Simulate multilevel SEM with simsem?

752 views
Skip to first unread message

Seongho Bae

unread,
Nov 6, 2017, 12:03:07 AM11/6/17
to lavaan
Hello all,

I want to simulate multilevel SEM with simsem.

Can I specify random cluster effect with simsem including development versions?

Best,
Seongho

Terrence Jorgensen

unread,
Nov 6, 2017, 7:53:44 AM11/6/17
to lavaan
I want to simulate multilevel SEM with simsem.  Can I specify random cluster effect with simsem including development versions?

Not using the LISREL matrices, but if you can use lavaan syntax to generate multilevel data using the simulateData() function, then you can pass that same syntax to sim(..., generate=) to use it as the population model, and write a corresponding lavaan syntax for analysis to pass to sim(..., model=)

There might be unforeseen issues, so keep me updated with a reproducible example if that does not work.  

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Thalia Vrantsidis

unread,
Mar 7, 2019, 2:43:57 PM3/7/19
to lavaan
Hi, 
I am trying to do the same thing as Seongho, but as far as I can tell, simulateData does not output multilevel data. Is this correct, and is there another way to get this data then or something like it in order to compute a power analysis for multilevel data?

As a somewhat arbitrary model, I tried running this, and my output only had 2 columns (for some reason both were called c("a", "b"), but I assume the first correponds to a and the second to b),.

data.generation.model1 <- '
level: 1
a ~ 0.097*b 
a ~~ 420.900*a
b ~~ 1189.236 *b
a ~1
b ~1

level: 2
a ~~ -29.098*b 
a ~~ 222.361*a
b ~~ 104.413 *b
a ~1*48.275
b ~1*44.072
'

#simulate data based on fitted model

myData <- simulateData( data.generation.model1, sample.nobs = 100L)

Thanks,
Thalia

redh...@gmail.com

unread,
Mar 11, 2019, 2:05:29 PM3/11/19
to lavaan
Hi All,
I believe I'm having the same issue. I'm trying to do simulate a multilevel dataset with 2 latent and 3 observed variables in order to estimate sample size though power analysis a priori using a multilevel structural equation mode but I also can't find any way to do this using R. I've seen a few tutorials and papers doing this in Mplus but not in R. Any help on this would be very much appreciated.
Thank you,
Lukas

Joseph Bonito

unread,
Mar 11, 2019, 2:08:23 PM3/11/19
to lav...@googlegroups.com
It’s not easy but can be done.  See the R package simstudy, especially:


Keith, the developer, is very good at responding to specific questions.

Joe


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Terrence Jorgensen

unread,
Mar 11, 2019, 2:23:00 PM3/11/19
to lavaan
I'm trying to do simulate a multilevel dataset 

lavaan's simulateData() does not produce multilevel data, nor would it be an easy addition.  Supposing you specify your models at each level with population parameters, which can be transformed into model-implied covariance matrices at each level, you would then need additional arguments to specify how many clusters to generate as well as how many observations within each cluster.  You could do this manually:
  1. specify your between-level population model, and feed that alone to simulateData() to generate the number of clusters you want
  2. specify your within-level population model, and loop over the clusters (each row from Step 1) to generate a certain number of level-1 units within each cluster (not necessarily the same cluster sizes, but they can be), again using simulateData().  Cluster-level variables will not be part of the Level-1 model.
  3. Within each cluster, for each Level-1 variable, add the Level-2 component from Step 1 to each corresponding Level-1 component from Step 2.
Good luck,

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics

redh...@gmail.com

unread,
Mar 19, 2019, 7:49:23 AM3/19/19
to lavaan
Thank you very much Dr. Jorgensen. I was wondering if you might know of any tutorials or example syntax used to create MSEM/DSEM data? Thank you for your help!

Luca Menghini

unread,
Jun 26, 2019, 9:45:57 AM6/26/19
to lavaan
Dear Terrence,

I tried to follow your pipeline with the aim to perform a power analysis and estimate the required number of participants (clusters) to reliably estimate factor loadings and fit indices at both levels. However, I did not get what you mean with "add" in step 3: "Within each cluster, for each Level-1 variable, add the Level-2 component from Step 1 to each corresponding Level-1 component from Step 2". Indeed, level-2 scores for a given item and a given participant should be the average of the scores (21 in my case) of that participant on that item, and not their sum (what I believed you mean with "add").

I am also trying alternative approaches, such as simulating the BLUPs around the scores estimated at the between level. But in this way my simulated data will not be based on pre-defined factor loadings on both levels. The same problem applies to other strategies for simulating multilevel data (e.g., lme4::simulate). Finally, I was wandering if a reasonable solution would be to focus my power analysis exclusively on the between-individuals level, since the number of observations will be always higher than the number of pariticpants and hypotized the measurement model is the same on both levels. However, I am not sure about how the number of clusters affects the estimates and the fit indices at the within-individual level.

I thank you in advance for your time.
Best Regards,
Luca

Terrence Jorgensen

unread,
Jun 27, 2019, 4:30:04 PM6/27/19
to lavaan
I did not get what you mean with "add" in step 3

For simplicity, look at a univariate intercept-only multilevel model:

y_{ij} = mu + u_j + e_{ij}

The observed variable y for person i in cluster j is a sum of 3 components:
  1. mu is a grand mean of cluster means
  2. u_j is the deviation of cluster j's mean from mu
  3. within cluster j, e_{ij} is the deviation of individual i's score from cluster j's mean
u_j and e_{ij} are both normally distributed with means = 0 and variances = tau^2 for clusters and sigma^2 for individuals.  In practice, you can simulate data for N_2 clusters by generating N_2 cluster means from a normal distribution with M = mu and SD = tau 

set.seed(123)
cMeans
<- rnorm(N_2, m = mu, SD = tau)


That's the same as generating a vector of U from a distribution with mean = 0, then adding mu + U.  

set.seed(123)
U
<- rnorm(N_2, m = 0, SD = tau)
cMeans
<- mu + U

Then you could loop over clusters to generate Level-1 residuals for N_1 individuals within each cluster (assuming a constant cluster size; otherwise, loop over a vector of N_2 cluster sizes) and add it to the cluster mean.

dataList <- list()
for (cl in 1:N_2) {
  e
<- rnorm(N_1, m = 0, sd = sigma)
  dataList
[[cl]] <- cMeans[cl] + e
 
## same as mu + U[cl] + e
 
## recognize the equation?
}

That is what I meant, but on a multivariate level.  You can replace mu with a mean vector, tau and sigma by covariance matrices at L1 and L2, and rnorm() by MASS::mvrnorm() 

I was wandering if a reasonable solution would be to focus my power analysis exclusively on the between-individuals level

Only if that is the model you will run to test your hypothesis.  The whole point of a simulation is to simulate what you are going to do, but under known conditions.

Luca Menghini

unread,
Oct 22, 2019, 6:55:19 AM10/22/19
to lavaan
Dear Terrence,

thank you very much for your explanation and insights, and sorry for my delayed response.
I have tried to create a function to simulate multilevel multivariate data based on your suggestions, which returns the parameters estimated by a MCFA model specified on the simulated data.

If matrices.type="simulated", the function uses the loadings specified in the "level1.model" and "level2.model" arguments to simulate data (using simulateData) on both levels, separately. Then, the covariance matrices on both levels are saved and used to simulate the BLUPs and the residuals.
If matrices.type="empirical", the covariance matrices are computed from an existing dataset over which the MCFA model is specified.
Then, the two multivariate datasets are "added" (as you said) to create the multilevel dataset from which the parameters are estimated and saved.

It took a long time (I'm not used to write functions), but it seems to work under different conditions (e.g., with more than one latent variable, free loadings on both levels), and I would really appreciate receiving your feedback on it. Also, I hope the function can be useful for other lavaan users aiming to simulate multilevel SEM data.

Best Regards,
Luca

MCFAsim <- function(N_2=100,   # number of clusters (level 2)
                    N_1=10,    # number of observations(level 1)
                    MU=0,      # grand average
                    n.items=4, # number of items to be generated
                    item.letter="x", # letter that identifies items of interest
                    cluster="ID",   # name of the grouping variable
                    
                    # COVARIANCE MATRICES
                    matrices.type="simulated", # simulated or empirical
                    std=TRUE, # standardized cov matrices?
                    # 2.1. if empirical, specify the DATASET and the cluster variable
                    mydata=NA, # dataset (data.frame)
                    # 2.2. if simulated, specify the LOADINGS in the one-level model(s)
                    level2.model="WLb =~ 0.6*x1 + 0.6*x2 + 0.6*x3 + 0.6*x4",
                    level1.model=level2.model, # default: the same as level 1 (conf. construct)
                                             
                    # MCFA MODEL
                    twolevels.model='level: 1
                                     WLw =~ a*x1 + b*x2 + c*x3 + d*x4
                                     level: 2
                                     WLb =~ a*x1 + b*x2 + c*x3 + d*x4',
                    
                    corrs=FALSE # want to save also correlations between simulated variables?
){
  
  if(matrices.type=="simulated"){
    # generate data on lv2
    myData <- lavaan::simulateData(level2.model,model.type="cfa",std.lv=TRUE,sample.nobs=N_2) 
    b.cov <- lavaan::inspectSampleCov(level2.model,myData)$'cov' # cov matrix on lv2
    # generate data on lv1
    myData <- lavaan::simulateData(level1.model,model.type="cfa",std.lv=TRUE,sample.nobs=N_1)
    w.cov <- lavaan::inspectSampleCov(level2.model,myData)$'cov' # covariance matrix on lv1
    
    }else if(matrices.type=="empirical"){
      # take cov matrices from observed data using the specified MCFA model
      b.cov <- lavaan::inspectSampleCov(twolevels.model,mydata,cluster="ID")$ID$`cov`
      w.cov <- lavaan::inspectSampleCov(twolevels.model,mydata,cluster="ID")$`within`$`cov`
    }
  
  # standardized cov matrices?
  if(std==TRUE){
    b.cov <- cov2cor(b.cov) 
    w.cov <- cov2cor(w.cov) 
  }
  
  # Simulating clusters' means (lv2) = grandAverage + BLUP
  U <- MASS::mvrnorm(n=N_2,mu=rep(0,n.items),Sigma=b.cov) # BLUPs
  cMeans <- MU + U # simulated clusters' means (1 X cluster X item)
  
  # Simulating within-cluster scores (lv1) = clusters' means + residual deviance
  dataList <- list()
  for (cl in 1:nrow(cMeans)){
    e <- MASS::mvrnorm(n=N_1,mu=rep(0,n.items),Sigma=w.cov) # Residuals
    for (i in 1:nrow(e)){ e[i,] <- e[i,] + cMeans[cl,] } # within-cluster scores
    dataList[[cl]] <- e
  }
  
  # Merging lv1 and lv2 data in a multilevel dataset
  simulated <- data.frame(ID=rep(1:N_2,N_1),TIME=rep(rep(NA,N_1),N_2)) # empty dataset
  simulated <- simulated[order(simulated$ID),] # ordered by ID
  simulated$TIME=rep(1:N_1,N_2)
  for(i in 1:n.items){ simulated <- cbind(simulated,rep(NA,nrow(simulated))) } # empty items columns
  colnames(simulated)[3:ncol(simulated)] <- paste(item.letter,seq(1,n.items,1),sep="") # item names
  for(cl in 1:length(dataList)){ # from list to df
    simulated[simulated$ID==cl,3:ncol(simulated)] <- dataList[[cl]] 
  }
  
  # parameters estimation
  results <- try(lavaan::cfa(twolevels.model,data=simulated,cluster=cluster,std.lv=TRUE),silent=TRUE) # CFA
  parameters <- data.frame(matrix(nrow=0,ncol=n.items*6+5+3+6)) # empty dataset
  
  if(class(results)!="try-error"){ error = FALSE # errors check
    if(lavaan::lavInspect(results,"converged")==TRUE){ convergence = TRUE # convergence check
      if(any(as.vector(lavaan::lavInspect(results,"est")[["ID"]][["theta"]])<0) & # negative var check
         any(as.vector(lavaan::lavInspect(results,"est")[["within"]][["theta"]])<0)){ 
        neg.Variance = TRUE } else { neg.Variance = FALSE }
    
    parameters <- rbind(parameters,
                          # model info (ncol=5)
                        c(N_2,N_1,matrices.type,lavaan::lavInspect(results,"fit")[c("npar")],std, 
                          # converg. (ncol=3)
                          error,convergence,neg.Variance, 
                          # lv1 & lv2 loadings - default type="std.all" (ncol=n.items*2)
                          c(round(lavaan::standardizedsolution(results)[
                            lavaan::standardizedsolution(results)$op=="=~","est.std"],3)),
                          # fit indices (ncol=6)
                          round(lavaan::lavInspect(results,"fit")[c("chisq","df","rmsea","cfi",
                                                                    "srmr_within","srmr_between")],3), 
                          # lv1&lv2 loadings CI (ncol=n.items*4)
                          c(round(lavaan::standardizedsolution(results)[
                            lavaan::standardizedsolution(results)$op=="=~","ci.lower"],3)),
                          c(round(lavaan::standardizedsolution(results)[
                            lavaan::standardizedsolution(results)$op=="=~","ci.upper"],3))
                          
                          # correlations between observed variables
                          ))
    
    }else{ convergence = FALSE # not converged
    parameters <- rbind(parameters,
                        c(N_2,N_1,matrices.type,NA,std, # model info (ncol=5)
                          error,convergence,NA, # converg. (ncol=3)
                          rep(NA,n.items*6+6))) # loadings and fit indices (ncol=n.items*4+4)
    }
  }else{ error=TRUE # error
  parameters <- rbind(parameters,
                      c(N_2,N_1,matrices.type,NA,std, # model info (ncol=2)
                        error,NA,NA, # converg. (ncol=3)
                        rep(NA,n.items*6+6))) # loadings and fit indices (ncol=n.items*4+4)
    }
  
  colnames(parameters) <- c("N_2","N_1","matrices.type","npar","std","error","converged","negVariance",
                            paste(paste(item.letter,1:n.items,sep=""),rep("w",n.items),sep=""),
                            paste(paste(item.letter,1:n.items,sep=""),rep("b",n.items),sep=""),
                            "chisq","df","rmsea","cfi","srmr_between","srmr_within",
                            paste(paste(item.letter,1:n.items,sep=""),rep("w.CI.low",n.items),sep=""),
                            paste(paste(item.letter,1:n.items,sep=""),rep("b.CI.low",n.items),sep=""),
                            paste(paste(item.letter,1:n.items,sep=""),rep("w.CI.up",n.items),sep=""),
                            paste(paste(item.letter,1:n.items,sep=""),rep("b.CI.up",n.items),sep=""))
  
  if(corrs==TRUE){ # should add n.items*(n-1)/2 columns
    corrS <- cor(simulated[,3:ncol(simulated)],simulated[,3:ncol(simulated)],use="complete.obs",method="pearson")
    corr.lab <- matrix(apply(expand.grid(rownames(corrS),rownames(corrS)),1,paste,collapse="."),
                       nrow=n.items,ncol=n.items)
    corrS <- round(corrS[lower.tri(corrS, diag = FALSE)],2)
    names(corrS) <- corr.lab[lower.tri(corr.lab, diag = FALSE)]
    parameters <- cbind(parameters,t(corrS))
  }
  
  return(parameters)
  
  }


Reply all
Reply to author
Forward
0 new messages