Simulate data from LGC factors regressed on observed variables with interaction

109 views
Skip to first unread message

John Flournoy

unread,
Feb 1, 2016, 4:41:27 PM2/1/16
to lavaan
Hi all,

I'm conducting a power analysis and want to use `simulateData` to generate data from a latent growth curve model where both the intercept and slope are regressed on two observed variables and their interaction. However, I'm finding it difficult to figure out how to generate the data for the interaction term. I know lavaan doesn't support syntax to specify an interaction, so perhaps there are some constraints I can impose that will simulate the proper relationship between the variables such that I can manually compute the interaction term for the model-fitting portion of the power analysis procedure. Or perhaps it's not important that the simulated data encode the multiplication of the two variables as that doesn't affect the covariance matrix? 

My code is something like this:

theDGM <- '
i_hrb =~ 1*hrb1 + 1*hrb2 + 1*hrb3 + 1*hrb4
s_hrb =~ 0*hrb1 + .1*hrb2 + .2*hrb3 + .3*hrb4

i_hrb ~ 0*1
s_hrb ~ .5*1

i_hrb ~ .3*A1 + -.2*A2 + .1*A1xA2

i_hrb ~~ -.1*s_hrb'

someData <- simulateData(model=theDGM, 
model.type='growth',  
standardized=T,
sample.nobs=500)

Using 'empircial=T' in the simulateData call seems to reproduce this model's path weights, but of course A1xA2 != A1 * A2 necessarily. 

I'm pretty new at this, so any feedback is appreciated.

(Only tangentially related: please let me know if there is anything wrong with using standardize=T in this way; I'm trying to simulate standardized relations).

Many thanks,
~John

Terrence Jorgensen

unread,
Feb 2, 2016, 7:15:31 AM2/2/16
to lavaan
This one piqued my interest, so I can't help but post a solution I worked out, although it may seem like more work than it is worth.  I hope it helps.

Interaction / Moderation means the effect (slope) of A1 depends on the level of A2.  You can switch the roles of A2 and A1 without loss of generality -- it just depends how you want to interpret it.  The important point is that this regression equation:

ŷ = β0 + β1A1 + β2A2 + β12(A1)(A2

can be expressed equivalently by factoring out the common term A2:
ŷ = β0 + β1A1 + (β2 + β12A1× A2

So the slope of A2 is a function of A1 -- specifically, the slope = β2 + β12A1.  You can capitalize on this parameterization to generate subsets of data for one level of A1 iteratively from min(A1) to max(A1) in some number of increments, such that the subsamples add up to N = 500.  For each iteration, you only need to include A2 as a variable in your model syntax, and after data generation you can save the variable A1 as a constant for that subset of data.

Your example specifies the latent-intercept being regressed on A1 and A2 and their interaction (with intercept == 0):

i_hrb = 0 + 0.3(A1) - 0.2(A2) + 0.1(A1)(A2

which can be expressed equivalently as:

i_hrb = 0.3(A1) + [-0.2 + 0.1(A1)] × A2

So, for example, when A1 == 1, then the equation simplies to:

i_hrb = 0.3(1) + [-0.2 + 0.1(1)] × A2 = 0.3 - 0.1(A2)

And when A1 == 2, the equation simplifies to:

i_hrb = 0.3(2) + [-0.2 + 0.1(2)] × A2 = 0.6 - 0(A2) = 0.6

Using paste(), you can loop through the different levels of A1, constructing the model syntax with appropriate values of the intercept and slope for the i_hrb equation.  To keep the illustration simple, I will generate values of A1 uniformly between 1 and 10, with n = 5 at each integer-value of A1, for a total N = 50.  (You can generate n = 50 for each value of A1 to get N = 500, or you can make A1 look more continuous by keeping n = 5 and using 100 more fine-grained cut-points across the range of A1, but you would probably get similar results either way -- most behavioral research relies on ordinal Likert-type data anyway).  

## function to generate subset of data at fixed level of A1
getSubsample <- function(A1, N) {
  library(lavaan)
  B0 <- 0 + 0.3*A1
  B1 <- -.2 + .1*A1
  theDGM <- paste0('i_hrb =~ 1*hrb1 + 1*hrb2 + 1*hrb3 + 1*hrb4
         s_hrb =~ 0*hrb1 + .1*hrb2 + .2*hrb3 + .3*hrb4
         s_hrb ~ .5*1
         i_hrb ~ ', B0, '*1 + ', B1, '*A2
         i_hrb ~~ -.1*s_hrb')
  someData <- simulateData(model = theDGM, model.type = 'growth',  
                           standardized = TRUE, sample.nobs = N)
  someData$A1 <- A1
  return(someData)
}
## check if it works
getSubsample(A1 = 2, N = 10)
## now apply it to the range of A1 == 1:10 (with small N to illustrate)
set.seed(123)
(dataList <- mapply(getSubsample, A1 = 1:10, N = 5, SIMPLIFY = FALSE)) 
## now combine the list of subsamples 
(myData <- do.call(rbind, dataList))

The distribution of the fixed exogenous variables A1 and A2 is not assumed to be normal (or anything else) unless you choose fixed.x = FALSE, so it doesn't really matter how A1 is distributed.  If you really want A1 to be standard normal, then instead of generating subsamples uniformly across the range of A1, you can generate as many values as you would expect in each of those bins.  Basically, you divide the standard normal curve into k segments/bins, calculate the corresponding cumulative probabilities for the bins, and multiply the total sample size by those probabilities to get subsample sizes in each bin.

nBins <- 10
(bins <- seq(-3, 3, length.out = nBins + 1))
(probs <- pnorm(bins))
(binprobs <- diff(probs)) # length(binprobs) == nBins
(binsize <- round(binprobs*500))
sum(binsize)
(newbins <- seq(-3, 3, length.out = nBins)) # length(newbins) == nBins

Now, you can apply the getSubsample function again, using the highlighted objects as arguments:

set.seed(123)
dataList <- mapply(getSubsample, A1 = newbins, N = binsize, SIMPLIFY = FALSE))
lapply(dataList, head) # check the top of each subsample
sapply(dataList, nrow) # check the subsample sizes
myData <- do.call(rbind, dataList) # stack the dataList



(Only tangentially related: please let me know if there is anything wrong with using standardize=T in this way; I'm trying to simulate standardized relations).

Figuring out how to standardize is pretty tricky when you have interaction terms (whose total variance is a nontrivial function of the joint distribution of A1 and A2).  In my example above, the HRB variables had SDs around 1.4, similar to your example that yields observed variances > 2

fitted(growth(theDGM))

and a model-implied variance > 1 for the latent intercept

lavInspect(growth(theDGM), "cov.lv")

I would strongly suggest setting all population parameters manually in the syntax (especially because it doesn't make sense to standardize the latent growth factors -- you do not expect their means to be zero!).  Try thinking of change in units of the actual variables you will eventually analyze, choosing effect sizes according to what would be meaningful (i.e., a change that would be considered noticeable in real life).  A standardized effect size is just a real effect size scaled by the variability in the data, but that variability can depend on various aspects of the study design (e.g., the (sub)populations of interest, the measurement instruments, etc.).  If you really are agnostic about what a truly (practically) significant effect would look like and what variability to expect in your data, you can always check your average standardized estimates across 10 trial replications, then adjust your population parameters by trial and error until you are in the ballpark you want.  Just a suggestion.

Terry

John Flournoy

unread,
Feb 2, 2016, 5:09:25 PM2/2/16
to lavaan
Hi Terry,

This is a really clever solution! I'll try it out. Also, your advice regarding standardization is quite helpful. Many thanks! I hope this will be of use to others as well.

Very best,
~John

Terrence Jorgensen

unread,
Feb 3, 2016, 3:21:25 AM2/3/16
to lavaan
This is a really clever solution! I'll try it out. Also, your advice regarding standardization is quite helpful. Many thanks! I hope this will be of use to others as well.

You're welcome.  A couple more things occurred to me after I made this post.
  • I should not have used "standardized = TRUE" in each separate script -- that will lead to different residual variances of endogenous variables in each subsample, so you end up with unintended (and unmodeled) heteroscedasticity.  That's another reason to instead specify all parameters manually in the lavaan model syntax.
  • I noticed you set your factor loadings for the latent slope to decimal values {0, 0.1, 0.2, 0.3} instead of integers {0, 1, 2, 3}.  The mean of the latent slope is the average amount of change per unit of time, which is now outside the scope of observed time in your simulation (because you are simulating time to range only 3 tenths of a time-unit).  Although there is nothing per se wrong with coding time that way in an analysis, it might complicate your choice of mean and variance parameters for the latent slope.
Good luck,

Terry

Reply all
Reply to author
Forward
0 new messages