Simulate data from LGC factors regressed on observed variables with interaction

John Flournoy

unread,

Feb 1, 2016, 4:41:27 PM2/1/16

to lavaan

Hi all,

I'm conducting a power analysis and want to use `simulateData` to generate data from a latent growth curve model where both the intercept and slope are regressed on two observed variables and their interaction. However, I'm finding it difficult to figure out how to generate the data for the interaction term. I know lavaan doesn't support syntax to specify an interaction, so perhaps there are some constraints I can impose that will simulate the proper relationship between the variables such that I can manually compute the interaction term for the model-fitting portion of the power analysis procedure. Or perhaps it's not important that the simulated data encode the multiplication of the two variables as that doesn't affect the covariance matrix?

My code is something like this:

theDGM <- '
i_hrb =~ 1*hrb1 + 1*hrb2 + 1*hrb3 + 1*hrb4
s_hrb =~ 0*hrb1 + .1*hrb2 + .2*hrb3 + .3*hrb4

i_hrb ~ 0*1
s_hrb ~ .5*1

i_hrb ~ .3*A1 + -.2*A2 + .1*A1xA2

i_hrb ~~ -.1*s_hrb'

someData <- simulateData(model=theDGM, 
	model.type='growth',  
	standardized=T,
	sample.nobs=500)

Using 'empircial=T' in the simulateData call seems to reproduce this model's path weights, but of course A1xA2 != A1 * A2 necessarily.

I'm pretty new at this, so any feedback is appreciated.

(Only tangentially related: please let me know if there is anything wrong with using standardize=T in this way; I'm trying to simulate standardized relations).

Many thanks,

~John

Terrence Jorgensen

unread,

Feb 2, 2016, 7:15:31 AM2/2/16

to lavaan

This one piqued my interest, so I can't help but post a solution I worked out, although it may seem like more work than it is worth. I hope it helps.

Interaction / Moderation means the effect (slope) of A₁ depends on the level of A₂. You can switch the roles of A₂ and A₁ without loss of generality -- it just depends how you want to interpret it. The important point is that this regression equation:

ŷ = β₀ + β₁A₁ + β₂A₂ + β₁₂(A₁)(A₂)

can be expressed equivalently by factoring out the common term A₂:

ŷ = β₀ + β₁A₁ + (β₂ + β₁₂A₁) × A₂

So the slope of A₂ is a function of A₁ -- specifically, the slope = β₂ + β₁₂A₁. You can capitalize on this parameterization to generate subsets of data for one level of A₁ iteratively from min(A1) to max(A1) in some number of increments, such that the subsamples add up to N = 500. For each iteration, you only need to include A₂ as a variable in your model syntax, and after data generation you can save the variable A₁ as a constant for that subset of data.

Your example specifies the latent-intercept being regressed on A₁ and A₂ and their interaction (with intercept == 0):

i_hrb = 0 + 0.3(A₁) - 0.2(A₂) + 0.1(A₁)(A₂)

which can be expressed equivalently as:

i_hrb = 0.3(A₁) + [-0.2 + 0.1(A₁)] × A₂

So, for example, when A₁ == 1, then the equation simplies to:

i_hrb = 0.3(1) + [-0.2 + 0.1(1)] × A₂ = 0.3 - 0.1(A₂)

And when A₁ == 2, the equation simplifies to:

i_hrb = 0.3(2) + [-0.2 + 0.1(2)] × A₂ = 0.6 - 0(A₂) = 0.6

Using paste(), you can loop through the different levels of A₁, constructing the model syntax with appropriate values of the intercept and slope for the i_hrb equation. To keep the illustration simple, I will generate values of A₁ uniformly between 1 and 10, with n = 5 at each integer-value of A₁, for a total N = 50. (You can generate n = 50 for each value of A₁ to get N = 500, or you can make A₁ look more continuous by keeping n = 5 and using 100 more fine-grained cut-points across the range of A₁, but you would probably get similar results either way -- most behavioral research relies on ordinal Likert-type data anyway).

## function to generate subset of data at fixed level of A1

getSubsample <- function(A1, N) {
library(lavaan)
B0 <- 0 + 0.3*A1
B1 <- -.2 + .1*A1
theDGM <- paste0('i_hrb =~ 1*hrb1 + 1*hrb2 + 1*hrb3 + 1*hrb4

s_hrb =~ 0*hrb1 + .1*hrb2 + .2*hrb3 + .3*hrb4

s_hrb ~ .5*1
i_hrb ~ ', B0, '*1 + ', B1, '*A2
i_hrb ~~ -.1*s_hrb')
someData <- simulateData(model = theDGM, model.type = 'growth',
standardized = TRUE, sample.nobs = N)
someData$A1 <- A1
return(someData)
}
## check if it works
getSubsample(A1 = 2, N = 10)
## now apply it to the range of A1 == 1:10 (with small N to illustrate)
set.seed(123)
(dataList <- mapply(getSubsample, A1 = 1:10, N = 5, SIMPLIFY = FALSE))

## now combine the list of subsamples

(myData <- do.call(rbind, dataList))

The distribution of the fixed exogenous variables A₁ and A₂ is not assumed to be normal (or anything else) unless you choose fixed.x = FALSE, so it doesn't really matter how A₁ is distributed. If you really want A₁ to be standard normal, then instead of generating subsamples uniformly across the range of A₁, you can generate as many values as you would expect in each of those bins. Basically, you divide the standard normal curve into k segments/bins, calculate the corresponding cumulative probabilities for the bins, and multiply the total sample size by those probabilities to get subsample sizes in each bin.

nBins <- 10
(bins <- seq(-3, 3, length.out = nBins + 1))
(probs <- pnorm(bins))
(binprobs <- diff(probs)) # length(binprobs) == nBins
(binsize <- round(binprobs*500))
sum(binsize)
(newbins <- seq(-3, 3, length.out = nBins)) # length(newbins) == nBins

Now, you can apply the getSubsample function again, using the highlighted objects as arguments:

set.seed(123)
dataList <- mapply(getSubsample, A1 = newbins, N = binsize, SIMPLIFY = FALSE))
lapply(dataList, head) # check the top of each subsample
sapply(dataList, nrow) # check the subsample sizes
myData <- do.call(rbind, dataList) # stack the dataList

(Only tangentially related: please let me know if there is anything wrong with using standardize=T in this way; I'm trying to simulate standardized relations).

Figuring out how to standardize is pretty tricky when you have interaction terms (whose total variance is a nontrivial function of the joint distribution of A₁ and A₂). In my example above, the HRB variables had SDs around 1.4, similar to your example that yields observed variances > 2

fitted(growth(theDGM))

and a model-implied variance > 1 for the latent intercept

lavInspect(growth(theDGM), "cov.lv")

I would strongly suggest setting all population parameters manually in the syntax (especially because it doesn't make sense to standardize the latent growth factors -- you do not expect their means to be zero!). Try thinking of change in units of the actual variables you will eventually analyze, choosing effect sizes according to what would be meaningful (i.e., a change that would be considered noticeable in real life). A standardized effect size is just a real effect size scaled by the variability in the data, but that variability can depend on various aspects of the study design (e.g., the (sub)populations of interest, the measurement instruments, etc.). If you really are agnostic about what a truly (practically) significant effect would look like and what variability to expect in your data, you can always check your average standardized estimates across 10 trial replications, then adjust your population parameters by trial and error until you are in the ballpark you want. Just a suggestion.

Terry

John Flournoy

unread,

Feb 2, 2016, 5:09:25 PM2/2/16

to lavaan

Hi Terry,

This is a really clever solution! I'll try it out. Also, your advice regarding standardization is quite helpful. Many thanks! I hope this will be of use to others as well.

Very best,

~John

Terrence Jorgensen

unread,

Feb 3, 2016, 3:21:25 AM2/3/16

to lavaan

This is a really clever solution! I'll try it out. Also, your advice regarding standardization is quite helpful. Many thanks! I hope this will be of use to others as well.

You're welcome. A couple more things occurred to me after I made this post.

I should not have used "standardized = TRUE" in each separate script -- that will lead to different residual variances of endogenous variables in each subsample, so you end up with unintended (and unmodeled) heteroscedasticity. That's another reason to instead specify all parameters manually in the lavaan model syntax.
I noticed you set your factor loadings for the latent slope to decimal values {0, 0.1, 0.2, 0.3} instead of integers {0, 1, 2, 3}. The mean of the latent slope is the average amount of change per unit of time, which is now outside the scope of observed time in your simulation (because you are simulating time to range only 3 tenths of a time-unit). Although there is nothing per se wrong with coding time that way in an analysis, it might complicate your choice of mean and variance parameters for the latent slope.