Data simulation with moderating moderation and categorical variables.

183 views
Skip to first unread message

diya roy

unread,
Mar 15, 2018, 9:22:52 AM3/15/18
to lavaan
Hi all,

Can anybody point out how to simulate the data for the attached model . I want to provide given factor loadings and covariance matrix for each of  the factors.

Previously I created a simulation of this sort with the following syntax.

loadingVal <- matrix(0, 51, 7)

loadingVal[1:5, 1] <- c(1.1,0.2, 1.2,1.2,1.0)
loadingVal[6:11, 2] <- c(0.3,1, 0.9,0.1,0.8,0.6)
loadingVal[12:20, 3] <- c(0.5,0.5, 1.1,0.1,0.9,0.6,0.1,0.1,0.8)
loadingVal[21:30, 4] <- c(1.0,1.1,0.9,0.4,1.1,0.1,0.1,0.4,0.4,0.2)
loadingVal[31:37, 5] <- c(0.8,1.1, 0.5, 0.3,0.1,0.1,1.1)
loadingVal[38:41,6] <- c(1.2,0.7, 0.7,0.1)
loadingVal[42:51, 7] <- c(0.7,0.5, 0.9,0.6,0.2,0.1,1.1,0.6,0.7,0.8)

LY <- simsem::bind(loadingVal, "runif(1, 0.7, 1.2)")

factor.mean <- rep(NA, 7)
factor.mean.starting <- c(3,3,3,3,3,3,3)
AL <- simsem::bind(factor.mean, factor.mean.starting)

W <- matrix(NA, 7, 7)
W <- diag(c(0,0,0,0,0,0,0))
W[lower.tri(W)] <- c(1.1, 1.3, 1.0, 1.0, 0.9, 1.1,0.3, 0.1, 0, 0, 0,0,0, 0, 0, 0, 0, 0,0, 0, 0)
BE <- simsem::bind(W )

RPS <-simsem::bind(diag(7),symmetric = TRUE)

RTE <-simsem::bind(diag(51),symmetric = TRUE)

SEM.Model <- simsem::model(LY=LY, BE=BE, PS=RPS, TE=RTE, AL= AL ,modelType="SEM")
data_P <- simsem::generate(SEM.Model, 6000)
write.table(data_P, file="data_P.txt", quote=F)


But this new model has moderating moderation and also categorical variables. The categorical variables are income and education with three values each: low, medium and high.


Thanks in advance.
Regards,
Diya

simsem_model.png

diya roy

unread,
Mar 17, 2018, 2:55:11 AM3/17/18
to lavaan

This is a piece of code for a similar case. Is it okay?

dataSem <- '
BE=~0.95BE1+0.955BE2+0.95BE3+0.8BE4+0.9BE5
BA=~0.1
BA1+.95BA2+0.9BA3+0.1BA4+0.95BA5+0.8BA6
BAS=~0.1
BAS1+0.95BAS2+0.9BAS3+0.1BAS4+0.95BAS5+0.98BAS6+0.1BAS7+0.86BAS9
PQ=~0.977
PQ1+0.985PQ2+0.959PQ3+0.1PQ4+0.987PQ5+0.2PQ6+0.1PQ7+0.15PQ8+0.05PQ9+0.09PQ10
L=~0.988
L1+0.99L2+0.1L3+0.1L4+0.1L5+0.1L6+0.96L7
INF=~0.99INF1+0.989INF2+0.87INF3+0.8INF4
C=~0.8C1+0.1C2+0.8C3+0.8C4+0.1C5+0.1C6+0.9C7+0.9C8+0.9C9+0.9C10

INFEDINC=~INF+ED+INC+INF:ED+INF:INC
CEDINC=~C+ED+INC+C:ED+C:INC
PQCI=~PQ+C+INF+PQ:C+PQ:INF+INFEDINC+CEDINC
BE=~NABA+aPQCI+bBAS+cL
a+b+c ==2.5
'

If I want to add constraints to make INC and ED within the rage of 1 to 3
and rests in the range of 1 to 5 , how to do it?

kma...@aol.com

unread,
Mar 17, 2018, 11:28:25 PM3/17/18
to lavaan
Diya,
I may not be following your question correctly.  However, I think that you want to simulate data for a model with latent variables named INC and ED.  Further, you want these to range from 1 to 3.  Here are some of the things that I find confusing about the question.

First, we normally only simulate data for the observed variables.  if you want to simulate INC and ED scores, perhaps the best strategy would be to rewrite the model to make these manifest variables.

Second, the scaling of latent variables is essentially arbitrary.  You can re-scale the latent variables without changing the data.

Third, latent variables in SE models are theoretically unbounded.  So, they do not have a population range.  The range is -Inf to +Inf and that cannot be re-scaled to finite values.  You could scale to mu=2, SD = 1/3, to obtain a normal distribution mostly between 1 and 3.

If the above is not helpful, perhaps you could scale down to a minimal example and explain a little further what you are trying to accomplish.

Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

diya roy

unread,
Mar 18, 2018, 6:00:33 AM3/18/18
to lav...@googlegroups.com
Keith,

INC and ED are observed variables (income and education only).  INF, C, PQ, L, BA, BE and BAS are latent defined by certain observed variables ( as in a scale).
 I want a data simulation where INC and ED are moderating INF and C and INF and C are moderating PQ.


Thanks & regards,
Diya Guha Roy

PhD Research Scholar



--
You received this message because you are subscribed to a topic in the Google Groups "lavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lavaan/tXLtsqAXPZo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lavaan+unsubscribe@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Terrence Jorgensen

unread,
Mar 18, 2018, 11:59:34 AM3/18/18
to lavaan
I want a data simulation where INC and ED are moderating INF and C and INF and C are moderating PQ.

You can't use lavaan::simulateData() or simsem to simulate a latent interaction like this because latent interactions do not fit within the traditional Covariance Structure Analysis (which is all linear effects).  You can use tricks to simulate data from subsamples defined by each level of your moderator(s), then combine for the total sample.  You can find some guidance here


That example is about latent growth curves, but the same principles apply to your CFA.

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

diya roy

unread,
Mar 18, 2018, 12:41:58 PM3/18/18
to lav...@googlegroups.com
Thanks Terrence, for the pointer. I am just replicating a data sample using simsem. This is of immense help.


Thanks & regards,
Diya Guha Roy

PhD Research Scholar



diya roy

unread,
Mar 19, 2018, 4:13:32 AM3/19/18
to lav...@googlegroups.com
Terrence,

Can you kindly provide the command for standardization. There is command called scale, but it can directly standardize for mean=0, std dev=1


Thanks & regards,
Diya Guha Roy

PhD Research Scholar



kma...@aol.com

unread,
Mar 19, 2018, 10:56:41 AM3/19/18
to lavaan
Diya,
  I am still not sure that I am entirely following what you want to do.  I am accustomed to thinking about moderation as one variable moderating an effect, not another variable.  So, I am sometimes unclear whether x moderating y should be expanded to mean x moderating all the effects of y on other variables or x moderating all the effects of other variables on y.  With that caveat, here are some further thoughts.

1. It is unclear in what context you want to standardize (base R, lavaan, simsem).  You can just type in the z score equation in base R, however, here is another way to do it in base R.  You could extend it to work with data frames using some form of apply() or some matrix operations.



standardize
<- function(x){
  z
<- qnorm(pnorm(y, mean(x), sd(x)))
 
return(z)
} # end function

y
<- rnorm(10, 2, 1)
zy
<- standardize(y)
mean
(zy); sd(zy)


2.  If you want to use simsem, you could use the option to provide your own data simulation function as the generate argument to sim().  This would allow you to create product variables or stack data from different populations with a call to sim().


mySimulationFunction
<- function(nobs=1){
  myData
<- data.frame(x=rnorm(nobs))
 
return(myData)
} # end function

fitModel
<- 'x ~~ x'

mySim
<- sim(nRep=5, model=fitModel, n=50, generate=mySimulationFunction)
summary
(mySim)


3. Your use of the latent variable constructor operator in lavaan syntax does not appear consistent with your verbal descriptions.  This is why Terrence referred to your model as a CFA.  Consider the following line from your sample code.

CEDINC=~C+ED+INC+C:ED+C:INC

The '=~' operator creates a latent variable called CEDINC and models the right hand side variables as a function of CEDINC.  Maybe that was what you had in mind but your use of the colon to indicate product interactions suggests you may have meant the opposite, that CEDINC in modeled as a function of the variables on the right hand side.  For that, you would use '~'.

To simplify the example:

latent =~ var1 + var2 + var1:var2 means var1 = (l1 * latent) + e1, var2 = (l2 * latent) + e2, and var1*var2 = (l3 * latent) = e3.

latent ~ var1 + var2 + var1:var2 means latent = (g1 * var1) + (g2 * var2) + (g3 * var1 * var2) + e.

In the latter, var1 moderates the effect of var2 on latent.  In the former, what should be simply the product of var1 and var2 is instead modeled as a function of latent.  To verify the above, use the lavaanify() function on your model syntax or snippets thereof.

4. My general suggestion would be to start small with a toy model containing only one moderated relationship.  Debug the toy model.  Then slowly work your way up to the full model one variable at a time.

HTH,

diya roy

unread,
Mar 19, 2018, 12:10:38 PM3/19/18
to lav...@googlegroups.com
Keith,

thanks for the lavaan syntax. And I did exactly how you mentioned.

See the following snippet. I had few errors , fixed the bugs.

BE=~BE1+BE2+BE3+BE5
BA=~BA2+BA3+BA5+BA6
BAS=~BAS2+BAS3+BAS5+BAS6+BAS9
PQ=~PQ1+PQ2+PQ3+PQ5
L=~L1+L2++L7
INF=~INF1+INF2+INF3+INF4
C=~C1+C3+C4+C7+C8+C9+C10
INFEDINC=~INF:ED+INF:INC
CEDINC=~C:ED+C:INC
PQCI=~PQ+C+INF+PQ:C+PQ:INF+ED+INC+INFEDINC+CEDINC

but I forgot to add error terms, in fact is it mandatory ? I am not sure of one thing. The SEM will provide the error terms for the factor loadings as well as the std err of the variance. Why we deliberately provide an option in the defining equation of the error.

One amusing finding, this model that I created does not return error terms, so I guess defining the error terms is kind of required. But I cannot fathom the logic.


Thanks & regards,
Diya Guha Roy

PhD Research Scholar



kma...@aol.com

unread,
Mar 19, 2018, 11:30:05 PM3/19/18
to lavaan
Diya,
  Have you ever seen a RAM style path diagram?  They are distinguished from LISREL style by the use of double-headed curved arrows with both ends pointing to the same variable.  Lavaan model syntax follows a similar logic.  x ~~ x represents a variance if x is (syntactically) exogenous, a disturbance variance if x is (syntactically) endogenous, and a unique variance if x is an indicator of a (reflective) latent variable.  You do not need these for exogenous variables if you use fixed.x = TRUE.

  In addition to the two chapters of the lavaan tutorial on the lavaan web page, there is also a model.syntax help file in the lavaan package that can be very helpful.

  If you had a regression y = b1(x1) + b2(x2) + e[y], you could specify this as follows:

y ~ x1 + x2 # everything but the disturbance
y
~~ y      # the disturbance, e[y]

  I am not sure that you really want to treat two variables and their product as effects of a common cause.  However, they are certainly not locally independent effects of the common cause.  So, you can allow correlations between their unique variances by using the same double-tilde operator with different variables on the right and left sides.

diya roy

unread,
Mar 19, 2018, 11:43:41 PM3/19/18
to lav...@googlegroups.com
Keith,

This is the most important help you rendered . I will check out the help page . Thanks a ton, deeply touched.

Thanks & regards,
Diya Guha Roy

PhD Research Scholar



Reply all
Reply to author
Forward
0 new messages