Simulating Ordinal Data

1,026 views
Skip to first unread message

Orowa Sikder

unread,
Mar 30, 2015, 3:20:45 PM3/30/15
to lav...@googlegroups.com
Hello,

I'm trying to use the simulateData() function to generate Likert-esque data. It's fairly straightforward if standardized factor loadings are used, however if unstandardized factor loadings (i.e. some loadings > 1) are used, it runs into difficulty.

It seems the problem arises because when categorical data is generated, variances of the y* underlying y (i.e. the underlying continuous variable formulation) are forced to 1, and the resulting Sigma generated is not positive definite. For example, suppose we use the following simple categorical indicator model:

simple.pop<-'
f1 =~ 0.55*y1 + 1.4*y2 + 1.1*y3 + 0.8*y4
y1 | -2*t1
y2 | -1*t1
y3 | 1*t1
y4 | 2*t1
'

group1
<-simulateData(simple.pop,sample.nobs=1000,debug = T)



Error in MASS::mvrnorm(n = sample.nobs[g], mu = Mu.hat[[g]], Sigma = COV,  :
 
'Sigma' is not positive definite

Sure enough the model-implied moments from the debug call are:


      [,1] [,2]  [,3] [,4]
[1,] 1.000 0.77 0.605 0.44
[2,] 0.770 1.00 1.540 1.12
[3,] 0.605 1.54 1.000 0.88
[4,] 0.440 1.12 0.880 1.00

(As a sense check, I also ran it with all loadings <1, and it runs fine).

I checked the paper the parameterisation is based on:


(p 552, equation 4)

As expected the parameterisation does require that diag(Sigma) is composed of ones. It even imposes this at the expense of having the error variances of y* underlying the observed variables be negative (which would be the implication in the example above).

So my question is, what would be a reasonable way to work around this without forcing my factor loadings to be below 1?

Many thanks in advance.

Oscar Olvera

unread,
Mar 31, 2015, 1:39:48 PM3/31/15
to lav...@googlegroups.com
Would it work if you generated your (continuous) data using simulateData() first and then just did the categorization after? Since the data coming from simulateData() is multivariate normal, you should be able to set the thresholds at whichever standard deviations from the mean you want. Like in your example you're sampling from a multivariate standard normal so (me being lazy and only wanting to take the easy case of binary data) you could do:

simple.pop<-'f1 =~ 0.55*y1 + 1.4*y2 + 1.1*y3 + 0.8*y4'

group1<-simulateData(simple.pop,sample.nobs=1000)

group1_binary <- ifelse(group1>=0,1,0)

where I just chose 0 so all the y's are cut at their (approximate) mean. 

I'm pretty sure it wouldn't take you too long to cut each y in whichever way you find suitable :-)

Orowa Sikder

unread,
Mar 31, 2015, 2:32:24 PM3/31/15
to lav...@googlegroups.com
Cheers - I'd already written that as a backup but I was hoping to pass the model to SimSem which requires a straight lavaan model. In general though it would be good if there was any documentation on the categorical data generation and analysis - I've pieced together what I can from referenced bits of MPlus documentation but does anyone know a detailed treatment of the lavaan categorical routines? For example, the use of scaling factors (with the ~*~ syntax), or checking when notation is referring to moments of latent continuous variables as opposed to manifest variables is all a bit mysterious to me without something to refer to.

Orowa Sikder

unread,
Apr 1, 2015, 1:02:03 PM4/1/15
to lav...@googlegroups.com
Just a note, I've realized a workaround which makes it much easier. Simply specify a list with both the generative model and the argument paramaterization = "theta", and feed that to simsem instead. Similarly for the specification model:

gen<-list(model=simple.pop.a,parameterization="theta")
mod
<-list(model=simple.spec,parameterization="theta")


OutPut<-sim(nRep=1000,model=mod,generate=gen,n=1000,lavaanfun="sem",seed=1234)
 

I've also found a presentation by Yves which discusses Delta vs Theta parameterization and the implications for scaling factors:


Now if only multiple groups could be generated in simsem with lavaan syntax! :)

yrosseel

unread,
Apr 5, 2015, 4:49:23 AM4/5/15
to lav...@googlegroups.com
> As expected the parameterisation does require that diag(Sigma) is
> composed of ones. It even imposes this at the expense of having the
> error variances of y* underlying the observed variables be negative
> (which would be the implication in the example above).

> So my question is, what would be a reasonable way to work around this
> without forcing my factor loadings to be below 1?

There is (currently) only one way:

use simulateData() only for the continuous y* variables, and then 'cut'
the continuous y* variables manually.

Yves.


yrosseel

unread,
Apr 5, 2015, 4:54:24 AM4/5/15
to lav...@googlegroups.com
On 03/31/2015 08:32 PM, Orowa Sikder wrote:
> though it would be good if there was any documentation on the
> categorical data generation

The only source for now is the source code. Look for the file:

https://github.com/yrosseel/lavaan/blob/master/R/lav_simulate.R


and analysis - I've pieced together what I
> can from referenced bits of MPlus documentation but does anyone know a
> detailed treatment of the lavaan categorical routines?

The categorical analysis routines of lavaan are heavily modeled after
Mplus, so the Mplus documentation is as good as it gets.

One day, lavaan will have its own technical documentation.

For example, the
> use of scaling factors (with the ~*~ syntax)

The ~*~ scaling factors are the equivalent of the {} statements in Mplus
syntax.

> or checking when notation
> is referring to moments of latent continuous variables as opposed to
> manifest variables

I do not quite follow here. What do you mean?

Yves.

UCLU

unread,
Apr 21, 2015, 3:23:18 PM4/21/15
to lav...@googlegroups.com
I may be mistaken here but I found that simulateData works absolutely fine even if it is passed a categorical outcome model. For example:

pop<-'

f1 =~ 0.85*x1 + c(1.4,0.7)*x2 + 1.25*x3 + c(2,1)*x4 + 0.5*x5 + c(0.75,0.38)*x6
f2 =~ 0.55*y1 + 1.4*y2 + 1.1*y3 + 0.8*y4

f1 + f2 ~ 0*1

f2 ~ 0.22*f1

f1 ~~ 1*f1
f2 ~~ 0.9516*f2

x1 ~~ 1*x1
x2 ~~ 1*x2
x3 ~~ 1*x3
x4 ~~ 1*x4
x5 ~~ 1*x5
x6 ~~ 1*x6

x1 ~*~ x1
x2 ~*~ x2
x3 ~*~ x3
x4 ~*~ x4
x5 ~*~ x5
x6 ~*~ x6

x1 | -1.7*t1 + 1.5*t2
x2 | c(-0.4,-1.2)*t1 + 1.9*t2
x3 | 0.7*t1 + 2.3*t2
x4 | c(-0.45,-1.45)*t1 + 2.75*t2
x5 | 0.8*t1 + 2.2*t2
x6 | c(1.2,-0.2)*t1 + 2*t2

y1 ~~ 1*y1
y2 ~~ 1*y2
y3 ~~ 1*y3
y4 ~~ 1*y4

y1 ~*~ y1
y2 ~*~ y2
y3 ~*~ y3
y4 ~*~ y4

y1 | -2*t1 + -0.8*t2 + 0.2*t3 + 1.7*t4
y2 | -2.5*t1 + -0.9*t2 + 0.3*t3 + 1.9*t4
y3 | -1.5*t1 + -0.2*t2 + 0.8*t3 + 2.3*t4
y4 | -1.7*t1 + 0.3*t2 + 1*t3 + 2.5*t4


'

sample
<-simulateData(model=pop,parameterization="theta",sample.nobs = c(1000,1000))


Furthermore if you compile the relevant arguments into a list before passing it to simsem's sim function that works fine too.

Alex Schoemann

unread,
Mar 31, 2015, 3:56:21 PM3/31/15
to lav...@googlegroups.com
A quick note on simsem. There's no reason why you can't use this approach with the sim function. Data generation and analysis are two different steps. I can think of two methods of accomplish the approach Oscar mentioned:

1. Generate the data outside of simsem, create a list of data frames (each element of the list represents a replication) and use the rawData argumentin the sim command.
2. Specify the model with continous outcomes using the generate argument in sim, create a function to categorize the continuous responses, use the datafun argument to apply that function to each generate data set.

Alex
Reply all
Reply to author
Forward
0 new messages