Workaround to generate factor mixture model, and a question on factor scores.

Raul Corrêa Ferraz

unread,

Jan 11, 2023, 8:00:08 PM1/11/23

to lavaan

Hello everyone,

I have a couple of questions - the first might be more of a stats question so I apologize if it is too off-topic.

1) I want to generate data under a factor mixture model with no covariates. I know that lavaan does not have mixture modeling implemented. As a workaround, could I simply generate data under a multiple-group factor analysis (MGFA) model and then "trash" the group membership variable?

This also seems convenient to do in Mplus to have better control of the class proportions. For example, if I want to do a simulation with two groups split 30/70%, in the MGFA case each replication has this exact split. If I generate the data under a mixture model, the 30/70% split is only on average across replications. However, I don't know if there are unintended consequences to my workaround, with the group membership being fixed and not random (these might not be the best choice of words, but hopefully what I mean is clear).

2) I am interested in comparing estimated factor scores with true factor scores. Mplus seems to only provide the estimated factor scores. Assuming I generate data in lavaan using the MGFA approach above, how do I save the true factor scores? Is there an existing function for that?

I appreciate any help you can send my way!

Best,

Raul

Keith Markus

unread,

Jan 12, 2023, 10:00:30 AM1/12/23

to lavaan

Raul,

Question 1: It seems to me that the key issue here is that you want the sample size to be fixed and not random. Here is one strategy that might help.

1. Simulate two data sets, one from each model, with the full sample size that you desire.

2. Join them horizontally such that each row has data generated by each model, e.g., using cbind() or data.frame().

3. Use a random binomial variate to choose which of the two is used for each case, using rbinom().

4. Analyze the data containing only the selected data for each case.

Because you are selecting data within row for each case, your N will be fixed but the proportions from each model will be random.

Question 2: The last time I checked, there are no true factor scores. simulateData() derives the implied moment matrix from the model and simulates data from that moment matrix. It does not simulate latent variable scores and then use those to simulate observed scores. I believe that the reason for adopting this strategy is to accommodate non-recursive models.

Keith

------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

Raul Corrêa Ferraz

unread,

Jan 12, 2023, 1:05:57 PM1/12/23

to lavaan

Hi Keith, thank you for the suggestions!

Re: question 1, my concern is more about doing the simplest thing - keeping the proportion fixed. Hypothetically it seems like a good thing to have control over the proportion, but I'm not sure if I'm missing something by doing that.

The reason I'm curious is because in Mplus the proportion is fixed for MGFA: if I generate 1000 two-group datasets with a 30/70 split, group 1 has N = 300 in every dataset. That is not the case for mixture modeling. So, I wondering if there's a good reason for that or if it is that way because it was more convenient to code.

Keith Markus

unread,

Jan 12, 2023, 11:34:32 PM1/12/23

to lavaan

Raul,

It seems to me that if you sample only data sets with and exact 70/30 split then your results generalize to other samples with a 70/30 split, which could be generated by a range of different population proportions with varying sample probabilities. Conversely, if you sample data sets with different splits based on a 70/30 population split, then your results generalize to samples drawn from populations with a 70/30 split. Neither is inherently wrong. They are means to different ends.

Reply all

Reply to author

Forward