I have a question about the use of weights in lavaan, and if doing so is appropriate for the analysis I aim to run.
I am planning a MGCFA for ordered-categorical variables where groups have unequal group sizes. I will analyze data from a cross-national survey, and sample sizes vary dramatically from country to country, from about 1000 in Country A to 2000 in Country B to 3000 in Country C; there are actually more countries in the dataset but these three illustrate the core challenge. It has been stated from a while now (e.g. Chen, 2007) that unequal sample sizes affect the sensitivity of goodness-of-fit indexes used to assess measurement invariance.
I first considered to randomly sample a subset of equal size from each dataset but it means to throw out a large amount of data for the countries with larger datasets. E.g, if subsamples have, say, 900 cases each, it would hold most of cases from Country A but would discard ~50% of the Country B sample and ~2/3 of Country C.
To avoid all that data loss, I wondered about weighting the data from each country such that they would contribute 'equally' (e.g. 'same sample sizes') to the MGCFA after weighted. For instance, attributing weight 1 to each observation from Country A, 0.5 to observations from Country B, and 0.33 to Country C so each country would count as 1000 observations.
lavaan 0.6 handles sampling weights, but it rescales the weights to the number of rows in the dataset, and apparently only works for robust ML estimators. Moreover, what I was considering is not to use sampling weights (I don't have them at all) but simply to attribute weights to some groups in order to downweight them and 'equalize' the contributions of groups of different sizes to the analysis.
So these are my questions, any suggestion will be much appreciated:
Is there a way to apply such a weighting strategy in lavaan? If it is not possible (or if it is pure nonsense), what would be an alternative way to prevent the larger group to 'dominate' model fit and potentially offuscate lack of invariance between/across groups?
(There are non-negligible % of missing data in the datasets, I will most probably do data imputation. I add this note because I will then have M 'complete' datasets for each country, which increases the computational burden to do the analysis. I considered the 'downweight' strategy above also for its simplicity. Strategies using resampling would exponentially complicate matters due to limited computations resources.)