empty categories of variables - how to handle?

1,519 views
Skip to first unread message

Isabella Otto

unread,
Nov 20, 2014, 5:36:49 AM11/20/14
to lav...@googlegroups.com
Hello,

I'm doing measurement invariance analysis for categorical data but have the problem that on one item in one group no participant answered "1". I get this error message:
some categories of variable `PL3' are empty in group 2.

I have this problem also with other scales
and items which have an extreme skewness, because group 2 is very small (n=150).

No I was wondering how to deal with this. The two options which came up to my mind were first, deleting the whole item from the analysis, or second, doing a measurement analysis for continous data, because then this problem doesn't occur anymore. Of course, I'm not satisfied with either of these solutions.

Is there anything else you can think of?

Thanks,
Isabella

yrosseel

unread,
Nov 27, 2014, 1:47:13 PM11/27/14
to lav...@googlegroups.com
On 11/20/2014 11:36 AM, Isabella Otto wrote:
> No I was wondering how to deal with this. The two options which came
> up to my mind were first, deleting the whole item from the analysis,
> or second, doing a measurement analysis for*continous* data, because
> then this problem doesn't occur anymore. Of course, I'm not satisfied
> with either of these solutions.
>
> Is there anything else you can think of?

There is no elegant solution. But you may consider collapsing two
adjacent response categories.

Yves.

bfzldh

unread,
Sep 3, 2018, 9:12:15 PM9/3/18
to lavaan
Hello,
Is it okey to select  one case in group 2, and changed the response of variable `PL3' to an adjacent one to avoid such "empty category problem". 
It seems that this trick will make little difference to the model.

Thanks,
bfzldh

在 2014年11月28日星期五 UTC+8上午2:47:13,yrosseel写道:

Terrence Jorgensen

unread,
Sep 12, 2018, 4:53:36 PM9/12/18
to lavaan
Is it okey to select  one case in group 2, and changed the response of variable `PL3' to an adjacent one to avoid such "empty category problem". 
It seems that this trick will make little difference to the model.

Yes, that is the only choice you have unless you collect more data.  And it does not change the interpretations of the other model parameters.  

Terrence D. Jorgensen
Postdoctoral Researcher, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Carlos Lemos

unread,
Sep 15, 2018, 9:30:21 AM9/15/18
to lav...@googlegroups.com

I also have this problem. The top level of one variable in my model (V49) is empty for some groups (two countries, Denmark and Russia). I tried to change one response in each of these countries, by picking one random record from each group and manually changing two responses as follows:

 

set.seed(1234)

ind.1 <- which(df[,"COUNTRY.NAME"] %in% "DK-Denmark" & df[,"V49"] %in% "Weekly")

ind.2 <- which(df[,"COUNTRY.NAME"] %in% "RU-Russia" & df[,"V49"] %in% "Weekly")

 

df$V49[sample(ind.1,1)] <- "Daily"

df$V49[sample(ind.2,1)] <- "Daily"

 

When I inspect the data frame with "table" prior to analysis, the two changed responses are shown correctly (other countries omitted):

 

> table(df$V49,df$COUNTRY.NAME)[,c(6,19)]

 

         DK-Denmark RU-Russia

  Never          827       292

  Yearly         501       416

  Monthly         85        52

  Weekly         316        15

  Daily            1         1

 

 

 

However, "cfa" again complains that the counts are zero for one group, and the error message shows a vector of counts that I cannot relate to any of the results from"table" :

 

> # CONFIGURAL

> config <- measEq.syntax(configural.model = model, data = df, parameterization = "theta",

+                         ID.fac = "std.lv", ID.cat = "Wu.Estabrook.2016",

+                         group = "COUNTRY.NAME", group.equal = "", return.fit = TRUE)

Error in lav_samplestats_step1(Y = Data, ov.names = ov.names, ov.types = ov.types,  :

  lavaan ERROR: some categories of variable `V49' are empty in group 18; frequencies are [128 139 18 8 0]

 

I tried three solutions that worked: changing 10 cases in each of the two groups instead of one, merging the two top levels for the variable, and removing the two groups from the analysis.

The first solution is obviously not acceptable. The second and third are better, but far from good.

 

I'm trying now multiple imputations, because the data frame also has a substantial proportion of missing values.

 

But what I do not understand is why after changing the values manually and inspecting with "table" , "cfa" still issues the error message.

 

Carlos M. Lemos


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Terrence Jorgensen

unread,
Sep 16, 2018, 10:44:53 AM9/16/18
to lavaan

The first solution is obviously not acceptable.


I wouldn't even qualify reassigning 1 observation as "acceptable". That changes the data.

The second and third are better, but far from good.


If the "daily" category has infrequent responses in the other countries, I would disagree that merging the two top levels for the variable is far from good.  If you only have to ignore a minimal amount of individual differences between "daily" and "weekly" (in which case your threshold estimates will be very imprecise anyway), then it does not hurt to simply have a merged "at least weekly" category.  All of the estimated thresholds and other parameters would still have the same interpretation, and at worst, estimation would be easier and less error-prone.

Carlos Lemos

unread,
Sep 27, 2018, 12:34:14 PM9/27/18
to lav...@googlegroups.com
Hi again, Terrence,

Thank you once again for your response. I know that re-assigning observations is not acceptable, but I was trying to understand why cfa() keeps complaining (which I still don't but this is not important).
The two other solutions I mentioned are not good in the sense that in the first I have to eliminate two countries but get fine results for the rest, while in the second I need to degrade information on one important variable for all countries. I will run both analyses, and if the conclusions on invariance do not change when I merge the categories, I will describe what I did and report the results for the solutions with all the countries.

I'm trying to use multiple imputations because of the missing values (above 15% for several items).  The sequence cfa.mi() -> measEq.syntax() in your example works fine, but when I run fitMeasures() I get warning messages like these (df.list is a list of 20 imputed data frames):

> # CONFIGURAL
> configural <- cfa.mi(model, data = df.list, parameterization = "theta",
+                      std.lv = TRUE, estimator = "WLSMV", group = group, group.equal = "")
> configural <- measEq.syntax(configural.model = configural, parameterization = "theta",
+                             ID.fac = "std.lv", ID.cat = "Wu.Estabrook.2016", estimator = "WLSMV",
+                             group = group, group.equal = "", return.fit = TRUE)
> fitMeasures(configural,fit.indices)

"D3" only available using maximum likelihood estimation. Changed test to "D2".
Robust corrections are made by pooling the naive chi-squared statistic across 20 imputations for which the model converged, then applying the average (across imputations) scaling factor and shift parameter to that pooled value.
To instead pool the robust test statistics, set test = "D2" and pool.robust = TRUE.

"D3" only available using maximum likelihood estimation. Changed test to "D2".
Robust corrections are made by pooling the naive chi-squared statistic across 20 imputations for which the model converged, then applying the average (across imputations) scaling factor and shift parameter to that pooled value.
To instead pool the robust test statistics, set test = "D2" and pool.robust = TRUE.

                          chisq                              df                          pvalue                    chisq.scaled
                       4818.201                         118.000                           0.000                        9855.244
                      df.scaled                   pvalue.scaled            chisq.scaling.factor          chisq.shift.parameters
                        118.000                           0.000                           0.490                          30.747
                 baseline.chisq                     baseline.df                 baseline.pvalue           baseline.chisq.scaled
                     504419.478                         156.000                           0.000                      181322.623
             baseline.df.scaled          baseline.pvalue.scaled   baseline.chisq.scaling.factor baseline.chisq.shift.parameters
                        156.000                           0.000                           2.783                          99.954
                            cfi                      cfi.scaled                           rmsea                  rmsea.ci.lower
                          0.991                           0.946                           0.048                           0.046
                 rmsea.ci.upper                    rmsea.pvalue                    rmsea.scaled           rmsea.ci.lower.scaled
                          0.049                           1.000                           0.068                           0.067
          rmsea.ci.upper.scaled             rmsea.pvalue.scaled                            srmr
                          0.070                           0.000                           0.027
Warning message:
In pchisq(X2.sc, DF.sc, ncp = N * DF.sc * 0.05^2/nG, lower.tail = FALSE) :
  full precision may not have been achieved in 'pnchisq'
>
I don't understand these messages well. Also, it seems that fitMeasures() does not accept the arguments test and pool.robust.
I have the following doubts: how does fitMeasures() pool the cfi, rmsea and srmr that are needed for the invariance tests? Which fit measures can be pooled and which cannot?
(I read this paper on using imputation in regression models:
 

which states that some statistics can be combined using Rubin's rules, but others cannot).

Anyway: your measEq.syntax() function is great and so far worked very nicely!
Thank you once again for your attention and advice.
Best regards,

Carlos



Terrence Jorgensen

unread,
Sep 28, 2018, 7:01:47 AM9/28/18
to lavaan
I don't understand these messages well.

The first two are the same message (printed once for the hypothesized model and once for the baseline model used for CFI, TLI), just providing information about other pooling options.  You can ignore them, but if you want to pass those arguments, you can.

Also, it seems that fitMeasures() does not accept the arguments test and pool.robust

You can, I just forgot to document in class?lavaan.mi that fitMeasures() has the "..." argument to pass arguments to lavTestLRT(.mi)

I have the following doubts: how does fitMeasures() pool the cfi, rmsea and srmr that are needed for the invariance tests?

It applies the complete-data formulas to the pooled chi-squared (and pooled baseline-model chi-squared for incremental indices) for any chi-squared-based fit index.  For (S)RMR, it applies the complete-data formulas to observed and model-implied moments based on pooled parameter estimates.

Which fit measures can be pooled and which cannot?
(I read this paper on using imputation in regression models:
 

which states that some statistics can be combined using Rubin's rules, but others cannot).

Correct, although their Table VIII only asserts that a model's chi-squared/LRT statistic cannot be pooled because it is not an estimate of a parameter, so Rubin's rules for pooling estimates does not apply.  But they can be pooled using different methods, as documented on the ?lavTestLRT.mi help page.

Regarding the SRMR, the differences between observed and model-implied moments are also NOT estimated parameters, but rather functions of parameters.  
  • The observed moments can be considered the parameter estimates of the saturated model (which is used to calculate the model's chi-squared/LRT statistic). So the pooled observed moments are simply the average of the sample statistics in each imputation.
  • The model-implied moments are not estimated parameters, but are functions of the estimated model parameters.  So the pooled model-implied moments are calculated the same way they are for complete data, but using the pooled parameter estimates.  In other words, this is the "sensible transformation before combination" they mention in their Table VIII.

Carlos Lemos

unread,
Sep 28, 2018, 11:04:21 AM9/28/18
to lav...@googlegroups.com
Hi again, Terrence.

Thank you very much for your response. It is very valuable because I could not try using imputations without knowing how the pooled estimates were computed.
I will analyze both your response and the lavTestLRT(.mi) help page and try running the models with imputed data sets.
Best regards,

Carlos

Jeremy Eberle

unread,
Sep 17, 2023, 12:59:10 PM9/17/23
to lavaan
When collapsing two adjacent response categories, would you do this (a) for only the item that has the empty response category or (b) for all items (i.e., even for those that contain no empty response categories)?

In my case, only 1 item (out of 36 five-point Likert items with a possible range of 0 to 4) has no responses of 0. I am conducting EFA on these items with efa(estimator = "WLSMV") given that some items are heavily skewed.

Thanks!

Jeremy

Terrence Jorgensen

unread,
Sep 18, 2023, 3:54:36 PM9/18/23
to lavaan

When collapsing two adjacent response categories, would you do this (a) for only the item that has the empty response category or (b) for all items (i.e., even for those that contain no empty response categories)?

In SEM, it is only necessary to do so for the relevant item(s), not all items.

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Reply all
Reply to author
Forward
0 new messages