Missing data with FIML in multiple group SEM

287 views
Skip to first unread message

Patrick Forscher

unread,
Aug 31, 2018, 5:33:06 PM8/31/18
to lavaan
I want to compare the fit of a multiple-group SEM and a single-group SEM.  I have a fair amount of missing data and am handling that using FIML estimation (missing="fiml" in the lavaan() function).

When I fit a multiple-group SEM (using, say, group="condition" in the call to lavaan()), I've noticed that FIML estimation proceeds separately on each group.  Based on my understanding, this means that the multiple-group SEM is not directly comparable to a single-group SEM, which uses observations from both groups to do the FIML estimation rather than just the data from a single group.

How should I be dealing with this?  When it's handling the missing data, s there a way to force the multiple-group SEM to consider the data from both groups rather than just a single group?

Jeremy Miles

unread,
Aug 31, 2018, 11:30:02 PM8/31/18
to lavaan
That's an interesting question.A multiple group analysis can have the same information as a single group analysis, and can give exactly the same results (or very nearly exactly the same) , even when data are missing. 

Here's some code to generate a data frame that has some missing data (missing at random, not completely at random), and then analyze it: y1 and y2 are regressed on x.

First, generate some data where the true estimates are 0.5 and 0.7.
library(lavaan)

set.seed(12345)
n <- 10000
# generate some data:
df <- data.frame(x = rep(c(0, 1), n), 
                 F = rnorm(n),
                 randvar = runif(n))

df$y1 <- df$F + rnorm(n)
df$y2 <- df$F + rnorm(n)

df$y1 <- df$y1 + df$x * 0.5
df$y2 <- df$y2 + df$x * 0.7


# now get rid of scores, based on x, so data are mar. Missingness on y1 depends on y2, x and a random factor. 
df$y1 <- ifelse(df$x == 0 & df$randvar > 0.6 & df$y2 > 0, NA, df$y1)

We end up with 20% of y1 missing in group x = 0

Let's do a regression, and see what estimates we get:

summary(lm(cbind(y1, y2) ~ x, data = df))
> summary(lm(cbind(y1, y2) ~ x, data = df))
Response y1 :

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.15223    0.01583  -9.616   <2e-16 ***
x            0.66517    0.02124  31.318   <2e-16 ***

Response y2 :
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.28044    0.01553  -18.05   <2e-16 ***
x            0.98025    0.02084   47.04   <2e-16 ***

(I cut out the less interesting bits).
The parameters for y1 and y2 are both biased upwards. They should be 0.5 and 0.7, they are 0.67 and 0.98.

Let's do a single group analysis in Lavaan:


singlemodel <- "y1 ~ x
                y2 ~ x
                y1 ~~ y1
                y2 ~~ y2
                y2 ~~ y1
"
> m1 <- sem(singlemodel, data = df, missing = "ml")
> summary(m1)
Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y1 ~                                                
    x                 0.521    0.021   24.765    0.000
  y2 ~                                                
    x                 0.701    0.020   35.387    0.000


(Cutting out the boring bits).
We got 0.521 and 0.701 - pretty close to the right estimates! Cool! 

Now let's do a multiple group:


twogroupmodel <- 
   "y1 ~ c(a, b) * 1  # intercept
    y2 ~ c(c, d) * 1  # intercept
    y1 ~~ c(v1a, v1) * y1
    y2 ~~ c(v2a, v2) * y2
    y2 ~~ c(v12a, v12) * y1
    diff1 := b - a
    diff2 := d - c"

m2 <- sem(twogroupmodel, data = df, group = "x", missing = "ml")
summary(m2)

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    diff1             0.524    0.021   24.745    0.000
    diff2             0.701    0.020   35.387    0.000

For y1 (diff1), which had missing data, the estimate is within 0.02 of the single group model. SE and z are (pretty much) identical to the single group model. 

For y2 (diff2), no missing data, the estimate, SE and Z are the same.

Jeremy

P.S. I just got nerd sniped


--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To post to this group, send email to lav...@googlegroups.com.
Visit this group at https://groups.google.com/group/lavaan.
For more options, visit https://groups.google.com/d/optout.

Patrick Forscher

unread,
Sep 1, 2018, 4:42:53 PM9/1/18
to lavaan
Thanks for your reply, Jeremy!  However, I'm still a little confused.  Here's what lavaan says about the optimization of m1 (your one-group model):


  Optimization method                           NLMINB
  Number of free parameters                          7

  Number of observations                         20000
  Number of missing patterns                         2

  Estimator                                         ML
  Model Fit Test Statistic                       0.000
  Degrees of freedom                                 0

Here's what it says about your m2 (your two-group model):

  Optimization method                           NLMINB
  Number of free parameters                         10

  Number of observations per group         
  0                                              10000
  1                                              10000
  Number of missing patterns per group     
  0                                                  2
  1                                                  1

  Estimator                                         ML
  Model Fit Test Statistic                       0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

I may be misinterpreting the output here, but this seems to indicate that lavaan is handling the missing data separately for each group -- in other words, it uses FIML on each group, ignoring the information from the other group.  This becomes a problem if I wish to compare the fit of a single-group vs multi-group SEM (i.e., test if the parameters are similar across groups).

Jeremy Miles

unread,
Sep 2, 2018, 12:31:08 AM9/2/18
to lavaan

Oops. I made a mistake. Thanks for spotting.

Model 2 should be:


twogroupmodel <- 
   "y1 ~ c(a, b) * 1  # intercept
    y2 ~ c(c, d) * 1  # intercept
    y1 ~~ c(v1, v1) * y1
    y2 ~~ c(v2, v2) * y2
    y2 ~~ c(v12, v12) * y1
    diff1 := b - a
    diff2 := d - c"

The variances and covariances must be constrained to be equal.

J

--

Mauricio Garnier-Villarreal

unread,
Sep 3, 2018, 6:31:07 PM9/3/18
to lavaan
Patrick

Actually, FIML does estimate the parameters for each group separately. But, FIML maximizes 1 overall multiple group log-likelihood, so it does take into account both groups information for its estimation. This is basically the same process for the multiple group SEM without missing data. The parameters are estimated separately, but the ML optimization is in function of the overall LL. 

Now, if you are interested in comparing groups, I would recommend to start from the multiple group model and equate parameters between groups. This gives you greater flexibility, to identify which parameters are different between groups. If at the end, all parameters can be equated between groups, would lead that the one group model is equally good and simpler. 

Looking to compare the single group to the multiple group SEM, would also lead to ask if they should be compared as nested or non-nested models. For example, you could use the voungtest function from nonnest2 package and compare them as non-nested models

?nonnest2::vuongtest
> HS.model <- 'visual  =~ x1 + x2 + x3
+               textual =~ x4 + x5 + x6
+               speed   =~ x7 + x8 + x9 '
> fit1 <- cfa(HS.model, data=HolzingerSwineford1939)
> fit2 <- cfa(HS.model, data=HolzingerSwineford1939, group="school")
> vuongtest(fit1, fit2)

Model 1 
 Class: lavaan 
 Call: lavaan::lavaan(model = HS.model, data = HolzingerSwineford1939, ...

Model 2 
 Class: lavaan 
 Call: lavaan::lavaan(model = HS.model, data = HolzingerSwineford1939, ...

Variance test 
  H0: Model 1 and Model 2 are indistinguishable 
  H1: Model 1 and Model 2 are distinguishable 
    w2 = 0.412,   p = 0.016

Non-nested likelihood ratio test 
  H0: Model fits are equal for the focal population 
  H1A: Model 1 fits better than Model 2 
    z = -4.989,   p = 1
  H1B: Model 2 fits better than Model 1 
    z = -4.989,   p = 3.035e-07

Hope this helps

Patrick Forscher

unread,
Sep 4, 2018, 10:07:18 AM9/4/18
to lavaan
Interesting, thank you, Mauricio! Do you happen to have a reference that I could look at for how FIML works in multiple-group SEMs? I'd like to do a bit of reading so I can understand this issue better on my own.

Mauricio Garnier-Villarreal

unread,
Sep 5, 2018, 2:47:20 PM9/5/18
to lavaan
Patrick

Sorry, I dont have a FIML paper specific about multiple group.

As a diagnostic you can ask the Fraction of Missing Information (FMI) from lavaan, and compared them from the single to the multiple group models

summary(fit1, fmi=T)
Reply all
Reply to author
Forward
0 new messages