Combine FIML and ML to improve estimation time?

112 views
Skip to first unread message

Rogier Kievit

unread,
Sep 30, 2021, 3:42:04 AM9/30/21
to lavaan
Dear all,
a bit of an oddball question perhaps, but is anyone familiar with combining FIML and regular ML to improve estimation time? Specifically (an idea by Ethan McCormick working with me), to use regular ML for the complete data, and fiml for any data with missingness, and then combine the two likelihoods much like one combines individual likelihoods in FIML? In a very non-thorough exploration (code attached) fitting a model in this way is quicker than using FIML for the whole sample. Obviously this only starts to matter in practice with (very) big data, but we are exploring some dataset for which it could start to matter materially. Toy reprex attached:
Standard ML on complete data: 0.2433321 seconds
FIML on complete data: 1.978944 seconds
FIML on data with missingness:  2.23119 seconds
Hybrid FIML: i.e. regular ML on part a without missingness, then FIML on the other half with missingness: 1.486024 seconds

We could of course explore this further ourselves but perhaps this has already been explored (or even implemented) in Lavaan, or there might be problems with it we haven't thought of yet. I could imagine that both the process of dividing the data up into complete/missingness, or the combination afterwards of the ML's, might wipe out any gains - but maybe not. Question inspired in part by this nice piece on frugal computing https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009324#sec005
best,
Rogier

hybrid_fiml.R

Edward Rigdon

unread,
Oct 2, 2021, 5:35:20 PM10/2/21
to lav...@googlegroups.com
Rogier--
     No specific insight, but this reminds me of ML vs GLS from the dawn of factor-based SEM. Joreskog figured that GLS, which inverts the empirical covariance matrix rather than the model-implied covariance matrix, would be faster than ML because the empirical matrix does not change, so the matrix inversion needed to be done only once, instead of with every iteration. But he found that ML converged in fewer iterations. Between more, faster iterations vs fewer slower iterations, it was a wash.
     I do wonder why FIML with complete data is so slow. That may be a code issue which could be improved. How old is the code?
--Ed Rigdon

--
You received this message because you are subscribed to the Google Groups "lavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lavaan+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lavaan/c8d7aa56-caf0-4c34-af6f-93de3566b835n%40googlegroups.com.

Shu Fai Cheung

unread,
Oct 2, 2021, 8:34:34 PM10/2/21
to lavaan
An interesting topic.

I checked and I believe the bottleneck is in finding the data patterns. You can try this:

library(profvis)
p_c_fiml <- profvis(sem(myModel, data=mydata_complete,missing='fiml'))
p_c_fiml

Only very limited information is available in the plot if you do not load the source of lavaan. Nevertheless, you can still see that a lot of time is spent identifying the missing data patterns by lav_data_missing_patterns().

One reason for Hybrid FIML to be faster, I believe, is because the task to identify missing data patterns is partially done before calling lavaan::sem(), by separating the dataset into two parts, one with complete data and one with missing data. lavaan::sem() does not need to spend time to identify the missing data patterns for half of the data.

-- Shu Fai

P.S.: Note that, for the comparison to be fair in the complete data case, meanstructure should be set to TRUE and information set to "observed" in standard ML, because these are the values of these options if we set missing to "fiml". Nevertheless, ML is still much faster after these settings have been added, as you found.

Shu Fai Cheung

unread,
Oct 2, 2021, 8:58:09 PM10/2/21
to lavaan
I forgot to mention that the sample code is for FIML on complete data. A lot of time is spent to find the missing data patterns, even though it is not necessary for the data, resulting in much longer processing time even for the complete data case. But it's not lavaan's fault because we tell it that the dataset may have missing data when we set missing = "fiml", and so it just does what we tell it to do.

-- Shu Fai

Shu Fai Cheung

unread,
Oct 3, 2021, 1:54:39 PM10/3/21
to lavaan
Please pardon me for trying to answer this question. I may be wrong but I want to try because I am interested in knowing how things are implemented.

I think lavaan already implemented the proposed approach. I believe log likelihoods are not computed for each individual, but for each missing data pattern.

After running the following line in the original code:

fit <- sem(myModel, data=mydata_partb_missing,missing='fiml')

You can take a look at the following slot:

fit@SampleStats@missing

It contains the sample statistics for each missing data pattern. To my understanding, this slot is used in the discrepancy function if missing = "fiml".

By the way, I think the proposed method can indeed be somehow applied, e.g., by telling lavaan which rows have complete data and so it will search for missing patterns only for rows with missing data, because it is fast to identify rows with complete data. The slow processing time in the missing data scenario is due to the time spent on half of the sample without missing data.

Please correct me if I am wrong.

-- Shu Fai


On Thursday, September 30, 2021 at 3:42:04 PM UTC+8 rogier...@gmail.com wrote:

Terrence Jorgensen

unread,
Oct 4, 2021, 8:17:28 AM10/4/21
to lavaan
I believe log likelihoods are not computed for each individual, but for each missing data pattern.

I think you are right, although the source code provides for both options:


Regarding the small gain in estimation speed, note that lavaan fits not only the hypothesized model but also the saturated model (to obtain "sample statistics" for incomplete data) and independence model (default baseline for CFI and friends). 

Also, the trick with this approach is to estimate all the information together so that the parameters are constrained across complete/missing-data groups.  So it would have to be implemented such that the complete-data shortcut was used in the complete-data group but FIML used in the incomplete-data group.

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Rogier Kievit

unread,
Oct 5, 2021, 8:14:55 AM10/5/21
to lavaan
Thank you everyone for all the input and ideas, this is really interesting
Reply all
Reply to author
Forward
0 new messages