EM algorithm for ppp in blavaan

153 views
Skip to first unread message

Metehan Feridun Sorkun

unread,
Jan 3, 2024, 10:42:19 AM1/3/24
to blavaan
Hi,

I would like to calculate the ppp value via the EM algorithm based on the tutorial: https://ecmerkle.github.io/blavaan/articles/ordinal.html

However, I got confused about how I should run the algorithm mentioned below:

"In blavaan, we consequently run an EM algorithm for a fixed number of iterations in order to compute an H1 covariance matrix that is “good enough” for the ppp. The default number of iterations it set to 20, and users can change the default by supplying an emiter value via the mcmcextra argument. For example,

mcmcextra = list(data = list(emiter = 50))"

Should it be as follows, for example for the CFA model? If yes, what should be the values of the arguments "burnin" and "sample"?

fit.cfa <- bcfa(cfa.model, std.lv=TRUE, ordered=TRUE,
                     save.lvs= TRUE, target="stan",
                     data = df, burnin=500,  sample=1000, 
                     dp = dpriors(lambda = "normal(1, .5)"),
                     bcontrol = list(cores = 3),
                     mcmcextra = list(data = list(emiter = 50, llnsamp = 50), seed = 100) 
 




Ed Merkle

unread,
Jan 3, 2024, 10:48:27 AM1/3/24
to blavaan
The EM algorithm is run automatically when you have missing data, and you do not necessarily need to set the emiter value yourself (though you can if you want). Additionally:

The burnin and sample arguments are mostly unrelated to emiter. The 500 and 1000 values are defaults but could be increased depending on your model. 

For the code that you posted, you want to ensure that the "seed" argument is outside of "mcmcextra". So the last line would be:

mcmcextra = list(data = list(emiter = 50, llnsamp = 50)), seed = 100)

Metehan Feridun Sorkun

unread,
Jan 3, 2024, 11:56:50 AM1/3/24
to blavaan
Dear Ed, thanks a lot for your detailed response. 

Indeed, my data does not have any missing values. My aim is to save latent variable scores for further path analysis, including many moderation effects between latent variables. I was very interested in your tutorial because  I understood (probably misunderstood) that we can see the model's PPP value (my model's approximate fit) only after 20 iterations without a need for running thousands of iterations.  My model runs 8-9 hours even when I set 500 burn-ins and 1000 samples. It would be nice for me first to ensure a good PPP value with minor changes in the model and then to run the very time-consuming algorithm only once.  

3 Ocak 2024 Çarşamba tarihinde saat 18:48:27 UTC+3 itibarıyla Ed Merkle şunları yazdı:

Ed Merkle

unread,
Jan 4, 2024, 10:46:01 PM1/4/24
to Metehan Feridun Sorkun, blavaan
Ok, I understand a bit more about what you are wanting to do. It would be nice if the ppp worked after 20 MCMC iterations, but this is an "iterations within iterations" thing: at each MCMC iteration, we draw a posterior sample of parameters and then run the EM algorithm for 20 iterations to do a ppp computation. This ppp computation results in a 0 or 1, and we need many hundreds of MCMC iterations to get a stable ppp value between 0 and 1. 

Additionally, this EM algorithm is only related to computing the ppp with missing data (for computing the "saturated" likelihood). So I don't think it will help you here.

You might have success with fitting a separate model for each latent variable, saving the lv scores, then doing a path analysis with those scores. This is related to the "structural after measurement" approach that Yves Rosseel recently wrote about (also see the MUPPET approach by Roy Levy). I plan to automate some of that in blavaan, but it is not yet available. Also, you might fit a maximum likelihood version of the model in lavaan before committing to the 8 hours for the Bayesian model.

Ed
--
You received this message because you are subscribed to the Google Groups "blavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blavaan+u...@googlegroups.com.

Metehan Feridun Sorkun

unread,
Jan 5, 2024, 2:46:45 AM1/5/24
to Ed Merkle, blavaan
Yes, it seems you have fully understood because I also implemented the alternatives you indicated. As you may be interested, let me share different methods' (dis) similarities regarding my research hypothesis testing results and also my questions regarding these.

1. I followed the method suggested by Logan et al. (2022). Accordingly, I applied CFA including all ordinal latent variables, and then extracted factor scores via the Ten Berge method. I used the function of psych::factor.scores for this because it can process the polychoric correlation resulting from the CFA.     
Logan, J.A.R., Jiang, H., Helsabeck, N. et al. Should I allow my confirmatory factors to correlate during factor score extraction? Implications for the applied researcher. Qual Quant 56, 2107–2131 (2022). https://doi.org/10.1007/s11135-021-01202-x
 
2. I am not able to use the function SAM because one of my latent variables is second-order latent variable and the function has not supported it yet. Alternatively, I again applied CFA including all ordinal latent variables. Then, I used lavPredict (fitted.cfa.object,  method = "ebm", transform = TRUE). Using latent variable scores with control variables, I proceeded to the path analysis using the sem function.

3.  Before using the blavaan package, as you stated, I applied CFA with DWLS (not with ML, as my variables are ordinal), and got good fit indices. Then, I ran the same model using bcfa (burniin = 500 and sample = 1000). Even in this case, it took around 8 hours for its fitting and the resulting size of the object exceeded 1 GB so I unfortunately could not see and save the results. However, I have seen the warning that the rhat values are greater than 1 so, it needs more iterations for convergence. Now, to overcome this limitation, I subscribed to the upper version of Posit Cloud. I started to run the model 10 hours ago on this platform  (burniin = 1000 and sample = 10000), it is still now processing. I hope to get a result. Is there any free alternative here, for example, can I run the code directly on R and save the result to overcome the 1 GB memory limitation imposed by RStudio? 

4.  As the last alternative, I applied CFA (bcfa) separately for each latent variable as you suggested to save their factor scores for further path analysis. As the model size gets smaller, I managed to run more iterations in blavaan (e.g., up to (burnin = 1000 and sample = 4500) within 1GB memory constraints. Almost all my models have the PPP value of zero. So, should not I consider this for poor model fit as I do not have any missing values? Moreover, in a few of my models, I again got the warning that rhat values are greater than 1 so, the model needs more iterations for convergence. How critical is this warning (do they affect latent variable scores significantly)? If I increase the number of iterations to very high values, how likely to get rid of this warning do you think? To sum up, is it worth making extra effort?

When I applied the methods 1, 2, and 4 above to test my research model, I have seen that the 2 and 4 have very similar results. The first one does not show the opposite results but does not find significant effects for the many relationships tested in the model on contrary to the 2 and 4. Do you have any comment regarding the above alternative ways regarding which one is methodologically more appropriate?               

Sorry for this long post but your comments and answers will be very helpful.        

Ed Merkle <ecme...@gmail.com>, 5 Oca 2024 Cum, 06:46 tarihinde şunu yazdı:

Ed Merkle

unread,
Jan 9, 2024, 11:58:59 AM1/9/24
to Metehan Feridun Sorkun, blavaan
Here are a few thoughts about these methods:

About Bayesian model not converging: have you considered the priors? The defaults in blavaan are not good for all situations, especially if you expect all the observed variables to be positively correlated with each other. In that case, something like

dpriors(lambda = "normal(1,.4)")

could encode this expectation and also avoid some nonconvergence.


About method 1 being different from 2 and 4: I wonder how different they are aside from significant effects. I would tend to de-emphasize the significance part and look at similarity of parameter estimates and SEs.


About 1GB memory limit: are you sure this can't be changed? There used to be a memory.limit() command to set this. It seems that the command is now obsolete, but maybe Rstudio limits something here. If you have more memory than that, maybe try running the model in R outside of Rstudio.


Also, some things that would reduce memory are save.lvs = FALSE (though maybe you want save.lvs) and thinning. For the latter, add this to the model estimation command:

bcontrol = list(thin = x)

where x is an integer > 1. (i.e., save every xth posterior sample).

Ed

Metehan Feridun Sorkun

unread,
Jan 12, 2024, 8:34:06 AM1/12/24
to Ed Merkle, blavaan
Thanks for your comments.

Yes, I changed priors based on my factor loading retrieved from lavaan cfa fit values. I tried dpriors(lambda = "normal(0.8,.4)"). Even once, I tried to set threshold values for ordinal levels as priors based on again lavaan cfa fit values by following the code in your tutorial ( https://ecmerkle.github.io/blavaan/articles/ordinal.html) .   

I would like to ask about the methodologic difference between factor scores derived from the lavaan::lavPredict via "ebm" (empirical bayes modal) and blavaan::bcfa? The former one is also based on Bayes (as I understand from its name) but gives the scores in seconds. How is it possible?

This 1gb limit has started, in my opinion, after RSudio changed its company name to Posit.
,
Yes, I tried to run the model in R outside of Rstudio and in Posit Cloud . However, when I increase the number of iterations up to 10000s, I receive the below error messages:
MultisessionFuture (future_lapply-3) failed to call gassign() on cluster RichSOCKnode #3 (PID 25676 on localhost ‘localhost’). The reason reported was 
or
Post-mortem diagnostic: Failed to determined whether a process with this PID exists or not, i.e. cannot infer whether localhost worker is alive or not. Detected a non-exportable reference (‘externalptr’) in one of the globals



Ed Merkle <ecme...@gmail.com>, 9 Oca 2024 Sal, 19:58 tarihinde şunu yazdı:

Mauricio Garnier-Villarreal

unread,
Jan 12, 2024, 12:10:23 PM1/12/24
to blavaan
Just a comment on the types of factor scores. The EBM from lavaan is Bayesian in the formula, they use the estimates of the model and estimates what the bayes modal values would be based on the lavaan model. So, you can think of it as a constraint "Bayesian" estimation of factor scores, because they are estimated based on the lavaan assumptions
While the factor scores from blavaan, as estimated from the posterior distributions of the Bayesian model, so they have less constraints/assumptions.

The EBM is faster because its a quick formula, while blavaan is estimating them for as many save iterations you have (normally several 1000s)
Reply all
Reply to author
Forward
0 new messages