Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

BSEM for ordinal data

181 views
Skip to first unread message

Giada Venaruzzo

unread,
Feb 5, 2024, 11:32:43 AM2/5/24
to blavaan
Hello, 

I'm trying to fit a BSEM on ordinal data with two groups (male or female).
The dataset consists of 36 variables, all of them are measured on a 1 to 7 scale (answers to a questionare). The problem is that for some categories, when considered for the two groups, I have 0 values. 

The model is the following:
model <- ' # latent variable definition
           ambiente =~ d11 + d12 + d13 + d14 + d15 + d16 + d17 + d18
           cibo =~ d21 + d22 + d23 + d24
           servizio =~ d31 + d32 + d33 + d34 + d35 + d36 + d37 + d38
           empos =~ d61 + d62 + d63 + d64 + d65 + d66
           emneg =~ d71 + d72 + d73 + d74 + d75 + d76
           loyalty =~ d91 + d92 + d93

           # regression
           loyalty ~ ambiente + cibo + servizio + empos + emneg '

When I try to fit the BSEM it gives the following error:
Error in lav_samplestats_step1(Y = Data, wt = wt, ov.names = ov.names,  :
  lavaan ERROR: some categories of variable `d34' are empty in group 2; frequencies are [112 235 70 66 6 12 0]

given that 
(table(data$d34, data$genere))
   
      1   2
  1 115 112
  2 279 235
  3 107  70
  4  65  66
  5   8   6
  6   7  12
  7   8   0

How can I handle this issue? 

Thank you in advance.

Giada

R. Noah Padgett

unread,
Feb 5, 2024, 12:29:40 PM2/5/24
to blavaan
Hello Giada,

I've had this problem in the past, too. I think it has something to do with the lavaan specification of ordinal models. Ed might be able to shed light on the specifics, though. One common approach is to collapse categories so that group 2 doesn't have a category with 0 endorsements.

I've used something like the code in the past to collapse categories. Given that you have 7 response categories, I suggest considering whether you need to use an ordinal model. The skewness might make the continuous approximation poor, but you wouldn't have to collapse categories and the continuous model estimates much faster.


# only collapsing the top two categories together
data$d34 <- ifelse(data$d34 > 6, 6, data$d34)

# dichotomized above/below middle category
data$d34 <- ifelse(data$d34 >= 4, 1, 0)


Have a great day,
Noah Padgett

Ed Merkle

unread,
Feb 5, 2024, 12:40:19 PM2/5/24
to blavaan
I agree with Noah that collapsing is the most straightforward thing to do here. A Bayesian model with a 0 category is conceptually possible by relying on the prior. But it currently doesn't work in blavaan because it is disallowed in lavaan (which blavaan is using to set up the model). In the future, we might work around this error so that blavaan estimates the models.

Ed

Giada Venaruzzo

unread,
Feb 10, 2024, 8:54:36 AM2/10/24
to blavaan
Thank you very much. 

Since the majority of my observations were on extreme categories (1, 2 or 6, 7), as suggested by Noah, I releveled the variables this way:
1-2 -> 1
3-4-5 -> 2
6-7 -> 3

Using default priors, I tried to estimate the SEM with the model written above, not using the groups.
When I first tried (with 7 levels) I couldn't get to the end of the estimation in a couple hours. 
The thing is that it is really slow even with just 3 categories. 

Do you have any suggestions? Maybe I can use a more informative prior? 

Thank you again.


Giada

Ed Merkle

unread,
Feb 10, 2024, 3:10:16 PM2/10/24
to Giada Venaruzzo, blavaan
It is generally helpful to consider the priors, and I tend to use priors that are more informative than the defaults. That being said, I believe the big bottleneck here is the fact that you have 36 observed variables. I have found that blavaan works fairly well for up to 10 observed variables, then gets slow and slower. blavaan's current estimation strategy is general in the sense that it handles all sorts of ordinal models, but that also means it is not optimal for simpler models. 

We will continue making improvements here, but you might be stuck with a slow model for now. On the plus side, convergence tends to happen pretty quickly. I often start with 100 burnin and 100 sample iterations, to quickly see whether the estimation is going ok.

Ed
--
You received this message because you are subscribed to the Google Groups "blavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blavaan+u...@googlegroups.com.

Giada Venaruzzo

unread,
May 9, 2024, 6:40:21 AM5/9/24
to blavaan
Hello,

I wanted to ask something: is there a way I can estimate fit indices for ordinal SEMs? blahFitIndices returns the following error: Fit indices are currently unavailable for ordinal models. 

Thank you in advance,

Giada

Mauricio Garnier-Villarreal

unread,
May 9, 2024, 10:16:27 AM5/9/24
to blavaan
Giada

So far we dont have them available, because the same method use for continuous indicators does not work with categorical indicators. we are working on a method for these models, but we are still testing it. As soon as we have research that shows that they work properly we will make them available in blavaan

You could use the ppmc() function to extract residual based indexes


take care

Giada Venaruzzo

unread,
May 28, 2024, 5:33:08 AM5/28/24
to blavaan
Hello,

I'm having troubles using the ppmc() function due to the large number of variables and the amount of data saved. At this point I think I will use the variables as continuous instead of ordinal. My question is, is there a prior I can use to get around the problem of skewness of the variables' distribution?

Thank you in advance.
Best wishes,

Giada

Mauricio Garnier-Villarreal

unread,
May 28, 2024, 6:34:34 AM5/28/24
to blavaan
Giada

can you share more icnormation

- how are you running the ppmc with ordinal data? we could help on how to make it faster and only save the minimal information necessary
- what are the characterictics of your data? how many categories? how many subjects? model? etc

we can try to help

Giada Venaruzzo

unread,
May 28, 2024, 10:57:47 AM5/28/24
to Mauricio Garnier-Villarreal, blavaan
Thank you.

I'm trying to estimate a simple CFA to begin with, the model is the following:

mis_mod <- 'musica =~ d11 + d12
            ambiente =~ d15 + d16 + d17 + d18
            cibo =~ d22 + d24

            servizio =~ d31 + d32 + d33 + d34 + d35

            loyalty =~ d91 + d92 + d93 '

All the variables are on a scale from 1 to 7, most of them are skewed (most of the values are between 1 and 4).
I sample 545 random observations and tried to estimate the model.

options(future.globals.maxSize = 1048576000) 
bfit_mis <- bcfa(mis_mod, n.chains = 3, burnin = 2900, sample = 500,
                 data = data_sam,
                 save.lvs = T, bcontrol = list(cores = 3),
                 ordered = T)

With lower burnin and sample parameters I had convergence problems.

res_ind <- ppmc(bfit_mis, thin = 1, fit.measures = c("srmr","chisq"))

This command gives me an error regarding the dimension of the file saved.

Thank you in advance, 
Giada 



You received this message because you are subscribed to a topic in the Google Groups "blavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/blavaan/pRMawiP8nfg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to blavaan+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blavaan/99743ae3-47a4-4851-b27e-4f1f86d28a57n%40googlegroups.com.

Mauricio Garnier-Villarreal

unread,
May 29, 2024, 1:17:05 PM5/29/24
to blavaan
Giada

a few comments/ideas

- I think the issues with memory might be related to using the save.lvs=T this saves larger objests of the posteriors of factor scores. So, the memory issue might be about the general R sesion instead of a specifc object
- with ordinal data the chisq would not be a good metric, better to save only the srmr, also you can use a higher thin to save smaller objects res_ind <- ppmc(bfit_mis, thin = 2, fit.measures = c("srmr",
))
- When ordinal data has more than 5 categories I usually treat it as continuous, unless they are very sparsely answered. You are estimating a large number of thresholds and would be good to check if you have enough subjects answering every option in the 7 options. When you say the data looks skew? how severe is that?
- In blavaan right now we dont have specific priors to account for skewness. But I would consider it an issue of the issues as severe

hope this helps

Giada Venaruzzo

unread,
May 30, 2024, 9:33:34 AM5/30/24
to Mauricio Garnier-Villarreal, blavaan
Thank you. 
Setting save.lvs=F solves the memory problem.
I report tables and skewness of each variable here (skewness function of moments package). 
If I use the logarithm transformation of the manifest variables, is it appropriate? How does this reflect on latent variables interpretation? I report skewness measure for transformed variables too. 

### frequency distribution

$d11

  1   2   3   4   5   6   7
223 481 187 139  42  13   5

$d12

  1   2   3   4   5   6   7
301 411 186 116  45  18  13

$d15

  1   2   3   4   5   6   7
133 486 261 142  45  13  10

$d16

  1   2   3   4   5   6   7
131 521 243 123  37  19  16

$d17

  1   2   3   4   5   6   7
128 596 194 116  33  14   9

$d18

  1   2   3   4   5   6   7
 70 513 256 156  66  15  14

$d22

  1   2   3   4   5   6   7
 23  80 217 157 265 168 180

$d24

  1   2   3   4   5   6   7
 31 140 329 142 193 133 122

$d31

  1   2   3   4   5   6   7
186 496 220 142  24   9  13

$d32

  1   2   3   4   5   6   7
 97 535 245 152  29  28   4

$d33

  1   2   3   4   5   6   7
225 500 201  98  49  14   3

$d34

  1   2   3   4   5   6   7
227 514 177 131  14  19   8

$d35

  1   2   3   4   5   6   7
211 481 213 143  28  11   3

$d91

  1   2   3   4   5   6   7
258 497 160  84  56  10  25

$d92

  1   2   3   4   5   6   7
144 485 140 212  60  17  32

$d93

  1   2   3   4   5   6   7
175 450 186 144  70  21  44

### skewness original variables
 [1]  1.0056975  1.1789930  1.0657154  1.3315931  1.4113586  1.1406057 -0.1638440  0.2551903  1.1967770
[10]  1.1360204  1.1009704  1.2732178  0.9119278  1.4800391  1.0529840  1.1486666

### skewness log variables
 [1] -0.087100386  0.084069920 -0.202957385 -0.049589388  0.015523674 -0.025665958 -1.105098949 -0.635060005
 [9] -0.129382470 -0.102559320 -0.051925728 -0.014866760 -0.179810611  0.190706522 -0.057827287  0.006054678
 
Thank you,
Giada

Mauricio Garnier-Villarreal

unread,
Jun 1, 2024, 10:09:50 AM6/1/24
to blavaan
I think it also goes the issue of interpretability. The log(variables) are less skew, but the model based on them will have to be interpreted in the log metric, which is hard to make sense of. Now, if the metric of your data is not that relevant to you, you could use the log variables

Also, the assumption of normality is in the residuals, not the variables itself, that is why most of these models are "robust" to its violation, meaning that it doesnt ruin the model, but might have larger standard deviations of the posteriors.

Since the SEM models seek to replicate the variables correlations, its good to note that skewness will attenuate these correlations. But I think skewness would need to be higher to make a "substantial" difference
 (hard to say what is the effect right now)

Some recommendations in different scenarios
- (a) If you continue with it as categorical data maybe combine some of the response options on the tails, to have more subjects in each response and have less categories
- (b) If running them as continuous, and the metric of the variables is important, use the raw data, and be aware that some correlations may be attenuated
- (c) if running as continuous, and the metric of the data is not important, can use the log variables, but be aware that the interpretability would be harder and your results would be generalizable with the log scale variables
- could also run (a) and (b), and if the factor correlations etc are not very different could be an indication that using then as continuous does not create a "big" difference



Hope this helps you make a decision
Reply all
Reply to author
Forward
0 new messages