BSEM for ordinal data

Giada Venaruzzo

unread,

Feb 5, 2024, 11:32:43 AM2/5/24

to blavaan

Hello,

I'm trying to fit a BSEM on ordinal data with two groups (male or female).

The dataset consists of 36 variables, all of them are measured on a 1 to 7 scale (answers to a questionare). The problem is that for some categories, when considered for the two groups, I have 0 values.

The model is the following:

model <- ' # latent variable definition
ambiente =~ d11 + d12 + d13 + d14 + d15 + d16 + d17 + d18
cibo =~ d21 + d22 + d23 + d24
servizio =~ d31 + d32 + d33 + d34 + d35 + d36 + d37 + d38
empos =~ d61 + d62 + d63 + d64 + d65 + d66
emneg =~ d71 + d72 + d73 + d74 + d75 + d76
loyalty =~ d91 + d92 + d93

# regression
loyalty ~ ambiente + cibo + servizio + empos + emneg '

When I try to fit the BSEM it gives the following error:

Error in lav_samplestats_step1(Y = Data, wt = wt, ov.names = ov.names, :
lavaan ERROR: some categories of variable `d34' are empty in group 2; frequencies are [112 235 70 66 6 12 0]

given that

(table(data$d34, data$genere))

1 2
1 115 112
2 279 235
3 107 70
4 65 66
5 8 6
6 7 12
7 8 0

How can I handle this issue?

Thank you in advance.

Giada

R. Noah Padgett

unread,

Feb 5, 2024, 12:29:40 PM2/5/24

to blavaan

Hello Giada,

I've had this problem in the past, too. I think it has something to do with the lavaan specification of ordinal models. Ed might be able to shed light on the specifics, though. One common approach is to collapse categories so that group 2 doesn't have a category with 0 endorsements.

I've used something like the code in the past to collapse categories. Given that you have 7 response categories, I suggest considering whether you need to use an ordinal model. The skewness might make the continuous approximation poor, but you wouldn't have to collapse categories and the continuous model estimates much faster.

# only collapsing the top two categories together
data$d34 <- ifelse(data$d34 > 6, 6, data$d34)

# dichotomized above/below middle category
data$d34 <- ifelse(data$d34 >= 4, 1, 0)

Have a great day,
Noah Padgett

Ed Merkle

unread,

Feb 5, 2024, 12:40:19 PM2/5/24

to blavaan

I agree with Noah that collapsing is the most straightforward thing to do here. A Bayesian model with a 0 category is conceptually possible by relying on the prior. But it currently doesn't work in blavaan because it is disallowed in lavaan (which blavaan is using to set up the model). In the future, we might work around this error so that blavaan estimates the models.

Ed

Giada Venaruzzo

unread,

Feb 10, 2024, 8:54:36 AM2/10/24

to blavaan

Thank you very much.

Since the majority of my observations were on extreme categories (1, 2 or 6, 7), as suggested by Noah, I releveled the variables this way:

1-2 -> 1

3-4-5 -> 2

6-7 -> 3

Using default priors, I tried to estimate the SEM with the model written above, not using the groups.

When I first tried (with 7 levels) I couldn't get to the end of the estimation in a couple hours.

The thing is that it is really slow even with just 3 categories.

Do you have any suggestions? Maybe I can use a more informative prior?

Thank you again.

Giada

Ed Merkle

unread,

Feb 10, 2024, 3:10:16 PM2/10/24

to Giada Venaruzzo, blavaan

It is generally helpful to consider the priors, and I tend to use priors that are more informative than the defaults. That being said, I believe the big bottleneck here is the fact that you have 36 observed variables. I have found that blavaan works fairly well for up to 10 observed variables, then gets slow and slower. blavaan's current estimation strategy is general in the sense that it handles all sorts of ordinal models, but that also means it is not optimal for simpler models.

We will continue making improvements here, but you might be stuck with a slow model for now. On the plus side, convergence tends to happen pretty quickly. I often start with 100 burnin and 100 sample iterations, to quickly see whether the estimation is going ok.

Ed

--
You received this message because you are subscribed to the Google Groups "blavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blavaan+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blavaan/214aed78-f89b-4ccc-816b-d60d5f2beb6dn%40googlegroups.com.

Giada Venaruzzo

unread,

May 9, 2024, 6:40:21 AM5/9/24

to blavaan

Hello,

I wanted to ask something: is there a way I can estimate fit indices for ordinal SEMs? blahFitIndices returns the following error: Fit indices are currently unavailable for ordinal models.

Thank you in advance,

Giada

Mauricio Garnier-Villarreal

unread,

May 9, 2024, 10:16:27 AM5/9/24

to blavaan

Giada

So far we dont have them available, because the same method use for continuous indicators does not work with categorical indicators. we are working on a method for these models, but we are still testing it. As soon as we have research that shows that they work properly we will make them available in blavaan

You could use the ppmc() function to extract residual based indexes

take care

Giada Venaruzzo

unread,

May 28, 2024, 5:33:08 AM5/28/24

to blavaan

Hello,

I'm having troubles using the ppmc() function due to the large number of variables and the amount of data saved. At this point I think I will use the variables as continuous instead of ordinal. My question is, is there a prior I can use to get around the problem of skewness of the variables' distribution?

Thank you in advance.

Best wishes,

Giada

Mauricio Garnier-Villarreal

unread,

May 28, 2024, 6:34:34 AM5/28/24

to blavaan

Giada

can you share more icnormation

- how are you running the ppmc with ordinal data? we could help on how to make it faster and only save the minimal information necessary

- what are the characterictics of your data? how many categories? how many subjects? model? etc

we can try to help

Giada Venaruzzo

unread,

May 28, 2024, 10:57:47 AM5/28/24

to Mauricio Garnier-Villarreal, blavaan

Thank you.

I'm trying to estimate a simple CFA to begin with, the model is the following:

mis_mod <- 'musica =~ d11 + d12
ambiente =~ d15 + d16 + d17 + d18
cibo =~ d22 + d24

servizio =~ d31 + d32 + d33 + d34 + d35

loyalty =~ d91 + d92 + d93 '

All the variables are on a scale from 1 to 7, most of them are skewed (most of the values are between 1 and 4).

I sample 545 random observations and tried to estimate the model.

options(future.globals.maxSize = 1048576000)
bfit_mis <- bcfa(mis_mod, n.chains = 3, burnin = 2900, sample = 500,
data = data_sam,
save.lvs = T, bcontrol = list(cores = 3),
ordered = T)

With lower burnin and sample parameters I had convergence problems.

res_ind <- ppmc(bfit_mis, thin = 1, fit.measures = c("srmr","chisq"))

This command gives me an error regarding the dimension of the file saved.

Thank you in advance,

Giada

You received this message because you are subscribed to a topic in the Google Groups "blavaan" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/blavaan/pRMawiP8nfg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to blavaan+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blavaan/99743ae3-47a4-4851-b27e-4f1f86d28a57n%40googlegroups.com.

Mauricio Garnier-Villarreal

unread,

May 29, 2024, 1:17:05 PM5/29/24

to blavaan

Giada

a few comments/ideas

- I think the issues with memory might be related to using the save.lvs=T this saves larger objests of the posteriors of factor scores. So, the memory issue might be about the general R sesion instead of a specifc object

- with ordinal data the chisq would not be a good metric, better to save only the srmr, also you can use a higher thin to save smaller objects res_ind <- ppmc(bfit_mis, thin = 2, fit.measures = c("srmr",

))

- When ordinal data has more than 5 categories I usually treat it as continuous, unless they are very sparsely answered. You are estimating a large number of thresholds and would be good to check if you have enough subjects answering every option in the 7 options. When you say the data looks skew? how severe is that?

- In blavaan right now we dont have specific priors to account for skewness. But I would consider it an issue of the issues as severe

hope this helps

Giada Venaruzzo

unread,

May 30, 2024, 9:33:34 AM5/30/24

to Mauricio Garnier-Villarreal, blavaan

Thank you.

Setting save.lvs=F solves the memory problem.

I report tables and skewness of each variable here (skewness function of moments package).

If I use the logarithm transformation of the manifest variables, is it appropriate? How does this reflect on latent variables interpretation? I report skewness measure for transformed variables too.

### frequency distribution

$d11

1 2 3 4 5 6 7
223 481 187 139 42 13 5

$d12

1 2 3 4 5 6 7
301 411 186 116 45 18 13

$d15

1 2 3 4 5 6 7
133 486 261 142 45 13 10

$d16

1 2 3 4 5 6 7
131 521 243 123 37 19 16

$d17

1 2 3 4 5 6 7
128 596 194 116 33 14 9

$d18

1 2 3 4 5 6 7
70 513 256 156 66 15 14

$d22

1 2 3 4 5 6 7
23 80 217 157 265 168 180

$d24

1 2 3 4 5 6 7
31 140 329 142 193 133 122

$d31

1 2 3 4 5 6 7
186 496 220 142 24 9 13

$d32

1 2 3 4 5 6 7
97 535 245 152 29 28 4

$d33

1 2 3 4 5 6 7
225 500 201 98 49 14 3

$d34

1 2 3 4 5 6 7
227 514 177 131 14 19 8

$d35

1 2 3 4 5 6 7
211 481 213 143 28 11 3

$d91

1 2 3 4 5 6 7
258 497 160 84 56 10 25

$d92

1 2 3 4 5 6 7
144 485 140 212 60 17 32

$d93

1 2 3 4 5 6 7
175 450 186 144 70 21 44

### skewness original variables

[1] 1.0056975 1.1789930 1.0657154 1.3315931 1.4113586 1.1406057 -0.1638440 0.2551903 1.1967770
[10] 1.1360204 1.1009704 1.2732178 0.9119278 1.4800391 1.0529840 1.1486666

### skewness log variables

[1] -0.087100386 0.084069920 -0.202957385 -0.049589388 0.015523674 -0.025665958 -1.105098949 -0.635060005
[9] -0.129382470 -0.102559320 -0.051925728 -0.014866760 -0.179810611 0.190706522 -0.057827287 0.006054678

Thank you,

Giada

To view this discussion on the web visit https://groups.google.com/d/msgid/blavaan/24eae063-e9d0-4771-9880-221ada36a471n%40googlegroups.com.

Mauricio Garnier-Villarreal

unread,

Jun 1, 2024, 10:09:50 AM6/1/24

to blavaan

I think it also goes the issue of interpretability. The log(variables) are less skew, but the model based on them will have to be interpreted in the log metric, which is hard to make sense of. Now, if the metric of your data is not that relevant to you, you could use the log variables

Also, the assumption of normality is in the residuals, not the variables itself, that is why most of these models are "robust" to its violation, meaning that it doesnt ruin the model, but might have larger standard deviations of the posteriors.

Since the SEM models seek to replicate the variables correlations, its good to note that skewness will attenuate these correlations. But I think skewness would need to be higher to make a "substantial" difference

(hard to say what is the effect right now)

Some recommendations in different scenarios

- (a) If you continue with it as categorical data maybe combine some of the response options on the tails, to have more subjects in each response and have less categories

- (b) If running them as continuous, and the metric of the variables is important, use the raw data, and be aware that some correlations may be attenuated

- (c) if running as continuous, and the metric of the data is not important, can use the log variables, but be aware that the interpretability would be harder and your results would be generalizable with the log scale variables

- could also run (a) and (b), and if the factor correlations etc are not very different could be an indication that using then as continuous does not create a "big" difference

Hope this helps you make a decision

Reply all

Reply to author

Forward