Excluding Thresholds from Models of Ordinal Data

134 views
Skip to first unread message

Keith Markus

unread,
Nov 12, 2022, 5:27:06 PM11/12/22
to lavaan
It came up in conversation with Angel that it is possible to fit a model to ordinal data without including the thresholds in the model.  Because threshold residuals are normally zero, it seems useful to compare lavaan's behavior with and without thresholds in the model to its behavior with and without fixed.x.

When we compare models with and without fixed.x, the model degrees of freedom remain the same because variances and covariances are added or removed from both the model parameters and the sufficient statistics, cancelling one another out.

When we omit thresholds from the model, the degrees of freedom change because thresholds are removed from the model free parameters but retained as sufficient statistics.  As a result, the same chi-square is tested against a different reference distribution.  The degrees of freedom increase by the number of thresholds removed from the model.  The estimates are reported by the summary method either way, but inferential statistics are reported only if the thresholds are included in the model.

Am I understanding this correctly?  If so, what are the intended use cases for this feature allowing the user to increase the degrees of freedom by omitting the thresholds from the model without removing them from the sufficient statistics?

Thanks,
Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/

Terrence Jorgensen

unread,
Nov 14, 2022, 7:39:20 AM11/14/22
to lavaan
Hi Keith,

It came up in conversation with Angel that it is possible to fit a model to ordinal data without including the thresholds in the model

Is that a lavaan-group conversation you can link to?  A syntax example would clarify what you mean by "without including the thresholds in the model".  When you run cfa() or sem(), you don't need them in the model syntax for them to be saturated (similar to intercepts of continuous variables, when meanstructure=TRUE), but they are still "in" the model.
 
Because threshold residuals are normally zero, it seems useful to compare lavaan's behavior with and without thresholds in the model to its behavior with and without fixed.x

I'm not sure what this has to do with exogenous predictors, which have no thresholds because no distributional assumptions are made.

But FYI (if it is relevant), lavaan's default behavior when any outcomes are ordered= is to set conditional.x=TRUE.  That goes a step beyond treating exogenous variables as fixed, by regressing them out of endogenous/modeled variables.  So the SEM's "goal" is then to reproduce a partial rather than zero-order (polychoric) correlation matrix.  You can take a peak at lavInspect(fit, "est") and lavInspect(fit, "fitted") to see what this changes.

When we omit thresholds from the model

Do you mean fix them to zero?
 
the degrees of freedom change because thresholds are removed from the model free parameters but retained as sufficient statistics

Shouldn't that be compensated by freely estimating the latent-response intercepts?  For binary outcomes, they are the same parameter but opposite sign (i.e., the threshold and intercept are the same distance apart, and their location along the latent-response continuum is arbitrary).  

As a result, the same chi-square is tested against a different reference distribution

I would expect chi-squared to increase if thresholds are fixed to 0, unless the distribution is exactly 50% in each category.  A syntax example could reveal how I misunderstand. 

Am I understanding this correctly?  If so, what are the intended use cases for this feature allowing the user to increase the degrees of freedom by omitting the thresholds from the model without removing them from the sufficient statistics?

What "feature" are you referring to?  If you mean that the lavaan() function does not automatically estimate them (auto.th=FALSE), I think that is no different than the default setting int.ov.free=FALSE (i.e., intercepts are not estimated by default).  Both options are set TRUE by the cfa() and sem() wrappers, which are designed to let model syntax be minimal.

Best,

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam
 

Keith Markus

unread,
Nov 15, 2022, 10:02:59 AM11/15/22
to lavaan
Thanks Terrence, that was very informative despite the fact that I left you fishing.  I definitely erred on the side of terseness.  Here is a script that illustrates what I had in mind.

Having further experimented with this, I now suspect that it is just incorrect to omit the thresholds despite the fact that the model goes through without a warning.  Nonetheless, I am not confident that I am thinking about this correctly.

The below script fits two single factor CFA's with four ordinal indicators.  The first model omits the threshold for x1 from the model.  The second model restores this threshold.  As you can see from the below summary table, the chi-square values are the same but I pick up a df with one less parameter when I omit the threshold from the model syntax.  Of course, this possibility relies on auto.th = FALSE.

> summaryTable(noThresholdFit, thresholdFit, c('No Threshold', 'Threshold'))
         model moments free     stat df    pvalue
1 No Threshold      10    7 1.314769  3 0.7256291
2    Threshold      10    8 1.314769  2 0.5182049
>

I am counting sufficient statistics as 6 polychoric correlations plus 4 thresholds = 10.  (However, in the table it is just the sum of free parameters plus df.)

Is there ever a good reason to fit a model like the first one, or with all thresholds omitted?

Thanks,
Keith
------------------------
Keith A. Markus
John Jay College of Criminal Justice, CUNY
http://jjcweb.jjay.cuny.edu/kmarkus
Frontiers of Test Validity Theory: Measurement, Causation and Meaning.
http://www.routledge.com/books/details/9781841692203/



# Counting DF with ordinal data
require(lavaan)
set.seed(54321)

# Simulated data
populationModel <- '
  F1 =~ .6*x1 + .6*x2 + .6*x3 + .6*x4
  F1 ~~ 1*F1
  x1 | -.2*t1  
  x2 | -.1*t1
  x3 | 0.1*t1
  x4 | 0.2*t1
' # end model

lavaanify(populationModel)

rm(myData)
myData <- simulateData(model=populationModel, sample.nobs=1000)
summary(myData)


# Fit model with x1 threshold omitted
noThresholdModel <- '
  F1 =~ x1 + x2 + x3 + x4
  F1 ~~ 1*F1
  x2 | t1
  x3 | t1
  x4 | t1
' # end model

noThresholdFit <- lavaan(model=noThresholdModel,
                         data=myData,
                         ordered=TRUE)
summary(noThresholdFit)
lavInspect(noThresholdFit, what='free')
lavInspect(noThresholdFit, what='sampstat')
lavaanify(noThresholdModel)


# fit model with x1 threshold restored
thresholdModel <- '
  F1 =~ x1 + x2 + x3 + x4
  F1 ~~ 1*F1
  x1 | t1
  x2 | t1
  x3 | t1
  x4 | t1
' # end model

thresholdFit <- lavaan(model=thresholdModel,
                       data=myData,
                       ordered=TRUE)
summary(thresholdFit)
lavInspect(thresholdFit, what='free')
lavInspect(thresholdFit, what='est')
lavInspect(thresholdFit, what='fitted')
lavInspect(thresholdFit, what='sampstat')
lavaanify(thresholdModel)


# summary comparison
summaryTable <- function(fit1, fit2, names, ...){
  fit1Test <- lavInspect(fit1, what='test')
  fit2Test <- lavInspect(fit2, what='test')
  model <- names
  stat <- c(fit1Test$standard$stat, fit2Test$standard$stat)
  df <- c(fit1Test$standard$df, fit2Test$standard$df)
  pvalue <- c(fit1Test$standard$pvalue, fit2Test$standard$pvalue)
  free <- c(max(unlist(lavInspect(fit1))),
         max(unlist(lavInspect(fit2))))
  moments <- free + df
  myTable <- data.frame(model, moments, free, stat, df, pvalue)
  return(myTable)
} # end function

summaryTable(noThresholdFit, thresholdFit, c('No Threshold', 'Threshold'))


Yves Rosseel

unread,
Nov 15, 2022, 12:39:51 PM11/15/22
to lav...@googlegroups.com
Hello Keith,

This is an interesting example. What happens in the 'No Threshold'
model, is that the 'x1 | t1' parameter is added to the parameter table
(otherwise everything would fail), but (because auto.th = FALSE) as a
non-free parameter.

But lavaan does compute a reasonable starting value (based on univariate
information of x1 only), which is often so good that it hardly changes
(in correctly specified models) even when it would be a free parameter.
As a result, the fit of 'No Threshold' and 'Threshold' models are the
same, but the df count is not the same.

Because there is (for this model) no reason to omit this threshold, you
could say this is just cheating.

You may wonder why lavaan does fix the threshold to zero. It could, and
that would make sense for binary items, but what about items with more
than 2 levels? If we all thresholds to zero, things will break down.

So the only 'use-case' I could imagine for omitting a threshold (on
purpose) is to 'fix' it to its starting value (which depends on the
data). But I have never seen an example where this would make sense.

Yves.

Keith Markus

unread,
Nov 16, 2022, 2:26:47 PM11/16/22
to lavaan
Yves,
Thanks.  That makes perfect sense.  Glad I asked.  This thread has cleared up several things that were a little fuzzy for me.
Reply all
Reply to author
Forward
0 new messages