problems estimating generalized (non-parallel) ordered logit models

jun...@gmail.com

unread,

Jun 25, 2015, 8:05:38 PM6/25/15

to stan-...@googlegroups.com

Dear rstanians,

This is a bit problems loaded post since I ran into couple of problems estimating a generalized ordered logit model, which is an extension to ordered logit/probit model discussed in p.50 of the Stan Modeling Language User's Guide v2.6.2. Instead of assuming that slopes (for the same independent variables) are the same across different cut-point equations (hence the proportional odds assumption), the generalized ordered logit model allows slopes to vary across equations. Because simple ordered logit model assumes the equivalence of coefficients, stan uses the ordered_logistic() function to set up the likelihood; but to set up a generalized ordered logit model, one may have to some modified version of of the stan codes on p.51 for an ordered probit regression model. Here are my questions (from the most to least straightforward):

1. how to get those model convergence statistics. I know using print(stanfit) (assuming stanfit is the stan function output object) can produce potential scale reduction factor (psrf); is it the same as those produced by gelman.diag() after rjags? How about other convergence and model fit/comparison statistics (DIC)? Professor Gelman suggested that one do not use DIC since it is not fully Bayesian, but is there any quick way to get it and triangulate results with those from WAIC?

2. I have a couple technical questions about stan and rstan codes. In terms of the empirical example that I have been working on, the ordinal response variable has eight levels, so one way to write the model is:

data {

int<lower=2> K; // total number of levels for y

int<lower=0> N; // total number of cases

int<lower=1> D; // total number of independent variables

int<lower=0, upper=K> y[N];

row_vector[D] x[N];

}

parameters {

vector[D] b1;

vector[D] b2;

vector[D] b3;

vector[D] b4;

vector[D] b5;

vector[D] b6;

vector[D] b7;

ordered[K-1] thresh;

}

model {

vector[K] prob;

vector[K-1] eta;

for (d in 1:D) {

b1[d] ~ normal(0, 10);

b2[d] ~ normal(0, 10);

b3[d] ~ normal(0, 10);

b4[d] ~ normal(0, 10);

b5[d] ~ normal(0, 10);

b6[d] ~ normal(0, 10);

b7[d] ~ normal(0, 10);

}

for ( k in 1:K-1)

thresh[k] ~ normal( k+0.5, 10);

for (n in 1:N) {

eta[1] <- x[n]*b1;

eta[2] <- x[n]*b2;

eta[3] <- x[n]*b3;

eta[4] <- x[n]*b4;

eta[5] <- x[n]*b5;

eta[6] <- x[n]*b6;

eta[7] <- x[n]*b7;

prob[1] <- inv_logit(thresh[1] - eta[1]);

for (k in 2:(K-1)) {

prob[k] <- inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]);

}

prob[K] <- 1 - inv_logit(thresh[K-1] - eta[K-1]);

y[n] ~ categorical(prob);

}

After running my model, I got many error message like " the current indicating that the Metropolis proposal is about to be rejected because of the following issues:
stan::prob::categorical_log(N4stan5agrad3varE): Probabilities is not a valid simplex: Probabilities parameter .... should be greater than or equal to zero....
If this warning occurs sporadically....then the sampler is fine, but if this warning occurs often, then your model may be either severely ill-conditioned or misspecified."

Here I am guessing that the reason why I got so many error messages and the process continued forever is because the generalized ordered logit/probit does not guarantee that the probabilities are positive since they are differences in two cumulative densities, which is a design flaw with the model. And I don't know how to resolve this issue. I also tried using

"prob[k] <- max(0, inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]));"

But looks like it doesn't work

3. My way of specifying model as laid out above is clearly not efficient. Since I have different slope vectors, I am thinking about using a matrix for the slopes like (see the highlighted part), but then I don't know how to code the rest of it....

data {

int<lower=2> K; // total number of levels for y

int<lower=0> N; // total number of cases

int<lower=1> D; // total number of independent variables

int<lower=0, upper=K> y[N];

row_vector[D] x[N];

}

parameters {

matrix[D, K-1] b

ordered[K-1] thresh;

}

model {

vector[K] prob;

vector[K-1] eta;

for (d in 1:D) {

b1[d] ~ normal(0, 10);

b2[d] ~ normal(0, 10);

b3[d] ~ normal(0, 10);

b4[d] ~ normal(0, 10);

b5[d] ~ normal(0, 10);

b6[d] ~ normal(0, 10);

b7[d] ~ normal(0, 10);

}

for ( k in 1:K-1)

thresh[k] ~ normal( k+0.5, 10);

for (n in 1:N) {

eta[1] <- x[n]*b1;

eta[2] <- x[n]*b2;

eta[3] <- x[n]*b3;

eta[4] <- x[n]*b4;

eta[5] <- x[n]*b5;

eta[6] <- x[n]*b6;

eta[7] <- x[n]*b7;

prob[1] <- inv_logit(thresh[1] - eta[1]);

for (k in 2:(K-1)) {

eta[k] <- x[n]*b[]

prob[k] <- inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]);

}

prob[K] <- 1 - inv_logit(thresh[K-1] - eta[K-1]);

y[n] ~ categorical(prob);

}

Jun Xu, PhD
Associate Professor
Department of Sociology
Ball State University
Muncie, IN

jun...@gmail.com

unread,

Jun 25, 2015, 8:36:02 PM6/25/15

to stan-...@googlegroups.com

I also tried using

"prob[k] <- fmax(inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]), log(1));"

It looks like the model setup of "y[n] ~ categorical(prob);" does not allow the total sum of probabilities (the model design allows some negative probabilities for some y levels) to be greater than one. Is there any work-around?

jun...@gmail.com

unread,

Jun 25, 2015, 8:39:26 PM6/25/15

to stan-...@googlegroups.com

My question actually can be rephrased as whether there is way to estimate generalized ordered logit/probit models, in which we can get negative predicted probabilities (so if we replace those negative predicted probabilities with zero, we might have sums greater than one). Thanks!

On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:

Bob Carpenter

unread,

Jun 25, 2015, 10:35:39 PM6/25/15

to stan-...@googlegroups.com

I'll let Andrew or Aki answer the WAIC/DIC, etc questions.
You may need to repost with WAIC in the title to get their
attention.

Almost all the posts on this list are problem-oriented. We
wish more people would write in with success stories and to
share models. We hear them from people in person and via
off-list e-mail, but not so much on-list. It gives a very skewed
perspective!

Whatever you do, a model generating negative probabilities is
ill posed. The usual thing to do to convert unbounded values into
probabilities is to use softmax (exp-transform before normalizing).
It's the link for multi-logit.

As to (1), I don't know if Martyn Plummer updated rjags to
use split-R-hat the way we do in Stan. The split-R-hat (aka
split-"psrf") defined in our manual and in the latest Gelman
et al. Bayesian Data Analysis book is more conservative,
as is our n_eff calculation, which also uses cross-chain information.
You can pull the JAGS output into Stan this way, assuming you can
get the same kind of output out of rjags as out of R2jags.

fit_jags <- jags(model.file = "irt_hier.jags",
data = list(I=I, J=J, y=y),
init = init_list,
parameters.to.save = c("mu_a", "sigma_a", "a",
"mu_b", "sigma_b", "b",
"theta"),
n.chains = num_chains,
n.iter = 20000,
n.thin = 2,
jags.seed = init_seed);

print(c("jags", elapsed_time_jags));

fit_jags_mcmc <- as.mcmc(fit_jags);
arr <- as.array(fit_jags_mcmc);
arr <- aperm(arr, c(1,3,2));
mon <- monitor(arr, inc_warmup = FALSE);

The mon object should everything you need in it.

Some comments on the model below.

> On Jun 25, 2015, at 8:39 PM, jun...@gmail.com wrote:
>
> My question actually can be rephrased as whether there is way to estimate generalized ordered logit/probit models, in which we can get negative predicted probabilities (so if we replace those negative predicted probabilities with zero, we might have sums greater than one). Thanks!
>
> On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:
> Dear rstanians,
>
> This is a bit problems loaded post since I ran into couple of problems estimating a generalized ordered logit model, which is an extension to ordered logit/probit model discussed in p.50 of the Stan Modeling Language User's Guide v2.6.2. Instead of assuming that slopes (for the same independent variables) are the same across different cut-point equations (hence the proportional odds assumption), the generalized ordered logit model allows slopes to vary across equations. Because simple ordered logit model assumes the equivalence of coefficients, stan uses the ordered_logistic() function to set up the likelihood; but to set up a generalized ordered logit model, one may have to some modified version of of the stan codes on p.51 for an ordered probit regression model. Here are my questions (from the most to least straightforward):
>
> 1. how to get those model convergence statistics. I know using print(stanfit) (assuming stanfit is the stan function output object) can produce potential scale reduction factor (psrf); is it the same as those produced by gelman.diag() after rjags? How about other convergence and model fit/comparison statistics (DIC)? Professor Gelman suggested that one do not use DIC since it is not fully Bayesian, but is there any quick way to get it and triangulate results with those from WAIC?
>
> 2. I have a couple technical questions about stan and rstan codes. In terms of the empirical example that I have been working on, the ordinal response variable has eight levels, so one way to write the model is:
>
> data {
>
> int<lower=2> K; // total number of levels for y
>
> int<lower=0> N; // total number of cases
>
> int<lower=1> D; // total number of independent variables
>
> int<lower=0, upper=K> y[N];
>
> row_vector[D] x[N];
>
> }
>
> parameters {
>
> vector[D] b1;
>
> vector[D] b2;
>
> vector[D] b3;
>
> vector[D] b4;
>
> vector[D] b5;
>
> vector[D] b6;
>
> vector[D] b7;
>

This should just be

matrix[7, D] b;

> ordered[K-1] thresh;
>
> }
>
> model {
>
> vector[K] prob;
>
> vector[K-1] eta;
>
> for (d in 1:D) {
>
> b1[d] ~ normal(0, 10);
>
> b2[d] ~ normal(0, 10);
>
> b3[d] ~ normal(0, 10);
>
> b4[d] ~ normal(0, 10);
>
> b5[d] ~ normal(0, 10);
>
> b6[d] ~ normal(0, 10);
>
> b7[d] ~ normal(0, 10);
>
> }

Then this should be:

to_vector(b) ~ normal(0, 10);

though we'd recommend tighter priors unless you're expecting values
of the scale 10.

>
> for ( k in 1:K-1)
>
> thresh[k] ~ normal( k+0.5, 10);

I don't understand why you're doing this. You usually want
to deal with cutpoints. Check out either the Gelman and Hill
book or the ordered_logit example in the manual.

>
> for (n in 1:N) {
>
> eta[1] <- x[n]*b1;
>
> eta[2] <- x[n]*b2;
>
> eta[3] <- x[n]*b3;
>
> eta[4] <- x[n]*b4;
>
> eta[5] <- x[n]*b5;
>
> eta[6] <- x[n]*b6;
>
> eta[7] <- x[n]*b7;

This then becomes

eta <- x * b;

>
> prob[1] <- inv_logit(thresh[1] - eta[1]);
>
> for (k in 2:(K-1)) {
>
> prob[k] <- inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]);
>
> }
>

The problem here is that you have no guarantee these are going
to be positive.

> prob[K] <- 1 - inv_logit(thresh[K-1] - eta[K-1]);
>
> y[n] ~ categorical(prob);
>
> }
>
> }
>
>
> After running my model, I got many error message

Right --- the model as formulated won't work.

> like " the current indicating that the Metropolis proposal is about to be rejected because of the following issues:
> stan::prob::categorical_log(N4stan5agrad3varE): Probabilities is not a valid simplex: Probabilities parameter .... should be greater than or equal to zero....
> If this warning occurs sporadically....then the sampler is fine, but if this warning occurs often, then your model may be either severely ill-conditioned or misspecified."
>
> Here I am guessing that the reason why I got so many error messages and the process continued forever is because the generalized ordered logit/probit does not guarantee that the probabilities are positive since they are differences in two cumulative densities, which is a design flaw with the model. And I don't know how to resolve this issue. I also tried using

Correct. See above.

>
> "prob[k] <- max(0, inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]));"
>
> But looks like it doesn't work

You want to model the positive differences.

>
> 3. My way of specifying model as laid out above is clearly not efficient.

See above. The computational inefficiency comes from not
vectorizing, but the bigger problem is all the cut-and-paste
and repetition making it hard to understand.

> Since I have different slope vectors, I am thinking about using a matrix for the slopes like (see the highlighted part), but then I don't know how to code the rest of it....
>

I'd seriously recommend single-spacing code. But maybe
this is some kind of mailer or cut-and-paste artificact.

Let us know if that answer's not clear --- but do have a look
at the built-in ordered logit and how it's defined.

- Bob

Andrew Gelman

unread,

Jun 26, 2015, 5:44:03 PM6/26/15

to stan-...@googlegroups.com

Indeed, you don’t want negative probabilities! From a Bayesian perspective, you no longer have a generative model.
Regarding Waic etc.: Aki and Jonah and I will be posting a paper on this very soon.
A

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

jun...@gmail.com

unread,

Jun 28, 2015, 4:18:15 PM6/28/15

to stan-...@googlegroups.com

Dear Professor Gelman and Bob,

Thanks a lot to you both for all your help! The generalized ordered logit allows possible negative predicted probabilities, which may deem either as a limitation of the model or some hurdle that usual estimation methods are hard to overcome (e.g., conditions such as the difference in two cumulative densities is positive and bound between 0 and 1). Is it possible to force the sum to be one (since I replace the negative predicted probabilities with zero's)? I will spend a few more days to try to find a work-around....

I agree with Bob's suggestion that we should collect all working examples. I'd be happy to share my rstan and jags codes on a series of ordered regression models for a project that I am working on.

Jun

On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:

Andrew Gelman

unread,

Jun 28, 2015, 4:45:17 PM6/28/15

to stan-...@googlegroups.com

lack of parallelism . . . Andrew and Robert . . . or Andy and Bob . . . or Dr. Gelman and Dr. Carpenter . . .

jun...@gmail.com

unread,

Jun 28, 2015, 7:38:02 PM6/28/15

to stan-...@googlegroups.com, gel...@stat.columbia.edu

Dr. Gelman and Dr. Carpenter,

My bad. I wasn't thinking much about it. I am kind of new to the Stan community, so I don't know people that well. I have been using JAGS almost exclusively before, although I did try estimating some simple models using Stan before. But most recently, I ran into problems with DIC from rjags (having Inf. for some measures), so I am trying to estimate the same models using Stan and WAIC....

Jun

Bob Carpenter

unread,

Jun 28, 2015, 9:00:46 PM6/28/15

to stan-...@googlegroups.com

The first thing I found by the name of "generalized ordered
logit" is this:

http://www.stata-journal.com/sjpdf.html?articlenum=st0097

You're right that as expressed it allows negative probabilities
for

Pr(Y[i] > 0) = 1
Pr(Y[i] > j) = inv_logit(x[i] * beta[j]) for j > 1

so that

Pr(Y[i] = j) = Pr(Y[i] > j -1) - Pr(Y[i] > j)

As the model is stated, it imposes a data-based consistency
condition on beta, which would not be guaranteed to be satisfied
by future observations of x.

This doesn't seem like a very good idea.

It seems like there must be a better way to parameterize
this model, perhaps by modeling

Pr(Y[i] > j | Y[i] > (j-1)) = inv_logit(x[i] * beta[j])

(modulo my inevitably having messed up an index somewhere).

- Bob

P.S. We're all on a first name basis here!

jun...@gmail.com

unread,

Jun 28, 2015, 10:42:41 PM6/28/15

to stan-...@googlegroups.com

Bob,

The model you suggested looks like a continuation ratio or sequential logit model with freely varying slopes. It's usually viewed as a different model than the generalized ordered logit model (based on cumulative logits). Among all alternatives (adjacent logit, sequential logit, and cumulative logit/proportional odds), the cumulative logit/proportional odds model is the only one that has this potential model design problem (negative probabilities) if we allow the slopes to vary across equations. I could've used the adjacent logit, but usually it's considered to be a different model.... I was able to use JAGS to get results (e.g., I also used an R function [vglm from the VGAM package] to get similar results using MLE), but I was not able to get a proper dic (infinity) for a complex model (8 levels; but it worked well with a four-level ordinal response variable), and that's why I am turning to Stan for help. This is a methodological project that specifically looks at generalized ordered logit, but probably I should use adjacent logit instead.

Thanks a lot!

Jun

On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:

Bob Carpenter

unread,

Jun 29, 2015, 5:08:38 PM6/29/15

to stan-...@googlegroups.com

For Stan to be efficient, the model needs to define
support for variables that matches the constraints defined
in the parameter block.

I don't think that's possible for your model for the
reasons outlined in the previous mail.

So I don't think Stan's going to help you here.

We also tend to recommend WAIC over DIC, because it uses Bayesian
estimates rather than point estimates to approximate cross-validation
in-sample.

- Bob

jun...@gmail.com

unread,

Jun 29, 2015, 9:58:12 PM6/29/15

to stan-...@googlegroups.com

Thanks for all your help! It looks like I may have to switch to the adjacent logit....

Jun

On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:

jun...@gmail.com

unread,

Aug 18, 2015, 10:56:55 AM8/18/15

to Stan users mailing list

Andrew and Bob,

Now I am back, but with mostly a success story plus a minor question. I think the initial subject line should be turned into something like "Problem Estimating Generalized Ordered Logit Model Because of User's Impatience." It turns out that Stan CAN estimate a model that has ill-conditioned probabilities (do not sum to one); it's just that I was not patient enough to wait till after ALL warning messages were spit out. Yes, I got a lot of warning messages and this time I waited till the job was completely done. It turns out that the results produced from Stan are almost the same as those from JAGS, and Stan runs about 30% faster than JAGS. So I will be excited to use the loo package to get loo and WAIC. As I indicated in my previous email, I am posting my codes to share (use rstan; see my codes toward the end of this email response):

Before you read my codes, here is my question:
1. At the very end of the model string (see below), I was trying to produce the generated quantity, incremental log likelihood to be relayed to WAIC; and I looked at section 38.5 of the Stan manual about how to spell out it, "real categorical_log(ints y, vector theta)". But I got an error message, and it looks like there is something wrong about how I specify the second argument in the categorical_log() function; so any quick solution? Thanks a lot!

### start of the error message
TRANSLATING MODEL 'modelString' FROM Stan CODE TO C++ CODE NOW.
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model 'modelString' with error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
variable "prob" does not exist.
variable "prob" does not exist.

ERROR at line 40
38:        vector[N] log_lik;
39:        for (n in 1:N)
40:            log_lik[n] <- categorical_log(y[n], prob);
                                                  ^
41:        }
### end of error message

###### START OF THE RSTAN MODEL STRING CODES######
# BELOW I AM ESTIMATING AN GENERALIZED ORDERED LOGIT MODEL WITH A FOUR-LEVEL RESPONSE VARIABLE
# N = total number of case
# K = number of levels for y
# D = number of independent variables in x

#####
modelString = "
data {
    int<lower=2> K;
    int<lower=0> N;
    int<lower=1> D;

    int<lower=0, upper=K> y[N];
    row_vector[D] x[N];
}
parameters {
    vector[D] b1;
    vector[D] b2;
    vector[D] b3;

    ordered[K-1] thresh;
}
model {
    vector[K] prob;
    vector[K-1] eta;
    for (d in 1:D) {
        b1[d] ~ normal(0, 10);
        b2[d] ~ normal(0, 10);
        b3[d] ~ normal(0, 10);
    }

// for ( k in 1:K-1)
// thresh[k] ~ normal( k+0.5, 10);

    for (n in 1:N) {
        eta[1] <- x[n]*b1;
        eta[2] <- x[n]*b2;
        eta[3] <- x[n]*b3;

prob[1] <- inv_logit(thresh[1] - eta[1]);
for (k in 2:(K-1)) {

prob[k] <- fmax(inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]), log(1));

        }
        prob[K] <- 1 - inv_logit(thresh[K-1] - eta[K-1]);
        y[n] ~ categorical(prob);
    }
}

/* this is where the problem was coming from
* generated quantities {
*    vector[N] log_lik;
*    for (n in 1:N)
*        log_lik[n] <- categorical_log(y[n], prob);
* }
*/ end of my question part

}
"
###### END OF THE RSTAN MODEL STRING CODES######

Jun Xu, PhD
Associate Professor
Department of Sociology
Ball State University

Muncie, IN 47306

On Thursday, June 25, 2015 at 8:05:38 PM UTC-4, jun...@gmail.com wrote:

Bob Carpenter

unread,

Aug 18, 2015, 5:22:35 PM8/18/15

to stan-...@googlegroups.com

The problem is that you declare prob as a local variable
in the model block, so it's not visible in the generated
quantities. Instead define it as a transformed parameter
and everything should be OK. Or redefine prob in
the generated quantities block.

If you care about efficiency, you can also make it more
efficient by not repeating subexpressions, like
inv_logit(thresh[k-1] - eta[k-1]) in:

> for (n in 1:N) {
> eta[1] <- x[n]*b1;
> eta[2] <- x[n]*b2;
> eta[3] <- x[n]*b3;
> prob[1] <- inv_logit(thresh[1] - eta[1]);
> for (k in 2:(K-1)) {
> prob[k] <- fmax(inv_logit(thresh[k] - eta[k]) - inv_logit(thresh[k-1] - eta[k-1]), log(1));
> }
> prob[K] <- 1 - inv_logit(thresh[K-1] - eta[K-1]);

You recompute each one twice, which is a lot of extra computation
in this problem. You can replace "inv_logit(thresh[k-1] - eta[k-1])" with
"prob[k-1]". Similarly for the last one.

But the real problem with this model is that nothing ensures that
the difference is positive, which I assume is why you do the fmax(..., 0)
thing. So I'm not surprised you're getting all kinds of warnings and
exceptions as it goes.

- Bob

Reply all

Reply to author

Forward