Prediction in an ordered logistic model

Olivier Ma

unread,

Mar 22, 2017, 11:45:38 PM3/22/17

to stan-...@googlegroups.com

I'm implementing the ordered logistic model in the Stan reference manual, and I want to check whether the model is working well by comparing the observed values and the model prediction. The model prediction is nothing like the observed values, I don't know if I have programmed it wrong or if I misunderstand how to check the model fits from an ordered logistic model.

The model code is essentially copied from the manual, I only added the generated quantities part. The whole model file is attached. And here is the generated quantities part I added:

generated quantities {
    real<lower=0> y_p[N];       // prediction 
    {
        matrix[N, K] pred;      // prediction each case and each class
        for (n in 1:N) {        // cases
            for (k in 1:K) {    // classes
                pred[n, k] = exp(ordered_logistic_lpmf(k | x[n]*beta, c)) * k;
            }
            y_p[n] = sum(pred[n]);
        } 
    }
}

Basically what I have done is:

get the log pmf for each class, exponentialize it;
multiply it with the observed class value;a
and sum all the classes for the expected prediction

But the prediction is nothing like the observed values. I have 7 different classes, and here are the observed values:

> table(d$response)
   1    2    3    4    5    6    7 
1274  909 1071 2323 1462 1445 1446

And the predicted values are all in a much smaller range:

> range((m113_sample$y_p))
[1] 2.673227 4.953754

What am I doing wrong here?

ologit.stan

Ben Goodrich

unread,

Mar 23, 2017, 12:18:21 AM3/23/17

to Stan users mailing list

On Wednesday, March 22, 2017 at 11:45:38 PM UTC-4, Olivier Ma wrote:

But the prediction is nothing like the observed values. I have 7 different classes, and here are the observed values:

> table(d$response) 1 2 3 4 5 6 7 1274 909 1071 2323 1462 1445 1446

And the predicted values are all in a much smaller range:

> range((m113_sample$y_p)) [1] 2.673227 4.953754

What am I doing wrong here?

Not drawing from the posterior predictive distribution. You don't want to compare d$response to some real number than only reflects uncertainty in the coefficients and cutpoints; you want to compare d$response to what distribution the model implies d$response should be. So, rather than storing those probabilities in the generated quantities block, you want to store draws from a categorical distribution that is evaluated at those posterior probabilities. Then --- assuming your model is reasonable --- you have an apples-to-apples comparison with d$response.

Ben

Olivier Ma

unread,

Mar 23, 2017, 12:35:48 AM3/23/17

to Stan users mailing list

Thanks Ben!

Does this mean that I can only check the model by comparing the distribution of observed vs predicted, but not the prediction accuracy for each observation? What I have in mind is the kind of error distribution from linear models, that is, for each observation, whether the model predicted value will be higher or lower than the actually observed.

Ben Goodrich

unread,

Mar 23, 2017, 10:19:47 AM3/23/17

to Stan users mailing list

On Thursday, March 23, 2017 at 12:35:48 AM UTC-4, Olivier Ma wrote:

Thanks Ben!

Does this mean that I can only check the model by comparing the distribution of observed vs predicted, but not the prediction accuracy for each observation? What I have in mind is the kind of error distribution from linear models, that is, for each observation, whether the model predicted value will be higher or lower than the actually observed.

In rstanarm, we calculate latent residuals for ordinal models

https://github.com/stan-dev/rstanarm/blob/master/exec/polr.stan#L104
https://github.com/stan-dev/rstanarm/blob/master/exec/polr.stan#L219

but it involves rejection sampling so it can hang if your model doesn't fit the data very well.

Ben

Olivier Ma

unread,

Mar 24, 2017, 1:03:19 AM3/24/17

to Stan users mailing list

This is very helpful, thanks!

Reply all

Reply to author

Forward