Cutpoint Priors

721 views
Skip to first unread message

Michael Betancourt

unread,
May 19, 2015, 5:46:58 PM5/19/15
to stan-...@googlegroups.com
Is there a natural prior for cut points (specifically in
an ordered logistic regression) or is the ordered constraint
sufficiently strong that the uniform prior is well-posed?

Andrew Gelman

unread,
May 19, 2015, 8:25:50 PM5/19/15
to stan-...@googlegroups.com
Hi, Ben and I actually have an (unwritten) paper on this. The short answer is that we do think there’s a place for a weakly informative prior here. One challenge is that the scale of the cutpoints themselves is hard to know ahead of time, so we were talking about setting up a prior by putting in a small bit (possibly a fractional bit) of pseudo-data in each bin.
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


Ben Goodrich

unread,
May 19, 2015, 10:53:50 PM5/19/15
to stan-...@googlegroups.com, gel...@stat.columbia.edu, gel...@stat.columbia.edu
One of the reasons the paper is unwritten is because I think it is better to bring the prior information through the parameters of a Dirichlet prior on the conditional probabilities of being in each category given that the predictors are at the sample means. In code, it would look like

functions {
  vector make_cutpoints
(vector probabilities, real scale) {
    vector
[rows(probabilities)-1] cutpoints;
    real running_sum
;
    running_sum
<- 0;
   
for(c in 1:(rows(cutpoints))) {
      running_sum
<- running_sum + probabilities[c];
      cutpoints
[c] <- logit(running_sum);
   
}
   
return scale * cutpoints;
 
}
}
data
{
 
int<lower=1> N;            # observations
 
int<lower=1> K;            # predictors
 
int<lower=3> C;            # outcome categories
  matrix
[N,K]  X;            # "standardized" predictor matrix
  vector
[K] xbar;            # vector of predictor means
 
int<lower=1,upper=C> y[N]; # ordinal outcome
  vector
[C] pseudocounts;    # prior value for concentrations in Dirichlet prior
  real
<lower=0> nu;          # scale for Cauchy priors
}
transformed data
{
  matrix
[K,K] middle;
  middle
<- xbar * transpose(xbar);
}
parameters
{
  vector
[K] beta_raw;        # coefficients
  simplex
[C] probabilities;  # of falling in each category given xbar
}
transformed parameters
{
  vector
[K] beta;
  vector
[C-1] cutpoints;
 
for (k in 1:K) beta[k] <- nu * tan(pi() * (Phi(beta_raw[k]) - 0.5));
  cutpoints
<- make_cutpoints(probabilities, sqrt(quad_form(middle, beta) + 1));
}
model
{
  vector
[N] eta;
  eta
<- X * beta;                                          # linear predictor
 
for(n in 1:N) y[n] ~ ordered_logistic(eta[n], cutpoints); # likelihood
  probabilities
~ dirichlet(pseudocounts);                  # our intuitive prior
  beta_raw
~ normal(0,1); # prior implies beta ~ cauchy(0,nu)
}

Ben
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
May 20, 2015, 2:27:28 PM5/20/15
to stan-...@googlegroups.com
Cool. I like that idea.

Just ordering obviously isn't enough for the uniform distribution
on an ordered vector to be proper.

What does happen for reasons I still don't fully understand is
that

ordered[K] cutpoints;

cutpoints ~ normal(0, 5);

provides the same distribution for cutpoints as

vector[K] cutpoints_raw;

cutpoints_raw ~ normal(0, 5);

cutpoints <- sort(cutpoints_raw);

You can put a prior directly on the gaps with

vector[1] cutpoint0;
vector<lower=0>[K - 1] cutpoint_gaps;

cutpoint_gaps ~ normal(0, 2);

ordered[K] cutpoints;

cutpoints <- append_col(cutpoint, cutpoint + cutpoint_gaps);

I'm not sure how to constraint the lowest cut point, cutpoint0.

- Bob
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Lei Zhang

unread,
Apr 13, 2016, 9:38:20 AM4/13/16
to Stan users mailing list
Sorry to bring this up again. I am (still) facing the cutpoint-collapse problem when fitting an ordered-logistic model. After reading this thread and several others, I am still not sure (1) how to properly specify the prior of the cutpoint parameters, and (2) is there a way to model the cutpoint hierarchically like below? I tried something before but it seems that it doesn't guarantee that c[n,k+1] > c[n,k], hence it won't compile. I would highly appreciate any suggestions!

ordered[K] c_mu;
vector
<lower=0>[K] c_sd;
ordered
[K] c[N];

for (n in 1:N)

   
for (k in 1:K)
       c[n,k] ~ normal(c_mu[k], c_sd[k]);

Bob Carpenter

unread,
Apr 13, 2016, 12:04:28 PM4/13/16
to stan-...@googlegroups.com
We don't really have any recommendations for cutpoint priors.
Gelman and Hill's regression book does talk a bit about
hierarchical priors.

What you need to do is put a prior on the distance between
cutpoints that avoids zeroes if you don't want them to collapse.
Something like a Dirichlet prior with alpha >> 1 on a simplex
that can then be scaled to ordered cutpoints would work.

- Bob
> To post to this group, send email to stan-...@googlegroups.com.

Andrew Gelman

unread,
Apr 13, 2016, 11:43:57 PM4/13/16
to stan-...@googlegroups.com
HI, are you saying that a simple flat prior on the cutpoints won’t work?  Or a super-weak prior such as normal(0,10)?

I’m surprised that you get “cutpoint collapse” if you’re doing full Bayes (as opposed to maximum likelihood).

If you want to put some prior info as below, I recommend putting the normal priors on the _differences_ between the cutpoints, not the cutpoints themselves.  Otherwise you have this weird thing going on where the cutpoints are all being pulled toward each other.

A

On Apr 13, 2016, at 9:38 AM, Lei Zhang <bnuzhan...@gmail.com> wrote:

To post to this group, send email to stan-...@googlegroups.com.

Dustin Tran

unread,
Apr 14, 2016, 12:11:58 AM4/14/16
to stan-...@googlegroups.com
What about a truncated stick breaking prior? The size-biasedness of TSB priors will 
(1) force an ordering for the cutpoints; 
(2) make draws of cutpoints live on [0,1] which you can scale willy-nilly;
(3) reinforce what Ben said on adding prior information via pseudocounts in the parameters of a Dirichlet prior (here, use Beta stick-breaking priors, which is what Ben essentially does based on the running sum in the make_cutpoints function).

Dustin

Lei Zhang

unread,
Apr 14, 2016, 5:01:00 AM4/14/16
to Stan users mailing list
Hi Dustin,

sorry for my ignorance. What exactly do you mean by 'Beta stick-breaking priors'? Could you provide some example code of how to implement it?

Thanks a lot!
Lei

Dustin Tran

unread,
Apr 14, 2016, 10:56:41 AM4/14/16
to stan-...@googlegroups.com
Hi Lei, see chapter 23 of BDA3. 23.1 explains the pseudo-counts in the discussion here, based on the basic notion of a Bayesian histogram. ‘Stick breaking construction’ in 23.2 explains the stick-breaking prior for Dirichlet processes, where the sticks are broken off based on a Beta distribution. By truncated stick-breaking prior, I mean rather than run this stick-breaking process until infinity, it is run until the number of cutpoints. (This is just another representation of a Dirichlet distribution.)

Dustin

Bob Carpenter

unread,
Apr 14, 2016, 4:53:28 PM4/14/16
to stan-...@googlegroups.com
The transform for the simplex uses stick-breaking. but
there's no ordering --- what do you mean by ordering of
the cutpoints? The usual thing to do here is take the
cumulative sums, so you get an increasing sequence to scale:

parameters {
simplex[K] theta; // gaps between cutpoints
real<lower=0> kappa; // scale for cutpoints

transformed parameters {
ordered[K] cutpoints;
cutpoints <- cumulative_sum(theta) * kappa;

model {
theta ~ dirichlet(rep_vector(alpha, K));

Then if you put a Dirichlet prior on theta, with alpha >> 1, it'll keep
the cutpoints apart. And it's best to define that vector as
transformed data.

- Bob

Tobias Konitzer

unread,
Mar 13, 2017, 3:47:59 PM3/13/17
to Stan users mailing list
I am wondering how to derive cut-points with only using prior counts in the different categories, but no information from the data other than that. It sounds like Bob's approach above does this, but how do I go from K (assuming K= number of categories) prior probabilities to K-1 cut-points? Does the above not estimate the first cut-point? In that case, we would calculate the predicted probability of k=1 as 1-sum(prob k=k), akin to estimating the probability of falling into the maximum category in standard ordinal logistic regression? 

And, would that approach, without some scale taken from predictors, be good enough to avoid collapse?

Thanks,
Tobi 

Andrew Gelman

unread,
Mar 13, 2017, 5:59:59 PM3/13/17
to stan-...@googlegroups.com
Collapse if cutpoints can occur with point estimation but should not occur with full Bayes.

Bob Carpenter

unread,
Mar 15, 2017, 2:44:28 AM3/15/17
to stan-...@googlegroups.com
In theory, you won't ever get exactly collapsed cut points
in the posterior, but that doesn't mean that in practice you
won't get close enough to cause underflow and other problems.

Same as if you start drawing theta ~ beta(1e-10, 1e-10). Sure,
you shouldn't draw theta = 0 or theta = 1 in theory, but that's
what's going to happen with floating point arithmetic on computers.
You can get around that problem with just a little stronger prior.

- Bob

Andrew Gelman

unread,
Mar 15, 2017, 6:10:33 PM3/15/17
to stan-...@googlegroups.com
No, that's not what I meant. If you have uniform priors on the distances between the cutpoints, and you have a bit of data, then your posterior won't be collapsed near 1e-10 or even near 0.001. That sort of collapsing behavior can occur with point estimates, and it can occur with hierarchical models (the "funnel") but it should not hold with flat priors. If you happen to have zero data in some cells, maybe you'll get the difference between cutpoints to be estimated at 0.01 or something like that.
A

Bob Carpenter

unread,
Mar 15, 2017, 7:42:31 PM3/15/17
to stan-...@googlegroups.com
Yup, understood. The MLE will actually collapse the boundaries
into each other if there's no data, whereas the Bayesian one
will never get there.

Now we're just guessing how close the cut points will go
to each other in some of the posterior draws, which will
depend on the prior and how much data there is in the other
bins. I haven't fit one of these in a while, so I may be
misremembering how bad the numerical issues will be.
Either way, the distance on the unconstrained scale needs
to go to -infinity to get it exponentiation to zero, so it
can cause issues if you get lots of data in the other bins.

Of course, instead of a uniform on the inter-cut-point distances,
you can use something that avoids zero like a gamma with the
appropriate params.

- Bob

Andrew Gelman

unread,
Mar 15, 2017, 7:48:00 PM3/15/17
to stan-...@googlegroups.com
My intuition is that with N data points with 0 inside a particular bin, that the estimated bin width would be of order 1/N. But I guess I should do an experiment.

Yes, a gamma prior with shape parameter 2 will get the point estimate away from 0 (as in this paper: http://www.stat.columbia.edu/~gelman/research/published/chung_etal_Pmetrika2013.pdf ). I luuuvv this idea but it's all about point estimation. I don't like it for full Bayes.

Again, I guess it's time for some experiments. Also it's interesting the point that you're making about constraints, as this also came up in my rounded-data example. It's making me wonder if we need to think again about our automatic transformations to the unconstrained scale.

A

Bob K

unread,
Apr 5, 2017, 9:40:14 AM4/5/17
to Stan users mailing list, gel...@stat.columbia.edu
I am running an IRT/ordered categorical logistic model and using a weakly informative normal prior on the differences between cutpoints. I'm also getting a lot (i.e., about 10) warnings about improper values (negative values and infinity) for the cutpoints coming from the ordered_logistic function.

So here's a question -- should the ordered vector type naturally handle the fact that the normal prior is being put over the differences between cutpoints, i.e. by transforming to the unconstrained scale such that the normal prior is really a half-normal prior? Or should the normal prior be explicitly truncated at zero? Or, would it be better to use an exponential prior given that all values below zero are meaningless for ordered cutpoint differences?

I can post some data if y'all want to see it.

Thanks much for any advice,

Bob

Tran

unread,
Apr 5, 2017, 1:52:33 PM4/5/17
to Stan users mailing list, gel...@stat.columbia.edu
Hi,

This link might be helpful for you.

Tran.

Bob Carpenter

unread,
Apr 6, 2017, 2:17:31 PM4/6/17
to stan-...@googlegroups.com
The ordered vector deals with keeping the parameters ordered
and also applies the Jacobian for the transform.

So you can put priors on the differences you want (zero-avoiding
priors like gamma or lognormal being a popular choice).

parmaeters {
ordered[K] c;
...
model {
for (k in 2:K)
(c[k] - c[k - 1]) ~ ...;
...

This is OK because the Jacobian of the difference is constant,
so doesn't need to be accounted for. You may also want positive
ordered vectors depending on the application; same thing works
for those.

The above is still an improper prior because nothing's locating the
overall position. That can be fixed by putting a prior on c[1].

c[1] ~ ...

or really on any of the c[n] values or even on their sum.

You can even vectorize if every diff gets the same
prior:

(c[2:K] - c[1:(K - 1)]) ~ ...

- Bob
Reply all
Reply to author
Forward
0 new messages