Re: sum-to-zero constraints in rstan/stan

2,662 views
Skip to first unread message

Andrew Gelman

unread,
Feb 4, 2015, 5:45:55 PM2/4/15
to Giles Crane, stan-...@googlegroups.com
Hi, Giles. There are different ways of doing this in Stan. Usually we’d do it by redefining parameters to sum to 0, so that if you have a bunch of unconstrained parameters alpha_1,…,alpha_n, you redefine beta_j <- alpha_j - mean(alpha).
But you’ll still want some informative prior on the alphas for stability.
A

P.S. I’m ccing stan-users because this is a general issue.

> On Feb 4, 2015, at 10:41 AM, Giles Crane <giles...@verizon.net> wrote:
>
> Dear Professor Gelman,
>
> After you replied to my question concerning R packages arm and MCMCglmm,
> I installed and tried to use rstan/stan.
>
> There seems to be no easy way to impose
> sum-to-zero constraints in rstan/stan.
>
> As you know better than I,
> such constraints have been used frequently
> in constraining the first effect to be zero,
> and also centering effects.
>
> My search of other people's comments
> did not turn up a solution.
>
> There seems to be a need for a rstan/stan method
> to implement sum-to-zero constraints.
> BUGs to rstan examples
> do not involve sum-to-zero constraints.
>
> Cordially,
> Giles Crane
>
> --
> Giles L Crane, MPH, ASA, NJPHA
> Statistical Consultant and R Instructor
> 621 Lake Drive
> Princeton, NJ 08540
> Phone: 609 924-0971
> Email: giles...@verizon.net


Bob Carpenter

unread,
Feb 4, 2015, 9:06:46 PM2/4/15
to stan-...@googlegroups.com
What Andrew is suggesting may appear to work in BUGS/JAGS
but it absolutely won't work in Stan. See the chapter of
the manual on problematic posteriors for an extended discussion
of what's going on with the posterior.

If z is a K-dimensional parameter vector constrained so that

sum(z) = 0,

then you can code up z as follows:

parameters {
simplex[K] z_raw;
real<lower=0> z_scale;
...

transformed parameters {
vector[K] z;
z <- (z_raw - 0.5) * z_scale;
...

The parameterization involves K free parameters (K-1 for the
simplex and one for the scale).

There's a discussion in the manual of how to parameterize models
like IRT so that they are identified.

In the simplest case, if you have something like a 1PL IRT
model, with likelihood:

y[m,n] ~ bernoulli(inv_logit(alpha[m] - beta[n]))

then alpha and beta aren't identified. But if you give them
priors such as

alpha ~ normal(0,1);
beta ~ normal(0,10);

then you'll find in the posterior, alpha will sum to 0 and
beta will float because it's less constrained. That's our
recommended strategy for dealing with centering. It's not
a hard constraint.

An alternative is to pin one of the elements of alpha or
beta, and let everything adjust relative to that. This is
clunkier to code, but possible.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Ben Goodrich

unread,
Feb 4, 2015, 9:16:19 PM2/4/15
to stan-...@googlegroups.com
On Wednesday, February 4, 2015 at 9:06:46 PM UTC-5, Bob Carpenter wrote:
If z is a K-dimensional parameter vector constrained so that

  sum(z) = 0,

then you can code up z as follows:

Is that simplex-based thing somehow preferable to

parameters {
  vector
[K-1] z_free;

}
transformed parameters
{
  vector
[K] z;

 
for(k in 1:(K-1) z[k] <- z_free[k];
  z
[K] <- -sum(z_free);
}

Ben

Andrew Gelman

unread,
Feb 4, 2015, 9:25:45 PM2/4/15
to stan-...@googlegroups.com
Bob
But in the model below, the alphas won’t sum to 0. They’ll sum to approximately 0. If a user wants alphas that exactly sum to 0, he or she can compute alpha_adj <- alpha - mean(alpha) and these adjusted alphas will sum to 0.
A

Bob Carpenter

unread,
Feb 4, 2015, 10:34:47 PM2/4/15
to stan-...@googlegroups.com
[re-inserted OP in BCC]

We have to put this all in context. Let's assume
we're talking about parameters for an IRT model.

If the goal is to identify the model by using

> alpha_adj <- alpha - mean(alpha)

and then using alpha_adj in the likelihood rather than alpha,
it just won't work in Stan. Read the problematic posteriors
chapter of the manual for both examples and theory. The synopsis
is that NUTS will run off forever and never turn around if
the parameter alpha isn't identified in the model. It's
not enough for the transformed parameter to be identified.

As Andrew says, using a centered prior on alpha with a constrained
scale relative to beta will only roughly center each posterior draw.
But it will identify the model in the sense Andrew describes in
his regression book with Jennifer.

Also, if you do a penalized MLE, you should get the alpha properly
centered if beta gets an uninformative prior.

alpha ~ normal(0,1)

beta ~ uniform(-inf,inf)

This lets the alpha identify the scale and location (in the sense
of "identify" Andrew talks about in his regression book with Jennifer),
while letting beta float. This is the direction Sophia Rabe-Hesketh
recommended because in IRT models the focus is usually on the test
questions (beta), not the students (alpha). Of course, we'd more usually
put a hierarchical prior on beta in Bayesian models:

beta ~ normal(mu_beta, sigma_beta);

- Bob

Andrew Gelman

unread,
Feb 4, 2015, 10:40:25 PM2/4/15
to stan-...@googlegroups.com
Yes, there are two things going on:
1. Using a proper prior so that the posterior is anchored via soft constraints
2. Possibly doing the alpha - mean(alpha) trick to get parameters that exactly sum to 0.
From a Bayesian perspective, step 2 is typically unnecessary (why, after all, would a user be particularly interested in this?) but from the standpoint of compatability with other software it could be helpful.
A

Bob Carpenter

unread,
Feb 4, 2015, 10:49:45 PM2/4/15
to stan-...@googlegroups.com
Exactly. My only concern is that users don't center
in the hopes that it'll identify translational
invariances in the model. It produces an improper
posterior in the actual parameters Stan is sampling with
and NUTS will just skate along the ridge.

- Bob

Ben Goodrich

unread,
Feb 5, 2015, 12:53:07 AM2/5/15
to stan-...@googlegroups.com
On Wednesday, February 4, 2015 at 10:46:46 PM UTC-5, Bob Carpenter wrote:
The remaining question in my mind is then what we should be
doing for priors.  Should the prior be on z or z_free?  

With that construction, we are basically saying

f(z | theta) = f(z_free | theta) * f(z[K] | z_free)

and the last term is conditionally deterministic given the sum-to-zero constraint.

So, I think specifying f(z_free | theta) as the prior is usually okay, particularly, if you are using multivariate normals. f(z | theta) would have a singular covariance matrix and the easiest way around that is to go to a full-rank subspace.

http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case

Ben

Bob Carpenter

unread,
Feb 5, 2015, 1:04:46 AM2/5/15
to stan-...@googlegroups.com
It makes sense going that way --- it's the usual case
of a derived quantity based on a parameter.

But won't it produce a different prior based on which of
the z[k] is defined as the negative sum of the others?

If you just put the prior on all of z instead of z_free,
then there's some implication of double-prioring the
z[1]...z[K-1] and need to renormalize, and I'm not sure
what the implication is marginally if I say z ~ normal(0,1).
(I could probably figure it out with some effort, and
it'd be easy to investigate computationally.)

- Bob

Ben Goodrich

unread,
Feb 5, 2015, 1:17:12 AM2/5/15
to stan-...@googlegroups.com
On Thursday, February 5, 2015 at 1:04:46 AM UTC-5, Bob Carpenter wrote:
But won't it produce a different prior based on which of
the z[k] is defined as the negative sum of the others?  

If you imagine a KxK covariance matrix that is singular, then yes, depending on which row / column you exclude, the remaining elements of the (K-1)x(K-1) covariance matrix differ but I don't think the distribution of z changes unless you are doing something informative with the priors that go into the linear predictor.

This paper actually argues for using sum-to-zero constraints in discrete choice models

http://www.burgette.org/sMNP-R0-FINAL.pdf

relative to the usual practice of defining utility relative to a baseline category largely because which category you decide to call the baseline affects the predictions. We can do a similar posterior is Stan but without the Gibbs sampling.
 
If you just put the prior on all of z instead of z_free,
then there's some implication of double-prioring the
z[1]...z[K-1] and need to renormalize, and I'm not sure
what the implication is marginally if I say z ~ normal(0,1).
(I could probably figure it out with some effort, and
it'd be easy to investigate computationally.)

The wikipedia link explains how to do a K-dimensional density with a singular covariance matrix, but Stan does not have an easy way for users to do a generalized inverse or pseudo-determinant.

Ben


> On Feb 5, 2015, at 12:53 AM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Wednesday, February 4, 2015 at 10:46:46 PM UTC-5, Bob Carpenter wrote:
> The remaining question in my mind is then what we should be
> doing for priors.  Should the prior be on z or z_free?  
>
> With that construction, we are basically saying
>
> f(z | theta) = f(z_free | theta) * f(z[K] | z_free)
>
> and the last term is conditionally deterministic given the sum-to-zero constraint.
>
> So, I think specifying f(z_free | theta) as the prior is usually okay, particularly, if you are using multivariate normals. f(z | theta) would have a singular covariance matrix and the easiest way around that is to go to a full-rank subspace.
>
> http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
>
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

andre.p...@googlemail.com

unread,
Jul 6, 2015, 1:40:30 AM7/6/15
to stan-...@googlegroups.com

If z is a K-dimensional parameter vector constrained so that

  sum(z) = 0,

then you can code up z as follows:

Is that simplex-based thing somehow preferable to

parameters {
  vector
[K-1] z_free;
}
transformed parameters
{
  vector
[K] z;
 
for(k in 1:(K-1) z[k] <- z_free[k];
  z
[K] <- -sum(z_free);
}

In this case z[K] could get far off from z[1..k-1], whereas in simplex based z[K] "behave" like the others.
  However, the one additional parameters in the simplex adds additional noise and the solution is less optimal. 

Andre

Bob Carpenter

unread,
Jul 6, 2015, 11:59:51 AM7/6/15
to stan-...@googlegroups.com

> On Jul 6, 2015, at 1:40 AM, andre.p...@googlemail.com wrote:
>
>
> If z is a K-dimensional parameter vector constrained so that
>
> sum(z) = 0,
>
> then you can code up z as follows:
>
> Is that simplex-based thing somehow preferable to
>
> parameters {
> vector[K-1] z_free;
> }
> transformed parameters {
> vector[K] z;
> for(k in 1:(K-1) z[k] <- z_free[k];
> z[K] <- -sum(z_free);
> }

This is how we recommend doing this. Except that
you need a further paren after (K - 1) or it won't parse.

>
> In this case z[K] could get far off from z[1..k-1], whereas in simplex based z[K] "behave" like the others.
> However, the one additional parameters in the simplex adds additional noise and the solution is less optimal.

It's a different parameterization and the parameterization
of a model matters to Stan (and most samplers including Gibbs
and Metropolis) because it affects the posterior geometry.

This is still a wide-open area of research if anyone with
a theory of applied computational stats bent wants to delve
deeper. The Betancourt and Girolami paper on hierarchical priors
is a good role model for what such papers will look like. For
example, we've been playing around with IRT model identification
and that's ripe for a writeup.

- Bob

andre.p...@googlemail.com

unread,
Jul 17, 2015, 11:01:00 AM7/17/15
to stan-...@googlegroups.com
I got a practical solution to this. First estimate the intercept as parameter with K-1 or simplex method,
and then second exclude the intercept as parameter from the model and include it as fixed constant
and use K parameters. Although this requires two runs, the model behaves stable.

- Andre

Bob Carpenter

unread,
Feb 4, 2015, 10:46:46 PM2/4/15
to stan-...@googlegroups.com
Neat --- I hadn't thought of doing it that way.
Ben's approach should be much more efficient. I updated
the manual issue so I won't forget it.

The remaining question in my mind is then what we should be
doing for priors. Should the prior be on z or z_free? We had
this discussion before in the context of using K versus (K-1)
coefficient vectors for multinomial logistic regression.

The nice part about doing it on z_free is that it's simple, but
the unpleasant part is that it's asymmetric. And I don't see
how to put a prior on z that's sensible (Ben is much better than
me at this and will probably know what to do).

With the simplex, you could take the simplex prior to be uniform
(or Dirichlet to push to symmetric or sparsity), and then
put a different prior on the scale.

- Bob

P.S. Ben's code also shows how much I need to get slicing for lvalues
working in the language!
Reply all
Reply to author
Forward
0 new messages