Sum-to-zero vector

638 views
Skip to first unread message

kbo...@berkeley.edu

unread,
Aug 27, 2016, 7:33:25 PM8/27/16
to Stan users mailing list
Hi,

I've been trying to implement a sum-to-zero parameter vector in Stan. The manual suggests two ways of doing this: the "K-1" method where the last element is set to the negative sum of the previous ones or the "scaled simplex" method where the mean of a simplex is removed and then scaled. I've been running into issues with both of these:

When using the simplex method, my models end up being non-identifiable. Fundamentally, a sum-to-zero vector has only K-1 degrees of freedom while this method has K degrees of freedom. For a two dimensional target vector, the simplex (0.4, 0.6) with scale 10 and the simplex (0.3, 0.7) with scale 5 both give the sum-to-zero vector (-1, 1).

With the K-1 method, I find that the last element is privileged relative to the other ones and samples poorly. A sum-to-zero vector is a point in the K-1 dimensional plane that is orthogonal to the (1, 1, 1, ...) vector, and the "basis vectors" of the K-1 method are complete but not orthogonal in this plane.

I believe that a better method would be to define an orthogonal basis that maps out this plane. For example, the basis:
[1, -1, 0, 0, 0, ...]
[1/2, 1/2, -1, 0, 0, ...]
[1/3, 1/3, 1/3, -1, 0, ...]
...
(with normalization of sqrt(1/n + 1)) will work. There might be a better way of generating such a basis that I am not aware of. Here is some example code to implement this transformation:

parameters {
    vector[W-1] T_raw;
}
transformed parameters {
    vector[W] T;

    for (w in 1:W-1) {
        T[w] = 0.;
        for (w2 in w:W-1) {
            T[w] = T[w] + T_raw[w2] / w2 / sqrt(1/w2 + 1);
        }
    }
    T[W] = 0.;

    for (w in 1:W-1) {
        T[w+1] = T[w+1] - T_raw[w] / sqrt(1/w + 1);
    }
}

I get better sampling performance with this method compared to either the simplex or K-1 methods. Would it be possible to implement something like this into the Stan language, or have something similar in the manual?

Thanks,
Kyle

Andrew Gelman

unread,
Aug 27, 2016, 7:35:21 PM8/27/16
to stan-...@googlegroups.com
Kyle:
Can you explain why you want a set of parameters in your model to sum to 0?
A

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

kbo...@berkeley.edu

unread,
Aug 27, 2016, 7:45:55 PM8/27/16
to Stan users mailing list, gel...@stat.columbia.edu
I'm trying to model differences between a set astronomical spectra taken a given number of days apart. My model is that the difference will behave something like:

f2(n, wavelength) = f1(n, wavelength) + delta_t(n) * vector(wavelength)

However, the overall scale of the spectra is not known, and I need to include a constant offset between them:

f2(n, wavelength) = f1(n, wavelength) + delta_t(n) * vector(wavelength) + offset(n)

For my model to be identified, I need to impose some sort of a constraint on the vector. A sum-to-zero constraint makes the most sense.

Kyle

Andrew Gelman

unread,
Aug 27, 2016, 7:53:53 PM8/27/16
to kbo...@berkeley.edu, Stan users mailing list
If you put a prior on them, you should not need the sum-to-zero constraint.  I think what's happening is that you're making your life unnecessarily difficult by importing a trick (the hard constraint) from classical inference, which you will not need in Bayesian inference.  It's also an example of the folk theorem, I think.
A

Bob Carpenter

unread,
Aug 27, 2016, 9:08:19 PM8/27/16
to stan-...@googlegroups.com

> On Aug 28, 2016, at 1:33 AM, kbo...@berkeley.edu wrote:
>
> With the K-1 method, I find that the last element is privileged relative to the other ones and samples poorly.

You mean just the K-th component has worse R-hat or
lower n_eff?

> A sum-to-zero vector is a point in the K-1 dimensional plane that is orthogonal to the (1, 1, 1, ...) vector, and the "basis vectors" of the K-1 method are complete but not orthogonal in this plane.
>
> I believe that a better method would be to define an orthogonal basis that maps out this plane. For example, the basis:
> [1, -1, 0, 0, 0, ...]
> [1/2, 1/2, -1, 0, 0, ...]
> [1/3, 1/3, 1/3, -1, 0, ...]
> ...
> (with normalization of sqrt(1/n + 1)) will work. There might be a better way of generating such a basis that I am not aware of. Here is some example code to implement this transformation:
>
> parameters {
> vector[W-1] T_raw;
> }
> transformed parameters {
> vector[W] T;
>
> for (w in 1:W-1) {
> T[w] = 0.;
> for (w2 in w:W-1) {
> T[w] = T[w] + T_raw[w2] / w2 / sqrt(1/w2 + 1);
> }
> }
> T[W] = 0.;
>
> for (w in 1:W-1) {
> T[w+1] = T[w+1] - T_raw[w] / sqrt(1/w + 1);
> }
> }

> I get better sampling performance with this method compared to either the simplex or K-1 methods. Would it be possible to implement something like this into the Stan language, or have something similar in the manual?

We've thought about implementing a sum-to-0 vector type in
Stan but haven't gotten around to it. We would've gone with
the K-1 vector with the last being the negative sum of the
first. So I'm curious as to why your approach works better,
at least in your problem.

One issue we have with all these parameterizations is how
to put priors on the elements, as we don't really want to
put priors on all K elements with only K-1 degrees of freedom,
but putting them only on K-1 has that asymmetry problem.

We run into similar problems with multilevel models, where
we have K groups, but usually use only (K - 1) non-zero parameters
in order to have identifiability. Or when we have a multi-logit
regression with K categories in the output and only K-1
coefficient vectors. We've talked about this one with Andrew,
too, and he's said the same thing about asymmetry, but we've
never come up with a solution that mixes well and has symmetry
for the priors.

For the unit vectors, we actually take K underlying parameters,
but then put a prior on them for identification. It seems like
we might be able to do something similar here.

- Bob

kbo...@berkeley.edu

unread,
Aug 28, 2016, 12:54:35 PM8/28/16
to Stan users mailing list
Thanks for all of your suggestions and replies!

I have been trying out the different variants that you suggested.

If I define a length K vector and put priors on its components constraining them to zero, then it kind of works in lower dimensions (K ~ 3), but breaks in the higher dimensions that I'm working in (K ~ 300). n_eff and Rhat both get significantly worse as the number of dimensions increases, from n_eff ~ 100 and Rhat ~ 1.01 in 3 dimensions to n_eff ~ 4 and Rhat ~ 1.7 in 300 dimensions (with iter=1000, chains=4). It does seem like this should make the model identifiable, so I'm a bit surprised that it doesn't work. Is there a better way to implement the sum-to-zero prior than something like T ~ normal(0, 1)?

I played with the K-1 vector method and it does seem to work reasonably well assuming that I specifically choose initial parameters that give the last vector element a reasonable starting value. With specifically chosen initial parameters, I get n_eff ~ 688 and Rhat=1 for my model with K=300.

I think that the method that I suggested in my first message does have symmetry for the priors assuming that you want the same prior on every element. Putting a normal(0, 1) prior on the T_raw vector seems to be equivalent to putting a normal(0, sqrt((W-1)/W)) prior on each of the elements of the T vector with a sum-to-zero constraint. I don't think that this would work for a prior that is not symmetric around 0.

Kyle

Bob Carpenter

unread,
Aug 28, 2016, 1:28:20 PM8/28/16
to stan-...@googlegroups.com
We've had the same problem with trying to identify via the
priors. It's technically identified for the MLE, but doesn't
do a great job sampling as it's not required to sum to zero.

In our simplex transform, which has the same asymmetry as the sum-to-one
construction, it's set up so that a zero init on the unconstrained space
leads to a uniform init on the constrained space. But we tend not to
use very high-dimensional simplexes. Our inits are uniform(-2, 2) by
default.

The sum-to-zero construction where the K-th element is

alpha[1:(K - 1)] = alpha_raw;
alpha[K] = sum(alpha_raw[1:(K - 1)])

will also have initial values of alpha = 0 for init alpha_raw = 0.

Thanks for the tip about your prior. For the sum-to-zero construction
in the manual, putting

alpha_raw ~ normal(0, 1)

(assuming independent normals per entry), implies

alpha[1:(K - 1)] ~ normal(0, 1)

but

alpha[K] ~ normal(0, sqrt(K - 1))

so it's definitely not symmetric.

- Bob

Andrew Gelman

unread,
Aug 28, 2016, 5:55:19 PM8/28/16
to stan-...@googlegroups.com
It's up to you, but given your problem description it's pretty clear to me that you should _not_ be defining your parameters to sum to 0.  It's unnecessary and is just slowing you down in various ways.
A

Andrew Gelman

unread,
Aug 28, 2016, 5:56:29 PM8/28/16
to stan-...@googlegroups.com
Again, nothing wrong with attacking the technical problem. But, as with many such problems, I don't think it's real.
A

Bob Carpenter

unread,
Aug 28, 2016, 6:14:20 PM8/28/16
to stan-...@googlegroups.com
To use your own folk theorem analogy, the sum-to-zero
constraint will help with mixing. Having soft identifiability
from priors makes it hard for the model to fit. That's discussion
we were having with the K-1 coefficient vector vs. K cefficient
vector for multi-logit.

- Bob

Andrew Gelman

unread,
Aug 28, 2016, 6:16:41 PM8/28/16
to stan-...@googlegroups.com
I'm guessing that the best approach is not a hard sum-to-zero constraint but a soft constraint from priors. But maybe not as soft as the default. One could, for example, put a prior on the sum, something like normal(0,1) maybe? just to keep everything stable.

Bob Carpenter

unread,
Aug 28, 2016, 6:21:21 PM8/28/16
to stan-...@googlegroups.com
Kyle was reporting that the soft constraint didn't
mix well, even with a normal(0, 1) prior.

We should talk about this in a meeting when we're
all back. Don't need to flood the users list with
this discussion until we have something more definite
to say.

- Bob

andre.p...@googlemail.com

unread,
Aug 29, 2016, 10:57:21 AM8/29/16
to Stan users mailing list
Bob, 
Kyle was reporting that the soft constraint didn't
mix well, even with a normal(0, 1) prior.

 I faced the same in STAN, but not in JAGS. 
Maybe this explains Andrews experience.

André

Bob Carpenter

unread,
Aug 29, 2016, 11:24:23 AM8/29/16
to stan-...@googlegroups.com
Do you have specific examples? All of our trials show Stan
outperforming JAGS on these kinds of problems as dimensionality
increases (we do lots of IRT models where this issue comes up).
See, for example:

https://arxiv.org/abs/1601.03443

which compares Stan to Stata and JAGS under favorable conditions
for them (such as conjugate priors for JAGS and centered parameterization
for Stata). We're revising a journal version of this paper that
should have much tighter comparisons, but the result's still the same.

- Bob

andre.p...@googlemail.com

unread,
Aug 29, 2016, 8:59:29 PM8/29/16
to stan-...@googlegroups.com
Bob,

I did experiments with a GLM model and latent structure and 
JAGS has the GLM module which in the manual says, there's
not need to substract the mean, which worked out for me.
STAN usually receives higher number of effective samples,
all true. The JAGS model however shows a 'better' looic. 

Its somewhere in the dozens of variants I wrote.
I cannot provide an example for now, 
its on my worklist to extract the "effect".
Sorry, sorry.

Andre

Andrew Gelman

unread,
Aug 29, 2016, 9:05:34 PM8/29/16
to stan-...@googlegroups.com
Best measure, I think, is clock time divided by n_eff, once the simulations have been run long enough for convergence.

Bob,

andre.p...@googlemail.com

unread,
Aug 30, 2016, 2:13:00 AM8/30/16
to Stan users mailing list, gel...@stat.columbia.edu
Best measure, I think, when every interval of simulation running large enough, share for every parameter of interest 
the same distribution of its difference with its likelihood optimization point.

Andre

Bob Carpenter

unread,
Aug 30, 2016, 9:52:39 AM8/30/16
to stan-...@googlegroups.com
I don't see how maximum likelihood is relevant here.
But yes, all the intervals are what we're really after,
or even better, the full joint posterior. That's harder
to measure as you go out into the tails.

Basically, all of our existing convergence diagnostics
are based on expectations. We just compare multiple MCMC
chains to each other using R-hat. Our n_eff calculations
for Stan also build-in cross-chain convergence adjustments,
which is why n_eff is the one thing to look at (assuming it's
not too low per iteration that we don't trust the estimates).

- Bob
Message has been deleted
Message has been deleted

andre.p...@googlemail.com

unread,
Sep 5, 2016, 2:56:58 AM9/5/16
to Stan users mailing list
library(rstan)
mc
<- '
functions {
vector kyles_simplex_orthogonal_base(vector T_raw) {
  vector[num_elements(T_raw)+1] T;
  int W;
 
  W = num_elements(T_raw)+1;

  for (w in 1:W-1) {
    T[w] = 0.;
    for (w2 in w:W-1) {
      T[w] = T[w] + T_raw[w2] / w2  / sqrt(1./w2 + 1);

    }
  }
  T[W] = 0.;
 
  for (w in 1:W-1) {
    T[w+1] = T[w+1] - T_raw[w]  / sqrt(1./w + 1);
  }
  return T / sqrt((W-1.0)/W);
}
matrix kyles_simplex_orthogonal_base_matrix(int N) {
  vector[N] T;
  matrix[N+1,N] M;
 
  for(i in 1:N) {
    for(j in 1:N)
      T[j] = 0.0;
    T[i] = 1.0;
    M[,i] = kyles_simplex_orthogonal_base(T);
  }
  return M;
}
}
model {}
'



cppcode
<- stanc(model_code = mc, model_name = "kyles_orthogonal_simplex")


## Not run:
expose_stan_functions
(cppcode)


m
<-matrix(rnorm(4000),1000,4)
apply
(m%*%t(kyles_simplex_orthogonal_base_matrix(4)),2,sd)
#[1] 0.9893244 0.9679589 1.0389636 1.0222200 0.9503592
m
<-matrix(rnorm(4000),1000,4)
apply
(m%*%t(kyles_simplex_orthogonal_base_matrix(4)),1,mean)
cov
(m%*%t(kyles_simplex_orthogonal_base_matrix(4)))

Changed:
- integer division 1/w, 1/wr
- origin variance  / sqrt((W-1.0)/W); 

Andre

andre.p...@googlemail.com

unread,
Sep 5, 2016, 9:35:16 PM9/5/16
to Stan users mailing list
Yesterday was not my day,

please  change 
  return T / sqrt((W-1.0)/W);
to 
  return T;

The variance is due to the multivariate nature ok. 

I did a test with 150+ parameters sum-to-zero kyles base, the looic 
went up slightly, the run-time of the model was increase by factor 8
compared to model it with normal(0, sigma) - non sum to zero constraint.

Maybe its due to the nature of my model. The kyles base for sure is
no default option. 

Andre



Reply all
Reply to author
Forward
0 new messages