Non-centered parameterization for standard deviations / variances?

Julian King

unread,

Jun 6, 2016, 6:10:22 PM6/6/16

to Stan users mailing list

Hello everyone,

I have a regression model which includes a number of different hierarchical standard deviations for model parameters.

I have used a non-centered parameterisation for the model parameters, but I am wondering whether it is worth using a non-centered parameterisation for the hierarchical standard deviations as well.

Is this really just as simple as doing the standard non-centered parameterisation but constrain the values to be positive, ie

declare a bunch of N(0,1) variables but with <lower=0>, eg stdev_norms<lower=0>[n], and then
Set stdev[i] <- stdev_norms[i]*stdev_prior_width?

Is there a useful non-centered parameterisation of the gamma distribution that can be used in this case, rather than assuming that the stdev variables have a normal hierarchical prior?

Thanks

Michael Betancourt

unread,

Jun 7, 2016, 3:09:05 AM6/7/16

to stan-...@googlegroups.com

Non-centering works only for location-scale distributions.

The easiest thing to do would be to use a general linear

model, as in

theta_tilde ~ normal(0, 1);

sigma = exp(mu + tau * theta_tilde)

This will limit you to log-normal distributions. If you wanted

to use a Gamma distribution then you’d have to figure out

a new transform using the properties of the Gamma

distribution, for example https://en.wikipedia.org/wiki/Gamma_distribution#Scaling.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julian King

unread,

Aug 25, 2016, 7:51:14 PM8/25/16

to Stan users mailing list

OK I had a crack at doing this based on the scaling properties of the gamma distribution.

I have created two data sets (dat1 and dat2) which are described by N(0, sigma_i) where sigma_1 and sigma_2 have to be learned from a toy data set. sigma_1 and sigma_2 have a hierarchical prior given by gamma(2, 1/hp) where hp is the hyperparameter.

I have created two versions of the model (in one Stan file):

A centered parameterisation based on gamma(2,1/hp)
A non-centered parameterisation based on hp_alt*gamma(2,1) where hp_alt is the hyperprior in this alternative specification.

Code is below for both the R version which calls it, and the Stan code.

I think that these two examples should give the same answer for hp, but they vary slightly. Why is this? Have I implicitly assumed slightly different priors here?

Results from 15 million iterations below. Note that hp and hp_alt (the two different versions of this) differ very slightly, and more than can reasonably be explained by sampling variation.

(sig1 is similar to sig1_alt, and sig2 is similar to sig2_alt, as expected).

            mean se_mean    sd    2.5%     25%     50%     75%   97.5%   n_eff Rhat
hp         2.633   0.001 2.608   0.706   1.337   1.974   3.061   8.501 3131450    1
hp_alt     2.618   0.002 2.648   0.698   1.329   1.963   3.046   8.447 1613607    1
sig1       3.398   0.000 0.820   2.232   2.828   3.257   3.805   5.392 4184178    1
sig2       1.933   0.000 0.489   1.246   1.593   1.846   2.172   3.128 4206156    1
c1_alt     1.897   0.001 1.097   0.400   1.097   1.687   2.468   4.587 2640897    1
c2_alt     1.091   0.000 0.663   0.219   0.612   0.955   1.420   2.746 2686321    1
sig1_alt   3.397   0.000 0.827   2.218   2.818   3.255   3.814   5.400 7500000    1
sig2_alt   1.933   0.000 0.494   1.239   1.588   1.845   2.177   3.135 7500000    1
lp__     -58.512   0.001 1.726 -62.730 -59.425 -58.185 -57.245 -56.152 2659432    1

---

Stan code:

data{
  int n;
  vector[n] dat1;
  vector[n] dat2;
}
parameters{
  real<lower=0> hp;
  real<lower=0> hp_alt;
  real<lower=0> sig1;
  real<lower=0> sig2;
  real<lower=0> c1_alt;
  real<lower=0> c2_alt;
  
}
transformed parameters{
  real sig1_alt;
  real sig2_alt;
  
  sig1_alt = c1_alt*hp_alt;
  sig2_alt = c2_alt*hp_alt; 
}
model{
  // Standard version
  
  dat1 ~ normal(0, sig1);
  dat2 ~ normal(0, sig2);
  
  sig1 ~ gamma(2,1/hp);
  sig2 ~ gamma(2,1/hp);
  
  // Alternate version
  
  dat1 ~ normal(0, sig1_alt);
  dat2 ~ normal(0, sig2_alt);
  
  c1_alt ~ gamma(2, 1);
  c2_alt ~ gamma(2, 1);
 }

R code which calls this:

library(rstan)

set.seed(12345)
rstan_options(auto_write = TRUE)

n <- 10
modegen_1 <- 4
modegen_2 <- 2

gendat_1 <- rnorm(n, 0, modegen_1)
gendat_1
gendat_2 <- rnorm(n, 0, modegen_2)

DatForStan <- list(n=n, dat1=gendat_1, dat2=gendat_2)

niter <- 15000000
#StanFit <- stan(file = 'Survival_model_STAN_v06.stan', data=DatForStan, iter=niter, chains=1, refresh=20, control=list(adapt_delta=0.9))
StanModel <- stan_model(file = 'NCP gamma STAN v01.Stan')

StanFit <- sampling(StanModel, data=DatForStan, iter=niter, chains=1, refresh=1000)
print(StanFit,digits=3)

Bob Carpenter

unread,

Aug 26, 2016, 3:21:24 PM8/26/16

to stan-...@googlegroups.com

15m iterations is a bit of overkill. You are right that the MCMC SE
is tight enough they should be the same if the model's the same.

I'm all out of steam for algebra, but what you need to do is derive
p(hp | dat1, dat2) and p(hp_alt | dat1, dat2) and make sure
they're the same up to a proportion that doesn't depend on the
other parameters. I suspect the use of hp in one context and 1 / hp
in the other with no priors on hp may be the culprit.

- Bob

Andrew Gelman

unread,

Aug 26, 2016, 4:04:50 PM8/26/16

to stan-...@googlegroups.com

When Bob says "15m iterations is a bit of overkill," I think he's using understatement.
100 iterations would probably be enough here, no? Certainly 1000 would do it with no problem.
A

Bob Carpenter

unread,

Aug 26, 2016, 6:38:18 PM8/26/16

to stan-...@googlegroups.com

In this case, Julian wanted to get a tight bound on MCMC standard
error to show that two models were different. So 1K might not've
been enough, but 10K probably would've been.

- Bob

Aaron Goodman

unread,

May 17, 2017, 3:33:28 AM5/17/17

to stan-...@googlegroups.com

I have been running into a similar issue, and was in the middle of coming up with a similar test case when I came upon this thread.

It looks like (up to additive constants)

log(p(hp | dat1, dat2)) = -n*log(sig1) - dat1^2/2sig1^2 -n*log(sig2) - dat2^2/2sig2^2 -2 log(hp)+log(sig1)-hp*sig1 -2log(hp)+log(sig2)-hp*sig2

log(p(hp_alt | dat1, dat2)) = -n*log(c1_alt*hp_alt) - n*dat1^2/2(c1_alt*hp_alt)^2 -log(c2_alt*hp_alt) - dat2^2/2(c2_alt*hp_alt)^2 +log(c1_alt)-c1_alt +log(c2_alt)-c2_alt

log(p(hp_alt | dat1, dat2)) = -dat1^2/2(c1_alt*hp_alt)^2 - dat2^2/2(c2_alt*hp_alt)^2 -2log(hp_alt)-c1_alt -c2_alt

so if I did my math right, log(p(hp | dat1, dat2)) and log(p(hp_alt | dat1, dat2)) will differ by -n*(log(sig1)+log(sig2))

So, I believe the reparamaterization is equivalent to imposing the gamma(2+n,epsilon) prior from Chung 2015 on the original model?

It also seems like this would be the case with normal distributions as well? My understanding from the mailing list was that the 'Matt Trick' was just a computational tool to speed up convergence, but working through this example it seems like it is implicitly putting a weakly informative prior on the data as well.

Reply all

Reply to author

Forward