Stan and CmdStan v2.6.0 released

447 views
Skip to first unread message

Daniel Lee

unread,
Feb 5, 2015, 5:13:25 PM2/5/15
to stan-...@googlegroups.com
Stan and CmdStan v2.6.0 is out now. Download links can be found from the home page:

  http://mc-stan.org

If you have any trouble, please contact us on the stan-users group. The other interfaces will be released soon.


- Stan Dev Team






v2.6.0 (5 February 2015)
======================================================================

New Features
----------------------------------------
* log_mix function for binary mixtures [half of #970]
* neg_binomial_2 CDF function [#1129]

Language Enhancements
----------------------------------------
* allow local-variable integers to be treated as constants,
  allowing, among other things, more general ODE functions [#1131]
* initialize transformed data and generated quantities the
  same way as transformed parameters [#1099]
* libstan.a removed and model compilation is 100% header-only [#1095]
* check for infinite/nan inits from boundary initializations
  with warning [#537]

API/Build/Test Enhancements
----------------------------------------
* removed extra variables being generated by validation [#1248]
* generalize OperandsAndPartials to second order [#676]
* rationalize error handling, using const char* args for efficiency [#1234]
* upgrade to Boost version 1.55 library [#1206]
* generate .hpp instead of .cpp for models [#803]
* value_of_rec recursive metaprogram to pull values out of higher-order
  autodiff structures; removes order dependency on fwd/rev includes
  [#1232]
* tests now run more efficiently with Python script [#1110]
* add performance tests to continuous integration to perform
  end-to-end regression testing on speed [#708, #1245]
* manual index parser for tool automation in interfaces (such as
  auto-complete suggestions) and emacs mode
* refactor log determinant and fix return type cast in vector arena
  alloc  [#1187]
* update makefile dependencies for auto-generated cpp test files [#1108]
* move test models into clearly labeled directories (good/bad) [#1016]
* removing policies from math error checking [#1122]
* remove usage of Boost's formatting (which can cause runtime
  errors like printf) [#1103]
* rearrange directory structure of err_handling [#1102]
* clean up math includes to proper location in source tree [#778]
* remove Windows newline and write script to catch in future [#1233]
* remove extra copy of Eigen (3.2.0), leaving just Eigen 3.2.2 [#1206]
* remove example-models dependency from Git [#1105]


Bug Fixes
----------------------------------------
* allow identifiers with prefixes matching keywords [#1177]
* allow functions to be used in variable declaration constriants [#1151]
* fix segfault resulting from multivariate normal in optimizer (root
  cause wasn't in optimizer, but in autodiff nesting cleanup) [#1200]
* fixed return type in language and C++ for append_row of two
  column vectors and append_col of two row vectors [#1241]
* fixed derivative error for pareto_type_2_log() [#1223]
* remove unused models from stan distribution (they're now in
  the stan-dev/example-models repo on GitHub) [#1249]
* squash compiler warnings and fix windows tests (mostly
  signed vs. unsigned, as usual) [#1257]
* fix memory leak in ODE solver [#1160]
* fix overflow in gamma_p function to throw instead [#674]
* make sure multiply_lower_tri_self_transpose returns a symmetric
  matrix [#1121]
* fix overflow in Poisson RNG to throw instead [#1053]


Documentation
----------------------------------------
* manual updated for 2.6 [#1081]
  - new chapter on software process (thanks to Sebastian Weber
    and Tony Rossini for help)
  - new chapter on sparse and ragged arrays
  - pointers to Julia and MATLAB interfaces (yay!)
  - vectorization of CDFs described
  - fix priors on IRT models
  - added discussion of multiple change points
  - remove range-contstrained scale priors
  - clarified fixed parameter call
  - remove references to "std::vector" in favor of "array"
  - corrected signs for lasso and ridge and discuss truncated gradient
    and absolute zero results
  - extended discussion of Gaussian process priors (thanks
    to Aki Vehtari, Jon Zelner, and Herra Huu)
  - remove bogus paths to models and replace with pointers to
    example-models repo
  - clarified Wishart/inverse Wishart parameterizations w.r.t. BDA3
  - fixed exp_mod_normal definition
  - fix student-t reparameterization
  - fix hurdle distribution definition

Thanks!
----------------------------------------
Thanks to all the users who contributed code and doc corrections
and comments: Alex Zvoleff, Juan Sebastián Casallas, Gökçen
Eraslan, seldomworks [GitHub handle], Avraham Adler, Sebastian
Weber, Amos Waterland, David Hallvig, Howard Zail, Andre Pfeuffer,
Bobby Jacob, Cody Ross, Krzysztof Sakrejda, Andrew Ellis, John Sutton


Bob Carpenter

unread,
Feb 5, 2015, 6:33:47 PM2/5/15
to stan-...@googlegroups.com
I missed this in the release notes:

* fixes memory leak in multivariate normal

It was taken care of as a side-effect of fixing our autodiff
stacks (that is, the bug wasn't in multivariate normal but
lower level).

I'd also like to officially welcome our newest core developer, Rob Trangucci.
He's behind the new log_mix function:

log_mix(theta, lp1, lp2) =def= log(theta * exp(lp1) + (1-theta) * exp(lp2))

It has the log_sum_exp behavior built in. There's a vector-based version
in the works, too.

Rob also sped up autodiff (and hence sampling and optimization) by 5--10% by
tightening up the argument validation routines.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Rob Trangucci

unread,
Feb 5, 2015, 7:18:15 PM2/5/15
to stan-...@googlegroups.com
Thanks for the introduction, Bob. I wouldn't have been able to do much without all of your and Daniel's help running up the steep learning curve, especially on the log_mix implementation. Excited to be a part of the team!


Rob

Mike Lawrence

unread,
Feb 5, 2015, 9:04:13 PM2/5/15
to stan-...@googlegroups.com
Exciting! Is there an example model using log_mix anywhere?

Rob Trangucci

unread,
Feb 6, 2015, 2:51:31 PM2/6/15
to stan-...@googlegroups.com
Sure, here's simple mixture of normals we used to test the function:

data {
  int<lower = 0> N;
  vector[N] y;
}
parameters {
  real<lower = 0, upper = 1> theta;
  vector<lower = 0>[2] sigmas;
  vector[2] means;
}
model {
  theta ~ beta(6,4);

  sigmas ~ normal(1,2);

  means[1] ~ normal(10,1);
  means[2] ~ normal(-1,1);


  for(i in 1:N)
    increment_log_prob(log_mix(theta,normal_log(y[i],means[1],sigmas[1]),
                              normal_log(y[i],means[2],sigmas[2])));
}


Rob

Steve

unread,
Feb 9, 2015, 9:34:44 PM2/9/15
to stan-...@googlegroups.com
This new function is very helpful. Tried the code below, it worked, but traceplot shows mixing of labels - 2 chains with (1,2) vs. 2 chains with (2,1). One would prefer the means (-1, 10) to appear in the summary instead of 4.59 and 4.51.

Call: fit.log.mix = stan("log_mix_test.stan",data=list(N=N,y=y),chains=4,iter=1000)

Is there a way to have print.stanfit only summarize a user chosen subset of the chains?   (e.g. something "like": print(fit.log.mix,chains=c(1,3))

Using print(fit.log.mix) gives:

> print(fit.log.mix)
Inference for Stan model: log_mix_test.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
theta         0.50    0.08  0.12     0.36     0.38     0.50     0.62     0.65     2  8.10
sigmas[1]     1.55    0.38  0.54     0.97     1.01     1.48     2.08     2.20     2 10.43
sigmas[2]     1.55    0.38  0.54     0.97     1.01     1.48     2.08     2.21     2  9.82
means[1]      4.59    3.86  5.46    -1.04    -0.88     4.69    10.05    10.12     2 71.46
means[2]      4.51    3.90  5.52    -1.20    -1.00     4.65    10.03    10.09     2 68.10
lp__      -2438.18   42.87 60.67 -2501.80 -2498.55 -2442.36 -2377.26 -2375.40     2 39.19
 

Samples were drawn using NUTS(diag_e) at Tue Feb 10 09:37:25 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.6.0   inline_0.3.13 Rcpp_0.11.2  

loaded via a namespace (and not attached):
[1] codetools_0.2-8 stats4_3.1.1    tools_3.1.1    

Ben Goodrich

unread,
Feb 9, 2015, 10:09:33 PM2/9/15
to stan-...@googlegroups.com
On Monday, February 9, 2015 at 9:34:44 PM UTC-5, Steve wrote:
Is there a way to have print.stanfit only summarize a user chosen subset of the chains?   (e.g. something "like": print(fit.log.mix,chains=c(1,3))

monitor(as.array(stanfit_object)[,c(1,3),,drop=FALSE])

Ben

Steve

unread,
Feb 9, 2015, 10:21:36 PM2/9/15
to stan-...@googlegroups.com
Thanks - worked perfectly. After a few tries, the "right" grouping of chains was (1,2) and (3,4).  See label switch for means below.

> monitor(as.array(fit.log.mix)[,c(1,2),,drop=FALSE])
Inference for the input samples (2 chains: each with iter=500; warmup=250):

             mean se_mean  sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta         0.4     0.0 0.0     0.4     0.4     0.4     0.4     0.4   367    1
sigmas[1]     2.1     0.0 0.1     1.9     2.0     2.1     2.1     2.2   485    1
sigmas[2]     1.0     0.0 0.0     1.0     1.0     1.0     1.0     1.1   314    1
means[1]     -0.9     0.0 0.1    -1.1    -1.0    -0.9    -0.8    -0.7   376    1
means[2]     10.0     0.0 0.0    10.0    10.0    10.0    10.1    10.1   363    1
lp__      -2498.8     0.1 1.4 -2502.4 -2499.6 -2498.6 -2497.7 -2496.9   228    1

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> monitor(as.array(fit.log.mix)[,c(3,4),,drop=FALSE])
Inference for the input samples (2 chains: each with iter=500; warmup=250):

             mean se_mean  sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta         0.6     0.0 0.0     0.6     0.6     0.6     0.6     0.7   325    1
sigmas[1]     1.0     0.0 0.0     1.0     1.0     1.0     1.0     1.1   262    1
sigmas[2]     2.1     0.0 0.1     1.9     2.0     2.1     2.1     2.2   405    1
means[1]     10.1     0.0 0.0    10.0    10.0    10.1    10.1    10.1   373    1
means[2]     -1.0     0.0 0.1    -1.2    -1.1    -1.0    -0.9    -0.8   278    1
lp__      -2377.6     0.1 1.7 -2381.5 -2378.5 -2377.4 -2376.3 -2375.3   162    1

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Bob Carpenter

unread,
Feb 11, 2015, 12:31:50 AM2/11/15
to stan-...@googlegroups.com
It's odd that lp__ is so different. Are those literally the
same models? Index switching shouldn't affect the log density.

- Bob

Steve

unread,
Feb 11, 2015, 1:28:27 PM2/11/15
to stan-...@googlegroups.com
Yes, same model from the call in my first post. Four chains ran, with iter = 1000, warmup = 500 (by default). Then summaries produced by the code as stated, ran directly after print(fit.log.mix); sessionInfo(); statements. 

And I'm puzzled by the "iter=500; warmup=250" from the monitor statements. A related (positive) note is that the se_mean for lp__ is much smaller for each subset (1,2) and (3,4) of chains at 0.1 vs 42.87 for all 4 chains. 

Bob Carpenter

unread,
Feb 11, 2015, 2:05:52 PM2/11/15
to stan-...@googlegroups.com

> On Feb 11, 2015, at 1:28 PM, Steve <ssk...@gmail.com> wrote:
>
> Yes, same model from the call in my first post. Four chains ran, with iter = 1000, warmup = 500 (by default). Then summaries produced by the code as stated, ran directly after print(fit.log.mix); sessionInfo(); statements.
>
> And I'm puzzled by the "iter=500; warmup=250" from the monitor statements. A related (positive) note is that the se_mean for lp__ is much smaller for each subset (1,2) and (3,4) of chains at 0.1 vs 42.87 for all 4 chains.

It doesn't look like the output below is from a run with iter=1000, warmup=500.
It looks half that size.

I still don't understand why the lp__ would be different. I'm thinking
they might not have been run with the same model, especially given that
the iterations are off.

- Bob

Steve

unread,
Feb 11, 2015, 2:39:07 PM2/11/15
to stan-...@googlegroups.com
Bob:

I'm not sure the best way to address this, but here is a copy of the statements and output that produced these summaries. I'm not sure that the "monitor" code does exactly what a print(stan.fit) does, but it did seem to extract and summarize (half of) the right chains. But I don't know why only half.

Steve



> N
[1] 1000
> length(y)
[1] 1000
> fit.log.mix = stan("log_mix_test.stan",data=list(N=N,y=y),chains=4,iter=1000)

TRANSLATING MODEL 'log_mix_test' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'log_mix_test' NOW.

SAMPLING FOR MODEL 'log_mix_test' NOW (CHAIN 1).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 501 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 0.994405 seconds (Warm-up)
#                0.823456 seconds (Sampling)
#                1.81786 seconds (Total)


SAMPLING FOR MODEL 'log_mix_test' NOW (CHAIN 2).

Iteration:   1 / 1000 [  0%]  (Warmup)Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:stan::prob::normal_log(N4stan5agrad3varE): Scale parameter is 0:0, but must be > 0!If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 501 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 0.974067 seconds (Warm-up)
#                0.829531 seconds (Sampling)
#                1.8036 seconds (Total)


SAMPLING FOR MODEL 'log_mix_test' NOW (CHAIN 3).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 501 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 0.965403 seconds (Warm-up)
#                0.695382 seconds (Sampling)
#                1.66078 seconds (Total)


SAMPLING FOR MODEL 'log_mix_test' NOW (CHAIN 4).

Iteration:   1 / 1000 [  0%]  (Warmup)
Iteration: 100 / 1000 [ 10%]  (Warmup)
Iteration: 200 / 1000 [ 20%]  (Warmup)
Iteration: 300 / 1000 [ 30%]  (Warmup)
Iteration: 400 / 1000 [ 40%]  (Warmup)
Iteration: 500 / 1000 [ 50%]  (Warmup)
Iteration: 501 / 1000 [ 50%]  (Sampling)
Iteration: 600 / 1000 [ 60%]  (Sampling)
Iteration: 700 / 1000 [ 70%]  (Sampling)
Iteration: 800 / 1000 [ 80%]  (Sampling)
Iteration: 900 / 1000 [ 90%]  (Sampling)
Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 0.79934 seconds (Warm-up)
#                0.771192 seconds (Sampling)
#                1.57053 seconds (Total)

> print(fit.log.mix)
Inference for Stan model: log_mix_test.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
theta         0.50    0.08  0.12     0.36     0.38     0.50     0.62     0.65     2  8.10
sigmas[1]     1.55    0.38  0.54     0.97     1.01     1.48     2.08     2.20     2 10.43
sigmas[2]     1.55    0.38  0.54     0.97     1.01     1.48     2.08     2.21     2  9.82
means[1]      4.59    3.86  5.46    -1.04    -0.88     4.69    10.05    10.12     2 71.46
means[2]      4.51    3.90  5.52    -1.20    -1.00     4.65    10.03    10.09     2 68.10
lp__      -2438.18   42.87 60.67 -2501.80 -2498.55 -2442.36 -2377.26 -2375.40     2 39.19

Samples were drawn using NUTS(diag_e) at Tue Feb 10 09:37:25 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.6.0   inline_0.3.13 Rcpp_0.11.2  

loaded via a namespace (and not attached):
[1] codetools_0.2-8 stats4_3.1.1    tools_3.1.1    
> monitor(as.array(fit.log.mix)[,c(1,3),,drop=FALSE])
Inference for the input samples (2 chains: each with iter=500; warmup=250):

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta         0.5     0.1  0.1     0.4     0.4     0.5     0.6     0.6     1  9.0
sigmas[1]     1.5     0.5  0.5     1.0     1.0     1.5     2.1     2.2     1 11.5
sigmas[2]     1.5     0.5  0.5     1.0     1.0     1.5     2.1     2.2     1 11.3
means[1]      4.6     5.5  5.5    -1.0    -0.9     4.7    10.1    10.1     1 82.5
means[2]      4.5     5.5  5.5    -1.2    -1.0     4.7    10.0    10.1     1 69.0
lp__      -2438.2    60.6 60.6 -2501.3 -2498.5 -2440.9 -2377.4 -2375.4     1 44.2

### end of R code/output


STAN code in log_mix_test.stan

data {
  int<lower = 0> N;
  vector[N] y;
}
parameters {
  real<lower = 0, upper = 1> theta;
  vector<lower = 0>[2] sigmas;
  vector[2] means;
}
model {
  theta ~ beta(6,4);

  sigmas ~ normal(1,2);

  means[1] ~ normal(10,1);
  means[2] ~ normal(-1,1);


  for(i in 1:N)
    increment_log_prob(log_mix(theta,normal_log(y[i],means[1],sigmas[1]),
                              normal_log(y[i],means[2],sigmas[2])));
}





Jiqiang Guo

unread,
Feb 11, 2015, 3:49:43 PM2/11/15
to stan-...@googlegroups.com
On Wed, Feb 11, 2015 at 11:39:07AM -0800, Steve wrote:
> Bob:
>
> I'm not sure the best way to address this, but here is a copy of the
> statements and output that produced these summaries. I'm not sure that the
> "monitor" code does exactly what a print(stan.fit) does, but it did seem to
> extract and summarize (half of) the right chains. But I don't know why only
> half.

as.array extracts the chain after warmup, that is, the second half by default.
monitor has an argument warmup that is set to half length of the chain.
If you apply monitor to what is extract samples from stanfit, warmup should be
set to zero. These are all documented.

Jiqiang
> > > On Feb 11, 2015, at 1:28 PM, Steve <ssk...@gmail.com <javascript:>>
> > an email to stan-users+...@googlegroups.com <javascript:>.

Steve

unread,
Feb 12, 2015, 9:34:43 AM2/12/15
to stan-...@googlegroups.com
I was looking at the ?as.array help file - the right help file is ?as.array.stanfit which specifies the non-warmup samples are coerced to an array.

To get closer to my original request of an equivalent to "print(stanfit.object, chains=c(1,2))", by borrowing from Ben one can use "monitor(extract(fit.log.mix,permuted=F,inc_warmup=T)[,c(1,2),,drop=F],digits=2)"

> monitor(extract(fit.log.mix,permuted=F,inc_warmup=T)[,c(1,2),,drop=F],digits=2)
 
Inference for the input samples (2 chains: each with iter=1000; warmup=500):

              mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta         0.38    0.00 0.01     0.36     0.37     0.38     0.39     0.41   747    1
sigmas[1]     2.08    0.00 0.07     1.94     2.03     2.08     2.13     2.22   824    1
sigmas[2]     1.01    0.00 0.03     0.96     0.99     1.01     1.03     1.07   620    1
means[1]     -0.87    0.00 0.11    -1.08    -0.95    -0.88    -0.80    -0.66   715    1
means[2]     10.03    0.00 0.04     9.95    10.00    10.03    10.05    10.10   765    1
lp__      -2498.81    0.07 1.51 -2502.71 -2499.52 -2498.55 -2497.73 -2496.87   433    1

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
 
> monitor(extract(fit.log.mix,permuted=F,inc_warmup=T)[,c(3,4),,drop=F],digits=2)
 
Inference for the input samples (2 chains: each with iter=1000; warmup=500):

              mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
theta         0.62     0.0 0.02     0.59     0.61     0.62     0.63     0.65   697 1.01
sigmas[1]     1.01     0.0 0.03     0.96     1.00     1.01     1.03     1.07   669 1.01
sigmas[2]     2.08     0.0 0.08     1.93     2.02     2.08     2.13     2.24   846 1.00
means[1]     10.05     0.0 0.04     9.96    10.02    10.05    10.08    10.13   620 1.00
means[2]     -1.00     0.0 0.12    -1.22    -1.08    -1.00    -0.93    -0.77   658 1.00
lp__      -2377.55     0.1 1.79 -2381.95 -2378.45 -2377.26 -2376.25 -2375.27   329 1.00

For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

As Bob pointed out, the lp__ means are very different. This maybe due to the first two chains converging to a minor mode of the posterior instead of the major(?) mode to which the last two chains converged. There were two means and two SDs giving four possible label switches, and (I believe) four posterior modes. The similarity of the se's might indicate that the inference is valid even when the chains converge to the minor mode. Does anyone have further insight into these issues (lp__ mean v. different, se's same after label switch - validity of inferences at minor modes)? 

Bob Carpenter

unread,
Feb 14, 2015, 3:44:53 AM2/14/15
to stan-...@googlegroups.com
Stan can definitely get stuck around a mode.

Diagnosing multiple modes is one of the motivations for using
multiple chains with diffuse starting points and using R-hat;
if they don't all converge to the same point, sampling around
multiple modes is one possible culprit.

For full Bayesian inference, you need to integrate over
the entire posterior. So if there are a few modes, you need
to visit each in proportion to its posterior mass. If there
are a combinatorially daunting number of posterior modes,
full Bayesian inference just isn't possible.

It's not enough to just do inference around one, though it's common
to do that in machine learning circles (e.g., with Gibbs
samplers for LDA, which are combinatorially multimodal).

This is a very very tricky business to get
right from a Bayesian perspective. Michael Betancourt's working
on some sampling techniques that'll work for a few modes.

There aren't even algorithms for finding the best modes in
general, because it's an NP-hard problem (i.e., no known
sub-exponential algorithm and every reason to believe there
will never be sub-exponential algorithms).

- Bob

Steve

unread,
Feb 14, 2015, 4:19:08 AM2/14/15
to stan-...@googlegroups.com
Yes, in general, sampling from multiple modes in posteriors (I could imagine) is NP-hard. What I wondered was whether for finite mixtures, the minor modes were "reflections" of the major mode due to symmetry between label switches, and if that's the case, then sampling from a minor mode will give the same SEs (and full posterior inference) that either sampling from the major mode would, or even sampling in proportion to their weight from all modes. This thought arose from the observation that in this case (mixture of two normals - two means, two SDs - with four modes), sampling from one of the minor modes gave exactly the same SEs as sampling from the major mode. In the two sub-groups of chains c(1, 2) and c(3, 4), Rhat was uniformly 1, indicating convergence within the sub-group, while Rhat when evaluated over all four chains ranged from 8 to 71 over the parameters.

Thanks to Rob for the log_mix function, to Ben for the inference method for a sub-group of chains, and to Bob for pointing out the lp__ mean difference and the discussion. Looking forward to the full implementation of #970.

Michael Betancourt

unread,
Feb 14, 2015, 5:22:38 AM2/14/15
to stan-...@googlegroups.com
The point is that full Bayesian inference marginalizes over the modes, so you have
to sample _all_ of them, not just one.  What people often really want is to sample
conditioned on a label assignment, which is a fundamentally different problem
with no immediate answer.

A second complication is that in mixture models there tend to be a bunch of
side modes in addition to the combinatorial label-reswitching modes, and these
are _not_ invariant to permutations.

Bob Carpenter

unread,
Feb 14, 2015, 8:05:55 AM2/14/15
to stan-...@googlegroups.com

> On Feb 14, 2015, at 9:22 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> The point is that full Bayesian inference marginalizes over the modes, so you have
> to sample _all_ of them, not just one. What people often really want is to sample
> conditioned on a label assignment, which is a fundamentally different problem
> with no immediate answer.
>

Is the problem with what happens between the modes? I'd
assume if the troughs between the modes were deep enough,
and they were all just label switching symmetries, then
sampling one would be good enough.

> A second complication is that in mixture models there tend to be a bunch of
> side modes in addition to the combinatorial label-reswitching modes, and these
> are _not_ invariant to permutations.

Exactly --- that's why having two different lp values for
different chains that seem to have converged is worrisome.

- Bob

Michael Betancourt

unread,
Feb 14, 2015, 8:24:31 AM2/14/15
to stan-...@googlegroups.com
>>
>> The point is that full Bayesian inference marginalizes over the modes, so you have
>> to sample _all_ of them, not just one. What people often really want is to sample
>> conditioned on a label assignment, which is a fundamentally different problem
>> with no immediate answer.
>>
>
> Is the problem with what happens between the modes? I'd
> assume if the troughs between the modes were deep enough,
> and they were all just label switching symmetries, then
> sampling one would be good enough.

The problem is that full Bayes says to integrate, which means average over
all of the modes. Means will typically put you in the middle of all those modes
in one of those troughs you mention, with the variance spanning the set of
all modes. Mathematically this is correct, it’s just not what people typically
implicitly want.

What people really want in practice is to do model comparison (averaging
over a bunch of unimodal models can give multimodal marginal posteriors)
which they hack by saying things like “pick a mode”. But that’s not really
all that well-posed outside of the model comparison perspective.
Reply all
Reply to author
Forward
0 new messages