Puzzling effective sample size values

1,278 views
Skip to first unread message

Kevin Van Horn

unread,
Apr 20, 2015, 1:31:31 PM4/20/15
to stan-...@googlegroups.com
 I am seeing vastly different numbers for effective sample size computed by Stan versus those computed by coda. The attached CSV file and plots are for a trace that Stan says only has an effective sample size of 167.4. Looking at it visually, that number of 167.4 looks very low. Running coda::effectiveSize on it, I get a result of 1616.5. The actual number of sampling iterations is 1500.

Any idea why the large discrepancy? I'm trying to figure out which of these numbers to believe.

anomalous-ess.csv
anomalous-ess-points.png
anomalous-ess-lines.png

Andrew Gelman

unread,
Apr 20, 2015, 4:13:19 PM4/20/15
to stan-...@googlegroups.com
Hi, you need multiple chains but your plot just shows one.  Also I’m surprised that Stan returns a number like 167.4.  We try to round it because it is a noisy estimate in any case.
A

On Apr 20, 2015, at 1:31 PM, Kevin Van Horn <ke...@ksvanhorn.com> wrote:

 I am seeing vastly different numbers for effective sample size computed by Stan versus those computed by coda. The attached CSV file and plots are for a trace that Stan says only has an effective sample size of 167.4. Looking at it visually, that number of 167.4 looks very low. Running coda::effectiveSize on it, I get a result of 1616.5. The actual number of sampling iterations is 1500.

Any idea why the large discrepancy? I'm trying to figure out which of these numbers to believe.


--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<anomalous-ess.csv><anomalous-ess-points.png><anomalous-ess-lines.png>

Kevin Van Horn

unread,
Apr 20, 2015, 6:05:39 PM4/20/15
to stan-...@googlegroups.com, gel...@stat.columbia.edu
The effective sample size calculation requires multiple chains? How is it being computed?

In this case, one chain is all I have.

Michael Betancourt

unread,
Apr 20, 2015, 6:33:53 PM4/20/15
to stan-...@googlegroups.com
The usual ESS estimators use just one chain — we use a slightly more
sophisticated estimators which considers multiple chains and is far
more sensitive to poor mixing (which is a good thing).  If you’ve
just run one then what happens is that we split the one chain into
two and use our multi-chain estimator.  

Naively looking at your trace plot O(100) seems reasonable to me —
although there is high-frequency variation there is a slower low-frequency
component with a period on the order of 100 samples, and this should
drive the ESS calculation.  I’m guess the coda estimator is just not
sensitive enough to catch this lower-frequency component.

Bob Carpenter

unread,
Apr 20, 2015, 8:42:33 PM4/20/15
to stan-...@googlegroups.com
Shouldn't the effective sample size always be lower than the
sample size, or can it be higher with anti-correlation?

Could we have your data to use as an example? This is just
the kind of thing that we're focusing on that leads people to
the wrong conclusion in BUGS/JAGS but our more careful diagnostics
will catch. I'm particularly worried about somebody running a JAGS
program with coda and then a Stan program with our own posterior
analysis, and comparing apples to oranges.

Maybe Martyn only updated R-hat in Coda and not n_eff to use the
split-chain estimators described in our manual (and in BDA3).

Andrew --- the output is from R, so whatever you set the number
of digits to is what you get. We don't do "smart" printing that
would stop you from griping about "needless precision." Put it
on the to-do list :-)

- Bob

Andrew Gelman

unread,
Apr 20, 2015, 8:58:50 PM4/20/15
to stan-...@googlegroups.com
Bob:

No, I’m pretty sure that the R print output for Stan objects always rounds the effective sample size.

Also, again, ESS is itself a noisy measure. This is the same thing that bugs me about how people interpret cross-validation: they have a habit of thinking of the cross-validated solution as the correct value rather than as a (noisy) function of data.

Jonah

unread,
Apr 20, 2015, 10:28:37 PM4/20/15
to stan-...@googlegroups.com, gel...@stat.columbia.edu
RStan does round n_eff to the nearest integer with the following code

 s$summary[, 'n_eff'] <- round(s$summary[, 'n_eff'], 0)

but that only happens if it's printed with one of RStan's functions/methods (e.g. monitor(stanfit), show(stanfit), print(stanfit), etc.). If the user extracts the summary stats info from the stanfit object and then does something else with it the n_eff values won't be rounded automatically. 

Bob Carpenter

unread,
Apr 21, 2015, 12:08:34 AM4/21/15
to stan-...@googlegroups.com

> On Apr 21, 2015, at 12:28 PM, Jonah <jga...@gmail.com> wrote:
>
> RStan does round n_eff to the nearest integer with the following code
>
> s$summary[, 'n_eff'] <- round(s$summary[, 'n_eff'], 0)

Neat. I didn't realize it specialized by column (just n_eff it looks like).

I also didn't realize digits=1 was just digits after decimal
place, not arithmetic precision:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
x 0.0 0.0 1.0 -1.9 -0.7 0.0 0.7 1.9 3073 1

So digits=3 gives you the same 4 digits of precision on n_eff but increases
precision on mean to 2 digits and 97.5% interval to 4 digits.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
x 0.016 0.018 0.992 -1.885 -0.668 0.010 0.688 1.937 3073 1.000

Alas, the only alternative that lets you align columns is scientific notation,
and I believe Andrew already vetoed that in general.

- Bob

Kevin Van Horn

unread,
May 15, 2015, 4:43:49 PM5/15/15
to stan-...@googlegroups.com
Sure, feel free to use the trace data I posted

Kevin Van Horn

unread,
May 15, 2015, 5:50:23 PM5/15/15
to stan-...@googlegroups.com
On Monday, April 20, 2015 at 4:33:53 PM UTC-6, Michael Betancourt wrote:
> Naively looking at your trace plot O(100) seems reasonable to me —
> although there is high-frequency variation there is a slower low-frequency
> component with a period on the order of 100 samples, and this should

I'm not sure I buy that. When I plot

    plot(rnorm(1500), type='l', ylim=c(-3.5, 3.5))

and compare it to

    plot((x-mean(x))/sd(x), type='l', ylim=c(-3.5, 3.5))

where x is the trace I posted earlier, I can see just as much apparent low-frequency variation in the first plot as in the second.

Furthermore, take a look at what happens when we look at the empirical standard deviations of successively larger batch means, and compare that to what we would expect for independent draws (x has length 1500):

> sd(x)
[1] 0.1406752

> sd(x)/sqrt(10) [1] 0.04448542 > sd(colMeans(matrix(x, nr=10))) [1] 0.04134681
>
sd(x)/sqrt(15) [1] 0.03632219 > sd(colMeans(matrix(x, nr=15))) [1] 0.03646076
>
sd(x)/sqrt(20) [1] 0.03145594 > sd(colMeans(matrix(x, nr=20))) [1] 0.03124686
>
sd(x)/sqrt(25) [1] 0.02813505 > sd(colMeans(matrix(x, nr=25))) [1] 0.02777601
>
sd(x)/sqrt(30) [1] 0.02568367 > sd(colMeans(matrix(x, nr=30))) [1] 0.02636602
>
sd(x)/sqrt(50) [1] 0.01989448 > sd(colMeans(matrix(x, nr=50))) [1] 0.01958211
>
sd(x)/sqrt(75) [1] 0.01624378 > sd(colMeans(matrix(x, nr=75))) [1] 0.01497145
>
sd(x)/sqrt(100) [1] 0.01406752 > sd(colMeans(matrix(x, nr=100))) [1] 0.01281342

If anything, we seem to be getting some negative correlation increasing the effective sample size.

Akihiko Nishimura

unread,
Jun 10, 2015, 1:58:58 PM6/10/15
to stan-...@googlegroups.com
I am noticing the same issues with the effective sample size outputs as well (using PySTAN). 

With my simple linear mixed effect model, STAN output consistently underestimates the effective sample sizes of the parameters by a factor of 3 to 6. I tried running 1, 5, and 10 chains, and it made no difference. I tried calculating the "true" effective sample size using Bartlett's spectrograph estimator, Geyer's monotone positive sequence estimator, and spectrum estimate based in AR processes, and they all give more or less the same number. I ran the chains for a fairly long time, so the estimates are not that noisy.

Bob Carpenter

unread,
Jun 10, 2015, 2:08:08 PM6/10/15
to stan-...@googlegroups.com
n_eff is always estimated --- you can't know the "true" effective
sample size unless you set the sampler up analytically.

Stan's been set up to be conservative compared to other methods by
using cross-chain information rather than estimating the n_eff of
each chain separately and then adding. What are the Rhat values for
those chains --- if it's above 1, you haven't run the chains long enough?
The bigger the Rhat, the more we'll adjust the n_eff
estimates downward. As Rhat goes to 1, our n_eff estimate should approach
the ones you'd get from running the chains independently and adding.
You can see the formula we use in the manual (MCMC chapter) or in Gelman
et al.'s BDA3 book.

We should make sure that PyStan provides the same answer as CmdStan.
Can we borrow your sample for testing?

- Bob

Akihiko Nishimura

unread,
Jun 10, 2015, 3:50:03 PM6/10/15
to stan-...@googlegroups.com
It seems to be pretty substantial underestimates to me (factor of 3 to 6). Rhat's are mostly 1.0. (I work on a version of HMC for my PhD thesis, so I am not new to computing ESS.)

I attached the python notebook that can reproduce the result. The burn-in iterations are intentionally large, so that the NUTS will produce more or less independent samples after the burn-in.
STAN for linear mixed effects (intercepts only) model.ipynb

Bob Carpenter

unread,
Jun 10, 2015, 3:56:09 PM6/10/15
to stan-...@googlegroups.com
Thanks --- when I get time to figure out how to pull this
out of Python and run in the command line, I'll try to investigate
what's up. I don't recall if PyStan's currently using the same code
as CmdStan to compute Rhat --- it will be soon.

On what basis are you deciding that NUTS produces more or less
independent samples? Is that using our Rhat calcs on a single chain?

- Bob
> <STAN for linear mixed effects (intercepts only) model.ipynb>

Akihiko Nishimura

unread,
Jun 10, 2015, 4:27:45 PM6/10/15
to stan-...@googlegroups.com
I inspected the autocorrelation functions from multiple STAN runs for my model and I saw almost no auto-correlations.

Andrew Gelman

unread,
Jun 10, 2015, 5:26:02 PM6/10/15
to stan-...@googlegroups.com
Bob:
I’m happy to sit with you and look at this example if you’d like, as I’ve been thinking a lot about R-hat recently anyway.
A

Allen B. Riddell

unread,
Jun 10, 2015, 8:56:01 PM6/10/15
to stan-...@googlegroups.com
PyStan, RStan, and CmdStan all use different code to calculate Rhat.

RStan and CmdStan tend to agree with each other. PyStan's calculation is
typically off by a bit. I haven't been able to pin down the reason for
the discrepancy (bug report: https://github.com/stan-dev/pystan/issues/21)

Susan Anyosa

unread,
Feb 6, 2017, 12:06:45 PM2/6/17
to Stan users mailing list
Hi Bob!

What do you think about getting one chain of Stan and one chain of JAGS (same model), and then compute ESS (coda, batchmeans, etc).
I've done that and ESS seems greater in JAGS than in Stan.

Is it ok to do that? What kind of procedure do you recommend?
Thanks.

Cole Monnahan

unread,
Feb 6, 2017, 12:43:49 PM2/6/17
to Stan users mailing list
I'm a bit late to this thread, but I have some model output that may be relevant here. For a paper I wrote comparing performance of JAGS and Stan on a handful of ecological models I compared the minESS from CODA and rstan. 

I didn't put it in the paper but made this figure, where I evaluated both JAGS and Stan chains with both ways of calculating ESS. In this case the runs were much longer than necessary and I only used a single chain in each calculation (hence each point is a stand alone replicate). 

It was clear that CODA had higher estimates (as expected), and Stan had higher ESS (also as expected). I also thought it was reassuring that when CODA was overestimating ESS, it did it for both platforms. It seems like it would be worse if the MCMC algorithms interacted with the ESS algorithm calculations. I didn't see this at all though.

Cheers,
Cole

Andrew Gelman

unread,
Feb 6, 2017, 4:58:10 PM2/6/17
to stan-...@googlegroups.com
Hi, no, you need multiple chains to assess effective sample size.

To post to this group, send email to stan-...@googlegroups.com.

Michael Betancourt

unread,
Feb 7, 2017, 9:52:57 AM2/7/17
to stan-...@googlegroups.com
Or to put it another way: ESS is meaningful only under certain conditions
between the target distribution and the sampler.  For a general sampler,
only by running multiple chains will you be able to identify that these
conditions are not being met.  In particular, when these conditions are not
met Stan’s ESS estimator over multiple chains will drop to very small values.

Bob Carpenter

unread,
Feb 7, 2017, 12:59:51 PM2/7/17
to stan-...@googlegroups.com
Thanks --- that's a really helpful figure.

So the min is the min across parameters? Do you estimate
ESS and then take min? I can imagine a different parameter
having the min ESS under CODA vs. Stan's ESS estimator.

- Bob
> To post to this group, send email to stan-...@googlegroups.com.

Cole Monnahan

unread,
Feb 7, 2017, 1:16:38 PM2/7/17
to stan-...@googlegroups.com
Bob: Correct that it is the min estimated ESS across parameters. I did not check whether the same parameter was minimal for the two ESS calcs, nor between software. I'd have to rerun things to get that. But I wouldn't be surprised... -Cole

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/mfdeTS322hQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages