Puzzling effective sample size values

Kevin Van Horn

unread,

Apr 20, 2015, 1:31:31 PM4/20/15

to stan-...@googlegroups.com

I am seeing vastly different numbers for effective sample size computed by Stan versus those computed by coda. The attached CSV file and plots are for a trace that Stan says only has an effective sample size of 167.4. Looking at it visually, that number of 167.4 looks very low. Running coda::effectiveSize on it, I get a result of 1616.5. The actual number of sampling iterations is 1500.

Any idea why the large discrepancy? I'm trying to figure out which of these numbers to believe.

anomalous-ess.csv

anomalous-ess-points.png

anomalous-ess-lines.png

Andrew Gelman

unread,

Apr 20, 2015, 4:13:19 PM4/20/15

to stan-...@googlegroups.com

Hi, you need multiple chains but your plot just shows one. Also I’m surprised that Stan returns a number like 167.4. We try to round it because it is a noisy estimate in any case.

A

On Apr 20, 2015, at 1:31 PM, Kevin Van Horn <ke...@ksvanhorn.com> wrote:

I am seeing vastly different numbers for effective sample size computed by Stan versus those computed by coda. The attached CSV file and plots are for a trace that Stan says only has an effective sample size of 167.4. Looking at it visually, that number of 167.4 looks very low. Running coda::effectiveSize on it, I get a result of 1616.5. The actual number of sampling iterations is 1500.

Any idea why the large discrepancy? I'm trying to figure out which of these numbers to believe.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<anomalous-ess.csv><anomalous-ess-points.png><anomalous-ess-lines.png>

Kevin Van Horn

unread,

Apr 20, 2015, 6:05:39 PM4/20/15

to stan-...@googlegroups.com, gel...@stat.columbia.edu

The effective sample size calculation requires multiple chains? How is it being computed?

In this case, one chain is all I have.

Michael Betancourt

unread,

Apr 20, 2015, 6:33:53 PM4/20/15

to stan-...@googlegroups.com

The usual ESS estimators use just one chain — we use a slightly more

sophisticated estimators which considers multiple chains and is far

more sensitive to poor mixing (which is a good thing). If you’ve

just run one then what happens is that we split the one chain into

two and use our multi-chain estimator.

Naively looking at your trace plot O(100) seems reasonable to me —

although there is high-frequency variation there is a slower low-frequency

component with a period on the order of 100 samples, and this should

drive the ESS calculation. I’m guess the coda estimator is just not

sensitive enough to catch this lower-frequency component.

Bob Carpenter

unread,

Apr 20, 2015, 8:42:33 PM4/20/15

to stan-...@googlegroups.com

Shouldn't the effective sample size always be lower than the
sample size, or can it be higher with anti-correlation?

Could we have your data to use as an example? This is just
the kind of thing that we're focusing on that leads people to
the wrong conclusion in BUGS/JAGS but our more careful diagnostics
will catch. I'm particularly worried about somebody running a JAGS
program with coda and then a Stan program with our own posterior
analysis, and comparing apples to oranges.

Maybe Martyn only updated R-hat in Coda and not n_eff to use the
split-chain estimators described in our manual (and in BDA3).

Andrew --- the output is from R, so whatever you set the number
of digits to is what you get. We don't do "smart" printing that
would stop you from griping about "needless precision." Put it
on the to-do list :-)

- Bob

Andrew Gelman

unread,

Apr 20, 2015, 8:58:50 PM4/20/15

to stan-...@googlegroups.com

Bob:

No, I’m pretty sure that the R print output for Stan objects always rounds the effective sample size.

Also, again, ESS is itself a noisy measure. This is the same thing that bugs me about how people interpret cross-validation: they have a habit of thinking of the cross-validated solution as the correct value rather than as a (noisy) function of data.

Jonah

unread,

Apr 20, 2015, 10:28:37 PM4/20/15

to stan-...@googlegroups.com, gel...@stat.columbia.edu

RStan does round n_eff to the nearest integer with the following code

s$summary[, 'n_eff'] <- round(s$summary[, 'n_eff'], 0)

but that only happens if it's printed with one of RStan's functions/methods (e.g. monitor(stanfit), show(stanfit), print(stanfit), etc.). If the user extracts the summary stats info from the stanfit object and then does something else with it the n_eff values won't be rounded automatically.

Bob Carpenter

unread,

Apr 21, 2015, 12:08:34 AM4/21/15

to stan-...@googlegroups.com

> On Apr 21, 2015, at 12:28 PM, Jonah <jga...@gmail.com> wrote:
>
> RStan does round n_eff to the nearest integer with the following code
>
> s$summary[, 'n_eff'] <- round(s$summary[, 'n_eff'], 0)

Neat. I didn't realize it specialized by column (just n_eff it looks like).

I also didn't realize digits=1 was just digits after decimal
place, not arithmetic precision:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
x 0.0 0.0 1.0 -1.9 -0.7 0.0 0.7 1.9 3073 1

So digits=3 gives you the same 4 digits of precision on n_eff but increases
precision on mean to 2 digits and 97.5% interval to 4 digits.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
x 0.016 0.018 0.992 -1.885 -0.668 0.010 0.688 1.937 3073 1.000

Alas, the only alternative that lets you align columns is scientific notation,
and I believe Andrew already vetoed that in general.

- Bob

Kevin Van Horn

unread,

May 15, 2015, 4:43:49 PM5/15/15

to stan-...@googlegroups.com

Sure, feel free to use the trace data I posted

Kevin Van Horn

unread,

May 15, 2015, 5:50:23 PM5/15/15

to stan-...@googlegroups.com

On Monday, April 20, 2015 at 4:33:53 PM UTC-6, Michael Betancourt wrote:
> Naively looking at your trace plot O(100) seems reasonable to me —

> although there is high-frequency variation there is a slower low-frequency

> component with a period on the order of 100 samples, and this should

I'm not sure I buy that. When I plot

plot(rnorm(1500), type='l', ylim=c(-3.5, 3.5))

and compare it to

plot((x-mean(x))/sd(x), type='l', ylim=c(-3.5, 3.5))

where x is the trace I posted earlier, I can see just as much apparent low-frequency variation in the first plot as in the second.

Furthermore, take a look at what happens when we look at the empirical standard deviations of successively larger batch means, and compare that to what we would expect for independent draws (x has length 1500):

> sd(x)
[1] 0.1406752

> sd(x)/sqrt(10)
[1] 0.04448542
> sd(colMeans(matrix(x, nr=10)))
[1] 0.04134681

> sd(x)/sqrt(15)
[1] 0.03632219
> sd(colMeans(matrix(x, nr=15)))
[1] 0.03646076

> sd(x)/sqrt(20)
[1] 0.03145594
> sd(colMeans(matrix(x, nr=20)))
[1] 0.03124686

> sd(x)/sqrt(25)
[1] 0.02813505
> sd(colMeans(matrix(x, nr=25)))
[1] 0.02777601

> sd(x)/sqrt(30)
[1] 0.02568367
> sd(colMeans(matrix(x, nr=30)))
[1] 0.02636602

> sd(x)/sqrt(50)
[1] 0.01989448
> sd(colMeans(matrix(x, nr=50)))
[1] 0.01958211

> sd(x)/sqrt(75)
[1] 0.01624378
> sd(colMeans(matrix(x, nr=75)))
[1] 0.01497145

> sd(x)/sqrt(100)
[1] 0.01406752
> sd(colMeans(matrix(x, nr=100)))
[1] 0.01281342

If anything, we seem to be getting some negative correlation increasing the effective sample size.

Akihiko Nishimura

unread,

Jun 10, 2015, 1:58:58 PM6/10/15

to stan-...@googlegroups.com

I am noticing the same issues with the effective sample size outputs as well (using PySTAN).

With my simple linear mixed effect model, STAN output consistently underestimates the effective sample sizes of the parameters by a factor of 3 to 6. I tried running 1, 5, and 10 chains, and it made no difference. I tried calculating the "true" effective sample size using Bartlett's spectrograph estimator, Geyer's monotone positive sequence estimator, and spectrum estimate based in AR processes, and they all give more or less the same number. I ran the chains for a fairly long time, so the estimates are not that noisy.

Bob Carpenter

unread,

Jun 10, 2015, 2:08:08 PM6/10/15

to stan-...@googlegroups.com

n_eff is always estimated --- you can't know the "true" effective
sample size unless you set the sampler up analytically.

Stan's been set up to be conservative compared to other methods by
using cross-chain information rather than estimating the n_eff of
each chain separately and then adding. What are the Rhat values for
those chains --- if it's above 1, you haven't run the chains long enough?
The bigger the Rhat, the more we'll adjust the n_eff
estimates downward. As Rhat goes to 1, our n_eff estimate should approach
the ones you'd get from running the chains independently and adding.
You can see the formula we use in the manual (MCMC chapter) or in Gelman
et al.'s BDA3 book.

We should make sure that PyStan provides the same answer as CmdStan.
Can we borrow your sample for testing?

- Bob

Akihiko Nishimura

unread,

Jun 10, 2015, 3:50:03 PM6/10/15

to stan-...@googlegroups.com

It seems to be pretty substantial underestimates to me (factor of 3 to 6). Rhat's are mostly 1.0. (I work on a version of HMC for my PhD thesis, so I am not new to computing ESS.)

I attached the python notebook that can reproduce the result. The burn-in iterations are intentionally large, so that the NUTS will produce more or less independent samples after the burn-in.

STAN for linear mixed effects (intercepts only) model.ipynb

Bob Carpenter

unread,

Jun 10, 2015, 3:56:09 PM6/10/15

to stan-...@googlegroups.com

Thanks --- when I get time to figure out how to pull this
out of Python and run in the command line, I'll try to investigate
what's up. I don't recall if PyStan's currently using the same code
as CmdStan to compute Rhat --- it will be soon.

On what basis are you deciding that NUTS produces more or less
independent samples? Is that using our Rhat calcs on a single chain?

- Bob

> <STAN for linear mixed effects (intercepts only) model.ipynb>

Akihiko Nishimura

unread,

Jun 10, 2015, 4:27:45 PM6/10/15

to stan-...@googlegroups.com

I inspected the autocorrelation functions from multiple STAN runs for my model and I saw almost no auto-correlations.

Andrew Gelman

unread,

Jun 10, 2015, 5:26:02 PM6/10/15

to stan-...@googlegroups.com

Bob:

I’m happy to sit with you and look at this example if you’d like, as I’ve been thinking a lot about R-hat recently anyway.

A

Allen B. Riddell

unread,

Jun 10, 2015, 8:56:01 PM6/10/15

to stan-...@googlegroups.com

PyStan, RStan, and CmdStan all use different code to calculate Rhat.

RStan and CmdStan tend to agree with each other. PyStan's calculation is
typically off by a bit. I haven't been able to pin down the reason for
the discrepancy (bug report: https://github.com/stan-dev/pystan/issues/21)

Susan Anyosa

unread,

Feb 6, 2017, 12:06:45 PM2/6/17

to Stan users mailing list

Hi Bob!

What do you think about getting one chain of Stan and one chain of JAGS (same model), and then compute ESS (coda, batchmeans, etc).
I've done that and ESS seems greater in JAGS than in Stan.

Is it ok to do that? What kind of procedure do you recommend?
Thanks.

Cole Monnahan

unread,

Feb 6, 2017, 12:43:49 PM2/6/17

to Stan users mailing list

I'm a bit late to this thread, but I have some model output that may be relevant here. For a paper I wrote comparing performance of JAGS and Stan on a handful of ecological models I compared the minESS from CODA and rstan.

I didn't put it in the paper but made this figure, where I evaluated both JAGS and Stan chains with both ways of calculating ESS. In this case the runs were much longer than necessary and I only used a single chain in each calculation (hence each point is a stand alone replicate).

It was clear that CODA had higher estimates (as expected), and Stan had higher ESS (also as expected). I also thought it was reassuring that when CODA was overestimating ESS, it did it for both platforms. It seems like it would be worse if the MCMC algorithms interacted with the ESS algorithm calculations. I didn't see this at all though.

Cheers,

Cole

Andrew Gelman

unread,

Feb 6, 2017, 4:58:10 PM2/6/17

to stan-...@googlegroups.com

Hi, no, you need multiple chains to assess effective sample size.

To post to this group, send email to stan-...@googlegroups.com.

Michael Betancourt

unread,

Feb 7, 2017, 9:52:57 AM2/7/17

to stan-...@googlegroups.com

Or to put it another way: ESS is meaningful only under certain conditions

between the target distribution and the sampler. For a general sampler,

only by running multiple chains will you be able to identify that these

conditions are not being met. In particular, when these conditions are not

met Stan’s ESS estimator over multiple chains will drop to very small values.

Bob Carpenter

unread,

Feb 7, 2017, 12:59:51 PM2/7/17

to stan-...@googlegroups.com

Thanks --- that's a really helpful figure.

So the min is the min across parameters? Do you estimate
ESS and then take min? I can imagine a different parameter
having the min ESS under CODA vs. Stan's ESS estimator.

- Bob

> To post to this group, send email to stan-...@googlegroups.com.

Cole Monnahan

unread,

Feb 7, 2017, 1:16:38 PM2/7/17

to stan-...@googlegroups.com

Bob: Correct that it is the min estimated ESS across parameters. I did not check whether the same parameter was minimal for the two ESS calcs, nor between software. I'd have to rerun things to get that. But I wouldn't be surprised... -Cole

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/mfdeTS322hQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward