recommended workflow for posterior predictive checks (maybe in R/Rstan)

Tamas Papp

unread,

Jan 21, 2014, 7:01:51 AM1/21/14

to stan-...@googlegroups.com

Hi,

Sorry if this is somewhat off-topic. I am using Stan via RStan, and I
was wondering about the most convenient way to do posterior predictive
checks (eg the Gelman et al papers).

I have tried the following:

1. Generated quantities in Stan.

pro: easy to code, can usually copy-paste code from the model block with
little modification. if I am just looking at a few generated quantities
that I know in advance, this looks like the best way.

con: if I don't know what I am looking for, I either recompile and rerun
the model (relatively slow), or simulate _all_ the parameters and save
them (can be a huge blob in memory, but works OK for smaller models).

2. Simulation in R.

pro: also easy to code. flexible enough for experimentation with various
checks.

con: I have yet to find a clean, concise idiom for this. Loops are
messy, especially if the model itself has nested loops.

I have looked at the rv library [1] I did not find it convenient. Maybe
I am using it the wrong way?

[1] http://cran.r-project.org/web/packages/rv/

I was wondering if users of Stan on this list could share their way of
doing posterior predictive checks, possibly with example code. I am sure
I have a lot to learn that could simplify my work.

Best,

Tamas

Bob Carpenter

unread,

Jan 21, 2014, 7:11:39 AM1/21/14

to stan-...@googlegroups.com

On 1/21/14, 1:01 PM, Tamas Papp wrote:
> Hi,
>

> Sorry if this is somewhat off-topic.I am using Stan via RStan, and I

> was wondering about the most convenient way to do posterior predictive
> checks (eg the Gelman et al papers).

Absolutely on topic.

> I have tried the following:
>
> 1. Generated quantities in Stan.
>
> pro: easy to code, can usually copy-paste code from the model block with
> little modification. if I am just looking at a few generated quantities
> that I know in advance, this looks like the best way.
>
> con: if I don't know what I am looking for, I either recompile and rerun
> the model (relatively slow), or simulate _all_ the parameters and save
> them (can be a huge blob in memory, but works OK for smaller models).
>
> 2. Simulation in R.
>
> pro: also easy to code. flexible enough for experimentation with various
> checks.
>
> con: I have yet to find a clean, concise idiom for this. Loops are
> messy, especially if the model itself has nested loops.

That's also a nice summary of the pros and cons.

The other two pros for doing it in Stan are (a) that there's no confusion
about parameterizations since our _rng functions and sampling
probability functions have the same parameterizations, and (b) it works
cross-platform (CmdStan, RStan, PyStan).

The third option would be to swap around the parameters and
generated quantities block. The pro is that it's the easiest of
all the options, but the con is that you don't get independent
samples, just MCMC samples. I'm about to drop the following into
the next version of the manual on just this topic:

https://github.com/stan-dev/stan/issues/485?source=cc#issuecomment-32600765

And I also plan to summarize the points you made (you'll see I added
them to the issue comment linked above).

- Bob

Tamas Papp

unread,

Jan 21, 2014, 7:25:19 AM1/21/14

to stan-...@googlegroups.com

Thanks.

I don't think that it is possible to automate the generation of
posterior simulations (eg in a hierarchical model, maybe I just want to
simulate from a given level), so I would not insist on Stan being able
to do this by swapping parameters and data.

I am in favor of functional approaches, so ideally, I would prefer an R
function with the following syntax:

data_for_stan <- list(N_observations=nrow(something),...)
fit <- stan(...)
posterior <- extract(fit,permuted=TRUE)

map_data_frame(posterior,
function(mu,sigma,...) {
## simulate data
...
## summarize
list(p1=my_favorite_statistic(...),...)
})

Ideally, map_data_frame would do the "right thing": slice up arrays
along the first dimension (so it can deal with a 3-dimensional array in
the data frame that is a simulated matrix, etc), call a function
matching the names to arguments, and then pack up the result as another
data_frame.

One could use such a function in two stages, too: first simulate, save
that, then calculate statistics on it.

I just don't know enough R wizardry to implement this. My main problem
is that AFAIK R does not allow vectors to have array elements, so
sub-arrays need to be extracted, then repacked.

Best,

Tamas

Andrew Gelman

unread,

Jan 21, 2014, 11:12:17 AM1/21/14

to stan-...@googlegroups.com

Let me just say that I agree this would be a good idea. I've long felt that it would be good to have a general template for posterior checking. It should be are easy to check a model as to fit it. I also think that there should be a way to automate the graphics (for example, histogram to show the distribution of a 1-D test statistic, scatterplot to show the distribution of a 2-D test statistic, parallel coordinate plots for 3 or more dimensions.
A

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Tamas Papp

unread,

Jan 22, 2014, 11:06:05 AM1/22/14

to stan-...@googlegroups.com

Hi Andrew,

For graphical posterior predictive checks, I am using something like
this (uses libraries lattice and latticeExtra, does not handle 3d but
that would be easy to add):

--8<---------------cut here---------------start------------->8---
comparisonplot <- function(xdata, xpred, main="Posterior predictive check", sub=NULL, ...) {
## Return a trellis plot for graphical posterior predictive
## checks. Type is automatically selected based on the dimensions of
## xdata. For stochastic predicted data, uses hexbin plots, with a
## LOESS smoother line.
range <- extendrange(c(xdata,xpred),f=0.05)
if (is.null(sub))
sub <- sprintf("P(data <= predicted)=%.2f", mean(xdata <= xpred))
if (length(xdata)==1) {
histogram(xpred,
main=main,
sub=sub,
panel=function(...) {
panel.histogram(...)
panel.abline(v=xdata,col="black",lty=2,lwd=2)
panel.abline(v=mean(xpred),col="red",lty=2,lwd=2)
},
xlim=range,
...)
} else if (length(xpred)==length(xdata)) {
hexbinplot(xpred ~ xdata,
main=main,
sub=sub,
panel=function(...) {
panel.hexbinplot(...)
panel.abline(a=0,b=1)
panel.loess(...,col="red",lty=2,lwd=2)
},
xlim=range, ylim=range, aspect="iso")
} else {
stop("possible dimension mismatch")
}
}
--8<---------------cut here---------------end--------------->8---

Example:

--8<---------------cut here---------------start------------->8---
comparisonplot(0,rnorm(100,mean=-3))
comparisonplot(rnorm(100),rnorm(100,mean=-1))
--8<---------------cut here---------------end--------------->8---

Obviously, it can be made much fancier (automatic variable names etc).

Best,

Tamas

Andrew Gelman

unread,

Jan 22, 2014, 5:24:09 PM1/22/14

to stan-...@googlegroups.com

Interesting, perhaps worth writing up as a little paper with all the details (i.e., a vignette that people can use as a template).
A

Tamas Papp

unread,

Jan 27, 2014, 10:47:06 AM1/27/14

to stan-...@googlegroups.com

Hi Andrew and everyone,

On the topic of posterior preditive checks without loops: I wrote up a
small set of functions for mapping posteriors functionally:

https://gist.github.com/tpapp/8650643

Examples are at the end. To take a well-known example from the BDA
appendix,

theta_rep <- array (NA, c(n_sims, J))
y_rep <- array (NA, c(n_sims, J))
for (s in 1:n_sims){
theta_rep[s,] <- rnorm (J, schools_sim$mu[s], schools_sim$tau[s])
y_rep[s,] <- rnorm (J, theta_rep[s,], sigma)
}

would be something like

theta_and_y <- map_subarrays(school_sim,
function(mu, tau, sigma) {
theta_rep <- rnorm(J, mu, tau)
list(theta_rep=theta_rep,
y_rep=rnorm(J, theta_rep, sigma))
})

The good news is that this does what I want, and the syntax works great,
simplifying loops (or nested loops) considerably.

The bad news is that it is painfully slow for nontrivial examples. This
comes from the management of list labels, if I understand
correctly. Currently, it is so slow as to be unusable.

If you know any R wizards, maybe they can speed it up a bit. I asked on
the R-devel mailing list, where the message was not approved ("please
ask on R-help") and then didn't get the kind of help I was looking for
on R-help. So I am back to loops, ugly as they are, I don't want to
waste too much time on this now.

Also, aaply from the plyr libray works for with a little tweaking
(cbinding arrays together, destructuring, applying the function, etc),
but that's not much nicer than loops.

Best,

Tamas

Bob Carpenter

unread,

Jan 27, 2014, 10:57:53 AM1/27/14

to stan-...@googlegroups.com

On 1/27/14, 4:47 PM, Tamas Papp wrote:
> Hi Andrew and everyone,
>
> On the topic of posterior preditive checks without loops: I wrote up a
> small set of functions for mapping posteriors functionally:
>
> https://gist.github.com/tpapp/8650643
>
> Examples are at the end. To take a well-known example from the BDA
> appendix,
>
> theta_rep <- array (NA, c(n_sims, J))
> y_rep <- array (NA, c(n_sims, J))
> for (s in 1:n_sims){
> theta_rep[s,] <- rnorm (J, schools_sim$mu[s], schools_sim$tau[s])
> y_rep[s,] <- rnorm (J, theta_rep[s,], sigma)
> }

This you can do inside the generated quantities block of Stan,
where it should be pretty fast.

> would be something like
>
> theta_and_y <- map_subarrays(school_sim,
> function(mu, tau, sigma) {
> theta_rep <- rnorm(J, mu, tau)
> list(theta_rep=theta_rep,
> y_rep=rnorm(J, theta_rep, sigma))
> })
>
> The good news is that this does what I want, and the syntax works great,
> simplifying loops (or nested loops) considerably.
>
> The bad news is that it is painfully slow for nontrivial examples.

Is it slower than the simpler loops above?

> This
> comes from the management of list labels, if I understand
> correctly. Currently, it is so slow as to be unusable.

How many samples are you running it for? We shouldn't need that
many iterations.

> If you know any R wizards, maybe they can speed it up a bit. I asked on
> the R-devel mailing list, where the message was not approved ("please
> ask on R-help") and then didn't get the kind of help I was looking for
> on R-help. So I am back to loops, ugly as they are, I don't want to
> waste too much time on this now.

All the ones I know are on this list.

- Bob

Tamas Papp

unread,

Jan 27, 2014, 11:36:49 AM1/27/14

to stan-...@googlegroups.com

On Mon, Jan 27 2014, ca...@alias-i.com wrote:

> On 1/27/14, 4:47 PM, Tamas Papp wrote:
>> theta_rep <- array (NA, c(n_sims, J))
>> y_rep <- array (NA, c(n_sims, J))
>> for (s in 1:n_sims){
>> theta_rep[s,] <- rnorm (J, schools_sim$mu[s], schools_sim$tau[s])
>> y_rep[s,] <- rnorm (J, theta_rep[s,], sigma)
>> }
>
> This you can do inside the generated quantities block of Stan,
> where it should be pretty fast.

That's what I am planning to do. Perhaps getting a posterior with MCMC,
and then using that with various Stan model files to experiment with
posterior predictive checks (my impression is that the simpler the Stan
code, the faster the compilation is).

> Is it slower than the simpler loops above?
>

> How many samples are you running it for? We shouldn't need that
> many iterations.
>

Considerably slower. In my problem (not the one I had code for above),
1000 iterations (1 chain) take about 1s with loops, and 13s with my
function. Which is OK-ish, but for a few parallel chains etc this
becomes too long.

Best,

Tamas

Bob Carpenter

unread,

Jan 27, 2014, 11:51:08 AM1/27/14

to stan-...@googlegroups.com

On 1/27/14, 5:36 PM, Tamas Papp wrote:
> On Mon, Jan 27 2014, ca...@alias-i.com wrote:
>
>> On 1/27/14, 4:47 PM, Tamas Papp wrote:
>>> theta_rep <- array (NA, c(n_sims, J))
>>> y_rep <- array (NA, c(n_sims, J))
>>> for (s in 1:n_sims){
>>> theta_rep[s,] <- rnorm (J, schools_sim$mu[s], schools_sim$tau[s])
>>> y_rep[s,] <- rnorm (J, theta_rep[s,], sigma)
>>> }
>>
>> This you can do inside the generated quantities block of Stan,
>> where it should be pretty fast.
>
> That's what I am planning to do. Perhaps getting a posterior with MCMC,
> and then using that with various Stan model files to experiment with
> posterior predictive checks (my impression is that the simpler the Stan
> code, the faster the compilation is).

Yes, up to a point. A lot of the compilation is just for
the samplers and the functions being called. The generated
quantities execution stuff should be much faster because
the instantiations are all double values.

>> Is it slower than the simpler loops above?
>>
>> How many samples are you running it for? We shouldn't need that
>> many iterations.
>>
>
> Considerably slower. In my problem (not the one I had code for above),
> 1000 iterations (1 chain) take about 1s with loops, and 13s with my
> function.

So what's the motivation for the function over using loops?
I, at least, find the loops so much easier to understand.

> Which is OK-ish, but for a few parallel chains etc this
> becomes too long.

We definitely don't need 1K iterations for posterior predictive checks!
I'd say you'd be good with 50 (or even fewer) for most purposes.

- Bob

Tamas Papp

unread,

Jan 28, 2014, 5:57:30 AM1/28/14

to stan-...@googlegroups.com

On Mon, Jan 27 2014, ca...@alias-i.com wrote:

> So what's the motivation for the function over using loops?
> I, at least, find the loops so much easier to understand.

In general, I like functional solutions: that means that I don't have to
create the array (or other structure) I use or collecting the results,
and I don't need to keep track of a loop index variable. Loops are OK
for a single level, but nested loops can become error-prone very
quickly.

But my main language is Common Lisp, so I guess I prefer more functional
approaches simply because of habit (and, also, I tend to use R as if it
was a recalcitrant Lisp with a thin layer of infix syntax upon it).

Meanwhile, Hadley Wickham, the author of various great R packages for
working with data (including plyr), kindly suggested some other
approaches which are more idiomatic in R. I will experiment with these
and report back to the list.

>> Which is OK-ish, but for a few parallel chains etc this
>> becomes too long.
>
> We definitely don't need 1K iterations for posterior predictive checks!
> I'd say you'd be good with 50 (or even fewer) for most purposes.

Then I am confused, I was under the impression that one calculates a
replication for each posterior draw.

Best,

Tamas

Bob Carpenter

unread,

Jan 28, 2014, 6:17:13 AM1/28/14

to stan-...@googlegroups.com

On 1/28/14, 11:57 AM, Tamas Papp wrote:
>
> On Mon, Jan 27 2014, ca...@alias-i.com wrote:
>
>> So what's the motivation for the function over using loops?
>> I, at least, find the loops so much easier to understand.
>
> In general, I like functional solutions: that means that I don't have to
> create the array (or other structure) I use or collecting the results,
> and I don't need to keep track of a loop index variable. Loops are OK
> for a single level, but nested loops can become error-prone very
> quickly.
>
> But my main language is Common Lisp, so I guess I prefer more functional
> approaches simply because of habit (and, also, I tend to use R as if it
> was a recalcitrant Lisp with a thin layer of infix syntax upon it).
>
> Meanwhile, Hadley Wickham, the author of various great R packages for
> working with data (including plyr), kindly suggested some other
> approaches which are more idiomatic in R. I will experiment with these
> and report back to the list.

I understand that tastes in programming style do vary. And the functional
approach is both loved and hated by R users I've talked to. I don't
understand R's functional programming styles, which is why I find the loops
easier to comprehend. I do love functional languages in general --- I spent
years working on typed higher-order lambda calculus and domain theory; it's
just R I find opaque.

>>> Which is OK-ish, but for a few parallel chains etc this
>>> becomes too long.
>>
>> We definitely don't need 1K iterations for posterior predictive checks!
>> I'd say you'd be good with 50 (or even fewer) for most purposes.
>
> Then I am confused, I was under the impression that one calculates a
> replication for each posterior draw.

No reason you have to do it for each posterior draw. Also, no reason
to have 1000 posterior draws for most examples. The default number of
iterations is going to change in the near future to something more
reasonable like 200.

What you should is decide how much accuracy you need in the posterior quantities
of interest, such as parameter means or intervals. Then figure out how many
effective samples you need to get it, then only draw that many samples.

Same for posterior predictive checks.

- Bob

Tamas Papp

unread,

Jan 28, 2014, 6:46:15 AM1/28/14

to stan-...@googlegroups.com

On Tue, Jan 28 2014, ca...@alias-i.com wrote:

> I understand that tastes in programming style do vary. And the functional
> approach is both loved and hated by R users I've talked to. I don't
> understand R's functional programming styles, which is why I find the loops
> easier to comprehend. I do love functional languages in general --- I spent
> years working on typed higher-order lambda calculus and domain theory; it's
> just R I find opaque.

I am coming to the same conclusion --- even if base R could accommodate
a functional style, many functions, especially in libraries, rely on
resolving variables on a dynamically created environment (eg columns of
a data frame). Also, generic (non-homogenous) vectors/arrays are
cumbersome and not supported. So maybe I will have to change my style
when I program in R. OTOH, it is hard to miss similarities between Common
Lisp and R: eg the S4 classes, etc.

>> Then I am confused, I was under the impression that one calculates a
>> replication for each posterior draw.
>
> No reason you have to do it for each posterior draw. Also, no reason
> to have 1000 posterior draws for most examples. The default number of
> iterations is going to change in the near future to something more
> reasonable like 200.
>
> What you should is decide how much accuracy you need in the posterior quantities
> of interest, such as parameter means or intervals. Then figure out how many
> effective samples you need to get it, then only draw that many samples.
>
> Same for posterior predictive checks.

I guess my habit of drawing many chains, each with at least 2000-5000
iterations, comes from the days when I used inefficient samplers (which
I usually programmed myself). Indeed Stan is much more efficient so I
will have to adjust.

Thanks for all the advice.

Best,

Tamas

Bob Carpenter

unread,

Jan 28, 2014, 7:20:19 AM1/28/14

to stan-...@googlegroups.com

On 1/28/14, 12:46 PM, Tamas Papp wrote:
>
> On Tue, Jan 28 2014, ca...@alias-i.com wrote:

...

>> What you should is decide how much accuracy you need in the posterior quantities
>> of interest, such as parameter means or intervals. Then figure out how many
>> effective samples you need to get it, then only draw that many samples.
>>
>> Same for posterior predictive checks.
>
> I guess my habit of drawing many chains, each with at least 2000-5000
> iterations, comes from the days when I used inefficient samplers (which
> I usually programmed myself). Indeed Stan is much more efficient so I
> will have to adjust.

You can also thin the samples, even if the number of effective samples
per iteration is low.

- Bob

Reply all

Reply to author

Forward