a Waic example!

3,031 views
Skip to first unread message

Andrew Gelman

unread,
Dec 28, 2013, 8:53:16 AM12/28/13
to stan-...@googlegroups.com, stan...@googlegroups.com
Hi all. I've programmed a simple example of WAIC for the example on p.165 and p.177 of BDA3. Attached are three files:
waic_example.R
lm_waic.stan
hibbs.dat

I have some thoughts on how we could alter Stan to do this automatically. The simplest approach would be to just put a function waic() in rstan that looks in the output for the variable "log_lik" and, if such a variable exists, calculates Waic. A more comprehensive approach would allow tagging of the Stan model so that certain lines of code would be identified with the loglikelihood so that "log_lik" would get calculated automatically.

By the way: in calculating Waic, we do want to use the full loglik (including the normalizing factor); this gets done automatically when we use increment_log_prob in Stan. If we did it using tagging, we'd just have to be sure to remember to include those factors.

---

waic_example.R
lm_waic.stan
hibbs.dat

Shige Song

unread,
Dec 28, 2013, 4:49:56 PM12/28/13
to stan-...@googlegroups.com
Will be great if it can be incorporated into the rstan package.

Shige



--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,
Dec 28, 2013, 4:59:50 PM12/28/13
to stan-...@googlegroups.com


On 12/28/13, 4:49 PM, Shige Song wrote:
> Will be great if it can be incorporated into the rstan package.

It'll have to be changed in basic Stan first so that
there's a way to specify the log probabilities in the algorithm.

As Andrew brought up, it's a bit tricky now because of the interpretation
of up-to-a-proportion calcs. One thing I could do is template out the
propto template parameter for both the sampling statements and
log probability functions and let that be specified at compile time.
But there's still the issue of how to divide the data up and specify the
model. It won't be something we can do without user intervention in
setting up all the intermediate log probability variables needed.

- Bob

Andrew Gelman

unread,
Dec 29, 2013, 11:16:29 AM12/29/13
to stan-...@googlegroups.com, stan...@googlegroups.com
We could do Waic in rstan as is (just using the function I already wrote!), conditional on the user creating the log_lik vector correctly in the Stan program. But maybe it makes sense to get it working in regular Stan using tagging. That sort of user intervention (tagging lines in the model as components in the loglik vector) would be necessary in any case. There's no way that it could be done automatically given only the log posterior function with no labeling.

I do think, however, that we should be able to do it without setting up intermediate log probability variables. I think it should be possible (at least for many many common examples) to just have the user tag the appropriate lines in the model as being log-likleihood terms.

A

Linas Mockus

unread,
Jan 6, 2014, 2:40:13 PM1/6/14
to stan-...@googlegroups.com, stan...@googlegroups.com, gel...@stat.columbia.edu
Thanks for posting the example. 

I am trying to compare two models - gamma and poisson for count data.  waic for poisson model is Inf while waic for gamma model is 2424.  I assume that it means that gamma model is much better with regards to predictive capability.  My question is why waic for the poisson model is Inf.  Data is under-dispersed and therefore I know that the poisson model is not suited well.

Code is attached.

Thank you,
Linas
gamma.stan
waic.r
poisson.stan

Andrew Gelman

unread,
Jan 6, 2014, 4:15:05 PM1/6/14
to Linas Mockus, stan-...@googlegroups.com, stan...@googlegroups.com
Hi--I don't know why waic is inf, that sounds like a mistake or a problem, possibly something where the expected value is 0 but the data are nonzero?  Also I think there's a mistake in your model in that it looks like you're counting the data twice.  If you include the increment_log_prob thing, you don't need to also express your model using ~.

<gamma.stan><waic.r><poisson.stan>

Linas Mockus

unread,
Jan 6, 2014, 5:12:30 PM1/6/14
to stan-...@googlegroups.com, stan...@googlegroups.com, gel...@stat.columbia.edu
Thanks a lot for pinpointing me the double counting mistake.  I am novice with rstan.

After correcting the poisson model I am still getting Inf.  As you have suggested, the reason is because likelihood is essentially 0.  I am curious if scaling could help.

On Saturday, December 28, 2013 8:53:16 AM UTC-5, Andrew Gelman wrote:

Leon-Alph

unread,
Jan 16, 2014, 4:30:11 PM1/16/14
to stan-...@googlegroups.com, stan...@googlegroups.com, gel...@stat.columbia.edu
Thank you for your work and all the Rstan team.

I tried to compute WAIC (and DIC) by aleatory sampling parameters from the extract() function.
I did it to save time of computing as my code is very time-consuming. Is this seem correct or is it against the Bayesian idea of WAIC?

Leon

Andrew Gelman

unread,
Jan 16, 2014, 4:34:24 PM1/16/14
to stan-...@googlegroups.com
I do not know what you mean, perhaps you can explain.

Leon-Alph

unread,
Jan 17, 2014, 4:16:17 AM1/17/14
to stan-...@googlegroups.com, gel...@stat.columbia.edu
In your paper "Understanding predictive information criteria for Bayesian models", I took the formula page 21.
Taking the code you presented in this post, I computed a WAIC based on a sample (in attached file), what give similar result if S.n is big enough.

I understand that the way you did it is better, but on my model, the computing time was very long when adding something similar to "increment_log_prob (sum (log_lik));". Maybe because I've some trouble using vector and matrix rather than loops.
Thank you. And thank for your paper.
waic_example.R

Bob Carpenter

unread,
Jan 17, 2014, 7:30:14 AM1/17/14
to stan-...@googlegroups.com
One reason that adding the log likelihoods in the model itself
may be problematic in terms of speed is that unlike the sampling
statements, the log probability functions don't drop constants.
And they require more I/O to save all the intermediate values.

- Bob
>> <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

Leon-Alph

unread,
Jan 17, 2014, 8:47:27 AM1/17/14
to
I understand. So, while I'm playing with several model I'll measure WAIC with my rough way, and add log likelihoods latter.
Thank you.
Message has been deleted

Bob Carpenter

unread,
Apr 19, 2014, 8:14:19 AM4/19/14
to stan-...@googlegroups.com

On Apr 18, 2014, at 11:49 PM, Brian Hayden <bha...@lbl.gov> wrote:

> What is the difference between doing log_lik in the transformed parameters block and using increment_log_prob in the model block, versus doing it in a generated quantities block like this:
>
> generated quantities {
> real log_lik;
> log_lik <- 0;
> for (n in 1:N){
> log_lik <- log_lik + normal_log(y[n], X[n]*b, sigma);
> }
> }
>
> I am asking mainly because I am trying to implement waic and I want to make sure the generated quantities version doesn't do something different.

They'll give you the same answer because Stan only cares about
the log probability function up to a constant as defined over
the parameters --- everything else is about what gets output.

So the two approaches will not give you a different answer.
But there are some subtle differences to what happens under the hood.

Transformed parameters are derived from parameters. This means they
involve heavy automatic differentiation types. This happens
every log prob evaluation, which is once per leapfrog step in
HMC/NUTS, which is roughly 2^treedepth per iteration.

Generated quantities are computed once per iteration using double
precision floating point values in C++ instead of auto-diff variables and
thus are faster to evaluate.

Another issue is that the function normal_log() does not drop constant
terms whereas the sampling statement version with ~ does. We may clean
this all up in the future so they do the same thing or are ideally
configurable, but that's how it works now and it means a bit more overhead
for calling normal_log() than using sampling statements. And if you
want to reuse these pieces to compute the log prob, then you have
to forego the built-in vectorization possibilities in the sampling
distributions.

Andrew prefers the generated quantities version because (a) it's faster,
and (b) it means the model can be written as it would be written if you weren't
computing WAIC. In contrast, the transformed parameter approach is nice because
you only have to write the model once, even if it's coded a bit differently
than it would otherwise be coded.

This general advice is all buried in the manual in the description of
which variables get computed where.

You probably want an array of log likelihoods, one for each data
point, too.

- Bob

Brian Hayden

unread,
Apr 19, 2014, 11:23:47 AM4/19/14
to stan-...@googlegroups.com
Thanks for the response. I had deleted my post because I realized I was missing the critical part of having the array of likelihoods for each data point. I am working in python with pystan and the R code confused me. I did manage to compute waic.

Andrew Gelman

unread,
Apr 19, 2014, 4:28:04 PM4/19/14
to Brian Hayden, stan-...@googlegroups.com, stan...@googlegroups.com, Aki Vehtari
We discuss this in the (not quite released) Waic paper.  The short answer is I like to do it like this:

data {
  int N;
  int J;
  vector[N] y;
  matrix[N,J] X;
}
parameters {
  vector[J] b;
  real<lower=0> sigma;
}
model {
  y ~ normal(X*b, sigma);
}
generated quantities {
  vector[N] log_lik;
  for (n in 1:N){
    log_lik[n] <- normal_log(y[n], X[n]*b, sigma);
  }
}

I have two reasons for preferring the above form.  First, it has the advantage that the Waic-conputing code is just at the end, it’s just added to the end of an existing model and I don’t need to mess with the model itself.  Second, by not messing with the model, I can do the inference in vectorized form which is faster.  In any case I need to save log_lik as a vector not a scalar because Waic needs all the components separately.

A

On Apr 18, 2014, at 11:49 PM, Brian Hayden <bha...@lbl.gov> wrote:

What is the difference between doing log_lik in the transformed parameters block and using increment_log_prob in the model block, versus doing it in a generated quantities block like this:

generated quantities {
    real log_lik;
    log_lik <- 0;
    for (n in 1:N){
        log_lik <- log_lik + normal_log(y[n], X[n]*b, sigma);
    }
}

I am asking mainly because I am trying to implement waic and I want to make sure the generated quantities version doesn't do something different.

Brian Hayden

unread,
May 1, 2014, 12:23:06 PM5/1/14
to stan-...@googlegroups.com, Brian Hayden, stan...@googlegroups.com, Aki Vehtari, gel...@stat.columbia.edu
I've been working on something else for the last couple of weeks, but in thinking about my stan-based project today, I had some other thoughts about WAIC. I am doing essentially a model comparison project, and I had been using the AIC as my metric of comparison. The stan "optimizing" function was not giving me reliable results from testing on simulated data; so I was getting the maximum likelihood estimate by simply taking way more samples than needed for estimating posterior statistics, and picking out the maximum likelihood from there.

For WAIC, tracking all of the pointwise likelihoods becomes a memory problem very quickly. However, it occurs to me that with waic, one shouldn't need many more samples than what is required to obtain good statistics on the parameters of the model. Since waic is based on the mean and variance of each pointwise likelihood, when I am satisfied that the "error" in waic (simply due to the error of the mean and variance that goes into waic) is sufficiently low, I don't need more samples. Is this true?

-Brian

Andrew Gelman

unread,
May 1, 2014, 1:13:22 PM5/1/14
to Brian Hayden, stan-...@googlegroups.com, stan...@googlegroups.com, Aki Vehtari
Hi—how many data points do you have in your example?

Brian Hayden

unread,
May 1, 2014, 2:00:53 PM5/1/14
to stan-...@googlegroups.com, Brian Hayden, stan...@googlegroups.com, Aki Vehtari, gel...@stat.columbia.edu
There are ~400 data points with ~800 parameters defined in the parameter block. Each data point has certain measurements associated with it and we draw a "true" value for these measurements from a prior distribution (also fit by the model) each sample.

-Brian

Andrew Gelman

unread,
May 1, 2014, 2:02:17 PM5/1/14
to Brian Hayden, stan-...@googlegroups.com, stan...@googlegroups.com, Aki Vehtari
400 points doesn’t seem like so much, I don’t see why this would overwhelm your memory?

Brian Hayden

unread,
May 1, 2014, 2:14:41 PM5/1/14
to stan-...@googlegroups.com, Brian Hayden, stan...@googlegroups.com, Aki Vehtari, gel...@stat.columbia.edu
Oh, because I was sampling ~500,000 times in order to get a relatively reliable maximum likelihood estimate for the AIC. I am using pystan, and when I go to calculate WAIC, python is spitting out a memory error when calculating the mean and variance required.

There is a related issue, which is that in pystan, there is currently a bug where the user cannot tell stan to 'ignore' any parameters, so sampling from a large model runs out of memory pretty quickly (in this case the amount of memory available to me is ~3 GB on a compute node).

Regardless, my main question is: is it true that with WAIC I am only really concerned with how well I measure the mean and variance terms from BDA (lppd and pwaic2), which basically requires the same number of samples as is required to get good statistics on the posterior of a parameter?

-Brian

Andrew Gelman

unread,
May 1, 2014, 2:30:57 PM5/1/14
to Brian Hayden, stan-...@googlegroups.com, stan...@googlegroups.com, Aki Vehtari
There is no way you should be doing 500,000 iterations in Stan (I think).  What is your effective sample size?

Marco Inacio

unread,
May 1, 2014, 2:35:47 PM5/1/14
to stan-...@googlegroups.com
Hi, I don't know if anybody has already explored this idea, but the
following small modification of your code avoids exponentiation of log
likelihood:

waic <- function (stanfit){
log_sum_exp <- function(x) {
x_max <- max(x)
x_max + log(sum(exp(x - x_max)))
}

log_lik <- extract (stanfit, "log_lik")$log_lik
summands <- apply(log_lik, 2, log_sum_exp)
correc <- - ncol(log_lik) * log(nrow(log_lik))
lppd <- sum(summands) + correc
p_waic_1 <- 2 * (sum(summands - colMeans(log_lik)) + correc)
p_waic_2 <- sum (colVars(log_lik))
waic_2 <- -2*lppd + 2*p_waic_2
return (list (waic=waic_2, p_waic=p_waic_2, lppd=lppd,
p_waic_1=p_waic_1))
}


There was almost no difference in your example though:
> results #modified code
$waic
[1] 87.09191

$p_waic
[1] 2.689071

$lppd
[1] -40.85688

$p_waic_1
[1] 2.267086

> waic(fit) #non modified code
$waic
[1] 87.09191

$p_waic
[1] 2.689071

$lppd
[1] -40.85688

$p_waic_1
[1] 2.267086


> print(results$p_waic_1, digits=22)
[1] 2.267086340125842980342
> print(waic(fit)$p_waic_1, digits=22)
[1] 2.267086340125830545844

> print(results$waic, digits=22)
[1] 87.09191109667641228498
> print(waic(fit)$waic, digits=22)
[1] 87.09191109667644070669

Brian Hayden

unread,
May 1, 2014, 3:36:21 PM5/1/14
to stan-...@googlegroups.com, Brian Hayden, stan...@googlegroups.com, Aki Vehtari, gel...@stat.columbia.edu
The autocorrelation time of the parameters that end up in my final chain object is very close to 1 for all parameters. So I certainly am taking way (!) too many samples.

Some of the reasons I was doing so many samples:
1) computing is relatively cheap. By submitting embarrassingly parallel jobs, 500,000 samples still only takes ~10 minutes walltime.
2) I wanted to make 'triangle' posterior plots with very smooth contours. I could probably do a kernel density estimate to get the contours, but again, it's cheap (in time and resources, except memory) for me to just run a ton of samples.
3) Testing had indicated there were differences in the maximum likelihood estimate between different runs, and so I kept ramping up the number of samples until the maximum likelihood estimate was relatively stable across different runs. This hasn't been properly tested, but I am moving away from AIC anyway (hence my interest in WAIC).

My goal wasn't necessarily to have you help diagnose and improve my method (though I certainly appreciate the comments), it was mainly to raise the question "how many samples is necessary for WAIC?" My general point was that if you want to use the AIC with MCMC samples, you have to get a good estimate of the maximum likelihood, which seems to quite obviously require alot more samples than the posterior statistics needed for WAIC. Maybe this is a blatantly obvious point, and I'm just overstating it.

To make decisions about the number of samples needed for WAIC, can one simply propagate the error of the mean and error of the variance (meaning for lppd and pwaic2) into a statistical error on WAIC, and use that as a metric? Is there anything fundamentally wrong with this idea, or is there a better approach?

Richard McElreath

unread,
May 1, 2014, 5:17:52 PM5/1/14
to stan-...@googlegroups.com
(This has mixed topics at this point. Responding to Marco here.)

I've been using that log_sum_exp trick myself to compute WAIC from stanfit samples. Like you, I haven't found any cases in which it gives a different result. Still, I favor doing the log_sum_exp for the safety of it.

Bob Carpenter

unread,
May 4, 2014, 5:40:56 PM5/4/14
to stan-...@googlegroups.com

On May 1, 2014, at 8:35 PM, Marco Inacio <marcoig...@gmail.com> wrote:

> Hi, I don't know if anybody has already explored this idea, but the following small modification of your code avoids exponentiation of log likelihood:
>
> waic <- function (stanfit){
> log_sum_exp <- function(x) {
> x_max <- max(x)
> x_max + log(sum(exp(x - x_max)))
> }

This is an excellent function to avoid overflow when doing addition
on the log scale. Algebraically,

log(a + b) -> log_sum_exp(log(a), log(b))

log(a * b) -> log(a) + log(b)

It avoid overflow in exp() because x - x_max <= 0.

- Bob

Bob Carpenter

unread,
May 4, 2014, 5:45:24 PM5/4/14
to stan-...@googlegroups.com
I didn't want this question to get lost, because
it seems like a good one:

Aki Vehtari

unread,
May 5, 2014, 7:47:51 AM5/5/14
to stan-...@googlegroups.com
Hi,
I had answered this, but my answer did not appear here due to a problem with using @aalto.fi email address for Google services, which should be fixed now. 
Here's the answer again.

Yes, you can use usual approach for estimating the Monte Carlo error. It is quite easy to get this error small. Usually most of the uncertainty in WAIC (and AIC, DIC, CV, etc.) is due to not knowing the true future data distribution. See Vehtari, and Ojanen (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6:142-228 http://dx.doi.org/10.1214/12-SS102 for further discussion. 

Aki
Reply all
Reply to author
Forward
0 new messages