Guidance on interpreting loo's elpd

3,397 views
Skip to first unread message

Tom Wallis

unread,
Mar 17, 2016, 5:46:08 AM3/17/16
to Stan users mailing list
Hi,

I'm looking for guidance on legitimate interpretations of the elpd. Briefly: can the elpd difference between two models be interpreted as the log odds ratio of the posterior model probabilities?

Example:

Model 1 elpd_loo = -50
Model 2 elpd_loo = -48

abs difference: 2

From these results, could one state that Model 2 "describes the data" (in the sense of expected prediction error) about 7 times better than model 1 (i.e. exp(2)), on average? Similarly, if one converted the elpd difference to base 2 (2 / log(2) = 2.89), could one say that the average expected information gain of Model 2 relative to Model 1 is 2.89 bits?

I am seeking ways to describe the results of loo such that they could be more easily understood or communicated to non-mathematical audiences. Apologies if my interpretations are way off base (I am a member of that non-mathematical audience).

Aki Vehtari

unread,
Mar 17, 2016, 1:48:40 PM3/17/16
to Stan users mailing list
On Thursday, March 17, 2016 at 11:46:08 AM UTC+2, Tom Wallis wrote:
I'm looking for guidance on legitimate interpretations of the elpd. Briefly: can the elpd difference between two models be interpreted as the log odds ratio of the posterior model probabilities?

Example:

Model 1 elpd_loo = -50
Model 2 elpd_loo = -48

abs difference: 2

From these results, could one state that Model 2 "describes the data" (in the sense of expected prediction error) about 7 times better than model 1 (i.e. exp(2)), on average? Similarly, if one converted the elpd difference to base 2 (2 / log(2) = 2.89), could one say that the average expected information gain of Model 2 relative to Model 1 is 2.89 bits?

You could state like this, but then you are ignoring the uncertainty in the estimates. This approach has been proposed informally as pseudo Bayes factor, but there are no papers showing formally or experimentally that this would be a good way to report results. I recommend using the approach described in section 5.2 of "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC" http://arxiv.org/abs/1507.04544, which is also implemented in compare function in loo package.

Aki

Aki Vehtari

unread,
Mar 17, 2016, 1:50:42 PM3/17/16
to Stan users mailing list
On Thursday, March 17, 2016 at 11:46:08 AM UTC+2, Tom Wallis wrote:
I am seeking ways to describe the results of loo such that they could be more easily understood or communicated to non-mathematical audiences. Apologies if my interpretations are way off base (I am a member of that non-mathematical audience).

And you could use application specific utility functions (like like classification accuracy, quality adjusted life years, mean absolute error, etc.) which would be more easily understood and communicated to non-mathematical audiences.

Aki

Michael Betancourt

unread,
Mar 17, 2016, 2:33:11 PM3/17/16
to stan-...@googlegroups.com
Remember that LOO, cross-valiadtion, WAIC and all of the like
are just estimators of some number that quantifies the _relative_
compatibility of the model with the data.  

That number is tricky to interpret as information gain or loss because 
it’s really a difference of two different Kullback-Leibler divergences.
And it’s definitely not a probability so shouldn’t be interpreted as
such.

Really this is just a way to rank models relative to each other.


--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Wallis

unread,
Mar 18, 2016, 5:30:20 AM3/18/16
to stan-...@googlegroups.com
Great, thanks for the guidance all. 

So if I take Michael's last sentence, it's basically not appropriate to treat any of these comparators as distance metrics. They only provide ordinal information about the relative model performance, but do not quantify *how* much better one model is than another.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/oUIKmpDGzFw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Daniel Emaasit

unread,
Aug 19, 2016, 6:51:17 AM8/19/16
to Stan users mailing list
@Tom Wallis, How did you finally interprete the results? I still did not quite understand Micheal & Aki's responses. (FYI, I am also one of those non-mathematicians)

Thanks,
-- Daniel

Tom Wallis

unread,
Sep 13, 2016, 8:36:23 AM9/13/16
to Stan users mailing list
Hi Daniel,

Sorry for the late reply. 

I'm afraid I didn't state much more than what is written at the bottom of p. 20 of Aki et al's paper (v5):

"Comparing the models on PSIS-LOO reveals an estimated difference in elpd of 10.2 (with a standard error of 5.1) in favor of Model A. "


So in my application I basically state that one model is preferred according to the LOO elpd estimate. Given that I'm comparing one model with and one model without an interaction term, and there's a pretty obvious interaction just from plotting the data, I think the model comparison results for my application are uncontroversial.


More generally though, I think it would be of great practical value if Aki and co-authors could include an example of converting elpd to a more intuitive scale for some given application (as stated in Aki's second reply, above).

Avraham Adler

unread,
Sep 13, 2016, 10:58:17 AM9/13/16
to Stan users mailing list
On Tuesday, September 13, 2016 at 8:36:23 AM UTC-4, Tom Wallis wrote:
Hi Daniel,

Sorry for the late reply. 

I'm afraid I didn't state much more than what is written at the bottom of p. 20 of Aki et al's paper (v5):

"Comparing the models on PSIS-LOO reveals an estimated difference in elpd of 10.2 (with a standard error of 5.1) in favor of Model A. "


So in my application I basically state that one model is preferred according to the LOO elpd estimate. Given that I'm comparing one model with and one model without an interaction term, and there's a pretty obvious interaction just from plotting the data, I think the model comparison results for my application are uncontroversial.


More generally though, I think it would be of great practical value if Aki and co-authors could include an example of converting elpd to a more intuitive scale for some given application (as stated in Aki's second reply, above).


If you've multiplied by -2 and are on the deviance scale, you can probably use Burnham & Anderson's (2002, §2.6) AIC rules of thumb as does Spieghalter et al.(2002, §9.2.4) for DIC, as they are all estimates of the expected log pointwisepredictive density penalized for bias, I believe. Take the difference between any model's WAIC or elpd-loo on deviance scale from the model with the minimal value. B&A's heuristics are:

  • Difference between 0 - 2: Substantial evidence for second model as well
  • Difference between 4 - 7: Considerably less evidence for second model with respect to first
  • Difference greater than 10: Essentially no evidence for second model

Remember it is the difference that is important, the absolute magnitude is more a function of the number of data points than anything else.


Avi


References:



Tom Wallis

unread,
Sep 28, 2016, 9:31:54 AM9/28/16
to Stan users mailing list
Thanks for the pointers Avi, that's very helpful.

In general it would be great if Aki and co-authors would consider including guidance along these lines in their final loo paper.
Reply all
Reply to author
Forward
0 new messages