help with reviewer comments on model selection with DIC

833 views
Skip to first unread message

glae...@gmail.com

unread,
Mar 1, 2016, 10:08:58 AM3/1/16
to Stan users mailing list
Dear Stan experts,

we have some trouble with a reviewer of a recent paper from one of my grad students. We estimated a large variety of models for our data with JAGS at that time and used DIC for model selection purposes. Not ideal by today’s standards, but back 2.5 years ago, this was our approach to hierarchical modeling and model comparison.

Anyways, the reviewer criticized our use of DIC, because it tends to select overfitted models. He requested that we also provide some non-Bayesian information criterion (AIC, BIC). In the first rebuttal, we argued against this by pointing out that DIC is the more appropriate index for hierarchical models and we listed the model likelihoods do demonstrate that the selected model had the best model fit to the data. We also said that while we could provide BIC or AIC estimates they would not be as accurate for model comparison, since hierarchical Bayesian models are fundamentally different than classical models (meaning non-hierarchical models fitted with MLE).

In his 2nd revision, the reviewer insisted that we should confirm model selection with other non-Bayesian methods. “Otherwise, one is left wondering about the robustness of the conclusions.”

I was wondering if Aki or the other model selection experts on this list have additional and stronger arguments (preferably backed up by papers) that we could use to justify our DIC-based model selection.

Thanks a lot for your insights.

Best wishes,
Jan

Aki Vehtari

unread,
Mar 1, 2016, 10:28:32 AM3/1/16
to Stan users mailing list
I guess you already know this one http://www.stat.columbia.edu/~gelman/research/published/waic_understand3.pdf , which tells that DIC is better than AIC, although WAIC and LOO are even better.

Maybe the reviewer would be happy with cross-validation http://arxiv.org/abs/1507.04544 ?

Aki

Michael Betancourt

unread,
Mar 1, 2016, 11:48:16 AM3/1/16
to stan-...@googlegroups.com
Did anyone point out that BIC, referring to “Bayesian Information Criterion”,
is an entirely Bayesian concept?  BIC is a linearized approximation to the
(Bayesian) marginal likelihood.  It also has nothing to do with the other
information criteria.

As Aki noted, AIC, DIC, LOO, and WAIC are all essentially approximations
to posterior predictive cross validation, with LOO and WAIC being the
best approximations.  They are not different methods of model comparisons
and hence it makes no sense to argue that they provide orthogonal information.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dustin Tran

unread,
Mar 1, 2016, 11:57:44 AM3/1/16
to stan-...@googlegroups.com
I think the reviewer’s comment, which is a common one, is to look at an evaluation criteria that is not justified using Bayesian analysis or requires taking posterior expectations.

Dustin

Jan Gläscher

unread,
Mar 2, 2016, 3:27:49 PM3/2/16
to stan-...@googlegroups.com
Dear Aki, Michael, and Dustin,

thanks for your comments and the links to the papers. Aki, I was aware of the first one as being (at least partially included) in Gelman et al. (2013), but it was still very informative to work through the detailed examples in this version. I have a quick follow-up question: Is it always the case that θ_mle equals to the posterior mode (the MAP estimate) or does that only hold for uninformative priors? Stated differently, with uninformative priors could I compute an AIC using the MAP parameter estimates form my hierarchical model? (This still wouldn’t get the penalty term (expected vs. nominal number of parameters) correctly through.)

In addition, in section 3.5 (Point-wise vs. joint predictive prediction) I found the following:

"A cost of using WAIC is that it relies on a partition of the data into n pieces, which is not so easy to do in some structured-data settings such as time series, spatial, and network data. AIC and DIC do not make this partition explicitly, but derivations of AIC and DIC assume that residuals are independent given the point estimate θ: conditioning on a point estimate θ eliminates posterior dependence at the cost of not fully capturing posterior uncertainty.

Now, the data that we are fitting (in a hierarchical model) come from an associative learning / decision-making experiment, in which there is a serial dependence of data points such that the data obey the Markov property. Given the statement above, does that suggest that DIC would be preferable over WAIC or is the use of the full posterior still superior over using a point estimate as in DIC?`

Finally, Dustin are you suggesting that the reviewer is probably looking for completely non-Bayesian approach to model fitting *and* selection? I.e. is he potentially looking for a model fitting using MLE and subsequently calculated AIC/BIC. We did this and it results in a different winning model, but the MLE parameter estimates are often very suspicious, i.e. they are at the range boundaries of the parameters — this was the primary reason why we started using Bayesian hierarchical modeling in the first place a few years ago.

Thanks again for your time and insights.
Jan



You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/-QwWiDbQp3U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
signature.asc

Aki Vehtari

unread,
Mar 2, 2016, 3:55:15 PM3/2/16
to Stan users mailing list
On Wednesday, March 2, 2016 at 10:27:49 PM UTC+2, Jan Gläscher wrote:
thanks for your comments and the links to the papers. Aki, I was aware of the first one as being (at least partially included) in Gelman et al. (2013), but it was still very informative to work through the detailed examples in this version. I have a quick follow-up question: Is it always the case that θ_mle equals to the posterior mode (the MAP estimate) or does that only hold for uninformative priors?

Only for uniform prior (there also non-uniform priors which are called uninformative).
 
Stated differently, with uninformative priors could I compute an AIC using the MAP parameter estimates form my hierarchical model? (This still wouldn’t get the penalty term (expected vs. nominal number of parameters) correctly through.)

In my experience full posterior MAP is not a good choice in hierarchical models.
 
In addition, in section 3.5 (Point-wise vs. joint predictive prediction) I found the following:

"A cost of using WAIC is that it relies on a partition of the data into n pieces, which is not so easy to do in some structured-data settings such as time series, spatial, and network data. AIC and DIC do not make this partition explicitly, but derivations of AIC and DIC assume that residuals are independent given the point estimate θ: conditioning on a point estimate θ eliminates posterior dependence at the cost of not fully capturing posterior uncertainty.

Now, the data that we are fitting (in a hierarchical model) come from an associative learning / decision-making experiment, in which there is a serial dependence of data points such that the data obey the Markov property. Given the statement above, does that suggest that DIC would be preferable over WAIC or is the use of the full posterior still superior over using a point estimate as in DIC?`

WAIC is better than DIC. Depending on you application and the decision task it is possible that all methods making n point-wise predictions are bad choices. It is hard to give exact advice without knowing what is your model and what you want to predict.

Aki
Reply all
Reply to author
Forward
0 new messages