Dear Aki, Michael, and Dustin,
thanks for your comments and the links to the papers. Aki, I was aware of the first one as being (at least partially included) in Gelman et al. (2013), but it was still very informative to work through the detailed examples in this version. I have a quick follow-up question: Is it always the case that θ_mle equals to the posterior mode (the MAP estimate) or does that only hold for uninformative priors? Stated differently, with uninformative priors could I compute an AIC using the MAP parameter estimates form my hierarchical model? (This still wouldn’t get the penalty term (expected vs. nominal number of parameters) correctly through.)
In addition, in section 3.5 (Point-wise vs. joint predictive prediction) I found the following:
"A cost of using WAIC is that it relies on a partition of the data into n pieces, which is not so easy to do in some structured-data settings such as time series, spatial, and network data. AIC and DIC do not make this partition explicitly, but derivations of AIC and DIC assume that residuals are independent given the point estimate θ: conditioning on a point estimate θ eliminates posterior dependence at the cost of not fully capturing posterior uncertainty.”
Now, the data that we are fitting (in a hierarchical model) come from an associative learning / decision-making experiment, in which there is a serial dependence of data points such that the data obey the Markov property. Given the statement above, does that suggest that DIC would be preferable over WAIC or is the use of the full posterior still superior over using a point estimate as in DIC?`
Finally, Dustin are you suggesting that the reviewer is probably looking for completely non-Bayesian approach to model fitting *and* selection? I.e. is he potentially looking for a model fitting using MLE and subsequently calculated AIC/BIC. We did this and it results in a different winning model, but the MLE parameter estimates are often very suspicious, i.e. they are at the range boundaries of the parameters — this was the primary reason why we started using Bayesian hierarchical modeling in the first place a few years ago.
Thanks again for your time and insights.
Jan