Hi Paul,
The behaviour of log p(y_i|y_{-i}) is probably best understood through the lens of proper scoring rules (see e.g. Gneiting and Raftery’s JASA paper from 2005 or 2007),
As it’s essentially the log-score.
When the true predictive distribution is G, and the claimed predictive distribution is F, and the observed value is y, let the score be S(F,y). (For cpo, the distributions are the leave-one-out predictive distributions) log-cpo is a positively oriented score, I.e. a large value is “good”.
A proper score has the property that the expectation over the true distribution is optimized when the claimed distribution F matches the true one, G, I.e.
S(F,G) := E_{y~G}[S(F,y)]
Is maximized when F=G.
Assume that the true predictive distribution is G~N(0,1). Then F~N(0,10^6) and F~N(0,10^{-6}) would received a lower score than F~N(0,1), on average.
So we can use proper scores to compare different predictions under the same circumstances, e.g. two different models for the same data.
But in your example it seems you wanted to change what you were conditioning _on_, but that’s not the situation handled by these methods. They only cover the case when there is a fixed (but unknown) distribution to be predicted.
So the total sum or average of log-cpo values isn’t really useful in itself. But if you take the pairwise differences between the scores computed from two different models for the same data, then they become more comparable. The variability of each difference still depends on the true G_i distribution, but at least the ordering is informative.
If one knew the true G_i, one could compute the theoretical maximum score expectation, but that’s only possible in toy examples where the true conditional distributions is already known (and these aren’t always philosophically well-defined).
For internal model validation, the pit values are much more meaningful, as they are interpretable for a single model, without the need to compare with another model.