The answer is (iii) --- it's a computational issue.
There's not much out there to read! Basically just what Michael's
written --- in papers (see arXiv), in our manual, and on this
list and the dev list.
It's almost always caused by areas of high curvature in the posterior
which are difficult to follow using only small steps along the gradient.
It almost always manifests as floating-point underflow or overflow in
the density or gradient calculations, which we report as "divergent"
iterations.
If the curvature (second derivative matrix) doesn't vary by position
in the posterior (as in the multivariate normal), then adaptation
is much easier. When curvature varies a lot, as in a hierarchical
model (see the funnel example in the manual), adaptation to a single
step size/mass matrix isn't possible, and areas of the posterior become
hard to explore.
This also explains why reparameterization can help. Again, see the
funnel example in the manual for an extreme example, or read Michael's
paper with Mark Girolami on hierarchical models for much more detail
and an explanation of how data can tame the funnel prior.
It's hard to quantify how much bias there might be without getting
a better handle on what's really going on in a paticular posterior. Usually
it's not very big in the examples I've looked at. You'll wind up getting
the same kinds of biases with Gibbs, Metropolis and ensemble samplers---these are
hard problems computationally; we're just trying to be helpful in providing
diagnostics.
- Bob