Let me just summarize a few of the key points here.
Warmup:
Conceptually MCMC warmup times are basically equivalent to
the autocorrelation time — because HMC chains tend to be
lowly autocorrelated they also tend to converge really, really
quickly.
The HUGE caveat is that such an argument assumes uniformity
of curvature across the parameter space, as assumption which
is violated in many of the complex models we see. Very often
the tails have large curvature while the bulk is relatively
well-behaved; in other words, warmup is slow not because
the actual convergence time is slow but rather because the
cost of an HMC iteration is more expensive out in the tails.
Poor behavior in the tails is the kind of pathology that Andrew
notes we can find by running only a few warmup iterations.
By looking at the acceptance probabilities and step sizes of
the first few iterations you can get an idea of how bad the
problem is and whether you need to address it with modeling
efforts.
The Mass Matrix (or, more formally, the Euclidean Metric):
The mass matrix can compensate for linear (i.e. global)
correlations in the posterior which can dramatically improve
the performance of HMC in some problems. Of course this
requires that we know the global correlations a priori.
In complex models this is incredibly difficult (for example,
nonlinear model components convolve the scales of the
data, so standardizing the data doesn’t always help) so in
Stan we learn these correlations online with an adaptive
warmup. In models with strong nonlinear (i.e. local)
correlations this learning can be slow, even with regularization.
This is ultimately why warmup in Stan often needs to be
so long, and why a sufficiently long warmup can yield such
substantial performance improvements.