HI Matt, Qing and Marc,
Matt, thanks for the blog post. I'm sure many will find it useful. I just want to chime in with a few brief perspectives from a nimble developer.
Early versions of nimbleHMC were painfully slow. Now it should be much better. Of course performance will vary. I've seen our HMC be faster than Stan on some problems and slower on others. We haven't had time for a thorough comparison, and there are some problems where we'll be slow. It may be helpful to know that the marginal distributions in nimbleEcology (occupancy, N-mixture, HMM, etc.) were updated to be compatible with nimble's automatic differentiation (AD), which allows them to be used in HMC.
It is understandable that the Stan team is all in on HMC. It is a very good method, and some of their biggest arguments are about how its performance scales up to large data sets better than other methods. However, it is not necessarily universally the best method on all real problems, and not all real problems are so large that other methods won't be relevant. For example, I have seen Pólya-gamma samplers outperform HMC for occupancy models in spOccupancy or nimble (which does now have Pólya-gamma). The way I look at HMC is that in the computational cost tradeoff between iterations and mixing, it spends a lot of computation on each iteration in order to achieve good mixing per iteration. Other samplers can look worse *per iteration* in a trace plot but generate iterations much more rapidly, so we always like to look at effective sample size *per computation time* to really compare methods. Some of the comparisons we've published can be found on the Documentation tab at
r-nimble.org. Of particular interest may be Ponisio et al (2020), where we illustrated that marginalization is not always worth the computational cost. It is another kind of tradeoff between computation and mixing. N-mixture models have very costly marginalizations, so they can sometimes be more efficient (per time, not per iteration) to sample without marginalization. (That paper was before we had nimbleHMC.)
Regarding the point you relayed from the Stan team about not having a "principled" way to use HMC together with other samplers (or I'm not sure if the comment was specifically about samplers for discrete latent states), that is curious. I would like to point out that Radford Neal, the inventor of HMC, wrote in his chapter in the
Handbook of MCMC about how he likes to use HMC on some parts of a model and other samplers on other parts. On the other hand we've explored some of that and haven't found it necessarily or automatically is a strategy, but I'm just saying it is feasible and valid. Maybe the point from the Stan team was in relation to discrete samplers specifically, but I don't see an additional issue there (besides that HMC can't handle discrete latent states at all). Or maybe the idea was that the HMC tuning -- what happens during "warmup" -- may not necessarily work well when other parts of a model are being sampled in other ways (which just means it could be less efficient, but not wrong). In any case there is a lot of art and pragmatism to MCMC, so my two cents would be to take comments like the one you related with a grain of salt.
For folks reading this who aren't familiar, I'd just like to highlight that two big differences between nimble and Stan are (i) nimble represents a model as a directed acyclic graph (allowing much more control by algorithms and hence many different kinds of algorithms), while Stan represents a model as a single big calculation, and (ii) nimble is a platform for extension with algorithms that can be written from R by users and for methodological experimentation. We have some exciting developments under way, so I hope you'll stay tuned. We're always keen for feedback and to provide support (we try to be quick, sometimes it's hard), mostly at the nimble-users mailing list.
Best wishes,
Perry