Picking Samplers - IPM

11 views
Skip to first unread message

Kenneth Loonam

unread,
Jun 1, 2024, 5:52:38 PMJun 1
to nimble-users
I'll start with a short question: Where can I go to learn more about which samplers perform best in various situations? I've found a few papers on the edge of my comprehension levels, but I'd love a broad overview, particularly for models with highly correlated posteriors. I'd like to become more of a nimble power-user who can pick optimal samplers for any model I write, but I'm at a loss of where to start. Please feel free to only answer this question, but I'll elaborate on what is going on in my current model for future readers.

The motivating issue is an Integrated Population Model (IPM) that converges in JAGS but that I cannot get to converge in nimble. The exact behavior depends on which sampler I use. When I let the defaults stand, almost all of the nodes use the RW sampler. With that sampler, many of the nodes stick on a value, either from the very start or after exploring the posterior for a while (watching a functioning chain flat-line is fascinating). Similar patterns happen when I switch it the samplers to the log scale or set reflect to true. I've tried the RW-block sampler as well, but I'm not at all certain that I'm choosing my grouping for the blocks correctly (guidance on when/how to group would be fantastic - I'm currently just grouping temporally sequential variables together (like abundance at each time step)). The RW-block sampler hasn't given significantly different results than the RW sampler. I've also tried the slice sampler with onlySlice = T and by selecting just the sticking nodes to slice sample. Any version of the slice sampler results in nodes wandering off into crazy space (like 10^8 for a value that should be >0, <1000).

For a bit more about the model, it has a removal component where a known number of individuals are subtracted at each time step. That led to many headaches when originally troubleshooting it in JAGS, but I was able to solve it with normal approximations of binomials and heavy use of truncation to "minimum number alive" at various steps. All that to say, the model has a very narrow space of reasonable, highly-correlated posteriors.

At this point I could run it all in JAGS, but I want to be able to do all of my work in nimble going forward (I'd also prefer not to wait weeks for every run). With that in mind, a few specific questions I have are:

1. How do various nimbe samplers work compared to JAGS?
2. When do various samplers work better/worse?
3. Are the any plans to re-maintain the autoblocking functionality?
4. Is there a way to see which nodes in a single parameter (e.g. N[]) are sampled by each sampler as opposed to the posterior predictive sampler?
5. When/how should I block?
6. Could "sticking" be a result of the adaptive RW? Should I change it to non-adaptive, manually set step lengths or other parameters to unstick nodes? My thinking here is that the "reasonable" values are very tightly clustered, so if it fails to find them at a high enough rate, it might start looking farther and farther out, continuing to decrease the acceptance rate to functionally 0.
7. If a node is stuck, is there some obscenely large number of iterations that will unstick it?

Sorry for the tome, but thank you so much for reading this far! Mostly, I am at a loss of resources for continued self-education.

Cheers,
Kenneth

Perry de Valpine

unread,
Jun 1, 2024, 6:29:12 PMJun 1
to Kenneth Loonam, nimble-users
Dear Kenneth,

Your question deserves a more thorough response, but I will at least give a quick response right now.

A paper on sampler choices for some ecological models (although not addressing all of your questions, but relevant): https://doi.org/10.1002/ece3.6053

And also some of our workshop materials include introductions to MCMC sampling performance issues (although not perhaps as thorough as you'd like). To be more specific, I'll suggest looking at the 3-day virtual workshop from 2023 and looking at modules 3 and 4 in the "content" folder.

I hope that makes a start for you. We might have time later to address more of your specific points, and other readers might want to chime in as well.

Cheers,
Perry


--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/6c3455d3-1f5d-4c37-9944-1c7c123b85f6n%40googlegroups.com.

Chris Paciorek

unread,
Jun 7, 2024, 8:06:58 PMJun 7
to kelo...@gmail.com, nimble-users
Hi Kenneth, a few other thoughts:

1. How do various nimbe samplers work compared to JAGS?

JAGS tends to use slice sampling when things are not conjugate (Nimble tends to use Metropolis (i.e., "RW") when things are not conjugate). So I'm surprised that if you switch Nimble to slice samplers it doesn't perform similarly. Perhaps priors are different or starting values, though it sounds like you've explored things carefully. For linear model components JAGS uses some sampling tricks for the regression coefficients that often perform well and that nimble doesn't have. I don't know the structure of your model, so not sure if that is relevant.

If JAGS does reasonably, then focusing on scalar samplers and not diving into blocking seems like the way to go. It is quite odd that Nimble would wander off to 10^8. Have you carefully checked your initial values? In particular if you have flat or very non-informative priors and you don't provide Nimble with initial values, Nimble will sample from the prior, which might cause the behavior you are seeing. You may have done this already but I would try setting initial values for Nimble and JAGS to be the same (and particularly making sure that all hyperparameters have reasonable initial values) and having Nimble's samplers match JAGS - presumably applying slice sampling to all non-conjugate cases. If that doesn't help, if you'd like to share your model and data (off-list if preferred), we might be able to take a look and see what might be going on. 

2. When do various samplers work better/worse?

This is a big topic. See comments below about blocking. With regard to individual parameters, slice sampling can often mix more quickly than RW on a per-iteration basis but involves more computation so there is a tradeoff. But there isn't the adaptation involved with RW, so that is an advantage of slice sampling.

3. Are the any plans to re-maintain the autoblocking functionality?

No. It tends to take a long time to figure out the blocks so we haven't maintained that approach.

4. Is there a way to see which nodes in a single parameter (e.g. N[]) are sampled by each sampler as opposed to the posterior predictive sampler?

I don't understand the question. Any nodes assigned a posterior predictive are sampled based on the distribution assigned in the model code to them in a "top-down" fashion so that parent nodes are sampled before child nodes. 

5. When/how should I block?

One would generally want to block when seeing high correlations between parameters at the same level of a model or to block hyperparameters with the latent nodes/random effects that depend on them, if it seems that the mixing of latent nodes is constrained by the relevant hyperparameter(s) and vice versa. But getting the RW_block sampler to adapt well can sometimes be a challenge. See the links Perry gave for some discussion of this and discussion of modifying the `adaptInterval` and `adaptFactorExponent` tuning parameters for RW_block.

You might try HMC on the entire model (at least for parameters that are not discrete) or on blocks of the model. HMC often does well in dealing with dependence, without being as finicky as block Metropolis.

6. Could "sticking" be a result of the adaptive RW? Should I change it to non-adaptive, manually set step lengths or other parameters to unstick nodes? My thinking here is that the "reasonable" values are very tightly clustered, so if it fails to find them at a high enough rate, it might start looking farther and farther out, continuing to decrease the acceptance rate to functionally 0.

Yes, if the model starts in a bad place, RW can adapt its tuning parameters to bad values. That may well be what is happening, though your lack of success with using slice samplers is puzzling.

7. If a node is stuck, is there some obscenely large number of iterations that will unstick it?

Possibly not. I've seen cases where things get unstuck after 1000s of iterations and others where they do not. If you have a case of such stickiness, usually you'd want to change the sampling strategy (after first checking initial values).

-chris

Reply all
Reply to author
Forward
0 new messages