Extending previously run chains with newly built model

Frédéric LeTourneux

unread,

Mar 21, 2022, 12:23:24 PM3/21/22

to nimble-users

Hello everyone,

I am running a very time- and memory-demanding CMR model and I need some help in trying to save some time (so that I may finish my PhD in less than 10 years lol).

Basically the model I’m running can take up to 5-6 days to build and compile and then runs at around 2000 iterations /24 h. It is a complex model with many individual- and time-based random effects, and so I’ll likely need over 100k iterations for chains to converge, which means I’m looking at at least 1.5-2 months for each chain to run.

So one issue I have is that if my model crashes (like it did this weekend when the cluster I was running it on went into maintenance), I’d like to be able to start it over from where it left off so as to not have to start over from the start. I’ve been saving my samples along the way using something like

mcmc$run(niter=5000, burnin=0, thin=5)
run1<-as.matrix(mcmc$MVsamples)
save(run1,'run1.RData')
rm(run1)

mcmc$run(niter=5000,burnin=0,thin=5, reset=F, resetMV=T)
run2<-as.matrix(mcmc$MVsamples)
save(run2,'run2.RData')
rm(run2)
and so on…

This allows me to save some memory be removing samples after each run but still extending the chains. My problem is that when the model crashes, I need to start over. So I have 2 questions.

1. Is is possible to save some parts of the model building process (i.e. model graph, links between objects created by nimble, anything else that could be saved) and store it in an object that could be re-loaded later so as to save at least some time of the (5-day) model building process? I have a hunch this won’t work given what I’ve read but I’m asking anyway just in case.

2. Is there a way to restart a chain from where a chain from a past model ended, so as to simply extend the chains from a previous model? I’ve ran some tests using mcmc$run(reset=F) and also some using the end points of a previous run of the same model as the initial values for the stochastic nodes (this also seems to be what happens when unsing mcmc$run(reset=F), am I wrong?). With these tests I found that the chains initialized with the last samples obtained in a previous run do not seem to be behaving the same way (at least in the first iterations) as when using the option mcmc$run(reset=F)(see an example in the attached graphs). I imagine then that there must be some other values saved in the sampler functions that are not simply the values of the nodes which are used when runing the function with reset=F? The section 7.5 of the nimble manual also hints that there are other values in the sampling function resetted when using the reset=T option. I am wondering whether it is possible to access this data, save it after each run, and feed it to a freshly built model (for instance after a crash), so that I could simply extend the chains I had already run and saved, as I would do with the reset=F option. Is there any way to do this? Otherwise it means I must start over after each crash and given the time it takes for my model to run, I’m not sure that is going to be an option.

Thanks a lot for any help with this, and let me know if some things are unclear,

Cheers,

Fred L.

Daniel Turek

unread,

Mar 21, 2022, 2:12:12 PM3/21/22

to Frédéric LeTourneux, nimble-users

Frédéric, it sounds like you have a demanding model, and I hope you're able to get the runs done that you need. Hopefully others will have more helpful advice about model building, etc, but let me point you to two vignettes I have, which might help you a bit, with the question about "restarting a chain right where it left off". Take a look at:

Restarting NIMBLE MCMC

Saving Model and MCMC State

Take a look at those, which are intended to be self-guiding, and hopefully something will be useful to you.

Cheers,

Daniel

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/9667bc36-419f-4217-b79d-e3d45bfece59n%40googlegroups.com.

Perry de Valpine

unread,

Mar 22, 2022, 11:50:49 AM3/22/22

to Daniel Turek, Frédéric LeTourneux, nimble-users

I'll that two of the easiest tricks for faster building steps are calculate=FALSE in nimbleModel and useConjugacy=FALSE in configureMCMC. Do those help at all?

Perry

To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/CAKbe0ho2n3Nz%3D6Ju-D2BdRxejMpr5XGOd4%3DYg0AF%3DiPJg0HjsA%40mail.gmail.com.

Frédéric LeTourneux

unread,

Mar 22, 2022, 1:16:22 PM3/22/22

to nimble-users

Hello Daniel and Perry,

Perry: yes, I'm already using these options and it does speed up the model building significantly. I dare not imagine how long it would take if I turned on these options... ^^

Daniel: Great, that is exactly what I needed and it works just fine. Just to be sure, the second link is basically the equivalent of the first one except its already packed in a function right?

Have a nice day!

F.

Daniel Turek

unread,

Mar 22, 2022, 3:22:20 PM3/22/22

to Frédéric LeTourneux, nimble-users

Yes, that's right, Frédéric, and the second also addresses saving the model state. I'm glad it seems to work for you.

Cheers,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/ca47e8ce-fcb5-4a4d-8d8c-0939eb3ca4b7n%40googlegroups.com.

Frédéric LeTourneux

unread,

Mar 24, 2022, 9:18:49 AM3/24/22

to nimble-users

Hello Daniel,

Thanks for the input. I have two follow up questions. And another one related to my original question.

1) in both vignettes there is a model$calculate step. I used the code from the first vignette without the calculate lines because that takes a very long time to run with my model. Resetting the MCMC seems to work fine without that. Is this simply a check or is there another reason to do this that eludes me?

2) I'm not sure what you mean by 'saving the model state'. Is this a way to reload the 'project' as it was before shutting the R session? This would mean that if you ran multiple MCMCs with different samplers you could reaload info on previous runs (e.g. WAICs, time spent in various samplers or other such info)?

Finally, I was talking with someone from Compute Canada (the cluster that I can use to run models) about running models for many weeks at a time and they were not super excited about the idea of locking up nodes with 500GB ram for multiple weeks at a time and that my priority would probably diminish very quickly meaning having to potentially wait weeks before my jobs are run (which becomes prohibitive considering I already need to run mcmc chains for a month or more at a time). They were saying that normally there are ways to do checkpoints in the middle of a script which should save a "picture" of the info used by the script in the state it is in at that time to a binary file and that there should not need to be some parts of the script that are re-run for the work to resume at the same point (e.g. like not having to re-build the model in my case). This should help since I could then do smaller, week-long jobs, which would allow me to run jobs faster. I need to look at this more in detail but I was wondering if anyone here had done something similar and if it is at all possible to stop a script after an MCMC run save all the relevant information and reload all of it in a new session and directly do another MCMC run without having to go through building a model again. With my limited understanding of how computers work, it feels like it should be possible to save this information (model graph, links between files and objects, etc.) so that it can be loaded again. My attempts at saving model objects in rds and reloading them again have been unsuccessful up to now but I wonder if there is a reason why this would simply not be possible with nimble, in which case I'm not going to go too much deeper down that rabbit hole.

Thanks a lot for the help!

Fred

Daniel Turek

unread,

Mar 24, 2022, 12:02:26 PM3/24/22

to Frédéric LeTourneux, nimble-users

Fred, you're very welcome.

(1) The model$calculate() is not necessary. It's something I generally like to include, to verify that the model is fully initialized and ready-to-go. But it's not strictly necessary, and especially when the time required is non-trivial, it can certainly be skipped.

(2) No, I'm sorry to say that what is meant by "saving the model state" is not nearly as sophisticated as you're suggesting. In that vignette, it merely means saving the values of all variables (data, model parameters, latent states, etc) inside the model. That way, on a later R session, you could (after re-creating a new model object, using the same model code and nimbleModel(code, ...)), you can restore the values of all model variables (the "model state") from your previous R session into the newly created model object, in this new R session. That, in combination with restoring the state of the MCMC, would effectively allow you to "continue" the previous MCMC run (from a previous R session) in this now R session. But unfortunately, no details of previous MCMC runs (WAIC, etc) are recorded or stored, that would be up to you. Does that make sense?

You're correct, at present the "model object" itself, in addition to the MCMC algorithm, cannot itself be saved between R sessions. That's a known shortcoming, and something that is being worked on. But, for the time being, a new model (and MCMC) would need to be build and compiled on starting a new R session.

There's a chance that your entire model could be streamlined to vastly reduce the building time, and hopefully MCMC runtime, but that would require some careful review of the code, and model-specific programming. This is the sort of thing that some members of the community (and development team) like to help with sometimes, so that might be an option for you, if anyone is willing and interested.

I hope this helps,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/cfeab8e6-83cf-4faf-a95b-55d5c692fd2bn%40googlegroups.com.

Frédéric LeTourneux

unread,

Mar 24, 2022, 2:01:21 PM3/24/22

to nimble-users

Daniel,

All right, that all makes sense. Thanks.

Regarding your 2nd point, some have already made such efforts on this post like Perry who showed me how to adapt dDHMMo to suit my needs and reduce memory usage, which is working out pretty well. Unfortunately right now I'm kinda stuck on the side of run time, but eh, I knew what I was in for when I moved to MCMC!

Thanks for everything, I'll be keeping an eye out for things I might be able to improve on the running time side.

Cheers!

F.

Frédéric LeTourneux

unread,

Apr 13, 2022, 10:57:27 AM4/13/22

to nimble-users

Hello Daniel,

I've been using the functions provided in the second link you put in your first answer to this post. It works well except for one thing. When I load the saved MCMC state, there are values for $scale, $timesAdapted and $gamma, but values for $timesRan and $timesAccepted are still at 0 eventhough the MCMC ran for 15 000 iterations (see example for one of my parameter below). Is this normal? Should be writing a function to also access the $timesRan and $timesAdapted or should this be set at 0 when restarting the MCMC? Am I missing something?

Thanks for the help,

Fred

example of saved mcmc state for one parameter:

[[946]]
[[946]]$scale
[1] 2.007314

[[946]]$timesRan
[1] 0

[[946]]$timesAccepted
[1] 0

[[946]]$timesAdapted
[1] 75

[[946]]$gamma1
[1] 0.03064251

Daniel Turek

unread,

Apr 13, 2022, 9:20:08 PM4/13/22

to Frédéric LeTourneux, nimble-users

Totally reasonable. If the adaptationInterval (another variable internal to the sampler, which doesn't change) is at its default value of 200, then every 200 MCMC iterations, the adaptation procedure will take place (internal to the sampler). This includes updating the scale (proposal standard deviation) parameter based on the past 200 posterior samples, and also setting timesRan (which means times ran since the most recent adaptation took place - not the total times ran) and timesAccepted (similar interpretation) back to 0. We can see timesAdapted is exactly what we'd expect: adaptationInterval * timesAdapted = 200 * 75 = 15,000 total MCMC iterations. If you ran it for 15,001 iterations, you would see timesRan = 1, and timesAccepted would be either 0 or 1. So everything is bang on looks like it's working fine.

To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/89c3f6c3-4fae-4b32-9ae8-05a6da0d377cn%40googlegroups.com.

Frédéric LeTourneux

unread,

Apr 19, 2022, 9:38:48 AM4/19/22

to nimble-users

You're right, it is spot on! Sweet! Thanks!

F

Yihong Zhu

unread,

Sep 26, 2024, 7:50:12 PM9/26/24

to nimble-users

Hi Daniel,

Thanks for providing the example! Following the two vignettes, I have successfully extended the iterations using Cmcmc$run(), but questions remain about how I could do it for multiple chains. Following is my situation:

I initially use runMCMC(Cmodel, niter=200000, nburnin=100000, thin=20, nchains=2), that returns me two chains. Each chain with 10000 samples.
Then without starting a new R session but still in the same session. I use Cmcmc$run(30000, reset = FALSE,resetWAIC = FALSE,time=TRUE), Cmcmc$mvSamples with 40000 rows. And I checked the first 10000 rows are the same as those in the 2nd chain. So I think in this case, Cmcmc$run() is extending the 2nd chain.

Then how could I do that for the first chain as well? I am hoping that next time when start a new R session, all the chains can be restarted. But I could not figure out what settings should be changed (in Cmcmc or in run(), run() seems not accept an argument like nchain), or what extra information needs to be saved. Or maybe the only plausible way is to use Cmcmc$run() to run each chain separately, and also save the information for each chain separately?

This is probably a dumb question just because I do not fully understand the mechanism behind it, but would appreciate any kind of help or clarification. Thanks!

Best,

Yihong

Daniel Turek

unread,

Sep 27, 2024, 9:13:07 AM9/27/24

to Yihong Zhu, nimble-users

Yihong, thanks for your very clear question. I'm sorry to report, but the facilities for what you're trying to do are not built into the system.

The Cmcmc object stores information about a *single* chain - both the samples, and the internal state of the MCMC. Using Cmcmc$run(..., reset = FALSE) simply continues the run of the *single* chain which is internal to Cmcmc.

runMCMC() was designed to be a wrapper for the Cmcmc object, and give additional functionality. Using runMCMC(..., nchains = 2), for example, runs a single chain (using the Cmcmc), extracts and saves those samples, then begins a new, second, separate chain (again using the Cmcmc object), and at the same time discarding the state of the first MCMC. The samples from the second chain are then extracted, and runMCMC returns the samples from both chains. However, as you noted, you cannot continue the run of both chains using runMCMC.

If you need to do this, it can be done using some of the links earlier in this post, by:

- use Cmcmc to execute a first chain

- extract the samples, and also extract and save the state of the MCMC (as demonstrated earlier)

- use Cmcmc to execute a second chain

- extract the samples, and again extract and save the state of the MCMC

- now, if you want to continue either chains, you'd have to restore the state of the MCMC to that of one particular chain, and continue it from there.

So, this is possible, but the facilities are not built in for this.

I hope this helps,

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/f96d9a6e-ce73-4a75-915f-fe4b621b5df3n%40googlegroups.com.

Reply all

Reply to author

Forward