MCMCTree CheckPointing

524 views

Skip to first unread message

Christopher Powell

unread,

Dec 18, 2019, 1:57:55 PM12/18/19

to PAML discussion group

Hello PAML discussion group,

Is there anyway to checkpoint MCMCTree that is running on a HPCC? I have a limit of 7 days for a single run but the mcmctree analysis that we will be running will easily run for 13+ days. Thus, I would need to start the analysis, have the HPCC kick me off the cluster, then restart the analysis from where it left off. Has anyone figured out a way around a run time constraint? Does PAML (mcmctree) work with distributed multithreaded checkpointing (http://dmtcp.sourceforge.net/index.html)?

Any help would be appreciated!

Thanks,

Christopher Powell

Ziheng

unread,

Feb 9, 2020, 11:37:50 AM2/9/20

to PAML discussion group

i added something like this a month or two ago, but it is not well tested yet.
you can take a look and let me know if you spot anything strange, or if you have better ideas for options.
here are notes i wrote for someone else.
best, ziheng

http://abacus.gene.ucl.ac.uk/ziheng/paml4.9j.tgz

i copied the checkpointing option from bpp into mcmctree. here is the link. and here are some notes i wrote at the same time.

checkpoint = 1 * 0: nothing; 1 : save; 2: resume

checkpoint is a switch (1 for save and 2 for resume) to turn on checkpointing. This does not save the memory image etc. Instead it saves the current state of the Markov chain (such as the divergence times, rates for loci, and the step lengths) in a file called mcmctree.ckpt. It does not save the conditional probability vectors, which are recalculated when the run is resumed. It saves into the file at every 10th percentile during the MCMC iteration, and if the file already exists, it will be overwritten. With the resume option, the program will read the control file and sequence alignments, allocate memory, and then fix the state of the Markov chain by reading from mcmctree.ckpt (which will have the last saved state of the chain) and then restart the MCMC (by setting burnin = 0). In effect it is using the last saved parameter values as the initial values. It will then take nsample*sampfreq samples, where nsample and sampfreq are read from the control file. The old mcmc sample file is destroyed. To use this option, you will need to save the sample file first. Then change to checkpoint = 2 but then you need to merge the samples into one file to summarize (if you use print = -1, the program summarizes the sample instead of running MCMC). perhaps this is too tedious.

Some options to consider: (i) allow the user to specify the file name and also use different file names at different percentage points. This wastes space but allows the user to run multiple analyses in the same folder. (ii) Let nsample be the total number of samples both before and after the checkpint. This means deleting the samples after the checkpoint and appending new samples after the run is resumed. (iii) Save the random number so that the save-resume option will produce exactly identical results as running the whole chain without using checkpointing.
(iii) is nice as it may be close to what people typically expect. but what i have right now may be useful if we want to run multiple mcmc with the starting point from the stationary distribution, without wasting time in the burnin. this should be good if the posterior has a single mode but can be problematic if the posterior is complex.

perhaps it is a good idea to have both options, something like the following

checkpoint = 1 frequency [filename] * to save
checkpoint = 2 r [filename] * to resume, r means replacing the mcmc sample file
checkpoint = 2 a [filename] * to resume, a means appending to the mcmc sample file

Reply all

Reply to author

Forward

0 new messages