Possibility for checkpoint?

58 views
Skip to first unread message

Marc Hömberger

unread,
Dec 24, 2018, 1:58:28 PM12/24/18
to bali-phy-users
Hi everyone, 

I am using Baliphy on a computing cluster and due to some changes a wall clock limit of 72hr has been put in place. For me that means I cannot use it for BaliPhy since none of my runs will converge within just 72hr. I was wondering if there is a way to resume a terminated Baliphy job, or if someone has been able to write checkpoint files to allow for continuing runs?

Thanks, 
Happy Holidays, 
Marc

Joel Berendzen

unread,
Dec 24, 2018, 5:39:13 PM12/24/18
to bali-ph...@googlegroups.com
Try criu.  If that is problematic, then link bali-phy with libckp and use that.  Bali-phy doesn’t make expensive memory demands, so why not run it on your own private machine?
--
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-users+unsubscribe@googlegroups.com.
To post to this group, send email to bali-phy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/bali-phy-users.
For more options, visit https://groups.google.com/d/optout.

Marc Hömberger

unread,
Dec 28, 2018, 8:12:22 AM12/28/18
to bali-phy-users
The reason I run it on the cluster is so that I can run many chains. I will look into criu. 

THanks, 
Marc


On Monday, December 24, 2018 at 5:39:13 PM UTC-5, Joel Berendzen wrote:
Try criu.  If that is problematic, then link bali-phy with libckp and use that.  Bali-phy doesn’t make expensive memory demands, so why not run it on your own private machine?

On Monday, December 24, 2018, Marc Hömberger <hoe...@gmail.com> wrote:
Hi everyone, 

I am using Baliphy on a computing cluster and due to some changes a wall clock limit of 72hr has been put in place. For me that means I cannot use it for BaliPhy since none of my runs will converge within just 72hr. I was wondering if there is a way to resume a terminated Baliphy job, or if someone has been able to write checkpoint files to allow for continuing runs?

Thanks, 
Happy Holidays, 
Marc

--
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-user...@googlegroups.com.
To post to this group, send email to bali-ph...@googlegroups.com.

Benjamin Redelings

unread,
Jan 23, 2019, 11:40:07 AM1/23/19
to bali-ph...@googlegroups.com, katarzyna.zarem...@gmail.com

Hi Marc,

First of all, my apologies for taking so long to respond.  I got sick during the holidays and that meant that I had less energy to handle difficult questions like this.

You are the second person who has asked for checkpoint-restore capabilities in bali-phy, so I think this is an important issue.  However, bali-phy's internal state is very (VERY) complex, and so I have not yet spent the necessary time to figure out how to dump and restore that state.

I was able to get CRIU to work and restore a bali-phy job:

    % sudo /usr/sbin/criu dump -D snap-${PID} -t ${PID} -vvv --shell-job

    % sudo /usr/sbin/criu restore -D snap-${PID} -v3 --shell-job -o restore.log

The only problem here is that you need to be root!  If the cluster administrators are willing to make criu SUID then this would solve the problem, but might be considered a security risk.

Does your cluster use SLURM?  This presentation from 2016 discusses integrating the CRIU, BLCR, and DMTCP checkpoint-restore frameworks into SLURM:

    https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf

It seems reasonable to me that, if the cluster administrators are going to limit jobs to 72 hours, then they should do a bit of work to help out people whose jobs don't fit in that window.

Have you had any luck with CRIU on your cluster?

-BenRI

P.S. The cluster administrator at Duke says that professors who use the cluster are often quite pushy, and say things like "we are doing chemistry and molecular dynamics simulations, real science, and should have priority over those biologists who are not doing real science."  Maybe someone whose jobs take less than 72 hours pushed the cluster administrators to change their scheduling policy in favor of short jobs?  In that case, maybe you can push back (or find someone senior enough to push back) the other direction.

Joel Berendzen

unread,
Jan 23, 2019, 1:34:34 PM1/23/19
to bali-ph...@googlegroups.com, katarzyna.zarem...@gmail.com
After thinking occasionally about this issue over the last month, I'd suggest not doing anything fancy with criu or the like.  I'd recommend instead to just get the state from the latest saved alignments and trees.   While there are time costs associated with startup, if those costs aren't negligible compared with 72 hours of CPU, then you have bitten off too big an alignment for your compute power available.  Since you can (I presume) run multiple 72-hour chains in parallel, you can also save some time by restarting from the replica with the best stats, up till convergence.  Yes, I'm aware that such a procedure can underestimate inter-chain variances if the convergence landscape is rough, but so far that seems not to be the case.

I'm in the process of writing some steering code to make some of the above automated.

As for how to push back against other cluster users, speaking as a physicist and sometimes crystallographer who has done a bit of molecular dynamics simulations, I'd point out that the models for molecular phylogeny are much more successful (in the sense of being quantitative) than those used for MD.   The burden of being more intellectually disreputable is on the chemists in this case.  Of course, what really matters to the cluster administrators is who will provide the justification for continued operation and eventual upgrades of the facility, which come in the forms of contributing citations to work you did and writing paragraphs in future funding proposals.

Benjamin Redelings

unread,
Feb 12, 2019, 2:50:54 PM2/12/19
to bali-ph...@googlegroups.com

Hey Marc,

    I'm looking into this some more, and I suspect it will be possible to suspend and resume tasks using criu.  However, they have to be run inside of a container.  So, you could do this using docker, I think.  However, I'm looking into ways to make this work transparently.  Sorry this is taking a while...

-BenRI

On 12/24/18 1:58 PM, Marc Hömberger wrote:

mmi...@sdsc.edu

unread,
Sep 9, 2022, 7:42:46 PM9/9/22
to bali-phy-users

Hi all,
Just checking in this thread. In the manual, there is a section called: Starting and stopping the program. I though that might mean a restart is possible. However, I gather there is still no method for restarting a run from the checkpoint?

Mark

Benjamin Redelings

unread,
Sep 9, 2022, 7:56:20 PM9/9/22
to bali-ph...@googlegroups.com

Hi Mark,

BAli-Phy doesn't implement the ability to do checkpoint/restore on its own.  BAli-Phy's internal data structures are really complicated.  Instead, it should be possible to checkpoint and restore bali-phy by running inside docker and using the "docker checkpoint" command.

    https://criu.org/Docker

Are you interested in this approach?  I'd be happy to help with the implementation...

-BenRI

mmi...@sdsc.edu

unread,
Sep 20, 2022, 3:09:16 PM9/20/22
to bali-phy-users
Hi Ben,
We can't use Docker at present, due to security concerns on our HPC machine.
However, I have implemented on CIPRES the ability to make multiple (up to 6) identical runs with most simple parameters; thus enabling the parallelism available in the code.
I also implemented a restart function, so people can run a consensus tree analysis after they complete their initial run(s).
We could add more features as they are requested.

Mark
Reply all
Reply to author
Forward
0 new messages