--
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-users+unsubscribe@googlegroups.com.
To post to this group, send email to bali-phy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/bali-phy-users.
For more options, visit https://groups.google.com/d/optout.
Try criu. If that is problematic, then link bali-phy with libckp and use that. Bali-phy doesn’t make expensive memory demands, so why not run it on your own private machine?
On Monday, December 24, 2018, Marc Hömberger <hoe...@gmail.com> wrote:
Hi everyone,--I am using Baliphy on a computing cluster and due to some changes a wall clock limit of 72hr has been put in place. For me that means I cannot use it for BaliPhy since none of my runs will converge within just 72hr. I was wondering if there is a way to resume a terminated Baliphy job, or if someone has been able to write checkpoint files to allow for continuing runs?Thanks,Happy Holidays,Marc
You received this message because you are subscribed to the Google Groups "bali-phy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bali-phy-user...@googlegroups.com.
To post to this group, send email to bali-ph...@googlegroups.com.
Hi Marc,
First of all, my apologies for taking so long to respond. I got sick during the holidays and that meant that I had less energy to handle difficult questions like this.
You are the second person who has asked for checkpoint-restore capabilities in bali-phy, so I think this is an important issue. However, bali-phy's internal state is very (VERY) complex, and so I have not yet spent the necessary time to figure out how to dump and restore that state.
I was able to get CRIU to work and restore a bali-phy job:
% sudo /usr/sbin/criu dump -D snap-${PID} -t ${PID} -vvv --shell-job
% sudo /usr/sbin/criu restore -D snap-${PID} -v3 --shell-job
-o restore.log
The only problem here is that you need to be root! If the
cluster administrators are willing to make criu SUID then this
would solve the problem, but might be considered a security risk.
Does your cluster use SLURM? This presentation from 2016 discusses integrating the CRIU, BLCR, and DMTCP checkpoint-restore frameworks into SLURM:
https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf
It seems reasonable to me that, if the cluster administrators are going to limit jobs to 72 hours, then they should do a bit of work to help out people whose jobs don't fit in that window.
Have you had any luck with CRIU on your cluster?
-BenRI
P.S. The cluster administrator at Duke says that professors who use the cluster are often quite pushy, and say things like "we are doing chemistry and molecular dynamics simulations, real science, and should have priority over those biologists who are not doing real science." Maybe someone whose jobs take less than 72 hours pushed the cluster administrators to change their scheduling policy in favor of short jobs? In that case, maybe you can push back (or find someone senior enough to push back) the other direction.
Hey Marc,
I'm looking into this some more, and I suspect it will be
possible to suspend and resume tasks using criu. However, they
have to be run inside of a container. So, you could do this using
docker, I think. However, I'm looking into ways to make this work
transparently. Sorry this is taking a while...
-BenRI
Hi Mark,
BAli-Phy doesn't implement the ability to do checkpoint/restore on its own. BAli-Phy's internal data structures are really complicated. Instead, it should be possible to checkpoint and restore bali-phy by running inside docker and using the "docker checkpoint" command.
Are you interested in this approach? I'd be happy to help with
the implementation...
-BenRI
To view this discussion on the web visit https://groups.google.com/d/msgid/bali-phy-users/d9f8ed02-199d-4b35-841d-d7671efe1467n%40googlegroups.com.