[slurm-users] Suspending jobs for file system maintenance

2,145 views
Skip to first unread message

Juergen Salk

unread,
Oct 19, 2021, 3:06:59 PM10/19/21
to slurm...@lists.schedmd.com
Dear all,

we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although failover functionality is
enabled on the Lustre servers we'd like to minimize risk for running jobs
in case something goes wrong.

Therefore, we thought about suspending all running jobs and resume
them as soon as file systems are back again.

The idea would be to stop Slurm from scheduling new jobs as a first step:

# for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done

with foo, bar and baz being the configured partitions.

Then suspend all running jobs (taking job arrays into account):

# squeue -ho %A -t R | xargs -n 1 scontrol suspend

Then perform the failover of OSTs to another OSS server.
Once done, verify that file system is fully back and all
OSTs are in place again on the client nodes.

Then resume all suspended jobs:

# squeue -ho %A -t S | xargs -n 1 scontrol resume

Finally bring back the partitions:

# for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done

Does that make sense? Is that common practice? Are there any caveats that
we must think about?

Thank you in advance for your thoughts.

Best regards
Jürgen



Paul Edmon

unread,
Oct 19, 2021, 3:15:47 PM10/19/21
to slurm...@lists.schedmd.com
Yup, we follow the same process for when we do Slurm upgrades, this
looks analogous to our process.

-Paul Edmon-

Juergen Salk

unread,
Oct 22, 2021, 6:08:20 PM10/22/21
to Slurm User Community List
Thanks, Paul, for confirming our planned approach. We did it that way
and it worked very well. I have to admit that my fingers were a bit
wet when suspending thousands of running jobs, but it worked without
any problems. I just didn't dare to resume all suspended jobs at
once, but did that in a staggered manner.

Best regards
Jürgen

* Paul Edmon <ped...@cfa.harvard.edu> [211019 15:15]:

Alan Orth

unread,
Oct 25, 2021, 4:48:56 AM10/25/21
to Slurm User Community List
Dear Jurgen and Paul,

This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job processes. The processes remain in memory, but are paused. What happens to open file handles, since the underlying filesystem goes away and comes back?

Thank you,
--

Paul Edmon

unread,
Oct 25, 2021, 8:59:43 AM10/25/21
to slurm...@lists.schedmd.com

I think it depends on the filesystem type.  Lustre generally fails over nicely and handles reconnections with out much of a problem.  We've done this before with out any hitches, even with the jobs being live.  Generally the jobs just hang and then resolve once the filesystem comes back.  On a live system you will end up with a completion storm as jobs are always exiting and thus while the filesystem is gone the jobs dependent on it will just hang and if they are completing they will just stall on the completion step.  Once it returns then all that traffic flushes.  This can create issues where a bunch of nodes get closed due to Kill task fail or other completion flags.  Generally these are harmless though I have seen stuck processes on nodes and have had to reboot them to clear, so you should check any node before putting it back in action.

That said if you are pausing all the jobs and scheduling this is some what mitigated, though jobs will still exit due to timeout.

-Paul Edmon-

Juergen Salk

unread,
Oct 25, 2021, 10:43:01 AM10/25/21
to Slurm User Community List
Hi Alan and Paul,

I can't clain to be a Lustre guru but my understanding is that Lustre
failover does not imply umount/mount of the file system on the client
side. On the client side the OSTs just stall until they are back. So
open file handles should actually be kept during that process.
However, we were still unsure whether any running application would
survive an inaccessible file system, even if it is only temporarily
gone, and (in this specific case) we were also unsure how long it will
take for the failover to succeed.

I do not think that jobs will timeout while suspended though, as
Slurm does not count the time spent in suspend against the job's
walltime. So suspending jobs do not "steal" walltime from the jobs.

Best regards
Jürgen


* Paul Edmon <ped...@cfa.harvard.edu> [211025 08:59]:
Reply all
Reply to author
Forward
0 new messages