[slurm-users] Slurm reservation for migrating user home directories

98 views
Skip to first unread message

Ole Holm Nielsen

unread,
Apr 16, 2021, 8:24:13 AM4/16/21
to Slurm User Community List
I need to migrate several sets of user home directories from an old NFS
file server to a new NFS file server. Each group of users belong to
specific Slurm accounts organized in a hierarchical tree.

I want to make the migration while the cluster is in full production mode
for all the other accounts (the terms "service window" or "downtime" don't
exist for me :-)

My idea is to make a Slurm reservation so that the accounts in question
will have zero jobs running during the reservation, and I also need to
kick users off the login nodes. During the reservation I can rsync the
home directories from the old NFS server to the new NFS server and update
the NFS automounter links.

Question: Does anyone have experiences with this type of scenario? Any
good ideas or suggestions for other methods for data migration?

Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Ward Poelmans

unread,
Apr 16, 2021, 9:20:45 AM4/16/21
to slurm...@lists.schedmd.com
Hi Ole,

On 16/04/2021 14:23, Ole Holm Nielsen wrote:
> Question:  Does anyone have experiences with this type of scenario?  Any
> good ideas or suggestions for other methods for data migration?

We once did something like that.

Basically it did something like that:
- Process is kicked off per user by some trigger
- Block all new jobs of the given user
- Wait until all currently running jobs have finished
- Disable the user in the LDAP and wipe the sssd cache for the user.
- Kill all their processes on the login nodes
- Move the data
- Re-enable the user in the LDAP
- Remove any blocks/limits of the user to start new job
- Mail the user that he/she can continue working again.

The whole process went pretty smooth.

Ward

Tina Friedrich

unread,
Apr 16, 2021, 9:40:08 AM4/16/21
to slurm...@lists.schedmd.com
Had to do home directory migrations a couple of times without 'full'
downtimes. Similar process, only I don't think we ever bothered
disabling users in LDAP or blocking their jobs. Generally, we told them
we'd move their directory at time X and would they please log out
everywhere; at time X, we killed their jobs & sessions (if any),
migrated everything (including automount information), and let then know
they can log in again.

Saying that clearing sssd etc caches sounds like a very good idea :)

Two suggestions to add:

- Make the old home directories read only/immutable directly after
migration, so that sessions forgotten or picking up the wrong automount
information throw errors when trying to use them.

- I'd rsync the whole file system across to the new machines way ahead
of 'migration day', so that during migration only a 'last pass' sort of
sync was required - generally much faster if most of the files are
already there.

Tina
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Ole Holm Nielsen

unread,
Apr 16, 2021, 10:21:37 AM4/16/21
to Slurm User Community List
Hi Niels Carl,

On 16-04-2021 14:41, Niels Carl Hansen wrote:
> For each account do
>      sacctmgr modify account name=<accountname> set GrpJobs=0
>
> After sync'ing, resume with
>      sacctmgr modify account name=<accountname> set GrpJobs=-1

Yes, but this would block all jobs from <accountname> immediately. If
this account had a week-long job running, it couldn't run any shorter
jobs until after the migration.

That's why I'm thinking that a system reservation excluding all jobs
from <accountname> could be created several weeks in advance, so that
<accountname> jobs could keep starting and running until they would be
blocked by the reservation.

I'm thinking of a reservation something like this:

scontrol create reservation starttime=... duration=12:00:00
ReservationName=migrate_physics nodes=ALL Accounts=-physics

Would this work as expected?

Best regards,
Ole

Ole Holm Nielsen

unread,
Apr 27, 2021, 2:59:31 AM4/27/21
to Slurm User Community List
On 4/16/21 4:21 PM, Ole Holm Nielsen wrote:
> I'm thinking of a reservation something like this:
>
> scontrol create reservation starttime=...  duration=12:00:00
> ReservationName=migrate_physics nodes=ALL Accounts=-physics

For the record: The idea of creating a Slurm reservation for excluding
specified accounts from running jobs seems to be a viable one. The
question is being tracked in https://bugs.schedmd.com/show_bug.cgi?id=11404

The correct way to make such a reservation is actually to add several flags:

$ scontrol create reservation reservationname=exclude_account
starttime=13:40:00 duration=30:00 flags=ignore_jobs,magnetic,flex
nodes=ALL accounts=-sub1

Caveat: This will result in all Pending jobs getting an incorrect
Reason=(ReqNodeNotAvail, Reserved for maintenance). It seems that jobs
from other accounts are starting correctly, however, so this does achieve
the goal, but probably also causes confusion among users!

SchedMD is looking at a way to enhance a future Slurm version so that the
incorrect Reason doesn't appear


>> On 16/04/2021 14.23, Ole Holm Nielsen wrote:
>>> I need to migrate several sets of user home directories from an old NFS
>>> file server to a new NFS file server.  Each group of users belong to
>>> specific Slurm accounts organized in a hierarchical tree.
>>>
>>> I want to make the migration while the cluster is in full production
>>> mode for all the other accounts (the terms "service window" or
>>> "downtime" don't exist for me :-)
>>>
>>> My idea is to make a Slurm reservation so that the accounts in question
>>> will have zero jobs running during the reservation, and I also need to
>>> kick users off the login nodes.  During the reservation I can rsync the
>>> home directories from the old NFS server to the new NFS server and
>>> update the NFS automounter links.
>>>
>>> Question:  Does anyone have experiences with this type of scenario? Any
>>> good ideas or suggestions for other methods for data migration?

/Ole

Reply all
Reply to author
Forward
0 new messages