[slurm-users] Cleanup of job_container/tmpfs

瀏覽次數:442 次
跳到第一則未讀訊息

Jason Ellul

未讀,
2023年2月28日 晚上9:29:122023/2/28
收件者:slurm...@lists.schedmd.com

Hi,

 

We have recently moved to slurm 22.05.8 and have configured job_container/tmpfs to allow private tmp folders.

 

job_container.conf contains:

AutoBasePath=true

BasePath=/slurm

 

And in slurm.conf we have set

JobContainerType=job_container/tmpfs

 

I can see the folders being created and they are being used but when a job completes the root folder is not being cleaned up.

 

Example of running job:

[root@papr-res-compute204 ~]# ls -al /slurm/14292874

total 32

drwx------   3 root      root    34 Mar  1 13:16 .

drwxr-xr-x 518 root      root 16384 Mar  1 13:16 ..

drwx------   2 mzethoven root     6 Mar  1 13:16 .14292874

-r--r--r--   1 root      root     0 Mar  1 13:16 .ns

 

Example once job completes /slurm/<jobid> remains:

[root@papr-res-compute204 ~]# ls -al /slurm/14292794

total 32

drwx------   2 root root     6 Mar  1 09:33 .

drwxr-xr-x 518 root root 16384 Mar  1 13:16 ..

 

Is this to be expected or should the folder /slurm/<jobid> also be removed?

Do I need to create an epilog script to remove the directory that is left?

 

Many thanks for the assistance,

 

Jason

 

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Center

 


Disclaimer: This email (including any attachments or links) may contain confidential and/or legally privileged information and is intended only to be read or used by the addressee. If you are not the intended addressee, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this email (including any attachments) are not waived or lost by reason of its mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. Peter MacCallum Cancer Centre provides no guarantee that this transmission is free of virus or that it has not been intercepted or altered and will not be liable for any delay in its receipt.

Ole Holm Nielsen

未讀,
2023年3月1日 凌晨4:28:472023/3/1
收件者:slurm...@lists.schedmd.com
Hi Jason,

IMHO, the job_container/tmpfs is not working well in Slurm 22.05, but
there may be some significant improvements included in 23.02 (announced
yesterday). I've documented our experiences in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories
This page contains links to bug reports against the job_container/tmpfs
plugin.

We're using the auto_tmpdir SPANK plugin with great success in Slurm 22.05.

Best regards,
Ole

Michael Jennings

未讀,
2023年3月1日 下午5:16:002023/3/1
收件者:slurm...@lists.schedmd.com
On Wednesday, 01 March 2023, at 10:28:24 (+0100),
Ole Holm Nielsen wrote:


> but there may be some significant improvements included in 23.02

TL;DR: I can vouch for this.

The primary problem with the interaction between the new namespace
code and the automounter daemon was simply that the shared subtree
flags on the root mount within the Slurm-built mount namespace were
not selected with things like autofs in mind. It was marked private
(no mounts out, and no mounts in) rather than shared+slave (no mounts out,
but mounts *do* come in); once that was found and fixed, autofs and
job containers played nicely again.

There were a couple other Oopsies with regard to mount/unmount
operations that were addressed at the same time. If you're curious
for more detail, you can follow the links in this comment on our bug
report:

https://bugs.schedmd.com/show_bug.cgi?id=12567#c47

Happy Slurming!
Michael

--
Michael E. Jennings (he/him) <m...@lanl.gov> https://hpc.lanl.gov/
HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group
Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25 +1 (505) 606-0605
Los Alamos National Laboratory, P.O. Box 1663, Los Alamos, NM 87545-0001

Jason Ellul

未讀,
2023年3月1日 下午6:28:162023/3/1
收件者:Ole.H....@fysik.dtu.dk、Slurm User Community List

Thanks so much Ole for the info and link,

 

Your documentation is extremely useful.

 

Prior to moving to 22.05 we had been using slurm-spank-private-tmpdir with an epilog to clean-up the folders on job completion, but we were hoping to move to the inbuilt functionality to ensure future compatibility and reduce complexity.

 

Will try 23.02 and if that does not resolve our issue consider moving back to slurm-spank-private-tmpdir or auto_tmpdir.

 

Thanks again,

 

Jason

 

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Center

 

 

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Ole Holm Nielsen <Ole.H....@fysik.dtu.dk>
Date: Wednesday, 1 March 2023 at 8:29 pm
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Cleanup of job_container/tmpfs

! EXTERNAL EMAIL: Think before you click. If suspicious send to Cyber...@petermac.org

Jason Ellul

未讀,
2023年3月1日 下午6:29:042023/3/1
收件者:Slurm User Community List

Hi Michael,

 

Thanks so much for the info will try 23.02.

 

Cheers,

 

Jason

 

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Center

 

 

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Michael Jennings <m...@lanl.gov>
Date: Thursday, 2 March 2023 at 9:17 am
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Cleanup of job_container/tmpfs

! EXTERNAL EMAIL: Think before you click. If suspicious send to Cyber...@petermac.org

Niels Carl W. Hansen

未讀,
2023年3月6日 凌晨4:15:542023/3/6
收件者:slurm...@lists.schedmd.com
Hi all

Seems there still are some issues with the autofs - job_container/tmpfs functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated node(s) before jobstart, we get:

slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such file or directory: going to /tmp instead


An easy workaround however, is to include this line in the slurm prolog on the slurmd -nodes:

/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true


-but there might exist a better way to solve the problem?

Best
Niels Carl

Brian Andrus

未讀,
2023年3月6日 下午3:07:242023/3/6
收件者:slurm...@lists.schedmd.com

That looks like the users' home directory doesn't exist on the node.

If you are not using a shared home for the nodes, your onboarding process should be looked at to ensure it can handle any issues that may arise.

If you are using a shared home, you should do the above and have the node ensure the shared filesystems are mounted before allowing jobs.

-Brian Andrus

Michael Jennings

未讀,
2023年3月6日 下午4:06:342023/3/6
收件者:slurm...@lists.schedmd.com
On Monday, 06 March 2023, at 10:15:22 (+0100),
Niels Carl W. Hansen wrote:

> Seems there still are some issues with the autofs -
> job_container/tmpfs functionality in Slurm 23.02.
> If the required directories aren't mounted on the allocated node(s)
> before jobstart, we get:
>
> slurmstepd: error: couldn't chdir to `/users/lutest': No such file or
> directory: going to /tmp instead
> slurmstepd: error: couldn't chdir to `/users/lutest': No such file or
> directory: going to /tmp instead
>
> An easy workaround however, is to include this line in the slurm
> prolog on the slurmd -nodes:
>
> /usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true
>
> -but there might exist a better way to solve the problem?

What we do, and what any site using NHC in (or as) its Prolog script
can do, is use the `-F` option of `check_fs_mount()`. By changing
`-f` to `-F`, NHC will know to trigger the filesystem to be mounted.
Here's an example from one of our clusters (names changed, of course):

check_fs_mount_rw -t "nfs" -s "nas-srv:/proj" -F "/net/nfs/projects"

Documentation for the check is at https://github.com/mej/nhc#check_fs_mount
if you're interested.

I'm not sure that's "better," but it's an option. :-)

HTH!

Ole Holm Nielsen

未讀,
2023年3月7日 凌晨3:20:152023/3/7
收件者:slurm...@lists.schedmd.com
Hi Brian,

Presumably the users' home directory is NFS automounted using autofs, and
therefore it doesn't exist when the job starts.

The job_container/tmpfs plugin ought to work correctly with autofs, but
maybe this is still broken in 23.02?

/Ole

Hagdorn, Magnus Karl Moritz

未讀,
2023年3月7日 上午9:13:422023/3/7
收件者:slurm...@lists.schedmd.com
I just upgrade slurm to 23.02 on our test cluster to try out the new
job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!)
but you need to set the Shared=true option in the job_container.conf
file.
Cheers
magnus
--
Magnus Hagdorn
Charité – Universitätsmedizin Berlin
Geschäftsbereich IT | Scientific Computing
 
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
 
magnus....@charite.de
https://www.charite.de
HPC Helpdesk: sc-hpc-...@charite.de

Niels Carl W. Hansen

未讀,
2023年3月7日 上午10:06:232023/3/7
收件者:slurm...@lists.schedmd.com
That was exactly the bit I was missing. Thank you very much, Magnus!

Best
Niels Carl
回覆所有人
回覆作者
轉寄
0 則新訊息