[slurm-users] temporary SLURM directories

875 views
Skip to first unread message

Arsene Marian Alain

unread,
May 23, 2022, 5:30:36 AM5/23/22
to slurm...@lists.schedmd.com

Dear SLURM users,

 

I am IT Administrator of a small scientific computing center. We recently installed SLURM as a job scheduler on our Cluster and everything seems to be working fine. I just have a question about how to create temporary directories with SLURM.

 

We use some programs for scientific calculation (such as Gromacs, Gaussian, NAMD, etc.). So, the process is the following:

 

When we need to launch a calculation the first step is to copy all the necessary files from the local "$SLURM_SUBMIT_DIR" directory to the "/scratch" of the remote node, second step is to access the "/scratch" of the remote node and then run the program. Finally, when the program finishes we copy all the output files from the remote node's "/scratch" back to the local "$SLURM_SUBMIT_DIR" directory.

 

So, is there any way to automatically generate a temporary directory inside the "/scratch" of the remote node?

 

At the moment I am creating that directory manually as follows:

 

"export HOMEDIR=$SLURM_SUBMIT_DIR

export SCRATCHDIR=/scratch/job.$SLURM_JOB_ID.$USER

export WORKDIR=$SCRATCHDIR

mkdir -p $WORKDIR

cp $HOMEDIR/* $WORKDIR

cd $WORKDIR

 

$NAMD/namd2 +idlepoll +p11 run_eq.namd > run_eq.log

 

wait

cp $WORKDIR/* $HOMEDIR"

 

The main problem when you create the "/scratch" manually is that in case the calculation ends (successfully or unsuccessfully), users have to check the "/scratch" and remove the directory manually. I know I could include a line at the end of my script to delete that directory when the calculation is done, but I'm sure there must be a better way to do this.

 

 

Thanks in advance for the help.

 

best regards,

 

Alain

Diego Zuccato

unread,
May 23, 2022, 6:57:29 AM5/23/22
to Slurm User Community List, Arsene Marian Alain
Hi Arsene.

I did something like that some weeks ago.

I used the lines
Prolog=/home/conf/Prolog.sh
TaskProlog=/home/conf/TaskProlog.sh
Epilog=/home/conf/Epilog.sh

The scripts for prolog and epilog manage the creation (and permissions
assignment) of a directory in local storage (including the job ID, so
that different jobs don't get messed up).

TaskProlog script should export an environment variable but I couldn't
make it work :(
In your case, TaskProlog should copy the dataset to the local storage
and then you should add a TaskEpilog script to copy back the result. I
don't know if the TaskEpilog gets run for aborted jobs.

Moreover, IIRC you shouldn't do slow operations in task prolog or
epilog, so in your case a state machine implemented as a job array could
probably be better suited than TaskProlog/TaskEpilog (you'd need
Prolog/Epilog anyway): the first "job" copies to scratch, the second
does the number crunching and the third copies back the results.

HIH,
Diego
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

René Sitt

unread,
May 23, 2022, 7:25:09 AM5/23/22
to slurm...@lists.schedmd.com

Hello,

you might want to have a look at the auto_tmpdir plugin ( https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir ), which does pretty much what you want - it creates job-specific temporary directories and bindmounts them into the specified locations (so the job will see /tmp/, /scratch/ and /dev/shm/ while these folders are actually located at /tmp/slurm-$jobid/, /scratch/slurm-$jobid/, and /dev/shm/slurm-$jobid/).
The bindmounts are destroyed when the job exits, so there's no need to manually delete them (and it's also much safer, because it also deletes them when the job crashes).

Note: Depending on the SLURM version you're running, you might need to checkout the "dir-removal-fixup" branch instead of master (iirc for SLURM < 20.x).

Kind regards,
René Sitt

Am 23.05.22 um 11:30 schrieb Arsene Marian Alain:
-- 
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
si...@hrz.uni-marburg.de
www.hkhlr.de

Mark Dixon

unread,
May 25, 2022, 8:43:13 AM5/25/22
to Slurm User Community List
In addition to the other suggestions, there's this:

https://slurm.schedmd.com/faq.html#tmpfs_jobcontainer
https://slurm.schedmd.com/job_container.conf.html

I would be interested in hearing how well it works - it's so buried in the
documentation that unfortunately I didn't see it until after I rolled a
solution similar to Diego's (which can be extended such that TaskProlog
sets the TMPDIR environment variable appropriately, and limit the disk
space used by the job).

All the best,

Mark

On Mon, 23 May 2022, Diego Zuccato wrote:

> [EXTERNAL EMAIL]

Diego Zuccato

unread,
May 26, 2022, 5:49:00 AM5/26/22
to Slurm User Community List, Mark Dixon
Il 25/05/2022 14:42, Mark Dixon ha scritto:

> https://slurm.schedmd.com/faq.html#tmpfs_jobcontainer
> https://slurm.schedmd.com/job_container.conf.html
> I would be interested in hearing how well it works - it's so buried in
> the documentation that unfortunately I didn't see it until after I
> rolled a solution similar to Diego's
Well, I found it, but IIUC it just handles tmpfs (RAM backed), but I
needed to use actual disk space: RAM is needed for the job :)

> (which can be extended such that
> TaskProlog sets the TMPDIR environment variable appropriately, and limit
> the disk space used by the job).Still can't
export TMPDIR=...
from TaskProlog script. Surely missing something important. Maybe
TaskProlog is called as a subshell? In that case it can't alter caller's
env... But IIUC someone made it work, and that confuses me...

Diego Zuccato

unread,
May 26, 2022, 6:03:18 AM5/26/22
to Slurm User Community List, Mark Dixon
Il 26/05/2022 11:48, Diego Zuccato ha scritto:

> Still can't
> export TMPDIR=...
> from TaskProlog script. Surely missing something important. Maybe
> TaskProlog is called as a subshell? In that case it can't alter caller's
> env... But IIUC someone made it work, and that confuses me...

Seems I finally managed to understand TaskProlog script! It's more
involved than I thought. :(

The script is run (on the first allocated node, IIUC) in a subshell (so
a direct export can't work), and *its output* is processed in the job
shell. Please correct me if I'm wrong.
That's why the FAQ https://slurm.schedmd.com/faq.html uses lines like
echo "print ..."

Changing my TaskProlog.sh from
export TMPDIR=...
to
echo "export TMPDIR=..."
fixed it. Now I'm quite happier :)
Reply all
Reply to author
Forward
0 new messages