Greetings All
I am struggling a bit with how TaskProlog works with srun. We have our TaskProlog set up to create a TMPDIR on local compute node scratch space and export the path in a variable called $TMPDIR. Our TaskEpilog deletes TMPDIR. This is working great for jobs submitted with sbatch. If I start an srun job with –ntasks=1, then everything works the same as with sbatch. Namely, the $TMPDIR variable is set and the directory is created on local scratch. However, if I use –ntasks=n where n > 1, we still get the $TMPDIR variable created but the directory itself is not created. Key files and examples:
slurm.conf (relevant entries):
#Prolog=/opt/slurm/prolog.bash
#PrologFlags=Alloc,NoHold
#Epilog=/opt/slurm/epilog.bash
#SrunProlog=/opt/slurm/srun_prolog
#SrunEpilog=/opt/slurm/srun_epilog
TaskProlog=/opt/slurm/task_prolog
TaskEpilog=/opt/slurm/task_epilog
/opt/slurm/task_prolog:
#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
mkdir -p $mytmpdir
echo export TMPDIR=$mytmpdir
exit;
/opt/slurm/task_epilog
#!/bin/bash
mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
rm -Rf $mytmpdir
exit;
Run Example –ntasks=1:
$ srun --pty --mem=16g --ntasks=1 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13 $SHELL
[hputnam@c4-n13:job=421362 ~]$ echo $TMPDIR
/scratch/hputnam/421362
[hputnam@c4-n13:job=421362 ~]$ ls $TMPDIR
[hputnam@c4-n13:job=421362 ~]$
Run Example –ntasks=2 $TMPDIR variable is set but the directory is not created:
$ srun --pty --mem=16g --ntasks=2 --time 0-08:00 --gres=scratch:20g --partition=cbc --nodelist=c4-n13 $SHELL
[hputnam@c4-n13:job=421370 ~]$ echo $TMPDIR
/scratch/hputnam/421370
[hputnam@c4-n13:job=421370 ~]$ ls $TMPDIR
ls: cannot access /scratch/hputnam/421370: No such file or directory
I am quite confused by this. I read this: https://slurm.schedmd.com/prolog_epilog.html which says TaskProlog is run by the user executing srun prior to lunching job step. I am not sure I understand what constitutes a job step. I do see a stepd process launched on the compute node each time I execute srun. That seems independent of –ntasks, I get one process per srun regardless of what –ntasks is set to.
Thanks in advance.
-Harry
Thanks for your reply Bjorn-Helge
This cleared things up for me. I had not understood that we need to use Prolog and Epilog for the TMPDIR stuff because that guarantees it is created at the very beginning of the job and deleted at the very end. Everything now works as expected, thanks so much for your help.
-Harry