Hi all,
this comes a bit late, but we are having the same problem:
The sbatch script sees the job-specific /tmp created by
job_container/tmpfs and the job itself does too, but srun and mpirun do
not; they still see the system /tmp.
This is a problem especially if the user sets the working directory to
something inside the job-specific /tmp:
=====================
#!/bin/bash
#SBATCH --nodes=1
#SBATCH ...
mkdir /tmp/something
cd /tmp/something
srun hostname
=====================
This gives the message
slurmstepd: error: couldn't chdir to `/tmp/something': No such file or
directory: going to /tmp instead
In many cases, it seems that message can be ignored since the program
itself sees the job-specific /tmp, e.g. the following works as expected:
=====================
mkdir /tmp/something
cd /tmp/something
echo "42" > a
srun cat a
=====================
However, MPICH jobs fail with messages like these:
[proxy:1@gpu016] launch_procs (proxy/pmip_cb.c:869): unable to change
wdir to /tmp/something (No such file or directory)
[...] (more error messages; job aborts).
The new job_container/tmpfs parameter EntireStepInNS in Slurm 24.11
removes the slurmstepd error message, but MPICH still fails, so it seems
the problem is not entirely solved.
Does anybody have a solution for this?
Best,
Martin
--
Dr. habil. Martin Lambers
Forschung und wissenschaftliche Informationsversorgung
IT.SERVICES
Ruhr-Universität Bochum | 44780 Bochum | Germany
fon :
+49 234 32 29941
https://www.it-services.rub.de/