[slurm-dev] Concept of job environment with sbatch or salloc

18 views
Skip to first unread message

Thomas Orgis

unread,
Jul 28, 2015, 11:27:02 AM7/28/15
to slurm-dev
Hi,

we are preparing our first real-world deployment of Slurm on a cluster
here and encountered quite a clash between our concept of user and job
environment and that of Slurm. I intend to share our observations and
hope for some discussion that at least makes things more clear for us,
ideally providing some input on how to better make Slurm fit into our
environment.

But let me start with saying that I don't want to bash Slurm with my
criticism. I still expect/hope it to be the better alternative than
continuing with Torque/Maui with it's own issues. I yell because I
_care_;-)

So, here it goes …

Slurm seems to start out with the idea of providing one common shell
environment for login shells and jobs. Therefore, it per default
exports all environment variables to sbatch jobs and does not source
any shell startup files (well, for certain not /etc/profile and
similar, which is the important one).

This may have a rationale for salloc, but for sure not for sbatch: You
schedule the job to run at some unspecified time in the future on some
other machine similar to the one you are submitting from, but not
necessarily that similar. It could even be a totally different machine in a
setup with distinct front end nodes.

For us, the obvious point where this falls down is that certain
elements of the user environment are not valid for the batch job
anymore, case in point being a session-dependent TMPDIR for scratch
work. We create and delete such a temporary directory for each login
session and we create and delete such a temporary directory for each
batch job in prolog/epilog. Actually, we create two: One local to each
node and a global one on the parallel filesystem. There are
applications that insist on heavy I/O that is best kept on a local
disk, no matter how good your parallel storage is.

Another point is that there simply is silly information that for sure
does not belong in the job environment, exported by default:

declare -x SSH_CLIENT="xxx.xxx.xxx.xxx ppppp pp"
declare -x SSH_CONNECTION="xxx.xxx.xxx.xxx pppp yyy.yyy.yyy.yyy pp"
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="rxvt"

The basic issue we have is that it is fragile to have the behaviour of
the job depend so heavily on the current environment at the time of
calling sbatch by default. A job script should be repeatable,
computations even reproducible for someone else. So all necessary
environment should either be global to site or set up inside the job
script (via environment modules).

Long story short: For predictable behaviour of batch jobs, we urge our
users to use

#SBATCH --export=NONE
unset SLURM_EXPORT_ENV

in the batch scripts and load environment modules from there as in a
fresh shell session.

Note that, while the sh profile is sourced in this case by SLURM and
exported variables carried over into the job, this still falls short of
providing the equivalent of

. /etc/profile

(or . ~/.profile)

in the job script as functions are not exported. In the case of wanting
to use environment modules, one still has to source a script that gives
the "module" command. Then you could just do the above and be done with
it anyway, as with traditional batch systems. So we actually need this
boilerplate:

#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
. /some/script/giving/module-command.sh

module load …

Btw., the hack with unsetting SLURM_EXPORT_ENV seems to be rather
common (seen it on user documentation of other HPC sites). In our case,
it is necessary together with --export=NONE because otherwise, mpirun
from Intel MPI is badly broken (and in strange ways). We want to export
stuff from the job script to the MPI process(es), just not from the
initial interactive shell …

So, the whole concept of handling environment variables in Slurm seems
rather broken to me, by design. It may come from good intentions of
making it simpler for users, having not to think about separate
environments of interactive jobs and batch jobs. But this is wrong:
There _are_ separate environments (separate hardware!) and one has to
keep that in mind or be ready for nasty surprises. We don't want to
surprise our users (only with unexpected scientific discoveries).

From our experience (in our team we have decades of experience on
using/managing HPC systems), we expect one of two cases:

1. Batch job environment is empty, save for some variables set by the
batch system (job id, node list, etc.).

2. Batch job environment is equal to a fresh login session. You write
down the exact commands you would run when preparing your work
interactively.

Slurm wants to offer something different, perhaps more flexible and
helpful, but ultimately unpredictable and harmful. When a user has
issues with the default of --export=ALL, I cannot rely on the shell
profile or the job script, I have to know whatever was set in the
user's environment when running sbatch and cannot quickly reproduce
things by running the job script again.

Slurm adds to the confusion with its version of interactive jobs. With
Torque, you get that via

qsub -I

and you wait for your allocation and then you get an interactive shell on
the first reserved node just like your PBS script would run on that
node. There is a subtle difference in that you are inside the shell
that sourced the profile here and not in a called shell script as in
the non-interactive batch job, so that functions are not inherited from
profile. But that difference can be understood without much head-hurt.

Now,

salloc

is a totally different beast. The environment is taken from the current
session and you in fact stay on the current machine, as if you would
continue your shell (but again, shell functions are missing). But your
job reservation is on a different set of machines. But here, I somehow
do see the point of carrying over the current environment from the
running session, as the job is defined to run while that session lasts.

But it has to be understood to be quite different from the case of
running a normal batch script.

Coming back to the sourcing of /etc/profile (and equivalent) for batch
jobs. Slurm does that in a segregated shell instance and then only
extracts the variables, employing funny
not-likely-but-also-not-guaranteed-not-to-collide marker strings to
help the parser (right?). Perhaps someone can explain the idea behind
_that_ to me sometime, but there is one issue that looks like a plain
oversight:

Slurm has the SLURM_* Variables set for the prolog script, but they are
not present in the stage parsing the profile script. The result is that
we now have a hack in place to look into /proc to see if the current
process is a child of slurmd or slurmstepd to be able to have the
profile script _not_ think that this is a normal login session and go
ahead and create a temporary directory for that, which will not be the
correct thing for batch jobs, they need directories on each node plus a
global one, all depending on the SLURM_JOBID.

And since we still don't have access to SLURM_JOBID, the profile script
cannot put the paths to the temporary directories (which are properly
created by prolog and removed by epilog) into the environment. We have
to tell our users to source a script that does that, which is not that
much of a burden since we have to source a script for getting the
environment module function already, also killing the (for us?)
poisonous SLURM_EXPORT_ENV variable, if the recommened --export=NONE
was used.

But still: Is there a special reasoning behind not having the SLURM_*
vars defined at profile time?

In closing, I also miss being able to see the output from prolog and
epilog scripts in the job output logs. That way, we were able to put a
warning message in there if the prolog notices some condition that
might cause trouble, aiding debugging later on, or a message after job
completion detailing the resource usage (mainly for quickly seeing if
the job did actually use the allocated CPUs properly, as we had a lot
of jobs set up in inefficient ways in the past).

Now, I guess one has to manually code somethig to produce per-job batch
system logs and point the users there via documentation. Hm, actually,
I'm not sure right now if the output of prolog/epilog is going
_anywhere_. I presume I will be able to dig that out of slurmd logs,
but not the user.


All in all, I hope we managed to get an environment set up now that the
users can cope with.*

But during the debugging sessions to get the shell environment working,
our surprise and head shaking grew more and more pronounced as we
discovered multiple instances of well-intentioned "smartness" that, in
our case, only got in the way. In our thinking, the resource manager /
job scheduler should not mess that much with the job script
environment. Just give the job the resource information and the support
for running parallel jobs**, do not try to guess which set of
environment variables is good for the job. Are we wrong?


Alrighty then,

Thomas

* Currently reinstalling Slurm to see if a slightly newer version
(from 14.11.5 to 14.11.8) doesn't have that bug where slurmd seems to
freeze and even not respond to a request to restart it … randomly
downing quite a number of nodes after some time. Bug report will
follow if that persists.

** We settled on not using libpmi of any kind, rather just
--bootstrap=slurm for Intel MPI and --with-slurm for OpenMPI to make
resource communication and spawning of processes work. With Intel
MPI, trying to use slurm's libpmi.so results in srun hanging and not
getting work done. It works with a properly built OpenMPI, but we
settled on always using mpirun for consistency. What are we missing
when not using PMI?

--
Dr. Thomas Orgis
Universität Hamburg
RRZ / Zentrale Dienste / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270

Christopher Samuel

unread,
Jul 29, 2015, 2:03:52 AM7/29/15
to slurm-dev

Hi Thomas,

On 29/07/15 01:26, Thomas Orgis wrote:

> Slurm seems to start out with the idea of providing one common shell
> environment for login shells and jobs. Therefore, it per default
> exports all environment variables to sbatch jobs and does not source
> any shell startup files (well, for certain not /etc/profile and
> similar, which is the important one).

Just a quick reply to hopefully guide you, but what we do is:

1) To stop environment variable being exported we have this in our
slurm.conf file:

PropagateResourceLimits=NONE

2) To make sure that bash sources a profile script we have this in
taskprolog.sh:

echo export BASH_ENV=/etc/profile.d/module.sh

3) If you wish to set TMPDIR you can use the same trick in taskprolog.sh:

# Set environment variable with location of scratch storage for the job
#echo export TMPDIR=/scratch/merri/jobs/$SLURM_JOB_ID

Note that is commented out as we now use a spank plugin to use kernel
namespaces to map a job specific scratch directory over /tmp.

We needed to patch it so that it would work on our diskless clusters
with a global scratch filesystem, our version is here:

https://github.com/vlsci/spank-private-tmp

It's derived from this code from UMU:

https://github.com/hpc2n/spank-private-tmp

You'll still need a slurmdepilog.sh script to clean those directories
up at the end of the job though..

> Slurm adds to the confusion with its version of interactive jobs.

We use this tiny script for interactive jobs (called sinteractive):

#!/bin/bash
exec srun $* --pty -u ${SHELL} -i -l

> The basic issue we have is that it is fragile to have the behaviour of
> the job depend so heavily on the current environment at the time of
> calling sbatch by default. A job script should be repeatable,
> computations even reproducible for someone else. So all necessary
> environment should either be global to site or set up inside the job
> script (via environment modules).

Completely agree - though this might be because we too come from a
Torque/Moab background. :-)

> In closing, I also miss being able to see the output from prolog and
> epilog scripts in the job output logs.

You can write into those, for instance our BlueGene/Q system
has the following in the taskprolog.sh:

# Add a banner to the job standard output
echo "print Job $SLURM_JOB_ID started at" `date`
echo "print Scratch directory /scratch/avoca/$SLURM_JOB_ID has been allocated"
echo "print $SLURM_BG_NUM_NODES Blue Gene/Q compute nodes have been allocated"


Hope these help!

All the best,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Thomas Orgis

unread,
Jul 29, 2015, 2:41:52 AM7/29/15
to slurm-dev
Am Tue, 28 Jul 2015 23:03:54 -0700
schrieb Christopher Samuel <sam...@unimelb.edu.au>:

> 1) To stop environment variable being exported we have this in our
> slurm.conf file:
>
> PropagateResourceLimits=NONE

Does that influence environment variables, too? I read that it is for
propagating resource limits from the submit host. But still, thanks for
pointing that out, as we intended to limit resources on the submit
machines where multiple users work, without limiting the jobs, where I
already had to raise limits in the slurmd systemd service file. Seems
like we need

PropagateResourceLimits=NONE

for separating those setups.

> 2) To make sure that bash sources a profile script we have this in
> taskprolog.sh:
>
> echo export BASH_ENV=/etc/profile.d/module.sh

> has the following in the taskprolog.sh:
>
> # Add a banner to the job standard output
> echo "print Job $SLURM_JOB_ID started at" `date`

Aha:

TaskProlog
TaskEpilog

So there is a chance to print something the user can see. But: All
prints from multiple tasks in a job get intermixed in one log file,
right? Well, just like the outputs from multiple nodes.

The thing about BASH_ENV is that we do not intend to enforce usage of a
certain shell. We support "some bourne shell" and "some C-ish shell",
notably ksh. Perhaps there are similar ways with other shells but not I
guess not all.

Nevertheless, interesting suggestion for systems that indeed just use
bash.

> 3) If you wish to set TMPDIR you can use the same trick in taskprolog.sh:
>
> # Set environment variable with location of scratch storage for the job
> #echo export TMPDIR=/scratch/merri/jobs/$SLURM_JOB_ID

I even see now that there is TmpFS for slurm.conf, too, but as there is
some complexity involved (user being able to decide if to use local
and/or global TMPDIRs prepared for her), we still roll our own.

> Note that is commented out as we now use a spank plugin to use kernel
> namespaces to map a job specific scratch directory over /tmp.

Now that's somewhat nifty and somewhat scary;-)

> > Slurm adds to the confusion with its version of interactive jobs.
>
> We use this tiny script for interactive jobs (called sinteractive):
>
> #!/bin/bash
> exec srun $* --pty -u ${SHELL} -i -l

Another fine idea … just start a login shell. We might offer such a
thing, too. But I have a suggestion: Rather use

#!/bin/bash
exec srun "$@" --pty -u ${SHELL} -i -l

, or do you intend to have quoted arguments being mangled? It might not
matter in practice for srun, as you rarely need quote something in its
arguments, but have it deeply ingrained to always use "$@" for the
argument list.


Alrighty then,

Thomas

Fitzpatrick, Ben

unread,
Jul 29, 2015, 4:19:52 AM7/29/15
to slurm-dev
Hi Thomas,

At our site, we recommend that sbatch scripts start with:

#!/bin/bash -l

which launches a login shell (users could use /bin/ksh -l, etc, if they really want).

This sources the /etc/profile on the compute node (note: we don't allow multi-node jobs).
Sourcing the /etc/profile runs into this code:

if [[ -n "${SLURM_JOB_USER:-}" && -n "${SLURM_JOB_ID:-}" ]]; then
export TMPDIR="/tmp/$SLURM_JOB_USER-$SLURM_JOB_ID"
mkdir -p $TMPDIR
fi

This directory and any contents is then removed by the 'Epilog'. This works pretty well
for us. You could easily modify your compute node /etc/profile to unset everything you don't want.

The taskprolog has the downside that it only applies to 'srun'. Batch scripts with one or
two processes (quite common for us) wouldn't get any TMPDIR settings from it.

However, a combination of the taskprolog TMPDIR and an /etc/profile TMPDIR might work fine
for multi-node jobs.

Cheers,

Ben

Thomas Orgis

unread,
Jul 29, 2015, 5:18:54 AM7/29/15
to slurm-dev
Am Wed, 29 Jul 2015 01:19:54 -0700
schrieb "Fitzpatrick, Ben" <ben.fit...@metoffice.gov.uk>:

> At our site, we recommend that sbatch scripts start with:
>
> #!/bin/bash -l

Another simple solution, yes. We might ponder that indeed.

> This sources the /etc/profile on the compute node (note: we don't allow multi-node jobs).

But this still means that it is sourced twice, which may have side
effects. Or, if --export=ALL is used, it is indeed sourced once only,
but all the variables from the login session that ran sbatch are
present. You have to write the profile script(s) with that in mind, it
might be confused by that.

I see that folks found their ways around the behaviour of Slurm … now
I'm only missing an explanation for why Slurm tries so hard to mess
with the job environment so that people feel the need to work around it …


Alrighty then,

Thomas

Christopher Samuel

unread,
Jul 29, 2015, 8:00:00 PM7/29/15
to slurm-dev

On 29/07/15 16:41, Thomas Orgis wrote:

> Am Tue, 28 Jul 2015 23:03:54 -0700
> schrieb Christopher Samuel <sam...@unimelb.edu.au>:
>
>> 1) To stop environment variable being exported we have this in our
>> slurm.conf file:
>>
>> PropagateResourceLimits=NONE
>
> Does that influence environment variables, too?

Argh insufficient caffeination! You are completely right, that was
brain fail at the end of a long day - ignore that please!

[...]
> So there is a chance to print something the user can see. But: All
> prints from multiple tasks in a job get intermixed in one log file,
> right? Well, just like the outputs from multiple nodes.

Hmm, I guess we can't see that on our BlueGene/Q system as the batch
script is started on a dedicated launch node and the (cross compiled)
executables are started directly on the (non-Linux) compute nodes so it
can only ever run once.

> The thing about BASH_ENV is that we do not intend to enforce usage of a
> certain shell. We support "some bourne shell" and "some C-ish shell",
> notably ksh. Perhaps there are similar ways with other shells but not I
> guess not all.

Thankfully our breakdown of users is:

597 bash
7 tcsh
2 zsh

and (IIRC) tcsh will source .cshrc on startup for a non-login shell.

I didn't even realise we had zsh users before now. :-) Looking at the
manual page it appears that zsh uses an ENV variable to declare a file
to source after profiles, but it's not clear if that is honoured for
non-login zsh's.

> Nevertheless, interesting suggestion for systems that indeed just use
> bash.

Pleasure!

>> 3) If you wish to set TMPDIR you can use the same trick in taskprolog.sh:
>>
>> # Set environment variable with location of scratch storage for the job
>> #echo export TMPDIR=/scratch/merri/jobs/$SLURM_JOB_ID
>
> I even see now that there is TmpFS for slurm.conf, too, but as there is
> some complexity involved (user being able to decide if to use local
> and/or global TMPDIRs prepared for her), we still roll our own.

Yeah, that's what we were doing too on x86 before the spank plugin, and
still do on BG/Q.

>> Note that is commented out as we now use a spank plugin to use kernel
>> namespaces to map a job specific scratch directory over /tmp.
>
> Now that's somewhat nifty and somewhat scary;-)

It's great as there are many applications that have no idea about
honouring $TMPDIR/$TMP..

>>> Slurm adds to the confusion with its version of interactive jobs.
>>
>> We use this tiny script for interactive jobs (called sinteractive):
>>
>> #!/bin/bash
>> exec srun $* --pty -u ${SHELL} -i -l
>
> Another fine idea … just start a login shell. We might offer such a
> thing, too. But I have a suggestion: Rather use
>
> #!/bin/bash
> exec srun "$@" --pty -u ${SHELL} -i -l
>
> , or do you intend to have quoted arguments being mangled? It might not
> matter in practice for srun, as you rarely need quote something in its
> arguments, but have it deeply ingrained to always use "$@" for the
> argument list.

We've not hit issues *yet* with it, but it's a good call and we'll look
into that, thanks!

Best of luck,
Reply all
Reply to author
Forward
0 new messages