[slurm-users] Custom Gres for SSD

Shunran Zhang

unread,

Jul 23, 2023, 11:48:49 PM7/23/23

to slurm...@lists.schedmd.com

Hi all,

I am attempting to setup a gres to manage jobs that need a
scratch space, but only a few of our computational nodes are
equipped with SSD for such scratch space. Originally I setup a new
partition for those IO-bound jobs, but it ended up that those jobs
might be allocated to the same node thus fighting each other for
IO.

With a look over other settings it appears that the gres setting
looks promising. However I was having some difficulties figuring
out how to limit access to such space to those who requested
--gres=ssd:1.

For now I am using Flags=CountOnly to trust users who uses SSD
request for it, but apparently any job submitted to a node with
SSD can just use such space. Our scratch space implementation is 2
disks (sda and sdb) formatted to btrfs and RAID 0. What should I
do to enforce such limit on which job can use such space?

Related configurations for ref:
gres.conf:
NodeName=scratch-1 Name=ssd Flags=CountOnly
cgroup.conf:
ConstrainDevices=yes
slurm.conf:
GresTypes=gpu,ssd
NodeName=scratch-1 CPUs=88 Sockets=2 CoresPerSocket=22 ThreadsPerCore=2  RealMemory=180000 Gres=ssd:1 State=UNKNOWN

Sincerely,
S. Zhang

Matthias Loose

unread,

Jul 24, 2023, 3:51:29 AM7/24/23

to slurm...@lists.schedmd.com

Hi Shunran,

we do something very similar. I have nodes with 2 SSDs in a Raid1
mounted on /local. We defined a gres ressource just like you and called
it local. We define the ressource in the gres.conf like this:

# LOCAL
NodeName=hpc-node[01-10] Name=local

and add the ressource in counts of GB to the slurm.nodes.conf:

NodeName=hpc-node01 CPUs=256 RealMemory=... Gres=local:3370

So in this case the node01 has 3370 counts or GB of the gres "local"
available for reservation. Now slurm tracks that resource for you and
users can reserve counts of /local space. But there is still one big
problem, SLURM hast no idea what local is and as u correctly noted,
others can just use it. I solved this the following way:

- /local ist owned by root, so no user can just write to it
- the node prolog creates a folder in /local in this form:
/local/job_<SLURM_JOB_ID> and makes the job owner of it
- the node epilog deletes that folder

This way you have already solved the problem of people/jobs not having
reserved any local using it. But there ist still no enforcement of
limits. For that I use quotas.
My /local is XFS formatted and XFS has a nifty feature called project
quotas, where you can set the quota for a folder.

This is my node prolog script for this purpose:

#!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

local_dir="/local"
local_job=0

## DETERMINE GRES:LOCAL
# get job gres
JOB_TRES=$(scontrol show JobID=${SLURM_JOBID} | grep "TresPerNode=" |
cut -d '=' -f 2 | tr ',' ' ')

# parse for local
for gres in ${JOB_TRES}; do
key=$(echo ${gres} | cut -d ':' -f 2 | tr '[:upper:]' '[:lower:]')
if [[ ${key} == "local" ]]; then
local_job=$(echo ${gres} | cut -d ':' -f 3)
break
fi
done

# make job local-dir if requested
if [[ ${local_job} -ne 0 ]]; then
# make local-dir for job
SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"
mkdir ${SLURM_TMPDIR}

# conversion
local_job=$((local_job * 1024 * 1024))

# set hard limit to requested size + 5%
hard_limit=$((local_job * 105 / 100))

# create project quota and set limits
xfs_quota -x -c "project -s -p ${SLURM_TMPDIR} ${SLURM_JOBID}"
${local_dir}
xfs_quota -x -c "limit -p bsoft=${local_job}k bhard=${hard_limit}k
${SLURM_JOBID}" ${local_dir}

chown ${SLURM_JOB_USER}:0 ${SLURM_TMPDIR}
chmod 750 ${SLURM_TMPDIR}
fi

exit 0

This is my epilog:

#!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

local_dir="/local"
SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

# remove the quota
xfs_quota -x -c "limit -p bsoft=0m bhard=0m ${SLURM_JOBID}"
${local_dir}

# remove the folder
if [[ -d ${SLURM_TMPDIR} ]]; then
rm -rf --one-file-system ${SLURM_TMPDIR}
fi

exit 0

In order to use project quota you would need to activate it by using
this mount flag: pquota in the fstab.
I give the user 5% more than he requested. You just have to make sure
that you configure available space - 5% in the nodes.conf.

This is what we do and it works great.

Kind regards, Matt

Matthias Loose

unread,

Jul 24, 2023, 4:07:52 AM7/24/23

to Slurm User Community List

On 2023-07-24 09:50, Matthias Loose wrote:

Hi Shunran,

just read your question again. If you dont want users to share the SSD,
like at all even if both have requested it you can basically skip the
quota part of my awnser.

If you really only want one user per SSD per node you should set the
gres variable in the node configuration to 1 just like you did and then
implement the prolog/epilog solution (without quotas). If the mounted
SSD can only be written to by root no one else can use it and the job
that requested it get a folder created by the prolog.

What we also do ist export the folder name in the user/task prolog to
the environment so he can easely use it.

Out task prolog:

#!/bin/bash
#PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

local_dir="/local"

SLURM_TMPDIR="${local_dir}/job_${SLURM_JOBID}"

# check for /local job dir

if [[ -d ${SLURM_TMPDIR} ]]; then

# set tempdir env vars
echo "export SLURM_TMPDIR=${SLURM_TMPDIR}"
echo "export TMPDIR=${SLURM_TMPDIR}"
echo "export JAVA_TOOL_OPTIONS=\"-Djava.io.tmpdir=${SLURM_TMPDIR}\""
fi

Kind regards, Matt

Shunran Zhang

unread,

Jul 24, 2023, 4:26:27 AM7/24/23

to Matthias Loose, slurm...@lists.schedmd.com

Hi Matthias,

Thank you for your info. The prolog/epilog way of managing it does look
quite promising.

Indeed in my setup I only want one job per node per SSD-set. Our tasks
that require the scratch space are more IO bound - we are more worried
about the IO usage than the actual disk space usage, and that is the
reason why we only have ssd with count of 1 per 2-disk RAID 0. For those
IO bound operations, even if each job only use 5% of the disk space
available, the IO on the disk would become the bottleneck, resulting in
both jobs running 2x slower and processes in D state, which is what I am
trying to prevent. Also as those IO bound jobs are usually submitted by
one single user in a batch, a user-based approach might also not be
adequate.

I am considering to modify your script so that by default, the scratch
space is world writable but everyone except root have a quota of 0, and
the prolog lifts such quota. This way when the user forgot to specify
the --gres=ssd:1 the job would fail with IO error and he would
immediately know what went wrong.

I am also thinking of a gpu-like cgroup based solution. Maybe if I limit
the file access to lets say /dev/sda, it would also stop the user from
accessing the mount point of /dev/sda - I am not sure so I would also
test this approach out...

Will investigate into it for a little bit more.

Sincerely,

S. Zhang

Reply all

Reply to author

Forward