[slurm-users] Using "srun" on compute nodes -- Ray cluster

Kamil Wilczek

unread,

Jul 15, 2022, 5:17:59 AM7/15/22

to slurm...@lists.schedmd.com

Dear Slurm Users,

one of my cluster users would like to run a Ray cluster on Slurm.
I noticed that the batch script example requires running the "srun"
command on a compute node, which already is allocated:
https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template

This is the first time I see or hear about this type of usage
and I have problems wrapping my head around this.
Is there anything wrong or unusual about this? I understand that
this would allocate some resources on other nodes. Would
Slurm enforce limits properly ("qos" or "partition" limits)?

Kind Regards
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]

OpenPGP_signature

Ryan Novosielski

unread,

Jul 15, 2022, 12:22:43 PM7/15/22

to slurm...@lists.schedmd.com

Are you talking about a script that is run via sbatch containing srun
command lines? If so, there are a lot of reasons to do that. One is
better instrumentation, as I understand it, but also srun --mpi is a way
to eliminate mpiexec/mpirun/etc., and is what we recommend at our site
instead (using the PMI2 or PMIx methods).

On 7/15/22 05:17, Kamil Wilczek wrote:
> Dear Slurm Users,
>
> one of my cluster users would like to run a Ray cluster on Slurm.
> I noticed that the batch script example requires running the "srun"
> command on a compute node, which already is allocated:
> https://docs.ray.io/en/latest/cluster/examples/slurm-template.html#slurm-template
>
>
> This is the first time I see or hear about this type of usage
> and I have problems wrapping my head around this.
> Is there anything wrong or unusual about this? I understand that
> this would allocate some resources on other nodes. Would
> Slurm enforce limits properly ("qos" or "partition" limits)?
>
> Kind Regards

--

#BlackLivesMatter
____
|| \\UTGERS, |----------------------*O*------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
`'

Reed Dier

unread,

Jul 15, 2022, 3:24:21 PM7/15/22

to Slurm User Community List

I have some users that are using ray on slurm.

I will preface by saying we are new slurm users, so may not be doing everything exactly correct.

The only issue that we came across so far as something that was somewhat ray specific that we ran into.

Specifically, and pardon my lack of specificity, the ray user I worked on this with is on vacation at the moment, there was an environment variable that needed to be unset so that ray wouldn’t kneecap itself if it hit a cpuset corner case in cgroup fencing.

Specifically, in this workload, the user spawns a “ray head,” and important to mention that this head worker may not have the same resources allocated to it as the “ray worker”.

TL;DR the ray head would be given fewer cpus than the worker(s), and in some corner cases, the worker pid spawned would inherit a smaller cpuset from an environment variable passed from the ray head that is then spawning workers via srun.

The user noticed that some workers would be able to get 100% util for their allocated cpu resources, where other workers running identical workloads would end up at partial usage, which we discovered were due to the cpuset getting inherited in a way we didn’t intend for it to.

I’ll have to follow up with the environment variable we had to unset when that user is back.

But here is my quick and dirty bash script that was able to show the cpu’s allocated to the cgroup, and the pid’s inside the cgroup, which should match, but didn’t always, which was our discovery.

Just use the uid of the user submitting the jobs.

#!/bin/bash
UID=$1

for JOB in $(ls -lah /sys/fs/cgroup/cpuset/slurm/uid_$UID/ | grep job | awk -F'_' '{print $2}' | xargs)
do
echo "Slurm JobID: “$JOB
echo -n "Cgroup CPU set: "
cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/cpuset.cpus

for PID in $(cat /sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$JOB/step_0/cgroup.procs | xargs)
do
echo -n "CPUs allocated for PID "$PID": "
cat /proc/$PID/status | grep Cpus_allowed_list | awk '{print $2}'
done
echo ""
done

slurmd3:
Slurm Job: 409
Cgroup CPU set: 0-7
CPUs allocated for PID 7907: 0-7
CPUs allocated for PID 7912: 0-3
CPUs allocated for PID 7931: 0-3
slurmd1:
Slurm Job: 406
Cgroup CPU set: 0-3
CPUs allocated for PID 7409: 0-3
CPUs allocated for PID 7414: 0-3
CPUs allocated for PID 7425: 0-3
slurmd2:
Slurm Job: 408
Cgroup CPU set: 0-7
CPUs allocated for PID 7491: 0-7
CPUs allocated for PID 7496: 0-3
CPUs allocated for PID 7515: 0-3

But otherwise, I’ve not had issues with users spawning jobs from within jobs, but I’m not a seasoned slurm admin, so that may not hold up with others.

Reed

Reply all

Reply to author

Forward