On a single Rocky8 workstation with one GPU where we wanted ssh
interactive logins to it to have a small portion of its resources (shell,
compiling, simple data manipulations, console desktop, etc) and the rest
for SLURM we did this:
- Set it to use cgroupv2
* modify /etc/defaultg/grub to add systemd.unified_cgroup_hierarchy=1
to GRUB_CMDLINE_LINUX. Remake grub with grub2-mkconfig
* create file /usr/etc/cgroup_cpuset_init with the lines
#!/bin/bash
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control
* Modify/create /etc/systemd/system/slurmd.service.d/override.conf
so it has:
[Service]
ExecStartPre=-/usr/etc/cgroup_cpuset_init
- figure out exact cores to use for "free user" use and cores for SLURM.
Also use GPU sharding in SLURM so GPU can be shared.
* install hwloc-ls
* run 'hwloc-ls' to tranlate physical cores 0-9 to logical cores
For me P 0-9 was Logical 0,2,4,6,8,10,12,14,16,18
* in /etc/slurm.conf the NodeName definition has
CPUs=128 Boards=1 SocketsPerBoard=1 CoresPerSocket=64 ThreadsPerCore=2 \
RealMemory=257267 MemSpecLimit=20480 \
CpuSpecList=0,2,4,6,8,10,12,14,16,18 \
TmpDisk=6000000 Gres=gpu:nvidia_a2:1,shard:nvidia_a2:32
reserving those 10 cores and 20GB of RAM for "free user"
* gres.conf has the lines:
AutoDetect=nvml
Name=shard Count=32
* Need to add gres/shard to GresTypes= too. Job submissions use
the option --gres=shard:N where N is less than 32
- Set up systemd to restrict "free users" to cores 0-9 and the 20GB
* Run: systemctl set-property user.slice MemoryHigh=20480M
* Run for every individual user on the system
systemctl set-property user-$uid.slice AllowedCPUs=0-9
where $uid is that users user ID. We do this in a script
that also runs sacctmgr to add them to the SLURM system
I could not just set this one for user.slice itself which is what I
first tried because it then restricted the root user too and that
cause wierd behavior with a lot of system tools. So far the
root/daemon process work fine in the 20GB limit though so that
MemoryHigh=20480M is one and done
Then reboot.
-- Paul Raines (
http://help.nmr.mgh.harvard.edu)
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at
https://www.massgeneralbrigham.org/complianceline <
https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
--
slurm-users mailing list --
slurm...@lists.schedmd.com
To unsubscribe send an email to
slurm-us...@lists.schedmd.com