[slurm-users] slurmstepd: error: task_g_set_affinity: Operation not permitted

22 views
Skip to first unread message

Christopher Harrop - NOAA Affiliate via slurm-users

unread,
Jun 13, 2024, 11:10:10 AMJun 13
to slurm...@lists.schedmd.com
Hi, I am building a containerized Slurm cluster with Ubuntu 20.04 and have it almost working.

The daemons start, and an “sinfo” command shows compute nodes up and available:

admin@slurmfrontend:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmpar*    up   infinite      3   idle slurmnode[1-3]
admin@slurmfrontend:~$ 

However, if I try to use “srun” to test a job submission it fails saying it could not execve the job:

admin@slurmfrontend:~$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
slurmstepd: error: task_g_set_affinity: Operation not permitted
slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error
admin@slurmfrontend:~$

If I go to the slurmnode1 container where the job should run, and look at the slurmd log, all I see is this:

admin@slurmnode1:/$ sudo cat /var/log/slurmd.log 
[2024-06-13T14:58:36.238] CPU frequency setting not configured for this node
[2024-06-13T14:58:36.239] warning: Core limit is only 0 KB
[2024-06-13T14:58:36.239] slurmd version 23.11.7 started
[2024-06-13T14:58:36.243] slurmd started on Thu, 13 Jun 2024 14:58:36 +0000
[2024-06-13T14:58:36.243] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=47926 TmpDisk=59767 Uptime=71713 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2024-06-13T14:58:47.230] launch task StepId=1.0 request from UID:1000 GID:1000 HOST:172.20.0.2 PORT:50618
[2024-06-13T14:58:47.230] task/affinity: lllp_distribution: JobId=1 implicit auto binding: sockets,one_thread, dist 8192
[2024-06-13T14:58:47.230] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2024-06-13T14:58:47.230] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1]: mask_cpu,one_thread, 0x01
[2024-06-13T14:58:47.243] [1.0] error: task_g_set_affinity: Operation not permitted
[2024-06-13T14:58:47.243] [1.0] error: _exec_wait_child_wait_for_parent: failed: No error
[2024-06-13T14:58:47.244] [1.0] error: job_manager: exiting abnormally: Slurmd could not execve job
[2024-06-13T14:58:47.247] [1.0] stepd_cleanup: done with step (rc[0xfb4]:Slurmd could not execve job, cleanup_rc[0xfb4]:Slurmd could not execve job)
admin@slurmnode1:/$ 

I’ve installed by following the instructions for building/installing the Debian RPMs and can see that all the daemons are up and running.

I have this slurm.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/slurm.conf 
ClusterName=cluster
SlurmctldHost=slurmmaster
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmdParameters=config_overrides
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=slurmnode[1-3] CPUs=8 State=UNKNOWN
PartitionName=slurmpar Nodes=ALL Default=YES MaxTime=INFINITE State=UP
admin@slurmfrontend:~$

And I have this group.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/cgroup.conf 
CgroupPlugin=cgroup/v1

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
admin@slurmfrontend:~$

Does anyone have any clues about where to look for why “srun” can’t run a job and where the "task_g_set_affinity: Operation not permitted” may be coming from?

Chris
---------------------------------------------------------------------------------------------------
Christopher W. Harrop                                                voice: (720) 649-0316
NOAA Global Systems Laboratory, R/GSL6                  fax: (303) 497-7259                 
325 Broadway                                                 
Boulder, CO 80303

Christopher Harrop - NOAA Affiliate via slurm-users

unread,
Jun 13, 2024, 11:53:25 AMJun 13
to slurm...@lists.schedmd.com
There is a permission problem somewhere, but I don’t know where.

If I run as root, it works:

admin@slurmfrontend:~$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
slurmstepd: error: task_g_set_affinity: Operation not permitted
slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error

admin@slurmfrontend:~$ sudo srun hostname
slurmnode1

admin@slurmfrontend:~$ sudo srun -N 3 hostname
slurmnode1
slurmnode3
slurmnode2
admin@slurmfrontend:~$

Christopher Harrop via slurm-users

unread,
Jun 14, 2024, 10:22:11 AMJun 14
to slurm...@lists.schedmd.com
I believe I have solved this. I changed the configuration to replace:

TaskPlugin=task/affinity

with:

TaskPlugin=task/none

In my case, the login node, the head node, and all of the compute nodes are running in their own containers. And docker compose is used to run all of those containers to create a containerized Slurm cluster running on a single physical host. So, I think the "TaskPlugin=task/none" setting is required.

If anyone has any other recommendations, please let me know.

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
Reply all
Reply to author
Forward
0 new messages