[slurm-users] slurmstepd: error: task_g_set_affinity: Operation not permitted

Christopher Harrop - NOAA Affiliate via slurm-users

unread,

Jun 13, 2024, 11:10:10 AMJun 13

to slurm...@lists.schedmd.com

Hi, I am building a containerized Slurm cluster with Ubuntu 20.04 and have it almost working.

The daemons start, and an “sinfo” command shows compute nodes up and available:

admin@slurmfrontend:~$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

slurmpar* up infinite 3 idle slurmnode[1-3]

admin@slurmfrontend:~$

However, if I try to use “srun” to test a job submission it fails saying it could not execve the job:

admin@slurmfrontend:~$ srun hostname

srun: error: task 0 launch failed: Slurmd could not execve job

slurmstepd: error: task_g_set_affinity: Operation not permitted

slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error

admin@slurmfrontend:~$

If I go to the slurmnode1 container where the job should run, and look at the slurmd log, all I see is this:

admin@slurmnode1:/$ sudo cat /var/log/slurmd.log

[2024-06-13T14:58:36.238] CPU frequency setting not configured for this node

[2024-06-13T14:58:36.239] warning: Core limit is only 0 KB

[2024-06-13T14:58:36.239] slurmd version 23.11.7 started

[2024-06-13T14:58:36.243] slurmd started on Thu, 13 Jun 2024 14:58:36 +0000

[2024-06-13T14:58:36.243] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=47926 TmpDisk=59767 Uptime=71713 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

[2024-06-13T14:58:47.230] launch task StepId=1.0 request from UID:1000 GID:1000 HOST:172.20.0.2 PORT:50618

[2024-06-13T14:58:47.230] task/affinity: lllp_distribution: JobId=1 implicit auto binding: sockets,one_thread, dist 8192

[2024-06-13T14:58:47.230] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic

[2024-06-13T14:58:47.230] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1]: mask_cpu,one_thread, 0x01

[2024-06-13T14:58:47.243] [1.0] error: task_g_set_affinity: Operation not permitted

[2024-06-13T14:58:47.243] [1.0] error: _exec_wait_child_wait_for_parent: failed: No error

[2024-06-13T14:58:47.244] [1.0] error: job_manager: exiting abnormally: Slurmd could not execve job

[2024-06-13T14:58:47.247] [1.0] stepd_cleanup: done with step (rc[0xfb4]:Slurmd could not execve job, cleanup_rc[0xfb4]:Slurmd could not execve job)

admin@slurmnode1:/$

I’ve installed by following the instructions for building/installing the Debian RPMs and can see that all the daemons are up and running.

I have this slurm.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/slurm.conf

ClusterName=cluster

SlurmctldHost=slurmmaster

MpiDefault=none

ProctrackType=proctrack/linuxproc

ReturnToService=1

SlurmdParameters=config_overrides

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=root

StateSaveLocation=/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/affinity

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

SchedulerType=sched/backfill

SelectType=select/cons_tres

AccountingStorageType=accounting_storage/none

JobCompType=jobcomp/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

NodeName=slurmnode[1-3] CPUs=8 State=UNKNOWN

PartitionName=slurmpar Nodes=ALL Default=YES MaxTime=INFINITE State=UP

admin@slurmfrontend:~$

And I have this group.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/cgroup.conf

CgroupPlugin=cgroup/v1

ConstrainCores=yes

ConstrainDevices=yes

ConstrainRAMSpace=yes

admin@slurmfrontend:~$

Does anyone have any clues about where to look for why “srun” can’t run a job and where the "task_g_set_affinity: Operation not permitted” may be coming from?

Chris

---------------------------------------------------------------------------------------------------
Christopher W. Harrop voice: (720) 649-0316
NOAA Global Systems Laboratory, R/GSL6 fax: (303) 497-7259
325 Broadway
Boulder, CO 80303

Christopher Harrop - NOAA Affiliate via slurm-users

unread,

Jun 13, 2024, 11:53:25 AMJun 13

to slurm...@lists.schedmd.com

There is a permission problem somewhere, but I don’t know where.

If I run as root, it works:

admin@slurmfrontend:~$ srun hostname

srun: error: task 0 launch failed: Slurmd could not execve job

slurmstepd: error: task_g_set_affinity: Operation not permitted

slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error

admin@slurmfrontend:~$ sudo srun hostname

slurmnode1

admin@slurmfrontend:~$ sudo srun -N 3 hostname

slurmnode1

slurmnode3

slurmnode2

admin@slurmfrontend:~$

Christopher Harrop via slurm-users

unread,

Jun 14, 2024, 10:22:11 AMJun 14

to slurm...@lists.schedmd.com

I believe I have solved this. I changed the configuration to replace:

TaskPlugin=task/affinity

with:

TaskPlugin=task/none

In my case, the login node, the head node, and all of the compute nodes are running in their own containers. And docker compose is used to run all of those containers to create a containerized Slurm cluster running on a single physical host. So, I think the "TaskPlugin=task/none" setting is required.

If anyone has any other recommendations, please let me know.

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Reply all

Reply to author

Forward