[slurm-users] Configless node problems.....

148 views
Skip to first unread message

Phill Harvey-Smith

unread,
Mar 17, 2023, 1:25:29 PM3/17/23
to Slurm User Community List
Hi all,

In preparation for deployment in a real world system, I have been trying
things out on a set of virtual machines arranged as a cluster. One of
the things I am trying to implement is configless nodes.

I currently have my virtual cluster setup as so :

frontend, frontback
both run slurmctld & slurmdbd
backend
has DNS, user management, shared filesystems and mariadb.

exec1-exec3 local partition nodes
execd1-execd3 dragon partition nodes
execr1-execr3 remote partition nodes

The cluster is spread across 3 physical machines, all linked by a tinc vpn.

This works without problems in the traditional slurm.conf on all nodes
configuration. Users can submit jobs and they are executed as requested.

All nodes are Rocky Linux 9, slurm 22.05.2.


So to test the configless setup I have done the following :

1) added SlurmctldParameters=enable_configless to slurm.conf on frontend
& frontback.
Then restarted slurmctld with systemctl restart slurmctld.service

2) added :
_slurmctld._tcp 3600 IN SRV 10 0 6817 frontend
_slurmctld._tcp 3600 IN SRV 0 0 6817 frontback

To the forward lookup (host->ip) file of the DNS server on backend and
restarted it.

3) removed /etc/slurm/slurm.conf on exec1, and attempted to re-start it
in configless mode, this is where I get the problem......

Attempting to start slurmd causes it to fail, if I run it in debug mode
I get :

[root@exec1 slurm]# slurmd -D -vv
slurmd: debug: Log file re-opened
slurmd: debug: CPUs:2 Boards:1 Sockets:2 CoresPerSocket:1 ThreadsPerCore:1
slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=1:2(hw)
CoresPerSocket=2:1(hw)
slurmd: error: cannot read
(null)/user.slice/user-0.slice/session-4.scope/cgroup.controllers: No
such file or directory
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin
init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed


Replace the slurm.conf and it starts without problems.

Currently in exec1:/etc/slurm I have :

cgroup.conf :

CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

plugstack.conf :

required auto_tmpdir.so mount=/tmp mount=/var/tmp


The slurm.conf from the slurmctld machines is below, I've snipped any
commented lines to save spece.


# slurm.conf file generated by configurator.html.
ClusterName=cluster
SlurmctldHost=frontend
SlurmctldHost=frontback

SlurmctldParameters=enable_configless
JobSubmitPlugins=lua
MpiDefault=none
PlugStackConfig=/etc/slurm/plugstack.conf
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurmd
SrunPortRange=60001-63000

StateSaveLocation=/usr/local/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE
PriorityWeightAge=1000
PriorityWeightFairshare=100000

AccountingStorageEnforce=associations,limits
AccountingStorageType=accounting_storage/slurmdbd
JobCompLoc=/var/log/slurm/joblog.txt
JobCompType=jobcomp/filetext
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log

NodeName=exec1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=local
Nodes=exec1.cluster.local,exec2.cluster.local,exec3.cluster.local
Default=Yes

NodeName=execr1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=remote
Nodes=execr1.cluster.local,execr2.cluster.local,execr3.cluster.local
Default=no

NodeName=execd1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=dragon
Nodes=execd1.cluster.local,execd2.cluster.local,execd3.cluster.local
Default=no qos=part_dragon



Anyone have any idea what the problem could be?

Cheers.

Phill.



Reply all
Reply to author
Forward
0 new messages