[slurm-users] Help debugging Slurm configuration

Jeffrey Layton

unread,

Dec 8, 2022, 1:38:29 PM12/8/22

to slurm...@lists.schedmd.com

Good afternoon,

I have a very simple two node cluster using Warewulf 4.3. I was following some instructions on how to install the OpenHPC Slurm binaries (server and client). I booted the compute node and the Slurm Server says it's in an unknown state. This hasn't happened to me before but I would like to debug the problem.

I checked the services on the S:urm server (head node)

$ systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago
     Docs: man:munged(8)
Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 1182 (munged)
    Tasks: 4 (limit: 48440)
   Memory: 1.2M
   CGroup: /system.slice/munge.service
           └─1182 /usr/sbin/munged

Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE authentication service...
Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE authentication service.

$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago
Main PID: 1518 (slurmctld)
    Tasks: 10
   Memory: 23.0M
   CGroup: /system.slice/slurmctld.service
           ├─1518 /usr/sbin/slurmctld -D -s
           └─1555 slurmctld: slurmscriptd

Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller daemon.
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No parameter for mcs plugin, de>
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: MCSParameters = (null). on>
Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: SchedulerParameters=default_que>

I then booted the compute node and checked the services there:

systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago
     Docs: man:munged(8)
Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 804 (munged)
    Tasks: 4 (limit: 26213)
   Memory: 940.0K
   CGroup: /system.slice/munge.service
           └─804 /usr/sbin/munged

Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service...
Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.

systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; 2min 40s ago
Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 897 (code=exited, status=1/FAILURE)

Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.

# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
Main PID: 996 (slurmd)
    Tasks: 2
   Memory: 1012.0K
   CGroup: /system.slice/slurmd.service
           ├─996 /usr/sbin/slurmd -D -s --conf-server localhost
           └─997 /usr/sbin/slurmd -D -s --conf-server localhost

Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.

On the SLurm server I checked the queue and "sinfo -a" and found the following:

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
$ sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 1-00:00:00 1 unk* n0001

After a few moments (less than a minute - maybe 20-30 seconds, slurmd on the compute node fails. WHen I checked the service I saw this:

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; 10min ago
Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 996 (code=exited, status=1/FAILURE)

Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.

Below are the logs for the slurm server for today (I rebooted the compute twice)

[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
[2022-12-08T13:12:17.343] error: Configured MailProg is invalid
[2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster cluster
[2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
[2022-12-08T13:12:17.374] Recovered state of 1 nodes
[2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
[2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
[2022-12-08T13:12:17.374] Recovered information about 2 jobs
[2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.375] Recovered state of 0 reservations
[2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified
[2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.376] Running as primary controller
[2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set
[2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
[2022-12-08T13:13:17.471] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2022-12-08T13:17:17.940] error: Nodes n0001 not responding
[2022-12-08T13:22:17.533] error: Nodes n0001 not responding
[2022-12-08T13:27:17.048] error: Nodes n0001 not responding

There are no logs on the compute node.

Any suggestions where to start looking? I think I'm seeing the trees and not the forest :)

Thanks!

Jeff

P.S Here's some relevant features from the server slurm.conf

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=

...

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=

Here's some relevant parts of slurm.conf on the client node:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=

...

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=

Glen MacLachlan

unread,

Dec 8, 2022, 1:59:54 PM12/8/22

to Slurm User Community List

What does running this on the compute node show? (looks at journal log for past 12 hours)

journalctl -S -12h -o verbose | grep slurm

You may want to increase your debug verbosity to debug5 https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking down this issue.

For reference, see https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug

You should also address this error to fix logging:

[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied

by making a directory /var/log/slurm and making the slurm user the owner on both the controller and compute node. Then update your slurm.conf file like this:

# LOGGING
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

and then running 'scontrol reconfigure'

Kind Regards,

Glen

==========================================

Glen MacLachlan, PhD

Lead High Performance Computing Engineer

Research Technology Services
The George Washington University
44983 Knoll Square
Enterprise Hall, 328L
Ashburn, VA 20147

==========================================

Glen MacLachlan

unread,

Dec 8, 2022, 2:28:59 PM12/8/22

to Slurm User Community List

One other thing to address is that SlurmctldHost should point to the controller node where slurmctld is running, the name of which I would expect Warewulf would put into /etc/hosts.

https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost

Jeffrey Layton

unread,

Dec 8, 2022, 3:04:28 PM12/8/22

to mac...@gwu.edu, Slurm User Community List

Thanks Glenn!

I change the slurm.conf logging to "debug5" on both the server and the client.

I also created /var/log/slurm on both the client and server and chown-ed to slurm:slurm.

On the server I did "scontrol reconfigure".

Then I rebooted the compute node. When I logged in, slurm was not up. I ran systemctl start slurmd. It stayed for about 5 seconds then stoppped.

# systemctl status slurmd
● slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)

Active: failed (Result: exit-code) since Thu 2022-12-08 19:51:58 UTC; 2min 33s ago
Process: 1299 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 1299 (code=exited, status=1/FAILURE)

Dec 08 19:51:49 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.

Here is the output from grepping through journalctl:

UNIT=slurmd.service
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
MESSAGE=error: slurmd initialization failed
UNIT=slurmd.service
MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
UNIT=slurmd.service
MESSAGE=slurmd.service: Failed with result 'exit-code'.
MESSAGE=Operator of unix-process:911:7771 successfully authenticated as unix-user:root to gain
ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1.24
[systemctl start slurmd] (owned by unix-user:laytonjb)
UNIT=slurmd.service
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
MESSAGE=error: slurmd initialization failed
UNIT=slurmd.service
MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
UNIT=slurmd.service
MESSAGE=slurmd.service: Failed with result 'exit-code'.
UNIT=slurmd.service
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
MESSAGE=error: slurmd initialization failed
UNIT=slurmd.service
MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
UNIT=slurmd.service
MESSAGE=slurmd.service: Failed with result 'exit-code'.
MESSAGE=Operator of unix-process:1254:240421 successfully authenticated as unix-user:root to g
ain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1
.47 [systemctl start slurmd] (owned by unix-user:laytonjb)
UNIT=slurmd.service
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
MESSAGE=error: slurmd initialization failed
UNIT=slurmd.service
MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
UNIT=slurmd.service
MESSAGE=slurmd.service: Failed with result 'exit-code'.
UNIT=slurmd.service
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd
SYSLOG_IDENTIFIER=slurmd
_COMM=slurmd

MESSAGE=error: slurmd initialization failed
UNIT=slurmd.service
MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
UNIT=slurmd.service
MESSAGE=slurmd.service: Failed with result 'exit-code'.

These don't look too useful even with debug5 on.

Any thoughts?

Thanks!

Jeff

Jeffrey Layton

unread,

Dec 8, 2022, 3:07:09 PM12/8/22

to mac...@gwu.edu, Slurm User Community List

localhost is the ctrl name :)

I can change it though if needed (I was lazy when I did the initial installation).

Thanks!

Jeff

Glen MacLachlan

unread,

Dec 8, 2022, 3:08:28 PM12/8/22

to Jeffrey Layton, Slurm User Community List

Hi,

Try starting the slurmd daemon on the compute node interactively with and share any output.

/usr/sbin/slurmd -D -vvv

Glen MacLachlan

unread,

Dec 8, 2022, 3:13:22 PM12/8/22

to Jeffrey Layton, Slurm User Community List

Then try using the IP of the controller node as explained here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldAddr or here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost.

Also, if you look at the first few lines of /etc/hosts (just above the line that reads ### ALL ENTRIES BELOW THIS LINE WILL BE OVERWRITTEN BY WAREWULF ###) you should see a hostname for the head node and it's IP address. If you don't set this correctly then the slurmd daemon won't know how to reach the slurmctld daemon.

Reply all

Reply to author

Forward