[slurm-users] Help debugging Slurm configuration

1,744 views
Skip to first unread message

Jeffrey Layton

unread,
Dec 8, 2022, 1:38:29 PM12/8/22
to slurm...@lists.schedmd.com
Good afternoon,

I have a very simple two node cluster using Warewulf 4.3. I was following some instructions on how to install the OpenHPC Slurm binaries (server and client). I booted the compute node and the Slurm Server says it's in an unknown state. This hasn't happened to me before but I would like to debug the problem.

I checked the services on the S:urm server (head node)

$ systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 13:12:10 EST; 4min 42s ago
     Docs: man:munged(8)
  Process: 1140 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 1182 (munged)
    Tasks: 4 (limit: 48440)
   Memory: 1.2M
   CGroup: /system.slice/munge.service
           └─1182 /usr/sbin/munged

Dec 08 13:12:10 localhost.localdomain systemd[1]: Starting MUNGE authentication service...
Dec 08 13:12:10 localhost.localdomain systemd[1]: Started MUNGE authentication service.

$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 13:12:17 EST; 4min 56s ago
 Main PID: 1518 (slurmctld)
    Tasks: 10
   Memory: 23.0M
   CGroup: /system.slice/slurmctld.service
           ├─1518 /usr/sbin/slurmctld -D -s
           └─1555 slurmctld: slurmscriptd

Dec 08 13:12:17 localhost.localdomain systemd[1]: Started Slurm controller daemon.
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: No parameter for mcs plugin, de>
Dec 08 13:12:17 localhost.localdomain slurmctld[1518]: slurmctld: mcs: MCSParameters = (null). on>
Dec 08 13:13:17 localhost.localdomain slurmctld[1518]: slurmctld: SchedulerParameters=default_que>



I then booted the compute node and checked the services there:

systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 18:14:53 UTC; 3min 24s ago
     Docs: man:munged(8)
  Process: 786 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 804 (munged)
    Tasks: 4 (limit: 26213)
   Memory: 940.0K
   CGroup: /system.slice/munge.service
           └─804 /usr/sbin/munged

Dec 08 18:14:53 n0001 systemd[1]: Starting MUNGE authentication service...
Dec 08 18:14:53 n0001 systemd[1]: Started MUNGE authentication service.

systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-12-08 18:15:53 UTC; 2min 40s ago
  Process: 897 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 897 (code=exited, status=1/FAILURE)

Dec 08 18:15:44 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 18:15:53 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.

# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-12-08 18:19:04 UTC; 5s ago
 Main PID: 996 (slurmd)
    Tasks: 2
   Memory: 1012.0K
   CGroup: /system.slice/slurmd.service
           ├─996 /usr/sbin/slurmd -D -s --conf-server localhost
           └─997 /usr/sbin/slurmd -D -s --conf-server localhost

Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.




On the SLurm server I checked the queue and "sinfo -a" and found the following:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 1-00:00:00      1   unk* n0001



After a few moments (less than a minute - maybe 20-30 seconds, slurmd on the compute node fails. WHen I checked the service I saw this:

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-12-08 18:19:13 UTC; 10min ago
  Process: 996 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 996 (code=exited, status=1/FAILURE)

Dec 08 18:19:04 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 18:19:13 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.


Below are the logs for the slurm server for today (I rebooted the compute twice)

[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied
[2022-12-08T13:12:17.343] error: Configured MailProg is invalid
[2022-12-08T13:12:17.347] slurmctld version 22.05.2 started on cluster cluster
[2022-12-08T13:12:17.371] No memory enforcing mechanism configured.
[2022-12-08T13:12:17.374] Recovered state of 1 nodes
[2022-12-08T13:12:17.374] Recovered JobId=3 Assoc=0
[2022-12-08T13:12:17.374] Recovered JobId=4 Assoc=0
[2022-12-08T13:12:17.374] Recovered information about 2 jobs
[2022-12-08T13:12:17.375] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.375] Recovered state of 0 reservations
[2022-12-08T13:12:17.375] read_slurm_conf: backup_controller not specified
[2022-12-08T13:12:17.376] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2022-12-08T13:12:17.376] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2022-12-08T13:12:17.376] Running as primary controller
[2022-12-08T13:12:17.376] No parameter for mcs plugin, default values set
[2022-12-08T13:12:17.376] mcs: MCSParameters = (null). ondemand set.
[2022-12-08T13:13:17.471] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2022-12-08T13:17:17.940] error: Nodes n0001 not responding
[2022-12-08T13:22:17.533] error: Nodes n0001 not responding
[2022-12-08T13:27:17.048] error: Nodes n0001 not responding

There are no logs on the compute node.

Any suggestions where to start looking? I think I'm seeing the trees and not the forest :)

Thanks!

Jeff

P.S Here's some relevant features from the server slurm.conf


# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
...
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=




Here's some relevant parts of slurm.conf on the client node:




# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=
...
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=localhost
#SlurmctldHost=







Glen MacLachlan

unread,
Dec 8, 2022, 1:59:54 PM12/8/22
to Slurm User Community List

What does running this on the compute node show? (looks at journal log for past 12 hours) 
journalctl -S -12h -o verbose | grep slurm


You may want to increase your debug verbosity to debug5 https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdDebug while tracking down this issue.
You should also address this error to fix logging:
[2022-12-08T13:12:17.343] error: chdir(/var/log): Permission denied

by making a directory /var/log/slurm and making the slurm user the owner on both the controller and compute node. Then update your slurm.conf file like this:
# LOGGING
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

and then running 'scontrol reconfigure'

Kind Regards,
Glen

==========================================
Glen MacLachlan, PhD
Lead High Performance Computing Engineer  
Research Technology Services
The George Washington University
44983 Knoll Square
Enterprise Hall, 328L
Ashburn, VA 20147
==========================================





Glen MacLachlan

unread,
Dec 8, 2022, 2:28:59 PM12/8/22
to Slurm User Community List
One other thing to address is that SlurmctldHost should point to the controller node where slurmctld is running, the name of which I would expect Warewulf would put into /etc/hosts. 

Jeffrey Layton

unread,
Dec 8, 2022, 3:04:28 PM12/8/22
to mac...@gwu.edu, Slurm User Community List
Thanks Glenn!

I change the slurm.conf logging to "debug5" on both the server and the client.

I also created /var/log/slurm on both the client and server and chown-ed to slurm:slurm.

On the server I did "scontrol reconfigure".

Then I rebooted the compute node. When I logged in, slurm was not up. I ran systemctl start slurmd. It stayed for about 5 seconds then stoppped.


# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-12-08 19:51:58 UTC; 2min 33s ago
  Process: 1299 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 1299 (code=exited, status=1/FAILURE)

Dec 08 19:51:49 n0001 systemd[1]: Started Slurm node daemon.
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAIL>
Dec 08 19:51:58 n0001 systemd[1]: slurmd.service: Failed with result 'exit-code'.


Here is the output from grepping through journalctl:


    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    MESSAGE=Operator of unix-process:911:7771 successfully authenticated as unix-user:root to gain
 ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1.24
 [systemctl start slurmd] (owned by unix-user:laytonjb)
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    MESSAGE=Operator of unix-process:1254:240421 successfully authenticated as unix-user:root to g
ain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1
.47 [systemctl start slurmd] (owned by unix-user:laytonjb)
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.
    UNIT=slurmd.service
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    SYSLOG_IDENTIFIER=slurmd
    _COMM=slurmd
    MESSAGE=error: slurmd initialization failed
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Main process exited, code=exited, status=1/FAILURE
    UNIT=slurmd.service
    MESSAGE=slurmd.service: Failed with result 'exit-code'.


These don't look too useful even with debug5 on.

Any thoughts?

Thanks!

Jeff

Jeffrey Layton

unread,
Dec 8, 2022, 3:07:09 PM12/8/22
to mac...@gwu.edu, Slurm User Community List
localhost is the ctrl name :) 

I can change it though if needed (I was lazy when I did the initial installation).

Thanks!

Jeff

Glen MacLachlan

unread,
Dec 8, 2022, 3:08:28 PM12/8/22
to Jeffrey Layton, Slurm User Community List
Hi, 

Try starting the slurmd daemon on the compute node interactively with and share any output. 
/usr/sbin/slurmd -D -vvv

Glen MacLachlan

unread,
Dec 8, 2022, 3:13:22 PM12/8/22
to Jeffrey Layton, Slurm User Community List
Then try using the IP of the controller node as explained here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldAddr or here https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmctldHost

Also, if you look at the first few lines of /etc/hosts (just above the line that reads ### ALL ENTRIES BELOW THIS LINE WILL BE OVERWRITTEN BY WAREWULF ###) you should see a hostname for the head node and it's IP address. If you don't set this correctly then the slurmd daemon won't know how to reach the slurmctld daemon. 
Reply all
Reply to author
Forward
0 new messages