[slurm-users] Unable to start slurmd service

Jaep Emmanuel

unread,

Nov 16, 2021, 10:08:07 AM11/16/21

to slurm...@lists.schedmd.com

Hi,

It might be a newbie question since I'm new to slurm.

I'm trying to restart the slurmd service on one of our Ubuntu box.

The slurmd.service is defined by:

[Unit]

Description=Slurm node daemon

After=network.target munge.service

ConditionPathExists=/etc/slurm/slurm.conf

[Service]

Type=forking

EnvironmentFile=-/etc/sysconfig/slurmd

ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS

ExecReload=/bin/kill -HUP $MAINPID

PIDFile=/var/run/slurmd.pid

KillMode=process

LimitNOFILE=51200

LimitMEMLOCK=infinity

LimitSTACK=infinity

[Install]

WantedBy=multi-user.target

The service start without issue (systemctl start slurmd.service).

However, when checking the status of the service, I get a couple of error messages, but nothing alarming:

~# systemctl status slurmd.service

● slurmd.service - Slurm node daemon

Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)

Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago

Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

Main PID: 2713021 (slurmd)

Tasks: 1 (limit: 134845)

Memory: 1.9M

CGroup: /system.slice/slurmd.service

└─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd

Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...

Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe>

Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.

Unfortunately, the node is still seen as down when a issue a 'sinfo':

root@ecpsc10:~# sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

Compute up infinite 2 idle ecpsc[11-12]

Compute up infinite 1 down ecpsc10

FastCompute* up infinite 2 idle ecpsf[10-11]

When I get the details on this node, I get the following details:

root@ecpsc10:~# scontrol show node ecpsc10

NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8

CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00

AvailableFeatures=(null)

ActiveFeatures=(null)

Gres=(null)

NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11

OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021

RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1

State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

Partitions=Compute

BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01

CfgTRES=cpu=16,mem=40195M,billing=16

AllocTRES=

CapWatts=n/a

CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]

From the reason, I get that the daemon won't reload because the machine was rebooted.

However, the /etc/slurm/slurm.conf looks like:

root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice

ReturnToService=2

So I'm quite puzzled on the reason why the node will not go back online.

Any help will be greatly appreciated.

Best,

Emmanuel

Hadrian Djohari

unread,

Nov 16, 2021, 10:48:48 AM11/16/21

to Slurm User Community List

There can be few possibilities:

Check if munge is working properly. From the scheduler master run "munge -n | ssh ecpsc10 unmunge"
Check if selinux is enforced
Check if firewalld or similar firewall is enabled
Check the logs /var/log/slurm/slurmctld.log or slurmd.log on the compute node

Best,

--

Hadrian Djohari
Manager of Research Computing Services, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Jaep Emmanuel

unread,

Nov 16, 2021, 11:51:13 AM11/16/21

to Slurm User Community List

Thanks for the quick reply.

check if munge is working properly

root@ecpsinf01:~# munge -n | ssh ecpsc10 unmunge

Warning: the ECDSA host key for 'ecpsc10' differs from the key for the IP address '128.178.242.136'

Offending key for IP in /root/.ssh/known_hosts:5

Matching host key in /root/.ssh/known_hosts:28

Are you sure you want to continue connecting (yes/no)? yes

STATUS: Success (0)

ENCODE_HOST: ecpsc10 (127.0.1.1)

ENCODE_TIME: 2021-11-16 16:57:56 +0100 (1637078276)

DECODE_TIME: 2021-11-16 16:58:10 +0100 (1637078290)

TTL: 300

CIPHER: aes128 (4)

MAC: sha256 (5)

ZIP: none (0)

UID: root (0)

GID: root (0)

LENGTH: 0

Check if SE linux is enforced

controller node

root@ecpsinf01:~# getenforce

-bash: getenforce: command not found

root@ecpsinf01:~# sestatus

-bash: sestatus: command not found

compute node

root@ecpsc10:~# getenforce

Command 'getenforce' not found, but can be installed with:

apt install selinux-utils

root@ecpsc10:~# sestatus

Command 'sestatus' not found, but can be installed with:

apt install policycoreutils

Check slurm log file

[2021-11-16T16:19:54.646] debug: Log file re-opened

[2021-11-16T16:19:54.666] Message aggregation disabled

[2021-11-16T16:19:54.666] topology NONE plugin loaded

[2021-11-16T16:19:54.666] route default plugin loaded

[2021-11-16T16:19:54.667] CPU frequency setting not configured for this node

[2021-11-16T16:19:54.667] debug: Resource spec: No specialized cores configured by default on this node

[2021-11-16T16:19:54.667] debug: Resource spec: Reserved system memory limit not configured for this node

[2021-11-16T16:19:54.667] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf

[2021-11-16T16:19:54.667] debug: Ignoring obsolete CgroupReleaseAgentDir option.

[2021-11-16T16:19:54.669] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf

[2021-11-16T16:19:54.670] debug: Ignoring obsolete CgroupReleaseAgentDir option.

[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated cores

[2021-11-16T16:19:54.670] debug: task/cgroup/memory: total:112428M allowed:100%(enforced), swap:0%(permissive), max:100%(112428M) max+swap:100%(224856M) min:30M kmem:100%(112428M enforced) min:30M swappiness:0(unset)

[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated memory

[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated devices

[2021-11-16T16:19:54.670] debug: task/cgroup: loaded

[2021-11-16T16:19:54.671] debug: Munge authentication plugin loaded

[2021-11-16T16:19:54.671] debug: spank: opening plugin stack /etc/slurm/plugstack.conf

[2021-11-16T16:19:54.671] Munge cryptographic signature plugin loaded

[2021-11-16T16:19:54.673] slurmd version 17.11.12 started

[2021-11-16T16:19:54.673] debug: Job accounting gather cgroup plugin loaded

[2021-11-16T16:19:54.674] debug: job_container none plugin loaded

[2021-11-16T16:19:54.674] debug: switch NONE plugin loaded

[2021-11-16T16:19:54.674] slurmd started on Tue, 16 Nov 2021 16:19:54 +0100

[2021-11-16T16:19:54.675] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=112428 TmpDisk=224253 Uptime=1911799 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

[2021-11-16T16:19:54.675] debug: AcctGatherEnergy NONE plugin loaded

[2021-11-16T16:19:54.675] debug: AcctGatherProfile NONE plugin loaded

[2021-11-16T16:19:54.675] debug: AcctGatherInterconnect NONE plugin loaded

[2021-11-16T16:19:54.676] debug: AcctGatherFilesystem NONE plugin loaded

check if firewalld is enable

No

èCompute up infinite 1 down ecpsc10

Stephen Cousins

unread,

Nov 16, 2021, 1:00:26 PM11/16/21

to Slurm User Community List

I think you just need to use scontrol to "resume" that node.

Jaep Emmanuel

unread,

Nov 16, 2021, 1:34:11 PM11/16/21

to Slurm User Community List

How do you do that?

As per documentation, the resume command applies to the job list (https://slurm.schedmd.com/scontrol.html), not to the node.

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Stephen Cousins <steve....@maine.edu>
Reply to: Slurm User Community List <slurm...@lists.schedmd.com>
Date: Tuesday, 16 November 2021 at 19:09
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Unable to start slurmd service

I think you just need to use scontrol to "resume" that node.

èCompute up infinite 1 down ecpsc10

Stephen Cousins

unread,

Nov 16, 2021, 1:45:32 PM11/16/21

to Slurm User Community List

scontrol update nodename=name-of-node state=resume

Christopher Samuel

unread,

Nov 16, 2021, 4:17:21 PM11/16/21

to slurm...@lists.schedmd.com

On 11/16/21 7:07 am, Jaep Emmanuel wrote:

> root@ecpsc10:~# scontrol show node ecpsc10

[...]

> State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

[...]

> Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]

This is why the node isn't considered available, as others have already
noted you will need to resume the node.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Jaep Emmanuel

unread,

Nov 16, 2021, 5:01:30 PM11/16/21

to Slurm User Community List

That was it. Thank you very much.

Reply all

Reply to author

Forward