Hi,
It might be a newbie question since I'm new to slurm.
I'm trying to restart the slurmd service on one of our Ubuntu box.
The slurmd.service is defined by:
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
The service start without issue (systemctl start slurmd.service).
However, when checking the status of the service, I get a couple of error messages, but nothing alarming:
~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-11-16 15:58:01 CET; 50s ago
Process: 2713019 ExecStart=/usr/sbin/slurmd -d /usr/sbin/slurmstepd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 2713021 (slurmd)
Tasks: 1 (limit: 134845)
Memory: 1.9M
CGroup: /system.slice/slurmd.service
└─2713021 /usr/sbin/slurmd -d /usr/sbin/slurmstepd
Nov 16 15:58:01 ecpsc10 systemd[1]: Starting Slurm node daemon...
Nov 16 15:58:01 ecpsc10 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not pe>
Nov 16 15:58:01 ecpsc10 systemd[1]: Started Slurm node daemon.
Unfortunately, the node is still seen as down when a issue a 'sinfo':
root@ecpsc10:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Compute up infinite 2 idle ecpsc[11-12]
Compute up infinite 1 down ecpsc10
FastCompute* up infinite 2 idle ecpsf[10-11]
When I get the details on this node, I get the following details:
root@ecpsc10:~# scontrol show node ecpsc10
NodeName=ecpsc10 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=ecpsc10 NodeHostName=ecpsc10 Version=17.11
OS=Linux 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021
RealMemory=40195 AllocMem=0 FreeMem=4585 Sockets=2 Boards=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=Compute
BootTime=2021-10-25T14:16:35 SlurmdStartTime=2021-11-16T15:58:01
CfgTRES=cpu=16,mem=40195M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]
From the reason, I get that the daemon won't reload because the machine was rebooted.
However, the /etc/slurm/slurm.conf looks like:
root@ecpsc10:~# cat /etc/slurm/slurm.conf | grep -i returntoservice
ReturnToService=2
So I'm quite puzzled on the reason why the node will not go back online.
Any help will be greatly appreciated.
Best,
Emmanuel
Thanks for the quick reply.
check if munge is working properly
root@ecpsinf01:~# munge -n | ssh ecpsc10 unmunge
Warning: the ECDSA host key for 'ecpsc10' differs from the key for the IP address '128.178.242.136'
Offending key for IP in /root/.ssh/known_hosts:5
Matching host key in /root/.ssh/known_hosts:28
Are you sure you want to continue connecting (yes/no)? yes
STATUS: Success (0)
ENCODE_HOST: ecpsc10 (127.0.1.1)
ENCODE_TIME: 2021-11-16 16:57:56 +0100 (1637078276)
DECODE_TIME: 2021-11-16 16:58:10 +0100 (1637078290)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
Check if SE linux is enforced
controller node
root@ecpsinf01:~# getenforce
-bash: getenforce: command not found
root@ecpsinf01:~# sestatus
-bash: sestatus: command not found
compute node
root@ecpsc10:~# getenforce
Command 'getenforce' not found, but can be installed with:
apt install selinux-utils
root@ecpsc10:~# sestatus
Command 'sestatus' not found, but can be installed with:
apt install policycoreutils
Check slurm log file
[2021-11-16T16:19:54.646] debug: Log file re-opened
[2021-11-16T16:19:54.666] Message aggregation disabled
[2021-11-16T16:19:54.666] topology NONE plugin loaded
[2021-11-16T16:19:54.666] route default plugin loaded
[2021-11-16T16:19:54.667] CPU frequency setting not configured for this node
[2021-11-16T16:19:54.667] debug: Resource spec: No specialized cores configured by default on this node
[2021-11-16T16:19:54.667] debug: Resource spec: Reserved system memory limit not configured for this node
[2021-11-16T16:19:54.667] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf
[2021-11-16T16:19:54.667] debug: Ignoring obsolete CgroupReleaseAgentDir option.
[2021-11-16T16:19:54.669] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf
[2021-11-16T16:19:54.670] debug: Ignoring obsolete CgroupReleaseAgentDir option.
[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated cores
[2021-11-16T16:19:54.670] debug: task/cgroup/memory: total:112428M allowed:100%(enforced), swap:0%(permissive), max:100%(112428M) max+swap:100%(224856M) min:30M kmem:100%(112428M enforced) min:30M swappiness:0(unset)
[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated memory
[2021-11-16T16:19:54.670] debug: task/cgroup: now constraining jobs allocated devices
[2021-11-16T16:19:54.670] debug: task/cgroup: loaded
[2021-11-16T16:19:54.671] debug: Munge authentication plugin loaded
[2021-11-16T16:19:54.671] debug: spank: opening plugin stack /etc/slurm/plugstack.conf
[2021-11-16T16:19:54.671] Munge cryptographic signature plugin loaded
[2021-11-16T16:19:54.673] slurmd version 17.11.12 started
[2021-11-16T16:19:54.673] debug: Job accounting gather cgroup plugin loaded
[2021-11-16T16:19:54.674] debug: job_container none plugin loaded
[2021-11-16T16:19:54.674] debug: switch NONE plugin loaded
[2021-11-16T16:19:54.674] slurmd started on Tue, 16 Nov 2021 16:19:54 +0100
[2021-11-16T16:19:54.675] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 Memory=112428 TmpDisk=224253 Uptime=1911799 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-11-16T16:19:54.675] debug: AcctGatherEnergy NONE plugin loaded
[2021-11-16T16:19:54.675] debug: AcctGatherProfile NONE plugin loaded
[2021-11-16T16:19:54.675] debug: AcctGatherInterconnect NONE plugin loaded
[2021-11-16T16:19:54.676] debug: AcctGatherFilesystem NONE plugin loaded
check if firewalld is enable
No
èCompute up infinite 1 down ecpsc10
How do you do that?
As per documentation, the resume command applies to the job list (https://slurm.schedmd.com/scontrol.html), not to the node.
From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Stephen Cousins <steve....@maine.edu>
Reply to: Slurm User Community List <slurm...@lists.schedmd.com>
Date: Tuesday, 16 November 2021 at 19:09
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Unable to start slurmd service
I think you just need to use scontrol to "resume" that node.
èCompute up infinite 1 down ecpsc10
That was it. Thank you very much.