[slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

4,369 views
Skip to first unread message

Sven Duscha

unread,
Mar 17, 2021, 2:17:28 PM3/17/21
to slurm...@schedmd.com
Hi,

I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
the service (through systemctl):


I installed munge and SLURM version 19.05.5-1 through the package
manager from
the default repository:

apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd


systemctl start slurmctld &
[1] 2735
18:55 [root@slurm ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
    Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
      Docs: man:slurmctld(8)
   Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
     Tasks: 12
    Memory: 2.5M
    CGroup: /system.slice/slurmctld.service
            └─2759 /usr/sbin/slurmctld

Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted


After about 60 seconds slurmctld terminates:


-- A stop job for unit slurmctld.service has finished.
--
-- The job identifier is 1043 and the job result is done.
Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 1044.
Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.


My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
/etc/slurm-llnl/slurm.conf

SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/run/slurm-llnl/slurmd.pid

Starting the slurmctld executable by hand works fine:
/usr/sbin/slurmctld &

pgrep slurmctld
2819
[1]+  Done                    /usr/sbin/slurmctld
pgrep slurmctld
2819
squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
sinfo -lNe
Wed Mar 17 19:01:45 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
ekgen1         1  cluster*    unknown*   16    2:8:1 480000       
0      1   (null) none
ekgen2         1  cluster*       down*   16    2:8:1 250000       
0      1   (null) Not responding
ekgen3         1    debian    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen4         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen5         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen6         1    debian    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen7         1  cluster*    unknown*   16    2:8:1 250000       
0      1   (null) none
ekgen8         1    debian       down*   16    2:8:1 250000       
0      1   (null) Not responding
ekgen9         1  cluster*    unknown*   16    2:8:1 192000       
0      1   (null) none

I tried then to modify /lib/systemd/system/slurmd.service

cp /lib/systemd/system/slurmd.service
/lib/systemd/system/slurmd.service.orig

changed
PIDFile=/run/slurmd.pid
to
PIDFile=/run/slurm-llnl/slurmd.pid

systemctl start slurmctld &
[1] 1869
pgrep slurm
1875
squeue
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

after ca. 60 seconds:

Job for slurmctld.service failed because a timeout was exceeded.
See "systemctl status slurmctld.service" and "journalctl -xe" for details


- Subject: A start job for unit packagekit.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit packagekit.service has finished successfully.
--
-- The job identifier is 586.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit slurmctld.service has entered the 'failed' state with result
'timeout'.
Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
-- Subject: A start job for unit slurmctld.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has finished with a failure.
--
-- The job identifier is 511 and the job result is failed.
Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon...
-- Subject: A start job for unit slurmctld.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit slurmctld.service has begun execution.
--
-- The job identifier is 662.
Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support

mkdir /run/slurm-lnll/
chown slurm: /run/slurm-lnll/

ls -lthrd /run/slurm-lnll/
drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/

It doesn't create the PID file

ls -lthr /run/slurm-lnll/
total 0


A work-around, writing the PID manually to the PID file, does work:

systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
/run/slurm-lnll/slurmctld.pid && chown slurm:
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid


Still status problem reported:

systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
    Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago
      Docs: man:slurmctld(8)
   Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
  Main PID: 2287 (slurmctld)
     Tasks: 7
    Memory: 2.3M
    CGroup: /system.slice/slurmctld.service
            └─2287 /usr/sbin/slurmctld

Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.


But the slurmctld process doesn't crash anymore. Stopping the service
does work:


systemctl stop slurmctld.service
systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
     Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago
       Docs: man:slurmctld(8)
    Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
   Main PID: 2287 (code=exited, status=0/SUCCESS)

Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon...
Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon.
Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon...
Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded.
Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon.

I am a little astonished that the default package shows this strange
behaviour regarding slurmctld installed through the package manager.

The base installation is Ubuntu 20.04 server installation, where I did
no modifications apart from installing the SLURM-wlm packages and
importing my existing configuration and munge.key.


Best wishes,

Sven Duscha


--
Sven Duscha
Deutsches Herzzentrum München
Technische Universität München
Lazarettstraße 36
80636 München

Brian Andrus

unread,
Mar 17, 2021, 2:54:38 PM3/17/21
to slurm...@lists.schedmd.com
I am guessing you aren't overly familiar with Linux/systemd since you
have the '&' at the end of your start command.

Be that as it may, you can see it is a permissions issue. Check
permissions on /run and ensure the slurmctld user is able to write there.

You can either change the slurmctld user to one that can write there or
change the permissions on the directory to allow the slurmctld user
write access.

Brian Andrus

Rodrigo Santibáñez

unread,
Mar 17, 2021, 3:37:18 PM3/17/21
to Slurm User Community List
After installing SLURM in Ubuntu and before starting the services, I do:

mkdir -p /var/spool/slurmd
mkdir -p /var/lib/slurm-llnl
mkdir -p /var/lib/slurm-llnl/slurmd
mkdir -p /var/lib/slurm-llnl/slurmctld
mkdir -p /var/run/slurm-llnl (You need to change this to /run/slurm-llnl as your location for SlurmdPidFile and SlurmctldPidFile)
mkdir -p /var/log/slurm-llnl

chmod -R 755 /var/spool/slurmd
chmod -R 755 /var/lib/slurm-llnl/
chmod -R 755 /var/run/slurm-llnl/ (Also here)
chmod -R 755 /var/log/slurm-llnl/

chown -R slurm:slurm /var/spool/slurmd
chown -R slurm:slurm /var/lib/slurm-llnl/
chown -R slurm:slurm /var/run/slurm-llnl/ (And here)
chown -R slurm:slurm /var/log/slurm-llnl/

Hope that clarifies something. My first SLURM installations failed because of missing directories and wrong permissions.

Best!

Sven Duscha

unread,
Mar 17, 2021, 4:05:58 PM3/17/21
to Slurm User Community List, Brian Andrus
Hi,

On 17.03.21 19:54, Brian Andrus wrote:
> Be that as it may, you can see it is a permissions issue. Check
> permissions on /run and ensure the slurmctld user is able to write there.
>
> You can either change the slurmctld user to one that can write there
> or change the permissions on the directory to allow the slurmctld user
> write access.


That I already did as you can see from my analysis log.


The default location is /run and that is of course only writable for
root, as it also belongs only to root:root.

I am reluctant of doing a chgrp slurm on that system directory.
Furthermore, as the slurmctld is started in the context of user "slurm"
I would expect the default location to be in a directory writable by
user slurm.

As I am issuing "systemctl start slurmctld" as root, the file is written
by user root. The permissions were fine for that:

ls -lthrd /run/slurm-lnll/
drwxrwxr-x 2 root slurm 60 Mar 17 20:50 /run/slurm-lnll/

Be it as it may, it is more suitable to let non-system services to write
to subdirectories. Hence my creation of the directory /run/slurm-llnl.


Even making the directory world-writable (chmod o+w) doesn't solve the
issue.

ls -lthrd /run/slurm-lnll/

drwxrwxrwx 2 root slurm 40 Mar 17 18:50 /run/slurm-llnl


So, even though the error message seems to be clear, pointing to a
"permissions issue"; it is not clear from checking the relevant
directories permissions.

But I might be missing something obvious here. Rodrigo's suggestion to
use /var/run/slurm-llnl works with "systemctl start slurmctld" for me,
even though it has the same permissions:

ls -lthrd /var/run/slurm-lnll/
drwxrwxr-x 2 root slurm 60 Mar 17 20:50 /var/run/slurm-lnll/


Best wishes,


Sven


--
Sven Duscha
Deutsches Herzzentrum München
Technische Universität München
Lazarettstraße 36
80636 München
+49 89 1218 2602


Brian Andrus

unread,
Mar 17, 2021, 4:33:16 PM3/17/21
to Sven Duscha, Slurm User Community List
That is looking like your /run folder does not have world execute
permissions, making it impossible for anything to access sub-directories.

Brian Andrus

William Brown

unread,
Mar 17, 2021, 7:14:53 PM3/17/21
to Slurm User Community List
I can't immediately check what I do with Slurm but in several systemd files I  create sub folders of /var/run and set their ownership the same as the service will run under.

I use CentOS (for now!).

I can post an actual service startup file in daylight if useful. 

William 


Stefan Staeglich

unread,
Mar 18, 2021, 6:30:02 AM3/18/21
to slurm...@schedmd.com
Hi Sven,

I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf
and not the systemd units:
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid

Best,
Stefan
Stefan Stäglich, Universität Freiburg, Institut für Informatik
Georges-Köhler-Allee, Geb.52, 79110 Freiburg, Germany

E-Mail : stae...@informatik.uni-freiburg.de
WWW : gki.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax : +49 761 203-8222




Sven Duscha

unread,
Mar 18, 2021, 7:49:50 AM3/18/21
to slurm...@lists.schedmd.com
Hi,

thanks for all the responses.

On 18.03.21 11:29, Stefan Staeglich wrote:
> I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf
> and not the systemd units:
> SlurmctldPidFile=/run/slurmctld.pid
> SlurmdPidFile=/run/slurmd.pid


That was of course my first approach. I had used the directory
/run/slurm-lnll/ on my CentOS 7 installations, where I copied the
slurm.conf file from over.

It turned out that those directories I defined there weren't used. The
error message suggested that slurmctld still tried to write to
/run/slurmctld.pid.

Changing the systemd file was my last resort. And as mentioned I don't
expect to have to do that much fiddling with an (relative old 19.05-5)
package manager version.  It seems "snap" provides a more current
version 20.02.1:


snap install slurm         # version 20.02.1, or
apt  install slurm-client  # version 19.05.5-1


The underlying distribution installation also hasn't been modified by
me, I want to use Ubuntu20.04 as my future cluster OS, and the
kvm-virtualized SLURM controller was the first I tried.


Brian Andrus suggested:

On 17.03.21 21:32, Brian Andrus wrote:
> That is looking like your /run folder does not have world execute
> permissions, making it impossible for anything to access sub-directories.

But I can write as user "sven" (I didn't set up the LDAP connection,
yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven".


Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file,
because it is good practice to not use root. Changing this to "root",
which should give universal access to all directories, doesn't make a
difference:

#SlurmUser=slurm
SlurmdUser=root


My  initial response, that /var/run/slurm-lnll/slurmctld.pid worked me;
was also premature. It kind of works for the first start after a reboot with

systemctl start slurmctld

and

systemctl stop slurmctld

works, but then lingers around in the timeout. During that time
slurmctld still runs, I see the process, and can use squeue, sinfo etc.

After the pid file writing timeout it shows the service to be
terminated. This time not due to the inability of writing the
slurmctld.pid file, but instead suggesting my modification to the legacy
location /var/run - which itself is only a reference to /run:

Mar 18 12:30:43 slurm systemd[1]: Reloading.
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
ListenStream= references a path below legacy directory /var/run/,
updating /var/>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
PIDFile= references a path below legacy directory /var/run/, updating
/var/r>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.


time systemctl start slurmctld
Job for slurmctld.service failed because a timeout was exceeded.
See "systemctl status slurmctld.service" and "journalctl -xe" for details.

real    1m1.314s
user    0m0.003s
sys    0m0.002s

-- A session with the ID 1 has been terminated.
Mar 18 12:30:43 slurm systemd[1]: Reloading.
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5:
ListenStream= references a path below legacy directory /var/run/,
updating /var/>
Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12:
PIDFile= references a path below legacy directory /var/run/, updating
/var/r>
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.


The initial "&" I put after the systemctl, because I wanted to get to my
prompt to investigate the problem. Normal behaviour, as I expect it,
would be a starting time of 1-2 seconds.


I am back to my work-around:

systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` >
/run/slurm-lnll/slurmctld.pid && chown slurm:
/run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid


My configuration file is read, though, as I can check with scontrol:

scontrol show config | grep run
SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid


So, all of this hassle shouldn't occur, my fiddling with systemd should
be entirely unnecessary.

Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.


Unmodified systemd file:

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf
Documentation=man:slurmctld(8)

[Service]
Type=forking
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/run/slurm-lnll/slurmctld.pid
LimitNOFILE=65536
TasksMax=infinity

[Install]
WantedBy=multi-user.target
~                            


I do know some file permissions issues, I encountered on CentOS-7, but
by all apparent means, i.e. checking the permissions, it should work
with those permissions in the subdirectory

ls -lthrd /run/slurm-lnll/
drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/


But this suggests, it ignores the setting in the slurm.conf file:

SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid


-- The job identifier is 2259.
Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation
timed out. Terminating.
Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result
'timeout'.


Though scontrol show config claims otherwise:

 scontrol show config | grep run
SlurmdPidFile           = /var/run/slurm-llnl/slurmd.pid
SlurmctldPidFile        = /var/run/slurm-llnl/slurmctld.pid
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)


I would attribute it to my fault, but I started yesterday with a
"vanilla" installation of Ubuntu20.04 server, and the purpose of this VM
is only to run sclurmctld.


This "should" occur to many more people, or I am missing something
obvious. If it was to permissions, making the directory /run/slurm-lnll
world-wirteable:

 ls -lthrd /run/slurm-lnll/
drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/

should "fix" the problem. I could live with that, even though I try to
adhere to strict permission management.

That also doesn't work

Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file
/run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted
Mar 18 12:46:38 slurm systemd[1]: Reloading.


So, I am turning in circles here.


Best wishes,

Sven


--
Sven Duscha
Deutsches Herzzentrum München
Technische Universität München
Lazarettstraße 36
80636 München
+49 89 1218 2602


Michael Gutteridge

unread,
Mar 18, 2021, 5:56:02 PM3/18/21
to Slurm User Community List
I would also encourage you to use defaults in the slurm.conf (matching what's shipped in the Ubuntu packages).  However, here is what I've done to use non-Ubuntu-package paths for the PID files.

Create an override in /etc/systemd/system/slurmd.service.d/override.conf with something like:

node32[~]: cat /etc/systemd/system/slurmd.service.d/override.conf
[Service]
PIDFile=/var/run/slurm-llnl/slurmd.pid
RuntimeDirectory=slurm-llnl
RuntimeDirectoryMode=0775

Replace the daemon name as necessary.  The "runtimedirectory" is needed because /run and /var/run are virtual file systems managed by systemd.  Creating that directory "by hand" has unpredictable results.

HTH

 - Michael

Reply all
Reply to author
Forward
0 new messages