[slurm-users] After reboot nodes are in state = down

21 views
Skip to first unread message

Rafał Kędziorski

unread,
Sep 27, 2019, 1:40:49 AM9/27/19
to slurm...@lists.schedmd.com
Hi,

I'm working with slurm-wlm 18.08.5-2 on Raspberry Pi Cluster:

- 1 Pi 4 as manager
- 4 Pi 4 nodes

This work fine. But after every restart of the nodes I get this

cluster@pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      4   down pi-4-node-[1-4]


state. Than I can call

sudo scontrol update NodeName=<node_name> State=RESUME

for every node and sometimes are all nodes idle and some down

cluster @pi-manager:~ $ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
devcluster*    up   infinite      2   idle pi-4-node-[1-2]
devcluster*    up   infinite      2   down pi-4-node-[3-4]


Status to all nodes

cluster@pi-manager:~ $ scontrol show nodes
NodeName=pi-4-node-1 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.24
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.141 NodeHostName=pi-4-node-1 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:58 SlurmdStartTime=2019-09-19T00:26:36
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-2 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.06
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.142 NodeHostName=pi-4-node-2 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:57 SlurmdStartTime=2019-09-19T00:26:49
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=pi-4-node-3 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.143 NodeHostName=pi-4-node-3 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3676 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:55 SlurmdStartTime=2019-09-19T00:26:45
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:32]

NodeName=pi-4-node-4 Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.02
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.144 NodeHostName=pi-4-node-4 Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3687 Sockets=4 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=devcluster
   BootTime=2019-09-19T17:38:52 SlurmdStartTime=2019-09-19T00:26:47
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30]

NodeName=pi-manager Arch=armv7l CoresPerSocket=1
   CPUAlloc=0 CPUTot=4 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=192.168.178.140 NodeHostName=pi-manager Version=18.08
   OS=Linux 4.19.66-v7l+ #1253 SMP Thu Aug 15 12:02:08 BST 2019
   RealMemory=1 AllocMem=0 FreeMem=3446 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2019-09-19T17:35:48 SlurmdStartTime=2019-09-19T08:10:51
   CfgTRES=cpu=4,mem=1M,billing=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Nodes which are down, the Reason is:

Reason=Node unexpectedly rebooted [slurm@2019-09-19T17:39:30]

What is the problem? But my Nodes in the Cluster are not running whole time.



Regards,
Rafal

Henkel, Andreas

unread,
Sep 27, 2019, 2:42:04 AM9/27/19
to Slurm User Community List
Hi Rafal,

How do you restart the nodes? If you don’t use scontrol reboot <node> Slurm doesn’t expect nodes to reboot therefore you see that reason in those cases.

Best 
Andreas 

Rafał Kędziorski

unread,
Sep 27, 2019, 4:37:29 AM9/27/19
to Slurm User Community List
Hi Andreas,

my Cluster is not running whole time. I call just sudo shutdown. And after boot the nodes are in state down.

I'm using Slurn on Raspi Cluster (5* Pi 4). What is the best way to shutdown the nodes that after boot the nodes are idle and not down?


Regards,
Rafal

Juergen Salk

unread,
Sep 27, 2019, 5:19:56 AM9/27/19
to Slurm User Community List
Hi Rafał,

you may try setting `ReturnToService=2´ in slurm.conf.

Best regards
Jürgen

--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471

* Rafał Kędziorski <rafal.ke...@gmail.com> [190927 10:36]:
--
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3 E28F BD0D 4B33 A997 BA7A

Steffen Grunewald

unread,
Sep 27, 2019, 7:49:00 AM9/27/19
to Slurm User Community List
On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> Hi Rafał,
>
> you may try setting `ReturnToService=2´ in slurm.conf.
>
> Best regards
> Jürgen

Caveat: A spontaneously rebooting machine may create a "black hole" this way.

- Steffen

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Rafał Kędziorski

unread,
Sep 27, 2019, 8:59:56 AM9/27/19
to Slurm User Community List
Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald <steffen....@aei.mpg.de>:
On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> Hi Rafał,
>
> you may try setting `ReturnToService=2´ in slurm.conf.
>
> Best regards
> Jürgen

Caveat: A spontaneously rebooting machine may create a "black hole" this way.

How do you mean this? Could ReturnToService=2 be a problem?

ps.

Juergen Salk

unread,
Sep 27, 2019, 9:33:29 AM9/27/19
to Slurm User Community List
* Rafał Kędziorski <rafal.ke...@gmail.com> [190927 14:58]:
> > >
> > > you may try setting `ReturnToService=2´ in slurm.conf.
> > >
> >
> > Caveat: A spontaneously rebooting machine may create a "black hole" this
> > way.
> >
>
> How do you mean this? Could ReturnToService=2 be a problem?
>

Hi Rafał,

black hole syndrom happens when a node constantly accepts new jobs
and then causes these jobs to fail. This may even flush all jobs
from the queue for no obvious reason.

As Steffen said, this scenario may also happen if a node accepts a
job, then spontaneously reboots, then accepts the next job, then
reboots again, ...

> > Max Planck Institute for Gravitational Physics (Albert Einstein Institute)

That makes a somewhat funny element in this context. ;-)

Steffen Grunewald

unread,
Sep 27, 2019, 9:36:47 AM9/27/19
to Slurm User Community List
On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote:
> Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald <
> steffen....@aei.mpg.de>:
> > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote:
> > >
> > > you may try setting `ReturnToService=2´ in slurm.conf.
> > >
> > Caveat: A spontaneously rebooting machine may create a "black hole" this
> > way.
> >
> How do you mean this? Could ReturnToService=2 be a problem?

For us it was - we had (and still have) nodes spontaneously rebooting.
If they come up into idle, they will eat the next job, etc as infinitum -
thus we've set ReturnToService=0.

"Black hole" in a figurative way, still swallowing all it could get its hands on.

You've got to decide what's worse: have full control over machines rebooted
intentionally, or have full control over misbehaving ones. My own choice is clear.

- S

Rafał Kędziorski

unread,
Sep 27, 2019, 12:38:32 PM9/27/19
to Slurm User Community List
o.k. thx for the explanation.
Reply all
Reply to author
Forward
0 new messages