[slurm-users] Nodes stuck in drain state

Roger Mason

unread,

May 25, 2023, 8:06:50 AM5/25/23

to slurm...@lists.schedmd.com

Hello,

slurm 20.02.7 on FreeBSD.

I have a couple of nodes stuck in the drain state. I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?

Thanks,
Roger

Ole Holm Nielsen

unread,

May 25, 2023, 8:29:53 AM5/25/23

to slurm...@lists.schedmd.com

On 5/25/23 13:59, Roger Mason wrote:
> slurm 20.02.7 on FreeBSD.

Uh, that's old!

> I have a couple of nodes stuck in the drain state. I have tried
>
> scontrol update nodename=node012 state=down reason="stuck in drain state"
> scontrol update nodename=node012 state=resume
>
> without success.
>
> I then tried
>
> /usr/local/sbin/slurmctld -c
> scontrol update nodename=node012 state=idle
>
> also without success.
>
> Is there some other method I can use to get these nodes back up?

What's the output of "scontrol show node node012"?

/Ole

Doug Meyer

unread,

May 25, 2023, 8:58:11 AM5/25/23

to Slurm User Community List

Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config.

Doug

Roger Mason

unread,

May 25, 2023, 9:26:50 AM5/25/23

to Slurm User Community List

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!

Yes. It is what is available in ports.

> What's the output of "scontrol show node node012"?

NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=macpro
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=4,mem=10193M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect. The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

Thanks for the help.
Roger

Roger Mason

unread,

May 25, 2023, 9:29:24 AM5/25/23

to Slurm User Community List

Hello,

Doug Meyer <dame...@gmail.com> writes:

> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config.
>

REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drain* node012
Not responding slurm(468) 2023-05-25T09:30:31 down*
node[001-003,008]

But, as I sail in my response to Ole, the memory in slurm.conf and in
the 'show node' output match.

Many thanks for the help.

Roger

Davide DelVento

unread,

May 25, 2023, 9:40:04 AM5/25/23

to Slurm User Community List

Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory?

Ole Holm Nielsen

unread,

May 25, 2023, 9:50:45 AM5/25/23

to Roger Mason, Slurm User Community List

On 5/25/23 15:23, Roger Mason wrote:
> NodeName=node012 CoresPerSocket=2
> CPUAlloc=0 CPUTot=4 CPULoad=N/A
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=node012 NodeHostName=node012
> RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
> State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=macpro
> BootTime=None SlurmdStartTime=None
> CfgTRES=cpu=4,mem=10193M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [slurm@2023-05-25T09:26:59]
>
> But the 'Low RealMemory' is incorrect. The entry in slurm.conf for
> node012 is:
>
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

Thanks for the info. Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0? It should be the size of the /tmp
filesystem.

Since you run Slurm 20.02, there are some suggestions in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
where this might be useful:

> Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error messages, see bug_9241 and bug_9233. Use Sockets= in stead:

I hope changing these slurm.conf parameters will help.

Best regards,
Ole

Roger Mason

unread,

May 25, 2023, 10:30:35 AM5/25/23

to Slurm User Community List

Hello,

Davide DelVento <davide....@gmail.com> writes:

> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?

This is what top shows:

last pid: 45688; load averages: 0.00, 0.00, 0.00 up 0+03:56:52 11:58:13
26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 0.1% system, 0.0% interrupt, 99.9% idle
Mem: 9452K Active, 69M Inact, 290M Wired, 287K Buf, 5524M Free
ARC: 125M Total, 37M MFU, 84M MRU, 168K Anon, 825K Header, 3476K Other
36M Compressed, 89M Uncompressed, 2.46:1 Ratio
Swap: 10G Total, 10G Free

Thanks for the suggestion.

Roger

Roger Mason

unread,

May 25, 2023, 10:35:58 AM5/25/23

to Ole Holm Nielsen, Slurm User Community List

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> 1. Is slurmd running on the node?

Yes.

> 2. What's the output of "slurmd -C" on the node?

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2

ThreadsPerCore=1 RealMemory=6097

> 3. Define State=UP in slurm.conf in stead of UNKNOWN

Will do.

> 4. Why have you configured TmpDisk=0? It should be the size of the
> /tmp filesystem.

I have not configured TmpDisk. This the entry in slurm.conf for that
node:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger

Brian Andrus

unread,

May 25, 2023, 10:54:43 AM5/25/23

to slurm...@lists.schedmd.com

That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think
it is.

Brian Andrus

Groner, Rob

unread,

May 25, 2023, 10:57:20 AM5/25/23

to Slurm User Community List

A quick test to see if it's a configuration error is to set config_overrides in your slurm.conf and see if the node then responds to scontrol update.

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Brian Andrus <toom...@gmail.com>
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Nodes stuck in drain state

Roger Mason

unread,

May 25, 2023, 1:24:22 PM5/25/23

to slurm...@lists.schedmd.com

Hello,

"Groner, Rob" <rug...@psu.edu> writes:

> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.

Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offending node and all seems well.

I have another node also stuck in drain that I will investigate. I
picked up some useful tips from the replies, but if I can't get it back
on-line I hope the friendly people on this list will rescue me.

Thanks again,
Roger

Reply all

Reply to author

Forward