[slurm-users] Nodes stuck in drain state

已查看 1,227 次
跳至第一个未读帖子

Roger Mason

未读,
2023年5月25日 08:06:502023/5/25
收件人 slurm...@lists.schedmd.com
Hello,

slurm 20.02.7 on FreeBSD.

I have a couple of nodes stuck in the drain state. I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?

Thanks,
Roger

Ole Holm Nielsen

未读,
2023年5月25日 08:29:532023/5/25
收件人 slurm...@lists.schedmd.com
On 5/25/23 13:59, Roger Mason wrote:
> slurm 20.02.7 on FreeBSD.

Uh, that's old!

> I have a couple of nodes stuck in the drain state. I have tried
>
> scontrol update nodename=node012 state=down reason="stuck in drain state"
> scontrol update nodename=node012 state=resume
>
> without success.
>
> I then tried
>
> /usr/local/sbin/slurmctld -c
> scontrol update nodename=node012 state=idle
>
> also without success.
>
> Is there some other method I can use to get these nodes back up?

What's the output of "scontrol show node node012"?

/Ole

Doug Meyer

未读,
2023年5月25日 08:58:112023/5/25
收件人 Slurm User Community List
Could also review the node log in /varlog/slurm/ .  Often sinfo -lR will tell you the cause, fro example mem not matching the config.

Doug

Roger Mason

未读,
2023年5月25日 09:26:502023/5/25
收件人 Slurm User Community List

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!

Yes. It is what is available in ports.

> What's the output of "scontrol show node node012"?

NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=macpro
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=4,mem=10193M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect. The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

Thanks for the help.
Roger

Roger Mason

未读,
2023年5月25日 09:29:242023/5/25
收件人 Slurm User Community List
Hello,

Doug Meyer <dame...@gmail.com> writes:

> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config.
>
REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drain* node012
Not responding slurm(468) 2023-05-25T09:30:31 down*
node[001-003,008]

But, as I sail in my response to Ole, the memory in slurm.conf and in
the 'show node' output match.

Many thanks for the help.

Roger

Davide DelVento

未读,
2023年5月25日 09:40:042023/5/25
收件人 Slurm User Community List
Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory?

Ole Holm Nielsen

未读,
2023年5月25日 09:50:452023/5/25
收件人 Roger Mason、Slurm User Community List
On 5/25/23 15:23, Roger Mason wrote:
> NodeName=node012 CoresPerSocket=2
> CPUAlloc=0 CPUTot=4 CPULoad=N/A
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=node012 NodeHostName=node012
> RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
> State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
> Partitions=macpro
> BootTime=None SlurmdStartTime=None
> CfgTRES=cpu=4,mem=10193M,billing=4
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [slurm@2023-05-25T09:26:59]
>
> But the 'Low RealMemory' is incorrect. The entry in slurm.conf for
> node012 is:
>
> NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
> ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

Thanks for the info. Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0? It should be the size of the /tmp
filesystem.

Since you run Slurm 20.02, there are some suggestions in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration
where this might be useful:

> Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error messages, see bug_9241 and bug_9233. Use Sockets= in stead:

I hope changing these slurm.conf parameters will help.

Best regards,
Ole




Roger Mason

未读,
2023年5月25日 10:30:352023/5/25
收件人 Slurm User Community List
Hello,

Davide DelVento <davide....@gmail.com> writes:

> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?

This is what top shows:

last pid: 45688; load averages: 0.00, 0.00, 0.00 up 0+03:56:52 11:58:13
26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 0.1% system, 0.0% interrupt, 99.9% idle
Mem: 9452K Active, 69M Inact, 290M Wired, 287K Buf, 5524M Free
ARC: 125M Total, 37M MFU, 84M MRU, 168K Anon, 825K Header, 3476K Other
36M Compressed, 89M Uncompressed, 2.46:1 Ratio
Swap: 10G Total, 10G Free

Thanks for the suggestion.

Roger

Roger Mason

未读,
2023年5月25日 10:35:582023/5/25
收件人 Ole Holm Nielsen、Slurm User Community List

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> 1. Is slurmd running on the node?
Yes.

> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097

> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.

> 4. Why have you configured TmpDisk=0? It should be the size of the
> /tmp filesystem.
I have not configured TmpDisk. This the entry in slurm.conf for that
node:
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193 State=UNKNOWN

But I do notice that slurmd -C now says there is less memory than
configured.

Thanks again.

Roger

Brian Andrus

未读,
2023年5月25日 10:54:432023/5/25
收件人 slurm...@lists.schedmd.com
That output of slurmd -C is your answer.

Slurmd only sees 6GB of memory and you are claiming it has 10GB.

I would run some memtests, look at meminfo on the node, etc.

Maybe even check that the type/size of memory in there is what you think
it is.

Brian Andrus

Groner, Rob

未读,
2023年5月25日 10:57:202023/5/25
收件人 Slurm User Community List
A quick test to see if it's a configuration error is to set config_overrides in your slurm.conf and see if the node then responds to scontrol update. 


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Brian Andrus <toom...@gmail.com>
Sent: Thursday, May 25, 2023 10:54 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Nodes stuck in drain state
 

Roger Mason

未读,
2023年5月25日 13:24:222023/5/25
收件人 slurm...@lists.schedmd.com
Hello,

"Groner, Rob" <rug...@psu.edu> writes:

> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.

Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offending node and all seems well.

I have another node also stuck in drain that I will investigate. I
picked up some useful tips from the replies, but if I can't get it back
on-line I hope the friendly people on this list will rescue me.

Thanks again,
Roger

回复全部
回复作者
转发
0 个新帖子