[slurm-users] node health check

254 views
Skip to first unread message

Ratnasamy, Fritz

unread,
Jan 30, 2023, 10:36:38 PM1/30/23
to Slurm User Community List
Hi, 

 Currently, some of our nodes are overloaded. The nhc installed used to check the load and drain the node when it is overloaded. However, for the past few  days, it is not showing the state of the node. When I run /usr/sbin/nhc manually, it says 
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu ( )

It seems that it is not able to read the state of the node. I ran scontrol show node mcn26
NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
   NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8


Any idea what happened and why nhc is not reading the state of the node anymore? 
Best, 


Fritz Ratnasamy

Data Scientist

Information Technology


Ole Holm Nielsen

unread,
Jan 31, 2023, 2:00:15 AM1/31/23
to slurm...@lists.schedmd.com
On 1/31/23 04:35, Ratnasamy, Fritz wrote:
>  Currently, some of our nodes are overloaded. The nhc installed used to
> check the load and drain the node when it is overloaded. However, for the
> past few  days, it is not showing the state of the node. When I run
> /usr/sbin/nhc manually, it says
> 20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
> mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Not sure how to handle node state ""
> on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
> /usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu
> <http://mcn26.chicagobooth.edu> ( )
>
> It seems that it is not able to read the state of the node. I ran scontrol
> show node mcn26
> NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
>    NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8
>
> Any idea what happened and why nhc is not reading the state of the node
> anymore?

What's the complete output of "scontrol show node mcn26", especially the
State=... information?

Which version of NHC are you running?

/Ole





Brian Johanson

unread,
Jan 31, 2023, 9:38:36 AM1/31/23
to slurm...@lists.schedmd.com

nhc is using the FQDN, slurm isn't (NodeHostName=mcn26), the query is failing.  

We have a line 'export HOSTNAME=$(hostname -s)' in /etc/sysconfig/nhc


-b


Reply all
Reply to author
Forward
0 new messages