[slurm-users] Best method to determine if a node is down

2,808 views
Skip to first unread message

Doug Niven

unread,
Jun 26, 2021, 1:11:15 PM6/26/21
to slurm...@lists.schedmd.com
Hi Folks,

I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs...

For example, using ‘squeue’ in NODELIST(REASON) I recently saw:

(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

And using ‘sinfo’ I saw:

% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
trom 1 short* draining 112 2:56:2 204800 0 1 (null) Kill task failed
trom 1 long draining 112 2:56:2 204800 0 1 (null) Kill task failed

I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs?

Thanks in advance for your ideas,

Doug


Marcus Boden

unread,
Jun 27, 2021, 4:03:05 PM6/27/21
to slurm...@lists.schedmd.com
Hi Doug,

Slurm has the strigger[1] mechanism that can do exactly that, the
manpage even has your use case as an example. It works quite well for us.

Best,
Marcus

[1] https://slurm.schedmd.com/strigger.html
--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.: +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-------------------------------------------------------------------------
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: gw...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-------------------------------------------------------------------------

Reply all
Reply to author
Forward
0 new messages