[slurm-users] Run a healthcheck job on all nodes

14 views
Skip to first unread message

John Hearns via slurm-users

unread,
Aug 18, 2025, 7:04:48 AMAug 18
to Slurm User Community List
I may have asked this already.

I want to run a healtcheck job on all nodes.
I can select the nodes in a partition by hand, the write a bash cript to get a list of nodes using 
nodeset -e
Then submit to each node in the list using sbatch -w

Is there a cleaner way of doing this?


John Hearns

Ole Holm Nielsen via slurm-users

unread,
Aug 18, 2025, 7:37:51 AMAug 18
to slurm...@lists.schedmd.com
Hi John,

Nice to hear from you again!

On 8/18/25 13:00, John Hearns via slurm-users wrote:
> I want to run a healtcheck job on all nodes.
> I can select the nodes in a partition by hand, the write a bash cript to
> get a list of nodes using
> nodeset -e
> Then submit to each node in the list using sbatch -w
>
> Is there a cleaner way of doing this?

IMHO the cleanest way is to use the great ClusterShell tool[1], where
Slurm partitions and nodes can be configured as shown in the Wiki
examples. For example, to run NHC on all nodes:

$ clush -ba nhc

Best regards,
Ole

[1] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Gerhard Strangar via slurm-users

unread,
Aug 18, 2025, 7:58:53 AMAug 18
to slurm...@lists.schedmd.com
John Hearns via slurm-users wrote:

> I want to run a healtcheck job on all nodes.

And using HealthCheckProgram in the slurm.conf would be too easy?

Ole Holm Nielsen via slurm-users

unread,
Aug 18, 2025, 8:45:45 AMAug 18
to slurm...@lists.schedmd.com
On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
> John Hearns via slurm-users wrote:
>
>> I want to run a healtcheck job on all nodes.
>
> And using HealthCheckProgram in the slurm.conf would be too easy?

But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd is
started, and possibly when a new job is started.

I think John asked for a way to run NHC on a set of nodes whenever desired
by the system administrator, and not at any any random time, right?
ClusterShell is the ideal tool for making such parallel commands on the
cluster.

Best regards,
Ole

Bjørn-Helge Mevik via slurm-users

unread,
Aug 18, 2025, 9:00:50 AMAug 18
to slurm...@schedmd.com
Ole Holm Nielsen via slurm-users <slurm...@lists.schedmd.com> writes:

> On 8/18/25 13:56, Gerhard Strangar via slurm-users wrote:
>> John Hearns via slurm-users wrote:
>>
>>> I want to run a healtcheck job on all nodes.
>> And using HealthCheckProgram in the slurm.conf would be too easy?
>
> But the HealthCheckProgram=/usr/sbin/nhc is executed only when slurmd
> is started, and possibly when a new job is started.

That depends on HealthCheckInterval and HealthCheckNodeState. If
HealthCheckInterval=N with N > 0, the HealthCheckProgram is run every N
seconds, given that the node is in one of the HealthCheckNodeState
states (default: any state).

> I think John asked for a way to run NHC on a set of nodes whenever
> desired by the system administrator, and not at any any random time,
> right? ClusterShell is the ideal tool for making such parallel
> commands on the cluster.

Yes, for running manually, setting up the Slurm groups in clush is the
easiest way, IMO.

--
Regards,
Bjørn-Helge Mevik
signature.asc

John Hearns via slurm-users

unread,
Aug 18, 2025, 9:17:23 AMAug 18
to Bjørn-Helge Mevik, slurm...@schedmd.com
Thankyou both.  For interest, this is the health check


Reply all
Reply to author
Forward
0 new messages