[slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

250 views
Skip to first unread message

Ole Holm Nielsen

unread,
Dec 13, 2021, 7:10:30 AM12/13/21
to Slurm User Community List
Hi Slurm users,

I have updated the "pestat" tool for printing Slurm nodes status with 1
line per node including job info. The download page is
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
(also listed in https://slurm.schedmd.com/download.html).

Improvements:

* The GRES/GPU output option "pestat -G" now prints the job gres/gpu
information as obtained from squeue's tres-alloc output option, which
should contain the most correct GRES/GPU information.

If you have a cluster with GPUs, could you try out the latest version and
send me any feedback?

Thanks to René Sitt for helpful suggestions and testing.

The pestat tool can print a large variety of node and job information, and
is generally useful for monitoring nodes and jobs on Slurm clusters. For
command options and examples please see the download page. My own
favorite usage is "pestat -F".

Thanks,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Loris Bennett

unread,
Dec 13, 2021, 7:57:06 AM12/13/21
to Ole.H....@fysik.dtu.dk, Slurm User Community List
Hi Ole,

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> Hi Slurm users,
>
> I have updated the "pestat" tool for printing Slurm nodes status with 1 line per
> node including job info. The download page is
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
> (also listed in https://slurm.schedmd.com/download.html).
>
> Improvements:
>
> * The GRES/GPU output option "pestat -G" now prints the job gres/gpu information
> as obtained from squeue's tres-alloc output option, which should contain the
> most correct GRES/GPU information.
>
> If you have a cluster with GPUs, could you try out the latest version and send
> me any feedback?
>
> Thanks to René Sitt for helpful suggestions and testing.
>
> The pestat tool can print a large variety of node and job information, and is
> generally useful for monitoring nodes and jobs on Slurm clusters. For command
> options and examples please see the download page. My own favorite usage is
> "pestat -F".

Thanks for the update - the GPU information is a good addition.
However, the alignment of the columns with the headers seems a bit off:


$ pestat -p gpu -G
Print only nodes in partition gpu
GRES (Generic Resource) is printed after each jobid
Hostname Partition Node Num_CPU CPUload Memsize Freemem GRES/node Joblist
State Use/Tot (15min) (MB) (MB) JobID(JobArrayID) User GRES/job ...
g001 gpu mix 1 32 0.06* 95200 89990 gpu:gtx1080ti:2(S:0-1) 8692106 joesnow gpu=2
g002 gpu mix 6 32 1.70* 95200 71692 gpu:gtx1080ti:2(S:0-1) 8692181(8536946_566) gailhail gpu=1 8692131(8536946_563) gailhail gpu=1
g003 gpu mix 1 32 0.06* 95200 87622 gpu:gtx1080ti:2(S:0-1) 8692111 joesnow gpu=2
g004 gpu mix 6 32 1.74* 95200 65647 gpu:gtx1080ti:2(S:0-1) 8692124(8536946_562) gailhail gpu=1 8692122(8536946_561) gailhail gpu=1


It looks as if the column 'Partition' needs to be four spaces wider.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris....@fu-berlin.de

Ole Holm Nielsen

unread,
Dec 13, 2021, 8:53:50 AM12/13/21
to Slurm User Community List
Hi Loris,

Thanks for the note. I need to figure out the correct variable width
printf() options. I'm working on an update...

Best regards,
Ole
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: Ole.H....@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Ole Holm Nielsen

unread,
Dec 13, 2021, 9:17:10 AM12/13/21
to Slurm User Community List
Hi Loris,

I fixed errors in the hostnamelength calculation and formatting.
Could you grab the latest pestat and test it?

Thanks,
Ole

On 12/13/21 13:56, Loris Bennett wrote:
Ole Holm Nielsen
PhD, Senior HPC Officer

Loris Bennett

unread,
Dec 13, 2021, 9:31:44 AM12/13/21
to Ole Holm Nielsen, Slurm User Community List
Hi Ole,

The new version looks good to me.

Cheers,

Loris

Ole Holm Nielsen

unread,
Dec 14, 2021, 2:35:36 AM12/14/21
to Slurm User Community List
The latest pestat version now adds a red color highlight if the GRES GPU
is the (null) value.

We use this to highlight jobs on GPU nodes which didn't request any GPU
resources, thereby possibly wasting resources.

Could you test if this is useful and give me a feedback?

Thanks,
Ole

Loris Bennett

unread,
Dec 14, 2021, 8:17:48 AM12/14/21
to Ole Holm Nielsen, Slurm User Community List
Hi Ole,

Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> writes:

> The latest pestat version now adds a red color highlight if the GRES GPU is the
> (null) value.
>
> We use this to highlight jobs on GPU nodes which didn't request any GPU
> resources, thereby possibly wasting resources.
>
> Could you test if this is useful and give me a feedback?

In job_submit.lua we check whether a job sent to the GPU partition has
actually requested a GPU as a TRES and, if not, reject it. So that kind
of wastage doesn't occur.

However, we do sometimes push non-GPU jobs onto GPU-nodes within a
scavenger partition, so it would be handy if pestat highlighted these.
At the moment, though, there are no such jobs, so I can't test.

It would however be good to be able to display the utilisation of the
GPUs via the command-line. Some people request GPUs, but the jobs don't
manage to use them very much. At the opposite end of the usage
spectrum, today, via our Zabbix monitoring, I spotted some jobs with an
unusually high GPU-efficiencies which turned out to be doing
cryptomining :-/

Ole Holm Nielsen

unread,
Dec 14, 2021, 8:24:52 AM12/14/21
to Loris Bennett, Slurm User Community List
Hi Loris,

It would be great if Slurm could read the GPU load using the Nvidia
monitoring tools, and then make the GPUload available through "scontrol
show node xxx". But I don't know if anyone has asked for (and paid)
SchedMD to implement this?

Best regards,
Ole

Ryan Novosielski

unread,
Dec 14, 2021, 4:29:31 PM12/14/21
to Ole.H....@fysik.dtu.dk, Slurm User Community List
Hi Ole,

Thanks again for your great tools!

Is something expected to have broken this script for older versions of Slurm somehow? A version we have with a file time of 1/19/21 will show job IDs and users for a given node, but the version you released yesterday does not seem to (we may have missed versions in the middle, so it may not be this version that did it):

Older:

[root@amarel1 pestat]# ./pestat -F -w slepner080
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=slepner080
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId User ...
slepner080 main* mix 22 24 1.07* 128000 116325 17036194 mt1044 17032319 as2654 17039145 vs670

Current:

[root@amarel1 pestat]# ~novosirj/bin/pestat -F -w slepner080
Print only nodes that are flagged by * (RED nodes)
Select only nodes in hostlist=slepner080
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (15min) (MB) (MB) JobID User ...
slepner080 main* mix 22 24 1.07* 128000 116325

You can see Joblist and JobID User are not present.

--
#BlackLivesMatter
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

Ryan Novosielski

unread,
Dec 14, 2021, 4:45:27 PM12/14/21
to Slurm User Community List
Did a git bisect and answered my own question: “yes.”

[novosirj@amarel1 Slurm_tools]$ git bisect good
72cd05d78f1077142143f20c4293c8c367ffb5a7 is the first bad commit
commit 72cd05d78f1077142143f20c4293c8c367ffb5a7
Author: OleHolmNielsen <Ole.H....@fysik.dtu.dk>
Date: Fri Apr 23 15:11:37 2021 +0200

Changes related to "squeue -O". May not work with Slurm 19.05 and older.

:040000 040000 dee11077f72dd898dcadccf9d0dd2cfc438a8d1f 61880fe14a49a7a96167b89d21dede41f2751d86 M pestat

Diego Zuccato

unread,
Dec 17, 2021, 3:51:42 AM12/17/21
to slurm...@lists.schedmd.com
Hi Loris.

Il 14/12/2021 14:16, Loris Bennett ha scritto:

> spectrum, today, via our Zabbix monitoring, I spotted some jobs with an
> unusually high GPU-efficiencies which turned out to be doing
> cryptomining :-/
What are you using to collect data for Zabbix?

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Loris Bennett

unread,
Dec 17, 2021, 5:16:26 AM12/17/21
to Slurm User Community List
Hi Diego,

Diego Zuccato <diego....@unibo.it> writes:

> Hi Loris.
>
> Il 14/12/2021 14:16, Loris Bennett ha scritto:
>
>> spectrum, today, via our Zabbix monitoring, I spotted some jobs with an
>> unusually high GPU-efficiencies which turned out to be doing
>> cryptomining :-/

> What are you using to collect data for Zabbix?

I used this:

https://github.com/plambe/zabbix-nvidia-smi-multi-gpu

Diego Zuccato

unread,
Dec 17, 2021, 5:38:58 AM12/17/21
to Loris Bennett, Slurm User Community List
Tks.
Will be useful soon :)
Are there other monitoring plugin you'd suggest?

Il 17/12/2021 11:15, Loris Bennett ha scritto:
> Hi Diego,
>
> Diego Zuccato <diego....@unibo.it> writes:
>
>> Hi Loris.
>>
>> Il 14/12/2021 14:16, Loris Bennett ha scritto:
>>
>>> spectrum, today, via our Zabbix monitoring, I spotted some jobs with an
>>> unusually high GPU-efficiencies which turned out to be doing
>>> cryptomining :-/
>
>> What are you using to collect data for Zabbix?
>
> I used this:
>
> https://github.com/plambe/zabbix-nvidia-smi-multi-gpu
>
> Cheers,
>
> Loris
>

--
Reply all
Reply to author
Forward
0 new messages