[slurm-users] monitoring and accounting

368 views
Skip to first unread message

LEROY Christine 208562

unread,
May 5, 2023, 9:06:26 AM5/5/23
to Slurm User Community List

Hello Everyone,

We would like to improve our visibility on our cluster usage.

We have ganglia, and use sacct actually, but I was wondering if there was a web tool recommended to have both monitoring and accounting (user and admin friendly) ?

Thanks in advance

Christine

 

 

 

Davide DelVento

unread,
May 5, 2023, 9:20:10 AM5/5/23
to Slurm User Community List
At a place I worked before, we used XDMOD several years ago. It was a bit tricky to set up correctly and not exactly intuitive to get started with data collection as a user (managers, allocation specialists and other not-super-technical people were most of our users). But when familiarized with it, it worked great.
At the place I work now, monitoring and accounting is low on our priority list, so it's been a while I haven't touched XDMOD. Hopefully now they have improved user and administration friendliness, while keeping all the great things that it could do.

Brian Andrus

unread,
May 5, 2023, 10:05:18 AM5/5/23
to slurm...@lists.schedmd.com

Something I have been impressed with is Netdata

It is in the standard repositories and will auto-detect quite a bit of things on a node. It is great for real-time monitoring of a node/job.

I also use Prometheus and Grafana for historic data (anything over 5 minutes).

Brian Andrus

LEROY Christine 208562

unread,
Jun 2, 2023, 5:21:02 AM6/2/23
to Slurm User Community List

Hello,

 

Thanks for your feedback.

I’ve tried xdmod, but after a lot of debugging it is still not working, and the support is not very responsive.

 

If there a more recent feedback on any accounting tool ?

 

Thanks in advance,

Christine

 

De : slurm-users <slurm-use...@lists.schedmd.com> De la part de Davide DelVento
Envoyé : vendredi 5 mai 2023 15:19
À : Slurm User Community List <slurm...@lists.schedmd.com>
Objet : Re: [slurm-users] monitoring and accounting

Jörg Striewski

unread,
Jun 2, 2023, 8:02:02 AM6/2/23
to Slurm User Community List
Hi, we use grafana with influx, it is easy to install and works fine
--
Mit freundlichen Grüßen / kind regards

-- 
Jörg Striewski

Information Systems and Machine Learning Lab (ISMLL)
Institute of Computer Science
University of Hildesheim Germany
post address: Universitätsplatz 1, D-31141Hildesheim, Germany
visitor address: Samelsonplatz 1, D-31141 Hildesheim,Germany

Andrew Elwell

unread,
Jun 11, 2023, 7:44:07 PM6/11/23
to Slurm User Community List
On Fri, 2 June 2023, 22:03 Jörg Striewski, <stri...@ismll.de> wrote:
Hi, we use grafana with influx, it is easy to install and works fine

Hi Jörg,

Are your slurm to influx scripts publicly available anywhere? I do something similar for squeue via python subprocess to call

squeue -M all -a -o "%P,%a,%u,%D,%q,%T,%r"

And some sinfo calls for node/cpu usage:

sinfo -M {} -o "%P,%a,%F"
sinfo -M {} -o "%%R,%a,%C,%B,%z"

But I'd be interested to see what other places do. Perhaps some examples could be gathered for Ole's wiki?

Andrew

Ole Holm Nielsen

unread,
Jun 12, 2023, 3:21:15 AM6/12/23
to slurm...@lists.schedmd.com
Hi Andrew,

On 6/12/23 01:43, Andrew Elwell wrote:
> Are your slurm to influx scripts publicly available anywhere? I do
> something similar for squeue via python subprocess to call
>
> squeue -M all -a -o "%P,%a,%u,%D,%q,%T,%r"
>
> And some sinfo calls for node/cpu usage:
>
> sinfo -M {} -o "%P,%a,%F"
> sinfo -M {} -o "%%R,%a,%C,%B,%z"
>
> But I'd be interested to see what other places do. Perhaps some examples
> could be gathered for Ole's wiki?

I'd be happy to copy examples and links to documentation to the Wiki. I
guess this would be the best place?

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#other-accounting-report-tools

/Ole

Reed Dier

unread,
Jun 12, 2023, 10:43:43 AM6/12/23
to Slurm User Community List
Hey Andrew,

I don’t have any specific examples I can share right this second, I’ll look into making it shareable, but my solution was to throw some basic bash scripts into cron to scrap and ship into influx.

I have one script that looks at sinfo, parsing out AIOT state for nodes and CPUs, and then a very ugly, hacky sed/cut/awk to scrape GPU usage; as well as squeue to see jobs per state; both of these per partition and cluster.
I have another script that is basic sreport parsing for the tres/gres I care about, so that I can get a somewhat birdseye trend of utilization over time.

There’s likely to be something far, far better for this, but it was a quick and dirty solution to get something visible with existing tooling (Grafana/influx).

Reed

Josef Dvoracek

unread,
Jun 12, 2023, 11:20:47 AM6/12/23
to slurm...@lists.schedmd.com
> But I'd be interested to see what other places do.

we installed this: https://github.com/vpenso/prometheus-slurm-exporter

and scrape this exporter with "inputs.prometheus" Telegraf input and
it's sent to influx (and shown by Grafana)

--

josef

On 12. 06. 23 1:43, Andrew Elwell wrote:
...

Brian Andrus

unread,
Jun 12, 2023, 2:48:02 PM6/12/23
to slurm...@lists.schedmd.com
Second that.

Prometheus+slurm exporter+grafana works great.

Brian Andrus
Reply all
Reply to author
Forward
0 new messages