[slurm-users] Graphing job metrics

1,058 views
Skip to first unread message

Nicholas McCollum

unread,
Nov 13, 2017, 12:18:36 PM11/13/17
to slurm...@lists.schedmd.com
Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it. I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.

I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups. The resolution for the
metrics is defined by the retention interval that you specify in graphite. In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.

From these databases, I have written scripts that can notify me if a user job
is wasting resources, like requesting 64 cores when their application only
scales to 8.

I have also created a script that will allow a user to cURL a Grafana instance
to graph their job metrics and create graphs.

If anyone is interested I wrote something real quickly at:
https://xathor.blogspot.com/2017/11/graphing-slurm-cgroup-job-metrics.html

If there's interest I would be more than happy to polish the code a little and
share it on github.

I am also at SC17 if anyone wants to meet up and check it out in person.

Thanks!


---

Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

Chris Samuel

unread,
Nov 14, 2017, 5:59:48 AM11/14/17
to Slurm User Community List
On Tuesday, 14 November 2017 4:18:08 AM AEDT Nicholas McCollum wrote:

> If there's interest I would be more than happy to polish the code a little
> and share it on github.

Yup, certainly interest here!

--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


Rajiv Nishtala

unread,
Nov 14, 2017, 6:05:23 AM11/14/17
to Slurm User Community List, Chris Samuel
Agree with Chris. Please do share the code with the community.

Simon Flood

unread,
Nov 14, 2017, 6:13:09 AM11/14/17
to slurm...@lists.schedmd.com
On 14/11/17 10:58, Chris Samuel wrote:

> Yup, certainly interest here!

Ditto.
--
Simon Flood
HPC System Administrator
University of Cambridge Information Services
United Kingdom

Rémi Palancher

unread,
Nov 14, 2017, 7:35:40 AM11/14/17
to slurm...@lists.schedmd.com
Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone else
> is interested in it. I have a lot of students on my cluster and I really
> wanted a way to show my users how efficient their jobs are, or let them know
> that they are wasting resources.
>
> I created a few scripts that leverage Graphite and whisper databases (RRD like)
> to gather metrics from Slurm jobs running in cgroups. The resolution for the
> metrics is defined by the retention interval that you specify in graphite. In
> my case I can store 1 minute metrics for CPU usage and Memory usage for the
> entire lifetime of a job.

FWIW, we wrote at EDF a collectd[1] plugin some time ago that does
basically the same thing, ie. exploring the cgroups to get cpu/memory
metrics out of jobs' processes. Code is here:

https://github.com/collectd/collectd/pull/1198

Then, you gain all collectd flexibility in terms of metrics processing
and backends (graphite, RRD, influxdb, and so on).

We also wrote a tiny web interface to visualize the metrics. One can
find out more by searching 'jobmetrics' in the following slides:

https://slurm.schedmd.com/SLUG16/EDF.pdf

NB: my intent is just to share, not to steal the thread. Please forgive
me if you take it the wrong way.

Best,
Rémi

[1] https://collectd.org/

Nicholas McCollum

unread,
Nov 14, 2017, 12:11:28 PM11/14/17
to slurm...@lists.schedmd.com
All,

I went to the SchedMD booth last night and talked with the guys. Tim told me
that the Barcelona Supercomputing Center is working on something similar. I am
going to try to meet with their Slurm person and compare notes.

I'm also going to look into trying InfluxDB instead of Graphite at the
recommendation of some people for performance improvements when querying
hundreds of jobs at the same time.

If anyone wants a specific time to meet, just e-mail me directly. I will be at
the SC17 convention center all week.

---

Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

Carlos Fenoy

unread,
Nov 15, 2017, 12:14:26 PM11/15/17
to Slurm User Community List
Hi,

I developed a plugin around 1.5 years ago that uses the profiling feature of slurm to collect the resource usage information and sends it to influxdb. This is not yet merged in the official slurm release, but it may be in the next 18.x release. If you want to test this there is a branch in the schedm github repo (https://github.com/SchedMD/slurm/tree/influxdb)

We've had this running since I created it in some mid-sized clusters with 10's of thousands of jobs per day without an issue. We have a retention policy of 7 days in influx to avoid collecting too much data. We provide then a grafana dashboard for the users where they can filter by jobid to see the CPU usage and Memory usage of their jobs.

If you need more details, I'll be glad to answer your questions.

Regards,
Carlos
--
--
Carles Fenoy

Nicholas McCollum

unread,
Nov 15, 2017, 1:07:11 PM11/15/17
to Slurm User Community List
I've been tracking down people at SC17 and talking about graphing user jobs with them.  There's a definite consensus that I should be using influxdb to store the data.  After SC17 I'm going to rebuild my setup and write a better how-to.  

The advantage of my current setup is the only requirement is to be running Slurm with cgroups.  

The better and more scalable solution is to have it written in C and managed by the slurmd process on the nodes themselves.

I think I may provide a Dockerfile later that will spin everything up automatically.  Then the only requirement is a crontab entry to run a shell script on your nodes to push data to your Docker instance.  

Carlos, I'd definitely like to take a look at your setup, especially if you can segregate users so they cannot see another users job metrics.

Nick McCollum

Sent from Nine

From: Carlos Fenoy <min...@gmail.com>
Sent: Nov 15, 2017 10:15 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Graphing job metrics

Markus Köberl

unread,
Jan 5, 2018, 10:08:17 AM1/5/18
to slurm...@lists.schedmd.com
netdata (https://github.com/firehol/netdata) also provides such information
collected from cgroups in real-time with 1 hour history. It can be configured
to use back-ends to archive the metrics.


regards
Markus Köberl
--
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus....@tugraz.at

Reply all
Reply to author
Forward
0 new messages