[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

433 views
Skip to first unread message

Will Furnell - STFC UKRI

unread,
Jul 24, 2023, 10:38:33 AM7/24/23
to slurm...@schedmd.com

Hello,

 

I am aware of ‘seff’, which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the cluster, so I am able to ‘re-educate’ users that may be running jobs that have terrible resource usage efficiency.

 

What do other cluster administrators use for this task? Is there anything you use and recommend (or don’t recommend) or have heard of that is able to do this? Even if it’s something like a Grafana dashboard that hooks up to the SLURM database,

 

Thank you,

 

Will.

Matthew Brown

unread,
Jul 24, 2023, 4:26:03 PM7/24/23
to Slurm User Community List, slurm...@schedmd.com
I use seff all the time as a first order approximation. It's a good hint at what's going on with a job but doesn't give much detail. 

We are in the process of integrating the Supremm node utilization capture tool with our clusters and with our local XDMOD installation. Plain old XDMOD can ingest the Slurm logs and give you some great information on utilization, but generally has more of a high-level or summary perspective on stats. To help see their personal job efficiency, you really need to give users time-series data and we're expecting to get that with the Supremm components.

The other angle which I've recently asked our eng/admin team to try to implement on our newest cluster (yet to be released), is to turn on the bits that Slurm has built-in for job profiling. With this properly configured, users can turn on job-profiling as with a Slurm job-option and it will produce that time-series data. Look for the AcctGatherProfileType config stuff for slurm.conf.

Best,

Matt

Matthew Brown
Computational Scientist
Advanced Research Computing
Virginia Tech

Magnus Jonsson

unread,
Jul 24, 2023, 4:56:03 PM7/24/23
to Slurm User Community List

We are feeding job usage information into a Prometheus database for our users (and us) to look at (via Grafana).

It is also possible to get a lite of jobs that are under using memory, gpu or whatever metric you feed into the database.

 

It’s a live feed with ~30s resolution from both compute jobs and Lustre file system.

It’s easy to extend with more metrices.

 

If you want more information on what we are doing just send me an email and I can give you more information.

 

/Magnus

 

--

Magnus Jonsson, Developer, HPC2N, Umeå Universitet

By sending an email to Umeå University, the University will need to

process your personal data. For more information, please read www.umu.se/en/gdpr

Davide DelVento

unread,
Jul 24, 2023, 10:45:58 PM7/24/23
to Slurm User Community List
I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0

The new version has (among many things) a really nice view of individual jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not pay attention to the overall statistics, so I am not sure how CV fares there -- because I care only about individual jobs (I work with individual users, and don't care about overall utilization, which is info for the upper management). At the moment only admins can see the info, but my understanding is that they are considering making it a user-space feature, which will be really slick.

Several years ago I used XDMOD and Supremm and it was more confusing to use and had troubles collecting all the data we needed (which the team blamed on some BIOS settings), so the view was incomplete. Also, the tool seemed to be more focused on the overall stats rather than per job info (both were available, but the focus seemed on the former). I am sure these tools have improved since then, so I'm not dismissing them, just giving my opinion based on old facts. Comparing that old version of XDMOD to current CV (unfair, I know, but that's the comparison I've got) the latter wins hands down for per-job information. Also probably unfair is that XDMOD and Supremm are free and open source whereas CV is proprietary.

Tina Friedrich

unread,
Jul 26, 2023, 12:41:51 PM7/26/23
to slurm...@lists.schedmd.com
Hi Will,

I don't, currently, although it's on my list.

However, we had a presentation on a recent Oxford HPC-SIG meeting from a
colleague, who implemented a simple job profiler that saves a lot of job
data (including efficiency) & creates plots of the efficiency of the job
run (in a nutshell). We all thought it sounded interesting :)

Code is here: https://github.com/OxfordCBRG/sps

(it's a spank plugin I believe)

Tina

Will Furnell - STFC UKRI

unread,
Jul 27, 2023, 6:36:14 AM7/27/23
to slurm...@lists.schedmd.com

Hi Magnus,

 

That does sound like an interesting solution – yes please would you be able to send me (or us if you’re willing to share it to the list) through some more information please?

 

And thank you everyone else that has replied to my email – there’s definitely a few solutions I need to look into here!

 

Thanks!

 

Will

Angel de Vicente

unread,
Sep 7, 2023, 2:16:16 PM9/7/23
to Will Furnell - STFC UKRI, slurm...@lists.schedmd.com
Hi Will,
we also use 'seff', but it gives reliable stats only for jobs that
finished properly (i.e. COMPLETED). In our case, we would need to
collect efficiency stats also for jobs that TIMEOUT and even those that
are CANCELLED.

Do you happen to know of some way to accomplish this?

Many thanks,
--
Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Tel.: +34 922-605-747
Web.: http://research.iac.es/proyecto/polmag/

GPG: 0x8BDC390B69033F52

John Snowdon

unread,
Sep 8, 2023, 4:45:35 AM9/8/23
to Slurm User Community List
I've been needing to do this as part of some analysis work we are undertaking to determine requirements for a replacement system.

We don't have anything structured in place currently to analyse Slurm data; lots of Grafana system-level metrics but nothing to look at trends of useful metrics like:

- Size (and age) of jobs sitting in the pending state
- Average runtime of jobs
- Plotting workload sizing information such as cores/job and memory/core so that we can understand how our users are utilising the service
- Demand (and utilisation) of particular partitions

I couldn't find anything that was exactly what we wanted, so I spent a couple of afternoons last week putting something together in Python to wrap around sacct / sinfo output.

So far I've got reports for what is happening 'now', as well as summaries for the following periods:

24 hours
7 days
30 days
1 year

Data is analysed based on jobs running/pending/completed/failed during windows in time and summarised in terms of sample periods per day (a 24 report having the finest sampling resolution of 6x 10 minute windows per hour), and the output of each sample period is stored as a persistent json object on the filesystem in case the same report is ran again, or that period is included as part of a larger analysis window.

I output to flat HTML files using the Jinja2 templating module and visualise data using the ubiquitous Highcharts and DataTables javascript libraries.

In our case we're more interested in things like:

Min/Max/Median cores/job, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median memory/core, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median nodes/job, plus lowest average value which would satisfy X% of all jobs
Backlog of jobs waiting in pending state
Percentage of jobs that 'fail' (end up in some state other than completed)
Scatter chart of cores/job to memory/core (i.e. what is the bulk of our user workload; parallel/serial, low memory/high memory?)

i.e. data points which will be useful in our sizing decisions of a replacement platform, both in terms of hardware, as well as partition definitions.

When it's at a point where it is useable, I'm sure that we can share the code. It's pretty much self-contained; the only dependencies being Slurm and Python 3 installed - no web components needed (unless you want to serve the generated reports to users, of course).

John Snowdon
Advanced Computing Consultant

Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne, 
NE1 8HW

Styrk, Daryl

unread,
Sep 8, 2023, 1:13:10 PM9/8/23
to Slurm User Community List
John,

I'm interested to see where this goes.

Good luck.

Daryl

Jason Simms

unread,
Sep 8, 2023, 1:52:14 PM9/8/23
to Slurm User Community List
Hello John,

I also am keen to follow your progress, as this is something we would find extremely useful as well.

Regards,
Jason
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms
Reply all
Reply to author
Forward
0 new messages