Hello,
I am aware of ‘seff’, which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the cluster, so I am able to ‘re-educate’ users that may be running jobs that have terrible resource usage efficiency.
What do other cluster administrators use for this task? Is there anything you use and recommend (or don’t recommend) or have heard of that is able to do this? Even if it’s something like a Grafana dashboard that hooks up to the SLURM database,
Thank you,
Will.
We are feeding job usage information into a Prometheus database for our users (and us) to look at (via Grafana).
It is also possible to get a lite of jobs that are under using memory, gpu or whatever metric you feed into the database.
It’s a live feed with ~30s resolution from both compute jobs and Lustre file system.
It’s easy to extend with more metrices.
If you want more information on what we are doing just send me an email and I can give you more information.
/Magnus
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
By sending an email to Umeå University, the University will need to
process your personal data. For more information, please read www.umu.se/en/gdpr
Hi Magnus,
That does sound like an interesting solution – yes please would you be able to send me (or us if you’re willing to share it to the list) through some more information please?
And thank you everyone else that has replied to my email – there’s definitely a few solutions I need to look into here!
Thanks!
Will
We don't have anything structured in place currently to analyse Slurm data; lots of Grafana system-level metrics but nothing to look at trends of useful metrics like:
- Size (and age) of jobs sitting in the pending state
- Average runtime of jobs
- Plotting workload sizing information such as cores/job and memory/core so that we can understand how our users are utilising the service
- Demand (and utilisation) of particular partitions
I couldn't find anything that was exactly what we wanted, so I spent a couple of afternoons last week putting something together in Python to wrap around sacct / sinfo output.
So far I've got reports for what is happening 'now', as well as summaries for the following periods:
24 hours
7 days
30 days
1 year
Data is analysed based on jobs running/pending/completed/failed during windows in time and summarised in terms of sample periods per day (a 24 report having the finest sampling resolution of 6x 10 minute windows per hour), and the output of each sample period is stored as a persistent json object on the filesystem in case the same report is ran again, or that period is included as part of a larger analysis window.
I output to flat HTML files using the Jinja2 templating module and visualise data using the ubiquitous Highcharts and DataTables javascript libraries.
In our case we're more interested in things like:
Min/Max/Median cores/job, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median memory/core, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median nodes/job, plus lowest average value which would satisfy X% of all jobs
Backlog of jobs waiting in pending state
Percentage of jobs that 'fail' (end up in some state other than completed)
Scatter chart of cores/job to memory/core (i.e. what is the bulk of our user workload; parallel/serial, low memory/high memory?)
i.e. data points which will be useful in our sizing decisions of a replacement platform, both in terms of hardware, as well as partition definitions.
When it's at a point where it is useable, I'm sure that we can share the code. It's pretty much self-contained; the only dependencies being Slurm and Python 3 installed - no web components needed (unless you want to serve the generated reports to users, of course).
John Snowdon
Advanced Computing Consultant
Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne,
NE1 8HW