--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
Hi,
We have had similar questions from users regarding how best to find out the high memory peak of a job since they may run a job and get a not very useful value for variables in sacct such as the MaxRSS since Slurm didn’t poll during the use of its maximum memory usage.
With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into account caches so can vary on how much I/O is used whilst total_rss in memory.stats looks more useful maybe. Maybe memory.peak is clearer?
Its not clear in the documentation how a user should in the sacct values to infer the actual usage of jobs to correct their behaviour in future submissions.
I would be keen to see improvements in high water mark reporting. I noticed that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – Spank plugin does possibly look like the way to go. Also it seems a common problem across technologies e.g. https://github.com/google/cadvisor/issues/3286
Tom
From:
Emyr James via slurm-users <slurm...@lists.schedmd.com>
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento <davide....@gmail.com>, Emyr James <emyr....@crg.eu>
Cc: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [slurm-users] Re: memory high water mark reporting
|
External email to Cardiff University - Take care when replying/opening attachments or links. |
|
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor atodiadau neu ddolenni. |
-- Ryan Cox Director Office of Research Computing Brigham Young University
Hi,
I came to same conclusion and spotted similar bits of the code where code could be changed to get what was required. Without a new variable it will be tricky to implement properly due to way those existing variables are used and defined. Maybe a PeakMem variable in Slurm accounting database to capture this is required if enough interest in this feature.
N.B. I got confused with the memory – total_rss is already used, max_usage_in_bytes in cgroupsv1 is the only one (similar to peak in cgroupsv2).
Maybe only proper way is to monitor this sort of thing outside of Slurm with tools such as XDMOD.
Users can, of course always just wrap the job itself in time to record the maximum memory usage. Bit of a naïve approach but it does work. I agree the polling of current usage is not very satisfactory.
Tim
--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue |
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com