We use Prometheus as our primary metric tool, and I recently added a metric for jobs in PENDING for the specific reason of “priority”. So we’ll have some nice data for when we are preparing for FY 2025, I suppose, the problem is for this past year we are stuck with what Slurm gathered…. unless I can find a better way to determine if the PD reason is “priority” than “run a query on an active job and see”.
To put things another way, what I am trying to find out is for a given past job, is there any way to determine how long it’s start was delayed due to lack of available resources?
From: slurm-users <
slurm-use...@lists.schedmd.com> on behalf of "Groner, Rob" <
rug...@psu.edu>
Reply-To: Slurm User Community List <
slurm...@lists.schedmd.com>
Date: Thursday, December 7, 2023 at 2:26 PM
To: Slurm User Community List <
slurm...@lists.schedmd.com>
Subject: [ext] Re: [slurm-users] Time spent in PENDING/Priority
Ya, I'm kinda looking at exactly this right now as well. For us, I know we're under-utilizing our hardware currently, but I still want to know if the number of pending jobs is growing because that would probably point to something going wrong
Ya, I'm kinda looking at exactly this right now as well. For us, I know we're under-utilizing our hardware currently, but I still want to know if the number of pending jobs is growing because that would probably point to something going wrong somewhere. It's a good metric to have.
We are going the route of using pyslurm/graphite/grafana to get our answers. I know there is also a prometheus slurm data tool/grafana dashboards that might work just as well.
With pyslurm, I end up with an array of all current jobs and can then grab my metrics as needed. We currently measure the "queue" time by comparing when the job was submitted vs. current time, as long as the job is Pending. Once it's running, then the time spent in the queue is start time minus submit time.
You could view the job Reason to determine if it is for Resources, or for QOS limits, etc. I kinda only care about Resource-related pending, but we could also use the QOS/group CPU limit-related pending as a way to show users if they purchased more CPU time then they'd wait much less.
Some of what I'm saying is hypothetical, we aren't actually graphing queue time yet, or at least, not like I want to. But that is how I plan to go about it.
Rob
________________________________
From: slurm-users <
slurm-use...@lists.schedmd.com> on behalf of Chip Seraphine <csera...@DRWHoldings.com>
Sent: Thursday, December 7, 2023 3:09 PM
To: Slurm User Community List <
slurm...@lists.schedmd.com>
Subject: [slurm-users] Time spent in PENDING/Priority
. Learn why this is important at