[slurm-users] 'sacct -a' missing running jobs 'held' by user

Lee via slurm-users

unread,

Dec 15, 2025, 2:35:53 PM12/15/25

to Slurm User Community List

Hello,

I am using slurm 23.02.6. I have a strange issue. I periodically use sacct to dump job data. I then generate reports based on the resource allocation of our users.

Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs. This included 'holding' the Running array jobs.

Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query.

However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present.

Question :
1. How can I include the 'held' running job when I do my bulk query with `sacct -a`? Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible.

Minimum working example :
#. Submit a job :

myuser@clusterb01:~$ srun --pty bash # landed on dgx29

#. Hold job
myuser@clusterb01:~$ scontrol hold 120918
myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash
UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
Priority=0 Nice=0 Account=allusers QOS=normal
JobState=RUNNING Reason=JobHeldUser Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
AccrueTime=Unknown
StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
Partition=defq AllocNode:Sid=clusterb01:4145861
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dgx29
BatchHost=dgx29
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=9070M,node=1,billing=1
AllocTRES=cpu=2,mem=18140M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/home/myuser

Power=

#. Release job

myuser@clusterb01:~$ scontrol release 120918

#. Show job again
myuser@clusterb01:~$ scontrol show job=120918
JobId=120918 JobName=bash
UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
Priority=1741 Nice=0 Account=allusers QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
AccrueTime=Unknown
StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
Partition=defq AllocNode:Sid=clusterb01:4145861
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dgx29
BatchHost=dgx29
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=9070M,node=1,billing=1
AllocTRES=cpu=2,mem=18140M,node=1,billing=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/home/myuser/

Power=

#. In slurmctld, I see :

root@clusterb01:~# grep 120918 /var/log/slurmctld
[2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resources JobId=120918 NodeList=dgx29 usec=1269
[2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456
[2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918
[2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189
[2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456
[2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268
[2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0
[2025-12-15T13:33:20.552] _job_complete: JobId=120918 done

#. Job is NOT missing, when identifying it by jobid
myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30"
JobIDRaw JobID NodeList Start End Elapsed State SubmitLine
------------ ------------ --------------- ------------------- ------------------- ---------- ---------- ------------------------------
120918 120918 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash
120918.0 120918.0 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20 00:01:52 COMPLETED srun --pty bash

#. Job IS missing when using -a

myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30" | grep -i 120918 ## MISSING

Best regards,

Lee

Lee via slurm-users

unread,

Jan 7, 2026, 7:25:09 AMJan 7

to Slurm User Community List

Hello,

I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database gets set to 0 when a running job is held. Let me demonstrate.

1. Allocate a job and check that I can query it via `sacct -S YYYY-MM-DD`

jess@bcm10-h01:~$ srun --pty bash
jess@bcm10-n01:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M
114 defq bash jess R 1:13 1 1 2900M

root@bcm10-h01:~# sacct -S 2026-01-06 -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
114 bash defq allusers 1 RUNNING 0:0
114.0 bash allusers 1 RUNNING 0:0

root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
SubmitTime=2026-01-06T14:52:04 EligibleTime=2026-01-06T14:52:04

2. Hold job, confirm that it is no longer queryable via `sacct -S YYYY-MM-DD`, notice EligibleTime changes to Unknown.

jess@bcm10-n01:~$ scontrol hold 114
jess@bcm10-n01:~$ scontrol release 114

root@bcm10-h01:~# sacct -S 2026-01-06 -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
SubmitTime=2026-01-06T14:52:04 EligibleTime=Unknown

3. Check time_eligible in the underlying MySQL database and confirm that changing time_eligible makes it queryable via `sacct -S YYYY-MM-DD`.

root@bcm10-h01:~# mysql --host=localhost --user=slurm --password=XYZ slurm_acct_db

mysql> SELECT id_job FROM slurm_job_table WHERE time_eligible = 0;
+--------+
| id_job |
+--------+
| 114 |
| 112 |
| 113 |
+--------+
3 rows in set (0.00 sec)

mysql> UPDATE slurm_job_table SET time_eligible = 1767733491 WHERE id_job = 114;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> SELECT time_eligible FROM slurm_job_table WHERE id_job = 114;
+---------------+
| time_eligible |
+---------------+
| 1767733491 |
+---------------+
1 row in set (0.00 sec)

### WORKS AGAIN

root@bcm10-h01:~# sacct -S 2026-01-06 -a
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
114 bash defq allusers 1 RUNNING 0:0
114.0 bash allusers 1 RUNNING 0:0

4. In the man page for sacct, it says things like :

"For example jobs submitted with the "--hold" option will have "EligibleTime=Unknown" as they are pending indefinitely."

Conclusion :

This very much feels like a bug. It doesn't seem like running jobs should be able to be 'held' because they can't be pending indefinitely due to the fact that they are actively running. I don't think that the EligibleTime should subsequently change when a user tries to 'hold' a running job either.

Question :

1. Identifying these problematic jobs via the underlying MySQL database seems not optimal. Are there any better workarounds?

Best regards,

Lee

Ole Holm Nielsen via slurm-users

unread,

Jan 8, 2026, 2:29:49 AMJan 8

to slurm...@lists.schedmd.com

Hi Lee,

Just my 2 cents: Which database and OS versions do you run?

Furthermore, Slurm 23.02 is really old, so I'd recommend upgrading to
25.05 (or perhaps even 25.11). It just might be that your bug has been
resolved in later versions of Slurm or MySQL/MariaDB.

You can find detailed upgrade instructions in [1]. Be especially mindful
of the MySQL and slurmdbd upgrades, and perform a dry-run upgrade first on
a test node.

On 1/7/26 13:22, Lee via slurm-users wrote:
> I replicated this issue on a different cluster and determined that the
> root cause is that the time_eligible in the underlying MySQL database gets
> set to 0 when a running job is held. Let me demonstrate.

...

> I am using slurm 23.02.6. I have a strange issue. I periodically use
> sacct to dump job data. I then generate reports based on the resource
> allocation of our users.

IHTH,
Ole

[1]
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Lee via slurm-users

unread,

Jan 8, 2026, 10:54:09 AMJan 8

to Ole Holm Nielsen, slurm...@lists.schedmd.com

Thanks for the suggestion. In my test environment, I'm running :

root@bcm10-h01:~# mysql -V
mysql Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))

root@bcm10-h01:~# cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.4 LTS"

This closely matches my production environment.

My production environment is running in an Nvidia POD ecosystem and I'm using Base Command Manager (v10) to manage my cluster. It does seem that the version of Slurm in the BCM iso tends to lag behind by at least 12 months. All this is to say updating individual cluster components in the Base Command Environment isn't straightforward.

Best,

Lee

Christopher Samuel via slurm-users

unread,

Jan 8, 2026, 1:46:20 PMJan 8

to slurm...@lists.schedmd.com

On 12/15/25 11:33 am, Lee via slurm-users wrote:

> I am using slurm 23.02.6.

FYI there's 6 security issues that have since been fixed since 23.02.6,
also 23.02.7 had a lot of other fixes in it. The last 23.02 release was
23.02.9:

https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-23.02.md

But that version is long abandoned, 24.11 is the most oldest supported
release (and you'd need to upgrade 23.02 to either 23.11 or 24.05 first
as Slurm only supports upgrading the previous 2 versions at any time).

FWIW we're running 24.11.7 and plan to upgrade directly to 25.11.x early
this year if testing goes well.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

Reply all

Reply to author

Forward