[slurm-users] 'sacct -a' missing running jobs 'held' by user

6 views
Skip to first unread message

Lee via slurm-users

unread,
Dec 15, 2025, 2:35:53 PM12/15/25
to Slurm User Community List
Hello,

I am using slurm 23.02.6.  I have a strange issue.  I periodically use sacct to dump job data.  I then generate reports based on the resource allocation of our users.

Recently, I noticed some 'missing' jobs from my query. The missing jobs came from a user who had a large array job, who then 'held' all of the array jobs.  This included 'holding' the Running array jobs.   
Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the job will be missing from that query.

However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S YYYY-MM-DD --format="jobidraw,jobname", the job is present.

Question :
1. How can I include the 'held' running job when I do my bulk query with `sacct -a`?  Finding these outliers and adding them ad-hoc to my dumped file is too laborious and isn't feasible.


Minimum working example :
    #. Submit a job :
        myuser@clusterb01:~$ srun --pty bash # landed on dgx29

    #. Hold job
        myuser@clusterb01:~$ scontrol hold 120918
        myuser@clusterb01:~$ scontrol show job=120918 JobId=120918 JobName=bash
           UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
           Priority=0 Nice=0 Account=allusers QOS=normal
           JobState=RUNNING Reason=JobHeldUser Dependency=(null)
           Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
           RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A
           SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
           AccrueTime=Unknown
           StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A
           SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
           Partition=defq AllocNode:Sid=clusterb01:4145861
           ReqNodeList=(null) ExcNodeList=(null)
           NodeList=dgx29
           BatchHost=dgx29
           NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
           ReqTRES=cpu=1,mem=9070M,node=1,billing=1
           AllocTRES=cpu=2,mem=18140M,node=1,billing=2
           Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
           MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
           Features=(null) DelayBoot=00:00:00
           OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
           Command=bash
           WorkDir=/home/myuser
           Power=

    #. Release job
        myuser@clusterb01:~$ scontrol release 120918

    #. Show job again
        myuser@clusterb01:~$ scontrol show job=120918
        JobId=120918 JobName=bash
           UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
           Priority=1741 Nice=0 Account=allusers QOS=normal
           JobState=RUNNING Reason=None Dependency=(null)
           Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
           RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A
           SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
           AccrueTime=Unknown
           StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28 Deadline=N/A
           SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
           Partition=defq AllocNode:Sid=clusterb01:4145861
           ReqNodeList=(null) ExcNodeList=(null)
           NodeList=dgx29
           BatchHost=dgx29
           NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
           ReqTRES=cpu=1,mem=9070M,node=1,billing=1
           AllocTRES=cpu=2,mem=18140M,node=1,billing=2
           Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
           MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
           Features=(null) DelayBoot=00:00:00
           OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
           Command=bash
           WorkDir=/home/myuser/
           Power=

    #. In slurmctld, I see : 
            root@clusterb01:~# grep 120918 /var/log/slurmctld
            [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resources JobId=120918 NodeList=dgx29 usec=1269
            [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on JobId=120918 by uid 123456
            [2025-12-15T13:31:47.751] sched: _update_job: set priority to 0 for JobId=120918
            [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=189
            [2025-12-15T13:32:48.081] sched: _release_job_rec: release hold on JobId=120918 by uid 123456
            [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete JobId=120918 uid=123456 usec=268
            [2025-12-15T13:33:20.552] _job_complete: JobId=120918 WEXITSTATUS 0
            [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done

    #. Job is NOT missing, when identifying it by jobid
            myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30"
            JobIDRaw     JobID               NodeList               Start                 End    Elapsed      State                     SubmitLine
            ------------ ------------ --------------- ------------------- ------------------- ---------- ---------- ------------------------------
            120918       120918                 dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20   00:01:52  COMPLETED                srun --pty bash
            120918.0     120918.0               dgx29 2025-12-15T13:31:28 2025-12-15T13:33:20   00:01:52  COMPLETED                srun --pty bash
   
 #. Job IS missing when using -a
            myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30"  | grep -i 120918    ## MISSING

Best regards,
Lee

Lee via slurm-users

unread,
Jan 7, 2026, 7:25:09 AM (2 days ago) Jan 7
to Slurm User Community List
Hello,

I replicated this issue on a different cluster and determined that the root cause is that the time_eligible in the underlying MySQL database gets set to 0 when a running job is held.  Let me demonstrate.

1. Allocate a job and check that I can query it via `sacct -S YYYY-MM-DD`

        jess@bcm10-h01:~$ srun --pty bash
        jess@bcm10-n01:~$ squeue
           JOBID PARTITION       NAME    USER ST         TIME  NODES   CPUS MIN_M
             114      defq       bash    jess  R         1:13      1      1 2900M

        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
        ------------ ---------- ---------- ---------- ---------- ---------- --------
        114                bash       defq   allusers          1    RUNNING      0:0
        114.0              bash              allusers          1    RUNNING      0:0

        root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
           SubmitTime=2026-01-06T14:52:04 EligibleTime=2026-01-06T14:52:04



2. Hold job, confirm that it is no longer queryable via `sacct -S YYYY-MM-DD`, notice EligibleTime changes to Unknown.

        jess@bcm10-n01:~$ scontrol hold 114
        jess@bcm10-n01:~$ scontrol release 114

        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
        ------------ ---------- ---------- ---------- ---------- ---------- --------

        root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
           SubmitTime=2026-01-06T14:52:04 EligibleTime=Unknown


3. Check time_eligible in the underlying MySQL database and confirm that changing time_eligible makes it queryable via `sacct -S YYYY-MM-DD`.

        root@bcm10-h01:~# mysql --host=localhost --user=slurm  --password=XYZ slurm_acct_db
        mysql> SELECT id_job  FROM slurm_job_table WHERE time_eligible = 0;
        +--------+
        | id_job |
        +--------+
        |    114 |
        |    112 |
        |    113 |
        +--------+
        3 rows in set (0.00 sec)

        mysql> UPDATE slurm_job_table SET time_eligible = 1767733491 WHERE id_job = 114;
        Query OK, 1 row affected (0.01 sec)
        Rows matched: 1  Changed: 1  Warnings: 0

        mysql> SELECT time_eligible FROM slurm_job_table WHERE id_job = 114;
        +---------------+
        | time_eligible |
        +---------------+
        |    1767733491 |
        +---------------+
        1 row in set (0.00 sec)

        ### WORKS AGAIN
        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
        ------------ ---------- ---------- ---------- ---------- ---------- --------
        114                bash       defq   allusers          1    RUNNING      0:0
        114.0              bash              allusers          1    RUNNING      0:0

4. In the man page for sacct, it says things like :

         "For  example  jobs  submitted with the "--hold" option will have "EligibleTime=Unknown" as they are pending indefinitely."

Conclusion : 
This very much feels like a bug.  It doesn't seem like running jobs should be able to be 'held' because they can't be pending indefinitely due to the fact that they are actively running.  I don't think that the EligibleTime should subsequently change when a user tries to 'hold' a running job either. 

Question : 
1. Identifying these problematic jobs via the underlying MySQL database seems not optimal.  Are there any better workarounds?

Best regards,
Lee

Ole Holm Nielsen via slurm-users

unread,
Jan 8, 2026, 2:29:49 AM (yesterday) Jan 8
to slurm...@lists.schedmd.com
Hi Lee,

Just my 2 cents: Which database and OS versions do you run?

Furthermore, Slurm 23.02 is really old, so I'd recommend upgrading to
25.05 (or perhaps even 25.11). It just might be that your bug has been
resolved in later versions of Slurm or MySQL/MariaDB.

You can find detailed upgrade instructions in [1]. Be especially mindful
of the MySQL and slurmdbd upgrades, and perform a dry-run upgrade first on
a test node.

On 1/7/26 13:22, Lee via slurm-users wrote:
> I replicated this issue on a different cluster and determined that the
> root cause is that the time_eligible in the underlying MySQL database gets
> set to 0 when a running job is held.  Let me demonstrate.
...
> I am using slurm 23.02.6.  I have a strange issue.  I periodically use
> sacct to dump job data.  I then generate reports based on the resource
> allocation of our users.

IHTH,
Ole

[1]
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Lee via slurm-users

unread,
Jan 8, 2026, 10:54:09 AM (yesterday) Jan 8
to Ole Holm Nielsen, slurm...@lists.schedmd.com
Thanks for the suggestion.  In my test environment, I'm running :

root@bcm10-h01:~# mysql -V
mysql  Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))

root@bcm10-h01:~# cat /etc/os-release  | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.4 LTS"

This closely matches my production environment.  

My production environment is running in an Nvidia POD ecosystem and I'm using Base Command Manager (v10) to manage my cluster.  It does seem that the version of Slurm in the BCM iso tends to lag behind by at least 12 months.  All this is to say updating individual cluster components in the Base Command Environment isn't straightforward.  

Best, 
Lee

Christopher Samuel via slurm-users

unread,
Jan 8, 2026, 1:46:20 PM (23 hours ago) Jan 8
to slurm...@lists.schedmd.com
On 12/15/25 11:33 am, Lee via slurm-users wrote:

> I am using slurm 23.02.6.

FYI there's 6 security issues that have since been fixed since 23.02.6,
also 23.02.7 had a lot of other fixes in it. The last 23.02 release was
23.02.9:

https://github.com/SchedMD/slurm/blob/master/CHANGELOG/slurm-23.02.md

But that version is long abandoned, 24.11 is the most oldest supported
release (and you'd need to upgrade 23.02 to either 23.11 or 24.05 first
as Slurm only supports upgrading the previous 2 versions at any time).

FWIW we're running 24.11.7 and plan to upgrade directly to 25.11.x early
this year if testing goes well.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
Reply all
Reply to author
Forward
0 new messages