[slurm-users] Job cancelled into the future

312 views
Skip to first unread message

Reed Dier

unread,
Dec 20, 2022, 10:54:58 AM12/20/22
to Slurm User Community List
Hoping this is a fairly simple one.

This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solve the issue.

I do a daily email of sreport to show how busy the cluster was, and who were the top users.
Weirdly, I have a user that seems to be able to use the same exact usage day after day after day, down to hundredth of a percent, conspicuously even when they were on vacation and claimed that they didn’t have job submissions in cron/etc.

So then, taking a spin of the scom tui posted this morning, I then filtered that user, and noticed that even though I was only looking 2 days back at job history, I was seeing a job from August.

Conspicuously, the job state is cancelled, but the job end time is 1y from the start time, meaning its job end time is in 2023.
So something with the dbd is confused about this/these jobs that are lingering and reporting cancelled but still “on the books” somehow until next August.

╭──────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                          │
│  Job ID               : 290742                                                           │
│  Job Name             : $jobname                                                         │
│  User                 : $user                                                            │
│  Group                : $user                                                            │
│  Job Account          : $account                                                         │
│  Job Submission       : 2022-08-08 08:44:52 -0400 EDT                                    │
│  Job Start            : 2022-08-08 08:46:53 -0400 EDT                                    │
│  Job End              : 2023-08-08 08:47:01 -0400 EDT                                    │
│  Job Wait time        : 2m1s                                                             │
│  Job Run time         : 8760h0m8s                                                        │
│  Partition            : $part                                                            │
│  Priority             : 127282                                                           │
│  QoS                  : $qos                                                             │
│                                                                                          │
│                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
Steps count: 0

Filter: $user         Items: 13

 Job ID      Job Name                             Part.  QoS         Account     User             Nodes                 State
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 290714      $jobname                             $part  $qos        $acct       $user            node32                CANCELLED
 290716      $jobname                             $part  $qos        $acct       $user            node24                CANCELLED
 290736      $jobname                             $part  $qos        $acct       $user            node00                CANCELLED
 290742      $jobname                             $part  $qos        $acct       $user            node01                CANCELLED
 290770      $jobname                             $part  $qos        $acct       $user            node02                CANCELLED
 290777      $jobname                             $part  $qos        $acct       $user            node03                CANCELLED
 290793      $jobname                             $part  $qos        $acct       $user            node04                CANCELLED
 290797      $jobname                             $part  $qos        $acct       $user            node05                CANCELLED
 290799      $jobname                             $part  $qos        $acct       $user            node06                CANCELLED
 290801      $jobname                             $part  $qos        $acct       $user            node07                CANCELLED
 290814      $jobname                             $part  $qos        $acct       $user            node08                CANCELLED
 290817      $jobname                             $part  $qos        $acct       $user            node09                CANCELLED
 290819      $jobname                             $part  $qos        $acct       $user            node10                CANCELLED

I’d love to figure out the proper way to either purge these jid’s from the accounting database cleanly, or change the job end/run time to a sane/correct value.
Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync everywhere, not that multiple servers would drift 1 year off like this.

Thanks for any help,
Reed

Sarlo, Jeffrey S

unread,
Dec 20, 2022, 11:02:12 AM12/20/22
to Slurm User Community List

Do they show up as run away jobs?

 

sacctmgr show runawayjobs

 

If they do, it should give you the option to fix them.

 

Jeff

Brian Andrus

unread,
Dec 20, 2022, 11:03:59 AM12/20/22
to slurm...@lists.schedmd.com

Try:

    sacctmgr list runawayjobs

Brian Andrus

Reed Dier

unread,
Dec 20, 2022, 11:08:46 AM12/20/22
to Slurm User Community List
2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.

# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster $cluster

So unfortunately that doesn’t appear to be the culprit.

Appreciate the responses.

Reed

Reed Dier

unread,
Dec 20, 2022, 2:51:59 PM12/20/22
to Slurm User Community List
Just to followup with some things I’ve tried:

scancel doesn’t want to touch it:
# scancel -v 290710
scancel: Terminating job 290710
scancel: error: Kill job error on job id 290710: Job/step already completing or completed

pscontrol does see that these are all members of the same array, but doesn’t want to touch it:
# scontrol update JobID=290710 EndTime=2022-08-09T08:47:01
290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished

And trying to modify the job’s end time with sacctmgr fails, as expected, to modify the EndTime because EndTime is only a where spec, not a set spec, also tried EndTime=now with same results:
# sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01
 Unknown option: EndTime=2022-08-09T08:47:01
 Use keyword 'where' to modify condition
 You didn't give me anything to set

I was able to set a comment for the jobs/array, so the DBD can see/talk to them.
One additional thing to mention is that there are 14 JIDs that are stuck like this, 1 is an Array JID, and 13 of them are array tasks on the original Array ID.

But figured I would provide some of the other steps I’ve tried to flush those ideas.

Thanks,
Reed

Brian Andrus

unread,
Dec 20, 2022, 6:02:10 PM12/20/22
to slurm...@lists.schedmd.com

Seems like the time may have been off on the db server at the insert/update.

You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump.

Brian Andrus

Chris Samuel

unread,
Dec 23, 2022, 8:19:16 PM12/23/22
to slurm...@lists.schedmd.com
On 20/12/22 6:01 pm, Brian Andrus wrote:

> You may want to dump the database, find what table/records need updated
> and try updating them. If anything went south, you could restore from
> the dump.

+lots to making sure you've got good backups first, and stop slurmdbd
before you start on the backups and don't restart it until you've made
the changes, including setting the rollup times to be before the jobs
started to make sure that the rollups include these changes!

When you start slurmdbd after making the changes it should see that it
needs to do rollups and kick those off.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA


Reed Dier

unread,
Jan 17, 2023, 12:29:47 PM1/17/23
to Slurm User Community List
So I was going to take a stab at trying to rectify this after taking care of post-holiday matters.

Paste of the $CLUSTER_job_table table where I think I see the issue, and now I just want to sanity check my steps to remediate.
https://rentry.co/qhw6mg (pastebin alternative because markdown is paywalled for pastebin).

There are a number of job steps with a timelimit of 4294967295, where as the others of the same job array are 525600.
Obviously I want to edit those time limits to sane limits (match them to the others).
I don’t see anything in the $CLUSTER_step_table that looks like it would need to be modified to match, though I could be wrong.

But then the part of getting slurm to pick it up is where I’m wanting to make sure I’m on the right page.
Should I manually update the mod_time timestamp and slurm will catch that at its next rollup?
Or will slurm catch the change in the time limit at update the mod_time when it sees it upon rollup?

I also don’t see any documentation stating how to manually trigger a rollup, either via slurmdbd.conf or command line flag.
Will it automagically perform a rollup at some predefined, non-configurable interval, or when restarting the daemon?

Apologies if this is all trivial information, just trying to measure twice and cut once.

Appreciate everyone’s help so far.

Thanks,
Reed

Reed Dier

unread,
Jan 19, 2023, 12:33:56 PM1/19/23
to Slurm User Community List
Just to hopefully close this out, I believe I was actually able to resolve this in “user-land” rather than mucking with the database.

I was able to requeue the bad jid’s, and they went pending.
Then I updated the jobs to a time limit of 60.
Then I scancelled the jobs, and they returned to a cancelled state, before they rolled off within about 10 minutes.

Surprised I didn’t think to try requeueing earlier, but here’s to hoping that this did the trick, and I will have more accurate reporting and fewer “more time than is possible” log errors.

Thanks,
Reed
Reply all
Reply to author
Forward
0 new messages