[slurm-users] sacct end time for failed jobs

540 views
Skip to first unread message

Brian Andrus

unread,
Feb 26, 2019, 1:04:26 PM2/26/19
to slurm...@lists.schedmd.com
All,

So I am using sacct to generate daily reports of job run times that is
imported into an external db for cost and projected use planning.

One thing I have noticed is that the END field for jobs with a state of
FAILED is "Unknown" but the ELAPSED field has the time it ran.

It seems to me that END should be filled with the time the job failed,
no? Is there a setting or something that can be done to do this? Or a
schema so I could update the table(s) myself for any job with a state of
"FAILED"?


All the Best,
Brian Andrus


Chris Samuel

unread,
Feb 28, 2019, 1:41:40 AM2/28/19
to slurm...@lists.schedmd.com
On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote:

> One thing I have noticed is that the END field for jobs with a state of
> FAILED is "Unknown" but the ELAPSED field has the time it ran.

That shouldn't happen, it works fine here (and where I've used Slurm in
Australia).

$ sacct -j ${FAILED_JOBID} -o start,end,elapsed,state
Start End Elapsed State
------------------- ------------------- ---------- ----------
2019-02-27T22:35:23 2019-02-27T22:36:20 00:00:57 FAILED
2019-02-27T22:35:23 2019-02-27T22:36:20 00:00:57 FAILED
2019-02-27T22:35:23 2019-02-27T22:36:38 00:01:15 COMPLETED

The "COMPLETED" part is the extern step we have as we use pam_slurm_adopt.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA




Brian Andrus

unread,
Mar 5, 2019, 1:08:56 PM3/5/19
to Slurm User Community List
Hmm. I have it as an issue as well as several jobs that are in the db without an end time, even though they are not running.
Not sure how that happened, but I do want to find a good way to clean it up. Without and end time, sacct reports the jobs as if they continue to run and the total elapsed time keeps growing.

Does anyone have a process they use to handle empty (aka "Unknown") end times for jobs that are not running?

Brian Andrus

Chris Samuel

unread,
Mar 6, 2019, 1:34:49 AM3/6/19
to slurm...@lists.schedmd.com
On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:

> Does anyone have a process they use to handle empty (aka "Unknown") end
> times for jobs that are not running?

What does:

sacctmgr list runawayjobs

say?

Brian Andrus

unread,
Mar 6, 2019, 11:17:29 AM3/6/19
to slurm...@lists.schedmd.com

It shows several jobs that all have "Unknown" for end_time. Some are
PENDING and some are RUNNING (none are truly in either state).

It asked to fix them, which I did, but nothing seems to have changed.
They still show up with that command and in reports.


Brian

Cyrus Proctor

unread,
Mar 6, 2019, 11:59:37 AM3/6/19
to slurm...@lists.schedmd.com

Hi Brian,

Others probably have better suggestions before going the route I'm about to detail. If you do go this route, be warned, you definitely have the ability to irrevocably lose data or destroy your Slurm accounting database. Do so at your own risk. I got here with Google-foo after being out of other (known to me) options. Someone please save Brian having to do what comes below ;-)

Last warning: I'd recommend turning off slurmdbd and backing up the database (mysqldump) before going forward.

In my case, runaway jobs did not show up with `sacctmgr list runawayjobs`. My problem was removing a user from the Slurm database because it thought they still had active jobs. The likely cause of this was the slurmdb daemon not shutting down gracefully at some point. The job was long gone but it was still in a pending state:

# sacct -j 899139 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
899139            equil   gpu-long    p-1234         20    PENDING      0:0 
# scontrol show job 899139
slurm_load_jobs error: Invalid job id specified
# mysql -u root -p
...
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 7453
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     0 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set state=3 where id_job=899139;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     3 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
+-------+----------+------------+-------------+----------+-----------+
| state | time_end | time_start | time_submit | id_assoc | partition |
+-------+----------+------------+-------------+----------+-----------+
|     3 |        0 | 1546880712 |  1546880711 |     2078 | gpu-long  |
+-------+----------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)

mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
+-------+------------+------------+-------------+----------+-----------+
| state | time_end   | time_start | time_submit | id_assoc | partition |
+-------+------------+------------+-------------+----------+-----------+
|     3 | 1546880713 | 1546880712 |  1546880711 |     2078 | gpu-long  |
+-------+------------+------------+-------------+----------+-----------+
1 row in set (0.00 sec)
In this case for job ID 899139 on the banana cluster, the state was not updated and neither were start or end times. I went in and manually edited the job entries such that Slurm thought they were complete with feasible start and end times. Again, this worked for me. I don't know if this is your problem or not. If you choose this route, be careful and good luck!

Paul Edmon

unread,
Mar 6, 2019, 1:07:23 PM3/6/19
to slurm...@lists.schedmd.com

A lot of this is automated in the new versions of slurm.  You should just need to run:

sacctmgr show runawayjobs

It will then give you an option to clean them and slurm will handle the rest.  If you add the -i option it will just clean them automatically.

-Paul Edmon-

Brian Andrus

unread,
Mar 6, 2019, 1:24:20 PM3/6/19
to slurm...@lists.schedmd.com

I am running the latest and did that, but it didn't change anything. The jobs stay in the runaway state and no changes are made to the database.

Using 18.08.2-1.

Maybe try updating to 19.05.0-0pre1?

Brian

Paul Edmon

unread,
Mar 6, 2019, 1:32:28 PM3/6/19
to slurm...@lists.schedmd.com

Odds are the new version won't help for that.  You will have to do some mysql work to fix it then.

-Paul Edmon-

Reply all
Reply to author
Forward
0 new messages