[slurm-users] Decreasing time limit of running jobs (notification)

776 views
Skip to first unread message

Amjad Syed

unread,
Jul 6, 2023, 11:54:06 AM7/6/23
to Slurm User Community List
Hello

We were trying to increase the time limit of a slurm running job

scontrol update job=<jobid> TimeLimit=16-00:00:00

But we accidentally got it to 16 hours

scontrol update job=<jobid> TimeLimit=16:00:00

This actually timeout and killed the running job and did not give any notification

Is this a bug, should not the user be warned that this job will be killled ?

Amjad

Jason Simms

unread,
Jul 6, 2023, 12:05:10 PM7/6/23
to Slurm User Community List
No, not a bug, I would say. When the time limit is reached, that's it, job dies. I wouldn't be aware of a way to manage that. Once the time limit is reached, it wouldn't be a hard limit if you then had to notify the user and then... what? How long would you give them to extend the time? Wouldn't be much of a limit if a job can be extended, plus that would throw off the scheduler/estimator. I'd chalk it up to an unfortunate typo.

Jason
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Jul 6, 2023, 12:24:55 PM7/6/23
to Slurm User Community List
Is the issue that the error in the time made it shorter than the time the job had already run, so it killed it immediately?
 

   
   
   
   
U.S. NAVAL
   
   
RESEARCH
   
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil


Amjad Syed

unread,
Jul 6, 2023, 12:53:49 PM7/6/23
to Slurm User Community List
Yes, the initial End Time was 7-00:00:00 but it allowed the typo (16:00:00) which caused the jobs to be killed without warning

Amjad

Jason Simms

unread,
Jul 6, 2023, 1:14:52 PM7/6/23
to Slurm User Community List
An unfortunate example of the “with great power comes great responsibility” maxim. Linux will gleefully let you rm -fr your entire system, drop production databases, etc., provided you have the right privileges. Ask me how I know…

Still, I get the point. Would it be possible to somehow ask for confirmation if you are setting a max time that is less than the current walltime? Perhaps. Could you script that yourself? Yes, I’m certain of it. Those kind of built-in safeguards aren’t super common, however. 

Jason

Amjad Syed

unread,
Jul 6, 2023, 1:38:39 PM7/6/23
to Slurm User Community List
Agreed the point  of greater  responsibility  but  even rm -r  ( without  f) gives  a warning.  In this case should slurm have that  option ( forced)   especially  if  it can immediately  kill a running  job?  




Jason Simms

unread,
Jul 6, 2023, 1:43:53 PM7/6/23
to Slurm User Community List
My opinion is no, at least not forced. 

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

unread,
Jul 6, 2023, 1:48:32 PM7/6/23
to Slurm User Community List
Given that the usual way to kill a job that's running is to use scancel, I would tend to agree that killing by shortening the walltime to below the already used time is likely to be an error, and deserves a warning.

Davide DelVento

unread,
Jul 10, 2023, 12:07:51 PM7/10/23
to Slurm User Community List
Actually rm -r does not give ANY warning, so in plain Linux "rm -r /" run as root would destroy your system without notice. Your particular Linux distro may have implemented safeguards with a shell alias such as `alias rm='rm -i'` and that's a common thing, but not guaranteed to be there
Reply all
Reply to author
Forward
0 new messages