[slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

121 views
Skip to first unread message

Roger Moye

unread,
Jan 22, 2019, 12:42:42 PM1/22/19
to Slurm User Community List

This morning we had several jobs fail with “launch failed requeued held” state.   We traced this to a failed prolog.   We fixed the problem but the jobs remained in this state.  

 

Is there a way to configure slurm so that it will automatically release the job from the Held state so that it can run?   There were plenty of healthy nodes for this job so I’d prefer that the job not remained held indefinitely.

 

Thanks!

-Roger

 

cid:image001.png@01D22319.C7D5D540

Roger Moye

HPC Engineer

713.425.6236 Office

713.898.0021 Mobile

 

QUANTLAB Financial, LLC

3 Greenway Plaza

Suite 200

Houston, Texas 77046

www.quantlab.com

 

 -----------------------------------------------------------------------------------

The information in this communication and any attachment is confidential and intended solely for the attention and use of the named addressee(s). All information and opinions expressed herein are subject to change without notice. This communication is not to be construed as an offer to sell or the solicitation of an offer to buy any security. Any such offer or solicitation can only be made by means of the delivery of a confidential private offering memorandum (which should be carefully reviewed for a complete description of investment strategies and risks). Any reliance one may place on the accuracy or validity of this information is at their own risk. Past performance is not necessarily indicative of the future results of an investment. All figures are estimated and unaudited unless otherwise noted. If you are not the intended recipient, or a person responsible for delivering this to the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it. In this case, please notify the sender immediately at 713-333-5440

Doug Meyer

unread,
Jan 22, 2019, 8:40:30 PM1/22/19
to Slurm User Community List
scontrol release job nnnnn

Not sure if the system can be set to automatically release jobs but I would not want them too as a faulty system will go into a do loop start, fail, start.

Doug
Reply all
Reply to author
Forward
0 new messages