[slurm-users] SLURM starts new job before CG finishes

19 views
Skip to first unread message

Erwin, James

unread,
Jan 3, 2020, 2:25:59 PM1/3/20
to slurm...@schedmd.com

Hello,

 

I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are starting on nodes still in the CG state. In an epilog I am running node health checks that last about 2-3 minutes. In the previous version (ancient 15.08), jobs would not start running on these nodes until the epilog was complete and the node is out of the CG state. Does anyone know why this overlap of R with CG might be happening?

 

There is a release note for version 19.05.3 that looks possibly related but I’m not exactly sure what it means:

 

* Changes in Slurm 19.05.3

==========================

...

-- Nodes in COMPLETING state treated as being currently available for job

    will-run test.

 

 

Thanks,

James

 

Lyn Gerner

unread,
Jan 22, 2020, 12:27:33 PM1/22/20
to Slurm User Community List, slurm...@schedmd.com
James, you might take a look at CompleteWait and KillWait.

Regards,
Lyn

Erwin, James

unread,
Feb 3, 2020, 8:58:39 AM2/3/20
to Slurm User Community List, slurm...@schedmd.com

Hello,

Thank you for your reply Lyn. I found a temporary workaround (epilog touching a file in /tmp/ and making a prolog wait until the epilog finishes and removes the file).

I was looking at CompleteWait before I tried these work-arounds but as it is written in the docs, I do not understand how this would help.

 

CompleteWait

The time, in seconds, given for a job to remain in COMPLETING state before any additional jobs are scheduled. If set to zero, pending jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations. 

 

In my case, the epilog is still executing (according ps and the health checks), and slurm still starts new jobs on the node.

 

Thanks,

James

Paddy Doyle

unread,
Feb 6, 2020, 10:48:59 AM2/6/20
to Slurm User Community List
Hi James,

Just for a slightly different take, 2-3 minutes seems a bit long for an
epilog script. Do you need to run all of those checks after every job?

Also, you describe it as running health checks; why not run those checks
via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)?

Or better, split more job-specific checks into the Epilog and put general
node-specific checks into HealthCheckProgram.

But either way, as Lyn noted, you might still need to set CompleteWait to a
non-zero value to allow the epilog to finish.

Kind regards,
Paddy

On Mon, Feb 03, 2020 at 01:58:15PM +0000, Erwin, James wrote:

> Hello,
> Thank you for your reply Lyn. I found a temporary workaround (epilog touching a file in /tmp/ and making a prolog wait until the epilog finishes and removes the file).
> I was looking at CompleteWait before I tried these work-arounds but as it is written in the docs, I do not understand how this would help.
>
> CompleteWait
> The time, in seconds, given for a job to remain in COMPLETING state before any additional jobs are scheduled. If set to zero, pending jobs will be started as soon as possible. Since a COMPLETING job's resources are released for use by other jobs as soon as the Epilog completes on each individual node, this can result in very fragmented resource allocations.
>
> In my case, the epilog is still executing (according ps and the health checks), and slurm still starts new jobs on the node.
>
> Thanks,
> James
>
>
> From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Lyn Gerner
> Sent: Wednesday, January 22, 2020 12:27 PM
> To: Slurm User Community List <slurm...@lists.schedmd.com>
> Cc: slurm...@schedmd.com
> Subject: Re: [slurm-users] SLURM starts new job before CG finishes
>
> James, you might take a look at CompleteWait and KillWait.
>
> Regards,
> Lyn
>
> On Fri, Jan 3, 2020 at 12:27 PM Erwin, James <james...@intel.com<mailto:james...@intel.com>> wrote:
> Hello,
>
> I’ve recently updated a cluster to SLURM 19.05.4 and notice that new jobs are starting on nodes still in the CG state. In an epilog I am running node health checks that last about 2-3 minutes. In the previous version (ancient 15.08), jobs would not start running on these nodes until the epilog was complete and the node is out of the CG state. Does anyone know why this overlap of R with CG might be happening?
>
> There is a release note for version 19.05.3 that looks possibly related but I’m not exactly sure what it means:
>
> * Changes in Slurm 19.05.3
> ==========================
> ...
> -- Nodes in COMPLETING state treated as being currently available for job
> will-run test.
>
>
> Thanks,
> James
>

--
Paddy Doyle
Research IT / Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://www.tchpc.tcd.ie/

Reply all
Reply to author
Forward
0 new messages