[slurm-users] stopping job array after N failed jobs in row

295 views
Skip to first unread message

Josef Dvoracek

unread,
Aug 1, 2023, 9:49:39 AM8/1/23
to Slurm User Community List
my users found the beauty of job arrays, and they tend to use it every
then and now.

Sometimes human factor steps in, and something is wrong in job array
specification, and cluster "works" on one failed array job after another.

Isn't there any way how to automatically stop/scancel/? job array after,
let say, 20 failed array jobs in row?

So far my experience is, if first ~20 array jobs go right, there is no
catastrophic failure in sbatch-file. If they fail, usually it's bad and
there is no sense to crunch the remaining thousands of job array jobs.

OT: what is the correct terminology for one item in job array...
sub-job? job-array-job? :)

cheers

josef


Daniel Letai

unread,
Aug 1, 2023, 3:25:32 PM8/1/23
to slurm...@lists.schedmd.com

Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is referred to as a task. I personally prefer element, as in array element.


Consider creating a batch job with:


arrayid=$(sbatch --parsable --array=0-19 array-job.sh)

sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh


I'm not near a cluster right now, so can't test for correctness. The main drawback is of course if 20 jobs takes a long time to complete, and there are enough resources to run more than 20 jobs in parallel, all those resources will be wasted for the duration. Not a big issue in busy clusters, as some other job will run in the meantime, but this will impact completion time of the array, if 20 jobs use significantly less than the resources available.


It might be possible to depend on afternotok of the first 20 tasks, to run --wrap="scancel $arrayid"


Maybe something like:


sbatch --array=1-50000 array-job.sh

with

cat array-job.sh

#!/bin/bash


srun myjob.sh $SLURM_ARRAY_TASK_ID &

[[ $SLURM_ARRAY_TASK_ID -gt 20  ]] && srun -d afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:${SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID



Will also work. Untested, use at your own risk.


The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case.

-- 
Regards,

--Dani_L.

Loris Bennett

unread,
Aug 2, 2023, 1:45:14 AM8/2/23
to Slurm User Community List
Daniel Letai <da...@letai.org.il> writes:

> Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of
> the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is referred to as a task. I personally prefer
> element, as in array element.
>
> Consider creating a batch job with:
>
> arrayid=$(sbatch --parsable --array=0-19 array-job.sh)
>
> sbatch --dependency=afterok:$arrayid --array=20-50000 array-job.sh
>
> I'm not near a cluster right now, so can't test for correctness. The main drawback is of course if 20 jobs takes a long time to complete, and there are
> enough resources to run more than 20 jobs in parallel, all those resources will be wasted for the duration. Not a big issue in busy clusters, as some
> other job will run in the meantime, but this will impact completion time of the array, if 20 jobs use significantly less than the resources available.

I think running an initial subarray is a good idea, since, once it has
completed, it allows the user to check whether the right amount of
resources were requested. I often find users don't do this and end up,
say, specifying 10 or 100 times more memory than actually needed for an
array of several thousand jobs. This is obviously a problem even if the
jobs all completed successfully.

Cheers,

Loris

> It might be possible to depend on afternotok of the first 20 tasks, to run --wrap="scancel $arrayid"
>
> Maybe something like:
>
> sbatch --array=1-50000 array-job.sh
>
> with
>
> cat array-job.sh
>
> #!/bin/bash
>
> srun myjob.sh $SLURM_ARRAY_TASK_ID &
>
> [[ $SLURM_ARRAY_TASK_ID -gt 20 ]] && srun -d afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:$
> {SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID
>
> Will also work. Untested, use at your own risk.
>
> The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the
> array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case.
>
> On 01/08/2023 16:48:47, Josef Dvoracek wrote:
>
>> my users found the beauty of job arrays, and they tend to use it every then and now.
>>
>> Sometimes human factor steps in, and something is wrong in job array specification, and cluster "works" on one failed array job after another.
>>
>> Isn't there any way how to automatically stop/scancel/? job array after, let say, 20 failed array jobs in row?
>>
>> So far my experience is, if first ~20 array jobs go right, there is no catastrophic failure in sbatch-file. If they fail, usually it's bad and there is no
>> sense to crunch the remaining thousands of job array jobs.
>>
>> OT: what is the correct terminology for one item in job array... sub-job? job-array-job? :)
>>
>> cheers
>>
>> josef
>--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Michael DiDomenico

unread,
Aug 2, 2023, 12:10:17 PM8/2/23
to slurm...@lists.schedmd.com
On Tue, Aug 1, 2023 at 3:27 PM Daniel Letai <da...@letai.org.il> wrote:
> The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case.

you can setup task prolog/epilog. just test for the error condition
inthe task epilog and then cancel your array if need be

https://slurm.schedmd.com/prolog_epilog.html

i've not tried it, nor how it relates to array's but might work

Reply all
Reply to author
Forward
0 new messages