Slurm is trying to kill the job that is exceeding it's time limit, but the job doesn't die, so Slurm marks the node down because it sees this as a problem with the node. Increasing the value for GraceTime or KillWait might help:
- GraceTime
- Specifies, in units of seconds, the preemption grace time to be extended to a job which has been selected for preemption. The default value is zero, no preemption grace time is allowed on this partition. Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job's tasks are immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. This second set of signals is sent to both the tasks and the containing batch script, if applicable. Meaningful only for PreemptMode=CANCEL. See also the global KillWait configuration parameter.
- KillWait
- The interval, in seconds, given to a job's processes between the SIGTERM and SIGKILL signals upon reaching its time limit. If the job fails to terminate gracefully in the interval specified, it will be forcibly terminated. The default value is 30 seconds. The value may not exceed 65533.
--
Prentice