[slurm-users] Randomly draining nodes

831 views
Skip to first unread message

Nacereddine Laddaoui via slurm-users

unread,
Oct 7, 2024, 6:08:54 AM10/7/24
to slurm...@lists.schedmd.com

Hello everyone,

I’ve recently encountered an issue where some nodes in our cluster enter a drain state randomly, typically after completing long-running jobs. Below is the output from the sinfo command showing the reason “Prolog error” :

root@controller-node:~# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Prolog error         slurm     2024-09-24T21:18:05 node[24,31]

When checking the slurmd.log files on the nodes, I noticed the following errors:

[2024-09-24T17:18:22.386] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to jobacct_gather plugin in the extern_step.
**(repeated 90 times)**
[2024-09-24T17:18:22.917] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to jobacct_gather plugin in the extern_step.

...

[2024-09-24T21:17:45.162] launch task StepId=217703.0 request from UID:54059 GID:1600 HOST:<SLURMCTLD_IP> PORT:53514                                     
[2024-09-24T21:18:05.166] error: Waiting for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec
[2024-09-24T21:18:05.166] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: Unexpected missing socket error
[2024-09-24T21:18:05.166] error: _rpc_launch_tasks: unable to send return code to address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory     

If you know how to solve these errors, please let me know. I would greatly appreciate any guidance or suggestions for further troubleshooting.

Thank you in advance for your assistance.

Best regards,

--
Télécom Paris
Nacereddine LADDAOUI
Ingénieur de Recherche et de Développement

19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom ParisX Télécom ParisFacebook Télécom ParisLinkedIn Télécom ParisInstagram Télécom ParisBlog Télécom Paris
Une école de l'IMT

Laura Hild via slurm-users

unread,
Oct 8, 2024, 3:02:04 PM10/8/24
to Nacereddine Laddaoui, slurm...@lists.schedmd.com
Apologies if I'm missing this in your post, but do you in fact have a Prolog configured in your slurm.conf?


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

laddaoui--- via slurm-users

unread,
Oct 11, 2024, 11:25:01 AM10/11/24
to slurm...@lists.schedmd.com
Hi Laura,

Thank you for your reply.

Indeed, Prolog is not configured on my machine
$ scontrol show config |grep -i prolog
Prolog = (null)
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = Alloc,Contain
ResvProlog = (null)
SrunProlog = (null)
TaskProlog = (null)

Does it have to be set on all machines?

Laura Hild via slurm-users

unread,
Oct 15, 2024, 2:33:47 PM10/15/24
to ladd...@telecom-paris.fr, slurm...@lists.schedmd.com
Your slurm.conf should be the same on all machines (is it? you don't have Prolog configured on some but not others?), but no, it is not mandatory to use a prolog. I am simply surprised that you could get a "Prolog error" without having a prolog configured, since an error in the prolog program itself is how I always get that error. Yours must be some kind of communication problem, or a difference in expectation between daemons about what requests ought be exchanged.

laddaoui--- via slurm-users

unread,
Oct 21, 2024, 7:37:50 AM10/21/24
to slurm...@lists.schedmd.com
You were right, I found that the slurm.conf file was different between the controller node and the computes, so I've synchronized it now. I was also considering setting up an epilogue script to help debug what happens after the job finishes. Do you happen to have any examples of what an epilogue script might look like?

However, I'm now encountering a different issue:

REASON USER TIMESTAMP NODELIST
Kill task failed root 2024-10-21T09:27:05 nodemm04
Kill task failed root 2024-10-21T09:27:40 nodemm06

I also checked the logs and found the following entries:

On nodemm04:

[2024-10-21T09:27:06.000] [223608.extern] error: *** EXTERN STEP FOR 223608 STEPD TERMINATED ON nodemm04 AT 2024-10-21T09:27:05 DUE TO JOB NOT ENDING WITH SIGNALS ***

On nodemm06:

[2024-10-21T09:27:40.000] [223828.extern] error: *** EXTERN STEP FOR 223828 STEPD TERMINATED ON nodemm06 AT 2024-10-21T09:27:39 DUE TO JOB NOT ENDING WITH SIGNALS ***

It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?

Thanks for your help!

Christopher Samuel via slurm-users

unread,
Oct 22, 2024, 1:01:59 AM10/22/24
to slurm...@lists.schedmd.com
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:

> It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?

That usually means processes wedged in the kernel for some reason, in an
uninterruptible sleep state. You can define an "UnkillableStepProgram"
to be run on the node when that happens to capture useful state info.
You can do that by doing things like iterating through processes in the
jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting
the `ps` info for all those same processes, and/or doing an `echo w >
/proc/sysrq-trigger` to make the kernel dump all blocked tasks.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Paul Raines via slurm-users

unread,
Oct 22, 2024, 10:48:58 AM10/22/24
to slurm-users

I have a cron job that emails me when hosts go into drain mode and
tells me the reason (scontrol show node=$host | grep -i reason)

We get drains with the "Kill task failed" reason probably about 5 times a
day. This despite having UnkillableStepTimeout=180

Right now we are still handling them manually by sshing to the node
and running a script we wrote called clean_cgroup_jobs that looks
for the unkilled processes using the cgroup info for the job

If it finds none, it deletes the cgroups for the job and we resume
the node. This is true about 95% of the time.

In the case of a truly unkillable process, it lists the process and then
we manually investigate. Often in this case it is hung NFS mount causing
the problem and we have various ways of dealing with that that can involve
faking the IP of the offline NFS server on another server to make the node
client nfs kernel process finally exit.

In the rare case we can not find a way to kill the unkillable process
we arrange to reboot the node.


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 22 Oct 2024 12:59am, Christopher Samuel via slurm-users wrote:

> External Email - Use Caution
>
> On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
>
>> It seems like there's an issue with the termination process on these
>> nodes. Any thoughts on what could be causing this?
>
> That usually means processes wedged in the kernel for some reason, in an
> uninterruptible sleep state. You can define an "UnkillableStepProgram" to be
> run on the node when that happens to capture useful state info. You can do
> that by doing things like iterating through processes in the jobs cgroup
> dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for
> all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to
> make the kernel dump all blocked tasks.
>
> All the best,
> Chris
> --
> Chris Samuel :
> http://secure-web.cisco.com/1nkj9AvGGR14KG_wv9PtKtCMW_eu_C_6GKksFtwzqIHnSnp9zBgBvF7UhDjX-Jr7rqntHijweFQC7Dr7OXLSBQL4QFJp08bow0Lq85rerK08C4tM9f1oLt8ZQw6024ThBhY-70OkfJeXC0vq8ErlLvw1M5SaiHScDnTVcvn1rXM4mXMWmaQLMRYYU_RBeHMar_VYV_5G1mgOQvtXsieR8EA9iW2Oh1G9gYhzPFIteEobjgzdvVkcmLAwnqvhoXv_eu6jGAfseh5fOIkdD3Rd0vqGyMj-D3m8kFtGZuUZ5rEi3eRIYWlnNkiSIBBHm8BYw/http%3A%2F%2Fwww.csamuel.org%2F
> : Berkeley, CA, USA
>
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Ole Holm Nielsen via slurm-users

unread,
Oct 22, 2024, 2:07:22 PM10/22/24
to slurm...@lists.schedmd.com
On 22-10-2024 16:46, Paul Raines via slurm-users wrote:
> I have a cron job that emails me when hosts go into drain mode and
> tells me the reason (scontrol show node=$host | grep -i reason)

In stead of cron you can also use Slurm triggers, see for example our
scripts in the page
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers
You can tailor the triggers to do whatever tasks you need.

> We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180

Some time ago it was recommended that UnkillableStepTimeout values above
127 (or 256?) should not be used, see
https://support.schedmd.com/show_bug.cgi?id=11103. I don't know if this
restriction is still valid with recent versions of Slurm?

Best regards,
Ole

Christopher Samuel via slurm-users

unread,
Oct 24, 2024, 1:53:59 AM10/24/24
to slurm...@lists.schedmd.com
Hi Ole,

On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote:

> Some time ago it was recommended that UnkillableStepTimeout values above
> 127 (or 256?) should not be used, see https://support.schedmd.com/
> show_bug.cgi?id=11103.  I don't know if this restriction is still valid
> with recent versions of Slurm?

As I read it that last comment includes a commit message for the fix to
that problem, and we happily use a much longer timeout than that without
apparent issue.

https://support.schedmd.com/show_bug.cgi?id=11103#c30

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Ole Holm Nielsen via slurm-users

unread,
Oct 24, 2024, 2:28:41 AM10/24/24
to slurm...@lists.schedmd.com
Hi Chris,

Thanks for confirming that UnkillableStepTimeout can have larger values
without issues. Do you have some suggestions for values that would safely
cover network filesystem delays?

Best regards,
Ole

On 10/24/24 07:51, Christopher Samuel via slurm-users wrote:
>> Some time ago it was recommended that UnkillableStepTimeout values above
>> 127 (or 256?) should not be used, see https://support.schedmd.com/
>> show_bug.cgi?id=11103.  I don't know if this restriction is still valid
>> with recent versions of Slurm?
>
> As I read it that last comment includes a commit message for the fix to
> that problem, and we happily use a much longer timeout than that without
> apparent issue.
>
> https://support.schedmd.com/show_bug.cgi?id=11103#c30

--

Reply all
Reply to author
Forward
0 new messages