[slurm-users] slurmctld hourly: Unexpected missing socket error

Jason Ellul via slurm-users

unread,

Jul 15, 2024, 11:48:22 PM7/15/24

to slurm...@lists.schedmd.com

Hi all,

I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as:

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have:
Set:

SchedulerParameters = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
SlurmctldPort = 6808-6817

But although the stats in sdiag have improved we still see the errors.

On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller.

Many Thanks in advance

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research

Peter MacCallum Cancer Centre

Patryk Bełzak via slurm-users

unread,

Jul 22, 2024, 4:04:57 AM7/22/24

to Jason Ellul via slurm-users

Hi,
we've been facing the same issue for some time. At the beginning the missing socket error happened every 20 minutes, later once per hour, now it happens few times a day.
The only downside of this was that controller was unresponsive for that couple of seconds - up to 60, if I remember well.
We tried to debug it in many ways, but we've found no straightforward solution or source of problems.

Things we've changed since the problem came up:
* RPC user limit: `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384`
* made sure that VM that slurm runs on has "network-latency" profile in `tuned`, also the same profile on worker nodes
* implemented some of these recommendations https://slurm.schedmd.com/high_throughput.html on controllers
* largely optimized slurmdb by some housekeeping and cleaning up inactive accounts, associations etc.
* optimized SSSD configuration (this one I believe had the biggest impact) both on controllers and on worker nodes
plus plenty of other (not related I guess) changes.

I'm not really sure if any of above helped us significantly in that matter.

Best regards,
Patryk Belzak.

On 24/07/16 03:45, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 2,0K --]

> Hi all,
>
> I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as:
>
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
>
> It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have:
> Set:
>

> *
> SchedulerParameters = max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
> *

> SlurmctldPort = 6808-6817
>
> But although the stats in sdiag have improved we still see the errors.
>
> On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller.
>
> Many Thanks in advance
>
> Jason
>
> Jason Ellul
> Head - Research Computing Facility
> Office of Cancer Research
> Peter MacCallum Cancer Centre

[-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: quoted-printable, Size: 6,9K --]

>
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com

Jason Ellul via slurm-users

unread,

Jul 22, 2024, 8:11:41 PM7/22/24

to Patryk Bełzak, Jason Ellul via slurm-users

Hi Patryk,

Thanks so much for your email.

There are a couple of things you list that we have not tried yet so we will definitely look at them. You mention optimizing SSSD which has me curious, are you using RedHat Identity management (free IPA?) because we are and after going through our logs it appears the errors became more consistent after upgrading our instance and replica to REHL9.

May I please ask what optimizations did you put in place for SSSD?

Many thanks

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research

My onsite days are Mon, alt Wed and Friday.

/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidBD5FD9A2-1554-4A49-B8F2-79C2470F2C05@petermac.org.au

Phone +61 3 8559 6546
Email Jason...@petermac.org

305 Grattan Street
Melbourne, Victoria
3000 Australia

www.petermac.org

From: Patryk Bełzak via slurm-users <slurm...@lists.schedmd.com>
Date: Monday, 22 July 2024 at 6:03 PM
To: Jason Ellul via slurm-users <slurm...@lists.schedmd.com>
Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket error

! EXTERNAL EMAIL: Think before you click. If suspicious send to Cyber...@petermac.org

Patryk Bełzak via slurm-users

unread,

Jul 24, 2024, 6:04:23 AM7/24/24

to Jason Ellul via slurm-users

Hi,

we're on 389 directory server (aka 389ds), which is pretty large instance. One of optimizations was to create proper ACI's on server side which significantly improved lookup times on slurm controller and worker nodes. Second thing was to move sssd cache to tmpfs - instruction by RedHat: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm#mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments
Entire chapter 9 may be helpful.

I also remembered that recently I modified kernel to match the slurmd port range from slurm.conf (60000-63001) by creating file /etc/sysctl.d/91-slurm.conf with following content:
# set ipv4 port range accordingly to slurmdPortRange in slurm.conf
net.ipv4.ip_local_port_range = 32768 63001
Unfortunately it hasn't stopped the error from occuring.

Best regards,
Patryk.

On 24/07/23 12:08, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 6,8K --]

> Hi Patryk,
>
> Thanks so much for your email.
>
> There are a couple of things you list that we have not tried yet so we will definitely look at them. You mention optimizing SSSD which has me curious, are you using RedHat Identity management (free IPA?) because we are and after going through our logs it appears the errors became more consistent after upgrading our instance and replica to REHL9.
>
> May I please ask what optimizations did you put in place for SSSD?
>
> Many thanks
>
> Jason
>
>
> Jason Ellul
> Head - Research Computing Facility
> Office of Cancer Research
> My onsite days are Mon, alt Wed and Friday.
>

> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidBD5FD9A2-1554-4A...@petermac.org.au]

>
> Phone +61 3 8559 6546

> Email Jason...@petermac.org<mailto:Jason...@petermac.org>

> 305 Grattan Street
> Melbourne, Victoria
> 3000 Australia
>

> www.petermac.org<http://www.petermac.org>
>
> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidEC351626-829A-4A...@petermac.org.au]<https://twitter.com/petermaccc>

[-- Alternative Type #1: text/html; charset=utf-8, Encoding: base64, Size: 14K --]

Jason Ellul via slurm-users

unread,

Jul 29, 2024, 7:49:04 PM7/29/24

to Patryk Bełzak, Jason Ellul via slurm-users

Thanks again Patryk,

For your insights, we have implemented many of the same things, but the socket errors are still occurring regularly.

If we find a solution that works I will be sure to add it to this thread.

Many thanks

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research

My onsite days are Mon, alt Wed and Friday.

/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/cidBD5FD9A2-1554-4A49-B8F2-79C2470F2C05@petermac.org.au

Phone +61 3 8559 6546
Email

Jason...@petermac.org

305 Grattan Street
Melbourne, Victoria
3000 Australia

www.petermac.org

Emyr James via slurm-users

unread,

Jan 12, 2026, 9:14:59 AMJan 12

to Patryk Bełzak, Jason Ellul via slurm-users, Jason Ellul

Hi,

We had the same error message. We see this happen when we get users submitting job arrays with lots of short jobs and then you get a storm of job starts and job completions within a short space of time.

For every start of a job or end of a job, there are multiple messages that go back and forth between the compute node, the slurm controller, the database daemon and the database itself and multiple rows are inserted and updated within the database. When there is a high turnover of jobs in a short amount of time the slurm system is unable to keep up and it effectively becomes a denial-of-service attack against slurm.

This command allowed me to see users that had a lot of jobs completing within the last day or so (you can add start and end filters for more precision)...

sacct --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw | grep -v batch |grep -v extern | grep -v RUNNING | awk '{print $2}' | sort | uniq -c | sort -nr| head

(note because we are using pam_slurm_adopt we get 3 rows appearing for each job so I filter out the batch and extern rows)

For the users showing up here I then did

sacct --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw | grep -v batch |grep -v extern | grep -v RUNNING | grep <username> |awk '{print $6}' | sort -n | uniq -c

substituting in the usernames. This shows the number of jobs with elapsed time of 0, 1, 2, 3 etc. seconds. If you see multiple users with high numbers in the sub 30 second range then this could be the reason. Submitting lots of short jobs (<30s runtime) is an anti-pattern.

Instead of submitting e.g. a 1000 element job array of 5 second jobs, the user could repackage this up into 10 jobs that do a for-loop over 100 of the individual jobs so that you get 10 jobs each with 100 subjobs with a runtime of 500 seconds. This will avoid the message storms. You could even ask for multiple cores here and use gnu parallel to run the 100 jobs across a few cores. This would make the individual jobs run quicker than the expected 500 seconds and you may get much higher cpu efficiency doing this if these jobs are bottlenecked on IO.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

From: Jason Ellul via slurm-users <slurm...@lists.schedmd.com>
Sent: 30 July 2024 01:46
To: Patryk Bełzak <patryk...@pwr.edu.pl>; Jason Ellul via slurm-users <slurm...@lists.schedmd.com>

Cutts, Tim via slurm-users

unread,

Jan 12, 2026, 9:29:32 AMJan 12

to Jason Ellul via slurm-users

Hi Em, Let me guess – a naïve nextflow workflow? 😊

I don’t know how much difference it’ll make to reducing load on SLURM, but Seqera did recently accept a small patch from me that allows nextflow to support using the --only-job-state option in squeue. I don’t know exactly when they’ll release it, it was scheduled for Nextflow 25.12.0-edge or something like that.

It’s no substitute for actually writing efficient nextflow workflows, making use in particular of nextflow’s buffer approach, which is really great for grouping lots of small tasks into a single job.

Tim

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com

Markus Köberl via slurm-users

unread,

Jan 13, 2026, 4:01:15 AMJan 13

to slurm...@lists.schedmd.com, Jason Ellul

On Tuesday, July 16, 2024 5:45:52 AM Central European Standard Time Jason Ellul via slurm-users wrote:
> Hi all,
>
> I am hoping someone can help with our problem. Every hour after restarting
> slurmctld the controller becomes unresponsive to commands for 1 sec,
> reporting errors such as:
>
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed:
> Unexpected missing socket error [2024-07-15T11:45:48.509] error:
> slurm_send_node_msg: [socket:[934875]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error [2024-07-15T11:45:48.509] error:
> slurm_send_node_msg: [socket:[939016]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error

with slurm 25.11.1 I noticed similar errors with array jobs.
I solved it for my small cluster with:

SlurmctldParameters=conmgr_max_connections=4096
SlurmdParameters=conmgr_max_connections=512

It seems the default values got lowered and are not the one stated in the documentation anymore.

regards
Markus Köberl

John Hearns via slurm-users

unread,

Jan 13, 2026, 4:17:03 AMJan 13

to Emyr James, Patryk Bełzak, Jason Ellul via slurm-users, Jason Ellul

You can increase the number of possible sockets.

Also can enable socket reuse or recycle. One of these is a bit dangerous - I cannot remember which.

I have a war story related to tcp connections. Might tell it later.

Reply all

Reply to author

Forward