[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

79 views
Skip to first unread message

Marcelo Garcia

unread,
Jun 11, 2019, 9:58:07 AM6/11/19
to slurm...@lists.schedmd.com
Hi

Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:

+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1".

The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to
# scontrol show config
(..._)
SlurmctldDebug = info

But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?

Thnaks for your attention.

Best Regards

mg.

Steffen Grunewald

unread,
Jun 11, 2019, 10:28:23 AM6/11/19
to Slurm User Community List
On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote:
> Hi
>
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:
>
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1".
>
> The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to
> # scontrol show config
> (..._)
> SlurmctldDebug = info
>
> But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?
>
> Thnaks for your attention.
>
> Best Regards
>
> mg.
>

- S

--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Daniel Letai

unread,
Jun 12, 2019, 12:05:23 AM6/12/19
to slurm...@lists.schedmd.com

I had similar problems in the past.

The 2 most common issues were:

1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit.

2. Topology and msg forwarding and aggregation.


For 2 - it would seem the nodes designated for forwarding are statically assigned based on topology. I could be wrong, but that's my observation, as I would get the socket timeout error when they had issues, even though other nodes in the same topology 'zone' were ok and could be used instead.


It took debug3 to observe this in the logs, I think.


HTH

--Dani_L.

Marcelo Garcia

unread,
Jun 12, 2019, 3:43:41 AM6/12/19
to Slurm User Community List
Hi Steffen

We are using Lustre as underlying file system:
[root@teta2 ~]# cat /proc/fs/lustre/version
lustre: 2.7.19.11

Nothing has changed. I think this is happening for a long time, but before was very sporadic, and only recently became more frequent.

Best Regards

mg.
Click https://www.mailcontrol.com/sr/C3sVfTezEznGX2PQPOmvUj911dVlkoGM8wtqpF4T7nO4ifXHGgg4hDJ1wA0Q6k9yVX4zexuKDmbIiTKH8SslWQ== to report this email as spam.

Bjørn-Helge Mevik

unread,
Jun 12, 2019, 4:54:42 AM6/12/19
to slurm...@schedmd.com
Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.

We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer. If that happens because slurmctld is trying to look up some
user or group, etc, client commands depending on it will hang. The
default message timeout is 10 seconds, so if the delay is more than
that, you get the timeout error.

We don't know why the delays are happening, but while we are debugging
it, we've increased the MessageTimeout, which seems to have reduced the
problem a bit. We're also experimenting with GroupUpdateForce and
GroupUpdateTime to reduce the number of times slurmctld needs to ask
about groups, but I'm unsure how much that helps.

--
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc

Christopher Benjamin Coffey

unread,
Jun 12, 2019, 10:37:54 AM6/12/19
to slurm-users, slurm...@schedmd.com
Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it shouldn't be hitting the directory anymore till the cache expires.

Do the nodes NAT through the head node?

Best,
Chris


Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

Marcus Wagner

unread,
Jun 12, 2019, 11:06:48 AM6/12/19
to slurm...@lists.schedmd.com
Hi,

we hit the same issue, up to 30.000 entries per day in the slurmctld log.

As we used SL6 the first time (Scientific Linux), we had massive
problems with sssd, often crashing.
We therefore decided to get rid of sssd and manually fill /etc/passwd
and /etc/groups via cronjob.

So, yes we have a ldap, but it can't be the issue in our case, since
user and group lookups are done locally.

Best
Marcus
--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Christopher Harrop - NOAA Affiliate

unread,
Jun 13, 2019, 10:48:21 AM6/13/19
to Slurm User Community List
Hi,

My group is struggling with this also.

The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or something else going on. The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message.

From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up. Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not. If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned. Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job. This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the time sbatch dies with that error. So, the timeout happens after the job request has already been made. I don’t know how one would solve this problem. In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you. Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure. The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response. That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into the —comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It’s not ideal, but it can work.

Chris

Jeffrey Frey

unread,
Jun 13, 2019, 11:07:02 AM6/13/19
to Slurm User Community List
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().  Those functions raise that error when a generic socket-based send/receive operation exceeds an arbitrary time limit imposed by the caller.  The functions use gettimeofday() to grab an initial timestamp and on each iteration of the poll() loop call gettimeofday() again, calculating a delta from the initial and current values returned by that function and subtracting from the timeout period.


Do you have any reason to suspect that your local times are fluctuating on the cluster?  That use of gettimeofday() to calculate actual time deltas is not recommended for that very reason:


NOTES
       The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system
       administrator manually changes the system time).  If you need a monotonically increasing clock, see clock_get‐
       time(2).







::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




Mark Hahn

unread,
Jun 13, 2019, 11:50:41 AM6/13/19
to Slurm User Community List
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
...
> One way I?m using to work around this is to inject a long random string
>into the ?comment option. Then, if I see the socket timeout, I use squeue
>to look for that job and retrieve its ID. It?s not ideal, but it can work.

I would have expected a different approach: use a unique string for the
jobname, and always verify after submission. after all, squeue provides
a --name parameter for this (efficient query by logical job "identity").

regards, mark hahn.

John Hearns

unread,
Jun 13, 2019, 12:03:42 PM6/13/19
to Slurm User Community List
I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in the configuration file which you can experiment with.

Christopher W. Harrop

unread,
Jun 13, 2019, 12:04:30 PM6/13/19
to Slurm User Community List
> ...
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option. Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID. It?s not ideal, but it can work.
>
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission. after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity”).

The job name is already in use, and it is not unique because there may be many copies of a workflow running at the same time by the same user. There is essentially no difference between verifying a match with jobname and a match with the comment; it’s just a different field of the output you’re looking at, which you can control with format options.


Bjørn-Helge Mevik

unread,
Jun 14, 2019, 3:04:56 AM6/14/19
to slurm...@schedmd.com
Christopher Benjamin Coffey <Chris....@nau.edu> writes:

> Hi, you may want to look into increasing the sssd cache length on the
> nodes,

We have thought about that, but it will not solve the problem, only make
it less frequent, I think.

> and improving the network connectivity to your ldap
> directory.

That is something we are investigating, yes.

> I recall when playing with sssd in the past that it wasn't
> actually caching. Verify with tcpdump, and "ls -l" through a
> directory. Once the uid/gid is resolved, it shouldn't be hitting the
> directory anymore till the cache expires.

We turned up the logging of the AD backend, and the logs indicate that
the caching works in our case: First time you look up a user/group in a
while, the backend gets the request, but subsequent lookups never reach
the backend (at least not according to the logs), which should mean that
sssd has cached the info.

> Do the nodes NAT through the head node?

We do, but we see the sssd delays on the head node as well, and on other
nodes outside the cluster that use the same ldap/da servers. But we
_do_ have a quite complicated network setup due to security, so there
might be something there. I'm currently trying to get my hands on the
logs from the servers themselves to see they actually get the requests
at the time when the sssd backend claims to make it.

--
Regards,
signature.asc

Marcelo Garcia

unread,
Jun 14, 2019, 4:41:05 AM6/14/19
to Slurm User Community List
Hi Chris

You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that:
=== start ===
Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to be dead.
=== end ===

Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem.

Best Regards

mg.

-----Original Message-----
From: slurm-users [mailto:slurm-use...@lists.schedmd.com] On Behalf Of Christopher Harrop - NOAA Affiliate
Sent: Donnerstag, 13. Juni 2019 16:47
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Click https://www.mailcontrol.com/sr/BSE5ulXU973GX2PQPOmvUujshICbHL2sPjokthLG0LGuvOKuSd7RBPQ08h87nB53U3B_o6vD7mIfmF8UmgH1OQ== to report this email as spam.

Christopher Harrop - NOAA Affiliate

unread,
Jun 14, 2019, 9:30:14 AM6/14/19
to Slurm User Community List
> Hi Chris
>
> You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to be dead.
> === end ===
>
> Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem.

Sorry I missed that. I am not the admin of the system, but I believe we are using 18.08.7. I believe we have a ticket open with SchedMD and our admin team is working with them. And I believe the approach being taken is to capture statistics with sdiag and use that info to tune configuration parameters. It is my understanding that they view the problem as a configuration issue rather than a bug in the scheduler. What this means to me is that the timeouts can only be minimized, not eliminated. And because workflow corruption is such a disastrous event, I have built in attempts to try to work around it even though occurrences are “rare”.

Chris
Reply all
Reply to author
Forward
0 new messages