Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

OpenSSH: cause of random kex_exchange_identification errors?

1,607 views
Skip to first unread message

Vincent Lefevre

unread,
Feb 2, 2022, 9:50:05 AM2/2/22
to
When I want to connect with SSH (ssh/scp) to some machine, I sometimes
get errors, either

kex_exchange_identification: Connection closed by remote host

or

kex_exchange_identification: read: Connection reset by peer

immediately after the connection attempt. This happens randomly,
and there are some periods where this happens quite often. The
client machine doesn't seem to matter, and this issue also even
occurs from machines on the local network.

With ssh -vvv, the output ends with

debug1: Local version string SSH-2.0-OpenSSH_8.7p1 Debian-4
kex_exchange_identification: read: Connection reset by peer
Connection reset by [...] port 22

In the source, this corresponds to function kex_exchange_identification
in kex.c:

len = atomicio(read, ssh_packet_get_connection_in(ssh),
&c, 1);
if (len != 1 && errno == EPIPE) {
error_f("Connection closed by remote host");
r = SSH_ERR_CONN_CLOSED;
goto out;
} else if (len != 1) {
oerrno = errno;
error_f("read: %.100s", strerror(errno));
r = SSH_ERR_SYSTEM_ERROR;
goto out;
}

so either with EPIPE or with ECONNRESET, and this apparently occurs
before the exchange of banners.

I could reproduce the issue with telnet, which gives

[...]
Escape character is '^]'.
Connection closed by foreign host.

while one normally has

SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2

just after the "Escape character..." line.

Note that this is different from a "Connection refused". Here, the
connection is accepted, but immediately closed.

The admin of the machine could see nothing particular in the logs.
He eventually modified the MaxStartups value, but this did not
solve the issue (but AFAIK, if this were the cause, there would
have been something about it in the logs). The machine has enough
available memory.

Any idea about the possible cause of these random errors?

--
Vincent Lefèvre <vin...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Hans

unread,
Feb 2, 2022, 10:20:05 AM2/2/22
to
Am Mittwoch, 2. Februar 2022, 15:44:32 CET schrieb Vincent Lefevre:
Sounds weired. I wonder, if there is a typo. Your message beginning with

kex_exchange_identif....

looks for me like a typo. I would have "key_exchange_...." expected.

However, I did not check this, and mybe this is correct.
On the other side, maybe this typo causes (if it is really a typo!) some
weired behaviour.

As I said, I may be wrong, but this is, what I did see at once.

Other reasons might be a timing problem on the network. Maybe you can take a
look with wireshark or similar, if there are network problems.

Got this one day on my wireless part, had lots of packets to be recalled,
which I did only see with wireshark and could not be noticed during normal
internet use.

Just some ideas.....

Does this help? Guess, not really....

Best regards

Hans

Vincent Lefevre

unread,
Feb 2, 2022, 10:30:05 AM2/2/22
to
On 2022-02-02 16:12:32 +0100, Hans wrote:
> Am Mittwoch, 2. Februar 2022, 15:44:32 CET schrieb Vincent Lefevre:
> Sounds weired. I wonder, if there is a typo. Your message beginning with
>
> kex_exchange_identif....
>
> looks for me like a typo. I would have "key_exchange_...." expected.

No, that's really kex_ in the OpenSSH source, and I think that it just
means "key exchange" (the "exchange" in kex_exchange_identification is
about identification, as part of the key exchange, if I understand
correctly).

[...]
> Other reasons might be a timing problem on the network. Maybe you
> can take a look with wireshark or similar, if there are network
> problems.

Note that the error is always immediate. So this is not due to packet
loss or something like that.

Bijan Soleymani

unread,
Feb 2, 2022, 11:50:05 AM2/2/22
to
On 2022-02-02 09:44, Vincent Lefevre wrote:
> In the source, this corresponds to function kex_exchange_identification
> in kex.c:
>
> len = atomicio(read, ssh_packet_get_connection_in(ssh),
> &c, 1);
> if (len != 1 && errno == EPIPE) {
> error_f("Connection closed by remote host");
> r = SSH_ERR_CONN_CLOSED;
> goto out;
> } else if (len != 1) {
> oerrno = errno;
> error_f("read: %.100s", strerror(errno));
> r = SSH_ERR_SYSTEM_ERROR;
> goto out;
> }
>
> so either with EPIPE or with ECONNRESET, and this apparently occurs
> before the exchange of banners.

If you look at the source of atomicio you will see that in this case it
will do a read() of 1 byte on the file descriptor used for communicating
with the other side.

atomicio will set errno to EPIPE if 0 bytes are returned on any of the
reads it does

and it returns the number of bytes read, which will be 0 or 1 in this case.

So the failure modes are 0 bytes read and read didn't return an error
(EPIPE), or 0 bytes read and read did return an error (read returns -1
and sets errno to something other than EPIPE).

But I think basically this means that read on the socket fails, or
basically can't read from the network.

Bijan

David Wright

unread,
Feb 2, 2022, 12:00:06 PM2/2/22
to
On Wed 02 Feb 2022 at 15:44:32 (+0100), Vincent Lefevre wrote:
> When I want to connect with SSH (ssh/scp) to some machine, I sometimes
> get errors, either
>
> kex_exchange_identification: Connection closed by remote host
>
> or
>
> kex_exchange_identification: read: Connection reset by peer
>
> immediately after the connection attempt. This happens randomly,
> and there are some periods where this happens quite often. The
> client machine doesn't seem to matter, and this issue also even
> occurs from machines on the local network.

My only guess about what might be random is whether there's a race
between connectiong via IPv4 and IPv6 (printed earlier in the debug log).

Does one end of any failing connection always involve 8.7p1
(that's from testing, isn't it)?

Does it happen on ports other than 22? (If you're like me, almost
everything I see goes through port 22 at one end or the other.)

Cheers,
David.

Greg Wooledge

unread,
Feb 2, 2022, 2:30:06 PM2/2/22
to
On Wed, Feb 02, 2022 at 02:21:08PM -0500, gene heskett wrote:
> When I change something, like rebooting the rpi4 running my big Sheldon
> lathe, from debian buster to debian bullseye, the keyfile changes, and I
> get an explicit error telling me to run ssh-keygen to remove the
> offending key, which I do, [...]

What *I* would do is copy the host key files from the buster instance
(the one that your client recognizes as valid) into the bullseye
instance. That way, the client will recognize *both* server instances
as the same host.

The host keys are in the /etc/ssh/ directory in Debian. There are
several files, and they all begin with ssh_host. Just copy them over
and make sure the permissions are retained. (The ones without .pub on
the end are meant to be private, so they have tighter permissions.)

If you're not running Debian, but instead are running some perverse
derivative that changes everything but still calls its releases "buster"
and "bullseye" in order to maximize confusion, then your host keys might
be in some other directory.

gene heskett

unread,
Feb 2, 2022, 2:30:06 PM2/2/22
to
When I change something, like rebooting the rpi4 running my big Sheldon
lathe, from debian buster to debian bullseye, the keyfile changes, and I
get an explicit error telling me to run ssh-keygen to remove the
offending key, which I do, and the next attempt then works as it auto-
registers the new key. But this machine is bullseye, and the stretch
before it, didn't have a self advising failure. The update was forced on
me, a nearly new 2T main drive died in the night losing everything, so I
threw money at it and now I'm booting from a 500G SSD, and 4 1T SSD's are
in a raid10 as /home of 2T capacity. One spinning rust drive remains,
amanda's morgue. I've put smaller SSD's in all my machines now, and the
only problem I've had was on the pi where I'm using usb3 to sata cables
to mount work drives, and an off-brand cable died, replaced the cable wth
a startech brand and the SSD as good, didn't lose a byte. They are about
6x faster than spinning rust, putting new life in old machines. Working
on that fast storage, I can rebuild a v5.16.2-rt12 realtime preempt-rt
kernel in armhf flavor for the rpi4 in around 20 minutes. The first time
I did that on spinning rust and a rpi3, took 13+ hours. And I'm still
running that older kernel on a rpi4. With a full xfc4 gui, it runs until
I cause a power failure by unplugging it. It has a small ups, and
because my now passed wife had COPD, needed a dependable oxygen supply,
there is a 20kw generac in the back yard that starts in about 4 seconds.

FWIW, we've not yet been able to make linuxcnc build on a bullseye
system, boost::python in the 3.9.2 version of python is a total
showstopper. The same calls in buster, work fine with python 3.7.

Probably more than you wanted to know.

Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
- Louis D. Brandeis
Genes Web page <http://geneslinuxbox.net:6309/gene>

David Wright

unread,
Feb 2, 2022, 4:00:06 PM2/2/22
to
I do similar, after checking that the keys look as if they were
generated by the same scheme. I do this just after Grub has been
installed on the disk, ie at "Finish the installation". In a shell
on VC2, or another remote ssh connection, I type:

# mount /dev/<previous-Debian-partition> /mnt
# cp -ipr /mnt/etc/ssh/s*[by] /target/etc/ssh/
# cp -ipr /mnt/root/.ssh (and most of root's dotfiles) /target/root/

The reason I do this in the d-i is because I typically install
over a ssh connection, and when the machine reboots at the end
and I want to login remotely to finish the configuration, I can
just type (from local's root):

# ssh -X hostname

and I'm in.

To summarise, the upshot is that to install a new system, I visit
the machine to plug in a USB installer stick, boot up from it using
the one-time-boot option, and run these commands:

│ Choose language │
│ Configure the keyboard │
│ Detect and mount CD-ROM │
│ Load installer components from CD │
→ network-console: Continue installation remotely using SSH ←
│ Detect network hardware │
│ Configure the network │
│ Continue installation remotely using SSH │
set a password (I use the hostname)

and return to my comfortable chair. I never /have to/ revisit
the target machine again.¹

One other trick: I run the remote installer with:

$ ssh -o GlobalKnownHostsFile=/dev/null -o UserKnownHostsFile=/dev/null installer@hostname

which avoids polluting my ~/.ssh/known_hosts with the ephemeral
host key being used by the installer.

¹ unless I want my stick back. (Desktop machines are configured
with magic-packet wake-up in the BIOS.)

Cheers,
David.

Vincent Lefevre

unread,
Feb 2, 2022, 10:20:05 PM2/2/22
to
On 2022-02-02 14:21:08 -0500, gene heskett wrote:
> When I change something, like rebooting the rpi4 running my big Sheldon
> lathe, from debian buster to debian bullseye, the keyfile changes, and I
> get an explicit error telling me to run ssh-keygen to remove the
> offending key, which I do, and the next attempt then works as it auto-
> registers the new key.
[...]

I recall that in my case, the error occurs at the very beginning
of the connection, *before* authentication has started.

Henrique de Moraes Holschuh

unread,
Feb 5, 2022, 4:40:06 PM2/5/22
to
On Wed, 02 Feb 2022, Vincent Lefevre wrote:
> When I want to connect with SSH (ssh/scp) to some machine, I sometimes
> get errors, either
>
> kex_exchange_identification: Connection closed by remote host
>
> or
>
> kex_exchange_identification: read: Connection reset by peer

That's a very early stage of the initial connection, and your SSH client
just noticed the remote server (or some middle box like a firewall, or
oversubscribed NAT gateway) dropped the TCP connection for whatever
reason.

It will be related to the TCP connection itself, and nothing else. IP
protocol family/address/port, or something in the TCP path misbehaving.

The most common reason is that the remote server disliked your IP
address and/or port due to /etc/hosts.allow/deny, firewalling, or
something in sshd_config. Ensure you check both IPv4 and IPv6.

> The admin of the machine could see nothing particular in the logs.
> He eventually modified the MaxStartups value, but this did not
> solve the issue (but AFAIK, if this were the cause, there would
> have been something about it in the logs). The machine has enough
> available memory.
>
> Any idea about the possible cause of these random errors?

The debugging needs to be done either in the server side, or on *both*
sides.

If you're using socket-forwarding stuff in the client side, check that.
This is exceedingly rare nowadays, so I doubt it. Stuff like SOCKS4 or
SOCKS5 TCP proxies.

Find what is actually listening on the TCP socket server-side, it might
not be sshd (interposers like systemd socket activation, xinetd/inetd,
etc). The logs/access control you need to look at server-side might not
be SSHD's in that case.

If it is sshd, ensure it is actually logging all you need, and carefully
study the logs.

If nothing helps, packet-dump both sides (client and server) and find
out what sent the TCP RST, as that might give you clues for the "why".
A middlebox might be doing it...

But get the remote admin to re-check server-side /etc/hosts.allow and
deny, sshd_config, etc. with an eye on "what might my SSH client be
using when it failed, and what it might have been using when it worked"
before all that debugging work. Don't forget to consider IPv6, or all
possibly outgoing ranges of IPv4 NAT, if any. It might pay off :-)

--
Henrique Holschuh

Vincent Lefevre

unread,
Feb 8, 2022, 5:20:06 AM2/8/22
to
On 2022-02-05 18:39:27 -0300, Henrique de Moraes Holschuh wrote:
> On Wed, 02 Feb 2022, Vincent Lefevre wrote:
> > When I want to connect with SSH (ssh/scp) to some machine, I sometimes
> > get errors, either
> >
> > kex_exchange_identification: Connection closed by remote host
> >
> > or
> >
> > kex_exchange_identification: read: Connection reset by peer
>
> That's a very early stage of the initial connection, and your SSH client
> just noticed the remote server (or some middle box like a firewall, or
> oversubscribed NAT gateway) dropped the TCP connection for whatever
> reason.

Yes, this is what I observed, e.g. when I reproduced the error
with telnet.

> It will be related to the TCP connection itself, and nothing else. IP
> protocol family/address/port, or something in the TCP path misbehaving.
>
> The most common reason is that the remote server disliked your IP
> address and/or port due to /etc/hosts.allow/deny, firewalling, or
> something in sshd_config.

I could reproduce the issue from multiple IP addresses (both from
the local network and from external networks), and the errors are
completely random. Immediately after a failure, this can succeed.

There's a fail2ban, but the IP can get banned only after
authentication failures, and in this case, the error is not the same:
the connection is not accepted (something like "connection refused",
not a kex_exchange_identification error, where the connection is
first accepted).

> Ensure you check both IPv4 and IPv6.

Only IPv4 is supported there (the host does not have an IPv6 address,
so that there can't be any mistake).

> > The admin of the machine could see nothing particular in the logs.
> > He eventually modified the MaxStartups value, but this did not
> > solve the issue (but AFAIK, if this were the cause, there would
> > have been something about it in the logs). The machine has enough
> > available memory.
> >
> > Any idea about the possible cause of these random errors?
>
> The debugging needs to be done either in the server side, or on *both*
> sides.
>
> If you're using socket-forwarding stuff in the client side, check that.
> This is exceedingly rare nowadays, so I doubt it. Stuff like SOCKS4 or
> SOCKS5 TCP proxies.

Nothing like that.

> Find what is actually listening on the TCP socket server-side, it might
> not be sshd (interposers like systemd socket activation, xinetd/inetd,
> etc). The logs/access control you need to look at server-side might not
> be SSHD's in that case.

There's a xinetd running, but its config files show that it does not
handle sshd, and there's a /usr/sbin/sshd running anyway (with lots
of children corresponding to all the current connections).

> If it is sshd, ensure it is actually logging all you need, and carefully
> study the logs.

It appears that the failure occurs too soon. The first thing that
sshd normally logs is a "Accepted publickey" line, but the connection
is closed before authentication. Apparently the admin could only see
my successful connections in the logs.

> If nothing helps, packet-dump both sides (client and server) and find
> out what sent the TCP RST, as that might give you clues for the "why".
> A middlebox might be doing it...
>
> But get the remote admin to re-check server-side /etc/hosts.allow and
> deny, sshd_config, etc. with an eye on "what might my SSH client be
> using when it failed, and what it might have been using when it worked"
> before all that debugging work. Don't forget to consider IPv6, or all
> possibly outgoing ranges of IPv4 NAT, if any. It might pay off :-)

/etc/hosts.allow and /etc/hosts.deny just contain comments.

But if the goal were to reject the connection before authentication,
then the connection should not be accepted in the first place.

Currently the errors have stopped (or are just too rare to reproduce).
If they occur again, more debugging could be done.

Vincent Lefevre

unread,
Jun 7, 2022, 12:00:05 PM6/7/22
to
On 2022-02-05 18:39:27 -0300, Henrique de Moraes Holschuh wrote:
> If it is sshd, ensure it is actually logging all you need, and carefully
> study the logs.

Nothing interesting in the logs, according to the admins of the server.

> If nothing helps, packet-dump both sides (client and server) and find
> out what sent the TCP RST, as that might give you clues for the "why".
> A middlebox might be doing it...

I eventually did a packet capture on the client side as I was able to
reproduce the problem. When it occurs, I get the following sequence:

Client → Server: [SYN] Seq=0
Server → Client: [SYN, ACK] Seq=0
Client → Server: [ACK] Seq=1
Server → Client: [FIN, ACK] Seq=1
Client → Server: Client: Protocol (SSH-2.0-OpenSSH_9.0p1 Debian-1)
Server → Client: [RST] Seq=2
Client → Server: [FIN, ACK] Seq=33
Server → Client: [RST] Seq=2

So the issue comes from the server, which sends [FIN, ACK] to terminate
the connection. In OpenSSH's sshd.c, this could be due to

if (unset_nonblock(*newsock) == -1 ||
drop_connection(*newsock, startups) ||
pipe(startup_p) == -1) {
close(*newsock);
continue;
}

At least 2 kinds of errors are not logged:

* In unset_nonblock(), a "fcntl(fd, F_SETFL, val) == -1" condition.

* the "pipe(startup_p) == -1" condition.

I'm not sure about drop_connection(), which is related to MaxStartups.

Tim Woodall

unread,
Jun 7, 2022, 12:20:05 PM6/7/22
to
On Tue, 7 Jun 2022, Vincent Lefevre wrote:

> On 2022-02-05 18:39:27 -0300, Henrique de Moraes Holschuh wrote:
>> If it is sshd, ensure it is actually logging all you need, and carefully
>> study the logs.
>
> Nothing interesting in the logs, according to the admins of the server.
>
>> If nothing helps, packet-dump both sides (client and server) and find
>> out what sent the TCP RST, as that might give you clues for the "why".
>> A middlebox might be doing it...
>
> I eventually did a packet capture on the client side as I was able to
> reproduce the problem. When it occurs, I get the following sequence:
>
> Client ? Server: [SYN] Seq=0
> Server ? Client: [SYN, ACK] Seq=0
> Client ? Server: [ACK] Seq=1
> Server ? Client: [FIN, ACK] Seq=1
> Client ? Server: Client: Protocol (SSH-2.0-OpenSSH_9.0p1 Debian-1)
> Server ? Client: [RST] Seq=2
> Client ? Server: [FIN, ACK] Seq=33
> Server ? Client: [RST] Seq=2
>
> So the issue comes from the server, which sends [FIN, ACK] to terminate
> the connection. In OpenSSH's sshd.c, this could be due to
>
> if (unset_nonblock(*newsock) == -1 ||
> drop_connection(*newsock, startups) ||
> pipe(startup_p) == -1) {
> close(*newsock);
> continue;
> }
>
> At least 2 kinds of errors are not logged:
>
> * In unset_nonblock(), a "fcntl(fd, F_SETFL, val) == -1" condition.
>
> * the "pipe(startup_p) == -1" condition.
>
> I'm not sure about drop_connection(), which is related to MaxStartups.
>

I've not seen the start of this thread but is this occasional or always?
If occasional, how many concurrent connections do you have starting all
at once. The default ssh config has a super-annoying default that
randomly kills sessions if too many are handshaking at once.

It's the MaxStartups setting you allude to. I've been bitten by this
where cron jobs all start at the same time and ssh to the same host.

Vincent Lefevre

unread,
Jun 14, 2022, 7:00:06 AM6/14/22
to
On 2022-06-07 17:19:12 +0100, Tim Woodall wrote:
> On Tue, 7 Jun 2022, Vincent Lefevre wrote:
> > I eventually did a packet capture on the client side as I was able to
> > reproduce the problem. When it occurs, I get the following sequence:
> >
> > Client ? Server: [SYN] Seq=0
> > Server ? Client: [SYN, ACK] Seq=0
> > Client ? Server: [ACK] Seq=1
> > Server ? Client: [FIN, ACK] Seq=1
> > Client ? Server: Client: Protocol (SSH-2.0-OpenSSH_9.0p1 Debian-1)
> > Server ? Client: [RST] Seq=2
> > Client ? Server: [FIN, ACK] Seq=33
> > Server ? Client: [RST] Seq=2
> >
> > So the issue comes from the server, which sends [FIN, ACK] to terminate
> > the connection. In OpenSSH's sshd.c, this could be due to
> >
> > if (unset_nonblock(*newsock) == -1 ||
> > drop_connection(*newsock, startups) ||
> > pipe(startup_p) == -1) {
> > close(*newsock);
> > continue;
> > }
> >
> > At least 2 kinds of errors are not logged:
> >
> > * In unset_nonblock(), a "fcntl(fd, F_SETFL, val) == -1" condition.
> >
> > * the "pipe(startup_p) == -1" condition.
> >
> > I'm not sure about drop_connection(), which is related to MaxStartups.
> >
>
> I've not seen the start of this thread but is this occasional or always?

Occasional. Someone else at my lab could reproduce the issue.
But the admins can't.

> If occasional, how many concurrent connections do you have starting all
> at once.

I'm not sure what you mean by "concurrent connections". The server
is a SSH gateway, so that many users connect to it. But for the
client host above (my personal machine at my lab), this was the
only connection from this machine; note I did this connection only
for testing, as there is no need to connect to this SSH gateway
from the lab.

> The default ssh config has a super-annoying default that
> randomly kills sessions if too many are handshaking at once.
>
> It's the MaxStartups setting you allude to. I've been bitten by this
> where cron jobs all start at the same time and ssh to the same host.

MaxStartups was increased in February, after I initially reported
the problem.

Since this is a Debian 10 machine with OpenSSH_7.9p1 Debian-10+deb10u2,
I should have quoted the code from this sshd.c version. Thus the
connection close issue should occur in

if (unset_nonblock(*newsock) == -1) {
close(*newsock);
continue;
}
if (drop_connection(startups) == 1) {
char *laddr = get_local_ipaddr(*newsock);
char *raddr = get_peer_ipaddr(*newsock);

verbose("drop connection #%d from [%s]:%d "
"on [%s]:%d past MaxStartups", startups,
raddr, get_peer_port(*newsock),
laddr, get_local_port(*newsock));
free(laddr);
free(raddr);
close(*newsock);
continue;
}
if (pipe(startup_p) == -1) {
close(*newsock);
continue;
}

Now, it appears that verbose() logs at SYSLOG_LEVEL_VERBOSE, and it
is just below the default SYSLOG_LEVEL_INFO, so that nothing would be
logged by default concerning MaxStartups, if I understand correctly.

But the admins changed the log level to some debug one a few days ago,
and debug messages effectively appear, but nothing concerning my case
(I had sent the exact time of the failures to the admins).

BTW, the issue also occurs at night, while there should be very few
connections at handshaking status.

Tim Woodall

unread,
Jun 14, 2022, 2:20:04 PM6/14/22
to
It doesn't matter if they're from the same machine, the problem happens
if the target machine has too many connections that haven't finished
authenticating (but from what you say below I doubt this is the problem)

>> The default ssh config has a super-annoying default that
>> randomly kills sessions if too many are handshaking at once.
>>
>> It's the MaxStartups setting you allude to. I've been bitten by this
>> where cron jobs all start at the same time and ssh to the same host.
>
> MaxStartups was increased in February, after I initially reported
> the problem.
>
So long as they've increased the first parameter then that should have
fixed it if it was the cause.
In the case where I hit it it was a cron job starting an ssh connection
from multiple machines - 'out of hours' where 'convenience' was more
valuable than 'performance'.

I don't have any more suggestions, sorry. Do you know how unset_nonblock
can fail? Other than building a patched version with more logging I
don't know what else to try that you haven't already done.

Tim.

Vincent Lefevre

unread,
Jun 14, 2022, 9:50:05 PM6/14/22
to
On 2022-06-14 19:17:01 +0100, Tim Woodall wrote:
[MaxStartups limit]
> In the case where I hit it it was a cron job starting an ssh connection
> from multiple machines - 'out of hours' where 'convenience' was more
> valuable than 'performance'.

Note that I get the errors at random times of the day and night,
with periods where the error occurs quite often and other periods
where I cannot reproduce it.

> I don't have any more suggestions, sorry. Do you know how unset_nonblock
> can fail?

The source from misc.c is

int
unset_nonblock(int fd)
{
int val;

val = fcntl(fd, F_GETFL);
if (val < 0) {
error("fcntl(%d, F_GETFL): %s", fd, strerror(errno));
return (-1);
}
if (!(val & O_NONBLOCK)) {
debug3("fd %d is not O_NONBLOCK", fd);
return (0);
}
debug("fd %d clearing O_NONBLOCK", fd);
val &= ~O_NONBLOCK;
if (fcntl(fd, F_SETFL, val) == -1) {
debug("fcntl(%d, F_SETFL, ~O_NONBLOCK): %s",
fd, strerror(errno));
return (-1);
}
return (0);
}

Well, one should get at least a debug message. I had already told
that to the admins last week. But no such debug message appears,
even when the connection succeeds! I'll try to have more information
from the admins, in particular which debug lines they claim to see.

Vincent Lefevre

unread,
Jun 15, 2022, 9:20:06 AM6/15/22
to
On 2022-06-15 03:48:38 +0200, Vincent Lefevre wrote:
> The source from misc.c is
>
> int
> unset_nonblock(int fd)
> {
> int val;
>
> val = fcntl(fd, F_GETFL);
> if (val < 0) {
> error("fcntl(%d, F_GETFL): %s", fd, strerror(errno));
> return (-1);
> }
> if (!(val & O_NONBLOCK)) {
> debug3("fd %d is not O_NONBLOCK", fd);
> return (0);
> }
> debug("fd %d clearing O_NONBLOCK", fd);
> val &= ~O_NONBLOCK;
> if (fcntl(fd, F_SETFL, val) == -1) {
> debug("fcntl(%d, F_SETFL, ~O_NONBLOCK): %s",
> fd, strerror(errno));
> return (-1);
> }
> return (0);
> }
>
> Well, one should get at least a debug message. I had already told
> that to the admins last week. But no such debug message appears,
> even when the connection succeeds! I'll try to have more information
> from the admins, in particular which debug lines they claim to see.

They set LogLevel to DEBUG, which explains that the debug3() message
doesn't appear. They can see debug lines when my connection succeeds,
but nothing in case of immediate failure. So this would mean that it
is the pipe() from server_accept_loop() in sshd.c that fails, as
nothing is logged in that case.

Vincent Lefevre

unread,
Jun 15, 2022, 10:00:06 AM6/15/22
to
On 2022-06-15 15:10:17 +0200, Vincent Lefevre wrote:
> They set LogLevel to DEBUG, which explains that the debug3() message
> doesn't appear. They can see debug lines when my connection succeeds,
> but nothing in case of immediate failure. So this would mean that it
> is the pipe() from server_accept_loop() in sshd.c that fails, as
> nothing is logged in that case.

I've eventually submitted an enhancement request to get something
logged in case of pipe() failure:

https://bugzilla.mindrot.org/show_bug.cgi?id=3447
0 new messages