The problems start around time index 09:21:39.860302 when the CLIENT issues a TCP packet with SACK option set (seemingly for a data segment which has already been seen) from that point on the connection hangs.
Full dump available via email.
09:12:13.444999 IP CLIENT.50727 > SERVER.ssh: S 2919512080:2919512080(0) win 5840 <mss 1460,sackOK,timestamp 799860259 0,nop,wscale 7>
09:12:13.535643 IP SERVER.ssh > CLIENT.50727: S 492909547:492909547(0) ack 2919512081 win 5792 <mss 1460,sackOK,timestamp 7126976 799860259,nop,wscale 2>
09:12:13.535717 IP CLIENT.50727 > SERVER.ssh: . ack 1 win 46 <nop,nop,timestamp 799860282 7126976>
09:12:13.665481 IP SERVER.ssh > CLIENT.50727: P 1:24(23) ack 1 win 1448 <nop,nop,timestamp 7127074 799860282>
09:12:13.665895 IP CLIENT.50727 > SERVER.ssh: . ack 24 win 46 <nop,nop,timestamp 799860314 7127074>
09:12:13.666044 IP CLIENT.50727 > SERVER.ssh: P 1:23(22) ack 24 win 46 <nop,nop,timestamp 799860314 7127074>
09:12:13.776318 IP SERVER.ssh > CLIENT.50727: . ack 23 win 1448 <nop,nop,timestamp 7127216 799860314>
..SNIPPED SUCCESSFUL TRAFFIC...
09:21:39.490740 IP SERVER.ssh > CLIENT.50727: P 18200:18464(264) ack 2991 win 2728 <nop,nop,timestamp 7692910 800001727>
09:21:39.490775 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800001755 7692910>
09:21:39.860245 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7693293 800001749>
09:21:39.860302 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800001847 7692910,nop,nop,sack sack 1 {12408:13856} >
09:21:40.453440 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7693887 800001749>
09:21:40.453495 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800001996 7692910,nop,nop,sack sack 1 {12408:13856} >
09:21:41.641821 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7695075 800001749>
09:21:41.641938 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800002293 7692910,nop,nop,sack sack 1 {12408:13856} >
09:21:44.017552 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7697451 800001749>
09:21:44.017622 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800002887 7692910,nop,nop,sack sack 1 {12408:13856} >
09:21:48.770051 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7702203 800001749>
09:21:48.770099 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800004075 7692910,nop,nop,sack sack 1 {12408:13856} >
09:21:58.274061 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7711707 800001749>
09:21:58.274180 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800006450 7692910,nop,nop,sack sack 1 {12408:13856} >
09:22:17.282035 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7730715 800001749>
09:22:17.282153 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800011202 7692910,nop,nop,sack sack 1 {12408:13856} >
09:22:55.298955 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7768731 800001749>
09:22:55.299023 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800020705 7692910,nop,nop,sack sack 1 {12408:13856} >
09:24:11.329853 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7844763 800001749>
09:24:11.329923 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800039711 7692910,nop,nop,sack sack 1 {12408:13856} >
09:26:11.331578 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 7964763 800001749>
09:26:11.331699 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800069708 7692910,nop,nop,sack sack 1 {12408:13856} >
09:26:13.011885 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800070128 7692910>
09:26:13.309032 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800070202 7692910>
09:26:13.901048 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800070350 7692910>
09:26:15.085103 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800070646 7692910>
09:26:17.453195 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800071238 7692910>
09:26:22.189378 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800072422 7692910>
09:26:31.661696 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800074790 7692910>
09:26:50.610374 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800079526 7692910>
09:27:28.499729 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800088998 7692910>
09:28:11.331256 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8084763 800001749>
09:28:11.331354 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800099704 7692910,nop,nop,sack sack 1 {12408:13856} >
09:28:44.286495 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800107942 7692910>
09:30:11.330959 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8204763 800001749>
09:30:11.331074 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800129701 7692910,nop,nop,sack sack 1 {12408:13856} >
09:30:44.298756 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800137942 7692910>
09:32:11.331661 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8324763 800001749>
09:32:11.331727 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800159698 7692910,nop,nop,sack sack 1 {12408:13856} >
09:32:44.311051 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800167942 7692910>
09:34:11.332375 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8444763 800001749>
09:34:11.332447 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800189695 7692910,nop,nop,sack sack 1 {12408:13856} >
09:34:44.323298 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800197942 7692910>
09:36:11.332073 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8564763 800001749>
09:36:11.332166 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800219692 7692910,nop,nop,sack sack 1 {12408:13856} >
09:36:44.335591 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800227942 7692910>
09:38:44.351950 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800257942 7692910>
09:40:44.368172 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800287943 7692910>
At the end of the dump here the connection is dropped by the client side.
Darryl
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[CCing netdev as it's where people competent on the subject are]
On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote:
> CLIENT = Linux 2.6.20.1-smp [Customer build]
> SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4
> (Nahant Update 5)]
>
> The problems start around time index 09:21:39.860302 when the CLIENT issues
> a TCP packet with SACK option set (seemingly for a data segment which has
> already been seen) from that point on the connection hangs.
Where was the capture taken ? on CLIENT or on SERVER (I suspect client from
the timers) ? A possible, but very unlikely reason would be an MTU limitation
somewhere, because the segment which never gets correctly ACKed is also the
largest one in this trace. It would be strange as the SACKs are send
immediately after, but it should be something to consider anyway.
Also, if everything is right with the packets on the server side, then
it would means that it's the RHEL kernel which is buggy (which does not
mean that the same bug does not exist in mainline).
Regards,
Willy
> Full dump available via email.
>
>
> 09:12:13.444999 IP CLIENT.50727 > SERVER.ssh: S 2919512080:2919512080(0)
> win 5840 <mss 1460,sackOK,timestamp 799860259 0,nop,wscale 7>
> 09:12:13.535643 IP SERVER.ssh > CLIENT.50727: S 492909547:492909547(0) ack
> 2919512081 win 5792 <mss 1460,sackOK,timestamp 7126976 799860259,nop,wscale
> 2>
> 09:12:13.535717 IP CLIENT.50727 > SERVER.ssh: . ack 1 win 46
> <nop,nop,timestamp 799860282 7126976>
> 09:12:13.665481 IP SERVER.ssh > CLIENT.50727: P 1:24(23) ack 1 win 1448
> <nop,nop,timestamp 7127074 799860282>
> 09:12:13.665895 IP CLIENT.50727 > SERVER.ssh: . ack 24 win 46
> <nop,nop,timestamp 799860314 7127074>
> 09:12:13.666044 IP CLIENT.50727 > SERVER.ssh: P 1:23(22) ack 24 win 46
> <nop,nop,timestamp 799860314 7127074>
> 09:12:13.776318 IP SERVER.ssh > CLIENT.50727: . ack 23 win 1448
> <nop,nop,timestamp 7127216 799860314>
>
> ...SNIPPED SUCCESSFUL TRAFFIC...
> On Sun, Jul 29, 2007 at 06:59:26AM +0100, Darryl L. Miles wrote:
> > CLIENT = Linux 2.6.20.1-smp [Customer build]
> > SERVER = Linux 2.6.9-55.ELsmp [Red Hat Enterprise Linux AS release 4
> > (Nahant Update 5)]
> >
> > The problems start around time index 09:21:39.860302 when the CLIENT issues
> > a TCP packet with SACK option set (seemingly for a data segment which has
> > already been seen) from that point on the connection hangs.
..That's DSACK and it's being correctly sent. To me, it seems unlikely to
be the cause for this breakage...
> Where was the capture taken ? on CLIENT or on SERVER (I suspect client from
> the timers) ?
..I would guess the same based on SYN timestamps (and from the DSACK
timestamps)...
> A possible, but very unlikely reason would be an MTU limitation
> somewhere, because the segment which never gets correctly ACKed is also the
> largest one in this trace.
Limitation for 48 byte segments? You have to be kidding... :-) But yes,
it seems that one of the directions is dropping packets for some reason
though I would not assume MTU limitation... Or did you mean some other
segment?
> Also, if everything is right with the packets on the server side, then
> it would means that it's the RHEL kernel which is buggy (which does not
> mean that the same bug does not exist in mainline).
..There are two independed signs that server side is not getting the
packets (ack field is not advanced nor do the retransmissions of an old
data cease) so IMHO that would the first thing to rule out.
..There a limit on how many times a retransmission of a segment is tried,
and in the dump both directions are trying to retransmit (and then they
finally gave up), perhaps CLIENT->SERVER direction drops all packets after
09:21:39.490740 (or even before that), so that no cumulative ACK for seq
18464 ever reaches SERVER (which is still having snd_una @ 12408 I guess)
nor the retransmissions of 2991:3039 from the CLIENT.
Is this reproducable? Can you somehow verify that the packets CLIENT is
sending are indeed received by the SERVER...?
--
i.
I had something similar (and most strangely enough, it only seemed to happen
when the source and/or destination address was outside a given subnet). It
boiled down to that some switch acted really strange and God knows why. A
tcpdump on *both* sides (server+client) each showed that the *other* peer sent
a FIN/RST. Well in the end I settled with setting sys.net.ipv4.tcp_sack=0,
no idea if the broken switch got replaced in the meantime.
Well, just letting you know that it is not limited to computers.
Jan
--
No, I was talking about the 1448 bytes segments. But in fact I don't
believe it much because the SACKs are always retransmitted just afterwards.
BTW, some information are missing. It would have been better if the trace
had been read with tcpdump -Svv. We would have got seq numbers and ttl.
Also, we do not know if there's a firewall between both sides. Sometimes,
some IDS identify attacks in crypted traffic and kill connections. It
might have been the case here, with the connection closed one way on an
intermediate firewall.
Regards,
Willy
> Is this reproducable? Can you somehow verify that the packets CLIENT is
> sending are indeed received by the SERVER...?
One possibility is drops due to checksum errors on the receiver, this
tends to pop up from time to time, let's see some SNMP statistics.
Ah, but it's ACKed correctly right below it...:
[...snip...]
> > > > 09:21:39.490740 IP SERVER.ssh > CLIENT.50727: P 18200:18464(264) ack 2991
> > > > win 2728 <nop,nop,timestamp 7692910 800001727>
> > > > 09:21:39.490775 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378
> > > > <nop,nop,timestamp 800001755 7692910>
> > > > 09:21:39.860245 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991
> > > > win 2728 <nop,nop,timestamp 7693293 800001749>
..segment below snd_una arrived => snd_una remains 18464, receiver
generates a duplicate ACK:
> > > > 09:21:39.860302 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378
> > > > <nop,nop,timestamp 800001847 7692910,nop,nop,sack sack 1 {12408:13856} >
The cumulative ACK field of it covers _everything_ below 18464 (i.e., it
ACKs them), including the 1448 bytes in 12408:13856... In addition, the
SACK block is DSACK information [RFC2883] telling explicitly the address
of the received duplicate block. However, if this ACK doesn't reach the
SERVER TCP, RTO is triggered and the first not yet cumulatively ACKed
segment is retransmitted (I guess cumulative ACKs up to 12408 arrived
without problems to the SERVER):
> > > > 09:21:40.453440 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991
> > > > win 2728 <nop,nop,timestamp 7693887 800001749>
[...snip...]
> BTW, some information are missing. It would have been better if the trace
> had been read with tcpdump -Svv. We would have got seq numbers and ttl.
> Also, we do not know if there's a firewall between both sides. Sometimes,
> some IDS identify attacks in crypted traffic and kill connections. It
> might have been the case here, with the connection closed one way on an
> intermediate firewall.
Yeah, firewall or some other issue, I'd say it's quite unlikely a bug in
TCP because behavior to both directions indicate client -> sender
blackhole independently of each other...
--
i.
I'd say most likely scenario is the SERVER is behind a Cisco Pix firewall,
which has known bugs in handling packets with sack option. By default the Cisco
has sequence number randomization enabled, but it's a half-assed implementation
which doesn't bother adjusting the sequence numbers inside sack options.
This has been reported to Cisco, and they don't seem to care. As a workaround,
you can do this:
echo 0 > /proc/sys/net/ipv4/tcp_sack
and it will probably fix it up. It'd be really nice, however, to have a per-route
option for sack, similar to how we can clamp window scaling per route. Something
like the below
ip r a <host> <gw> <nosack>
Phil
Oh yes, you're damn right. I did not notice that the ACK was higher than
the SACK, I'm more used to read traces with absolute rather than relative
seq/acks.
So I agree, it is this ACK which is lost between client and server,
reinforcing the supposition about the location of the capture (client side).
> [...snip...]
>
> > BTW, some information are missing. It would have been better if the trace
> > had been read with tcpdump -Svv. We would have got seq numbers and ttl.
> > Also, we do not know if there's a firewall between both sides. Sometimes,
> > some IDS identify attacks in crypted traffic and kill connections. It
> > might have been the case here, with the connection closed one way on an
> > intermediate firewall.
>
> Yeah, firewall or some other issue, I'd say it's quite unlikely a bug in
> TCP because behavior to both directions indicate client -> sender
> blackhole independently of each other...
It would also be possible that something stupid between both ends simply
drops packets with the SACK option set. Also something which sometimes
happen is that some firewalls automatically translate sequence numbers
but not necessarily SACK values, which could pretty well lead to this
packet being received but ignored on the server side.
I'm pretty sure that the same trace taken on the server side will reveal
the reason for the problem.
> On Sun, Jul 29, 2007 at 12:28:04PM +0300, Ilpo Järvinen wrote:
>
> > [...snip...]
> >
> > > BTW, some information are missing. It would have been better if the trace
> > > had been read with tcpdump -Svv. We would have got seq numbers and ttl.
> > > Also, we do not know if there's a firewall between both sides. Sometimes,
> > > some IDS identify attacks in crypted traffic and kill connections. It
> > > might have been the case here, with the connection closed one way on an
> > > intermediate firewall.
> >
> > Yeah, firewall or some other issue, I'd say it's quite unlikely a bug in
> > TCP because behavior to both directions indicate client -> sender
> > blackhole independently of each other...
>
> It would also be possible that something stupid between both ends simply
> drops packets with the SACK option set. Also something which sometimes
> happen is that some firewalls automatically translate sequence numbers
> but not necessarily SACK values, which could pretty well lead to this
> packet being received but ignored on the server side.
..One can toggle those off with /proc/sys/net/ipv4/tcp_dsack but I
suspect DSACKs are not the cause because these retransmissions neither
are making it through (there are many of them also earlier in the log,
just quoted the easiest ones :-) ):
> > > > 09:36:44.335591 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win
> > > > 378 <nop,nop,timestamp 800227942 7692910>
> > > > 09:38:44.351950 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win
> > > > 378 <nop,nop,timestamp 800257942 7692910>
> > > > 09:40:44.368172 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win
> > > > 378 <nop,nop,timestamp 800287943 7692910>
..there are no SACKs involved in them, yet no cumulative ACK ever
arrives from SERVER...
--
i.
I've been informed there is no firewall at all, the only firewalling
that takes place is the local Linux netfilter, I've never audited the
kit at the SERVER end to know for sure.
I have run a tcpdump from the server side before now (not for the
tcpdump I enclosed) when try to diagnose the problem in the first place
and the triplet of packets:
09:28:11.331256 IP SERVER.ssh > CLIENT.50727: . 12408:13856(1448) ack 2991 win 2728 <nop,nop,timestamp 8084763 800001749>
09:28:11.331354 IP CLIENT.50727 > SERVER.ssh: . ack 18464 win 378 <nop,nop,timestamp 800099704 7692910,nop,nop,sack sack 1 {12408:13856} >
09:28:44.286495 IP CLIENT.50727 > SERVER.ssh: P 2991:3039(48) ack 18464 win 378 <nop,nop,timestamp 800107942 7692910>
That keeps repeating from that point in was observable at both sides.
No checksum errors have ever been shown on any tcpdump I saw.
The dump I posted was the first dump I was able to reliably reproduce
(observing from the CLIENT end), I'm pretty sure given some time
tomorrow I could create you 2 dumps of the same session (the view from
each end). Looking a little more at the dump today I see that at time
index 09:21:39.860245 to server sends data from seq 12408:13856 again.
Here is a snippet that backs up a little more around the time the
trouble starts, as SERVER send sequence 12408 seems to be the problem
maybe I need to illustrate what happened when that sequence was first sent.
Excuse my understanding if it seems a bit limited. SACK is only used
when segments are dropped and from inspecting the dump I can't spot
where a lost segment occurs. My main question is why does the CLIENT
ack data 12408:13856 and beyond all the way up to 18464 (@
09:21:39.490775) and then start to SACK {12408:13856} (@ 09:21:39.860302) ?
Does this mean the client is the buggy end for going back on data it has
already acked cleanly, that would be my interpretation of events.
Enclosed should be a tcpdump with -vvS of the same session as before
observing from the client side.
I hope I've answered all the points raised in this thread.
Darryl
There is another concern of why the SERVER performed a retransmission in
the first place, when the tcpdump shows the ack covering it has been seen.
I have made available the full dumps at:
http://darrylmiles.org/snippets/lkml/20070731/
There are some changes in 2.6.22 that appear to affect TCP SACK handling
does this fix a known issue ?
This sequence is interesting from the client side:
03:58:56.419034 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16345815 819458859> # S1
03:58:56.419100 IP CLIENT.43726 > SERVER.ssh: . ack 27464 win 501
<nop,nop,timestamp 819458884 16345815> # C1
03:58:56.422019 IP SERVER.ssh > CLIENT.43726: P 27464:28176(712) ack
4239 win 2728 <nop,nop,timestamp 16345815 819458859> # S2
03:58:56.422078 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501
<nop,nop,timestamp 819458884 16345815> # C2
The above 4 packets look as expect to me. Then we suddenly see a
retransmission of 26016:27464.
03:58:56.731597 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16346128 819458864> # S3
So the client instead of discarding the retransmission of duplicate
segment, issues a SACK.
03:58:56.731637 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501
<nop,nop,timestamp 819458962 16345815,nop,nop,sack sack 1 {26016:27464}
> # C3
In response to this the server is confused ??? It responds to
sack{26016:27464} but the client is also saying "wnd 28176". Wouldn't
the server expect "wnd < 26016" to there is a segment to retransmit ?
03:58:57.322800 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16346718 819458864> # S4
Now viewed from the server side:
03:58:56.365655 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16345815 819458859> # S1
03:58:56.365662 IP SERVER.ssh > CLIENT.43726: P 27464:28176(712) ack
4239 win 2728 <nop,nop,timestamp 16345815 819458859> # S2
03:58:56.374633 IP CLIENT.43726 > SERVER.ssh: . ack 24144 win 488
<nop,nop,timestamp 819458861 16345731> # propagation delay
03:58:56.381630 IP CLIENT.43726 > SERVER.ssh: . ack 25592 win 501
<nop,nop,timestamp 819458863 16345734> # propagation delay
03:58:56.384503 IP CLIENT.43726 > SERVER.ssh: . ack 26016 win 501
<nop,nop,timestamp 819458864 16345734> # propagation delay
03:58:56.462583 IP CLIENT.43726 > SERVER.ssh: . ack 27464 win 501
<nop,nop,timestamp 819458884 16345815> # C1
03:58:56.465707 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501
<nop,nop,timestamp 819458884 16345815> # C2
The above packets just as expected.
03:58:56.678546 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16346128 819458864> # S3
I guess the above packet is indeed a retransmission of "# S1" but why
was it retransmitted, when we can clearly see "# C1" above acks this
segment ? It is not even as if the retransmission escaped before the
kernel had time to process the ack, as 200ms elapsed. CONCERN NUMBER TWO
03:58:56.774778 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501
<nop,nop,timestamp 819458962 16345815,nop,nop,sack sack 1 {26016:27464}
> # C3
CONCERN NUMBER ONE, why in response to that escaped retransmission was a
SACK the appropriate response ? When at the time the client sent the
SACK it had received all data upto 28176, a fact it continues to
advertise in the "# C3" packet above.
There is nothing wrong is the CLIENT expecting to see a retransmission
of that segment at this point in time that is an expected circumstance.
03:58:57.269529 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack
4239 win 2728 <nop,nop,timestamp 16346718 819458864> # S4
Darryl
> I've been able to capture a tcpdump from both ends during the problem and its
> my belief there is a bug in 2.6.20.1 (at the client side) in that it issues a
> SACK option for an old sequence which the current window being advertised is
> beyond it. This is the most concerning issue as the integrity of the sequence
> numbers doesn't seem right (to my limited understanding anyhow).
You probably didn't check the reference I explicitly gave to those who
are not familiar how DSACK works, just in case you didn't pick it up last
time, here it is again for you: RFC2883... However, if DSACKs really
bother you still (though it shouldn't :-)), IIRC I also told you how
you're able to turn it off (tcp_dsack sysctl) but I assure you that it's
not a bug but feature called DSACK [RFC2883], there's _absolutely_ nothing
wrong with it, instead, it would be wrong to _not_ send the below snd_una
SACK in this scenario when tcp_dsack set to 1.
> There is another concern of why the SERVER performed a retransmission in the
> first place, when the tcpdump shows the ack covering it has been seen.
There are only three possible reasons to this thing:
1) The ACK didn't reach the SERVER (your logs prove this to not be the
case)
2) The ACK got discarded by the SERVER
3) The SERVER (not the client) is buggy and sends an incorrect
retransmission
..So we have just two options remaining...
> I have made available the full dumps at:
>
> http://darrylmiles.org/snippets/lkml/20070731/
Thanks about these... Based on a quick check, it is rather clear that the
SERVER is for some reason discarding the packets it's receiving:
04:11:26.833935 IP CLIENT.43726 > SERVER.ssh: P 4239:4287(48) ack 28176 win 501 <nop,nop,timestamp 819646456 16345815>
04:11:27.132425 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack 4239 win 2728 <nop,nop,timestamp 17096579 819458864>
04:11:27.230081 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501 <nop,nop,timestamp 819646555 16345815,nop,nop
Notice, (cumulative) ack field didn't advance though new data arrived, and
for the record, it's in advertised window too. There are no DSACK in here
so your theory about below snd_una SACK won't help to explain this one
at all... We'll just have to figure out why it's discarding it. And
there's even more to prove this...
> This sequence is interesting from the client side:
>
> 03:58:56.419034 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack 4239
> win 2728 <nop,nop,timestamp 16345815 819458859> # S1
> 03:58:56.419100 IP CLIENT.43726 > SERVER.ssh: . ack 27464 win 501
> <nop,nop,timestamp 819458884 16345815> # C1
> 03:58:56.422019 IP SERVER.ssh > CLIENT.43726: P 27464:28176(712) ack 4239
> win 2728 <nop,nop,timestamp 16345815 819458859> # S2
> 03:58:56.422078 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501
> <nop,nop,timestamp 819458884 16345815> # C2
>
> The above 4 packets look as expect to me. Then we suddenly see a
> retransmission of 26016:27464.
>
> 03:58:56.731597 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack 4239
> win 2728 <nop,nop,timestamp 16346128 819458864> # S3
..Look at this on the retransmission:
... timestamp 16346128 819458864>
..it tells us what really got received by the TCP. The corresponding ACK
with matching timestamp is, surprise, surprise, this one:
> 03:58:56.340180 IP CLIENT.43726 > SERVER.ssh: . ack 26016 win 501
> <nop,nop,timestamp 819458864 16345734>
..thus the SERVER has _not_ received but discarded the subsequent
cumulative ACKs!!! Therefore it's retransmitting from 26016 onward but
never receives any reply as everything seems to get discarded...
There was one bad checksum btw:
> 03:58:56.365662 IP (tos 0x10, ttl 64, id 28685, offset 0, flags [DF],
> proto 6, length: 764) SERVER.ssh > CLIENT.43726: P [bad tcp cksum 6662
> (->ef2b)!] 617734888:617735600(712) ack 2634113543 win 2728
> <nop,nop,timestamp 16345815 819458859>
> There are some changes in 2.6.22 that appear to affect TCP SACK handling
> does this fix a known issue ?
There is no such "known issue" :-)... This issue has nothing to do with
TCP SACK handling, since that code _won't_ be reached... We could verify
that from the timestamps. But if you still insist that SACK under snd_una
is the issue, please turn tcp_dsack to 0 on the CLIENT, you will not get
them after that and you can be happy as your out-of-window SACK "issue"
is then fixed :-)...
..Seriously, somebody else than me is probably better in suggesting what
could cause the discarding at the SERVER in this case. SNMP stuff Dave was
asking could help, you can find them from /proc/net/{netstat,snmp}...
--
i.
> ...Seriously, somebody else than me is probably better in suggesting what
> could cause the discarding at the SERVER in this case. SNMP stuff Dave was
> asking could help, you can find them from /proc/net/{netstat,snmp}...
That will also tell us if TCP discarded the packet due to
timestamps tests or similar.
I've now squinted the D-SACK RFC and understand a little about this,
however the RFC does make the claim "This extension is compatible with
current implementations of the SACK option in TCP. That is, if one of
the TCP end-nodes does not implement this D-SACK extension and the other
TCP end-node does, we believe that this use of the D-SACK extension by
one of the end nodes will not introduce problems."
What if it turns out that is not true for a large enough number of SACK
implementations out there; in the timeframe that SACK was supported but
D-SACK was not supported. Would it be possible to clearly catagorise an
implementation to be:
* 100% SACK RFC compliant. SACK works and by virtue of the mandatory
requirements written into the previous SACK RFCs then this
implementation would never see a problem with receiving D-SACK even
through the stack itself does not support D-SACK.
* Mostly SACK RFC compliant. SACK works but if it saw D-SACK it would
have a problems dealing with it, possibly resulting in fatal TCP
lockups. Are there SACK implementation mandatory requirements in place
for to be able to clearly draw the line and state that the 2.6.9 SACK
implementation was not RFC compliant.
* 100% SACK and D-DACK RFC compliant. Such an implementation was
written to support D-SACK on top of SACK.
So if there is a problem whos fault would it be:
* The original SACK RFCs for not specifying a mandatory course of
action to take which D-SACK exploits. Thus making the claim in RFC2883
unsound.
* The older linux kernel for not being 100% SACK RFC compliant in its
implementation ? Not a lot we can do about this now, but if we're able
to identify there maybe backward compatibility issues with the same
implementation thats a useful point to take forward.
* The newer linux kernel for enabling D-SACK by default when RFC2883
doesn't even claim a cast iron case for D-SACK to be compatible with any
100% RFC compliant SACK implementation.
Does TCP support the concept of vendor dependent options, that would be
TCP options which are in a special range that would both identify the
vendor and the vendors-specific option id. Such a system would allow
Linux to implement a <D-SACK Ok> option, even if the RFC claims one is
not needed. This would allow moving forward through this era until such
point in time when it was officially agreed it was just a linux problem
or an RFC problem. If its an RFC problem then IANA (or whoever) would
issue a generic TCP option for it.
If the dump on this problem really does identify a risk/problem when as
its between 2 version of linux a vendor specific option also makes sense.
I don't really want to switch new useful stuff off by default (so it
never gets used), I'm all for experimentation but not to the point of
failure between default configurations of widely distributed version of
the kernel.
So thats the technical approaches I can come up with to discuss. Does
Ilpo have a particular vested interest in D-SACK that should be disclosed?
> However, if DSACKs really
> bother you still (though it shouldn't :-)), IIRC I also told you how
> you're able to turn it off (tcp_dsack sysctl) but I assure you that it's
> not a bug but feature called DSACK [RFC2883], there's _absolutely_
nothing
> wrong with it, instead, it would be wrong to _not_ send the below
snd_una
> SACK in this scenario when tcp_dsack set to 1.
So it is necessary to turn off a TCP option (that is enabled by default)
to be sure to have reliable TCP connections (that don't lock up) in the
bugfree Linux networking stack ? This is absurd.
If such an option causes such a problem; then that option should not be
enabled by default. If however the problem is because of a bug then let
us continue to try to isolate the cause rather than wallpaper over the
cracks with the voodoo of turning things that are enabled by default off.
It only makes sense to turn options off when there is a 3rd party
involved (or other means beyond your control) which is affecting
function, the case here is that two Linux kernel stacks are affected and
no 3rd party device has been shown to be affecting function.
>> There is another concern of why the SERVER performed a retransmission in the
>> first place, when the tcpdump shows the ack covering it has been seen.
>
> There are only three possible reasons to this thing:
> 1) The ACK didn't reach the SERVER (your logs prove this to not be the
> case)
> 2) The ACK got discarded by the SERVER
I'd thought about that one, its a possibility. The server in question
does have period of high UDP load on a fair number of UDP sockets at
once (a few 1000). Both systems have 2Gb of RAM. The server has maybe
just 250Mb of RSS of all apps combined.
> 3) The SERVER (not the client) is buggy and sends an incorrect
> retransmission
This was my initial stab at the cause, simply due to the sequence like
this (from when the lockup starts) :
03:58:56.731637 IP (tos 0x10, ttl 64, id 53311, offset 0, flags [DF],
proto 6, length: 64) CLIENT.43726 > SERVER.ssh: . [tcp sum ok]
2634113543:2634113543(0) ack 617735600 win 501 <nop,nop,timestamp
819458962 16345815,nop,nop,sack sack 1 {617733440:617734888} >
03:58:57.322800 IP (tos 0x0, ttl 50, id 28689, offset 0, flags [DF],
proto 6, length: 1500) SERVER.ssh > CLIENT.43726: .
617733440:617734888(1448) ack 2634113543 win 2728 <nop,nop,timestamp
16346718 819458864>
The client sent a SACK. But from understanding more about D-SACK, this
is a valid D-SACK response, but it appears to confuse the older Linux
kernel at the server end.
> ...So we have just two options remaining...
>
>> I have made available the full dumps at:
>>
>> http://darrylmiles.org/snippets/lkml/20070731/
>
> Thanks about these... Based on a quick check, it is rather clear that the
> SERVER is for some reason discarding the packets it's receiving:
>
> 04:11:26.833935 IP CLIENT.43726 > SERVER.ssh: P 4239:4287(48) ack 28176 win 501 <nop,nop,timestamp 819646456 16345815>
> 04:11:27.132425 IP SERVER.ssh > CLIENT.43726: . 26016:27464(1448) ack 4239 win 2728 <nop,nop,timestamp 17096579 819458864>
> 04:11:27.230081 IP CLIENT.43726 > SERVER.ssh: . ack 28176 win 501 <nop,nop,timestamp 819646555 16345815,nop,nop
>
> Notice, (cumulative) ack field didn't advance though new data arrived, and
> for the record, it's in advertised window too. There are no DSACK in here
> so your theory about below snd_una SACK won't help to explain this one
> at all... We'll just have to figure out why it's discarding it. And
> there's even more to prove this...
Agreed on this. However discarding data is allowed (providing it is
intentional discarding not a bug where the TCP stack is discarding
segments it shouldn't), TCP should recover providing sufficient packets
get through.
So the SNMP data would show up intentional discards (due to
memory/resource issues). So I'll get some of those too.
>> ...SNIPPED...MORE SIGNS OF UNEXPLAINED DISCARDING BY THE SERVER...
>
> There was one bad checksum btw:
>
>> 03:58:56.365662 IP (tos 0x10, ttl 64, id 28685, offset 0, flags [DF],
>> proto 6, length: 764) SERVER.ssh > CLIENT.43726: P [bad tcp cksum 6662
>> (->ef2b)!] 617734888:617735600(712) ack 2634113543 win 2728
>> <nop,nop,timestamp 16345815 819458859>
I noticed this one too, but discarded the "[bad tcp cksum 6662
->ef2b)!]" as bogus on the basis of the following packet turning up at
the client:
03:58:56.422019 IP (tos 0x0, ttl 50, id 28685, offset 0, flags [DF],
proto 6, length: 764) SERVER.ssh > CLIENT.43726: P [tcp sum ok]
617734888:617735600(712) ack 2634113543 win 2728 <nop,nop,timestamp
16345815 819458859>
Forgive me if I am mistaken, but while the server reports a checksum
error, the client did not. I took this to be a misreporting by tcpdump
at the server, probably due to the "e1000" network card checksum
offloading (I'd guess this level of card does offloading, I've never
audited the driver before). If you search both dumps for the timestamps
"16345815 819458859" two packets were sent by the server and two
received by the server with those timestamps.
>> There are some changes in 2.6.22 that appear to affect TCP SACK handling
>> does this fix a known issue ?
>
> There is no such "known issue" :-)... This issue has nothing to do with
> TCP SACK handling, since that code _won't_ be reached... We could verify
> that from the timestamps. But if you still insist that SACK under snd_una
> is the issue, please turn tcp_dsack to 0 on the CLIENT, you will not get
> them after that and you can be happy as your out-of-window SACK "issue"
> is then fixed :-)...
I had thrown up one interpretation of events for others to comment, so
thanks for your comments and viewpoint on the issue.
> ...Seriously, somebody else than me is probably better in suggesting what
> could cause the discarding at the SERVER in this case. SNMP stuff Dave was
> asking could help, you can find them from /proc/net/{netstat,snmp}...
The SNMP stats aren't so useful right now as
the box has been rebooted since then but I shall attempt to capture
/proc/net/* data, cause the problem, then capture /proc/net/* data again
if those numbers can help.
Darryl