I am experiencing packet loss over TCP/IPv[46], which causes 1 second
delays when connect()ing to a socket. This happens even on loopback, and
on multiple kernels. On the older kernels, the connect() time is nearly
3 seconds, I believe this is due to a recent TCP connect retrasmit
parameter changed in the kernel.
1. Linux dc-s1000-2114 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
16:26:12 UTC 2011 x86_64 GNU/Linux
2. Linux dc-a1000-2131.cleversafelabs.com 2.6.39.4-2-clevos+ #1 SMP
Tue Nov 8 09:06:49 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
3. Linux telperion.jlyo.org 3.1.0-4-ARCH #1 SMP PREEMPT Mon Nov 7
22:47:18 CET 2011 x86_64 Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
GenuineIntel GNU/Linux
I have created some test cases which reify this problem, the first set
of tests use select() multiplexing, and have some problems, however,
they exhibit odd behavior as well, especially in the difference between
tcp4 and tcp6.
Please note: these tests will quickly exaust the amount of available
ephemeral TCP ports on your system, which will cause any TCP connect()
calls in other processes to return with EADDRNOTAVAIL. However, ports
will become available after a short while.
The first test fails super quick, while the others haven't timed out
so far. NOTE: The second test requires /proc/sys/net/ipv6/bindv6only
to be set to 1.
./packetloss :: ::1
./packetloss :: 127.0.0.1
./packetloss 0.0.0.0 127.0.0.1
The other tests run a client and server in different processes.
Run the "close" daemon using one of:
./closed ::
./closed 0.0.0.0
And flood connect() pings against 8009, the port closed listens on.
./tcping -f -p8009 ::1
./tcping -f -p8009 127.0.0.1
Wait for a pause, then ^C, and notice the max statistic is ~1000ms.
These tests have been rn between machines on a relativley noiseless
ethernet LAN with similar results.
What's also puzzling, is that I see no packet drop reporting in
$ ifconfig lo
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 276411482 bytes 15822880567 (14.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 276411482 bytes 15822880567 (14.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions
I'm thinking this may be a bug in the TCP/IP stack, however, I'm not
certain if I'm missing a socket option, or some other configuration
that may elimiate this behavior.
If there's anything else I can help you with, please don't hesitate
to Cc me.
Thanks,
Jesse
Attached: syndrop.pcap
Get the code here
https://github.com/jlyo/packetloss
git clone git://github.com/jlyo/packetloss.git
https://github.com/jlyo/tcping
git clone git://github.com/jlyo/tcping.git
https://github.com/jlyo/closed
git clone git://github.com/jlyo/closed.git
> What's also puzzling, is that I see no packet drop reporting in
> $ ifconfig lo
> lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
> inet 127.0.0.1 netmask 255.0.0.0
> inet6 ::1 prefixlen 128 scopeid 0x10<host>
> loop txqueuelen 0 (Local Loopback)
> RX packets 276411482 bytes 15822880567 (14.7 GiB)
> RX errors 0 dropped 0 overruns 0 frame 0
> TX packets 276411482 bytes 15822880567 (14.7 GiB)
> TX errors 0 dropped 0 overruns 0 carrier 0 collisions
The device driver therefore isn't even seeing the packets, they are
being dropped elsewhere.
Why is this "puzzling"?
There's layers upon layers and thousands of places where packets can
be dropped between the originating network stack and the actual device
driver.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
> From: Jesse Young <jl...@jlyo.org>
> Date: Tue, 22 Nov 2011 18:13:20 -0600
>
> > What's also puzzling, is that I see no packet drop reporting in
> > $ ifconfig lo
> > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
> > inet 127.0.0.1 netmask 255.0.0.0
> > inet6 ::1 prefixlen 128 scopeid 0x10<host>
> > loop txqueuelen 0 (Local Loopback)
> > RX packets 276411482 bytes 15822880567 (14.7 GiB)
> > RX errors 0 dropped 0 overruns 0 frame 0
> > TX packets 276411482 bytes 15822880567 (14.7 GiB)
> > TX errors 0 dropped 0 overruns 0 carrier 0 collisions
>
> The device driver therefore isn't even seeing the packets, they are
> being dropped elsewhere.
>
> Why is this "puzzling"?
>
> There's layers upon layers and thousands of places where packets can
> be dropped between the originating network stack and the actual device
> driver.
Maybe puzzling isn't the best word... just some more relevant
information. Also, this is the loopback interface, there is no device
driver, PHY or DLL layer in question here (just the loopback's mock
driver/PHY/DLL).
I presume that the drop is occuring in between the NET layer, and the sys
call interface, do you agree? Where should I begin looking?
>I presume that the drop is occuring in between the NET layer, and the sys
>call interface, do you agree? Where should I begin looking?
perf script net_dropmonitor.py from Neil should do the job.
HGN
-John
There is one litle thing called "netstat -s", a very useful tool,
included in many distros :)
This is related to TIMEWAIT syndrom ?
06:47:42.090522 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 2646115915, win 32752, options [mss 16376,sackOK,TS val 26574090 ecr 0,nop,wscale 6], length 0
06:47:42.090579 IP6 ::1.8009 > ::1.49374: Flags [S.E], seq 184529170, ack 2646115916, win 32728, options [mss 16376,sackOK,TS val 26574090 ecr 26574090,nop,wscale 6], length 0
06:47:42.090616 IP6 ::1.49374 > ::1.8009: Flags [.], ack 1, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090718 IP6 ::1.8009 > ::1.49374: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090780 IP6 ::1.49374 > ::1.8009: Flags [F.], seq 1, ack 2, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090843 IP6 ::1.8009 > ::1.49374: Flags [.], ack 2, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
First connection went well.
Now we try to reuse tuple (ports 49374, 8009 on loopback) while a socket is in TIMEWAIT, and first
SYN packet (time 06:48:20.337335) is dropped (considered as a packet part of previous session)
Now why the first SYN packet is dropped and not the second one, I dont know yet.
06:48:20.337335 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 3243722104, win 32752, options [mss 16376,sackOK,TS val 26612337 ecr 0,nop,wscale 6], length 0
06:48:21.340112 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 3243722104, win 32752, options [mss 16376,sackOK,TS val 26613340 ecr 0,nop,wscale 6], length 0
06:48:21.340162 IP6 ::1.8009 > ::1.49374: Flags [S.E], seq 797804014, ack 3243722105, win 32728, options [mss 16376,sackOK,TS val 26613340 ecr 26613340,nop,wscale 6], length 0
06:48:21.340217 IP6 ::1.49374 > ::1.8009: Flags [.], ack 1, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340360 IP6 ::1.8009 > ::1.49374: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340466 IP6 ::1.49374 > ::1.8009: Flags [F.], seq 1, ack 2, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340541 IP6 ::1.8009 > ::1.49374: Flags [.], ack 2, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
> First connection went well.
>
> Now we try to reuse tuple (ports 49374, 8009 on loopback) while a socket is in TIMEWAIT, and first
> SYN packet (time 06:48:20.337335) is dropped (considered as a packet part of previous session)
>
> Now why the first SYN packet is dropped and not the second one, I dont know yet.
in netstat -s output, the suspect increasing counter is :
"45 congestion windows recovered without slow start after partial ack"
# vi +2831 net/ipv4/tcp_input.c
/* Undo during loss recovery after partial ACK. */
static int tcp_try_undo_loss(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (tcp_may_undo(tp)) {
struct sk_buff *skb;
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
}
tcp_clear_all_retrans_hints(tp);
DBGUNDO(sk, "partial loss");
tp->lost_out = 0;
tcp_undo_cwr(sk, true);
HERE: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
inet_csk(sk)->icsk_retransmits = 0;
tp->undo_marker = 0;
if (tcp_is_sack(tp))
tcp_set_ca_state(sk, TCP_CA_Open);
return 1;
}
return 0;
Here is the patch to solve this IPv6 problem, thanks a lot for the
report !
[PATCH] ipv6: tcp: fix tcp_v6_conn_request()
Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
cookies), we can drop a SYN packet reusing a TIME_WAIT socket.
(As a matter of fact we fail to send the SYNACK answer)
As the client resends its SYN packet after a one second timeout, we
accept it, because first packet removed the TIME_WAIT socket before
being dropped.
This probably explains why nobody ever noticed or complained.
Reported-by: Jesse Young <jl...@jlyo.org>
Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
---
net/ipv6/tcp_ipv6.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 36131d1..2dea4bb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1255,6 +1255,13 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!want_cookie || tmp_opt.tstamp_ok)
TCP_ECN_create_request(req, tcp_hdr(skb));
+ treq->iif = sk->sk_bound_dev_if;
+
+ /* So that link locals have meaning */
+ if (!sk->sk_bound_dev_if &&
+ ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
+ treq->iif = inet6_iif(skb);
+
if (!isn) {
struct inet_peer *peer = NULL;
@@ -1264,12 +1271,6 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
atomic_inc(&skb->users);
treq->pktopts = skb;
}
- treq->iif = sk->sk_bound_dev_if;
-
- /* So that link locals have meaning */
- if (!sk->sk_bound_dev_if &&
- ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
- treq->iif = inet6_iif(skb);
if (want_cookie) {
isn = cookie_v6_init_sequence(sk, skb, &req->mss);
> [PATCH] ipv6: tcp: fix tcp_v6_conn_request()
>
> Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
> cookies), we can drop a SYN packet reusing a TIME_WAIT socket.
>
> (As a matter of fact we fail to send the SYNACK answer)
>
> As the client resends its SYN packet after a one second timeout, we
> accept it, because first packet removed the TIME_WAIT socket before
> being dropped.
>
> This probably explains why nobody ever noticed or complained.
>
> Reported-by: Jesse Young <jl...@jlyo.org>
> Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
Applied and queued up for -stable, thanks!
...response from 0.0.0.0:8009, seq=1907 time=0.02 msresponse from 0.0.0.0:8009, seq=1908 time=0.03 msresponse from 0.0.0.0:8009, seq=1909 time=999.11 ms--- 0.0.0.0:8009 ping statistics ---1909 responses, 1910 ok, 0.00% failedround-trip min/avg/max = 0.0/0.6/999.1 msreal 0m1.125suser 0m0.008ssys 0m0.104s