Missing TCP SYN on loopback, retransmits after 1s

607 views
Skip to first unread message

Jesse Young

unread,
Nov 22, 2011, 7:13:20 PM11/22/11
to net...@vger.kernel.org
Hi all,

I am experiencing packet loss over TCP/IPv[46], which causes 1 second
delays when connect()ing to a socket. This happens even on loopback, and
on multiple kernels. On the older kernels, the connect() time is nearly
3 seconds, I believe this is due to a recent TCP connect retrasmit
parameter changed in the kernel.

1. Linux dc-s1000-2114 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
16:26:12 UTC 2011 x86_64 GNU/Linux
2. Linux dc-a1000-2131.cleversafelabs.com 2.6.39.4-2-clevos+ #1 SMP
Tue Nov 8 09:06:49 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
3. Linux telperion.jlyo.org 3.1.0-4-ARCH #1 SMP PREEMPT Mon Nov 7
22:47:18 CET 2011 x86_64 Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
GenuineIntel GNU/Linux

I have created some test cases which reify this problem, the first set
of tests use select() multiplexing, and have some problems, however,
they exhibit odd behavior as well, especially in the difference between
tcp4 and tcp6.

Please note: these tests will quickly exaust the amount of available
ephemeral TCP ports on your system, which will cause any TCP connect()
calls in other processes to return with EADDRNOTAVAIL. However, ports
will become available after a short while.

The first test fails super quick, while the others haven't timed out
so far. NOTE: The second test requires /proc/sys/net/ipv6/bindv6only
to be set to 1.

./packetloss :: ::1
./packetloss :: 127.0.0.1
./packetloss 0.0.0.0 127.0.0.1

The other tests run a client and server in different processes.
Run the "close" daemon using one of:
./closed ::
./closed 0.0.0.0

And flood connect() pings against 8009, the port closed listens on.
./tcping -f -p8009 ::1
./tcping -f -p8009 127.0.0.1

Wait for a pause, then ^C, and notice the max statistic is ~1000ms.

These tests have been rn between machines on a relativley noiseless
ethernet LAN with similar results.

What's also puzzling, is that I see no packet drop reporting in
$ ifconfig lo
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 276411482 bytes 15822880567 (14.7 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 276411482 bytes 15822880567 (14.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions

I'm thinking this may be a bug in the TCP/IP stack, however, I'm not
certain if I'm missing a socket option, or some other configuration
that may elimiate this behavior.

If there's anything else I can help you with, please don't hesitate
to Cc me.

Thanks,
Jesse

Attached: syndrop.pcap

Get the code here
https://github.com/jlyo/packetloss
git clone git://github.com/jlyo/packetloss.git

https://github.com/jlyo/tcping
git clone git://github.com/jlyo/tcping.git

https://github.com/jlyo/closed
git clone git://github.com/jlyo/closed.git

syndrop.pcap

David Miller

unread,
Nov 22, 2011, 7:23:38 PM11/22/11
to jl...@jlyo.org, net...@vger.kernel.org
From: Jesse Young <jl...@jlyo.org>
Date: Tue, 22 Nov 2011 18:13:20 -0600

> What's also puzzling, is that I see no packet drop reporting in
> $ ifconfig lo
> lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
> inet 127.0.0.1 netmask 255.0.0.0
> inet6 ::1 prefixlen 128 scopeid 0x10<host>
> loop txqueuelen 0 (Local Loopback)
> RX packets 276411482 bytes 15822880567 (14.7 GiB)
> RX errors 0 dropped 0 overruns 0 frame 0
> TX packets 276411482 bytes 15822880567 (14.7 GiB)
> TX errors 0 dropped 0 overruns 0 carrier 0 collisions

The device driver therefore isn't even seeing the packets, they are
being dropped elsewhere.

Why is this "puzzling"?

There's layers upon layers and thousands of places where packets can
be dropped between the originating network stack and the actual device
driver.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jesse Young

unread,
Nov 22, 2011, 7:37:27 PM11/22/11
to David Miller, net...@vger.kernel.org
On Tue, 22 Nov 2011 19:23:38 -0500 (EST)
David Miller <da...@davemloft.net> wrote:

> From: Jesse Young <jl...@jlyo.org>
> Date: Tue, 22 Nov 2011 18:13:20 -0600
>
> > What's also puzzling, is that I see no packet drop reporting in
> > $ ifconfig lo
> > lo: flags=73<UP,LOOPBACK,RUNNING> mtu 16436 metric 1
> > inet 127.0.0.1 netmask 255.0.0.0
> > inet6 ::1 prefixlen 128 scopeid 0x10<host>
> > loop txqueuelen 0 (Local Loopback)
> > RX packets 276411482 bytes 15822880567 (14.7 GiB)
> > RX errors 0 dropped 0 overruns 0 frame 0
> > TX packets 276411482 bytes 15822880567 (14.7 GiB)
> > TX errors 0 dropped 0 overruns 0 carrier 0 collisions
>
> The device driver therefore isn't even seeing the packets, they are
> being dropped elsewhere.
>
> Why is this "puzzling"?
>
> There's layers upon layers and thousands of places where packets can
> be dropped between the originating network stack and the actual device
> driver.

Maybe puzzling isn't the best word... just some more relevant
information. Also, this is the loopback interface, there is no device
driver, PHY or DLL layer in question here (just the loopback's mock
driver/PHY/DLL).

I presume that the drop is occuring in between the NET layer, and the sys
call interface, do you agree? Where should I begin looking?

Hagen Paul Pfeifer

unread,
Nov 22, 2011, 8:44:13 PM11/22/11
to Jesse Young, David Miller, net...@vger.kernel.org
* Jesse Young | 2011-11-22 18:37:27 [-0600]:

>I presume that the drop is occuring in between the NET layer, and the sys
>call interface, do you agree? Where should I begin looking?

perf script net_dropmonitor.py from Neil should do the job.

HGN

John Heffner

unread,
Nov 22, 2011, 9:06:04 PM11/22/11
to Jesse Young, net...@vger.kernel.org
Offhand, I'd guess you're overflowing the TCP SYN queue. (You can try
tuning tcp_max_syn_backlog.)

-John

Eric Dumazet

unread,
Nov 23, 2011, 12:24:40 AM11/23/11
to John Heffner, Jesse Young, net...@vger.kernel.org
Le mardi 22 novembre 2011 à 21:06 -0500, John Heffner a écrit :
> Offhand, I'd guess you're overflowing the TCP SYN queue. (You can try
> tuning tcp_max_syn_backlog.)
>

There is one litle thing called "netstat -s", a very useful tool,
included in many distros :)

Eric Dumazet

unread,
Nov 23, 2011, 12:58:31 AM11/23/11
to John Heffner, Jesse Young, net...@vger.kernel.org
Le mercredi 23 novembre 2011 à 06:24 +0100, Eric Dumazet a écrit :
> Le mardi 22 novembre 2011 à 21:06 -0500, John Heffner a écrit :
> > Offhand, I'd guess you're overflowing the TCP SYN queue. (You can try
> > tuning tcp_max_syn_backlog.)
> >
>
> There is one litle thing called "netstat -s", a very useful tool,
> included in many distros :)

This is related to TIMEWAIT syndrom ?

06:47:42.090522 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 2646115915, win 32752, options [mss 16376,sackOK,TS val 26574090 ecr 0,nop,wscale 6], length 0
06:47:42.090579 IP6 ::1.8009 > ::1.49374: Flags [S.E], seq 184529170, ack 2646115916, win 32728, options [mss 16376,sackOK,TS val 26574090 ecr 26574090,nop,wscale 6], length 0
06:47:42.090616 IP6 ::1.49374 > ::1.8009: Flags [.], ack 1, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090718 IP6 ::1.8009 > ::1.49374: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090780 IP6 ::1.49374 > ::1.8009: Flags [F.], seq 1, ack 2, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0
06:47:42.090843 IP6 ::1.8009 > ::1.49374: Flags [.], ack 2, win 512, options [nop,nop,TS val 26574090 ecr 26574090], length 0

First connection went well.

Now we try to reuse tuple (ports 49374, 8009 on loopback) while a socket is in TIMEWAIT, and first
SYN packet (time 06:48:20.337335) is dropped (considered as a packet part of previous session)

Now why the first SYN packet is dropped and not the second one, I dont know yet.

06:48:20.337335 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 3243722104, win 32752, options [mss 16376,sackOK,TS val 26612337 ecr 0,nop,wscale 6], length 0
06:48:21.340112 IP6 ::1.49374 > ::1.8009: Flags [SEW], seq 3243722104, win 32752, options [mss 16376,sackOK,TS val 26613340 ecr 0,nop,wscale 6], length 0
06:48:21.340162 IP6 ::1.8009 > ::1.49374: Flags [S.E], seq 797804014, ack 3243722105, win 32728, options [mss 16376,sackOK,TS val 26613340 ecr 26613340,nop,wscale 6], length 0
06:48:21.340217 IP6 ::1.49374 > ::1.8009: Flags [.], ack 1, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340360 IP6 ::1.8009 > ::1.49374: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340466 IP6 ::1.49374 > ::1.8009: Flags [F.], seq 1, ack 2, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0
06:48:21.340541 IP6 ::1.8009 > ::1.49374: Flags [.], ack 2, win 512, options [nop,nop,TS val 26613340 ecr 26613340], length 0

Eric Dumazet

unread,
Nov 23, 2011, 1:09:00 AM11/23/11
to John Heffner, Jesse Young, net...@vger.kernel.org
Le mercredi 23 novembre 2011 à 06:58 +0100, Eric Dumazet a écrit :

> First connection went well.
>
> Now we try to reuse tuple (ports 49374, 8009 on loopback) while a socket is in TIMEWAIT, and first
> SYN packet (time 06:48:20.337335) is dropped (considered as a packet part of previous session)
>
> Now why the first SYN packet is dropped and not the second one, I dont know yet.


in netstat -s output, the suspect increasing counter is :

"45 congestion windows recovered without slow start after partial ack"

Eric Dumazet

unread,
Nov 23, 2011, 1:13:22 AM11/23/11
to John Heffner, Jesse Young, net...@vger.kernel.org
Le mercredi 23 novembre 2011 à 07:09 +0100, Eric Dumazet a écrit :
> Le mercredi 23 novembre 2011 à 06:58 +0100, Eric Dumazet a écrit :
>
> > First connection went well.
> >
> > Now we try to reuse tuple (ports 49374, 8009 on loopback) while a socket is in TIMEWAIT, and first
> > SYN packet (time 06:48:20.337335) is dropped (considered as a packet part of previous session)
> >
> > Now why the first SYN packet is dropped and not the second one, I dont know yet.
>
>
> in netstat -s output, the suspect increasing counter is :
>
> "45 congestion windows recovered without slow start after partial ack"
>
>

# vi +2831 net/ipv4/tcp_input.c

/* Undo during loss recovery after partial ACK. */
static int tcp_try_undo_loss(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);

if (tcp_may_undo(tp)) {
struct sk_buff *skb;
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
}

tcp_clear_all_retrans_hints(tp);

DBGUNDO(sk, "partial loss");
tp->lost_out = 0;
tcp_undo_cwr(sk, true);
HERE: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO);
inet_csk(sk)->icsk_retransmits = 0;
tp->undo_marker = 0;
if (tcp_is_sack(tp))
tcp_set_ca_state(sk, TCP_CA_Open);
return 1;
}
return 0;

Eric Dumazet

unread,
Nov 23, 2011, 9:38:44 AM11/23/11
to Jesse Young, David Miller, net...@vger.kernel.org

Here is the patch to solve this IPv6 problem, thanks a lot for the
report !

[PATCH] ipv6: tcp: fix tcp_v6_conn_request()

Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
cookies), we can drop a SYN packet reusing a TIME_WAIT socket.

(As a matter of fact we fail to send the SYNACK answer)

As the client resends its SYN packet after a one second timeout, we
accept it, because first packet removed the TIME_WAIT socket before
being dropped.

This probably explains why nobody ever noticed or complained.

Reported-by: Jesse Young <jl...@jlyo.org>
Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
---
net/ipv6/tcp_ipv6.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 36131d1..2dea4bb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1255,6 +1255,13 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
if (!want_cookie || tmp_opt.tstamp_ok)
TCP_ECN_create_request(req, tcp_hdr(skb));

+ treq->iif = sk->sk_bound_dev_if;
+
+ /* So that link locals have meaning */
+ if (!sk->sk_bound_dev_if &&
+ ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
+ treq->iif = inet6_iif(skb);
+
if (!isn) {
struct inet_peer *peer = NULL;

@@ -1264,12 +1271,6 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
atomic_inc(&skb->users);
treq->pktopts = skb;
}
- treq->iif = sk->sk_bound_dev_if;
-
- /* So that link locals have meaning */
- if (!sk->sk_bound_dev_if &&
- ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
- treq->iif = inet6_iif(skb);

if (want_cookie) {
isn = cookie_v6_init_sequence(sk, skb, &req->mss);

David Miller

unread,
Nov 23, 2011, 5:29:59 PM11/23/11
to eric.d...@gmail.com, jl...@jlyo.org, net...@vger.kernel.org
From: Eric Dumazet <eric.d...@gmail.com>
Date: Wed, 23 Nov 2011 15:38:44 +0100

> [PATCH] ipv6: tcp: fix tcp_v6_conn_request()
>
> Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
> cookies), we can drop a SYN packet reusing a TIME_WAIT socket.
>
> (As a matter of fact we fail to send the SYNACK answer)
>
> As the client resends its SYN packet after a one second timeout, we
> accept it, because first packet removed the TIME_WAIT socket before
> being dropped.
>
> This probably explains why nobody ever noticed or complained.
>
> Reported-by: Jesse Young <jl...@jlyo.org>
> Signed-off-by: Eric Dumazet <eric.d...@gmail.com>

Applied and queued up for -stable, thanks!

para...@gmail.com

unread,
Mar 22, 2013, 10:03:57 PM3/22/13
to linu...@googlegroups.com, eric.d...@gmail.com, jl...@jlyo.org, net...@vger.kernel.org, da...@davemloft.net
Hi List!

First, I'm sorry for resurrecting an extremely old thread, but I've exhausted all other resources. We're experiencing this same "1 second retransmit" with ipv4 (including loopback). And the best part is, it can be replicated very easily using the 'closed' and 'tcping' tests provided by Jesse Young in the initial post. For reference:

$ git clone git://github.com/jlyo/tcping.git
$ cd tcping && make

$ git clone git://github.com/jlyo/closed.git
$ cd closed && make

$ ./closed 0.0.0.0

$ time ./tcping -f -p8009 0.0.0.0

Results:

...
response from 0.0.0.0:8009, seq=1907 time=0.02 ms
response from 0.0.0.0:8009, seq=1908 time=0.03 ms
response from 0.0.0.0:8009, seq=1909 time=999.11 ms
--- 0.0.0.0:8009 ping statistics ---
1909 responses, 1910 ok, 0.00% failed
round-trip min/avg/max = 0.0/0.6/999.1 ms

real    0m1.125s
user    0m0.008s
sys     0m0.104s


Packet captures from tcpdump look remarkably similar to what Eric Dumazet shared. That eventually lead me to this thread.

This happens on a fresh Ubuntu 12.10 install, and also with our tuning parameters. (Includes increasing the syn backlog, open file descriptors, TCP memory, max orphans, etc.)  I've also seen the problem with other kernels, within EC2 and Azure. I have not been able to test with ipv6 yet.

$ uname -a
Linux test 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I'm hoping to spark some interest in revisiting this issue (with focus on ipv4, this time).

Thanks everyone!
Jay
Reply all
Reply to author
Forward
0 new messages