Slow start after idle can harm some scientific applications which
interleave computation and communication. Assume we have an iterative
applications, each iteration consisting of a computation and a
communication phase. If the computation phase takes long enough (i.e.
more that 2*RTT), the communication phase will always slow start and
might never reach the wire speed.
This patch allows each application to disable slow start after idle,
just like we allow delay-sensitive applications (e.g. telnet, SSH) to
disable NAGLE.
Signed-off-by: Cristian KLEIN <crist...@gmail.com>
---
include/linux/tcp.h | 4 +++-
net/ipv4/tcp.c | 10 ++++++++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_minisocks.c | 1 +
net/ipv4/tcp_output.c | 4 ++--
5 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a778ee0..132aab0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -105,6 +105,7 @@ enum {
#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
#define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/
#define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */
+#define TCP_SLOW_START_AFTER_IDLE 18 /* Slow start after transmission idle */
/* for TCP_INFO socket option */
#define TCPI_OPT_TIMESTAMPS 1
@@ -345,7 +346,8 @@ struct tcp_sock {
u8 nonagle : 4,/* Disable Nagle algorithm? */
thin_lto : 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack */
- unused : 2;
+ slow_start_after_idle : 1,/* Slow start after transmission idle */
+ unused : 1;
/* RTT measurement */
u32 srtt; /* smoothed round trip time << 3 */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6afb6d8..3cf3863 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2252,6 +2252,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
}
break;
+ case TCP_SLOW_START_AFTER_IDLE:
+ if (val)
+ tp->slow_start_after_idle = 1;
+ else
+ tp->slow_start_after_idle = 0;
+ break;
+
case TCP_THIN_LINEAR_TIMEOUTS:
if (val < 0 || val > 1)
err = -EINVAL;
@@ -2497,6 +2504,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
case TCP_NODELAY:
val = !!(tp->nonagle&TCP_NAGLE_OFF);
break;
+ case TCP_SLOW_START_AFTER_IDLE:
+ val = tp->slow_start_after_idle;
+ break;
case TCP_CORK:
val = !!(tp->nonagle&TCP_NAGLE_CORK);
break;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f4df5f9..1380902 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1449,6 +1449,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
newtp->advmss = tcp_sk(sk)->rx_opt.user_mss;
tcp_initialize_rcv_mss(newsk);
+ newtp->slow_start_after_idle = sysctl_tcp_slow_start_after_idle;
#ifdef CONFIG_TCP_MD5SIG
/* Copy over the MD5 key from the original socket */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4199bc6..1e6f0bb 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,6 +495,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
newtp->rx_opt.ts_recent_stamp = 0;
newtp->tcp_header_len = sizeof(struct tcphdr);
}
+ newtp->slow_start_after_idle = sysctl_tcp_slow_start_after_idle;
#ifdef CONFIG_TCP_MD5SIG
newtp->md5sig_info = NULL; /*XXX*/
if (newtp->af_specific->md5_lookup(sk, newsk))
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f181b78..175d499 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -154,7 +154,7 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
struct inet_connection_sock *icsk = inet_csk(sk);
const u32 now = tcp_time_stamp;
- if (sysctl_tcp_slow_start_after_idle &&
+ if (tp->slow_start_after_idle &&
(!tp->packets_out && (s32)(now - tp->lsndtime) > icsk->icsk_rto))
tcp_cwnd_restart(sk, __sk_dst_get(sk));
@@ -1279,7 +1279,7 @@ static void tcp_cwnd_validate(struct sock *sk)
if (tp->packets_out > tp->snd_cwnd_used)
tp->snd_cwnd_used = tp->packets_out;
- if (sysctl_tcp_slow_start_after_idle &&
+ if (tp->slow_start_after_idle &&
(s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= inet_csk(sk)->icsk_rto)
tcp_cwnd_application_limited(sk);
}
--
1.7.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> Allows user-space to override the sysctl
> net.ipv4.tcp_slow_start_after_idle, on a per-socket bases, using
> setsockopt().
>
> Slow start after idle can harm some scientific applications which
> interleave computation and communication. Assume we have an iterative
> applications, each iteration consisting of a computation and a
> communication phase. If the computation phase takes long enough (i.e.
> more that 2*RTT), the communication phase will always slow start and
> might never reach the wire speed.
>
> This patch allows each application to disable slow start after idle,
> just like we allow delay-sensitive applications (e.g. telnet, SSH) to
> disable NAGLE.
>
> Signed-off-by: Cristian KLEIN <crist...@gmail.com>
We specifically did not add a socket option for this facility.
It is a very dangerous option to enable, and depends deeply
upon the characteristics of your network and the paths by
which remote hosts are reached.
Therefore, only the system administrator can determine whether it is
safe to enable this, and that's why it can only be changed via sysctl.
Lettting arbitrary applications change this aspect of TCP is beyond
dangerous.
I will not be applying this patch.
Could you please explain me why it is dangerous? To me it seems that
it's just like allowing applications to disable NAGLE or to choose a
congestion control algorithm.
> Could you please explain me why it is dangerous? To me it seems that
> it's just like allowing applications to disable NAGLE or to choose a
> congestion control algorithm.
Because you can cause undue congestion to other people on the network
because you are believing path information that has been outdated and
has not been validated by sending data for a certain amount of time.
I consider your argument an important concern, but I'm not quite
convinced this patch is so bad.
An application which does not need this behaviour will continue to slow
start after idle by default.
Without this patch, an application which needs this behaviour (i.e. not
to slow start after idle) is forced to implement its own UDP-based
protocol with all the congestion control, retransmission etc. Undue
congestion might still occur.
If you don't agree with the above two points, would you consider
accepting a patch with an allow_user_fast_start_after_idle sysctl?
Cristi.
> Without this patch, an application which needs this behaviour
> (i.e. not to slow start after idle) is forced to implement its own
> UDP-based protocol with all the congestion control, retransmission
> etc. Undue congestion might still occur.
Ask your system administrator to set the existing sysctl, because it
is a physical network attribute whether this behavior is safe or not.
And if it is safe, it is safe for all applications, there is no reason
for one application to ask for it and others to not. If it's legal,
it helps all applications without exception.
Your attempts to tie this to NAGLE is complete nonsense.
NAGLE changes acking behavior, whereas this feature controls in what
way we trust congestion control information we've probed for in the
past.
It should be safe as long as you don't have any packet loss (which
means the network can take it). Or I am missing something?
So perhaps allow it, but force disable it on the first retransmit?
There could be still some over subscription in the first window,
but hopefully not too bad.
Ok it could be still gamed by opening lots of sockets, but that
problem is there anyways.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> It should be safe as long as you don't have any packet loss (which
> means the network can take it). Or I am missing something?
>
> So perhaps allow it, but force disable it on the first retransmit?
Unless you can detect packet loss or queueing delay incurred by
everyone else on the internet, this won't work.
Look, just drop this discussion, people simply don't understand the
implications of what they are suggesting.
The only thing I am willing to consider is a per-route attribute for
this setting, because that would be another legitimate to make this
setting.