Disable TCP delayed ack timeout (Nagle's algorithm)

guil...@databerries.com

unread,

Apr 8, 2016, 8:25:09 AM4/8/16

to

Hi,

We are trying to be as real time as possible on our system. We would like to reduce the latency on our tcp packets. What is the equivalent option in Debian to this RedHat feature:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Reducing_the_TCP_delayed_ack_timeout.html ?

We would like to disable the Nagle's algorithm (https://en.wikipedia.org/wiki/Nagle%27s_algorithm) also called "TCP_NODELAY".

Thank you for your help.

PS : We are using Debian 8

Jorgen Grahn

unread,

Apr 8, 2016, 12:30:36 PM4/8/16

to

On Fri, 2016-04-08, guil...@databerries.com wrote:
> Hi,
>
> We are trying to be as real time as possible on our system. We would
> like to reduce the latency on our tcp packets. What is the
> equivalent option in Debian to this RedHat feature:
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-Reducing_the_TCP_delayed_ack_timeout.html
> ?

That page talks about changing /proc/sys/net/ipv4/tcp_delack_min .

I don't have tcp_delack_min on my machine, and it seems likely some
guy on stackoverflow is right:

It is only available in certain - mostly realtime focused - kernel
branches. A lot of the documentation relating to this kernel
parameter is referring to the MRG (Messaging Realtime-Grid) kernel
variant that some RHEL machines will use.

> We would like to disable the Nagle's algorithm
> (https://en.wikipedia.org/wiki/Nagle%27s_algorithm) also called
> "TCP_NODELAY".

That's a different issue. The application typically does that when it
knows it's a good idea for the particular application protocol: see
tcp(4) and TCP_NODELAY.

I don't know exactly what tcp_delack_min means, but judging from the
name it's a tuning parameter of the Nagle algorithm, and if you
disable the Nagle algorithm ... the parameter is useless, right?

> Thank you for your help.
>
> PS : We are using Debian 8

Are you sure any of this would be of use to you? Your problem seems a
bit vague ... have you measured and seen that delays in TCP is a
major bottleneck?

I believe with a good application-level protocol[1], any decent TCP
stack is good enough for most things.

/Jorgen

[1] Few, long-lived TCP connections with lots of traffic is good.
Pipelining is good. Tiny request-response datagrams is bad.
Applications which try not to generate many tiny TCP segments are
good.

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Rick Jones

unread,

Apr 8, 2016, 5:56:45 PM4/8/16

to

Jorgen Grahn <grahn...@snipabacken.se> wrote:
> On Fri, 2016-04-08, guil...@databerries.com wrote:
> > We would like to disable the Nagle's algorithm
> > (https://en.wikipedia.org/wiki/Nagle%27s_algorithm) also called
> > "TCP_NODELAY".

Quibble - the alorithm is called Nagle's alorithm, the setsockopt()
one can issue to disable/enable it is TCP_NODELAY.

> I don't know exactly what tcp_delack_min means, but judging from the
> name it's a tuning parameter of the Nagle algorithm, and if you
> disable the Nagle algorithm ... the parameter is useless, right?

I'm guessing it is controlling the delayed ACK timer, which while it
does interact with the Nagle algorithm, isn't really part of it.

Nagle on/off is a perennial discussion topic. In olden days at least,
it was often tripped-over when applications did what could be
described as "write, write, read" behaviour - sending logically
associated data in separate write/send calls. The common expedient at
the time was to disable Nagle (set TCP_NODELAY to 1) and/or disable
delayed ACKs. Short of the matter of having multiple "transactions"
in flight at one time, I was always of the opinion that the correct
answer was to write all logically associated data to the transport in
a single, perhaps gathering, send call.

rick jones

Some ancient boilerplate I used to post (you will want to read it in a
fixed-width font for greatest ease):

raj@tardy:~$ cat usenet_replies/nagle_algorithm.txt

> I'm not familiar with this issue, and I'm mostly ignorant about what
> tcp does below the sockets interface. Can anybody briefly explain what
> "nagle" is, and how and when to turn it off? Or point me to the
> appropriate manual.

In broad terms, whenever an application does a send() call, the logic
of the Nagle algorithm is supposed to go something like this:

1) Is the quantity of data in this send, plus any queued, unsent data,
greater than the MSS (Maximum Segment Size) for this connection? If
yes, send the data in the user's send now (modulo any other
constraints such as receiver's advertised window and the TCP
congestion window). If no, go to 2.

2) Is the connection to the remote otherwise idle? That is, is there
no unACKed data outstanding on the network. If yes, send the data
in the user's send now. If no, queue the data and wait. Either the
application will continue to call send() with enough data to get to
a full MSS-worth of data, or the remote will ACK all the currently
sent, unACKed data, or our retransmission timer will expire.

Now, where applications run into trouble is when they have what might
be described as "write, write, read" behaviour, where they present
logically associated data to the transport in separate 'send' calls
and those sends are typically less than the MSS for the connection.
It isn't so much that they run afoul of Nagle as they run into issues
with the interaction of Nagle and the other heuristics operating on
the remote. In particular, the delayed ACK heuristics.

When a receiving TCP is deciding whether or not to send an ACK back to
the sender, in broad handwaving terms it goes through logic similar to
this:

a) is there data being sent back to the sender? If yes, piggy-back the
ACK on the data segment.

b) is there a window update being sent back to the sender? If yes,
piggy-back the ACK on the window update.

c) has the standalone ACK timer expired.

Window updates are generally triggered by the following heuristics:

i) would the window update be for a non-trivial fraction of the window
- typically somewhere at or above 1/4 the window, that is, has the
application "consumed" at least that much data? If yes, send a
window update. If no, check ii.

ii) would the window update be for, the application "consumed," at
least 2*MSS worth of data? If yes, send a window update, if no
wait.

Now, going back to that write, write, read application, on the sending
side, the first write will be transmitted by TCP via Nagle rule 2 -
the connection is otherwise idle. However, the second small send will
be delayed as there is at that point unACKnowledged data outstanding
on the connection.

At the receiver, that small TCP segment will arrive and will be passed
to the application. The application does not have the entire app-level
message, so it will not send a reply (data to TCP) back. The typical
TCP window is much much larger than the MSS, so no window update would
be triggered by heuristic i. The data just arrived and consumed by the
application is < 2*MSS, so no window update from heuristic ii. Since
there is no window update, no ACK is sent by heuristic b.

So, that leaves heuristic c - the standalone ACK timer. That ranges
anywhere between 50 and 200 milliseconds depending on the TCP stack in
use.

If you've read this far :) now we can take a look at the effect of
various things touted as "fixes" to applications experiencing this
interaction. We take as our example a client-server application where
both the client and the server are implemented with a write of a small
application header, followed by application data. First, the
"default" case which is with Nagle enabled (TCP_NODELAY _NOT_ set) and
with standard ACK behaviour:

Client Server
Req Header ->
<- Standalone ACK after Nms
Req Data ->
<- Possible standalone ACK
<- Rsp Header
Standalone ACK ->
<- Rsp Data
Possible standalone ACK ->

For two "messages" we end-up with at least six segments on the wire.
The possible standalone ACKs will depend on whether the server's
response time, or client's think time is longer than the standalone
ACK interval on their respective sides. Now, if TCP_NODELAY is set we
see:

Client Server
Req Header ->
Req Data ->
<- Possible Standalone ACK after Nms
<- Rsp Header
<- Rsp Data
Possible Standalone ACK ->

In theory, we are down to four segments on the wire which seems good,
but frankly we can do better. First though, consider what happens
when someone disables delayed ACKs

Client Server
Req Header ->
<- Immediate Standalone ACK
Req Data ->
<- Immediate Standalone ACK
<- Rsp Header
Immediate Standalone ACK ->
<- Rsp Data
Immediate Standalone ACK ->

Now we definitly see 8 segments on the wire. It will also be that way
if both TCP_NODELAY is set and delayed ACKs are disabled.

How about if the application did the "right" think in the first place?
That is sent the logically associated data at the same time:

Client Server
Request ->
<- Possible Standalone ACK
<- Response
Possible Standalone ACK ->

We are down to two segments on the wire.

For "small" packets, the CPU cost is about the same regardless of data
or ACK. This means that the application which is making the propper
gathering send call will spend far fewer CPU cycles in the networking
stack.

--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
these opinions are mine, all mine; HPE might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hpe.com but NOT BOTH...