Tuning delay ack effects select() dispatching ?

Thomas Segev

unread,

Dec 20, 2001, 12:32:28 PM12/20/01

to

A sockets application on OS/390 sends a file to a receiver on AIX.
In netstat display on AIX the recv-q of the receiver connection always
has data.

Tuning the parameter fasttimo on AIX has a significant effect on the
throughput
This timeout controls how often the system scans the TCP control blocks
to send delayed acknowledgments.

With default fasttimo set to 200 milliseconds the receiver application
using select() is dispatched with available data about every 200
milliseconds.

When fasttimo is set to 50 milliseconds the receiver application is
dispatched about every 50 milliseconds.

The result is a much faster transfer.

Since the data is already in TCP/IP recv buffers of the receiver
connection, it seems that besides any effect that changing fasttimo may
have on TCP/IP packet sends and acks, changing fasttimo also effects how
often the kernel will scan socket applications and dispatch those with
available data.

I have experienced similar results with decreasing
tcp_deferred_ack_interval on HP-UX.

Can anybody confirm or comment on this ?

Thanks,
Thomas

David Schwartz

unread,

Dec 20, 2001, 3:42:59 PM12/20/01

to

When the reader application gets a hit on 'select', does it keep
reading until read returns EWOULDBLOCK?

DS

Thomas Segev

unread,

Dec 21, 2001, 1:48:42 AM12/21/01

to

David Schwartz <dav...@webmaster.com> wrote in message news:<3C224D53...@webmaster.com>...

> When the reader application gets a hit on 'select', does it keep
> reading until read returns EWOULDBLOCK?
>
> DS

No, it just reads one logical message (based on length preceding the
message) and eneters select again.
I assume that rewriting the reader side to keep reading until
EWOULDBLOCK will solve the problem. I was under (the wrong ?)
impression that when you issue select - and there is already data in
the receive buffers - you immediately get a 'hit' on the select.

Thomas

Adam Dunkels

unread,

Dec 21, 2001, 5:05:15 AM12/21/01

to

Thomas Segev wrote:

> A sockets application on OS/390 sends a file to a receiver on AIX.
> In netstat display on AIX the recv-q of the receiver connection always
> has data.
>
> Tuning the parameter fasttimo on AIX has a significant effect on the
> throughput
> This timeout controls how often the system scans the TCP control blocks
> to send delayed acknowledgments.
>
> With default fasttimo set to 200 milliseconds the receiver application
> using select() is dispatched with available data about every 200
> milliseconds.
>
> When fasttimo is set to 50 milliseconds the receiver application is
> dispatched about every 50 milliseconds.
>
> The result is a much faster transfer.

By reducing the fasttimo value, ACKs are delayed a shorter time and more
ACKs will be sent. This will make the slow start mechanism at the sending
side inflate the congestion window faster (it doubles the congestion window
for every ACK it receives - more ACKs = faster window increase). The same
behaviour can be seen if delayed ACKs are turned off completely.

/adam
--
Adam Dunkels <ad...@dunkels.net> (Spambait)
http://dunkels.com/adam/

Adam Dunkels

unread,

Dec 21, 2001, 7:45:43 AM12/21/01

to

Emil Naepflein wrote:

>> By reducing the fasttimo value, ACKs are delayed a shorter time and more
>> ACKs will be sent. This will make the slow start mechanism at the sending
>> side inflate the congestion window faster (it doubles the congestion
>> window for every ACK it receives - more ACKs = faster window increase).
>> The same behaviour can be seen if delayed ACKs are turned off completely.
>

> How can a timeout that only happens when *no* data is available cause a
> ACK to be delayed shorter?
> The ack handling in the protocol stack is independent from select().

This has nothing to do with select(), it is just how TCP works in most
cases involving short uni-directional transfers. With a low delayed ACK
delay, more ACKs are sent by the receiver which makes the sender open the
congestion winodw quicker. The result is a faster transfer.

I believe newer TCP/IP stacks such as the one in newer versions of Linux
and FreeBSD are "immune" to this kind of artificial inflation of the
congestion window; instead of counting incoming ACKs, they count the number
of segments that are acknowledged by the ACKs.

Thomas Segev

unread,

Dec 21, 2001, 3:08:36 PM12/21/01

to

Adam Dunkels <ad...@dunkels.net> wrote in message news:<9vvatp$2i2o$1...@not.sics.se>...

OK, I admit, this was not the whole story.
Seems we can't avoid it and it will probably be more interesting.
This is not a real file transfer: on the OS/390 side there is a
monitoring application sending object status change reports as they
occur to the UNIX side.
This flow of updates happens all the time and at peak periods the rate
becomes intensive enough for the receiving window to shut down (zero
window).
Here is what I have seen from a packet trace on OS/390 communicating
with HP-UX:
With tcp_deferred_ack_interval set to 50 ms, the zero window condition
lasts for about 27 seconds. Afterwards the window opens fully to its
maximum value of 32K.
With tcp_deferred_ack_interval set to 5 ms, the window opens to 32K
after 9 seconds.
My only explanation for this is that the reader side (which issues
select after every message and does not keep reading until
EWOULDBLOCK), gets the control back from select with some delay which
is correlated with the deferred ack interval. This deferred ack
interval has no effect at this time on the sender as there is no data
transfer taking place while the window is shut.

Thomas

David Schwartz

unread,

Dec 21, 2001, 4:08:47 PM12/21/01

to

Thomas Segev wrote:

> No, it just reads one logical message (based on length preceding the
> message) and eneters select again.
> I assume that rewriting the reader side to keep reading until
> EWOULDBLOCK will solve the problem. I was under (the wrong ?)
> impression that when you issue select - and there is already data in
> the receive buffers - you immediately get a 'hit' on the select.

No, the impression is correct. It should be okay to call 'select',
'read' one time, and then call 'select' again. If the read won't block,
'select' should return immediately. And it's usually inefficient to make
an extra 'read' call on every 'select' hit just to get an EWOULDBLOCK.
However, forcing an extra pass to 'select' for each 'read' can cause
performance problems if the buffer you pass to 'read' is too small. It
also depends upon how often you manage to call 'select' and how much
other code you run inside your 'select' loop.

I suggest looping on 'read' until one of the following:

1) You get zero, indiciating the connection closed
2) You get a negative number, with EWOULDBLOCK indicating that there's
no more data to read.
3) You get a return value less than the size of the buffer you passed,
indicating that there's no more data to read right now, or
4) You get a negative numebr, with an error that indicates the
connection has failed.

In other words, if you pass a 4Kb buffer to read and you get back 4Kb
characters, call 'read' again to try to drain the buffer.

This assumes that your priority is to give attention to the busy
connection, possibly at the expense of others. 'select' is an expensive
system call for large numbers of sockets, so you want to do as much work
inbetween calls to 'select' as possible.

DS

Rishi Sinha

unread,

Dec 22, 2001, 4:25:11 PM12/22/01

to

On Fri, 21 Dec 2001, Adam Dunkels wrote:

> This has nothing to do with select(), it is just how TCP works in most
> cases involving short uni-directional transfers. With a low delayed ACK
> delay, more ACKs are sent by the receiver which makes the sender open the
> congestion winodw quicker. The result is a faster transfer.

I'm not sure that's true (see below).

> I believe newer TCP/IP stacks such as the one in newer versions of Linux
> and FreeBSD are "immune" to this kind of artificial inflation of the
> congestion window; instead of counting incoming ACKs, they count the number
> of segments that are acknowledged by the ACKs.

Was there ever a stack that opened up its cwnd based on the _number_ of
ACKs received? I don't think so. I think cwnd has always been reacting to
the number of unique bytes acknowledged (exclude fast retransmit from this
discussion). Are you sure there was a stack that did the artificial
inflation you describe?

In any case, faster ACKs will open up the cwnd faster, of course.

Rishi.

Thomas Segev

unread,

Dec 26, 2001, 2:52:31 AM12/26/01

to

You all were right that it was not select() being delayed by ack
delay.
I failed to mention some details that seemed irrelavant, but were the
root of the problem:
The update messages were written by the reader to Sybase as they were
read from the incoming connection. This of course was via a TCP
connection to SQL server, AND the server was running with "tcp no
delay" (TCP_NODELAY) set to the default of 0 (off). This updating of
the database was preceeded by a query (SQL SELECT) that sent two reply
packets from SQL Server to the client. The second packet was sent only
after the first has been acked by the client (the reader). Decreasing
the ack delay of TCP stack improved performance by forcing faster acks
of the client.
Setting tcp no delay to 1 (on) by sp_configure, improved performance
even more.

Thomas

Brian Utterback

unread,

Dec 26, 2001, 9:16:26 AM12/26/01

to

On the other hand, that is exactly the wrong solution. The delayed ack algorithm
is designed to lower network congestion by limiting the number of packets that
make inefficient use of the network. Efficient packets are not limited. Some
applications require timely response and cannot use the network effciently (the
telnet service is the time honored example), so the ability to bypass the limit
is provided, by setting TCP_NODELAY on the connection.

Unfortunately, when the delayed ack algorithm comes into play and these
ineffcient protocols are delayed thereby, the immediate reaction is almost
always "how do I turn off the delay?", when it should be "how do I use the
network more effciently?"

The behavior of your application is broken. I realize that you may not have
any control over the inner workings of it (sybase did you say?), but you should
view using TCP_NODELAY as a workaround and demand the vendor fix his app.

thomas...@bmc.com (Thomas Segev) wrote in message news:<83b78f96.0112...@posting.google.com>...

Jeffrey Mogul

unread,

Dec 26, 2001, 8:31:21 PM12/26/01

to

In article <9bb98867.01122...@posting.google.com>, bl...@onebox.com (Brian Utterback) writes:
|> On the other hand, that is exactly the wrong solution. The delayed ack algorithm
|> is designed to lower network congestion by limiting the number of packets that
|> make inefficient use of the network. Efficient packets are not limited. Some
|> applications require timely response and cannot use the network effciently (the
|> telnet service is the time honored example), so the ability to bypass the limit
|> is provided, by setting TCP_NODELAY on the connection.

On the other other hand, the TCP_NODELAY option (in BSD kernels, at least)
does not directly affect the delayed ACK algorithm at all. It controls
the Nagle algorithm, which in turn controls the decisions about when
to send small data packets. (The delayed ACK algorithm controls when
the receiver sends ACKs.)

It's questionable whether either the Nagle algorithm or the delayed
ACK algorithm, in their semi-standard forms, are really the best solution.
They often interact in such a way that seduces people into setting
TCP_NODELAY (thus disabling Nagle, thus avoiding this particular
interaction).

This is not good, because most people who set TCP_NODELAY really don't
understand what they are doing wrong, and those who are forced to set
it (even with full understanding) ought to have a somewhat more precise
tool available.

Anyway: if you don't know why setting TCP_NODELAY is a bad idea, you
shouldn't do it. If you know why it is a bad idea, and you decide you
have to set it anyway ... well, that's a shame, but maybe necessary
(in the absence of improved implementations).

-Jeff

Brian Utterback

unread,

Dec 28, 2001, 8:56:46 AM12/28/01

to

Jeffrey Mogul wrote:
>
> In article <9bb98867.01122...@posting.google.com>, bl...@onebox.com (Brian
Utterback) writes:
> |> On the other hand, that is exactly the wrong solution. The delayed ack algorithm
> |> is designed to lower network congestion by limiting the number of packets that
> |> make inefficient use of the network. Efficient packets are not limited. Some
> |> applications require timely response and cannot use the network effciently (the
> |> telnet service is the time honored example), so the ability to bypass the limit
> |> is provided, by setting TCP_NODELAY on the connection.
>
> On the other other hand, the TCP_NODELAY option (in BSD kernels, at least)
> does not directly affect the delayed ACK algorithm at all. It controls
> the Nagle algorithm, which in turn controls the decisions about when
> to send small data packets. (The delayed ACK algorithm controls when
> the receiver sends ACKs.)

Well, yes, I just didn't want to get into the full explanation. The TCP_NODELAY
option affects the Nagle algorithm, but if delayed ack wasn't on, then we wouldn't
have as large a delay. Since we can't control delayed ack on the other system, the
best we can do is control Nagle on this one. But my point is still valid, that
they are both your friends.

>
> It's questionable whether either the Nagle algorithm or the delayed
> ACK algorithm, in their semi-standard forms, are really the best solution.
> They often interact in such a way that seduces people into setting
> TCP_NODELAY (thus disabling Nagle, thus avoiding this particular
> interaction).
>
> This is not good, because most people who set TCP_NODELAY really don't
> understand what they are doing wrong, and those who are forced to set
> it (even with full understanding) ought to have a somewhat more precise
> tool available.

Exactly my point.

>
> Anyway: if you don't know why setting TCP_NODELAY is a bad idea, you
> shouldn't do it. If you know why it is a bad idea, and you decide you
> have to set it anyway ... well, that's a shame, but maybe necessary
> (in the absence of improved implementations).

Quite. Which brings me to one of my pet peeves. Why in the world isn't there a
way for the application to tell the system when the write is complete? If there
was such a way, basically allowing the application to specify that the push
bit should be set on this write call and not to delay any more, it would
provide the 灸est of both worlds" solution. With the TCP_NODELAY option, you
can only specify that the protocol on this connection *overall* needs to write
tinygrams that are not delayed, but what about the application that gets data
in bits and drabs, but knows when it is going to wait for the response from the
other end? By allowing small writes to accumulate via Nagle up to the last write
that finishes one "logical" transaction, we could have efficient network use, and
no delay. The info on the "push" bit screams out for such an API, but I don't
think anybody has one, although I have heard about setting and unsetting
TCP_NODELAY on the fly to accomplish the same thing, a kludge if there ever was
one.
--
blu

________________________________________________________________________________
NTP: the protocol that ensures that a good time is had by all.

Vernon Schryver

unread,

Dec 28, 2001, 11:49:01 AM12/28/01

to

In article <9bb98867.01122...@posting.google.com>,
Brian Utterback <bl...@onebox.com> wrote:

> ...

>Quite. Which brings me to one of my pet peeves. Why in the world isn't there a

>way for the application to tell the system when the write is complete? ...

Reasonable systems have something much better than an output PSH bit.
Instead of forcing the operating system to allocate and copy data buffers
while the application dithers and tries to collect its data, the BSD
socket sendmsg() system call lets the application build a gather list
of the pieces of its message and then do the operation in a single system
call. The operating system can then send it in one burst, perhaps by
pointing DMA hardware at a gather list built from the (struct msghdr)
so that no memory bandwidth is wasted on useless copies. That sendmsg()
is a single system call saves lots of cycles that are otherwise wasted
on validating file descriptors, socket buffer limits, and so forth.

Vernon Schryver v...@rhyolite.com

Rishi Sinha

unread,

Dec 29, 2001, 1:54:10 AM12/29/01

to

On 28 Dec 2001, Brian Utterback wrote:
[...]

> Quite. Which brings me to one of my pet peeves. Why in the world isn't
> there a way for the application to tell the system when the write is
> complete? If there was such a way, basically allowing the application
> to specify that the push bit should be set on this write call and not
> to delay any more, it would provide the 灸est of both worlds"
> solution. With the TCP_NODELAY option, you can only specify that the
> protocol on this connection *overall* needs to write tinygrams that
> are not delayed, but what about the application that gets data in bits
> and drabs, but knows when it is going to wait for the response from
> the other end? By allowing small writes to accumulate via Nagle up to
> the last write that finishes one "logical" transaction, we could have
> efficient network use, and no delay. The info on the "push" bit
> screams out for such an API, but I don't think anybody has one,
> although I have heard about setting and unsetting TCP_NODELAY on the
> fly to accomplish the same thing, a kludge if there ever was one. --

Setting the PSH bit on a tinygram would not override the Nagle delaying of
the transmission of this tinygram. I believe the PSH bit is set by most
stacks anyway, on every tinygram, and on the last segment resulting from a
large write().

Here's the relevant quote from Section 4.2.3.3 of RFC 1122:

"The Nagle algorithm is generally as follows:

If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the
sending TCP buffers all user data (regardless of the PSH bit), until the
outstanding data has been acknowledged or until the TCP can send a
full-sized segment[...]"

Rishi.

Brian Utterback

unread,

Dec 31, 2001, 9:40:50 AM12/31/01

to

Rishi Sinha <rish...@aludra.usc.edu> wrote in message news:<Pine.GSO.4.33.01122...@aludra.usc.edu>...

> On 28 Dec 2001, Brian Utterback wrote:
> [...]
> > Quite. Which brings me to one of my pet peeves. Why in the world isn't
> > there a way for the application to tell the system when the write is
> > complete? If there was such a way, basically allowing the application
> > to specify that the push bit should be set on this write call and not

> > to delay any more, it would provide the ?best of both worlds"
> > solution. With the TCP NODELAY option, you can only specify that the

> > protocol on this connection *overall* needs to write tinygrams that
> > are not delayed, but what about the application that gets data in bits
> > and drabs, but knows when it is going to wait for the response from
> > the other end? By allowing small writes to accumulate via Nagle up to
> > the last write that finishes one "logical" transaction, we could have
> > efficient network use, and no delay. The info on the "push" bit
> > screams out for such an API, but I don't think anybody has one,

> > although I have heard about setting and unsetting TCP NODELAY on the

> > fly to accomplish the same thing, a kludge if there ever was one. --
>
> Setting the PSH bit on a tinygram would not override the Nagle delaying of
> the transmission of this tinygram. I believe the PSH bit is set by most
> stacks anyway, on every tinygram, and on the last segment resulting from a
> large write().
>
> Here's the relevant quote from Section 4.2.3.3 of RFC 1122:
>
> "The Nagle algorithm is generally as follows:
>
> If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the
> sending TCP buffers all user data (regardless of the PSH bit), until the
> outstanding data has been acknowledged or until the TCP can send a
> full-sized segment[...]"

You are missing my point. Data driven application design often produces
applications that write small amounts of data in a single write, perhaps
many writes in a row which Nagle tries to accumulate. Only the application
knows when it is done doing these small writes and must wait for a reply. If
the sum of these writes does not fill a packet, then the first write typically
created one packet, and the rest of the writes get sent in a second packet when
the acknowledgment for the first packet arrives. Since the process at the other
end wants all the writes before sending a reply, if any, the acknowledgment is
delayed due to delayed ack.

Since the delivery of the data is delayed, 9 times out 10 the application
programmer will simply set TCP_NODELAY, resulting in no delay but a horribly
inefficient use of network resources. Now, the traditionally correct thing
to do would be to rewrite the app to either copy the data into a buffer, or
possibly use writev, or as mentioned in another article in this thread, use
sendmsg with gather to also grab all the data in a single write. That's
great, but the buffer management can get tricky if the data is larger than a
packet, and might needlessly delay data that might have been sent earlier,
making tracffic "bursty" rather than "smooth". Most programmers won't bother.

A compromise solution is to modify the API to allow the programmer to set a
"flush" bit with the write. This bit would signify "I am done writing and
cannot do anything until I get a reply". This would then turn off the
Nagle algorithm for *this write only*, setting the push bit and writing all
pending data subject to receive and congestion windows, but not outstanding
tinygrams. Allow a zero length write to set this bit without moving data, and
then the mods to our hypothetical program become trivial, and allow effcient
use of the network without introducing unneccesary delays.

I know that most implementations set PUSH on each packet, and that most
implementations don't do anything with that knowledge. That is because there
is no API to set the bit by the application. Since no info is imparted by the
bit, it does not make sense to treat it any differently. If the bit did say
something about the packet, then it would be possible (and desireable) to
treat packets with the bit set differently from other packets.

Vernon Schryver

unread,

Dec 31, 2001, 12:49:31 PM12/31/01

to

In article <9bb98867.01123...@posting.google.com>,
Brian Utterback <bl...@onebox.com> wrote:

> ...

>possibly use writev, or as mentioned in another article in this thread, use
>sendmsg with gather to also grab all the data in a single write. That's
>great, but the buffer management can get tricky if the data is larger than a
>packet, and might needlessly delay data that might have been sent earlier,
>making tracffic "bursty" rather than "smooth". Most programmers won't bother.

> ...

Managing the necessarily small buffers that are being discussed here
is very unlike to be tricky. If the data is much larger than a packet
(or more accurately, the MSS), then there is no need to combine write()
requests in a single writev()/sendmsg(). You need only use writev()
on that last couple thousand bytes of dribbles.

Using writev/sendmsg() instead of a bunch of write()s is very unlikely
to make the traffic more bursty that it would otherwise be.

Programmers who are too lazy and incompetent to use the appropriate
tools are worse than useless. Programmers who cannot handle writev()
could not handle an API PSH bit. Adding a PSH bit would only
provide another excuse for whining by the incompetent and lazy
about how terribly hard they things are.

Vernon Schryver v...@rhyolite.com

David Schwartz

unread,

Dec 31, 2001, 5:04:01 PM12/31/01

to

Brian Utterback wrote:

> A compromise solution is to modify the API to allow the programmer to set a
> "flush" bit with the write. This bit would signify "I am done writing and
> cannot do anything until I get a reply". This would then turn off the
> Nagle algorithm for *this write only*, setting the push bit and writing all
> pending data subject to receive and congestion windows, but not outstanding
> tinygrams. Allow a zero length write to set this bit without moving data, and
> then the mods to our hypothetical program become trivial, and allow effcient
> use of the network without introducing unneccesary delays.

Linux has an interesting setting, TCP_CORK, which causes the stack to
only send full packets. You can set this option, make multiple writes,
and then uncork the socket to immediately send whatever was left over
that didn't fit in a full packet. Theoretically, this should allow for
high network efficiency without extra delays.

However, it has some disadvantages. You must ensure your code always
corks and uncorks in the right place. If you mess up (miss an uncork),
you can deadlock. And if you could find the right places to cork/uncork,
you could start writing to a buffer and flush the buffer in those same
places. The corresponding user-space solution would require fewer
user/kernel transitions.

DS

Rishi Sinha

unread,

Jan 1, 2002, 11:43:38 AM1/1/02

to

On 31 Dec 2001, Brian Utterback wrote:

[...]

> A compromise solution is to modify the API to allow the programmer to
> set a "flush" bit with the write. This bit would signify "I am done
> writing and cannot do anything until I get a reply". This would then
> turn off the Nagle algorithm for *this write only*, setting the push

[...]

I agree with what you say about the motivation for and use of such a bit.

> I know that most implementations set PUSH on each packet, and that
> most implementations don't do anything with that knowledge. That is
> because there is no API to set the bit by the application. Since no
> info is imparted by the bit, it does not make sense to treat it any
> differently. If the bit did say something about the packet, then it
> would be possible (and desireable) to treat packets with the bit set
> differently from other packets.

I got the impression you were proposing a change to the API only. An API
that let the application set the PSH bit would not help.

Rishi.

Adam Dunkels

unread,

Jan 2, 2002, 11:32:26 AM1/2/02

to

Rishi Sinha wrote:

>> I believe newer TCP/IP stacks such as the one in newer versions of Linux
>> and FreeBSD are "immune" to this kind of artificial inflation of the
>> congestion window; instead of counting incoming ACKs, they count the
>> number of segments that are acknowledged by the ACKs.
>

> Was there ever a stack that opened up its cwnd based on the number of

> ACKs received? I don't think so. I think cwnd has always been reacting to
> the number of unique bytes acknowledged (exclude fast retransmit from this
> discussion). Are you sure there was a stack that did the artificial
> inflation you describe?
>
> In any case, faster ACKs will open up the cwnd faster, of course.

I ment that the congestion window was inflated based on the number of ACKs
that acknowledge new data (just as you say). The artificial inflation was
when the receiver sent many such ACKs per received segment - for a 1500
byte segment the receiver could send up to 1500 unique ACKs, one for each
byte in the segment. The sender would only count the number of unique ACKs
received, and not the number of segments that the ACKs really acknowledged
thus inflating cwnd too much.

I believe this problem was called "TCP Daytona" when it was found and at
least FreeBSD version 3.x was vulnerable.

Thomas Segev

unread,

Jan 3, 2002, 7:50:28 AM1/3/02

to

I am glad to have stirred up this lively discussion.

I would like to make two comments:
1. Although disabling NAGLE with TCP_NODELAY is bad in general,
this is not so in my SPECIFIC case.

With TCP_NODELAY set to off, this is what happens:

Client Server

Request
------------------------>
Reply (1st part)
<-----------------------
...
...
...
ack
--------------->
Reply (2nd part)
<-----------------------

With TCP_NODELAY set to on:

Client Server

Request
------------------------>
Reply (1st part)
<-----------------------
Reply (2nd part)
<-----------------------

Packet accumulation does not occur anyway, and the NAGLE delay is for naught.
Since this is the typical behavior pattern of this specific application,
the overall result of setting TCP_NODELAY is only beneficial.

2. If I may mention SNA here, I would like to point out that CPI-C provides
an explicit Flush() call to send the data immediately.
It also performs implicit flush when the application issues a Receive() call.

Thomas

Thomas Segev

unread,

Jan 3, 2002, 10:27:24 AM1/3/02

to

Alun Jones

unread,

Jan 3, 2002, 10:34:55 AM1/3/02

to

In article <3C345394...@bmc.com>, Thomas Segev <thomas...@bmc.com>
wrote:

>I would like to make two comments:
>1. Although disabling NAGLE with TCP_NODELAY is bad in general,
>this is not so in my SPECIFIC case.

..

>Packet accumulation does not occur anyway, and the NAGLE delay is for naught.

Actually, your example _does_ show that Nagle has a benefit - with Nagle on,
you're sending only one ACK - with Nagle off, you're sending two. That's two
round trips instead of one. Your server should endeavour to send both parts
of the reply in one send() to avoid the delay associated with delayed ACK,
rather than disabling Nagle.

>Since this is the typical behavior pattern of this specific application,
>the overall result of setting TCP_NODELAY is only beneficial.

And yet, is not as beneficial as _not_ setting TCP_NODELAY and writing your
server such that it coalesces associated data into a single send().

>2. If I may mention SNA here, I would like to point out that CPI-C provides
>an explicit Flush() call to send the data immediately.
>It also performs implicit flush when the application issues a Receive() call.

And, of course, when the network card's buffer is full, it pushes all those
electrons up even tighter, such that it forces data to flow down the wire,
right? I'm being sarcastic. Flush, even if it was to be implemented, would
be of _no_ use whatever, because the data doesn't qualify as having been
successfully sent (i.e. it's still in the buffer) until the ACK comes back.

Alun.
~~~~

[Note that answers to questions in newsgroups are not generally
invitations to contact me personally for help in the future.]
--
Texas Imperial Software | Try WFTPD, the Windows FTP Server. Find us at
1602 Harvest Moon Place | http://www.wftpd.com or email al...@texis.com
Cedar Park TX 78613-1419 | VISA/MC accepted. NT-based sites, be sure to
Fax/Voice +1(512)258-9858 | read details of WFTPD Pro for NT.

Brian Utterback

unread,

Jan 3, 2002, 3:34:59 PM1/3/02

to

Thomas Segev <thomas...@bmc.com> wrote in message news:<3C345394...@bmc.com>...

As I said earlier, this is the typical situation when TCP_NODELAY is used to
fix a problem that is really a problem in the application. Only if the two
packets sizes sum to greater than one full packet and you can be sure that the
sequence won't ever go to more than two packets is it reasonable to use
TCP_NODELAY. While the purpose of Nagle is to accumulate data if possible,
another thing it does is limit the number of small packets on the wire at the
same time.

In that regard Nagle is working as it is supposed to, and using TCP_NODELAY
is simply thwarting it. The correct answer is to send the two replies as a
single write operation, and accumulate them in the application. If they sum
to greater than one full packet, in this one case the delay due to Nagle is
for naught, and increases the congestion due to the extra ACK that is returned.

David Schwartz

unread,

Jan 3, 2002, 3:55:37 PM1/3/02

to

Thomas Segev wrote:

> 1. Although disabling NAGLE with TCP_NODELAY is bad in general,
> this is not so in my SPECIFIC case.

Yes it is, because it becomes the excuse not to fix the application.

> With TCP_NODELAY set to off, this is what happens:
>
> Client Server
>
> Request
> ------------------------>
> Reply (1st part)
> <-----------------------
> ...
> ...
> ...
> ack
> --------------->
> Reply (2nd part)
> <-----------------------

This is good. A poorly written application is being penalized for being
poorly written. It is not permitted to monopolize network bandwidth by
sending two packets where one will do.

> With TCP_NODELAY set to on:
>
> Client Server
>
> Request
> ------------------------>
> Reply (1st part)
> <-----------------------
> Reply (2nd part)
> <-----------------------
>
> Packet accumulation does not occur anyway, and the NAGLE delay is for naught.

And here the application is no penalized for its bad behavior. It *is*
allowed to send two packets where one will do. And the incentive to
actually fix the application is removed.

> Since this is the typical behavior pattern of this specific application,
> the overall result of setting TCP_NODELAY is only beneficial.

How is hiding a problem better than fixing it?

> 2. If I may mention SNA here, I would like to point out that CPI-C provides
> an explicit Flush() call to send the data immediately.
> It also performs implicit flush when the application issues a Receive() call.

Just do that yourself. Have a buffer. Put small writes into the buffer.
You can flush the buffer too. Then you'll get the right behavior and you
won't have to disable Nagle. You'll get:

Request
------------------------>
Reply (whole thing)
<-----------------------

which is better than either of your scenarios above.

DS

Thomas Segev

unread,

Jan 3, 2002, 4:10:43 PM1/3/02

to

al...@texis.com (Alun Jones) wrote in message news:<zS_Y7.1806$ot3.85...@newssvr11.news.prodigy.com>...

> In article <3C345394...@bmc.com>, Thomas Segev <thomas...@bmc.com>
> wrote:
> >I would like to make two comments:
> >1. Although disabling NAGLE with TCP_NODELAY is bad in general,
> >this is not so in my SPECIFIC case.
> ..
> >Packet accumulation does not occur anyway, and the NAGLE delay is for naught.
>
> Actually, your example _does_ show that Nagle has a benefit - with Nagle on,
> you're sending only one ACK - with Nagle off, you're sending two. That's two
> round trips instead of one. Your server should endeavour to send both parts
> of the reply in one send() to avoid the delay associated with delayed ACK,
> rather than disabling Nagle.

OK, I re-examined the trace and following is the correct picture:

With TCP_NODELAY set to off, this is what happens:

Client Server

Select (109 bytes)
------------------------>
ack (40 bytes)
<-------------
Reply (1st part - 552 bytes)
<-----------------------
...
...
...
ack (40 bytes)
--------------->
Reply (2nd+3rd part - 648 bytes)
<-----------------------
Begin tran (64 bytes)
----------------------->

With TCP_NODELAY set to on:

Client Server

Select (109 bytes)
------------------------>
Reply (1st part 552 bytes)
<-----------------------
Reply (2nd part 552 bytes)
<-----------------------
Reply (3rd part 136 bytes)
<----------------------
Begin tran (64 bytes)
----------------------->

With Nagle on, there are two extra ACKs, but Nagle delay does
accomplish combining two messages into one. The first ACK could be
just a coincidence of delay ack timer expiring just before there was
some reply to send.

A similar scenario is described in "Internet Core Protocols", Eric
Hall, on page 388: "Interactions between Nagle and delayed
acknowledgments".
In the example there "if an application... needs to send
one-and-a-half segments of data, then the first segment will be sent
immediately but the second (small) segment will not be sent until the
first segment has been acknowledged... the first full-sized segement
described above would not get acknowledged immediately, since two full
sized segments had not been received, nor would there be any data
being returned since not all application data had been received ... as
of yet. Instead, the acknowledgement would not be sent until the
acknowledgement timer had expired...".
It also says there that "The only real cure for this situation is to
disable the use of the Nagle algorithm on the system that is sending
the one-and-a-half segments of application data".

In my example it happens that the total size of the reply is less than
full segment, so if the application had combined it into one send
there would have been no delay.
However this works only if the reply does not exceed the segment size.

>
> >Since this is the typical behavior pattern of this specific application,
> >the overall result of setting TCP_NODELAY is only beneficial.
>
> And yet, is not as beneficial as _not_ setting TCP_NODELAY and writing your
> server such that it coalesces associated data into a single send().

Rewriting the server would work, but only if the reply does not exceed
the segment size.

>
> >2. If I may mention SNA here, I would like to point out that CPI-C provides
> >an explicit Flush() call to send the data immediately.
> >It also performs implicit flush when the application issues a Receive() call.
>
> And, of course, when the network card's buffer is full, it pushes all those
> electrons up even tighter, such that it forces data to flow down the wire,
> right? I'm being sarcastic. Flush, even if it was to be implemented, would
> be of _no_ use whatever, because the data doesn't qualify as having been
> successfully sent (i.e. it's still in the buffer) until the ACK comes back.

Yes, flush does not evacuate the buffers, and it can not force send
when the connection is in congestion (zero window), but - used
properly - it can tell TCP stack that there is no more data coming and
that delaying the sending of the data in an attempt to accumulate some
more data is pure wasted time.
In my example using flush would achieve the best utilization of
network resources because it would enable to combine all three parts
into one message without causing delay. This is what APPC does.

>
> Alun.
> ~~~~
>
> [Note that answers to questions in newsgroups are not generally
> invitations to contact me personally for help in the future.]

Thomas

Alun Jones

unread,

Jan 3, 2002, 6:09:41 PM1/3/02

to

In article <83b78f96.0201...@posting.google.com>,

thomas...@bmc.com (Thomas Segev) wrote:
>Yes, flush does not evacuate the buffers, and it can not force send
>when the connection is in congestion (zero window), but - used
>properly - it can tell TCP stack that there is no more data coming and
>that delaying the sending of the data in an attempt to accumulate some
>more data is pure wasted time.

So why not implement this yourself? Instead of calling the send() function,
call a function that instead buffers the data, and create a flush function
that sends the buffered data.

Alun.
~~~~

[Note that answers to questions in newsgroups are not generally
invitations to contact me personally for help in the future.]

Rick Jones

unread,

Jan 3, 2002, 6:12:59 PM1/3/02

to

Thomas Segev <thomas...@bmc.com> wrote:

> Client Server

I would have thought that the ack of the "Select" would happen as a
function of the response time of the DB server versus the standalone
ack timer, so it would be just as likely in the TCP_NODELAY case. So,
you would have six packets on the netowkr in both cases, or five
packets on the network in both cases. (Versus the three you would have
if all the reply were in one send :)

> A similar scenario is described in "Internet Core Protocols", Eric
> Hall, on page 388: "Interactions between Nagle and delayed
> acknowledgments".
> In the example there "if an application... needs to send
> one-and-a-half segments of data, then the first segment will be sent
> immediately but the second (small) segment will not be sent until the
> first segment has been acknowledged... the first full-sized segement

Only in a "broken" stack. There have been stacks that have mistakenly
interpreted the "nagle rules" on a per-segment basis. The nagle rules
are supposed to be interpreted on a "per-send" basis. that is, any
send that is larger than the MSS, is to be sent without nagle-induced
delays.

> It also says there that "The only real cure for this situation is to
> disable the use of the Nagle algorithm on the system that is sending
> the one-and-a-half segments of application data".

Or fix that stack's broken implementation of nagle :)

You can find such stacks with the netperf TCP_RR test. Increase the -r
parm from just below to just above one MSS and if you see a _major_
drop that stack has a broken implementation of nagle.

rick jones
--
Wisdom Teeth are impacted, people are affected by the effects of events.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com but NOT BOTH...

Thomas Segev

unread,

Jan 3, 2002, 11:10:30 PM1/3/02

to

al...@texis.com (Alun Jones) wrote in message news:<Vw5Z7.3538$wc1.68...@newssvr30.news.prodigy.com>...

> In article <83b78f96.0201...@posting.google.com>,
> thomas...@bmc.com (Thomas Segev) wrote:
> >Yes, flush does not evacuate the buffers, and it can not force send
> >when the connection is in congestion (zero window), but - used
> >properly - it can tell TCP stack that there is no more data coming and
> >that delaying the sending of the data in an attempt to accumulate some
> >more data is pure wasted time.
>
> So why not implement this yourself? Instead of calling the send() function,
> call a function that instead buffers the data, and create a flush function
> that sends the buffered data.

I would if I could control the application. But it would not eliminate
the redundant delay on the last part when the data amounts to more
than full segment (one or more) with some leftover.

>
> Alun.
> ~~~~
>
> [Note that answers to questions in newsgroups are not generally
> invitations to contact me personally for help in the future.]

Thomas

Thomas Segev

unread,

Jan 3, 2002, 11:36:20 PM1/3/02

to

Rick Jones <f...@bar.baz.invalid> wrote in message news:<a12ohr$gk$3...@web1.cup.hp.com>...

With TCP_NODELAY and reasonable DB response time the three replies
span a short duration which is much less than the delayed ack
interval, so no more than one ack timeout is likely to fall in this
time frame.

>
> > A similar scenario is described in "Internet Core Protocols", Eric
> > Hall, on page 388: "Interactions between Nagle and delayed
> > acknowledgments".
> > In the example there "if an application... needs to send
> > one-and-a-half segments of data, then the first segment will be sent
> > immediately but the second (small) segment will not be sent until the
> > first segment has been acknowledged... the first full-sized segement
>
> Only in a "broken" stack. There have been stacks that have mistakenly
> interpreted the "nagle rules" on a per-segment basis. The nagle rules
> are supposed to be interpreted on a "per-send" basis. that is, any
> send that is larger than the MSS, is to be sent without nagle-induced
> delays.
>
> > It also says there that "The only real cure for this situation is to
> > disable the use of the Nagle algorithm on the system that is sending
> > the one-and-a-half segments of application data".
>
> Or fix that stack's broken implementation of nagle :)

Any likelyhood of that happening soon ? :(

>
> You can find such stacks with the netperf TCP_RR test. Increase the -r
> parm from just below to just above one MSS and if you see a _major_
> drop that stack has a broken implementation of nagle.
>
> rick jones

Thomas

David Schwartz

unread,

Jan 4, 2002, 6:48:47 AM1/4/02

to

Thomas Segev wrote:

> > So why not implement this yourself? Instead of calling the send() function,
> > call a function that instead buffers the data, and create a flush function
> > that sends the buffered data.

> I would if I could control the application. But it would not eliminate
> the redundant delay on the last part when the data amounts to more
> than full segment (one or more) with some leftover.

That's the problem. The application doesn't provide any way for the
stack to do the right thing. The application is broken.

I'm just curious, was the application coded to formal specifications?
Is the protocol the application is implemented over TCP formally
specified? Was the application layered on top of a set of network
classes? These are the kinds of mistakes that are easy to make when
protocols and applications are just written out of someone's head
without any kind of specification and well written network classes don't
let you do things like this.

DS

Thomas R. Truscott

unread,

Jan 4, 2002, 12:51:53 PM1/4/02

to

>1. Although disabling NAGLE with TCP_NODELAY is bad in general, ...

Please don't buy into their propaganda.
Trust the evidence before your eyes.

They have claimed that you are making a mistake by disabling Nagle. Yet:
1) Disabling Nagle helped more than anything else you tried.
2) They claim doing so increases the ACKs. The opposite is true.
3) They say that disabling Nagle hurts the network by causing congestion.
Is your network like mine? Mine supports remote filesystems
(all home directories and source codes are accessed over it),
Web clients and servers, video (I have a video seminar paused
in another window right now, audio, and so on.
All of those services have either disabled Nagled or do not use TCP.
Would your a network even 'notice' the traffic from your application?
Don't listen to them, listen to your common sense.
They have only a theoretical claim that
Nagle is keeping the network from collapsing (by slowing it down ? :-).
You have a real example of real harm that is caused by Nagle.

They have claimed superior knowledge of this issue, and that

"if you don't know why setting TCP_NODELAY is a bad idea,

you shouldn't do it." But think back.
You posted your problem on Dec 20 and for the next 6 days they
suggested that select had a bug or performance problem,
that your TCP/IP stack had a bug, that it was due to the slow-start algorithm.
Only on Dec 26 was the obvious "yet another victim of Nagle" suggested.
4) And guess what, it was YOU that suggested it!
Don't be swayed by their claims of intellectual superiority,
or by implications that you are in incompetent programmer.

You made a perfectly reasonable suggestion for a flush mechanism
(both explicit and implicit). They were dismissive.

Their claims and attacks are hopelessly distorted by their dogma.
Even now they are emitting fog about bugs in implementations of Nagle,
oblivious to the reality that it is irrelevant to your problem.
They are beyond the reach of reason.
Don't waste your time, just ignore them.

Tom Truscott

Rick Jones

unread,

Jan 4, 2002, 3:25:18 PM1/4/02

to

Thomas Segev <thomas...@bmc.com> wrote:
> I would if I could control the application. But it would not eliminate
> the redundant delay on the last part when the data amounts to more
> than full segment (one or more) with some leftover.

Nagle is supposed to be interpreted on a per-user-send basis. As such,
any bytes from a send >= MSS, even the residual sub-MSS stuff, are not
supposed to be delayed in any way by Nagle. Any stack that delays such
residual data is simply broken. The Nagle algorithm is only supposed
to come into play when the user's sends are less than the MSS.

Rick Jones

unread,

Jan 4, 2002, 3:20:15 PM1/4/02

to

>> I would have thought that the ack of the "Select" would happen as a
>> function of the response time of the DB server versus the standalone
>> ack timer, so it would be just as likely in the TCP_NODELAY case. So,
>> you would have six packets on the netowkr in both cases, or five
>> packets on the network in both cases. (Versus the three you would have
>> if all the reply were in one send :)

> With TCP_NODELAY and reasonable DB response time the three replies
> span a short duration which is much less than the delayed ack
> interval, so no more than one ack timeout is likely to fall in this
> time frame.

I was refering to the ACK you show being sent by the server in
response to the select message, not the ones in the midst of the reply
data. Whether that ACK is sent or not would depend _entirely_ on the
DB response time and not the setting of TCP_NODELAY.

Brian Utterback

unread,

Jan 4, 2002, 6:36:37 PM1/4/02

to

t...@news.cs.duke.edu (Thomas R. Truscott) wrote in message news:<a14q3p$d2r$1...@hal.cs.duke.edu>...

> They have claimed superior knowledge of this issue, and that
> "if you don't know why setting TCP_NODELAY is a bad idea,
> you shouldn't do it." But think back.
> You posted your problem on Dec 20 and for the next 6 days they
> suggested that select had a bug or performance problem,
> that your TCP/IP stack had a bug, that it was due to the slow-start algorithm.
> Only on Dec 26 was the obvious "yet another victim of Nagle" suggested.
> 4) And guess what, it was YOU that suggested it!
> Don't be swayed by their claims of intellectual superiority,
> or by implications that you are in incompetent programmer.
>
> You made a perfectly reasonable suggestion for a flush mechanism
> (both explicit and implicit). They were dismissive.
>
> Their claims and attacks are hopelessly distorted by their dogma.
> Even now they are emitting fog about bugs in implementations of Nagle,
> oblivious to the reality that it is irrelevant to your problem.
> They are beyond the reach of reason.
> Don't waste your time, just ignore them.
>
> Tom Truscott

I suppose I shouldn't get involved in a discussion with you on this,
since your postings in the past indicate what is a figurative chip on
your shoulder about this, but since I have a bit of one about this
subject, I just couldn't resist.

You say that it took 8 days for anyone to figure out that the problem
was with
Nagle, as if that proves that we don't know what we are talking about
or that
Nagle is somehow difficult to understand, but as the orginal poster at
that
time admits, he left out details that he thought were irrelevent,
making the
diagnosis at a distance difficult.

I provide network support for a living, and I can tell you I always
ask for a
trace of the packets and can recognize Nagle in a heartbeat. Or 200ms
anyway 8-)

While your claim that we can't expect programmers to know about Nagle
has merit,
what about once the problem is pointed out? The delays happen, they
understand
why, but you want them to just set TCP_NODELAY and forget about Nagle.
To make
a mistake through inexperience and lack of knowledge is one thing, but
then
ignore and circumvent the established safeguards is quite another.

I agree that sometimes the right thing is to set TCP_NODELAY,
sometimes it is
the only thing that can be done. But the descision needs thoughful
reflection
and understanding either way. If you don't want to think about it and
not decide
based on reason, it is selfish to take the more dangerous route. If
you are
writing an application that runs on your own network and never shares
bandwidth with anything else, and you are never going to run it in any
other
environment, by all means set TCP_NODELAY. But if you think that you
may give
it to somebody else to run, then it is wise to take the conservative
route.

I wish that the TCP stack could always do the "right thing" in all
cases without
the need for specialized knowledge by the programmer, but that doesn't
appear
to be possible, so inducing delays in programs that are not written
with Nagle
in mind, either by coding it so as to not send small packets or
setting
by TCP_NODELAY, seems to me to be the safest course. On any reasonable
TCP
stack, you can nearly always code around Nagle, and make more effcient
use of
the network.

I would like to give a real world example. Back in my younger days, I
took a
class on software design given by a teacher that was from the "data
driven"
design school. This school held that you should look at the input and
output
data structures and code around those. I find programs all the time
that are
clearly designed around this principle. Anyway, I had a customer that
had an
application that was provided by a third party vendor. The application
searched
a data base of fixed format records and returned the records that
matched a
certain search criteria. The records were quite short in length.

The application was coded around the data records, and did one write
per record.
Thus, the records were delayed and the application ran slow. They had
a consultant that told them to set TCP_NODELAY, while I told them to
recode
using buffering. Since it would be difficult to get the vendor to
recode with
buffering, and the vendor would bs happy to make a one line change and
set
TCP_NODELAY, they opted for the easy fix that you advocate.

After setting TCP_NODELAY, the application pretty much had total
meltdown. The
inefficienct use of the network was compounded by the lack of throttle
and
packets started to be dropped at a very high rate. TCP had to
retransmit those
packets, which further added to the congestion. This caused the drop
rate to
get so high that the requests timed out and were reissued, resulting
in
even greater congestion. The result was a complete failure of the
application.
The vendor sent an engineer onsite to code the fix, (which took about
15
minutes once he got there) and they redeployed the application and
after that
the application worked flawlessly. The network load was much less than
it was
before the TCP_NODELAY and the performance was much faster, with no
delays
at all. They were actually able to scale down the number of switches
and
servers they used due to the new numbers for bandwidth required.

An extreme example, I admit, but it actually happened that way. That
experience
pretty much sealed my position on Nagle and TCP_NODELAY. I was an
advocate of
the "write with push" school of thought, but since this thread started
I found
a message from my colleague James Carlson that showed me that even
this simple change was unnecessary. With proper coding, you can almost
always
get the best of both worlds.

The bottom line is that we want the title "engineer" without needing
the craft and rigor that goes with it. Would we accept it if the
designer of a bridge
that fell down said that he didn't know that rivets weren't as strong
as welds?

blu
--
Brian Utterback

Joe Doupnik

unread,

Jan 4, 2002, 8:33:41 PM1/4/02

to

--------------
There are finer points than this, but Rick has gotten close.
The problem of ACKs interacting badly with Nagle mode has been solved.
To see the solution and a discussion of Nagle mode in the round please
get draft-doupnik-tcpimpl-nagle-mode-00.txt from directory pub/misc on
either netlab1 or netlab2.usu.edu (web browsers: click on "Complete file
archives" to reach pub). This is a draft RFC and I need to resubmit it
to hammer on the system.
My implementation replaces Nagle mode. It forms as complete
segments as possible, as Nagle mode tries but does not quite succeed,
is independent of returning ACKs, as Nagle mode unfortunately is not,
and thus achieves the goals of avoiding the Silly Window Syndrome and
providing crisp responsiveness in request/reply environments (and others).
It has no controls and needs none. It fully interoperates with all
existing implementations of TCP.
Implementations in public source code are provided for FreeBSD
and Linux, in pub/misc/newpolicy.sources. I can't show the changes for
Solaris, but an outline is in the same directory. Other implementations
are not available in public sources.
Joe D.

Thomas Segev

unread,

Jan 8, 2002, 11:37:51 AM1/8/02

to

Due to (almost) everybody here urging me to fix the application instead of touching TCP_NODELAY, I
did some more homework. I found that the application (Sybase Server) has parameters "maximum network
packet size" and "default network packet size" that can be tuned. By default they are set to 512.
The client side also has default packet size of 512 which can be changed in code.
After changing in both sides to 1536 (the size must be a multiple of 512) the server no longer breaks
down its reply and now it sends it all in one send.

I also made some tests to find out whether there is any delay problem when sending a message with
size exceeding one segment (broken stack implementation of Nagle), and found out that in AIX 4.3.3,
Solaris 2.5.1 and HP-UX 10.20 the problem does not exist.

Thanks for all your help.
Thomas

Rick Jones

unread,

Jan 8, 2002, 4:22:34 PM1/8/02

to

Thomas Segev <thomas...@bmc.com> wrote:

> After changing in both sides to 1536 (the size must be a multiple of
> 512) the server no longer breaks down its reply and now it sends it
> all in one send.

Great!

> I also made some tests to find out whether there is any delay
> problem when sending a message with size exceeding one segment
> (broken stack implementation of Nagle), and found out that in AIX
> 4.3.3, Solaris 2.5.1 and HP-UX 10.20 the problem does not exist.

Excellent.