I've seen a lot of news articles of TCP going slowly, but none that look similar to this. Anyone else seen this problem, or can say why it's happening?
A problem of WinSock2 appears to cause a severe drop in thoughput. The problem shows when the application uses send() to send a large buffer of data, followed by a send of a small buffer of data. eg
Of course TCP is a byte stream and the concept of message boundaries should be irrelevant, but the way WinSock2 appears to implement socket buffers, it seems that they assume send buffers to imply message boundaries! If a small send buffer is passed following a large send buffer, WinSock2 will wait until *all* the data from the large buffer is acknowledged before starting to transmit the small buffer. At least, this is the way it seems. What's more, disabling Nagle doesn't change this.
This only happens if a large send is followed immediately by a small send. The definitions of a large and a small send are as follows:
A large send is defined as any send both greater than the current tcp window size, and greater than or equal to the current SO_SNDBUF size. Note: I'm only talking about unidirectional data flow, so "tcp window size" refers to the window size advertised by the receiver.
A small send seems to be: insufficient data to fill 2 full packets. Actually, I'm not 100% sure about this...
For example. On an ethernet connection with TCP MSS of 1460. A WinSock2 server sending data to a Solaris client. Typically the Solaris client (data consumer) advertises a tcp window size of 8760 bytes. The default SO_SNDBUF size on the WinSock2 server (data producer) is 8192 bytes...
Then, sending a 12000 byte buffer followed by a 4 byte buffer, repeatedly, will show this problem: 12000 bytes sends 8 full packets (1460 bytes each), and leaves a 320 byte remainder. Therefore, (2 * 1460) - 320 = 2600. Any send of less than 2600 bytes following a send of 12000 bytes will show this problem.
What happens is this:
WinSock2 sends the 8 full sized 1460 byte packets. It coalesces the 320 byte remainder and the 4 byte buffer (I left Nagle enabled in this test) and sends them together in a 324 byte packet. It then waits until all 12004 bytes of data have been acknowledged before indicating (via select or a blocking send call) that buffer space is now available. Since the last packet sent (324 bytes) is a small packet, the peer delays the ACK (for approx 100ms), leading to a massive drop in performance.
A snoop of such an exchange 0.00044 a -> b Ack=536882553 Seq=421461919 Len=1460 Win=64240 0.00011 a -> b Ack=536882553 Seq=421463379 Len=1460 Win=64240 0.00009 b -> a Ack=421464839 Seq=536882553 Len=0 Win=8760 0.00003 a -> b Ack=536882553 Seq=421464839 Len=1460 Win=64240 0.00012 a -> b Ack=536882553 Seq=421466299 Len=1460 Win=64240 0.00007 b -> a Ack=421467759 Seq=536882553 Len=0 Win=8760 0.00005 a -> b Ack=536882553 Seq=421467759 Len=1460 Win=64240 0.00011 a -> b Ack=536882553 Seq=421469219 Len=1460 Win=64240 0.00007 b -> a Ack=421470679 Seq=536882553 Len=0 Win=8760 0.00005 a -> b Ack=536882553 Seq=421470679 Len=1460 Win=64240 0.00012 a -> b Ack=536882553 Seq=421472139 Len=1460 Win=64240 0.00002 a -> b Ack=536882553 Seq=421473599 Len=324 Win=64240 0.00005 b -> a Ack=421473599 Seq=536882553 Len=0 Win=8760 <delayed ack> 0.09862 b -> a Ack=421473923 Seq=536882553 Len=0 Win=8760 <all data ack'd, only now does WinSock2 send the next data> 0.00061 a -> b Ack=536882553 Seq=421473923 Len=1460 Win=64240
You get a similar trace with Nagle disabled (ie TCP_NODELAY set).
Keep in mind that WinSock2 tends to advertise much larger TCP windows, so if you're trying this out sending data Windows to Windows, you have to use larger numbers to see this - but it's still visible. Try sending 64241 bytes followed by 1000 bytes, repeatedly.
There seems to be a timing issue involved as well... On our Windows to Windows tests, I was using Ethereal software running on the Windows server (the data producer) to trace the TCP traffic. Sending 64241 bytes followed by 1000 bytes *without* running Ethereal, the transfer rate was low (as described above). As soon as I started tracing on the server, the transfer rate went back to normal (high)!
Astonishingly, using WSASend with two buffers, one of 12000 bytes and one of 4 bytes exhibits the same problem! You might expect WinSock2 to treat the two buffers as one (in the style of Unix writev), but it doesn't.
Bernard Brooks wrote: > A problem of WinSock2 appears to cause a severe drop in thoughput. The problem > shows when the application uses send() to send a large buffer of data, followed > by a send of a small buffer of data. eg
Okay, first, this program is broken. You would expect pathological behavior from a program like this.
When you send data to TCP, you must use a sensible buffer flushing strategy and the one shown above is nonsensical. There is absolutely no reason to do a 4-byte send when you have more than 10,000 bytes ready to go at that time.
You must either pass a reasonably large buffer (2Kb or more) or all the data that is ready to go at that time on each call to send. If you don't do this, TCP performance will suck.
> Of course TCP is a byte stream and the concept of message boundaries should be > irrelevant, but the way WinSock2 appears to implement socket buffers, it seems > that they assume send buffers to imply message boundaries! If a small send > buffer is passed following a large send buffer, WinSock2 will wait until *all* > the data from the large buffer is acknowledged before starting to transmit the > small buffer. At least, this is the way it seems. What's more, disabling > Nagle doesn't change this.
Winsock does not have ESP. It has to make a decision whether or not to send a packet and it has to make it when you call 'send'. Otherwise, it can set a timer. It has no other options. If Winsock were to send data immediately in your 4 byte send, what happens if it's followed later by 32 4-byte sends? Should they all go in their own packet, with data efficiency dropping in the toilet as headers exceed data by huge factors?
Winsock could handle this better, I admit. But this is definitely a "then don't do that". If you care about TCP throughput and latency, you *must* implement sensible buffer flushing. Disabling Nagle is not a reasonable shortcut.
> Okay, first, this program is broken. You would expect pathological > behavior from a program like this.
I absolutely agree that the program is poorly written, and the solution, as you point out, is to employ a sensible buffering strategy.
But the code snippet above is purely pedagogical. Regardless of the poor coding, I still think there's a genuine problem with WinSock2 that's worth looking at. Neither Linux nor Solaris TCP/IP stacks suffer from this problem.
Look at these two examples - both send a large buffer followed by a small buffer: if you replace the 12000 and 4 with 8760 bytes and 2 bytes = over 10 MBytes/s 8761 1 = under 100 KBytes/s
You can't just explain this away by saying that the applications buffering strategy is poor. Peversely, you only become subject to these problems when you DO buffer!
> Winsock does not have ESP. It has to make a decision whether or not to > send a packet and it has to make it when you call 'send'.
As I understand it, the algorithm on whether to send a packet is well defined.
> If Winsock were to send data > immediately in your 4 byte send, what happens if it's followed later by > 32 4-byte sends? Should they all go in their own packet, with data > efficiency dropping in the toilet as headers exceed data by huge > factors?
+ With Nagle disabled, that's exactly what it should do. And yes, it's inefficient.
+ With Nagle enabled, it'll send one 4 byte packet, and then a sequence of full packets (since the computer is easily fast enough to coalesce a full packet before receiving the ACK from the previous packet). This is the whole point of the Nagle algorithm.
But... the problem I'm showing is quite different. [ No I'm not talking about the well discussed problem of the deadlock introduced by Nagle and the Delayed Ack. ]
This problem seems to be in the WinSock2 socket buffering layer. It seems that the WinSock2 socket buffer is not allowing data to be presented to the TCP/IP stack in a timely fashion.
My tests seem to show that it's the socket buffering layer itself that's blocking until *all* data has been acknowledged. It's blocking to such an extent that it's not allowing the program to present any more data, which prevents it from using the Nagle algorithm to coalesce it.
The first send(12000) gets presented to the TCP/IP stack. The next send(4) gets presented, and the 4 bytes + the 320 bytes (remaining from the 12000) get coalesced. The socket buffer is now full, so you'd expect the next send(12000) to block - which is fine.
Then WinSock2 starts to receive ACK's from the peer. As the ACK's arrive, you'd expect it to release buffer space, and maybe it does... who can tell? What needs explaining is that (as you can see from the trace) even after receiving ACK's for 11680 bytes of data (in my mind leaving _plenty_ of space in the socket buffer), it still doesn't unblock the socket! Why?
The explanation seems to be that WinSock2 socket buffering layer treats individual send buffers as *messages*, and not as a byte stream as it should. I'm sure this bug is affecting a lot of other peoples code too.
> Hi David. Thanks for the reply. >> > for (;;) { >> > send (sd, buf, 12000); >> > send (sd, buf, 4); >> > }
>> Okay, first, this program is broken. You would expect pathological >> behavior from a program like this. > I absolutely agree that the program is poorly written, and the solution, > as you point out, is to employ a sensible buffering strategy. > But the code snippet above is purely pedagogical. Regardless of the poor > coding, I still think there's a genuine problem with WinSock2 that's > worth looking at. Neither Linux nor Solaris TCP/IP stacks suffer from > this problem. > Look at these two examples - both send a large buffer followed by a small > buffer: if you replace the 12000 and 4 with > 8760 bytes and 2 bytes = over 10 MBytes/s > 8761 1 = under 100 KBytes/s > You can't just explain this away by saying that the applications > buffering strategy is poor. Peversely, you only become subject to > these problems when you DO buffer!
Any system that is more complex then a stack of punced cards must be understood and used properly to be reasonable efficient.
The above is to drive a car in first gear only and complain about low gas mileage.
conclusion : do not expect inexperienced programmers to write good code. Education and measurment is needed. -- Peter Håkanson IPSec Sverige ( At Gothenburg Riverside ) Sorry about my e-mail address, but i'm trying to keep spam out, remove "icke-reklam" if you feel for mailing me. Thanx.
<p...@icke-reklam.ipsec.nu> wrote in message news:ap6c53$8br$2@nyheter.crt.se... > In comp.protocols.tcp-ip Bernard Brooks <bernardb...@hotmail.com> wrote: > > Hi David. Thanks for the reply.
> >> Okay, first, this program is broken. You would expect pathological > >> behavior from a program like this.
> > I absolutely agree that the program is poorly written, and the solution, > > as you point out, is to employ a sensible buffering strategy.
> > But the code snippet above is purely pedagogical. Regardless of the poor > > coding, I still think there's a genuine problem with WinSock2 that's > > worth looking at. Neither Linux nor Solaris TCP/IP stacks suffer from > > this problem.
> > Look at these two examples - both send a large buffer followed by a small > > buffer: if you replace the 12000 and 4 with > > 8760 bytes and 2 bytes = over 10 MBytes/s > > 8761 1 = under 100 KBytes/s
> > You can't just explain this away by saying that the applications > > buffering strategy is poor. Peversely, you only become subject to > > these problems when you DO buffer!
> Any system that is more complex then a stack of punced cards must be > understood and used properly to be reasonable efficient.
> The above is to drive a car in first gear only and complain about > low gas mileage.
> conclusion : do not expect inexperienced programmers to write > good code. Education and measurment is needed.
IT'S JUST AN EXAMPLE, PETER! NOT ACTUAL CODE!
The same behavior and performance hit could happen if two objects were serializing themselves in sequence out the same socket and one happened to be large and one small.
Or two threads might use the same socket and be completele oblivious to what the other thread is sending out. One just happened to have a large block and the other a small one.
Why are you fixated on the form or quality of the example????
On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz
<dav...@webmaster.com> wrote: >> Of course TCP is a byte stream and the concept of message boundaries should be >> irrelevant, but the way WinSock2 appears to implement socket buffers, it seems >> that they assume send buffers to imply message boundaries! If a small send >> buffer is passed following a large send buffer, WinSock2 will wait until *all* >> the data from the large buffer is acknowledged before starting to transmit the >> small buffer. At least, this is the way it seems. What's more, disabling >> Nagle doesn't change this. > Winsock does not have ESP. It has to make a decision whether or not to >send a packet and it has to make it when you call 'send'.
Sorry, what does "ESP" stand for?
-- Fernando Gont e-mail: ferna...@ANTISPAM.gont.com.ar
[To send a personal reply, please remove the ANTISPAM tag]
Fernando Gont <arielg...@softhome.net> wrote: >On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz ><dav...@webmaster.com> wrote: >> Winsock does not have ESP. It has to make a decision whether or not to >>send a packet and it has to make it when you call 'send'.
>Sorry, what does "ESP" stand for?
Extra-Sensory Perception, i.e. mind-reading.
-- Barry Margolin, bar...@genuity.net Genuity, Woburn, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.
Thanks again for the responses. Please let's not get hung up on the example programs. It's the underlying problem with WinSock2 that needs to be addressed.
I'll give another example which I think might interest you more. Using David Schwartz advice:
> You must either pass a reasonably large buffer (2Kb or more) or all the > data that is ready to go at that time on each call to send.
Then you'll be surprised to hear that for (;;) { send (sd, buf, 12000); send (sd, buf, 2500); }
achieves an appauling 140 KBytes/s. This must surely be affecting many people?
Has anyone else seen this performance problem or know if Microsoft knows anything about it? Better still, can anyone explain what's going on in WinSock2 to cause it?
Bernard Brooks
The small print... as per my first posting, my examples depend on a uni-directional data flow from a WinSock2 machine; SO_SNDBUF=8192 (the default); MSS=1460 (ethernet); to a machine offering a TCP window size of 8760 (eg solaris). If your setup is different (ie sending to another Windows machine), then you just have to plug in different numbers to demonstrate the problem. Also, my tests were done with Nagle ENABLED, though a similar problem occurs with Nagle disabled.
Bernard Brooks wrote: > This problem seems to be in the WinSock2 socket buffering layer. It > seems that the WinSock2 socket buffer is not allowing data to be > presented to the TCP/IP stack in a timely fashion.
> My tests seem to show that it's the socket buffering layer itself that's > blocking until *all* data has been acknowledged. It's blocking to such > an extent that it's not allowing the program to present any more data, > which prevents it from using the Nagle algorithm to coalesce it.
> The first send(12000) gets presented to the TCP/IP stack. The next > send(4) gets presented, and the 4 bytes + the 320 bytes (remaining > from the 12000) get coalesced. The socket buffer is now full, so you'd > expect the next send(12000) to block - which is fine.
> Then WinSock2 starts to receive ACK's from the peer. As the ACK's > arrive, you'd expect it to release buffer space, and maybe it does... > who can tell? What needs explaining is that (as you can see from the > trace) even after receiving ACK's for 11680 bytes of data (in my mind > leaving _plenty_ of space in the socket buffer), it still doesn't > unblock the socket! Why?
> The explanation seems to be that WinSock2 socket buffering layer treats > individual send buffers as *messages*, and not as a byte stream as it > should. I'm sure this bug is affecting a lot of other peoples code too.
> <p...@icke-reklam.ipsec.nu> wrote in message news:ap6c53$8br$2@nyheter.crt.se... >> In comp.protocols.tcp-ip Bernard Brooks <bernardb...@hotmail.com> wrote: >> > Hi David. Thanks for the reply.
>> >> Okay, first, this program is broken. You would expect pathological >> >> behavior from a program like this.
>> > I absolutely agree that the program is poorly written, and the solution, >> > as you point out, is to employ a sensible buffering strategy.
>> > But the code snippet above is purely pedagogical. Regardless of the poor >> > coding, I still think there's a genuine problem with WinSock2 that's >> > worth looking at. Neither Linux nor Solaris TCP/IP stacks suffer from >> > this problem.
>> > Look at these two examples - both send a large buffer followed by a small >> > buffer: if you replace the 12000 and 4 with >> > 8760 bytes and 2 bytes = over 10 MBytes/s >> > 8761 1 = under 100 KBytes/s
>> > You can't just explain this away by saying that the applications >> > buffering strategy is poor. Peversely, you only become subject to >> > these problems when you DO buffer!
>> Any system that is more complex then a stack of punced cards must be >> understood and used properly to be reasonable efficient.
>> The above is to drive a car in first gear only and complain about >> low gas mileage.
>> conclusion : do not expect inexperienced programmers to write >> good code. Education and measurment is needed. > IT'S JUST AN EXAMPLE, PETER! NOT ACTUAL CODE! > The same behavior and performance hit could happen if two objects > were serializing themselves in sequence out the same socket and one > happened to be large and one small.
Serializing ON THE SAME socket needs to be done with knowledge about how tcp works.
> Or two threads might use the same socket and be completele oblivious > to what the other thread is sending out. One just happened to have a > large block and the other a small one.
Threads is no cure for everything. In fact they might screw up stuff, like using the same socket for two independent comminucation channels. Again, they might blew your performance. Again, the cure is knowledge about tcp and skillful programming.
> Why are you fixated on the form or quality of the example????
I'm not.
> Rufus
-- Peter Håkanson IPSec Sverige ( At Gothenburg Riverside ) Sorry about my e-mail address, but i'm trying to keep spam out, remove "icke-reklam" if you feel for mailing me. Thanx.
On 22 Oct 2002 09:38:23 -0700, bernardb...@hotmail.com (Bernard
Brooks) wrote: >Of course TCP is a byte stream and the concept of message boundaries should be >irrelevant, but the way WinSock2 appears to implement socket buffers, it seems >that they assume send buffers to imply message boundaries! If a small send >buffer is passed following a large send buffer, WinSock2 will wait until *all* >the data from the large buffer is acknowledged before starting to transmit the >small buffer. At least, this is the way it seems. What's more, disabling >Nagle doesn't change this.
Do you know whether using TCP_NODELAY option with Winsock *really* disables the Nagle algorithm or not?
-- Fernando Gont e-mail: ferna...@ANTISPAM.gont.com.ar
[To send a personal reply, please remove the ANTISPAM tag]
> In article <3db5f5a2.2551...@News.CIS.DFN.DE>, > Fernando Gont <arielg...@softhome.net> wrote: > >On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz > ><dav...@webmaster.com> wrote: > >> Winsock does not have ESP. It has to make a decision whether or not to > >>send a packet and it has to make it when you call 'send'.
> >Sorry, what does "ESP" stand for?
> Extra-Sensory Perception, i.e. mind-reading.
In this case, it can't predict the future. It doesn't know whether your 4 byte send is the first of 100 such sends, to be followed immediately by another 4 byte send, or the last byte of data you'll send for a week. So it *can't* do the right thing all the time.
"Rufus V. Smith" wrote: > IT'S JUST AN EXAMPLE, PETER! NOT ACTUAL CODE!
It's an example of bad code, and so it works badly.
> The same behavior and performance hit could happen if two objects > were serializing themselves in sequence out the same socket and one > happened to be large and one small.
Sure, that's why you have to handle those cases properly.
You could, for example, accumulate small writes into an application buffer until it either hits 2Kb or 100 milliseconds pass with no data being written.
> Or two threads might use the same socket and be completele oblivious > to what the other thread is sending out. One just happened to have a > large block and the other a small one.
Sure, that's why you have to handle those cases properly. Users of a socket *can't* be oblivious to other socket users. If you need to do this, you need to write a sensible multiplexer with a sensible buffer flushing strategy.
> Why are you fixated on the form or quality of the example????
Because the example shows why you have to handle those cases properly. In all of these cases, the programmer has more information that the TCP stack, does the wrong thing, and expects the TCP stack to magically fix it.
And you know what cracks me up completely? The one thing that could have helped an application with a poor buffer flushing strategy to still work reasonably, which is Nagle's algorithm, was the thing that was disabled first.
On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz
<dav...@webmaster.com> wrote: > Winsock does not have ESP. It has to make a decision whether or not to >send a packet and it has to make it when you call 'send'. Otherwise, it >can set a timer. It has no other options. If Winsock were to send data >immediately in your 4 byte send, what happens if it's followed later by >32 4-byte sends? Should they all go in their own packet, with data >efficiency dropping in the toilet as headers exceed data by huge >factors?
Supposing Nagle was enabled, his data should be sent (at least) when he gets MSS bytes in the socket send buffer, as the idea of Nagle is "do not send *small* packets when.....".
With Nagle disabled, I see no reason for not sending the data, when you have MSS bytes available to be sent.
-- Fernando Gont e-mail: ferna...@ANTISPAM.gont.com.ar
[To send a personal reply, please remove the ANTISPAM tag]
> > The same behavior and performance hit could happen if two objects > > were serializing themselves in sequence out the same socket and one > > happened to be large and one small.
> Serializing ON THE SAME socket needs to be done with knowledge > about how tcp works.
Why is that? A socket should be handled like any other byte stream. When I serialize something out, I just pump bytes into a stream that are fomatted such that I can serialize back into an equivalent of the object.
> > Or two threads might use the same socket and be completele oblivious > > to what the other thread is sending out. One just happened to have a > > large block and the other a small one.
> Threads is no cure for everything. In fact they might screw up > stuff, like using the same socket for two independent comminucation > channels. Again, they might blew your performance. Again, the cure > is knowledge about tcp and skillful programming.
My mistake to mention threads, to cause you to take issue with threads.
If the problem described is a "feature" of TCP, rather than a bug in Winsock2, why do the other implementations not exhibit the "problem"?
Is your knowledge about tcp complete enough to explain this behavior? If you did explain it in a prior posting, I must have missed it somehow. I'll look back.
It sound to me the kind of cure you are talking about (skillful programming and knowledge of tcp) would suggest that they live accept the problem and "skillfully" program a workaround, by perhaps adding another layer of and ensure the buffer handling described doesn't happen. That's like building in delay loops in an output driver because your I/O card can't handle full speed I/O. Sure it works, but it doesn't solve the base problem. Or as they call it around here: the "root cause".
>> > IT'S JUST AN EXAMPLE, PETER! NOT ACTUAL CODE!
>> > The same behavior and performance hit could happen if two objects >> > were serializing themselves in sequence out the same socket and one >> > happened to be large and one small.
>> Serializing ON THE SAME socket needs to be done with knowledge >> about how tcp works.
> Why is that? A socket should be handled like any other byte stream. > When I serialize something out, I just pump bytes into a stream that > are fomatted such that I can serialize back into an equivalent of the object.
>> > Or two threads might use the same socket and be completele oblivious >> > to what the other thread is sending out. One just happened to have a >> > large block and the other a small one.
>> Threads is no cure for everything. In fact they might screw up >> stuff, like using the same socket for two independent comminucation >> channels. Again, they might blew your performance. Again, the cure >> is knowledge about tcp and skillful programming.
> My mistake to mention threads, to cause you to take issue with > threads. > If the problem described is a "feature" of TCP, rather than a bug in Winsock2, > why do the other implementations not exhibit the "problem"? > Is your knowledge about tcp complete enough to explain this behavior? If > you did explain it in a prior posting, I must have missed it somehow. I'll look > back. > It sound to me the kind of cure you are talking about (skillful programming > and knowledge of tcp) would suggest that they live accept the problem and > "skillfully" program a workaround, by perhaps adding another layer of > and ensure the buffer handling described doesn't happen. That's like building > in delay loops in an output driver because your I/O card can't handle full > speed I/O. Sure it works, but it doesn't solve the base problem. Or as they > call it around here: the "root cause".
get a copy of "TCP Illustrated Vol1" by R Stevens ( isbn 0-201-63346-9) pages 223 to 357 is devoted to the basics of tcp.
There is also a companion "Unix network programming" isbn 0-13-490012-x that deals with issues on the socket layer that a programmer should think about.
And yes, writing on a TCP socket has to be done carefully so the built-in features won't hurt performace.
You might consider UDP ( if you never can saturate your network)
> Rufus
-- Peter Håkanson IPSec Sverige ( At Gothenburg Riverside ) Sorry about my e-mail address, but i'm trying to keep spam out, remove "icke-reklam" if you feel for mailing me. Thanx.
arielg...@softhome.net (Fernando Gont) wrote in message <news:3db7fd89.1001698@News.CIS.DFN.DE>... > On 22 Oct 2002 09:38:23 -0700, bernardb...@hotmail.com (Bernard > Brooks) wrote:
> >Of course TCP is a byte stream and the concept of message boundaries should be > >irrelevant, but the way WinSock2 appears to implement socket buffers, it seems > >that they assume send buffers to imply message boundaries! If a small send > >buffer is passed following a large send buffer, WinSock2 will wait until *all* > >the data from the large buffer is acknowledged before starting to transmit the > >small buffer. At least, this is the way it seems. What's more, disabling > >Nagle doesn't change this.
> Do you know whether using TCP_NODELAY option with Winsock *really* > disables the Nagle algorithm or not?
I have evidence to show that TCP_NODELAY does disable Nagle on WinSock2. BUT - the problem I'm talking about happens with Nagle ENABLED (and also with it disabled). The problem has nothing to do with Nagle.
arielg...@softhome.net (Fernando Gont) wrote in message <news:3db849e3.4241698@News.CIS.DFN.DE>... > On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz > <dav...@webmaster.com> wrote:
> > Winsock does not have ESP. It has to make a decision whether or not to > >send a packet and it has to make it when you call 'send'. Otherwise, it > >can set a timer. It has no other options. If Winsock were to send data > >immediately in your 4 byte send, what happens if it's followed later by > >32 4-byte sends? Should they all go in their own packet, with data > >efficiency dropping in the toilet as headers exceed data by huge > >factors?
> Supposing Nagle was enabled, his data should be sent (at least) when > he gets MSS bytes in the socket send buffer, as the idea of Nagle is > "do not send *small* packets when.....".
Absolutely true - and yet it doesn't. This (imho) is a bug.
> With Nagle disabled, I see no reason for not sending the data, when > you have MSS bytes available to be sent.
Also absolutely true - and yet again, it doesn't. Also a bug.
> It's an example of bad code, and so it works badly.
According to you (and I agree), my later example was an example of Good code: for (;;) { send (sd, buf, 12000); send (sd, buf, 2500); }
and yet it still suffers from the same appauling problem.
> You could, for example, accumulate small writes into an application > buffer until it either hits 2Kb or 100 milliseconds pass with no data > being written.
If the application is presented with 12000 bytes of data, would you split that into five 2Kb sends, and buffer the rest? Of course you wouldn't. You'd think, this data is more than 2Kb, I'll pass it directly to send(). If you are then presented with 512 small writes of 4 bytes, you might accumulate them into a single 2Kb buffer and send that next... The result send (sd, buf, 12000); send (sd, buf, 2048);
and crap performance... Explain that.
> And you know what cracks me up completely? The one thing that could > have helped an application with a poor buffer flushing strategy to still > work reasonably, which is Nagle's algorithm, was the thing that was > disabled first.
Actually I have said quite explicitly in all my posts that these tests are done with nagle ENABLED.
Let me restate the salient points of this problem + the problem has NOTHING to do with Nagle
+ the size of the "small data buffer" is IRRELEVANT - I used 4 bytes just to demonstrate, but as I pointed out, even buffers of 2Kb or more can show this problem
+ the problem occurs immediately after you use a LARGE buffer
> In this case, it can't predict the future. It doesn't know whether your > 4 byte send is the first of 100 such sends, to be followed immediately > by another 4 byte send, or the last byte of data you'll send for a week. > So it *can't* do the right thing all the time.
The algorithm is the Nagle algorithm: It should store and coalesce data while there's outstanding unacknowledged data, or until it coalesces enough to fill a segment (subject of course, to the TCP window size and the congestion window).
So you see it *can* do the right thing.
If you were to send a thousand 4 byte buffers, it'll send one 4 byte packet, and then a sequence of full packets (since the computer is easily fast enough to coalesce a full packet before receiving the ACK from the previous packet).
But why are we arguing about the 4 byte buffer? It's irrelevant, and I wish I'd never used it in my example.
You seem to have missed two key points from my original post 1) This problem occurs with Nagle enabled.
2) Replace the 4 byte buffer with a 2Kb buffer if you like... It'll still go slowly.
Send this sequence of buffers from a Windows box to a Solaris box and time it. You'll find it takes roughly 1 second to send 20 packets! Surely this isn't right? You can read it faster than Windows can send it! 12000,2500,12000,2500,12000,2500,12000,2500,12000,2500, 12000,2500,12000,2500,12000,2500,12000,2500,12000,2500
In article <a7458e5b.0210250733.76602...@posting.google.com>,
bernardb...@hotmail.com (Bernard Brooks) wrote: >I have evidence to show that TCP_NODELAY does disable Nagle on WinSock2. >BUT - the problem I'm talking about happens with Nagle ENABLED (and also >with it disabled). The problem has nothing to do with Nagle.
Most of the 'problems' laid at Nagle's door likewise have nothing to do with Nagle. Coalescence of outgoing data happens even in the absence of Nagle. Delayed sending also occurs in the absence of Nagle (for instance, when the send buffer size is larger than the negotiated window).
Alun. ~~~~
[Please don't email posters, if a Usenet response is appropriate.] -- Texas Imperial Software | Try WFTPD, the Windows FTP Server. Find us at 1602 Harvest Moon Place | http://www.wftpd.com or email a...@texis.com Cedar Park TX 78613-1419 | VISA/MC accepted. NT-based sites, be sure to Fax/Voice +1(512)258-9858 | read details of WFTPD Pro for XP/2000/NT.
In article <a7458e5b.0210250735.45a02...@posting.google.com>,
bernardb...@hotmail.com (Bernard Brooks) wrote: >arielg...@softhome.net (Fernando Gont) wrote in message >> With Nagle disabled, I see no reason for not sending the data, when >> you have MSS bytes available to be sent.
>Also absolutely true - and yet again, it doesn't. Also a bug.
Have you run a network trace to determine what is, and isn't being negotiated and sent? Have you checked the buffer size that's being given to your application? What about the window size?
Alun. ~~~~
[Please don't email posters, if a Usenet response is appropriate.] -- Texas Imperial Software | Try WFTPD, the Windows FTP Server. Find us at 1602 Harvest Moon Place | http://www.wftpd.com or email a...@texis.com Cedar Park TX 78613-1419 | VISA/MC accepted. NT-based sites, be sure to Fax/Voice +1(512)258-9858 | read details of WFTPD Pro for XP/2000/NT.
> >> With Nagle disabled, I see no reason for not sending the data, when > >> you have MSS bytes available to be sent.
> >Also absolutely true - and yet again, it doesn't. Also a bug.
> Have you run a network trace to determine what is, and isn't being negotiated > and sent? Have you checked the buffer size that's being given to your > application? What about the window size?
Yes, I've traced it, and if you're thinking that the delay is because the TCP window size is being reduced, or because of congestion - the trace shows that neither of these are the case.
The setup: A local ethernet (no routers, etc). MSS is 1460 bytes; no window scaling negotiated; Solaris is advertising a receive window of 8760 bytes; SO_SNDBUF on the Windows box is the default (8192 bytes).
The trace is of traffic from a Windows box to a Solaris box, produced by for (;;) { send (sd, buf, 12000); send (sd, buf, 2500); }
0.00056 a -> b Ack=1 Seq=4144714689 Len=1460 Win=64240 0.00009 a -> b Ack=1 Seq=4144716149 Len=1460 Win=64240 0.00008 a -> b Ack=1 Seq=4144717609 Len=1460 Win=64240 0.00009 a -> b Ack=1 Seq=4144719069 Len=1460 Win=64240 0.00010 b -> a Ack=4144717609 Seq=1 Len=0 Win=8760 0.00008 a -> b Ack=1 Seq=4144720529 Len=1460 Win=64240 0.00006 b -> a Ack=4144720529 Seq=1 Len=0 Win=8760 0.00002 a -> b Ack=1 Seq=4144721989 Len=1460 Win=64240 0.00021 a -> b Ack=1 Seq=4144723449 Len=1460 Win=64240 p 0.00015 a -> b Ack=1 Seq=4144724909 Len=1460 Win=64240 0.00001 b -> a Ack=4144723449 Seq=1 Len=0 Win=8760 q 0.00007 a -> b Ack=1 Seq=4144726369 Len=1460 Win=64240 0.00007 b -> a Ack=4144726369 Seq=1 Len=0 Win=8760 X 0.00005 a -> b Ack=1 Seq=4144727829 Len=1360 Win=64240 0.09829 b -> a Ack=4144729189 Seq=1 Len=0 Win=8760 0.00046 a -> b Ack=1 Seq=4144729189 Len=1460 Win=64240
The packet I've marked with an 'X' is the first suspicious event. At this point in the conversation there's outstanding unacknowledged data. The ACK immediately before packet 'X' acknowledges all data up to and including packet 'p'. Packet 'q' is unacknowledged.
The Nagle algorithm says that if there's outstanding unacknowledged data, then it should NOT send data unless it's got enough to fill a packet. So why does it send the 1360 byte packet? It appears (from other tests I've done) that the bug is not in the WinSock2 implementation of the Nagle algorithm.
What actually happens is that the socket buffering layer within WinSock2 tells the application (via select for write, or via a blocking send() call) that there's insufficient space in the buffer, and won't allow the next send() to present the data buffer it has. But we know from the ACK's received, that there should be plenty of space in the buffer. This seems to be where the bug is.
In fact (as you can see from the trace) it waits until the ACK arrives for the 1360 byte packet before it unblocks the socket - causing these appalling delays.
Fernando Gont wrote: > On Tue, 22 Oct 2002 12:47:57 -0700, David Schwartz > <dav...@webmaster.com> wrote: > > Winsock does not have ESP. It has to make a decision whether or not to > >send a packet and it has to make it when you call 'send'. Otherwise, it > >can set a timer. It has no other options. If Winsock were to send data > >immediately in your 4 byte send, what happens if it's followed later by > >32 4-byte sends? Should they all go in their own packet, with data > >efficiency dropping in the toilet as headers exceed data by huge > >factors? > Supposing Nagle was enabled, his data should be sent (at least) when > he gets MSS bytes in the socket send buffer, as the idea of Nagle is > "do not send *small* packets when.....".
Right, that's why disabling Nagle didn't help.
> With Nagle disabled, I see no reason for not sending the data, when > you have MSS bytes available to be sent.
There are any number of reasons, even without Nagle, when you might not send data even if you have an MSS worth. See, for example, RFC1122. With Nagle disabled, the 4 byte send is much more likely to result in a packet being sent, thus robbing the stack of the chance to use that oppurtunity to send more data. TCP pacing only allows so many oppurtunities.
In article <3DBB0ACA.9C56C...@webmaster.com>, David Schwartz <dav...@webmaster.com> wrote:
> ... >> With Nagle disabled, I see no reason for not sending the data, when >> you have MSS bytes available to be sent.
> There are any number of reasons, even without Nagle, when you might not >send data even if you have an MSS worth. See, for example, RFC1122. With >Nagle disabled, the 4 byte send is much more likely to result in a >packet being sent, thus robbing the stack of the chance to use that >oppurtunity to send more data. TCP pacing only allows so many >oppurtunities.
Which part of RFC 1122 are you referring to?
Given the sample code with the Nagle Algorithm enabled, how can the 4-byte send() result in sending a segment that is not maximum sized or otherwise rob the stack of a chance to send more data?
Your advice to write application code that knows the MSS and RTT of the network can result in code that is much worse than the ugly and inefficient infinite loop of alternating big and tiny writes. Unless the network is very much faster than the host and probably also has 20K MTU, the only unnecessary costs of those alternate writes is in CPU cycles in the host. Your advice to wire "buffer flushing" into the application can produce code that also has problems on the wire.