Udp sending performance in Gbit Ethernet

JTL

unread,

Dec 29, 2005, 4:12:02 AM12/29/05

to

Hello,

This topic has been previously discussed in win32.programmer.networks, but
no clear solution has been found.

It seems that when sending UDP packets at maximum speed, the throughput of
the sending drops significantly when the payload size is increased from 1024
to 1025 bytes. This seems to be connected to the NIC interrupts in some way,
it seems like that 1024-byte packets can be sent at full rate, but 1025 byte
packets only once per interrupt (and per sending application).

I made a web page of the situation so far, and it is located at:

http://www.kolumbus.fi/juha.lemmetti/Udp.html

If you have explanation or work-around, please reply to this post or to me
(e-mail is on the web page).

Greetings,

Juha

m

unread,

Jan 8, 2006, 7:24:24 PM1/8/06

to

FYI: I have retested my code on gigabit ethernet and do not have this
problem. I measure the throuput at the reciever, and use intel NICs with a
3com baseline switch.

"JTL" <J...@discussions.microsoft.com> wrote in message
news:A8BE3C6F-FB46-4FEC...@microsoft.com...

JTL

unread,

Jan 12, 2006, 3:08:02 AM1/12/06

to

"m" wrote:

> FYI: I have retested my code on gigabit ethernet and do not have this
> problem. I measure the throuput at the reciever, and use intel NICs with a
> 3com baseline switch.
>

Hi,

You were using IOCP, right? Could you please give some details of your setup?

1) What was the throughput you measured?
2) How many threads you had in your application? How many open ports at the
server side?
3) How many clients did you have? How many distinct ports at the client side?

Thanks for the input,

Juha

m

unread,

Jan 12, 2006, 3:12:18 PM1/12/06

to

Yes, IOCP

Total throughput of about 112 MB/s (896 Mb/s) with 1200 byte packets.

My app checks the number of processors in the system to determin the optimal
number of IO worker threads - in this case, 16.

4 client PCs (Server 2003) - each with a single recieving app.

FYI: I spent about half an hour on this test, so the numbers may not be
totaly optimal etc.

"JTL" <J...@discussions.microsoft.com> wrote in message

news:FFBB872C-A5F3-4939...@microsoft.com...

Stephan Wolf [MVP]

unread,

Jan 13, 2006, 7:45:58 AM1/13/06

to

Not sure whether your test program or setup is correct. You should
better use some well-known test program like TTCP. Various
implementations of TTCP are available for Windows, see e.g.

"Test TCP (TTCP) Benchmarking Tool for Measuring TCP and UDP
Performance"
http://www.pcausa.com/Utilities/pcattcp.htm

We've been using the UDP tests with WSTTCP for years and IIRC, we got
it to 100% load our Gigabit network (using cards with the Marvell
"Yukon" GigE chipset).

The performance drop you describe can actually be caused by an
increased number of interrupts. However, modern NICs usually implement
"interrupt moderation" so that should not actually be a problem.

Stephan
---

JTL

unread,

Jan 14, 2006, 11:00:02 AM1/14/06

to

"Stephan Wolf [MVP]" wrote:

> Not sure whether your test program or setup is correct. You should
> better use some well-known test program like TTCP. Various
> implementations of TTCP are available for Windows, see e.g.
>
> "Test TCP (TTCP) Benchmarking Tool for Measuring TCP and UDP
> Performance"
> http://www.pcausa.com/Utilities/pcattcp.htm

Here are the results:

>wsttcp -t -u -l1024 -n1000000 -p1234 192.168.2.201
wsttcp-t: buflen=1024, nbuf=1000000, align=16384/+0, port=1234 udp ->
192.168.2.201
wsttcp-t: socket
wsttcp-t: 1024000000 bytes in 23.22 real sec = 43068.18 KB/sec (352814505.36
bps)
wsttcp-t: 1000006 I/O calls, msec/call = 0.02, calls/sec = 43068.44
1024000000 1137252422.38 1137252445.60 23.22 352814505.36

>wsttcp -t -u -l1025 -n1000000 -p1234 192.168.2.201
wsttcp-t: buflen=1025, nbuf=1000000, align=16384/+0, port=1234 udp ->
192.168.2.201
wsttcp-t: socket
wsttcp-t: 1025000000 bytes in 67.05 real sec = 14929.48 KB/sec (122302265.57
bps)
wsttcp-t: 1000006 I/O calls, msec/call = 0.07, calls/sec = 14915.00
1025000000 1137252450.74 1137252517.78 67.05 122302265.57

>ipconfig /all

Ethernet adapter Local Area Connection:

Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Compaq NC7131 Gigabit Server
Adapter #2
Physical Address. . . . . . . . . :
Dhcp Enabled. . . . . . . . . . . : No
IP Address. . . . . . . . . . . . : 192.168.2.200
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . :
DNS Servers . . . . . . . . . . . : 195.74.0.47
NetBIOS over Tcpip. . . . . . . . : Disabled

... So the performance drops about 65%.

> We've been using the UDP tests with WSTTCP for years and IIRC, we got
> it to 100% load our Gigabit network (using cards with the Marvell
> "Yukon" GigE chipset).

Okay, what were the settings? I have achieved wire speed also, BUT with 1024
byte packet size (and faster interface than PCI).

> The performance drop you describe can actually be caused by an
> increased number of interrupts. However, modern NICs usually implement
> "interrupt moderation" so that should not actually be a problem.

I did not understand your point, but I did run performance monitor when
running the tests. The results were:

- With 1024 byte packets, Interrupts/sec was approximately 5450
- With 1024 byte packets, Interrupts/sec was approximately 15100

So,

1) The 1025 byte packets really did increase the # of interrupts by factor
of 3 compared to 1024 byte packets
2) The 1025 byte packets were again sent at 1 packet per interrupt. For 1024
byte packets this ratio was about 8...

I again kindly ask you to read the previous posts and check the previous
threads from google. This phenomenon has been detected in multiple
configurations and multiple windows versions.

Clearly the posts from "m" show that there is a work-around for this, and I
will investigate it when I have the time. But, it still does not explain, why
the naive approach to the UDP sending has so dramatic drop of performance at
1024->1025 byte transition.

Greetings,

Juha

Maxim S. Shatskih

unread,

Jan 14, 2006, 5:17:39 PM1/14/06

to

> 1) The 1025 byte packets really did increase the # of interrupts by factor
> of 3 compared to 1024 byte packets
> 2) The 1025 byte packets were again sent at 1 packet per interrupt. For 1024
> byte packets this ratio was about 8...

Pecularities of the particular card and its driver.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com

JTL

unread,

Jan 16, 2006, 3:53:02 AM1/16/06

to

"m" wrote:

> Yes, IOCP
>
> Total throughput of about 112 MB/s (896 Mb/s) with 1200 byte packets.
>
> My app checks the number of processors in the system to determin the optimal
> number of IO worker threads - in this case, 16.
>
> 4 client PCs (Server 2003) - each with a single recieving app.

Hello,

I modified my test program so that there are multiple sending threads (with
a single sending socket). I did not yet use IOCP. The results were again a
bit strange:

- When increasing the amount of sending threads, the throughput did
increase, but not linearly.
- Multiple sending threads did however reduce the amount of interrupts / sec
in performance monitor. This decrease was something like 70% with 6 sending
threads.

My current guess is that there exists a some kind of relationship with the #
of sending threads, # of interrupts per second and throughput. This could
even be so that the relationship is close to

throughput = # of sending threads * # of interrupts / sec * datagram size

That is, each thread is able to send single datagram with each interrupt.

However, this is just a guess. But it would give some clue to the fact that
you cannot reproduce the problem with your application - 16 threads are
probably sufficient to fill the wire (with the hardware you are using). Other
possibility is that IOCP functions in a different way. I'll look into that.

Thanks a lot for your input and efforts,

Juha

Stephan Wolf [MVP]

unread,

Jan 16, 2006, 9:46:39 AM1/16/06

to

JTL wrote:
[..]

> I again kindly ask you to read the previous posts and check the previous
> threads from google. This phenomenon has been detected in multiple
> configurations and multiple windows versions.

Writing network drivers is my main job. I did that for various network
cards since 1988.

We always did a lot of testing and thinking wrt performance
optimization both for hardware and software. One thing you learn is
that the more complex your optimization approach gets, the less
efficient it usually also gets (paradoxal, I know). There are just too
many side effects one cannot think of.

So I am *not* actually suprised by the behaviour you describe. We often
saw a sudden drop of perfromance for certain packet (frame) sizes. And
there can be just so many reasons like interrupts and DMA.

If the number of interrupts increases by a factor of 3 as you describe,
then there is probably some limit reached in the driver or in the card
such as the maximum DMA block size or alike, which forces the driver to
split frames into several DMA transfers. Since some cards generate an
interrupt at the end of each DMA transfer, this would be an
explanation.

But this is just wild guessing. I am not familiar with the actual
architecture of neither the driver or card's hardware that you use.

All you can do is:

1. Try different versions of the driver for the card (even older ones).
2. Try some other card (different chipset) along with its driver.
3. Holler at the card vendor's support (don't expect too much).

BTW, which chipset does your GigE card use (Marvell, Broadcom, Intel,
etc.)?

HTH, Stephan

JTL

unread,

Jan 17, 2006, 7:38:01 AM1/17/06

to

"Stephan Wolf [MVP]" wrote:
> So I am *not* actually suprised by the behaviour you describe. We often
> saw a sudden drop of perfromance for certain packet (frame) sizes. And
> there can be just so many reasons like interrupts and DMA.
>
> If the number of interrupts increases by a factor of 3 as you describe,
> then there is probably some limit reached in the driver or in the card
> such as the maximum DMA block size or alike, which forces the driver to
> split frames into several DMA transfers. Since some cards generate an
> interrupt at the end of each DMA transfer, this would be an
> explanation.

I do not know the internals of the Windows driver model, so I really cannot
make any comments. Your comments seem reasonable, but they still leave the
following open points:

- Why is it faster to send 3 UDP packets of size 1024 bytes than 1 of size
1025?
- Why is it twice as fast to send one fragmented UDP packet, where both
fragments are of size 1400+ bytes than two un-fragmented packets of size 1400
bytes?
- Why there is no such phenomenon with TCP? It sends large packets also.
(And achieves the same speeds than UDP with 1024 byte datagrams)
- Why increasing the number of applications (sockets) increase the throughput?
- Why increasing the number of threads with one socket increase the
throughput?

As I said, I don't know the internals of the windows networking, but the
points above in my opinion belong to the TCP/IP stack implementation, not
driver implementation.

Furthermore:
- I have tested this with at least 4 different cards (see below), each of
them has the same problem.
- This problem does not exist in Linux, when run on exactly the same hardware.

> 1. Try different versions of the driver for the card (even older ones).

This I have not tried.

> 2. Try some other card (different chipset) along with its driver.

This I have tried.

> 3. Holler at the card vendor's support (don't expect too much).

This I have not tried.

> BTW, which chipset does your GigE card use (Marvell, Broadcom, Intel,
> etc.)?

I had the list in one of the previous posts, but at least:

- 3Com 3c2000
- Intel PRO/1000 GT
- Compaq Server NIC
- NVIDIA NForce chipset (motherboard) NIC

I would be suprised, if each of the manufacturers above would make the same
mistake in their drivers...

Greetings

Juha

Stephan Wolf [MVP]

unread,

Jan 18, 2006, 10:20:40 AM1/18/06

to

Network drivers follow the NDIS model. The NDIS library ("Wrapper")
itself uses WDM, but NDIS drivers have no idea of WDM.

There are just so many things to mention here:

1. TCP/UDP Checksum offload - Most modern NICs offer this feature, i.e.
they calculate the checksum in hardware. Try and play with this option
for your NIC if available.

2. Large Send Offload - ditto.

3. Is there any difference in the frames on the wire. Run some network
sniffer like Ethereal and see if you see things like IP fragments.

n. ...just so many more.

Stephan

m

unread,

Jan 18, 2006, 6:08:23 PM1/18/06

to

BTW: did you test with different motherboards?

"JTL" <J...@discussions.microsoft.com> wrote in message

news:23EE9A62-AD63-4D10...@microsoft.com...

JTL

unread,

Jan 19, 2006, 4:21:01 PM1/19/06

to

"m" wrote:

> BTW: did you test with different motherboards?

Yep, I have tested several motherboards, both AMD and Intel processors, with
e.g. Intel, NVidia, VIA chipsets.

I have not yet heard of single configuration that would not have this
phenomen in one form or another. I do believe your results, but your
application's structure is different (i.e. multiple threads and IOCP). Until
I hear from configuration that does not reproduce the problem with simple
looping test program (like TTCP), I do believe that this is some kind of
phenomenon in every Windows installation.

In other words, I do not believe that I can correct the situation by
changing hardware, driver versions or configuration. I have to change the
structure of the program, which is IMHO rather unnecessary work.

Greetings,

Juha

JTL

unread,

Jan 25, 2006, 9:03:20 AM1/25/06

to

"JTL" wrote:
> It seems that when sending UDP packets at maximum speed, the throughput of
> the sending drops significantly when the payload size is increased from 1024
> to 1025 bytes. This seems to be connected to the NIC interrupts in some way,
> it seems like that 1024-byte packets can be sent at full rate, but 1025 byte
> packets only once per interrupt (and per sending application).

This problem has been discussed in usenet group comp.protocols.tcp-ip under
the topic "large data transfers over GigE problems". In the discussions mr.
Charles Bryant gave the solution: the udp packet size is really a threshold
inside TCP/IP stack, and it can be set using registry key.

I have tested the registry key: when set to e.g. 2048, the problem
disappears, and the throughput returns to the expected value with packets >
1024 bytes.

Quote from mr. Bryant:

On 2006-01-25, Charles Bryant <n95474...@chch.demon.co.uk> wrote:
> That sounds like the same issue I described in
><URL: http://groups.google.co.uk/group/comp.os.ms-windows.programmer.nt.kernel-mode/browse_thread/thread/7ec2673b5471490f/409197bb36320ace%3Fhl%3Den%23409197bb36320ace>.

> I have since been told about the registry key FastSendDatagramThreshold,
> which is 1024 by default. Try increasing this value. It's mentioned in
><http://www.microsoft.com/technet/itsolutions/network/deploy/depovg/tcpip2k.mspx>.

Thus, not a driver problem, not an MTU problem, not a software problem -
just a simple matter of finding the right registry key...

Greetings,

Juha

Pavel A.

unread,

Jan 26, 2006, 5:27:07 AM1/26/06

to

Thanks this is very interestiing.
Didn't know that the "fast i/o path" is indeed this fast.

By the way, the FastSendDatagramThreshold parameter is documented for
win2003 as well
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/networking/tcpip03.mspx

And there are other parameters of the "fast i/o path":
MaxFastCopyTransmit
MaxFastTransmit ( used only for TransmitFile ?)
FastCopyReceiveThreshold

Regards,
--PA

Stephan Wolf [MVP]

unread,

Jan 26, 2006, 7:26:50 AM1/26/06

to

Pavel A. wrote:
> Thanks this is very interestiing.
> Didn't know that the "fast i/o path" is indeed this fast.

Very interesting indeed, and good to know.

Thanks!
Stephan
P.S.: As I said... there are just too many things that have an impact
on the overall network throughput/performance.

Maxim S. Shatskih

unread,

Jan 26, 2006, 12:01:48 PM1/26/06

to

> By the way, the FastSendDatagramThreshold parameter is documented for
> win2003 as well

This limit works the following way - datagrams smaller then it are copied to
AFD's temp memory chunks and sends are immediately completed, while the TDI
send is done via AFD's temp chunk. For larger datagrams, send() equals to TDI
send.

You can also use nonblocking socket (FIONBIO), this will effectively set this
limit to infinity, by forcing the intermediate copying for all datagrams.

This can be faster in some scenarios, but consumes more CPU and kernel memory.

JTL

unread,

Jan 27, 2006, 9:50:02 AM1/27/06

to

"Pavel A." wrote:

> Thanks this is very interestiing.
> Didn't know that the "fast i/o path" is indeed this fast.

I think that the reason is the phrase "Larger ones are held until the
datagram is actually sent". I guess it means that the call waits until the
datagram is sent on interrupt. Now there is a big difference between 100Mbit
and gigabit networks:

With 100 Mbit and 1500 byte packets, the rate is roughly 8 000 interrupts
per second, which is feasible. With gigabit network, the rate increases to 80
000 interrupts per second, which in turn is not possible to handle. So, in
order to achieve full gigabit rate, several packets must be sent with every
interrupt. This is why the effect is so noticeable with gigabit network (and
I guess the default 1024 was chosen based on tests with 100Mbit network).

> By the way, the FastSendDatagramThreshold parameter is documented for
> win2003 as well
> http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/networking/tcpip03.mspx
>
> And there are other parameters of the "fast i/o path":
> MaxFastCopyTransmit
> MaxFastTransmit ( used only for TransmitFile ?)
> FastCopyReceiveThreshold
>

Thanks,

Juha

Pavel A.

unread,

Jan 27, 2006, 10:45:26 AM1/27/06

to

This is ever more interesting because of interference
with QoS scheduling. Since QoS becomes a hot topic (multimedia, wireless
and both of them together) and major overhaul of TDI in LH, all this probably will change...

--PA

"Maxim S. Shatskih" <ma...@storagecraft.com> wrote in message news:%23xzsomp...@TK2MSFTNGP10.phx.gbl...

Pavel A.

unread,

Jan 27, 2006, 10:59:11 AM1/27/06

to

"JTL" <J...@discussions.microsoft.com> wrote in message news:057BD625-BFD6-4761...@microsoft.com...

> I think that the reason is the phrase "Larger ones are held until the
> datagram is actually sent". I guess it means that the call waits until the
> datagram is sent on interrupt. Now there is a big difference between 100Mbit
> and gigabit networks:
>
> With 100 Mbit and 1500 byte packets, the rate is roughly 8 000 interrupts
> per second, which is feasible. With gigabit network, the rate increases to 80
> 000 interrupts per second, which in turn is not possible to handle. So, in
> order to achieve full gigabit rate, several packets must be sent with every
> interrupt. This is why the effect is so noticeable with gigabit network (and
> I guess the default 1024 was chosen based on tests with 100Mbit network).

GB+ adapters should have some kind of interrupt moderation.
Also, they use bus mastering DMA, so it's hard to
tell what is exact dependency between TX thruput and interrupt rate.
Of course the behavior greatly depends of the chipset and how OEM optimizes
the prototype driver of chip vendor - and there are not a lot of such chips.
Server class products can behave differently from cheap notebook adapters.

Regards,
--PA

Arkady Frenkel

unread,

Jan 27, 2006, 11:09:44 AM1/27/06

to

Maybe post that to sdk...@microsoft.com and/or
ndi...@microsoft.com

Arkady

"JTL" <J...@discussions.microsoft.com> wrote in message
news:057BD625-BFD6-4761...@microsoft.com...

Michael K. O'Neill

unread,

Jan 27, 2006, 7:34:53 PM1/27/06

to

"JTL" <J...@discussions.microsoft.com> wrote in message
news:057BD625-BFD6-4761...@microsoft.com...

Juha, may I suggest that you update your web page to reflect the discovery
of the "FastSendDatagramThreshold" parameter.

http://www.kolumbus.fi/juha.lemmetti/Udp.html

Mike

m

unread,

Jan 28, 2006, 3:59:27 PM1/28/06

to

I am glad you figured this out.

"JTL" <J...@discussions.microsoft.com> wrote in message
news:057BD625-BFD6-4761...@microsoft.com...