large data transfers over GigE problems

Dave Moore

unread,

Jan 20, 2006, 6:03:34 AM1/20/06

to

Hi All,
I'm attempting to efficiently transfer large amounts data (say
60-70Mbytes/s) between PCs over a GigE connection. So far, I've been unable
to do this without incuring a large host CPU loading.
Currently I'm transferring the data in small 1k packets. For some reason
(that I assume has something to do with fragmentation), using packets larger
than this greatly reduces the throughput to < 20Mbytes/sec typically.
Different packet sizes give different throughputs but nothing seems as
efficient in transfering data across the ethernet than 1k packets. However,
one consequence of this is that the host CPU is very heavily loaded to an
unacceptable level.

As far as I can see, the only way I improve the situation is to do something
drastic like rewrite the ethernet drivers or remove the IP stack and
introduce DMAs into the system.

Has anybody any ideas?. (either on the packet efficiency issues or the
driver rewrite?). Is there any low-level driver source code around that I
could use?.

Ta in advance,
Dave

Rick Jones

unread,

Jan 20, 2006, 2:49:43 PM1/20/06

to

Dave Moore <dave.m...@baesystems.com> wrote:
> I'm attempting to efficiently transfer large amounts data (say
> 60-70Mbytes/s) between PCs over a GigE connection. So far, I've been
> unable to do this without incuring a large host CPU loading.

One of the oft-overlooked but omnipresent dirtly little secrets of
Ethernet is that the Ethernet specification, in and of itself, has
done _nothing_ to make data transfer easier on hosts since it was
first put-forth. It has retained that lovely 1500 byte MTU from (near
as I can tell) day one.

So, from the standpoint of Ethernet itself, it is no easier for a host
to send a packet on a gigabit ethernet than a 100 megabit than a 10
megabit Ethernet network.

> Currently I'm transferring the data in small 1k packets.

Then you are not helping any. :)

> For some reason (that I assume has something to do with
> fragmentation), using packets larger than this greatly reduces the
> throughput to < 20Mbytes/sec typically. Different packet sizes give
> different throughputs but nothing seems as efficient in transfering
> data across the ethernet than 1k packets. However, one consequence
> of this is that the host CPU is very heavily loaded to an
> unacceptable level.

If you are transferring directly over IP then there may be some stuff
to do with fragmentation, but that needn't be a big per hit _unless_
there is packet loss. Some usenet searches for "frankengrams" may
find the discussion on that bit.

If you are using TCP, then one would likely be _segmenting_. Are you
using TCP? If you are using TCP you may also have either a broken
application or broken stack when it comes to the Nagle Algorithm.
Group searches on that one should find plenty in the history on that -
my fingers are too cold still this morning to rehash it all here :)

> As far as I can see, the only way I improve the situation is to do
> something drastic like rewrite the ethernet drivers or remove the IP
> stack and introduce DMAs into the system.

Um, the NIC should already be doing DMA to send/recv packets, just
perhaps not directly to your buffers. I hope you do not really mean
to imply that you have a Gigabit Ethernet NIC using PIO's for data
transfer...

Now, having dissed Ethernet :) There are some _implementations_ of
Gigabit Ethernet cards that provide things to lessen CPU overheads.
One is support for Jumbo Frames - this ups the MTU to 9000 bytes. It
does requre that everything in the broadcast domain support the larger
MTU size.

If you are using TCP, then some NICs have support for TSO or "large
send" which is a "poor man's" jumbo frame - the NIC allows the stack
to hand it really large TCP segments, which it will then _resegment_
into TCP segments that fit in the standard 1500 byte MTU. This means
the rest of the infrastructure is a don't care, but does have the
downside of doing virtually nothing for the receiver.

So, some additional details of your data transfers may be of help.
And of the systems involved.

rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Dave Moore

unread,

Jan 23, 2006, 4:25:52 AM1/23/06

to

Thanks Rick,
I'm using UDP Multicast. I've tried using JumboFrames by setting the
config of the NIC drivers and using say 5k packets. However, this doesn't
seem to make any difference and the same bad throughput persists. However, I
dont know that our switch supports JumboFrames (despite being brand new) so
I guess it's possible that they are getting fragmented at the Switch. I've
also tried removing the switch and using a cross-over cable between the
machines, but this just breaks the multicast completely!.

Just to claify, I'm trying to find the most efficient way (in processor load
terms) of transfering a large amount of data between PCs.

Regards,
Dave

Rick Jones

unread,

Jan 23, 2006, 1:31:22 PM1/23/06

to

Dave Moore <dave.m...@baesystems.com> wrote:
> I'm using UDP Multicast.

OK, so that _does_ mean IP fragmentation. Lose a fragment of an IP
datagram the entire datagram is toast.

> I've tried using JumboFrames by setting the config of the NIC
> drivers and using say 5k packets. However, this doesn't seem to make
> any difference and the same bad throughput persists. However, I dont
> know that our switch supports JumboFrames (despite being brand new)
> so I guess it's possible that they are getting fragmented at the
> Switch. I've also tried removing the switch and using a cross-over
> cable between the machines, but this just breaks the multicast
> completely!.

Well, if the switch didn't support JumboFrames, or supported them and
they were not enabled by default, and they were indeed being used, you
would have had nothing get through.

I'm wondering if JF gets used for multicast traffic on your platform?
You should take some tcpdump traces and see just what is being sent
when you enable JF. Might also want to take some system call traces
to make sure your application is indeed making the larger send calls
(ok,I'm paranoid :)

> Just to claify, I'm trying to find the most efficient way (in
> processor load terms) of transfering a large amount of data between
> PCs.

If you are using multicast, do you intend to have the same data
recieved by multiple PC's at the same time, or are you still really
only going from one machine to another?

David Moore

unread,

Jan 23, 2006, 4:04:06 PM1/23/06

to

> Well, if the switch didn't support JumboFrames, or supported them and
> they were not enabled by default, and they were indeed being used, you
> would have had nothing get through.

Just loaded up ethereal to see what's going on. It appears that JBs do work
and are received at the destination PC without fragmentation which I guess
is good, so the switch must be compliant. However, the situation is still
the same. ie. the overall throughput is much less than with 1k packets.

> If you are using multicast, do you intend to have the same data
> recieved by multiple PC's at the same time, or are you still really
> only going from one machine to another?

No, I need a one-to-many situation, but I'm just using 2 PCs to start with
to keep things simple.

Dave

Message has been deleted

Network - TCP/IP

unread,

Jan 23, 2006, 4:35:45 PM1/23/06

to

Well, my original message was edited all to hell because of the
Registry information I included. To make a long story short, you need
to change the default TCP/IP parameters in order to get anywhere near
decent speeds on GigE. I can't spell it out for you without being
filtered and in reality if you're a reasonably advanced user you should
be able to understand where to make these changes and what they do.

You need to add a few TCP/IP parameters in the
windows_registry(filter_avoid?). They are listed below:

TcpWindowSize=131400
Tcp1323Opts=3
ForwardBufferMemory=80000
NumForwardPackets=60000

See this article for more information:
http://www.enterprisenetworkingplanet.com/nethub/article.php/3485486

Rick Jones

unread,

Jan 23, 2006, 5:26:55 PM1/23/06

to

David Moore <dave_m...@post2me.freeserve.co.uk> wrote:
>> Well, if the switch didn't support JumboFrames, or supported them
>> and they were not enabled by default, and they were indeed being
>> used, you would have had nothing get through.

> Just loaded up ethereal to see what's going on. It appears that JBs
> do work and are received at the destination PC without fragmentation
> which I guess is good, so the switch must be compliant. However, the
> situation is still the same. ie. the overall throughput is much less
> than with 1k packets.

Time for some CPU profiling. First with the 1K and then with the
larger. Perhaps the larger sends are going through a "slow" path with
extra copies or whatnot. Which OS is this again?

Are there any retransmissions being recorded for the (presumably
reliable) multicast transfer?

rick jones
--
a wide gulf separates "what if" from "if only"

Dave Moore

unread,

Jan 24, 2006, 5:39:37 AM1/24/06

to

Ok, here are some (hopefully relevant numbers). Just to confirm , I'm using
UDP Multicast:

If I blast 1024 byte Multicast packets continually, ethereal shows that it
transmits 52000 packets per second with 100% CPU loading. I'm happy with
this transmission rate but not with the CPU loading. So I try larger packets
(with presumably less processing overhead).

If I blast 5024 byte Multicast packets continually, etheral shows that it
transmits 4057 packets per second with 25% CPU loading.

So what I can't understand is why is the overall bandwidth reduced when
using larger packets?.

For reference the NIC settings are the following:

Adaptive Interface Spacing Disabled
Enable PME No Action
Flow Control Generate & Respond
Interrupt Moderation Rate Hardware Default
Jumbo Frames 9014 bytes
Link Speed & Duplex Auto Detect
Log Link State Event Enabled
Offload Transmit IP Checksum On
Offload Transmit ICP Checksum On
Qos Packet Tagging Disabled
Receive Descriptors 256
Transmit Descriptors 256

Dave

"Rick Jones" <rick....@hp.com> wrote in message
news:PKcBf.2003$yi4....@news.cpqcorp.net...

Rick Jones

unread,

Jan 24, 2006, 2:16:24 PM1/24/06

to

Dave Moore <dave.m...@baesystems.com> wrote:
> Ok, here are some (hopefully relevant numbers). Just to confirm ,
> I'm using UDP Multicast:

> If I blast 1024 byte Multicast packets continually, ethereal shows
> that it transmits 52000 packets per second with 100% CPU
> loading. I'm happy with this transmission rate but not with the CPU
> loading. So I try larger packets (with presumably less processing
> overhead).

> If I blast 5024 byte Multicast packets continually, etheral shows that it
> transmits 4057 packets per second with 25% CPU loading.

OS and rev? IIRC Linux has _intra-stack_ flow control for UDP - if
you start sending data faster than the link, you get flow controlled,
and perhaps it takes a bit longer to re-enable. You could try making
the SO_SNDBUF much larger.

> So what I can't understand is why is the overall bandwidth reduced when
> using larger packets?.

> For reference the NIC settings are the following:

> Adaptive Interface Spacing Disabled
> Enable PME No Action
> Flow Control Generate & Respond
> Interrupt Moderation Rate Hardware Default
> Jumbo Frames 9014 bytes
> Link Speed & Duplex Auto Detect
> Log Link State Event Enabled
> Offload Transmit IP Checksum On
> Offload Transmit ICP Checksum On
> Qos Packet Tagging Disabled
> Receive Descriptors 256
> Transmit Descriptors 256

It would be interesting to know if in your 5Kish case it was filling
the transmit descriptors.

Do you have some way to "pace" your sends so you aren't ever trying to
send faster than link-rate? If there is indeed intra-stack flow
control, that would be one way to try to avoid it.

rick jones
--
No need to believe in either side, or any side. There is no cause.
There's only yourself. The belief is in your own precision. - Jobert

Charles Bryant

unread,

Jan 24, 2006, 7:25:19 PM1/24/06

to

In article <43d0c0b5$1...@glkas0286.greenlnk.net>,

Dave Moore <dave.m...@baesystems.com> wrote:
> I'm attempting to efficiently transfer large amounts data (say
>60-70Mbytes/s) between PCs over a GigE connection. So far, I've been unable
>to do this without incuring a large host CPU loading.
>Currently I'm transferring the data in small 1k packets. For some reason
>(that I assume has something to do with fragmentation), using packets larger
>than this greatly reduces the throughput to < 20Mbytes/sec typically.

That sounds like the same issue I described in
<URL: http://groups.google.co.uk/group/comp.os.ms-windows.programmer.nt.kernel-mode/browse_thread/thread/7ec2673b5471490f/409197bb36320ace?hl=en#409197bb36320ace >.

I have since been told about the registry key FastSendDatagramThreshold,
which is 1024 by default. Try increasing this value. It's mentioned in
<URL: http://www.microsoft.com/technet/itsolutions/network/deploy/depovg/tcpip2k.mspx >.

Juha Lemmetti

unread,

Jan 25, 2006, 5:45:04 AM1/25/06

to

I have noticed the same problem also (throughput drops at 1025 bytes), and
setting the registry key (to e.g. 2048) fixes the problem.

My guess is that Windows will transfer only one datagram / interrupt when
the size of the datagram is larger than 1024. With smaller packets (<=
1024), windows can transfer several datagrams per interrupt. This is just
what the second link mr. Bryant gave states.

This leads to the following:

- Because virtually all GigE drivers/NICs limit the number of interrupts
per second, the throughput drops dramatically.
- In order to reach the maximum throughput, there must be as many
interrupts as possible. I guess this is why CPU load is increased (# of
interrupts / sec increase)

However, this limit is only per socket (or per application or per thread).
Thus, when two programs transmit packets of size 1025 bytes, the troughput
is exactly double the troughput of single program. This is not the case
with 1024 byte packets (obviously).

But, as I said, setting the registry key to a larger value fixes the
problem. I guess that this really has been a problem only since the
introduction of gigabit networks.

Greetings,

Juha Lemmetti

Charles Bryant

unread,

Jan 26, 2006, 6:48:13 PM1/26/06

to

In article <slrndtem9a...@lehtori.cc.tut.fi>,

Juha Lemmetti <ou...@nowhere.invalid> wrote:
>On 2006-01-25, Charles Bryant <n95474...@chch.demon.co.uk> wrote:
>> I have since been told about the registry key FastSendDatagramThreshold,
>> which is 1024 by default. Try increasing this value. It's mentioned in
>><URL: http://www.microsoft.com/technet/itsolutions/network/deploy/depovg/tcpip2k.mspx >.
>
>I have noticed the same problem also (throughput drops at 1025 bytes), and
>setting the registry key (to e.g. 2048) fixes the problem.
>
>My guess is that Windows will transfer only one datagram / interrupt when
>the size of the datagram is larger than 1024. With smaller packets (<=
>1024), windows can transfer several datagrams per interrupt. This is just
>what the second link mr. Bryant gave states.

From what Microsft have said about the option, I believe that datagrams
below the threshold are copied into a kernel buffer, while larger ones
are locked into memory and the Ethernet hardware is told to DMA the
data direct from memory.

I can guess why this would be affected by interrupts. A typical
program will send a datagram from a memory buffer and then overwrite
the same buffer with the data for the next datagram. Obviously if the
hardware is going to read the data directly from the buffer, it's
essential that the buffer isn't modified until the data has been read.
This could be done in one of two ways: either make send() block until
the hardware has read the data (and, therefore, until the datagram has
been sent), or use the virtual memory protection mechanism to trap any
attempt to modify the data before it has been sent and suspend the
thread which is writing to the buffer until the datagram has been
sent.

Since interrupts take time to process, a common optimisation on
transmit is to wait until several datagrams have been sent before
allowing a transmit interrupt. This may slow things down where a
thread is waiting for the interrupt.

If my guess is correct, then this type of code:

for (;;) {
char buff[8192];
fill buff
send(somewhere, buff, 8192, 0);
}

should be much slower than:

for (;;) {
char buff_a[8192];
char buff_b[8192];
fill buff_a
send(somewhere, buff_a, 8192, 0);
fill buff_b
send(somewhere, buff_b, 8192, 0);
}

since with two buffers the work of filling the second buffer can
overlap with the hardware reading the first buffer. Maybe even more
buffers would go faster.

Dave

unread,

Jan 30, 2006, 11:51:01 AM1/30/06

to

> > I have since been told about the registry key FastSendDatagramThreshold,
> which is 1024 by default. Try increasing this value. It's mentioned in
> <URL:
> http://www.microsoft.com/technet/itsolutions/network/deploy/depovg/tcpip2k.mspx
> >.

Well, that certainly sounds like a good idea, but unfortunately that key
doesn't exist on any of the Windows 2000 machines I have. There's now
'parameters' directory under AFD. Am I looking in the wrong place or is it
different for 2000?. Tried adding it, but it didn't seem to have any
noticeable effect.

Dave

Juha Lemmetti

unread,

Jan 30, 2006, 2:56:01 PM1/30/06

to

On 2006-01-30, Dave <david....@baesystems.com> wrote:
> Well, that certainly sounds like a good idea, but unfortunately that key
> doesn't exist on any of the Windows 2000 machines I have. There's now
> 'parameters' directory under AFD. Am I looking in the wrong place or is it
> different for 2000?. Tried adding it, but it didn't seem to have any
> noticeable effect.

I tested in XP SP2, and by default there was no such value under the key
"Parameters". However, I added the value (type DWORD) and rebooted - and I
could see the difference.

Check that you have typed exactly the right value name with the right type
and remember to reboot between attempts.

Greetings,

Juha

Dave

unread,

Jan 31, 2006, 6:18:14 AM1/31/06

to

Thanks Juha,
That definitely makes a difference. I can definitely send data more now.
Shame the receiving end 'bulks' at the additional data!!.
Ta,
Dave

"Juha Lemmetti" <ou...@nowhere.invalid> wrote in message
news:slrndtsseg...@mozart.cc.tut.fi...