[erlang-questions] why is gen_tcp:send slow?

330 views
Skip to first unread message

Rapsey

unread,
Jun 19, 2008, 4:55:22 AM6/19/08
to erlang-q...@erlang.org
I have a streaming server written in Erlang. When it was pushing 200-300 mb/s the CPU was getting completely hammered. I traced the problem to gen_tcp:send.
So instead of sending every audio/video packet with a single gen_tcp:send call, I buffer 3 packets and then send them all at once. CPU consumption dropped dramatically.
On one of the servers I have a simple proxy, the main process that sends packets between the client and some other server looks like this:

transmit_loop({tcp, Sock, Data}, P) when P#transdat.client == Sock ->
    gen_tcp:send(P#transdat.server, Data),
    inet:setopts(P#transdat.client, [{active, once}]),
    {ok, P};
transmit_loop({tcp, Sock, Data}, P) when P#transdat.server == Sock ->
    gen_tcp:send(P#transdat.client, Data),
    inet:setopts(P#transdat.server, [{active, once}]),
    {ok, P};
transmit_loop({start, ServerPort}, P) ->
    {ok, Sock} = gen_tcp:connect("127.0.0.1", ServerPort, [binary, {active, once}, {packet, 0}]),
    {ok, P#transdat{server = Sock}};
transmit_loop({tcp_closed, _}, _) ->
    exit(stop).

The proxy is eating more CPU time than the streaming server.
Is this normal behavior? The server is running  OSX 10.4


Sergej

Edwin Fine

unread,
Jun 19, 2008, 1:39:47 PM6/19/08
to Rapsey, erlang-q...@erlang.org
How large is each packet? Can multiple packets fit into one TCP window? Have you looked at the TCP/IP wire-level data with Wireshark/Ethereal to see if the packets are being combined at the TCP level? If you see that you are only getting one packet per TCP frame (assuming a packet is much smaller than the window size), you might be falling foul of the Nagle congestion algorithm. The fact that manually buffering your packets improves performance suggests this may be the case. Nagle says that send, send, send is OK, receive, receive, receive is ok, and even send, receive, send, receive is ok, but you get into trouble if sends and receives are mixed asymmetrically on the same socket (e.g. send, send, receive).

Also, I don't understand your transmit_loop. Where is it looping (or am I misunderstanding something)?

From what I have seen, people writing Erlang TCP/IP code do an {active, once} receive, and when getting the first packet, drop into another loop that does a passive receive until there's no data waiting, then go back into the {active, once} receive. Are you doing this? I am not sure, but I fear that if all your receives are {active, once} it will incur more CPU overhead than the active/passive split. It's hard to know because I can't see enough of your code to know what you are doing overall. Disclaimer: I'm no Erlang or TCP/IP expert.

Hope this helps.

2008/6/19 Rapsey <rap...@gmail.com>:
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions

Rapsey

unread,
Jun 19, 2008, 2:06:13 PM6/19/08
to erlang-q...@erlang.org
It loops from another module, that way I can update the code at any time without disrupting anything.
The packets are generally a few hundred bytes big, except keyframes which tend to be in the kB range. I haven't tried looking with wireshark.  Still it seems a bit odd that a large CPU consumption would be the symptom. The traffic is strictly one way. Either someone is sending the stream or receiving it.
The transmit could of course be written with a passive receive, but the code would be significantly uglier. I'm sure someone here knows if setting {active, once} every packet is CPU intensive or not.
It seems the workings of gen_tcp is quite platform dependent. If I run the code in windows, sending more than 128 bytes per gen_tcp call significantly decreases network output.
Oh and I forgot to mention I use R12B-3.


Sergej

Rapsey

unread,
Jun 20, 2008, 12:35:43 AM6/20/08
to erlang-q...@erlang.org
All data goes through nginx which acts as a proxy. Its CPU consumption is never over 1%.


Sergej

On Thu, Jun 19, 2008 at 9:35 PM, Javier París Fernández <javie...@udc.es> wrote:

El 19/06/2008, a las 20:06, Rapsey escribió:


It loops from another module, that way I can update the code at any time without disrupting anything.
The packets are generally a few hundred bytes big, except keyframes which tend to be in the kB range. I haven't tried looking with wireshark.  Still it seems a bit odd that a large CPU consumption would be the symptom. The traffic is strictly one way. Either someone is sending the stream or receiving it.
The transmit could of course be written with a passive receive, but the code would be significantly uglier. I'm sure someone here knows if setting {active, once} every packet is CPU intensive or not.
It seems the workings of gen_tcp is quite platform dependent. If I run the code in windows, sending more than 128 bytes per gen_tcp call significantly decreases network output.
Oh and I forgot to mention I use R12B-3.

Hi,

Without being an expert.

200-300 mb/s  in small (hundreds of bytes) packets means a *lot* of system calls if you are doing a gen_tcp:send for each one. If you buffer 3 packets, you are reducing that by a factor of 3 :). I'd try to do an small test doing the same thing in C and compare the results. I think it will also eat a lot of CPU.

About the proxy CPU... I'm a bit lost about it, but speculating wildly it is possible that the time spent doing the system calls that gen_tcp is doing is added to the proxy CPU process.

Regards.

Edwin Fine

unread,
Jun 20, 2008, 2:24:05 AM6/20/08
to Rapsey, erlang-q...@erlang.org
Which Erlang command-line options are you using? Specifically, are you using -K true and the -A flags? Does OS/X support kernel poll (-K true)?  I saw benchmarks where CPU usage without kernel poll was high (60 - 80%), and without it was much lower (5 - 10%).

I wouldn't necessarily agree that "the workings of gen_tcp is quite platform dependent." I would rather guess that TCP/IP stacks, and TCP/IP parameters, are very different across certain operating systems. The default values are often not even close to optimal. There are numerous registry tweaks to improve Windows TCP/IP performance, for example. I am surprised that you are forced to send only 128 bytes at a time or face lower performance in Erlang on Windows. That seems odd indeed. I would be taking looks at default buffer sizes and the registry hacks that are findable on Google, and then experiment.

I was able to improve performance of an application I am working on from 3 message/sec to 70 msgs/sec simply by spawning a function (to gen_tcp:send the data) that was previously being called sequentially. This was because TCP/IP could now pack multiple packets into the same frame, which previously only had one packet in it. The RTT of the link was dreadful (290ms), so this was a bit of a special case but I think the principle remains the same. Transmitting data in fewer packets means fewer system calls, better utilization of available frame space, and less CPU. Plus using -K true and perhaps +A 128 should improve things.

Give it a try (if you already haven't) and see if it improves things. Also take a look if you will at Boost socket performance on Linux, which has some interesting information on this topic.

2008/6/19 Rapsey <rap...@gmail.com>:

Rapsey

unread,
Jun 20, 2008, 3:52:50 AM6/20/08
to erlang-q...@erlang.org
I have kernel-poll enabled (osx supports it). Don't have async threads enabled, because I don't know what it does. I'll try it.


Sergej

Adam Kelly

unread,
Jun 20, 2008, 6:04:13 AM6/20/08
to Rapsey, erlang-q...@erlang.org
2008/6/19 Rapsey <rap...@gmail.com>:

> The transmit could of course be written with a passive receive, but the code
> would be significantly uglier. I'm sure someone here knows if setting
> {active, once} every packet is CPU intensive or not.

Is the streaming process building up a message queue (a back log of
messages waiting
to be sent?). gen_tcp:send does a selective receive in order to
acknowledge that the
packet has been sent, so if the process has a message queue it has to
do a linear search
of the message queue on each send and the streaming becomes O(N^2).

Adam.

Rapsey

unread,
Jun 20, 2008, 6:38:55 AM6/20/08
to erlang-q...@erlang.org
The proxy is not building up a queue, because {active, once} is set after every packet has been sent to the other side.
Streaming server keeps a close eye on the message queue and it does not allow it to grow by much.

Bjorn Gustavsson

unread,
Jun 23, 2008, 5:55:24 AM6/23/08
to erlang-q...@erlang.org
"Edwin Fine" <erlang-ques...@usa.net> writes:

> Plus using -K true and
> perhaps +A 128 should improve things.
>

Async threads (+A 128) can only speed up file operations (the inet driver
driver does not use the async thread pool).

/Bjorn
--
Björn Gustavsson, Erlang/OTP, Ericsson AB

Edwin Fine

unread,
Jun 23, 2008, 11:12:48 AM6/23/08
to Bjorn Gustavsson, erlang-q...@erlang.org
That's good to know. It was given as a general recommendation to speed up communications (I forget where). Thanks.

Javier París Fernández

unread,
Jun 19, 2008, 3:35:40 PM6/19/08
to Rapsey

El 19/06/2008, a las 20:06, Rapsey escribió:

> It loops from another module, that way I can update the code at any
> time without disrupting anything.
> The packets are generally a few hundred bytes big, except keyframes
> which tend to be in the kB range. I haven't tried looking with
> wireshark. Still it seems a bit odd that a large CPU consumption
> would be the symptom. The traffic is strictly one way. Either
> someone is sending the stream or receiving it.
> The transmit could of course be written with a passive receive, but
> the code would be significantly uglier. I'm sure someone here knows
> if setting {active, once} every packet is CPU intensive or not.
> It seems the workings of gen_tcp is quite platform dependent. If I
> run the code in windows, sending more than 128 bytes per gen_tcp
> call significantly decreases network output.
> Oh and I forgot to mention I use R12B-3.

Hi,

Edwin Fine

unread,
Jun 24, 2008, 2:43:05 PM6/24/08
to Rapsey, erlang-q...@erlang.org
I wrote a small benchmark in Erlang to see how fast I could get socket communications to go. All the benchmark does is pump the same buffer to a socket for (by default) 10 seconds. It uses {active, once} each time, just like you do.

Server TCP options:
     {active, once},
        {reuseaddr, true},
        {packet, 0},
        {packet_size, 65536},
        {recbuf, 1000000}

Client TCP options:
        {packet, raw},
        {packet_size, 65536},
        {sndbuf, 1024 * 1024},
        {send_timeout, 3000}

Here are some results using Erlang R12B-3 (erl +K true in the Linux version):

Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):
- Using localhost (127.0.0.1): 7474.14 MB in 10.01 secs (746.66 MB/sec)
- Using 192.168.x.x IP address: 8064.94 MB in 10.00 secs (806.22 MB/sec) [Don't ask me why it's faster than using loopback, I repeated the tests and got the same result]

Windows XP SP3 (32 bits), Intel Core 2 Duo E6600:
- Using loopback: 2166.97 MB in 10.02 secs (216.35 MB/sec)
- Using 192.168.x.x IP address: 2140.72 MB in 10.02 secs (213.75 MB/sec)
- On Gigabit Ethernet to the Q6600 Linux box: 1063.61 MB in 10.02 secs (106.17 MB/sec) using non-jumbo frames. I don't think my router supports jumbo frames.

There's undoubtedly a huge discrepancy between the two systems, whether because of kernel poll in Linux, or that it's 64 bits, or unoptimized Windows TCP/IP flags, I don't know. I don't believe it's the number of CPUs (there's only 1 process sending and one receiving), or the CPU speed (they are both 2.4 GHz Core 2s).

Maybe some Erlang TCP/IP gurus could comment.

I've attached the code for interest. It's not supposed to be production quality, so please don't beat me up :) although I am always open to suggestions for improvement. If you do improve it, I'd like to see what you've done. Maybe there is another simple Erlang tcp benchmark program out there (i.e. not Tsung), but I couldn't find one in a cursory Google search.

To run:

VM1:

tb_server:start(Port, Opts).
tb_server:stop() to stop.

Port = integer()
Opts = []|[opt()]
opt() = {atom(), term()} (Accepts inet setopts options, too)

The server prints out the transfer rate (for simplicity).

VM2:
tb_client(Host, Port, Opts).

Host = atom()|string() hostname or IP address
Port, Opts as in tb_server

Runs for 10 seconds, sending a 64K buffer as fast as possible to Host/Port.
You can change this to 20 seconds (e.g.) by adding the tupls {time_limit, 20000} to Opts.
You can change buffer size by adding the tuple {blksize, Bytes} to Opts.

2008/6/20 Rapsey <rap...@gmail.com>:
tcp_bench.tgz

Rapsey

unread,
Jun 24, 2008, 3:00:42 PM6/24/08
to erlang-q...@erlang.org
You're using very large packets. I think the results would be much more telling if the packets would be a few kB at most. That is closer to most real life situations.


Sergej

David Mercer

unread,
Jun 24, 2008, 4:02:11 PM6/24/08
to erlang-q...@erlang.org

I tried some alternative block sizes (using the blksize option).  I found that from 1 to somewhere around––maybe a bit short of––1000 bytes, the test was able to send about 300,000 blocks in 10 seconds regardless of size.  (That means, 0.03 MB/sec for block size of 1, 0.3 MB/sec for block size of 10, 3 MB/sec  for block size of 100, etc.)  I suspect the system was CPU bound at those levels.

 

Above 1000, the number of blocks sent seemed to decrease, though this was more than offset by the increased size of the blocks.  Above about 10,000 byte blocks (may have been less, I didn’t check any value between 4,000 and 10,000), however, performance peaked and block size no longer mattered: it always sent between 70 and 80 MB/sec.  My machine is clearly slower than Edwin’s…

 

DBM

 


Edwin Fine

unread,
Jun 24, 2008, 5:16:18 PM6/24/08
to dme...@alum.mit.edu, erlang-q...@erlang.org
David,

Thanks for trying out the benchmark.

With my limited knowledge of TCP/IP, I believe you are seeing the 300,000 limit because TCP/IP requires acknowledgements to each packet, and although it can batch up multiple acknowledgements in one packet, there is a theoretical limit of packets per seconds beyond which it cannot go due to the laws of physics. I understand that limit is determined by the Round-Trip Time (RTT), which can be shown by ping. On my system, pinging 127.0.0.1 gives a minimum RTT of 0.018 ms (out of 16 pings). That means that the maximum number of packets that can make it to and dest and back per second is 1/0.000018 seconds, or 55555 packets per second. The TCP/IP stack is evidently packing 5 or 6 blocks into each packet to get the 300K blocks/sec you are seeing. Using Wireshark or Ethereal would confirm this. I am guessing that this means that the TCP window is about 6 * 1000 bytes or 6KB.

What I neglected to tell this group is that I have modified the Linux sysctl.conf as follows, which might have had an effect (like I said, I am not an expert):

# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB, or higher if you use very high BDP paths
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 32768 16777216

When I have more time, I will vary a number of different Erlang TCP/IP parameters and get a data set together that gives a broader picture of the effect of the parameters.

Thanks again for taking the time.

2008/6/24 David Mercer <dme...@gmail.com>:

Edwin Fine

unread,
Jun 24, 2008, 6:55:44 PM6/24/08
to Johnny Billquist, erlang-q...@erlang.org
Johnny,

Thanks for the lesson! I am always happy to learn. Like I said, I am not an expert in TCP/IP.

What I was writing about when I said that packets are acknowledged is what I saw in Wireshark while trying to understand performance issues. I perhaps should have said "TCP/IP" instead of just "TCP". There were definitely acknowledgements, but I guess they were at the IP level.

I wonder what the MSS is for loopback? I think it's about 1536 on my eth0 interface, but not sure.

As for RTT, I sent data over a link that had a very long (290ms) RTT, and that definitely limited the rate at which packets could be sent. Can RTT be used to calculate  the theoretical maximum traffic that a link can carry? For example, a satellite link with a 400ms RTT but 2 Mbps bandwidth?

Ed

On Tue, Jun 24, 2008 at 6:00 PM, Johnny Billquist <b...@softjar.se> wrote:
No. TCP don't acknowledge every packet. In fact, TCP don't acknowledge packets as such at all. TCP is not packet based. It's just that if you use IP as the carrier, IP itself it packet based.
TCP can in theory generate any number of packets per second. However, the amount of unacknowledged data that can be outstanding at any time is limited by the transmit window. Each packet carries a window size, which is how much more data that can be accepted by the reciever. TCP can (is allowed to) send that much data and no more.

The RTT calculations are used for figuring out how long to wait before doing retransmissions. You also normally have a slow start transmission algorithm which prevents the sender from even using the full window size from the start, as a way of avoiding congestions. That is used in combination with a backoff algorithm when retransmissions are needed to further decrease congestions, but all of this only really comes into effect if you start loosing data, and TCP actually needs to do retransmissions.

Another thing you have is an algorithm called Nagle, which tries to collect small amount of data sent into larger packets before sending it, so that you don't flood the net with silly small packets.

One addisional detail is that receivers normally, when the receive buffers becomes full, don't announce newly freed space immediately, since that is normally rather small amounts, but instead wait a while, until a larger part of the receive buffer is free, so that the sender actually can send some full sized packets once it starts sending again.

In addition to all this, you also have a max segment size which is negotiated between the TCP ends, which limit the size of a single IP packet sent by the TCP protocol. This is done in order to try to avoid packet fragmentation.

So the window size is actually a flow control mechanism, and is in reality limiting the amount of data that can be sent. And it varies all the time. And the number of packets that will be used for sending that much data is determined by the MSS (Max Segment Size).

Sorry for the long text on how TCP works. :-)

       Johnny

Edwin Fine wrote:
David,

Thanks for trying out the benchmark.

With my limited knowledge of TCP/IP, I believe you are seeing the 300,000 limit because TCP/IP requires acknowledgements to each packet, and although it can batch up multiple acknowledgements in one packet, there is a theoretical limit of packets per seconds beyond which it cannot go due to the laws of physics. I understand that limit is determined by the Round-Trip Time (RTT), which can be shown by ping. On my system, pinging 127.0.0.1 <http://127.0.0.1> gives a minimum RTT of 0.018 ms (out of 16 pings). That means that the maximum number of packets that can make it to and dest and back per second is 1/0.000018 seconds, or 55555 packets per second. The TCP/IP stack is evidently packing 5 or 6 blocks into each packet to get the 300K blocks/sec you are seeing. Using Wireshark or Ethereal would confirm this. I am guessing that this means that the TCP window is about 6 * 1000 bytes or 6KB.


What I neglected to tell this group is that I have modified the Linux sysctl.conf as follows, which might have had an effect (like I said, I am not an expert):

# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB, or higher if you use very high BDP paths
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 32768 16777216

When I have more time, I will vary a number of different Erlang TCP/IP parameters and get a data set together that gives a broader picture of the effect of the parameters.

Thanks again for taking the time.

2008/6/24 David Mercer <dme...@gmail.com <mailto:dme...@gmail.com>>:


   I tried some alternative block sizes (using the blksize option).  I
   found that from 1 to somewhere around––maybe a bit short of––1000
   bytes, the test was able to send about 300,000 blocks in 10 seconds
   regardless of size.  (That means, 0.03 MB/sec for block size of 1,
   0.3 MB/sec for block size of 10, 3 MB/sec  for block size of 100,
   etc.)  I suspect the system was CPU bound at those levels.

   
   Above 1000, the number of blocks sent seemed to decrease, though
   this was more than offset by the increased size of the blocks.    Above about 10,000 byte blocks (may have been less, I didn't check
   any value between 4,000 and 10,000), however, performance peaked and
   block size no longer mattered: it always sent between 70 and 80
   MB/sec.  My machine is clearly slower than Edwin's…

   
   DBM

   
   ------------------------------------------------------------------------

   *From:* erlang-quest...@erlang.org
   <mailto:erlang-quest...@erlang.org>
   [mailto:erlang-quest...@erlang.org
   <mailto:erlang-quest...@erlang.org>] *On Behalf Of *Rapsey
   *Sent:* Tuesday, June 24, 2008 14:01
   *To:* erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>

   *Subject:* Re: [erlang-questions] why is gen_tcp:send slow?


   
   You're using very large packets. I think the results would be much
   more telling if the packets would be a few kB at most. That is
   closer to most real life situations.


   Sergej

   On Tue, Jun 24, 2008 at 8:43 PM, Edwin Fine
   <erlang-ques...@usa.net
   <mailto:erlang-ques...@usa.net>> wrote:

   I wrote a small benchmark in Erlang to see how fast I could get
   socket communications to go. All the benchmark does is pump the same
   buffer to a socket for (by default) 10 seconds. It uses {active,
   once} each time, just like you do.

   Server TCP options:
        {active, once},
           {reuseaddr, true},
           {packet, 0},
           {packet_size, 65536},
           {recbuf, 1000000}

   Client TCP options:
           {packet, raw},
           {packet_size, 65536},
           {sndbuf, 1024 * 1024},
           {send_timeout, 3000}

   Here are some results using Erlang R12B-3 (erl +K true in the Linux
   version):

   Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):
   - Using localhost (127.0.0.1 <http://127.0.0.1>): 7474.14 MB in
   2008/6/20 Rapsey <rap...@gmail.com <mailto:rap...@gmail.com>>:


   All data goes through nginx which acts as a proxy. Its CPU
   consumption is never over 1%.


   Sergej

   
   On Thu, Jun 19, 2008 at 9:35 PM, Javier París Fernández

   http://www.erlang.org/mailman/listinfo/erlang-questions

   
   

   _______________________________________________
   erlang-questions mailing list
   erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>

   http://www.erlang.org/mailman/listinfo/erlang-questions



------------------------------------------------------------------------

Johnny Billquist

unread,
Jun 24, 2008, 6:00:51 PM6/24/08
to Edwin Fine, erlang-q...@erlang.org

Johnny

Edwin Fine wrote:
> David,
>
> Thanks for trying out the benchmark.
>
> With my limited knowledge of TCP/IP, I believe you are seeing the
> 300,000 limit because TCP/IP requires acknowledgements to each packet,
> and although it can batch up multiple acknowledgements in one packet,
> there is a theoretical limit of packets per seconds beyond which it
> cannot go due to the laws of physics. I understand that limit is
> determined by the Round-Trip Time (RTT), which can be shown by ping. On

> my system, pinging 127.0.0.1 <http://127.0.0.1> gives a minimum RTT of

> 0.018 ms (out of 16 pings). That means that the maximum number of
> packets that can make it to and dest and back per second is 1/0.000018
> seconds, or 55555 packets per second. The TCP/IP stack is evidently
> packing 5 or 6 blocks into each packet to get the 300K blocks/sec you
> are seeing. Using Wireshark or Ethereal would confirm this. I am
> guessing that this means that the TCP window is about 6 * 1000 bytes or 6KB.
>
> What I neglected to tell this group is that I have modified the Linux
> sysctl.conf as follows, which might have had an effect (like I said, I
> am not an expert):
>
> # increase Linux autotuning TCP buffer limits
> # min, default, and max number of bytes to use
> # set max to at least 4MB, or higher if you use very high BDP paths
> net.ipv4.tcp_rmem = 4096 87380 16777216
> net.ipv4.tcp_wmem = 4096 32768 16777216
>
> When I have more time, I will vary a number of different Erlang TCP/IP
> parameters and get a data set together that gives a broader picture of
> the effect of the parameters.
>
> Thanks again for taking the time.
>

> 2008/6/24 David Mercer <dme...@gmail.com <mailto:dme...@gmail.com>>:


>
> I tried some alternative block sizes (using the blksize option). I
> found that from 1 to somewhere around––maybe a bit short of––1000
> bytes, the test was able to send about 300,000 blocks in 10 seconds
> regardless of size. (That means, 0.03 MB/sec for block size of 1,
> 0.3 MB/sec for block size of 10, 3 MB/sec for block size of 100,
> etc.) I suspect the system was CPU bound at those levels.
>
>
>
> Above 1000, the number of blocks sent seemed to decrease, though
> this was more than offset by the increased size of the blocks.
> Above about 10,000 byte blocks (may have been less, I didn't check
> any value between 4,000 and 10,000), however, performance peaked and
> block size no longer mattered: it always sent between 70 and 80
> MB/sec. My machine is clearly slower than Edwin's…
>
>
>
> DBM
>
>
>

> ------------------------------------------------------------------------
>
> *From:* erlang-quest...@erlang.org
> <mailto:erlang-quest...@erlang.org>
> [mailto:erlang-quest...@erlang.org
> <mailto:erlang-quest...@erlang.org>] *On Behalf Of *Rapsey
> *Sent:* Tuesday, June 24, 2008 14:01
> *To:* erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>

> *Subject:* Re: [erlang-questions] why is gen_tcp:send slow?


>
>
>
> You're using very large packets. I think the results would be much
> more telling if the packets would be a few kB at most. That is
> closer to most real life situations.
>
>
> Sergej
>
> On Tue, Jun 24, 2008 at 8:43 PM, Edwin Fine
> <erlang-ques...@usa.net

> <mailto:erlang-ques...@usa.net>> wrote:
>
> I wrote a small benchmark in Erlang to see how fast I could get
> socket communications to go. All the benchmark does is pump the same
> buffer to a socket for (by default) 10 seconds. It uses {active,
> once} each time, just like you do.
>
> Server TCP options:
> {active, once},
> {reuseaddr, true},
> {packet, 0},
> {packet_size, 65536},
> {recbuf, 1000000}
>
> Client TCP options:
> {packet, raw},
> {packet_size, 65536},
> {sndbuf, 1024 * 1024},
> {send_timeout, 3000}
>
> Here are some results using Erlang R12B-3 (erl +K true in the Linux
> version):
>
> Linux (Ubuntu 8.10 x86_64, Intel Core 2 Q6600, 8 GB):

> - Using localhost (127.0.0.1 <http://127.0.0.1>): 7474.14 MB in

> 2008/6/20 Rapsey <rap...@gmail.com <mailto:rap...@gmail.com>>:

> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>


> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list

> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>
> ------------------------------------------------------------------------
>

Johnny Billquist

unread,
Jun 24, 2008, 7:15:15 PM6/24/08
to Edwin Fine, erlang-q...@erlang.org
Edwin, always happy to help out...

Edwin Fine skrev:


> Johnny,
>
> Thanks for the lesson! I am always happy to learn. Like I said, I am not an
> expert in TCP/IP.
>
> What I was writing about when I said that packets are acknowledged is what I
> saw in Wireshark while trying to understand performance issues. I perhaps
> should have said "TCP/IP" instead of just "TCP". There were definitely
> acknowledgements, but I guess they were at the IP level.

No. IP don't have any acknowledgements. IP (as well as UDP) is basically just
sending packets without any guarantee that they will ever reach the other end.
What you saw was TCP acknowledgements, but you misunderstood how they work.

Think of a TCP connection as an eternal length stream of bytes. Each byte in
this stream have a sequence number. TCP sends bytes in this stream, packed into
IP packets. Each IP packet will have one or several bytes from that stream.
TCP at the other end will acknowledge the highest ordered byte that is has
received. How many packets it took to get to that byte is irrelevant, as is any
retransmissions, and so on... The window size tells how many additional bytes
from this stream can be sent, which is further on based in the point which the
acknowledgement points at.

(In reality, the sequence numbers are not infitite, but are actually a 32-bit
number, which wraps. But since window sizes normally fits in a 16-bit quantity,
there is no chance of ever getting back to the same sequence number again before
it has long been passed by the time before, so no risk of confusion or errors
there.)

> I wonder what the MSS is for loopback? I think it's about 1536 on my eth0
> interface, but not sure.

Smart implementations use the MTU - 40 of the interface as the MSS for a
loopback connection. Otherwise the thumb of rule is that if it's on the same
network, MSS is usually set to 1460 and 536 for destinations on other networks.
This comes from the fact that the local network (usually ethernet) have an MTU
of 1500, and the IP header is normally 20 bytes, and so is the standard TCP
header, leaving 1460 bytes of data in an ethernet frame.
For non-local destinations, IP requires that atleast 576 byte packets can go
through unfragmented. The rest follows. :-)

> As for RTT, I sent data over a link that had a very long (290ms) RTT, and
> that definitely limited the rate at which packets could be sent. Can RTT be
> used to calculate the theoretical maximum traffic that a link can carry?
> For example, a satellite link with a 400ms RTT but 2 Mbps bandwidth?

No. RTT can not be used to calculate anything regarding traffic bandwidth.
You can keep sending packets until the window is exhausted, no matter what the
RTT says. The RTT is only used to calculate when to do retransmissions if you
haven't received an ACK.
The only other thing that affects packet rates are the slow start algorithm.
That will be affected by the round trip delays, since it adds a throttling
effect on the window, in addition to what the received says. The reason for it
being affected by the rount trip delay is that the slow start window size is
only increased when you get ACK packets back.
But, assuming the link can take the load, and you don't loose a lot of packets,
the slow start algorithm will pretty quickly stop being a factor.

Johnny


--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: b...@softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol

Edwin Fine

unread,
Jun 24, 2008, 8:38:15 PM6/24/08
to Johnny Billquist, erlang-q...@erlang.org
Ok Johnny, I think I get it now. Thanks for the detailed explanation. I wonder why the original poster (Sergej/Rapsey) is seeing such poor TCP/IP performance? In any case, I am still going to do some more benchmarks to see if I can understand how the different components of TCP/IP communication in Erlang (inet:setopts() and gen_tcp) affect performance, CPU overhead and so on.

The reason I got into all this is because I was seeing very good performance between two systems on a LAN, and terrible performance over a non-local overseas link that had an RTT of about 290ms. Through various measurements and Wireshark usage I found the link was carrying only 3.4 packets per second, with only about 56 data bytes in each packet. When I investigated further, I found that a function I thought was running asynchronously was actually running synchronously inside a gen_server:call(). When I spawned the function, I still only saw 3.4 packets per second (using Wireshark timestamps) but each packet was now full of multiple blocks of data, not just 56 bytes, so the actual throughput went up hugely. Nothing else changed. When I tried to find out where the 3.4 was coming from, I calculated 1/3.4 = 0.294ms which was (coincidentally?) the exact RTT. That's why I thought there was a relationship between RTT and the number of packets/second a link could carry.

Now I have to go back and try to figure it all all over again :( unless you can explain it to me (he said hopefully).

Thanks
Ed

Per Hedeland

unread,
Jun 25, 2008, 2:51:27 AM6/25/08
to b...@softjar.se, erlang-q...@erlang.org
Johnny Billquist <b...@softjar.se> wrote:
>
>No. RTT can not be used to calculate anything regarding traffic bandwidth.
>You can keep sending packets until the window is exhausted, no matter what the
>RTT says. The RTT is only used to calculate when to do retransmissions if you
>haven't received an ACK.

Well, yes and no - RTT by itself cannot be used to calculate bandwidth,
and TCP itself doesn't need to "know" the bandwidth anyway, but the
possible throughput is dependant on RTT: Since you can have at most one
window size of un-ack'ed data outstanding, and data can't be ack'ed
until it's been received:-), the throughput is bounded by the ratio of
(max) window size to RTT. With only 16 bits of window size available and
an RTT of 300 ms, the theoretical max throughput is 65535/0.3 bytes/s or
~ 1.75 Mbit/s.

Of course this problem, a.k.a. "long fat pipe", was solved long ago as
far as TCP is concerned - enter window scaling (RFC 1323), which allows
for the 16 bit window size to have a unit of anything from 1 to
(theoretically) 2^255 bytes. These days it should also actually work
most everywhere. Nevertheless, the max window size is under the control
of the TCP "user", and if the kernel and/or the application limits the
size of the receive buffer to something less than 64kB, window scaling
can't help.

Whether this is Edwin's problem I don't know - the "fixed packet rate"
observation may actually be more or less correct: As you explained, TCP
doesn't ack packets, it acks bytes - but the actual *sending* of acks is
definitely related to the reception of packets (or "segments" if you
prefer), in particular in a one-way data transfer where there are no
outgoing data packets that can have acks "piggy-backed". The details may
vary, but in general in such a case an ack is sent for every other
packet received, or after a ("long" - 200 ms) timeout if no packets are
received.

--Per Hedeland

Johnny Billquist

unread,
Jun 25, 2008, 3:44:04 AM6/25/08
to Edwin Fine, erlang-q...@erlang.org
Hi,
no, unfortunately I don't have an answer for their observed performance.
Although I haven't really looked at it either. Time, you know... :-)
But keep testing and try to figure it out. You always learn something, and
hopefully you'll find the answer as well.

Johnny

Edwin Fine skrev:

Johnny Billquist

unread,
Jun 25, 2008, 4:01:33 AM6/25/08
to Per Hedeland, erlang-q...@erlang.org
Per Hedeland skrev:

> Johnny Billquist <b...@softjar.se> wrote:
>> No. RTT can not be used to calculate anything regarding traffic bandwidth.
>> You can keep sending packets until the window is exhausted, no matter what the
>> RTT says. The RTT is only used to calculate when to do retransmissions if you
>> haven't received an ACK.
>
> Well, yes and no - RTT by itself cannot be used to calculate bandwidth,
> and TCP itself doesn't need to "know" the bandwidth anyway, but the
> possible throughput is dependant on RTT: Since you can have at most one
> window size of un-ack'ed data outstanding, and data can't be ack'ed
> until it's been received:-), the throughput is bounded by the ratio of
> (max) window size to RTT. With only 16 bits of window size available and
> an RTT of 300 ms, the theoretical max throughput is 65535/0.3 bytes/s or
> ~ 1.75 Mbit/s.

Yes. If you manage to fill the whole window, then the bandwidth starts to be
reduced from the theoretical max of the media, and then RTT is related to how
much your BW is reduced. And of course, your "ability" to fill the window is
related to RTT as well.

> Of course this problem, a.k.a. "long fat pipe", was solved long ago as
> far as TCP is concerned - enter window scaling (RFC 1323), which allows
> for the 16 bit window size to have a unit of anything from 1 to
> (theoretically) 2^255 bytes. These days it should also actually work
> most everywhere. Nevertheless, the max window size is under the control
> of the TCP "user", and if the kernel and/or the application limits the
> size of the receive buffer to something less than 64kB, window scaling
> can't help.
>
> Whether this is Edwin's problem I don't know - the "fixed packet rate"
> observation may actually be more or less correct: As you explained, TCP
> doesn't ack packets, it acks bytes - but the actual *sending* of acks is
> definitely related to the reception of packets (or "segments" if you
> prefer), in particular in a one-way data transfer where there are no
> outgoing data packets that can have acks "piggy-backed". The details may
> vary, but in general in such a case an ack is sent for every other
> packet received, or after a ("long" - 200 ms) timeout if no packets are
> received.

Hmm. Well, a tcpdump will quickly tell if the window size is zero, and he's
hitting the RTT-limiting factor. But unless the receiver is acting funny, it
should only announce a new window size when there is substantial space available
in the receive buffer, at which time the sender should send large packets, and
not lots of small ones.

Oh well, someone needs to look into this a bit more obviously. I haven't even
properly looked at the problem description. I just thought I'd point out some
wrong assumptions on how TCP works. :-)

Johnny

--

Per Hedeland

unread,
Jun 25, 2008, 4:24:48 PM6/25/08
to b...@softjar.se, erlang-q...@erlang.org
Johnny Billquist <b...@softjar.se> wrote:
>
>Per Hedeland skrev:

>>
>> Whether this is Edwin's problem I don't know - the "fixed packet rate"
>> observation may actually be more or less correct: As you explained, TCP
>> doesn't ack packets, it acks bytes - but the actual *sending* of acks is
>> definitely related to the reception of packets (or "segments" if you
>> prefer), in particular in a one-way data transfer where there are no
>> outgoing data packets that can have acks "piggy-backed". The details may
>> vary, but in general in such a case an ack is sent for every other
>> packet received, or after a ("long" - 200 ms) timeout if no packets are
>> received.
>
>Hmm. Well, a tcpdump will quickly tell if the window size is zero, and he's
>hitting the RTT-limiting factor.

Hmm hmm, I think I wasn't thinking straight when I wrote the above - I
was speculating that this could be something other than the RTT-limiting
factor (otherwise small vs big packets shouldn't make a difference in
throughput), along the lines of "the receiver has data to ack but is
delaying the sending of the ack until he recieves another segment". But
this can't be the explanation I believe - if you're sending smaller
packets, you can send that many more into a given window before you need
to wait for an ack, so there are more "opprtunities" for the reciever to
send a delayed ack.

>Oh well, someone needs to look into this a bit more obviously. I haven't even
>properly looked at the problem description. I just thought I'd point out some
>wrong assumptions on how TCP works. :-)

Agreed!

--Per

Johnny Billquist

unread,
Jun 26, 2008, 7:38:48 AM6/26/08
to Per Hedeland, erlang-q...@erlang.org
Per Hedeland wrote:
> Johnny Billquist <b...@softjar.se> wrote:
>> Per Hedeland skrev:
>>> Whether this is Edwin's problem I don't know - the "fixed packet rate"
>>> observation may actually be more or less correct: As you explained, TCP
>>> doesn't ack packets, it acks bytes - but the actual *sending* of acks is
>>> definitely related to the reception of packets (or "segments" if you
>>> prefer), in particular in a one-way data transfer where there are no
>>> outgoing data packets that can have acks "piggy-backed". The details may
>>> vary, but in general in such a case an ack is sent for every other
>>> packet received, or after a ("long" - 200 ms) timeout if no packets are
>>> received.
>> Hmm. Well, a tcpdump will quickly tell if the window size is zero, and he's
>> hitting the RTT-limiting factor.
>
> Hmm hmm, I think I wasn't thinking straight when I wrote the above - I
> was speculating that this could be something other than the RTT-limiting
> factor (otherwise small vs big packets shouldn't make a difference in
> throughput), along the lines of "the receiver has data to ack but is
> delaying the sending of the ack until he recieves another segment". But
> this can't be the explanation I believe - if you're sending smaller
> packets, you can send that many more into a given window before you need
> to wait for an ack, so there are more "opprtunities" for the reciever to
> send a delayed ack.

Yes.
However, one thing to also keep an eye out for is if a lot of small
packets are generated, and the received can't accept packets at that
pace. Then you'll experience lost packets, followed by retransmits. And
this is definitely related to the RTT.
This is especially a problem if you have a transmitter sitting with an
ethernet interface running at a higher speed than the receiver ethernet.
Not totally unheard of if you have a switch in between...

But this situation will be very visible in the tcp counters on the system.

Do the erlang system turn off the nagle algorithm, or force TCP to push
packets? Otherwise the underlying tcp layer should not be sending small
packets at a high pace.

Johnny

Per Hedeland

unread,
Jun 26, 2008, 8:42:09 AM6/26/08
to b...@softjar.se, erlang-q...@erlang.org
Johnny Billquist <b...@softjar.se> wrote:
>
>Do the erlang system turn off the nagle algorithm, or force TCP to push
>packets?

No - i.e. gen_tcp/inet_drv doesn't (you can turn off Nagle with the
'nodelay' option).

> Otherwise the underlying tcp layer should not be sending small
>packets at a high pace.

True. This wasn't a "high pace" though (3-4 packets/s), but of course
one has to wonder why this was happening in the first place - possibly
the "messages" to be sent were just "naturally" spaced far enough apart
to defeat Nagle.

--Per

Reply all
Reply to author
Forward
0 new messages