Re: [erlang-questions] How to handle a massive amount of UDP packets?

Valentin Nechayev

unread,

Apr 22, 2012, 4:26:00 PM4/22/12

to erlang-q...@erlang.org

Using {active,once} on socket type other than stream one principally can't
give performance compared to uncontrolled stream, due to asynchronous manner
of sending messages to owner's mailbox and its processing.
OTOH, {active,once} is shorthand for "send me one next portion" and this "one"
can be extended to more numbers; e.g. {active,100} for "send me 100 next
packets and then stay and wait for next permit". This combines advantages
of both modes ('once' and 'true').

-netch-
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Jon Watte

unread,

Apr 22, 2012, 5:26:23 PM4/22/12

to John-Paul Bader, erlang-q...@erlang.org

There are two options I would consider in this situation, both of which use synchronous reads on the socket:

1) A single process does recv() on the socket (synchronously) and pitches the received data to one of many worker processes (assuming work processing > receive processing.)

2) Each worker process blocks on recv() on the socket, and the OS will give each incoming packet to some arbitrary process that is currently blocked on the socket. This is nice because it "magically" load balances.

I'd probably go for 2) first, and only look at 1) if I run into surprising lock contention issues between workers.

Sincerely,

Jon Watte

--
"I pledge allegiance to the flag of the United States of America, and to the republic for which it stands, one nation indivisible, with liberty and justice for all."
~ Adopted by U.S. Congress, June 22, 1942

On Sun, Apr 15, 2012 at 11:08 AM, John-Paul Bader <hu...@berlin.ccc.de> wrote:

Dear list,

I'm currently writing a bittorrent tracker in Erlang. While a naive implementation of the protocol is quite easy, there are some performance related challanges where I could use some help.

In the first test run as a replacement for a very popular tracker, my erlang tracker got about 40k requests per second.

My initial approach was to initialize the socket in one process with {active, once}, handle the message in handle_info with minimal effort and pass the data asynchronously to a freshly spawned worker processes which responds to the clients. After spawning the process I'm setting the socket back to {active, once}.

Now when I switched the erlang tracker live the erlang vm was topping at 100% CPU load. My guess is that the process handling the udp packets from the socket could not keep up. Since I'm still quite new to the world of erlang I'd like to know if there are some best practices / patterns to handle this massive amount of packets.

For example using the socket in {active, once} might be too slow? Also the response to the clients needs to come from the same port as the request was coming in. Is it a problem to use the same socket for that? Should I pre-spawn a couple of thousand workers and dispatch the data from the socket to them rather than spawning them on each packet?

It would be really great if you could give some advice or point me into the right directions.

~ John

Ulf Wiger

unread,

Apr 22, 2012, 5:30:05 PM4/22/12

to Valentin Nechayev, erlang-q...@erlang.org

What one can do is to combine {active, once} with gen_tcp:recv().

Essentially, you will be served the first message, then read as many as you
wish from the socket. When the socket is empty, you can again enable
{active, once}.

BR,
Ulf W

On 22 Apr 2012, at 22:26, Valentin Nechayev wrote:

> Using {active,once} on socket type other than stream one principally can't
> give performance compared to uncontrolled stream, due to asynchronous manner
> of sending messages to owner's mailbox and its processing.
> OTOH, {active,once} is shorthand for "send me one next portion" and this "one"
> can be extended to more numbers; e.g. {active,100} for "send me 100 next
> packets and then stay and wait for next permit". This combines advantages
> of both modes ('once' and 'true').

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com

Valentin Nechayev

unread,

Apr 23, 2012, 2:25:15 AM4/23/12

to Ulf Wiger, erlang-q...@erlang.org

> From: Ulf Wiger <u...@feuerlabs.com>

> What one can do is to combine {active, once} with gen_tcp:recv().
>
> Essentially, you will be served the first message, then read as many as you
> wish from the socket. When the socket is empty, you can again enable
> {active, once}.

First, the approach you described is quite badly documented. No
description how such non-waiting recv() can be reached. If this is call
with Timeout=0, type timeout() isn't defined, and return value for
timeout isn't defined. It only defines Reason = closed or
inet:posix(). But it's incorrect to guess that eagain (or ewouldblock?)
will be returned, if the implementing code is uniform against timeout
value except infinity. I dislike to use such undocumented ways.

Second, your approach gives useless process switches. If a long message
is in receiving via TCP, there will be two switches to owner or more -
the first one for the first part of a message, and some next ones for
rest of it. If incoming rate is enough to process each small portion
(TCP window) separately, owner process will get and process them
separately; if its and system speed isn't enough for such switching,
data will group in larger portions. This means that performance
measuring will be total lie, with three intervals - uselessly quick
saturation, then stable 100% under wide load interval, and then
unexpected overloading. It's very hard to diagnose and optimize a
system with such behavior, and this trend to fill the whole system by
one subsystem affects other concurrent subsystems in bad way.

People invented many mechanisms of avoiding both uselessly fast
switching and non-reasonable delays - see e.g. VMIN and VTIME in
termios, low matermark in BSD sockets. The Max Lapshin's proposition is
among them and should only get small but major extension - to specify
both full limit and inter-portion timeout.

Third, please see measures by John-Paul Bader in neighbour message:
with {active,false} he gets substantial packet loss, compared to
{active,true}. Yep, this is UDP specifics and nobody guaranteed the
delivery but there is no reason to increase loss without reason. His
result shall be checked against the real reason but I guess these are
socket buffer overflows. With {active,true}, owner mailbox becomes
additional socket buffer with much larger size, but owner process loses
control on its mailbox. Having window of allowed packets, it can
provide more fine tuning of its load.

-netch-

Ulf Wiger

unread,

Apr 23, 2012, 3:56:43 AM4/23/12

to Valentin Nechayev, erlang-q...@erlang.org

On 23 Apr 2012, at 08:25, Valentin Nechayev wrote:

>> From: Ulf Wiger <u...@feuerlabs.com>
>
>> What one can do is to combine {active, once} with gen_tcp:recv().
>>
>> Essentially, you will be served the first message, then read as many as you
>> wish from the socket. When the socket is empty, you can again enable
>> {active, once}.
>
> First, the approach you described is quite badly documented. No
> description how such non-waiting recv() can be reached. If this is call
> with Timeout=0, type timeout() isn't defined, and return value for
> timeout isn't defined. It only defines Reason = closed or
> inet:posix(). But it's incorrect to guess that eagain (or ewouldblock?)
> will be returned, if the implementing code is uniform against timeout
> value except infinity. I dislike to use such undocumented ways.

Huh?

Well, first of all, I wrote gen_tcp:recv() - apologies for that.

I agree that the documentation should say that {error, timeout} is one
of the possible return values, but this is a small oversight - feel free to
submit a patch. It is by no means an undocumented or unsupported
behavior. gen_[tcp|udp]:recv() is what you use when you have {active,false}.

type timeout() _is_ defined. It's just that the gen_tcp/gen_udp manual
doesn't tell you where to find it (I agree this is annoying, but if we're
discussing optimal tuning of live systems, perhaps we can agree that we
shouldn't let bugs in the documentation limit our options?)

Actually, timeout() is a built-in type:

timeout() :: 'infinity' | non_neg_integer()
non_neg_integer() :: 0..

It's documented in the Reference Manual, chapter 6.2
http://www.erlang.org/doc/reference_manual/typespec.html#id74831

> Second, your approach gives useless process switches. If a long message
> is in receiving via TCP, there will be two switches to owner or more -
> the first one for the first part of a message, and some next ones for
> rest of it. If incoming rate is enough to process each small portion
> (TCP window) separately, owner process will get and process them
> separately; if its and system speed isn't enough for such switching,
> data will group in larger portions. This means that performance
> measuring will be total lie, with three intervals - uselessly quick
> saturation, then stable 100% under wide load interval, and then
> unexpected overloading. It's very hard to diagnose and optimize a
> system with such behavior, and this trend to fill the whole system by
> one subsystem affects other concurrent subsystems in bad way.

One way to look at it is that you go from {active, once} to {active, false}
and stay with {active, false} until you get a timeout. Then switch to
{active, once} to avoid being locked up in a blocking recv(), which can
be bad for e.g. code updates and reconfigurations. If that particular
problem doesn't bother you, it may be better to stay in {active, false}
and do a blocking read. Chances are, you _will_ regret this. ;-)

The only switching that goes on is between the port owner and the
port. Granted, there is a performance penalty in using {active, false}
and {active, true} (much of the 25% difference reported before). OTOH,
{active, true} can only be used if you are absolutely sure you will not
kill the system. It completely lacks flow control and effectively disables
the back-pressure mechanisms in TCP.

Packet loss in UDP _is_ the way for the server to stay alive if it cannot
keep up with the rate of incoming requests. If you can over-provision
your server side so that it cannot be killed by clients (which usually
cannot be guaranteed), foregoing flow control will surely give the best
throughput.

In my experience, using UDP in situations where high availability is
required, and overload possible, is a PITA. It's extremely difficult to
achieve a proper overload behavior, if you also want the clients to have
a predictable experience.

If you want _really_ undocumented, here is one way to get better
throughput than ({active,false} and gen_tcp:recv/3), but still keep the packets
in the TCP buffer for as long as possible.

(Not showing the other shell, where I'm just connecting and sending the
messages "one", "two", …, "five").

Eshell V5.9 (abort with ^G)
1> {ok,L} = gen_tcp:listen(8888,[{packet,2}]).
{ok,#Port<0.760>}
2> {ok,S} = gen_tcp:accept(L).
{ok,#Port<0.771>}
3> inet:setopts(S,[{active,false}]).
ok

% Can't use the Length indicator to tell the socket we want as much as possible:
4> gen_tcp:recv(S,1000,1000).
{error,einval}

% With Length = 0, we get exactly one message. This is what you normally do.
5> gen_tcp:recv(S,0,1000).
{ok,"one"}
6> gen_tcp:recv(S,0,1000).
{ok,"two"}

% This is going directly at the low-level function used by both gen_tcp and gen_udp:
7> [prim_inet:async_recv(S,0,0) || _ <- [1,2,3,4]].
[{ok,3},{ok,4},{ok,5},{ok,6}]
8> flush().
Shell got {inet_async,#Port<0.771>,3,{ok,"three"}}
Shell got {inet_async,#Port<0.771>,4,{ok,"four"}}
Shell got {inet_async,#Port<0.771>,5,{ok,"five"}}
Shell got {inet_async,#Port<0.771>,6,{error,timeout}}

Note to self: the JOBS load regulation system internally figures out a quota of
jobs for each dispatch. For counter-based regulation, it only supports a fixed
job size per-queue, but it wouldn't be hard to allow a configuration that makes
the job quota (the 'increment' in the JOBS config) to be dynamic up to a certain
limit. The list of counters and their increment size is already passed on when
the request is granted, and can be inspected by the client. This could be used
in combination with the above to determine how many messages to receive.

Jesper, if you want to steal back the MVP status, there's one way to get ahead. ;-)

The main trick would be to figure out how to fairly divide the quota among
several queued requests - and, as always, how this should be expressed in the
config.

BR,
Ulf W

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com

_______________________________________________

Valentin Micic

unread,

Apr 23, 2012, 4:59:12 AM4/23/12

to Valentin Nechayev, erlang-q...@erlang.org

First, why are we making a reference to TCP if subject line says we're discussing UDP? ;-)

Second, may I say why I think Ulf's approach better than Valentin's (hey, finally a namesake -- please to meet you):

Valentin's approach:
Whilst {active, 100} may reduce some overhead, it creates far more damage if one consider ease of programming. For example, how would one know when to issue another {active, N}? What's worse than "useess process switching" is having programmer doing the counting in order to be able to issue another {active, 100} at the end of the cycle. And what happens if one issues {active, 100} while another one is still running? Does this mean that one should expect additional 100 messages, or just 100?

OTOH, "useless process switching" notwithstanding, Ulf's approach offers a far simpler solution that does not require changes to driver -- if nothing else, far more practical approach.

Now, if I may be impractical, there's an alternative approach, which we've used for some high-throughput application operating on a raw socket - since raw IP is not natively supported by Erlang, we had to write a driver for it. Motivated by a need to increase a throughput, we derived a *at-most-N method*. Simply put, the semantic of {active, 100} syntax in this case would mean that driver shall send in one message a list of up to 100 packets. Assuming it would be simpler for a programmer to traverse a list, rather than counting variable-timing events.

This method combines relative simplicity suggested by Ulf, with lower overhead advocated by Valentin Nechayev.

Kind regards

V/

Or, what should happen if one issue another {active, 100} while the current one is still running?

Saying that Ulf's approach gives useless process switches is quite useless -- process switches is what computer does.

Jon Watte

unread,

Apr 23, 2012, 4:38:59 PM4/23/12

to Ulf Wiger, erlang-q...@erlang.org, Valentin Nechayev

but if we're
discussing optimal tuning of live systems, perhaps we can agree that we
shouldn't let bugs in the documentation limit our options?

All of Ulf's advice was great, except for the implicit assumption I read in this sentence.

I know of operational environments, where operators may actively refuse to do anything that is not documented in a reference.

These are large, performance-critical systems, and I think the operators are totally in their right to do this.

In fact, I'd go so far as to say that, for experiments, academia, and engineering what-ifs, documentation is optional, and for high-value, operational systems, documentation is required.

It turns out the type in question was documented elsewhere, so we're in agreement on the particular feature, but I'd like to pitch in a word from the poor operations guys on whom engineers sometimes "dump" "stuff" that will wake them up in the middle of the night without any idea what to do about it :-)

Sincerely,

jw

Ulf Wiger

unread,

Apr 23, 2012, 5:30:03 PM4/23/12

to Jon Watte, erlang-q...@erlang.org, Valentin Nechayev

Just to clarify, I definitely didn't mean to imply that we should accept a poor state of documentation. :)

I just wanted to highlight that the problem here was not that the behavior was 'undocumented' (as in unsupported), but simply that the documentation left something to be desired.

This latter problem can fairly easily be fixed, even by the community. It is pretty easy to submit a patch on the documentation (much easier than it is to actually *write* good documentation…)

BR,

Ulf W

Max Lapshin

unread,

Apr 24, 2012, 2:26:45 AM4/24/12

to Ulf Wiger, erlang-q...@erlang.org, Valentin Nechayev

I want to explain a bit about my words on highload, large throughput
and big latency.

Let's first discuss TCP.

I consider, that {active,once} approach is one of the most beautiful
and convenient ways to handle incoming network traffic, I've ever
seen. gen_tcp:recv is not a choice (just like a blocking accept,
which is very popular due its "documented" state, considering with
non-blocking accept). {active,false} is a step back to a
non-responsive C-like program that hangs because of client GPRS lag.

So, again: {active,true} is for brave people. For very brave with very
good monitoring.
{active,false} — I don't know what is it for, but it is really fast right now
{active,once} — this is the best way, because it is flexible and
convenient. But it has one performance problem: messages.

When I want to download 150 MB of data via HTTP into erlang via 1 Gbps
channel, I suppose that it will be downloaded in 1 second. But lets
see what may happen. 150 MB is about 100 000 times of 1500 bytes. If
we use {active,once} than we may receive 100_000 messages in our
process. Erlang VM can hardly handle more messages into one process
than this amount. But this is not the biggest problem: we must handle
this data. And often handling 1500 and 15000 bytes is almost the same
time.

So according to my experience, biggest problem is not an amount of
data, it is amount of messages. Each message is a very expensive thing
if we speak about hundreds of thousands of them.

If we receive these 150MB in 150 messages, each 1 MB, our erlang
system will not even start coolers working. Something about 5% CPU.

But I was wrong about {active, 1024*1024}, because there should be
some timeout. It must be something like {active, 1024*1024, 1000}.
Don't wait more than 1 second and don't accumulate more than 1 MB.

Such hinting will help a lot in terms of copying memory, because it is
possible to preallocate required binary in tcp driver.

Now about UDP: problem is also in amount of messages. I know only one
way to reduce it: to reply with several udp messages in one message,
but perhaps Ulf's approach to combine {active,once} and recv will be
the easiest way to go.

Reply all

Reply to author

Forward