[erlang-questions] Unicast 20k messages, $500-$1000 bounty

Joel Reymont

unread,

Jul 15, 2009, 1:39:18 PM7/15/09

to Erlang Users' List

Folks,

I ran out of ideas on how to optimize my code but my client was kind
enough to let me open source a basic version of the code under the MIT
and offer a bounty for performance improvements.

The rules are very simple...

1) Unicast to 20K clients in < 1s nets you a $1000 bounty.

2) Unicast to 20K clients in < 2s nets you a $500 bounty.

3) Final proof has to be from EC2, one small instance for the server
and one for the bots.

4) {packet, 0}

5) TCP

Lets keep this thread for technical discussion and use #erlang on
irc.freenode.com to discuss.

You can also email me with any questions and follow wagerlabs on
Twitter.

I'll be posting the reference implementation on Github tonight.

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Joel Reymont

unread,

Jul 15, 2009, 5:01:00 PM7/15/09

to Erlang Users' List

The Ultimate Erlang Challenge is on:

http://github.com/tinycode/janus/tree/master

Be the first to improve Janus to unicast to 20K clients with a
consistent maximum latency of less than 1s and grab a $1000 bounty.

Be the first to get the maximum latency consistently under 2s and
claim $500.

Develop on any hardware but test on Amazon EC2 for final numbers.

Here's the current set for 20k clients with the bots running on one
small instance and the server on another:

setup: 47791.69ms, n: 20000, run: 55429.86ms
-21.7180ms | min
500.0000ms | 1064 - 05.32%
1000.0000ms | 828 - 04.14%
1500.0000ms | 1596 - 07.98%
2000.0000ms | 2786 - 13.93%
2500.0000ms | 2272 - 11.36%
3000.0000ms | 2574 - 12.87%
3500.0000ms | 2717 - 13.58%
4000.0000ms | 2418 - 12.09%
4500.0000ms | 2594 - 12.97%
5000.0000ms | 1151 - 05.75%
4679.1540ms | max

setup: 50666.82ms, n: 20000, run: 58361.03ms
-25.4550ms | min
500.0000ms | 1330 - 06.65%
1000.0000ms | 811 - 04.06%
1500.0000ms | 792 - 03.96%
2000.0000ms | 2800 - 14.00%
2500.0000ms | 2484 - 12.42%
3000.0000ms | 2902 - 14.51%
3500.0000ms | 2526 - 12.63%
4000.0000ms | 2861 - 14.31%
4500.0000ms | 2777 - 13.88%
5000.0000ms | 717 - 03.58%
4716.3200ms | max

setup: 42524.39ms, n: 20000, run: 50148.11ms
6.6660ms | min
500.0000ms | 1168 - 05.84%
1000.0000ms | 1016 - 05.08%
1500.0000ms | 1546 - 07.73%
2000.0000ms | 2442 - 12.21%
2500.0000ms | 2713 - 13.56%
3000.0000ms | 2684 - 13.42%
3500.0000ms | 2636 - 13.18%
4000.0000ms | 2363 - 11.82%
4500.0000ms | 2248 - 11.24%
5000.0000ms | 1184 - 05.92%
4680.1510ms | max

Here are the numbers from my unibody MacBook Pro 2.93Ghz:

setup: 15040.77ms, n: 20000, run: 23101.30ms
1.5010ms | min
500.0000ms | 579
1000.0000ms | 3693
1500.0000ms | 509
2000.0000ms | 1758
2500.0000ms | 3922
3000.0000ms | 2577
3500.0000ms | 2809
4000.0000ms | 1657
4500.0000ms | 2496
4185.8580ms | max

setup: 14538.21ms, n: 20000, run: 22750.28ms
2.8300ms | min
500.0000ms | 400
1000.0000ms | 1541
1500.0000ms | 7107
2000.0000ms | 5402
2500.0000ms | 4670
3000.0000ms | 880
2595.8890ms | max

setup: 15929.55ms, n: 20000, run: 23789.73ms
1.7730ms | min
500.0000ms | 1330
1000.0000ms | 668
1500.0000ms | 1878
2000.0000ms | 4109
2500.0000ms | 4954
3000.0000ms | 2546
3500.0000ms | 1259
4000.0000ms | 3256
3902.5510ms | max
ok

Joel Reymont

unread,

Jul 15, 2009, 5:20:58 PM7/15/09

to Erlang Users' List

One last thing... There must be no errors in the console and all bots
must run through successfully!

On Jul 15, 2009, at 10:01 PM, Joel Reymont wrote:

> The Ultimate Erlang Challenge is on:
>
> http://github.com/tinycode/janus/tree/master
>
> Be the first to improve Janus to unicast to 20K clients with a
> consistent maximum latency of less than 1s and grab a $1000 bounty.
>
> Be the first to get the maximum latency consistently under 2s and
> claim $500.

---

Jim Morris

unread,

Jul 15, 2009, 7:11:11 PM7/15/09

to erlang-q...@erlang.org

Well as I have said my server already does this easily, but I have to
ask did you try the one process per client that everyone was
suggesting?

On Jul 15, 2:20 pm, Joel Reymont <joe...@gmail.com> wrote:
> One last thing... There must be no errors in the console and all bots
> must run through successfully!
>
> On Jul 15, 2009, at 10:01 PM, Joel Reymont wrote:
>
> > The Ultimate Erlang Challenge is on:
>
> >http://github.com/tinycode/janus/tree/master
>
> > Be the first to improve Janus to unicast to 20K clients with a
> > consistent maximum latency of less than 1s and grab a $1000 bounty.
>
> > Be the first to get the maximum latency consistently under 2s and
> > claim $500.
>
> ---

> Mac hacker with a performance benthttp://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 15, 2009, 7:15:15 PM7/15/09

to Jim Morris, erlang-q...@erlang.org

Jim,

On Jul 16, 2009, at 12:11 AM, Jim Morris wrote:

> Well as I have said my server already does this easily, but I have to
> ask did you try the one process per client that everyone was
> suggesting?

Take a look at the code, I have 2 processes per client.

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

________________________________________________________________

Joel Reymont

unread,

Jul 16, 2009, 6:29:22 AM7/16/09

to Jim Morris, erlang-q...@erlang.org

Some notes on the architecture...

janus_acceptor spawns transport which initializes janus_flash.

janus_flash:start sends a token to the client.

janus_flash will also initialize a client_proxy. Think of client_proxy
as the session process.

Upon receipt of the token, client will try to subscribe and wait for
subscription confirmation. Client will start a 5s timer once
subscription is confirmed by the server. Client expects to receive its
packet before the timer expires.

topman maps subscription topics to pubsub instances keeping track of
subscribers.

pubsub keeps client_proxy (subscriber) pids in ETS and monitors
subscribers to automatically unsubscribe if they die.

pubsub will traverse the client_proxy pids in ETS once a publish
request is received and gen_server:cast the message to each
client_proxy. Message will be sent by client_proxy to the transport
instance connected to the socket and that will send the message to the
socket itself.

bot will launch as many instances of the flashbot client as required.
It then waits for bots to finish and calculates statistics.

launcher takes care of starting bots on the node where it's running.

The client load can actually be spread across multiple instances.
Simply connect the nodes and invoke launcher:start(true) on each one.
Try pg2:get_members(launcher) on the master bot node to make sure you
have multiple bot launcher nodes.

flashbot connects, subscribes and waits for its message.

Note that it does not matter how long the setup time takes, the
critical path is in pubsub:publish and, perhaps, in the parsing and
handling of messages by flashbot.

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

________________________________________________________________

Joel Reymont

unread,

Jul 16, 2009, 7:48:39 AM7/16/09

to Erlang Users' List, Will

I switched to ets:foldl as suggested by Jim Morris and switched to
calculating timestamps for each message, as per the patch sent by Will
Glozer.

Here are the latest timings on my MBP 2.93Ghz. The maximum latency
seems to be 2.1-2.2s, not sure what to make of the 1.5s.

On to testing on EC2...

Thanks, Joel

---

(de...@biggie.local)1> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::12:24:08 ===
setup: 161030.13ms, good: 20000, bad: 0, run: 168624.54ms
0.9540ms | min
500.0000ms | 4858 - 24.29%
1000.0000ms | 1547 - 7.74%
1500.0000ms | 2551 - 12.75%
2000.0000ms | 8044 - 40.22%
2500.0000ms | 3000 - 15.00%
2181.8090ms | max
ok

(de...@biggie.local)3> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::12:36:02 ===
setup: 14496.73ms, good: 20000, bad: 0, run: 22286.76ms
0.5900ms | min
500.0000ms | 2810 - 14.05%
1000.0000ms | 1228 - 6.14%
1500.0000ms | 2470 - 12.35%
2000.0000ms | 10811 - 54.05%
2500.0000ms | 2681 - 13.41%
2258.7140ms | max
ok

(de...@biggie.local)4> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::12:43:19 ===
setup: 14338.41ms, good: 20000, bad: 0, run: 22433.91ms
0.5340ms | min
500.0000ms | 9485 - 47.42%
1000.0000ms | 6033 - 30.16%
1500.0000ms | 4482 - 22.41%
1483.6170ms | max
ok
(de...@biggie.local)5> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::12:45:08 ===
setup: 14256.80ms, good: 20000, bad: 0, run: 22089.76ms
3.6720ms | min
500.0000ms | 5368 - 26.84%
1000.0000ms | 874 - 4.37%
1500.0000ms | 2688 - 13.44%
2000.0000ms | 9781 - 48.91%
2500.0000ms | 1289 - 6.44%
2210.5790ms | max

Joel Reymont

unread,

Jul 16, 2009, 8:10:31 AM7/16/09

to Erlang Users' List

EC2 run, one instance for the server and another one for bots. ~3s
mostly.

(de...@ip-10-244-145-238.ec2.internal)2> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::12:05:07 ===
setup: 30900.04ms, good: 20000, bad: 0, run: 38579.24ms
-38.6400ms | min
500.0000ms | 2650 - 13.25%
1000.0000ms | 1972 - 9.86%
1500.0000ms | 2234 - 11.17%
2000.0000ms | 4611 - 23.05%
2500.0000ms | 4518 - 22.59%
3000.0000ms | 4015 - 20.08%
2963.5520ms | max
ok
(de...@ip-10-244-145-238.ec2.internal)3> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::12:06:29 ===
setup: 55302.13ms, good: 20000, bad: 0, run: 63033.07ms
-41.0070ms | min
500.0000ms | 2836 - 14.18%
1000.0000ms | 1733 - 8.67%
1500.0000ms | 2275 - 11.38%
2000.0000ms | 4142 - 20.71%
2500.0000ms | 4555 - 22.78%
3000.0000ms | 4239 - 21.20%
3500.0000ms | 220 - 1.10%
3075.1930ms | max
ok
(de...@ip-10-244-145-238.ec2.internal)4> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::12:07:51 ===
setup: 52532.12ms, good: 20000, bad: 0, run: 60196.70ms
-43.1910ms | min
500.0000ms | 2330 - 11.65%
1000.0000ms | 2205 - 11.03%
1500.0000ms | 2158 - 10.79%
2000.0000ms | 3742 - 18.71%
2500.0000ms | 4371 - 21.86%
3000.0000ms | 4630 - 23.15%
3500.0000ms | 564 - 2.82%
3059.0860ms | max

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 9:57:26 AM7/16/09

to Joel Reymont, Erlang Users' List, Will

Hi Joel,

Joel Reymont wrote:
> I switched to ets:foldl as suggested by Jim Morris and switched to
> calculating timestamps for each message, as per the patch sent by Will
> Glozer.

ets:foldr or foldl is slow for what you need. You don't need to
accumulate or to look at the Ref, and the function passed is trivial.

For what you need you should:

- obtain the list of pids with ets:select, as in:

ets:select(State#state.subs, [{{'$1', '_'}, [], ['$1']}]).

- iterate the list in a tight loop as in:

send([], _M) -> ok;
send([H|T], M) -> H ! M, send(T, M).

I haven't measured in the case, but on a similar problem using select
and traversing the list made things faster.

But another source of inefficiency is using 2 middleman procs (as
opposed to just 1) between the pubsub and the socket.

Regards,
Paulo

Joel Reymont

unread,

Jul 16, 2009, 10:01:49 AM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 2:57 PM, Paulo Sérgio Almeida wrote:

> But another source of inefficiency is using 2 middleman procs (as
> opposed to just 1) between the pubsub and the socket.

We had a discussion about this in an earlier thread. I used to store a
closure that captured a socket and invoke that in a loop. The
overwhelming consensus was to use a middleman process :-). I guess
I'll try the closure approach again.

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Witold Baryluk

unread,

Jul 16, 2009, 10:02:25 AM7/16/09

to Paulo Sérgio Almeida, Joel Reymont, Erlang Users' List, Will

Dnia 2009-07-16, czw o godzinie 14:57 +0100, Paulo Sérgio Almeida pisze:
> Hi Joel,

> - iterate the list in a tight loop as in:
>
> send([], _M) -> ok;
> send([H|T], M) -> H ! M, send(T, M).

List comprehension can be faster:

send(L, M) ->
[ H ! M || H <- L],
ok.

--
Witold Baryluk <bar...@smp.if.uj.edu.pl>

signature.asc

Joel Reymont

unread,

Jul 16, 2009, 10:23:57 AM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 2:57 PM, Paulo Sérgio Almeida wrote:

> - obtain the list of pids with ets:select, as in:
>
> ets:select(State#state.subs, [{{'$1', '_'}, [], ['$1']}]).

Would it not be faster to call ets:tab2list?

I would think so, intuitively, since on select work needs to be done.

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 10:44:32 AM7/16/09

to Witold Baryluk, Joel Reymont, Erlang Users' List, Will

Witold Baryluk wrote:
> Dnia 2009-07-16, czw o godzinie 14:57 +0100, Paulo Sérgio Almeida pisze:
>> Hi Joel,
>
>> - iterate the list in a tight loop as in:
>>
>> send([], _M) -> ok;
>> send([H|T], M) -> H ! M, send(T, M).
>
> List comprehension can be faster:
>
> send(L, M) ->
> [ H ! M || H <- L],
> ok.
>

Or it can be slower. It varies. :)
They are basically the same if the comprehension optimizes away building
a result list, as when using this "ok" here after it.

Joel Reymont

unread,

Jul 16, 2009, 10:49:01 AM7/16/09

to Paulo Sérgio Almeida, Witold Baryluk, Erlang Users' List, Will

On Jul 16, 2009, at 3:44 PM, Paulo Sérgio Almeida wrote:

> Or it can be slower. It varies. :)
> They are basically the same if the comprehension optimizes away
> building a result list, as when using this "ok" here after it.

Also, instead of duplicating the table data by building a list, why
not use ets:first and ets:next?

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 10:56:03 AM7/16/09

to Joel Reymont, Erlang Users' List, Will

Joel Reymont wrote:
>
> On Jul 16, 2009, at 2:57 PM, Paulo Sérgio Almeida wrote:
>
>> - obtain the list of pids with ets:select, as in:
>>
>> ets:select(State#state.subs, [{{'$1', '_'}, [], ['$1']}]).
>
> Would it not be faster to call ets:tab2list?

I have a weak memory, but I have the feeling select is faster in this
case when you don't need all the content in the table. It is a built-in,
it is producing less results (just a list of Pids, not of tuples), and
then you don't have to extract the Pid from the tuples when traversing
the list.

Paulo

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 11:35:02 AM7/16/09

to Joel Reymont, Witold Baryluk, Erlang Users' List, Will

Joel Reymont wrote:
>
> Also, instead of duplicating the table data by building a list, why not
> use ets:first and ets:next?

To be sure one has to measure. But my intuition is that select followed
by list traversal avoids a ping-pong between built-in and user code. It
will use a built-in function (C) to operate in bulk over the table,
extracting only what is needed. Then you just traverse the list; and
Erlang is very good operating on lists.

Also the list produced should be quite compact, with good memory
locality. <- Is this true?

Also, the negative impact due to GC should be later, after the messages
have been sent. But in this case not even GC is a problem as the process
exits after sending the msgs.

Btw, now I noticed a pontential problem. I don't know what deliveery
semantics you aim for, but now you don't have total order in message
delivery even for the same topic. As the sending is in a spawned proc,
two clients can receive two msgs A and B from the same topic one in the
order A -> B and the other B -> A. As you already have middleman, why
not do the sending without spawning a proc?

Regards,
Paulo

Joel Reymont

unread,

Jul 16, 2009, 11:35:55 AM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

I'd like to introduce another useful constraint into the equation.

You must grab the timestamp _before_ you start broadcasting since what
we are trying to measure is delivery time from the moment the message
is published.

The time it takes for the message to travel through the server and
into pubsub:publish is rather small and can be ignored but resetting
the timestamp before sending the message to each client both skews the
timings and does not measure "time from publishing".

With that in mind, here's my latest set of Mac timings: 2s, 2.2s and
2.3s for 20k messages, with the starting time captured before
broadcasting [1].

I traded the time taken to broadcast to a list over the time and
effort taken to subscribe and unsubscribe. Setup time is much longer
now since inserting, deleting and searching a tuple list of 20k items
is garbage collection galore. Traversal time of 617.352ms, 645.084ms,
871.228ms and 697.16ms beats anything we have seen from ETS so far and
I'm only looking at a maximum of 20k subscribers per server.

I think it's better to build the list over time than to build it from
ets at broadcasting time.

Thanks, Joel

[1] http://github.com/tinycode/janus/commit/19c3cc954fdbc5ecb7c950510a80f56423ab9ec9

---

(de...@biggie.local)3> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::16:10:26 ===
setup: 154206.92ms, good: 20000, bad: 0, run: 157926.75ms
2.4990ms | min
500.0000ms | 2359 - 11.79%
1000.0000ms | 6323 - 31.61%
1500.0000ms | 4934 - 24.67%
2000.0000ms | 5769 - 28.84%
2500.0000ms | 615 - 3.08%
2078.7770ms | max

ok
(de...@biggie.local)4> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::16:16:36 ===
setup: 150305.92ms, good: 20000, bad: 0, run: 154186.14ms
2.1070ms | min
500.0000ms | 2235 - 11.18%
1000.0000ms | 4840 - 24.20%
1500.0000ms | 5497 - 27.48%
2000.0000ms | 5283 - 26.41%
2500.0000ms | 2145 - 10.72%
2185.3810ms | max

ok
(de...@biggie.local)5> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::16:22:07 ===
setup: 201656.95ms, good: 20000, bad: 0, run: 205959.23ms
2.5690ms | min
500.0000ms | 2573 - 12.86%
1000.0000ms | 2704 - 13.52%
1500.0000ms | 5835 - 29.18%
2000.0000ms | 5527 - 27.63%
2500.0000ms | 3361 - 16.80%
2371.7610ms | max
ok

(de...@biggie.local)6> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::16:29:57 ===
setup: 153129.67ms, good: 20000, bad: 0, run: 157174.14ms
3.4600ms | min
500.0000ms | 2682 - 13.41%
1000.0000ms | 3922 - 19.61%
1500.0000ms | 5523 - 27.62%
2000.0000ms | 4728 - 23.64%
2500.0000ms | 3145 - 15.72%
2295.7770ms | max
ok

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 16, 2009, 11:47:23 AM7/16/09

to Paulo Sérgio Almeida, Witold Baryluk, Erlang Users' List, Will

On Jul 16, 2009, at 4:35 PM, Paulo Sérgio Almeida wrote:

> I don't know what deliveery semantics you aim for, but now you don't
> have total order in message delivery even for the same topic. As the
> sending is in a spawned proc, two clients can receive two msgs A and
> B from the same topic one in the order A -> B and the other B -> A.
> As you already have middleman, why not do the sending without
> spawning a proc?

You are right, I'll fix this!

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

mats cronqvist

unread,

Jul 16, 2009, 11:54:58 AM7/16/09

to Paulo Sérgio Almeida, Joel Reymont, Erlang Users' List, Will

Paulo Sérgio Almeida <p...@di.uminho.pt> writes:

> Joel Reymont wrote:
>>
>> On Jul 16, 2009, at 2:57 PM, Paulo Sérgio Almeida wrote:
>>
>>> - obtain the list of pids with ets:select, as in:
>>>
>>> ets:select(State#state.subs, [{{'$1', '_'}, [], ['$1']}]).
>>
>> Would it not be faster to call ets:tab2list?
>
> I have a weak memory, but I have the feeling select is faster in this
> case when you don't need all the content in the table. It is a
> built-in, it is producing less results (just a list of Pids, not of
> tuples), and then you don't have to extract the Pid from the tuples
> when traversing the list.

+1

mats

Joel Reymont

unread,

Jul 16, 2009, 11:56:13 AM7/16/09

to mats cronqvist, Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 4:54 PM, mats cronqvist wrote:

>> I have a weak memory, but I have the feeling select is faster in this
>> case when you don't need all the content in the table. It is a
>> built-in, it is producing less results (just a list of Pids, not of
>> tuples), and then you don't have to extract the Pid from the tuples
>> when traversing the list.
>
> +1

How it would compare to building up a list over time?

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 16, 2009, 12:09:04 PM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 3:56 PM, Paulo Sérgio Almeida wrote:

> I have a weak memory, but I have the feeling select is faster in
> this case when you don't need all the content in the table. It is a
> built-in, it is producing less results (just a list of Pids, not of
> tuples), and then you don't have to extract the Pid from the tuples
> when traversing the list.

There's nothing like empirical proof. Traversal time with ets:select
is 1046.224ms and 1009.256ms respectively. Compare to my previous
timings.

Trashing memory while building the list of subscribers over time beats
other techniques thus far, by a large margin since we are looking at
2.9s and 3s here.

=INFO REPORT==== 16-Jul-2009::17:05:17 ===
setup: 15100.95ms, good: 20000, bad: 0, run: 22863.07ms
5.3150ms | min
500.0000ms | 432 - 2.16%
1000.0000ms | 1230 - 6.15%
1500.0000ms | 6055 - 30.28%
2000.0000ms | 4970 - 24.85%
2500.0000ms | 4417 - 22.09%
3000.0000ms | 2896 - 14.48%
2923.3990ms | max

ok
(de...@biggie.local)3> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::17:05:47 ===
setup: 14442.21ms, good: 20000, bad: 0, run: 21925.20ms
6.3430ms | min
500.0000ms | 723 - 3.62%
1000.0000ms | 3308 - 16.54%
1500.0000ms | 3442 - 17.21%
2000.0000ms | 4217 - 21.09%
2500.0000ms | 1943 - 9.71%
3000.0000ms | 3084 - 15.42%
3500.0000ms | 1130 - 5.65%
4000.0000ms | 2153 - 10.76%
3711.4690ms | max
ok

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 12:27:34 PM7/16/09

to Joel Reymont, Erlang Users' List, Will

Joel Reymont wrote:

> I traded the time taken to broadcast to a list over the time and effort
> taken to subscribe and unsubscribe. Setup time is much longer now since
> inserting, deleting and searching a tuple list of 20k items is garbage
> collection galore. Traversal time of 617.352ms, 645.084ms, 871.228ms and
> 697.16ms beats anything we have seen from ETS so far and I'm only
> looking at a maximum of 20k subscribers per server.
>
> I think it's better to build the list over time than to build it from
> ets at broadcasting time.

Have you already tried using select? In my very slow very old Dell X1
with a 1.0GHz ULV processor, extracting the list of keys from a 20k
entry ets table takes about 5 - 8 ms.

I tried extracting using ets:foldl and was shocked. It was even much
slower than I expected: it varies between 30 - 80 ms.

I think you don't need to compromise and build the list before hand. If
8ms is not good enough. The solution to this case is ...

the process dictionary

(I don't have anymore pacience to hear how a sacrilege it is to use it;
some other day I could digress about Erlang philosophy, encapsulation
and mutable state, ...)

For now I will just say that:

- If I have 100000 processes not each of them is a gen_*. I use gen_*
sparingly, for the main services.

- Sometimes, for processes that provide a simple service, with little
code, the process dictionary can be the right solution to store exactly
this kind of information that would be otherwise stored in slower global
ets tables, or even slower immutable containers (like dict).

In this case, in my slow laptop, extracting the entries of a 20k process
dictionary takes about 2 - 3 ms.

Regards,
Paulo

Joel Reymont

unread,

Jul 16, 2009, 12:35:33 PM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 5:27 PM, Paulo Sérgio Almeida wrote:

> Have you already tried using select? In my very slow very old Dell
> X1 with a 1.0GHz ULV processor, extracting the list of keys from a
> 20k entry ets table takes about 5 - 8 ms.

I tried. See my previous email. 2.9-3s whereas I can consistently get
2.2s with a list.

> the process dictionary

How?

> In this case, in my slow laptop, extracting the entries of a 20k
> process dictionary takes about 2 - 3 ms.

Are you suggesting get() to get all the key/value pairs?

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 12:36:26 PM7/16/09

to Joel Reymont, Erlang Users' List, Will

Joel Reymont wrote:

> There's nothing like empirical proof. Traversal time with ets:select is
> 1046.224ms and 1009.256ms respectively. Compare to my previous timings.

What??? For 20k entries? Just traversal time (without sending msgs)?
There is something wrong. It cannot be this slow. This would be 50 us
per entry.

Paulo

Joel Reymont

unread,

Jul 16, 2009, 12:37:24 PM7/16/09

to Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 5:36 PM, Paulo Sérgio Almeida wrote:

> What??? For 20k entries? Just traversal time (without sending msgs)?
> There is something wrong. It cannot be this slow. This would be 50
> us per entry.

Apologies, I meant time taken to extract the list with ets:select and
traverse the list while broadcasting.

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 16, 2009, 1:17:59 PM7/16/09

to Erlang Users' List

An interesting bit of info...

This is how long it takes to broadcast (traverse + send) to bots
running on 1 ec2 instance, in addition to the server instance

info: 20000
time: 1168.112

info: 20000
time: 1230.546

info: 20000
time: 1095.887

And this is how long it takes to do the same for 20k bots spread over
3 instances:

info: 20000
time: 5387.765

info: 20000
time: 5380.398

Could there be a Linux networking optimization of sorts for when all
packets are going to a single server? This is Ubuntu 9.04.

Cliff Moon

unread,

Jul 16, 2009, 1:18:35 PM7/16/09

to Joel Reymont, Paulo Sérgio Almeida, Erlang Users' List, Will

The problem you're seeing there is process scheduling overhead. In the
fun it sends a message and it's doing an ungodly amount of context
switching. The way to get around this is to bump the process priority
level of the process which iterates the ets table.

On that node, I'd like to share my latest findings:

=INFO REPORT==== 16-Jul-2009::12:44:30 ===
setup: 557892.25ms, good: 20000, bad: 0, run: 11261.47ms
-382.1810ms | min
500.0000ms | 13035 - 65.18%
1000.0000ms | 6904 - 34.52%
1500.0000ms | 8 - 0.04%
2000.0000ms | 6 - 0.03%
2500.0000ms | 6 - 0.03%
3000.0000ms | 4 - 0.02%
3500.0000ms | 7 - 0.03%
4000.0000ms | 3 - 0.01%
4500.0000ms | 1 - 0.01%
5000.0000ms | 6 - 0.03%
10843.4820ms | 20 - 0.10%
10843.4820ms | max
ok

Overall latency is improved drastically in my version, however there are
some stragglers which are taking way too long. I'm going to try and
figure out what's going on with them.

Joel Reymont

unread,

Jul 16, 2009, 1:21:25 PM7/16/09

to Cliff Moon, Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 6:18 PM, Cliff Moon wrote:

> The problem you're seeing there is process scheduling overhead. In
> the fun it sends a message and it's doing an ungodly amount of
> context switching. The way to get around this is to bump the
> process priority level of the process which iterates the ets table.

My latest version on GitHub is drastically faster and does not use ETS.

What kind of numbers do you get with that when you bump the process
priority level?

Can you show us the code that does the bumping of process priority?

Thanks, Joel

Joel Reymont

unread,

Jul 16, 2009, 1:25:20 PM7/16/09

to Cliff Moon, Paulo Sérgio Almeida, Erlang Users' List, Will

Indeed, as soon as I spread bot load over more than 1 server,
broadcasting all of a sudden takes 5x longer. It's 1s vs 5s.

Why?

Joel Reymont

unread,

Jul 16, 2009, 3:11:15 PM7/16/09

to Cliff Moon, Paulo Sérgio Almeida, Erlang Users' List, Will

On Jul 16, 2009, at 6:18 PM, Cliff Moon wrote:

> The problem you're seeing there is process scheduling overhead. In
> the fun it sends a message and it's doing an ungodly amount of
> context switching. The way to get around this is to bump the
> process priority level of the process which iterates the ets table.

bumping priority to high before the broadcasting list comprehension
and lowering it after results in ~10ms broadcasting time and shaves
~300ms off max latency. Loving it!

Mac is at ~2s (one box for everything) and ec2 is ~2.7s when bots are
spread over 3 instances. It looks like process priority is a very
useful tool in one's toolbox.

(de...@ip-10-244-145-238.ec2.internal)4> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::19:05:18 ===
setup: 69312.72ms, good: 20000, bad: 0, run: 72269.01ms
315.0020ms | min
500.0000ms | 2019 - 10.10%
1000.0000ms | 3651 - 18.25%
1500.0000ms | 4433 - 22.17%
2000.0000ms | 5259 - 26.30%
2500.0000ms | 4057 - 20.29%
3000.0000ms | 581 - 2.90%
2713.3960ms | max
ok
(de...@ip-10-244-145-238.ec2.internal)5> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::19:08:37 ===
setup: 65606.03ms, good: 20000, bad: 0, run: 68362.39ms
328.4420ms | min
500.0000ms | 1361 - 6.80%
1000.0000ms | 4231 - 21.15%
1500.0000ms | 4477 - 22.38%
2000.0000ms | 4490 - 22.45%
2500.0000ms | 4051 - 20.26%
3000.0000ms | 1390 - 6.95%
2716.1440ms | max
ok
(de...@ip-10-244-145-238.ec2.internal)6> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 16-Jul-2009::19:10:45 ===
setup: 99778.01ms, good: 19953, bad: 47, run: 104700.73ms
313.9300ms | min
500.0000ms | 1420 - 7.12%
1000.0000ms | 4553 - 22.82%
1500.0000ms | 3987 - 19.98%
2000.0000ms | 4777 - 23.94%
2500.0000ms | 4241 - 21.25%
3000.0000ms | 968 - 4.85%
3500.0000ms | 0 - 0.00%
4000.0000ms | 0 - 0.00%
4500.0000ms | 7 - 0.04%
4143.9490ms | max
ok

Cliff Moon

unread,

Jul 16, 2009, 3:29:53 PM7/16/09

to Joel Reymont, Paulo Sérgio Almeida, Erlang Users' List, Will

Word dogg. On my ec2 cluster (2 machines, small instance) I'm pretty
consistently getting under 2 seconds:

(de...@domU-12-31-39-03-C2-07.compute-1.internal)1> bot:test(flashbot,
20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::14:34:42 ===
setup: 77401.66ms, good: 20000, bad: 0, run: 13670.42ms
590.9040ms | min
500.0000ms | 0 - 0.00%
1000.0000ms | 3553 - 17.77%
1500.0000ms | 8093 - 40.47%
2000.0000ms | 8354 - 41.77%
1861.2000ms | max
ok
(de...@domU-12-31-39-03-C2-07.compute-1.internal)2> bot:test(flashbot,
20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::14:36:32 ===
setup: 79961.47ms, good: 20000, bad: 0, run: 13641.45ms
459.8880ms | min
500.0000ms | 424 - 2.12%
1000.0000ms | 4116 - 20.58%
1500.0000ms | 7236 - 36.18%
2000.0000ms | 8224 - 41.12%
1760.0160ms | max
ok

The code is here: http://github.com/cliffmoon/janus

I'll have a write up a little later that has the optimizations which
seem to matter most.

Joel Reymont

unread,

Jul 16, 2009, 4:23:11 PM7/16/09

to Cliff Moon, Paulo Sérgio Almeida, Erlang Users' List, Will

Cliff,

On Jul 16, 2009, at 8:29 PM, Cliff Moon wrote:

> Word dogg. On my ec2 cluster (2 machines, small instance) I'm
> pretty consistently getting under 2 seconds:

Some comments on your code...

Enabling SMP will not improve things on EC2 since it's single-core,
try disabling it.

Did you really see an improvement from +A 30? I thought it made sense
in linked-in drivers mostly. I'm guilty of leaving +A 8 in my make
file but that's because I forgot to remove it.

Does {keepalive, true} really help?

What did you change in gen_fsm.erl?

I will remove all the messaging from bots back to the mothership since
it's not needed, i.e. everything but success and failure.

Thanks, Joel

Will

unread,

Jul 16, 2009, 4:23:24 PM7/16/09

to Joel Reymont, Cliff Moon, Paulo Sérgio Almeida, Erlang Users' List

2009/7/16 Joel Reymont <joe...@gmail.com>:

>
> bumping priority to high before the broadcasting list comprehension and
> lowering it after results in ~10ms broadcasting time and shaves ~300ms off
> max latency. Loving it!
>
> Mac is at ~2s (one box for everything) and ec2 is ~2.7s when bots are spread
> over 3 instances. It looks like process priority is a very useful tool in
> one's toolbox.
>

I suspect that the bots are the bottleneck now. Here is the average
max time reported out of 3 runs with 1-5 independent VMs running a
total of 20,000 bots. The benchmark machine is a 3.0GHz dual,
quad-core, Intel Xeon running 64-bit CentOS 5.2 and R13B01.

1 VM: 1723.75ms
2 VMs: 1429.36ms
3 VMs: 1030.18ms
4 VMs: 845.32ms
5 VMs: 784.75ms

5 VMs on one box is probably pushing it, even with 8 cores...

-Will

Cliff Moon

unread,

Jul 16, 2009, 4:31:13 PM7/16/09

to Erlang Users' List

So it turns out that the optimizations which mattered most were:

compile everything with HiPE, including mochiweb
bump the process priority for publishing
have pubsub write directly to the socket during its loop

The last two send it over the top. The reason that the wall clock time
was so high during the ets fold was because it was getting hung up
context switching processes. Each message sent out ends up putting a
process in the run queue and erlang was not allowing the fold to
complete before context switching off of the pubsub process. So bumping
the priority level helps a lot with fold time, however it simply delays
the pain of all of that context switch. The message has to wend it's
way through two processes before it finally gets written to a socket.
This means that at least 3 context switches, two for erlang procs and
one for the linked in tcp driver, must happen per publish for 20k
messages. So making pubsub write directly to the tcp socket is a big
win in terms of absolute latency. I think there are some more
inefficiencies to tease out, but I'm unsure whether it can be gotten
below 1s on a small instance.

Joel Reymont

unread,

Jul 16, 2009, 4:41:27 PM7/16/09

to Cliff Moon, Erlang Users' List

Cliff,

Awesome work!

On Jul 16, 2009, at 9:31 PM, Cliff Moon wrote:

> I think there are some more inefficiencies to tease out, but I'm
> unsure whether it can be gotten below 1s on a small instance.

Have you tried splitting the bots over a couple more small ec2
instances?

Thanks, Joel

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 4:47:09 PM7/16/09

to Joel Reymont, Cliff Moon, Erlang Users' List, Will

Joel Reymont wrote:
>
> On Jul 16, 2009, at 6:18 PM, Cliff Moon wrote:
>
>> The problem you're seeing there is process scheduling overhead. In
>> the fun it sends a message and it's doing an ungodly amount of context
>> switching. The way to get around this is to bump the process priority
>> level of the process which iterates the ets table.

I don't think the pubsub causes a huge amount of context switching.
Sending a message in itself should not cause context switching. But
there is in fact a huge amount of context switching due to the 2 * 20k
middlemen.

This means that even if the pubsub is preempted a single time, it will
have to wait for a lot of switching before resuming the sending. Making
the pubsub priority high makes it complete the sending to everyone
before being preempted.

However, I still think that performance should improve noticeably if
there is one less hop of middlemen: it will be at least 20k less context
switches and 20k less messages.

Regards,
Paulo

Joel Reymont

unread,

Jul 16, 2009, 4:49:44 PM7/16/09

to Paulo Sérgio Almeida, Cliff Moon, Erlang Users' List, Will

On Jul 16, 2009, at 9:47 PM, Paulo Sérgio Almeida wrote:

> But there is in fact a huge amount of context switching due to the 2
> * 20k middlemen.

I used to have a version that used a closure with a socket to send
directly.

Consensus then was that a middleman process should be used :-).

> However, I still think that performance should improve noticeably if
> there is one less hop of middlemen: it will be at least 20k less
> context switches and 20k less messages.

There are no middlemen now, Cliff's version sends directly to the
socket.

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Jim Morris

unread,

Jul 16, 2009, 5:01:23 PM7/16/09

to erlang-q...@erlang.org

Seeing as I have a similar server that can do 20,000 subscribers
easily under 1s (more like 600ms), I am trying to see where the major
differences are.

First with the change to a list of pids rather than an ets table makes
the broadcast loop almost the same as mine. I store all the
subscribers in a list, and add/delete them as needed (an uncommon
occurrence).

I also go through two gen_server procs before the tcp:send, so I don't
think that is a problem.

I do not use HIPE, I found it makes very little difference on my setup
(maybe 5% at best).

Doing a tcp send in the loop I think will make it much slower, for
reasons I explain in the previous thread.

I do not bump the priority around my broadcast list comprehension,
although I may try that to see if it speeds my server up even more.

Other than the JSON munging there is no other major difference
anymore. other than the way the stats are gathered, which may be the
source of discrepancy, I'll need to look more closely at how you do
that, I do like the histogram approach so I may incorporate that in
mine rather than min/max/ave which is what I do now.

Also the way the test clients are setup is very different to the way I
do it, although I doubt that is significant. I basically have a
latency tester running in its own VM, and it spawns the given number
of TCP clients as regular processes, and they do very little other
than connect, and wait for messages, extract the timestamp from the
message, and store the results. When the latency tester stops it sends
a message to each client process to end, and they send their results
to the tester. Although I'll never have 20,000 clients in a single
topic, I need to have sub 10ms latency on every message with 200
subscribers, which I currently achieve handily.

This has been an interesting exercise I look forward to seeing the end
results.

Joel Reymont

unread,

Jul 16, 2009, 5:25:04 PM7/16/09

to Jim Morris, erlang-q...@erlang.org

Mac, HiPE-compiled everything, including mochiweb, consistently ~1s.

This is the list version (no ets) that sends via 2 middlemen processes.

=INFO REPORT==== 16-Jul-2009::22:20:59 ===
setup: 156250.41ms, good: 20000, bad: 0, run: 162741.96ms
1000.0000ms | 10868 - 54.34%
1500.0000ms | 8125 - 40.63%
1075.4410ms | max
ok
(de...@biggie.local)2> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::22:21:42 ===
setup: 17522.91ms, good: 20000, bad: 0, run: 23429.93ms
327.8270ms | min
500.0000ms | 2205 - 11.03%
1000.0000ms | 15385 - 76.92%
1500.0000ms | 2410 - 12.05%
1022.2170ms | max

ok
(de...@biggie.local)3> bot:test(flashbot, 20000).

=INFO REPORT==== 16-Jul-2009::22:22:10 ===
setup: 17570.93ms, good: 20000, bad: 0, run: 23337.27ms
320.6370ms | min
500.0000ms | 849 - 4.25%
1000.0000ms | 18066 - 90.33%
1500.0000ms | 1085 - 5.42%
1009.3740ms | max
ok

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Jim Morris

unread,

Jul 16, 2009, 5:25:38 PM7/16/09

to erlang-q...@erlang.org

Hmmm. setting the priority higher around my broadcast loop actually
makes the average latency worse not better by about 8%, actually
rather interesting, as my theory as to why broadcasting to process
rather than directly to tcp hinges on parallel execution of system
commands would predict that outcome.

>
> I do not bump the priority around my broadcast list comprehension,
> although I may try that to see if it speeds my server up even more.
>

________________________________________________________________

Joel Reymont

unread,

Jul 16, 2009, 6:16:34 PM7/16/09

to Paulo Sérgio Almeida, Cliff Moon, Erlang Users' List, Will

More fun with Erlang optimization...

bot:wait used to take 3 arguments and I thought I'd remove the last
two since they are not needed, as well as remove the subscribing,
connected and disconnected handling code. I then removed the sending
of these messages from flashbot.

The above trimming consistently bumped my latency from 1s to 2s on the
Mac with HiPE.

What gives?

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 16, 2009, 7:24:26 PM7/16/09

to Joel Reymont, Cliff Moon, Erlang Users' List, Will

Joel Reymont wrote:

> There are no middlemen now, Cliff's version sends directly to the socket.

I did not sugest sending directly to the socket in the loop because it
is a bad idea: if there is a single slow client and gen_tcp:send blocks
the remaining clients will suffer ...

It is important to have one middleman to make the service tolerant of
slow (or even malicious) clients. This is specially true for large
messages. The middleman should also use send_timeout, and should keep
the inbox small, retrieving all messages to data structure before doing
a gen_tcp:send and possibly discarding messages if the client cannot
keep up with the message rate.

Regards,
Paulo

Cliff Moon

unread,

Jul 16, 2009, 7:30:27 PM7/16/09

to Erlang Users' List

The latest timings from ec2, this time with 4 servers distributing the
client load and a single server. The max latency hovers just under 1s
most of the time.

(de...@domU-12-31-39-03-C2-07.compute-1.internal)4> bot:test(flashbot,

20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::19:13:32 ===
setup: 39034.92ms, good: 20000, bad: 0, run: 5097.52ms
-29.2580ms | min
500.0000ms | 10224 - 51.12%
1000.0000ms | 9776 - 48.88%
965.6640ms | max
ok
(de...@domU-12-31-39-03-C2-07.compute-1.internal)5> bot:test(flashbot,

20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::19:14:30 ===
setup: 36596.29ms, good: 20000, bad: 0, run: 5070.07ms
-29.4500ms | min
500.0000ms | 10260 - 51.30%
1000.0000ms | 9740 - 48.70%
999.8180ms | max
ok
(de...@domU-12-31-39-03-C2-07.compute-1.internal)9> bot:test(flashbot,

20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::19:24:34 ===
setup: 36501.23ms, good: 20000, bad: 0, run: 5119.36ms
-19.9900ms | min
500.0000ms | 10796 - 53.98%
1000.0000ms | 9204 - 46.02%
979.9250ms | max
ok
(de...@domU-12-31-39-03-C2-07.compute-1.internal)4> bot:test(flashbot,

20000, "domU-12-31-39-02-B5-46.compute-1.internal", 8081).
publishing

=INFO REPORT==== 16-Jul-2009::19:27:16 ===
setup: 37879.43ms, good: 20000, bad: 0, run: 5192.95ms
-37.1400ms | min
500.0000ms | 10062 - 50.31%
1000.0000ms | 9938 - 49.69%
971.6150ms | max
ok

Jim Morris

unread,

Jul 16, 2009, 7:37:40 PM7/16/09

to erlang-q...@erlang.org

Sorry a quick look at the code didn't answer this question :)

How many messages are you timing? I usually send 10 or more messages
and average the time, or take the 95th percentile one, this may even
out the results a bit.

Also I run the server and do two runs of the clients, and time the
second one to eliminate any one time dynamic loading glitches. This
made run-to-run differences minimal.

On Jul 16, 3:16 pm, Joel Reymont <joe...@gmail.com> wrote:
> More fun with Erlang optimization...
>
> bot:wait used to take 3 arguments and I thought I'd remove the last
> two since they are not needed, as well as remove the subscribing,
> connected and disconnected handling code. I then removed the sending
> of these messages from flashbot.
>
> The above trimming consistently bumped my latency from 1s to 2s on the
> Mac with HiPE.
>
> What gives?
>
> ---

> Mac hacker with a performance benthttp://www.linkedin.com/in/joelreymont

Cliff Moon

unread,

Jul 16, 2009, 10:00:39 PM7/16/09

to Paulo Sérgio Almeida, Joel Reymont, Erlang Users' List, Will

Your assertion about gen_tcp:send blocking is incorrect in this
particular case. In this test the async thread queue is enabled, which
allows the io subsystem to queue blocking IO operations and execute them
concurrently with the rest of the VM.

Scott Lystig Fritchie

unread,

Jul 16, 2009, 10:13:07 PM7/16/09

to Cliff Moon, Paulo Sérgio Almeida, Joel Reymont, Erlang Users' List, Will

Cliff Moon <cl...@moonpolysoft.com> wrote:

cm> Your assertion about gen_tcp:send blocking is incorrect in this
cm> particular case. In this test the async thread queue is enabled,
cm> which allows the io subsystem to queue blocking IO operations and
cm> execute them concurrently with the rest of the VM.

Cliff, I haven't looked deeply into the guts of the R13B VM, but IIRC
the R11 VM + inets driver didn't use the async thread pool. File I/O
yes, TCP/UDP I/O I don't think so. Independent (and preferably
authoritative :-) confirmation/refutation is welcome.

-Scott

Joel Reymont

unread,

Jul 17, 2009, 4:02:17 AM7/17/09

to Cliff Moon, Erlang Users' List

On Jul 17, 2009, at 12:30 AM, Cliff Moon wrote:

> The latest timings from ec2, this time with 4 servers distributing
> the client load and a single server. The max latency hovers just
> under 1s most of the time.

Cliff takes the $1000 bounty and the contest is over :-).

I am concerned about gen_tcp:send blocking on slow clients and having
at least one middlemen process but that does not take away from
Cliff's achievement. Woot!

Thanks, Joel

Paulo Sérgio Almeida

unread,

Jul 17, 2009, 5:03:40 AM7/17/09

to Scott Lystig Fritchie, Cliff Moon, Joel Reymont, Erlang Users' List, Will

Scott Lystig Fritchie wrote:
> Cliff Moon <cl...@moonpolysoft.com> wrote:
>
> cm> Your assertion about gen_tcp:send blocking is incorrect in this
> cm> particular case. In this test the async thread queue is enabled,
> cm> which allows the io subsystem to queue blocking IO operations and
> cm> execute them concurrently with the rest of the VM.
>
> Cliff, I haven't looked deeply into the guts of the R13B VM, but IIRC
> the R11 VM + inets driver didn't use the async thread pool. File I/O
> yes, TCP/UDP I/O I don't think so. Independent (and preferably
> authoritative :-) confirmation/refutation is welcome.

I have a feeling async threads are for File I/O, but even if they are
used for TCP, the problem still holds. They would tolerate a transient
slowness. But if some client(s) are persistently slow and cannot keep
up, after the TCP window has been exausted and all async threads have
also been exausted, then it will block.

You need to do some flow control. Possibly discarding messages and in
the worst case even disconnecting the client. And as it should be done
independently for each client, a per client middleman process is the
best place to do it.

Regards,
Paulo

Joel Reymont

unread,

Jul 17, 2009, 5:10:39 AM7/17/09

to Paulo Sérgio Almeida, Scott Lystig Fritchie, Cliff Moon, Erlang Users' List, Will

On Jul 17, 2009, at 10:03 AM, Paulo Sérgio Almeida wrote:

> I have a feeling async threads are for File I/O, but even if they
> are used for TCP, the problem still holds. They would tolerate a
> transient slowness. But if some client(s) are persistently slow and
> cannot keep up, after the TCP window has been exausted and all async
> threads have also been exausted, then it will block.

Do you have any code that simulates this scenario?

This is what prim_inet:send looks like:

%% This is a generic "port_command" interface used by TCP, UDP, SCTP,
depending
%% on the driver it is mapped to, and the "Data". It actually sends
out data,--
%% NOT delegating this task to any back-end. For SCTP, this function
MUST NOT
%% be called directly -- use "sendmsg" instead:
%%
send(S, Data) when is_port(S) ->
?DBG_FORMAT("prim_inet:send(~p, ~p)~n", [S,Data]),
try erlang:port_command(S, Data) of
true ->
receive
{inet_reply,S,Status} ->
?DBG_FORMAT("prim_inet:send() -> ~p~n", [Status]),
Status
end
catch
error:_Error ->
?DBG_FORMAT("prim_inet:send() -> {error,einval}~n", []),
{error,einval}
end.

It looks like it will block waiting for a port reply.

Since the result of gen_tcp:send is ignored in the broadcasting loop,
perhaps erlang:port_command can be used directly on the socket to make
things completely asynchronous and avoid the need for a middleman
process.

I think solid proof is needed here rather than speciluation.

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 17, 2009, 5:43:20 AM7/17/09

to Erlang Users' List

On Jul 17, 2009, at 10:10 AM, Joel Reymont wrote:

> Since the result of gen_tcp:send is ignored in the broadcasting
> loop, perhaps erlang:port_command can be used directly on the socket
> to make things completely asynchronous and avoid the need for a
> middleman process.

The inet_reply messages would need to be ignored by
pubsub:handle_info, of course.

No middleman then?

Jim Morris

unread,

Jul 17, 2009, 6:12:33 AM7/17/09

to erlang-q...@erlang.org

> No middleman then?

I think you are really asking for trouble, or at least a not very
robust server, if you go that route.

It is Socket programming 101 that writes can block when the TCP
receiver does not read and the flow control causes the TCP buffers to
backup.

Additionally read the docs about gen_tcp:send, at the end in the
example section where it states...

"The fact that the send call does not accept a timeout option, is
because timeouts on send is handled through the socket option
send_timeout. The behavior of a send operation with no receiver is in
a very high degree defined by the underlying TCP stack, as well as the
network infrastructure. If one wants to write code that handles a
hanging receiver that might eventually cause the sender to hang on a
send call, one writes code like the following. "

Although it does not explicitly state that gen_tcp:send can block I
think it is is common knowledge that it does, and the explanation of
how to handle timeouts would indicate that too.

Again just my 0.02c worth.

Joel Reymont

unread,

Jul 17, 2009, 6:19:37 AM7/17/09

to Jim Morris, erlang-q...@erlang.org

Jim,

On Jul 17, 2009, at 11:12 AM, Jim Morris wrote:

> It is Socket programming 101 that writes can block when the TCP
> receiver does not read and the flow control causes the TCP buffers to
> backup.

This all makes sense. I will redo the code using just 1 middleman
process then.

Now, if we could just figure out why your server is so much faster ;-).

Thanks, Joel

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Sverker Eriksson

unread,

Jul 17, 2009, 6:35:39 AM7/17/09

to Erlang Users' List

Joel Reymont wrote:
> Since the result of gen_tcp:send is ignored in the broadcasting loop,
> perhaps erlang:port_command can be used directly on the socket to make
> things completely asynchronous and avoid the need for a middleman
> process.
>
> I think solid proof is needed here rather than speciluation.
>

Calling erlang:port_command directly will not make things completely
asyncronous. The inet driver will suspend the calling process when the
send queue has reached its high watermark. You will block in
erlang:port_command until some sending has succeeded and the send queue
is below a low watermark. Or you can timeout using option send_timeout.

/Sverker, Erlang/OTP

Ulf Wiger

unread,

Jul 17, 2009, 6:39:16 AM7/17/09

to Joel Reymont, Jim Morris, erlang-q...@erlang.org

Joel Reymont wrote:

> Jim,
>
> On Jul 17, 2009, at 11:12 AM, Jim Morris wrote:
>
>> It is Socket programming 101 that writes can block when the TCP
>> receiver does not read and the flow control causes the TCP buffers to
>> backup.
>
> This all makes sense. I will redo the code using just 1 middleman
> process then.

You may want to consider the following passage from the inet man page
on setopts/2:

{delay_send, Boolean}
Normally, when an Erlang process sends to a socket, the driver will try
to immediately send the data. If that fails, the driver will use any
means available to queue up the message to be sent whenever the
operating system says it can handle it. Setting {delay_send, true} will
make all messages queue up. This makes the messages actually sent onto
the network be larger but fewer. The option actually affects the
scheduling of send requests versus Erlang processes instead of changing
any real property of the socket. Needless to say it is an implementation
specific option. Default is false.

I'm not sure if setting {delay_send, true} would have any beneficial
effect on your particular problem, but the text does shed some light
on the semantics of sending to a socket.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com

Joel Reymont

unread,

Jul 17, 2009, 6:46:03 AM7/17/09

to Sverker Eriksson, Erlang Users' List

The challenge then would seem to be getting under 1s and 2s with a
middleman process, to avoid blocking the send loop. Possible?

On Jul 17, 2009, at 11:35 AM, Sverker Eriksson wrote:

> Calling erlang:port_command directly will not make things completely
> asyncronous. The inet driver will suspend the calling process when
> the send queue has reached its high watermark. You will block in
> erlang:port_command until some sending has succeeded and the send
> queue is below a low watermark. Or you can timeout using option
> send_timeout.

---

Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 17, 2009, 7:03:26 AM7/17/09

to Ulf Wiger, Jim Morris, erlang-q...@erlang.org

On Jul 17, 2009, at 11:39 AM, Ulf Wiger wrote:

> {delay_send, Boolean}
> Normally, when an Erlang process sends to a socket, the driver will
> try to immediately send the data. If that fails, the driver will use
> any means available to queue up the message to be sent whenever the
> operating system says it can handle it. Setting {delay_send, true}
> will make all messages queue up.

On EC2 with 3 bot instances I seem to be doing ~200-300ms better with
delay_send enabled and sending directly to the socket. Could be just a
fluke.

=INFO REPORT==== 17-Jul-2009::10:59:17 ===
setup: 32944.80ms, good: 20000, bad: 0, run: 37288.00ms
-60.8130ms | min
500.0000ms | 4992 - 24.96%
1000.0000ms | 4652 - 23.26%
1500.0000ms | 5312 - 26.56%
2000.0000ms | 3914 - 19.57%
2500.0000ms | 1130 - 5.65%
2363.9200ms | max
ok

(de...@ip-10-244-145-238.ec2.internal)4> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 17-Jul-2009::11:00:01 ===
setup: 30712.31ms, good: 20000, bad: 0, run: 35419.80ms
-61.9370ms | min
500.0000ms | 5376 - 26.88%
1000.0000ms | 4844 - 24.22%
1500.0000ms | 5187 - 25.93%
2000.0000ms | 3358 - 16.79%
2500.0000ms | 1235 - 6.17%
2472.4190ms | max

ok
(de...@ip-10-244-145-238.ec2.internal)5> bot:test(flashbot, 20000,
'ip-10-244-47-97', 8081).

=INFO REPORT==== 17-Jul-2009::11:00:49 ===
setup: 30307.11ms, good: 20000, bad: 0, run: 35555.13ms
-63.3260ms | min
500.0000ms | 6321 - 31.61%
1000.0000ms | 5827 - 29.14%
1500.0000ms | 5422 - 27.11%
2000.0000ms | 1438 - 7.19%
2500.0000ms | 992 - 4.96%
2318.4370ms | max
ok

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 17, 2009, 7:46:19 AM7/17/09

to Jim Morris, Erlang Users' List, Alexis Richardson, Matthew Sackman

On Jul 17, 2009, at 11:12 AM, Jim Morris wrote:

>> No middleman then?
>
> I think you are really asking for trouble, or at least a not very
> robust server, if you go that route.
>
> It is Socket programming 101 that writes can block when the TCP
> receiver does not read and the flow control causes the TCP buffers to
> backup.

RabbitMQ 1.6 (bottom of rabbit_writer.erl) uses erlang:port_command
directly.

The premise is that selective receive of inet_reply messages in
gen_tcp:send can be very slow when the sender's message queue is large.

---
Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Joel Reymont

unread,

Jul 17, 2009, 8:13:08 AM7/17/09

to Sverker Eriksson, Erlang Users' List

Can we have a final expert word on this from the OTP team?

Must we use a middleman process or can we combine {delay_send, true}
with directly sending to the socket when unicasting to 20k clients?

Are there other tweaks to apply apart from raising priority to high
around the broadcasting ets:foldr?

This dispute must be settled as soon as possible for the benefit of
future generations!

On Jul 17, 2009, at 11:35 AM, Sverker Eriksson wrote:

> Joel Reymont wrote:
>> Since the result of gen_tcp:send is ignored in the broadcasting
>> loop, perhaps erlang:port_command can be used directly on the
>> socket to make things completely asynchronous and avoid the need
>> for a middleman process.
>>
>> I think solid proof is needed here rather than speciluation.
>>
> Calling erlang:port_command directly will not make things completely
> asyncronous. The inet driver will suspend the calling process when
> the send queue has reached its high watermark. You will block in
> erlang:port_command until some sending has succeeded and the send
> queue is below a low watermark. Or you can timeout using option
> send_timeout.

---

Mac hacker with a performance bent
http://www.linkedin.com/in/joelreymont

Paulo Sérgio Almeida

unread,

Jul 17, 2009, 9:59:30 AM7/17/09

to Joel Reymont, Sverker Eriksson, Erlang Users' List

Joel Reymont wrote:
> Can we have a final expert word on this from the OTP team?
>
> Must we use a middleman process or can we combine {delay_send, true}
> with directly sending to the socket when unicasting to 20k clients?

I am no such expert, but there is a simple universal rule: there are no
miracles. If a client stops receiving then you either;

- discard messages to that client or disconnect the client or
- the messages are going to be buffered somewhere (TCP stack, message
queues, data structures), memory consumption will grow unboundedly,
until the VM crashes out of memory.

You cannot escape doing some flow control. I said a middleman was the
best place to do it. As in convenient place. If you are desperate and
want to do it with less convenience and more trouble, I guess the best
way to obtain low latency will be:

- use port:command to write to the socket; do the loop without waiting
for the reply messages;
- for each client keep a counter of outstanding replies from the port;
update it as replies arrive; use ets to store this info and
ets:update_counter;
- if that counter reaches a given value start discarding messages (i.e.
ignore that client in the send loop); for this you can use ets:select
with a condition on the counter to obtain the list of ports to use in
the loop.

That's it. Have fun.
Paulo

Joel Reymont

unread,

Jul 18, 2009, 2:12:26 AM7/18/09

to Erlang Users' List

Folks,

Janus has been renamed to Hot Wheels and now lives at http://github.com/wagerlabs/hotwheels/tree/master
.

The goal is still to get a broadcast to 40k users in < 1s.

Thanks, Joel

Witold Baryluk

unread,

Jul 23, 2009, 4:30:42 PM7/23/09

to Ulf Wiger, Joel Reymont, Jim Morris, erlang-q...@erlang.org

Dnia 2009-07-17, pią o godzinie 12:39 +0200, Ulf Wiger pisze:

There is also similar way of doing this under Linux. Linux' TCP sockets,
have option called TCP_CORK. This is generally delay_send (but in kernel
space). See man page tcp(7) for details, and other useful socket
options.

This is way I access this option in Erlang.

cork(TcpSocket) ->
inet:setopts(TcpSocket,[{raw,6,3,<<1:32/native>>}]).
uncork(TcpSocket) ->
inet:setopts(TcpSocket,[{raw,6,3,<<0:32/native>>}]).

Witek

--
Witold Baryluk <bar...@smp.if.uj.edu.pl>

signature.asc

Reply all

Reply to author

Forward