[erlang-questions] Improve performance of IO bounded server written in Erlang via having pollset for each scheduler and bind port to scheduler together with process

Wei Cao

unread,

Jul 9, 2012, 5:01:12 AM7/9/12

to erlang-q...@erlang.org

Hi, all

We wrote a proxy server in Erlang, the proxy's logic is rather simple,
it listens on some TCP port, establishes new connection from user
client, forward packets back and forth between the client and backend
server after authentication until connection is closed.

It's very easy to write such a proxy in Erlang, fork a process for
each new user connection and connect to the backend server in the same
process, the process works like a pipe, sockets from both side is set
to the active once mode, whenever a tcp packet is received from one
socket, the packet will be sent to other socket. (A simplified version
of proxy code is attached at the end of the mail)

However, the performance is not quite satisfying, the proxy can handle
maximum only 40k requests on our 16 core machine(Intel Xeon L5630,
2.13GHz) with heavy stress(100 concurrent clients). We then analyzed
the behavior of beam.smp use tools like tcprstat, mpstat, perf, and
SystemTap.

tcprstat shows QPS is about 40k, have average 1.7 millisecond latency
for each request.

timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std
1341813326 39416 17873 953 1718 1519 724 2919 1609 340 3813 1674 462
1341813327 40528 9275 884 1645 1493 491 2777 1559 290 3508 1619 409
1341813328 40009 18105 925 1694 1507 714 2868 1586 328 3753 1650 450

mpstat shows 30% CPU is idle,

03:30:19 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle
03:30:20 PM all 38.69 0.00 21.92 0.00 0.06 7.52
0.00 0.00 31.80
03:30:21 PM all 37.56 0.00 21.99 0.00 0.00 7.50
0.00 0.00 32.95

and perf top shows, much time is wasted in scheduler_wait, in spin wait I guess.

9320.00 19.8% scheduler_wait
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1813.00 3.9% process_main
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp
1379.00 2.9% _spin_lock
/usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
1201.00 2.6% schedule
/home/mingsong.cw/erlangr16b/lib/erlang/erts-5.10/bin/beam.smp

It seems the performance may be associated with scheduler_wait() and
erts_check_io(), with a SystemTap script(attached at the end of this
mail), we can find out how many times the system call epoll_wait is
invoked by beam.smp and each time, how many revents it gets.

cpu process times
revents min max avg timeouts
all
1754 128042 - - 73 3
[14] beam.smp 151
14127 82 97 93 0
[ 5] beam.smp 142
13291 83 97 93 0
[13] beam.smp 127
11948 86 96 94 0
[ 6] beam.smp 127
11836 81 96 93 0
[ 4] beam.smp 121
11323 81 96 93 0
[15] beam.smp 117
10935 83 96 93 0
[12] beam.smp 486
10128 0 96 20 2
[ 1] beam.smp 71
6549 71 100 92 0
[ 2] beam.smp 62
5695 82 96 91 0
[ 7] beam.smp 55
5102 81 95 92 0
[11] beam.smp 52
4822 85 95 92 0
[ 9] beam.smp 52
4799 85 95 92 0
[ 8] beam.smp 51
4680 78 95 91 0
[10] beam.smp 49
4508 85 97 92 0
[ 3] beam.smp 46
4211 81 95 91 0
[ 0] beam.smp 44
4088 83 95 92 0

The resuls shows, epoll_wait is invoked 1754 times each second, and
get 73 io events in average. This is unacceptable for writing high
performance server. Because if epoll_wait is invoked no more than 2k
times per second, then read/write a packet would cost more than 500ms,
which causes long delay and affects the throughput finally.

The problem relies on there is only one global pollset in system wide,
so at a time there is no more than one scheduler can call
erts_check_io() to obtain pending io tasks from underlying pollset,
and no scheduler can call erts_check_io() before all pending io
tasks're processed, so for IO bounded application, it's very likely
that a scheduler finish its own job, but must wait idly for other
schedulers to complete theirs.

Hence, we develops a patch to slove this problem, by having a pollset
for each scheduler, so that each scheduler can invoke erts_check_io()
on its own pollset concurrently. After a scheduler complete its tasks,
it can invoke erts_check_io() immediately no matter what state other
schedulers're in. This patch also handles port migration situation,
all used file descriptors in each port're recorded, when a port is
migrated, these
fd 're removed from original scheduler's pollset, and added to new scheduler's.

Bind port to scheduler together with process is also helpful to
performance, it reduces the cost of thread switches and
synchronization, and bound port won't be migrated between schedulers.

After apply the two patches, with the same pressure(100 concurrent
clients),epoll_wait is invoked 49332 times per second, and get 3
revents each time in average, that is to say, our server responds
quicker and become more realtime.

cpu process times
revents min max avg timeouts
all
49332 217186 - - 4 3
[ 2] beam.smp 3219
16050 2 7 4 0
[11] beam.smp 4275
16033 1 6 3 0
[ 8] beam.smp 4240
15992 1 6 3 0
[ 9] beam.smp 4316
15964 0 6 3 2
[10] beam.smp 4139
15851 1 6 3 0
[ 3] beam.smp 4256
15816 1 6 3 0
[ 1] beam.smp 3107
15800 2 7 5 0
[ 0] beam.smp 3727
15259 1 6 4 0
[ 7] beam.smp 2810
14722 3 7 5 0
[13] beam.smp 1981
11499 4 7 5 0
[15] beam.smp 2285
10942 3 6 4 0
[14] beam.smp 2258
10866 3 6 4 0
[ 4] beam.smp 2246
10849 3 6 4 0
[ 6] beam.smp 2206
10730 3 6 4 0
[12] beam.smp 2173
10573 3 6 4 0
[ 5] beam.smp 2093
10240 3 6 4 0

scheduler_wait no longer take so much time now,

169.00 6.2% process_main beam.smp
55.00 2.0% _spin_lock [kernel]
45.00 1.7% driver_deliver_term beam.smp

so is idle CPU time
04:30:44 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle
04:30:45 PM all 60.34 0.00 21.44 0.00 0.06 16.45
0.00 0.00 1.71
04:30:46 PM all 60.99 0.00 21.22 0.00 0.00 16.26
0.00 0.00 1.52

and tcprstat shows, QPS is getting 100K, latency is less than 1 millisecond

timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std
1341822689 96078 11592 314 910 817 311 1447 869 228 1777 897 263
1341822690 100651 24245 209 914 819 381 1458 870 229 1800 899 263

I also write a extreamly simple keep-alive http server(attached at the
end of mail), to compare performance before and after applying the
patches, mearused with apache ab tool(ab -n 1000000 -c 100 -k ), a 30%
performance gain can be get.

before
Requests per second: 103671.70 [#/sec] (mean)
Time per request: 0.965 [ms] (mean)

after
Requests per second: 133701.24 [#/sec] (mean)
Time per request: 0.748 [ms] (mean)

Patches can be found at github, compile with
./configure CFLAGS=-DERTS_POLLSET_PER_SCHEDULER

Pollset per scheduler:

git fetch git://github.com/weicao/otp.git pollset_per_scheduler

https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler
https://github.com/weicao/otp/compare/weicao:master...weicao:pollset_per_scheduler.patch

Bind port to scheduler:

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler
https://github.com/weicao/otp/compare/weicao:pollset_per_scheduler...weicao:bind_port_to_scheduler.patch

Appendix:

-----------------------------------
proxy.erl
------------------------------------

-module(proxy).
-compile(export_all).

-define(RECBUF_SIZE, 8192).
-define(ACCEPT_TIMEOUT, 2000).

start([MyPortAtom, DestIpAtom, DestPortAtom]) ->
{MyPort, []} = string:to_integer(atom_to_list(MyPortAtom)),
DestIp = atom_to_list(DestIpAtom),
{DestPort, []} = string:to_integer(atom_to_list(DestPortAtom)),
listen(MyPort, DestIp, DestPort),
receive Any ->
io:format("recv ~p~n", [Any])
end.

listen(MyPort, DestIp, DestPort) ->
io:format("start proxy on [local] 0.0.0.0:~p -> [remote] ~p:~p~n",
[MyPort, DestIp, DestPort]),
case gen_tcp:listen(MyPort,
[inet,
{ip, {0,0,0,0}},
binary,
{reuseaddr, true},
{recbuf, ?RECBUF_SIZE},
{active, false},
{nodelay, true}
]) of
{ok, Listen} ->
N = erlang:system_info(schedulers),
lists:foreach(fun(I) -> accept(Listen, DestIp, DestPort,
I) end, lists:seq(1,N));
{error, Reason} ->
io:format("error listen ~p~n", [Reason])
end.

accept(Listen, DestIp, DestPort, I) ->
spawn_opt(?MODULE, loop, [Listen, DestIp, DestPort, I], [{scheduler, I}]).

loop(Listen, DestIp, DestPort, I) ->
case gen_tcp:accept(Listen, ?ACCEPT_TIMEOUT) of
{ok, S1} ->
accept(Listen, DestIp, DestPort, I),
case catch gen_tcp:connect(DestIp, DestPort,
[inet, binary, {active, false},
{reuseaddr, true}, {nodelay, true}]) of
{ok, S2} ->
io:format("new connection~n"),
loop1(S1, S2);
{error, Reason} ->
io:format("error connect ~p~n", [Reason])
end;
{error, timeout} ->
loop(Listen, DestIp, DestPort, I);
Error ->
io:format("error accept ~p~n", [Error]),
accept(Listen, DestIp, DestPort, I)
end.

loop1(S1, S2) ->
active(S1, S2),
receive
{tcp, S1, Data} ->
gen_tcp:send(S2, Data),
loop1(S1, S2);
{tcp, S2, Data} ->
gen_tcp:send(S1, Data),
loop1(S1, S2);
{tcp_closed, S1} ->
io:format("S1 close~n"),
gen_tcp:close(S1),
gen_tcp:close(S2);
{tcp_closed, S2} ->
io:format("S2 close~n"),
gen_tcp:close(S1),
gen_tcp:close(S2)
end.

active(S1,S2)->
inet:setopts(S1, [{active, once}]),
inet:setopts(S2, [{active, once}]).

-----------------------------------
epollwait.stp
------------------------------------
#! /usr/bin/env stap
#
#

global epoll_timeout_flag, epoll_count, epoll_min, epoll_max,
epoll_times, epoll_timeouts

probe syscall.epoll_wait {
if(timeout > 0) {
epoll_timeout_flag[pid()] = 1
}
}

probe syscall.epoll_wait.return {
c = cpu()
p = execname()

epoll_times[c,p] ++
epoll_count[c,p] += $return

if($return == 0 && pid() in epoll_timeout_flag) {
epoll_timeouts[c,p] ++
delete epoll_timeout_flag[pid()]
}

if(!([c, p] in epoll_min)) {
epoll_min[c,p] = $return
} else if($return < epoll_min[c,p]) {
epoll_min[c,p] = $return
}

if($return > epoll_max[c,p]) {
epoll_max[c,p] = $return
}
}

probe timer.s($1) {
printf ("%4s %45s %10s %10s %10s %10s %10s %10s\n", "cpu",
"process", "times", "revents", "min", "max", "avg", "timeouts" )
foreach ([cpu, process] in epoll_count- limit 20) {
all_epoll_times += epoll_times[cpu,process]
all_epoll_count += epoll_count[cpu,process]
all_epoll_timeouts += epoll_timeouts[cpu,process]
}
printf ("%4s %45s %10d %10d %10s %10s %10d %10d\n",
"all", "", all_epoll_times, all_epoll_count, "-", "-",
all_epoll_count == 0? 0:all_epoll_count/all_epoll_times, all_epoll_timeouts)

foreach ([cpu, process] in epoll_count- limit 20) {
printf ("[%2d] %45s %10d %10d %10d %10d %10d %10d\n",
cpu, process, epoll_times[cpu, process], epoll_count[cpu, process],
epoll_min[cpu, process], epoll_max[cpu, process],
epoll_count[cpu,process]/epoll_times[cpu,process],
epoll_timeouts[cpu,process])
}
delete epoll_count
delete epoll_min
delete epoll_max
delete epoll_times
delete epoll_timeouts
printf ("--------------------------------------------------------------------------\n\n"
)
}

------------------------------------------------
ehttpd.erl
-------------------------------------------------
-module(ehttpd).
-compile(export_all).

start() ->
start(8888).
start(Port) ->
N = erlang:system_info(schedulers),
listen(Port, N),
io:format("ehttpd ready with ~b schedulers on port ~b~n", [N, Port]),

register(?MODULE, self()),
receive Any -> io:format("~p~n", [Any]) end. %% to stop: ehttpd!stop.

listen(Port, N) ->
Opts = [inet,
binary,
{active, false},
{recbuf, 8192},
{nodelay,true},
{reuseaddr, true}],

{ok, S} = gen_tcp:listen(Port, Opts),
lists:foreach(fun(I)-> spawn_opt(?MODULE, accept, [S, I],
[{scheduler, I}]) end, lists:seq(1, N)).

accept(S, I) ->
case gen_tcp:accept(S) of
{ok, Socket} ->
spawn_opt(?MODULE, accept, [S, I],[{scheduler,I}] ),
io:format("new connection @~p~n", [I]),
loop(Socket,<<>>);
Error -> erlang:error(Error)
end.

loop(S,B) ->
inet:setopts(S, [{active, once}]),
receive
{tcp, S, Data} ->
B1 = <<B/binary, Data/binary>>,
case binary:part(B1,{byte_size(B1), -4}) of
<<"\r\n\r\n">> ->
Response = <<"HTTP/1.1 200 OK\r\nContent-Length:
12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,
gen_tcp:send(S, Response),
loop(S, <<>>);
_ ->
loop(S, B1)
end;
{tcp_closed, S} ->
io:format("connection closed forced~n"),
gen_tcp:close(S);
Error ->
io:format("unexpected message~p~n", [Error]),
Error
end.

--

Best,

Wei Cao
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Zabrane Mickael

unread,

Jul 9, 2012, 7:14:33 PM7/9/12

to Wei Cao, erlang-q...@erlang.org

Hi,

Performance of our HTTP Web Server drops down after applying your patches.

Box: Linux F17, 4GB of RAM:

$ lscpu

Architecture: i686

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 4

On-line CPU(s) list: 0-3

Thread(s) per core: 1

Core(s) per socket: 4

Socket(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 23

Stepping: 7

CPU MHz: 2499.772

BogoMIPS: 4999.54

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 3072K

Bench:

before: 77787 rps

after: 53056 rps

Any hint to explain this result Wei ?

Regards,

Zabrane

Wei Cao

unread,

Jul 9, 2012, 10:30:01 PM7/9/12

to Zabrane Mickael, erlang-q...@erlang.org

My tests were all keepalived/persistent connections, and there was
significant performance gain in these situations. but I wrote a http
server with short connection just now, the performance did drop down,
I'll find it out later :).

2012/7/10 Zabrane Mickael <zabr...@gmail.com>:

Zabrane Mickael

unread,

Jul 10, 2012, 5:08:13 AM7/10/12

to Wei Cao, erlang-q...@erlang.org

On Jul 10, 2012, at 4:30 AM, Wei Cao wrote:

My tests were all keepalived/persistent connections, and there was
significant performance gain in these situations.

Of course I enabled keepalive on my test.

but I wrote a http server with short connection just now, the performance did drop down,
I'll find it out later :).

Great. Let me know if you make any progress.

Regards,

Zabrane

erlang

unread,

Jul 10, 2012, 5:14:54 PM7/10/12

to erlang-q...@erlang.org

Hi, all (first time)
on start - i'm sorry for my englisch language.
i think that the problem is on all lines where you use "io:format"
In my system when i start wrote in erlang I used many times io:format
for debug. It wos big bottleneck.

JanM

W dniu 2012-07-09 11:01, Wei Cao pisze:

Zabrane Mickael

unread,

Jul 10, 2012, 5:36:56 PM7/10/12

to erlang, erlang-q...@erlang.org

I dont think so!

When I tested Wei's code teh first time, I suppressed all io:format statements and made

the inner loop lot simpler. Something like that:

[...]

loop(S,B) ->

inet:setopts(S, [{active, once}]),

receive

{tcp, S, _Data} ->

Response = <<"HTTP/1.1 200 OK\r\nContent-Length:

12\r\nConnection: Keep-Alive\r\n\r\nhello world!">>,

gen_tcp:send(S, Response),

loop(S, <<>>)

[...]

didn't change anything in my case.

Hope this help!

Regards,

Zabrane

Zabrane Mickael

unread,

Jul 10, 2012, 8:24:53 PM7/10/12

to Wei Cao, erlang-q...@erlang.org

Hi Wei,

I did some other tests on our 8-Cores 64-bit machine:

before: 157141 rps
after: 80245 rps

Same behaviour as before ... hope this help.

Regards,
Zabrane

On Jul 10, 2012, at 4:30 AM, Wei Cao wrote:

Wei Cao

unread,

Jul 11, 2012, 2:44:49 AM7/11/12

to Zabrane Mickael, erlang-q...@erlang.org

Find the reason :), please compile with

./configure CFLAGS="-DERTS_POLLSET_PER_SCHEDULER -g -O3 -fomit-frame-pointer"

otherwise compiler optimization is disabled. (run ./configure with no
CFLAGS set will include " -g -O3 -fomit-frame-pointer" by default)

2012/7/11 Zabrane Mickael <zabr...@gmail.com>:

Zabrane Mickael

unread,

Jul 11, 2012, 3:12:04 PM7/11/12

to Wei Cao, Erlang Questions

Hi Wei,

On Jul 11, 2012, at 2:35 PM, Wei Cao wrote:

sure, the steps is correct

I re-installed everything from scratch with your second patch and tested your ehttpd web server example.

before: ~55K rps

after: ~70K rps

but was unable to reach the 100K rps.

Anyone courageous enough to help us reach the 100K rps?

Regards,

Zabrane

Ronny Meeus

unread,

Jul 11, 2012, 4:35:36 PM7/11/12

to Zabrane Mickael, Erlang Questions

> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions
>

I think the feature that certain processes can be bound to a certain
CPU at application level would also be useful.
The solution implemented here only works (if I understand it well)
when a process is using socket communication.

There are scenarios where tasks are using just a little load (for
example short processing of a message) where the overhead introduced
by the scheduler is large so that nothing is gained by switching to a
multi-core processor (in fact single core runs much faster).
If certain the processes can be grouped on application level and bound
to specific cores, the application scales a lot better.
The other solution would be to run a separate VM instance per core but
I have the feeling that this is more complex to manage and that there
is also more overhead if messages need to be sent between the
applications running on different VMs.

This is a "taskset" like primitive that we know it from the Linux thread world.
For example in thread "Strange observation running in an SMP
environment." in this mailing list this would certainly be beneficial.

Best regards,
Ronny

Wei Cao

unread,

Jul 11, 2012, 9:48:09 PM7/11/12

to Zabrane Mickael, Erlang Questions

we can reach 135 rps on a 16 core machine, it's quite reasonable to
have 70k rps on a 8 core machine.

lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian

CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 2

Vendor ID: GenuineIntel
CPU family: 6

Model: 44
Stepping: 2
CPU MHz: 2134.000
BogoMIPS: 4266.58

Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K

L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:

Zabrane Mickael

unread,

Jul 12, 2012, 2:19:16 AM7/12/12

to Wei Cao, Erlang Questions

Hi Wei,

We already surpassed the 100krps on an 8-cores machine with our HTTP server (~150K rps).

My question was: could we reach the 100K rps on a 4-cores machine with ehttpd ?

That will be awesome.

Regards,

Zabrane

Max Lapshin

unread,

Jul 12, 2012, 2:21:17 AM7/12/12

to Zabrane Mickael, Erlang Questions

Can You post code for benchmarks? Or it is here in topic and I've missed it?

Zabrane Mickael

unread,

Jul 12, 2012, 3:37:08 AM7/12/12

to Max Lapshin, Erlang Questions

Hi Max,

On Jul 12, 2012, at 8:21 AM, Max Lapshin wrote:

Can You post code for benchmarks? Or it is here in topic and I've missed it?

The ehttpd code was sent by Wei (first email). Here it is again (attached).

You also need his patched version of the VM. More precisely, the "bind_port_to_scheduler" patch:

git clone git://github.com/weicao/otp.git

cd otp

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

./otp_build autoconf

make clean

CFLAGS="-DERTS_POLLSET_PER_SCHEDULER -g -O3 -fomit-frame-pointer" ./configure --prefix=/SOMEWHERE/usr

make && make install

Then:

export PATH=/SOMEWHERE/usr/bin:$PATH

Regards,

Zabrane

ehttpd.erl

Wei Cao

unread,

Jul 12, 2012, 4:45:14 AM7/12/12

to Zabrane Mickael, Erlang Questions

Hi,

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:
> Hi Wei,
>

> We already surpassed the 100krps on an 8-cores machine with our HTTP server
> (~150K rps).

Which erlang version did you use to get ~150k rps on 8-cores machine,
patched or unpatched? if it was measured on a unpatched erlang
version, would you mind measuring it on the patched version and let me
know the result?

>
> My question was: could we reach the 100K rps on a 4-cores machine with
> ehttpd ?
> That will be awesome.
>

Today I found a lock bottleneck through SystemTap, trace-cmd and lcnt,
after fixing it, ehttpd on my 16-cores can reach 325k rps.

RX packets: 326117 TX packets: 326122
RX packets: 326845 TX packets: 326859
RX packets: 327983 TX packets: 327996
RX packets: 326651 TX packets: 326624

This is the upper limit of our Gigabit network card, I run ab on three
standalone machines to make enough pressure, I posted the fix to
github, have a try ~

Zabrane Mickael

unread,

Jul 12, 2012, 5:02:20 AM7/12/12

to Wei Cao, Erlang Questions

Hi Wei,

>> We already surpassed the 100krps on an 8-cores machine with our HTTP server
>> (~150K rps).
>
> Which erlang version did you use to get ~150k rps on 8-cores machine,
> patched or unpatched?

We reach the 150K on the unpatched version.

> if it was measured on a unpatched erlang
> version, would you mind measuring it on the patched version and let me
> know the result?

I didn't yet adapted our code to use VM with your patch.
I'll keep you informed.

> Today I found a lock bottleneck through SystemTap, trace-cmd and lcnt,
> after fixing it, ehttpd on my 16-cores can reach 325k rps.
>
> RX packets: 326117 TX packets: 326122
> RX packets: 326845 TX packets: 326859
> RX packets: 327983 TX packets: 327996
> RX packets: 326651 TX packets: 326624
>
> This is the upper limit of our Gigabit network card, I run ab on three
> standalone machines to make enough pressure, I posted the fix to
> github, have a try ~

That's simply fantastic. Could you share your bottleneck tracking method?
Any new VM patch to provide?

Regards,
Zabrane

Wei Cao

unread,

Jul 12, 2012, 5:12:25 AM7/12/12

to Zabrane Mickael, Erlang Questions

The fix is git-pushed to patch branches, retrieved by

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

git fetch git://github.com/weicao/otp.git pollset_per_scheduler

I used git push --force, so it's better to fetch it whole again.

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:

> Hi Wei,
>
>>> We already surpassed the 100krps on an 8-cores machine with our HTTP server
>>> (~150K rps).
>>
>> Which erlang version did you use to get ~150k rps on 8-cores machine,
>> patched or unpatched?
>
> We reach the 150K on the unpatched version.
>
>
>> if it was measured on a unpatched erlang
>> version, would you mind measuring it on the patched version and let me
>> know the result?
>
> I didn't yet adapted our code to use VM with your patch.
> I'll keep you informed.
>
>> Today I found a lock bottleneck through SystemTap, trace-cmd and lcnt,
>> after fixing it, ehttpd on my 16-cores can reach 325k rps.
>>
>> RX packets: 326117 TX packets: 326122
>> RX packets: 326845 TX packets: 326859
>> RX packets: 327983 TX packets: 327996
>> RX packets: 326651 TX packets: 326624
>>
>> This is the upper limit of our Gigabit network card, I run ab on three
>> standalone machines to make enough pressure, I posted the fix to
>> github, have a try ~
>
> That's simply fantastic. Could you share your bottleneck tracking method?
> Any new VM patch to provide?
>
> Regards,
> Zabrane
>

--

Best,

Wei Cao

Zabrane Mickael

unread,

Jul 12, 2012, 5:43:12 AM7/12/12

to Wei Cao, Erlang Questions

Wei,

On Jul 12, 2012, at 11:23 AM, Wei Cao wrote:

Not really, bind_port_to_scheduler is based on pollset_per_scheduler,
you can use bind_port_to_scheduler only,

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

can't compile the VM wihth your new patch:

git clone git://github.com/weicao/otp.git

cd otp

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler

./otp_build autoconf

CFLAGS="-DERTS_POLLSET_PER_SCHEDULER -g -O3 -fomit-frame-pointer" ./configure --prefix=/SOMEWHERE/usr

make && make install

make clean (WITH OR WITHOUT make clean, it doesn't compile)

make

[...]

gcc -m32 -DERTS_POLLSET_PER_SCHEDULER -g -O3 -fomit-frame-pointer -I/opt/otp/erts/i686-pc-linux-gnu -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename -D_GNU_SOURCE -DERTS_SMP -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -DERTS_ENABLE_LOCK_COUNT -Ii686-pc-linux-gnu/opt/smp -Ibeam -Isys/unix -Isys/common -Ii686-pc-linux-gnu -Izlib -Ipcre -Ihipe -I../include -I../include/i686-pc-linux-gnu -I../include/internal -I../include/internal/i686-pc-linux-gnu -c beam/erl_process_lock.c -o obj/i686-pc-linux-gnu/opt/smp/erl_process_lock.o

beam/erl_process_lock.c: In function 'erts_lcnt_enable_proc_lock_count':

beam/erl_process_lock.c:1275:15: error: 'process_tab' undeclared (first use in this function)

beam/erl_process_lock.c:1275:15: note: each undeclared identifier is reported only once for each function it appears in

make[3]: *** [obj/i686-pc-linux-gnu/opt/smp/erl_process_lock.o] Error 1

make[3]: Leaving directory `/opt/otp/erts/emulator'

make[2]: *** [opt] Error 2

make[2]: Leaving directory `/opt/otp/erts/emulator'

make[1]: *** [smp] Error 2

make[1]: Leaving directory `/opt/otp/erts'

Regards,

Zabrane

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:

On Jul 12, 2012, at 11:12 AM, Wei Cao wrote:

The fix is git-pushed to patch branches, retrieved by

git fetch git://github.com/weicao/otp.git bind_port_to_scheduler
git fetch git://github.com/weicao/otp.git pollset_per_scheduler

I used git push --force, so it's better to fetch it whole again.

So I need both patches? Right?

Wei Cao

unread,

Jul 12, 2012, 5:52:15 AM7/12/12

to Zabrane Mickael, Erlang Questions

it seems you enable lcnt, do you include --enable-lock-counter in
your ./configure command?

lcnt is broken in master branch at the time I branched to commit
patches, and it's fixed in the lastest otp master branch.

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:

Wei Cao

unread,

Jul 12, 2012, 5:58:37 AM7/12/12

to Zabrane Mickael, Erlang Questions

2012/7/12 Zabrane Mickael <zabr...@gmail.com>:

> Hi Wei,
>
>>> We already surpassed the 100krps on an 8-cores machine with our HTTP server
>>> (~150K rps).
>>
>> Which erlang version did you use to get ~150k rps on 8-cores machine,
>> patched or unpatched?
>
> We reach the 150K on the unpatched version.
>
>
>> if it was measured on a unpatched erlang
>> version, would you mind measuring it on the patched version and let me
>> know the result?
>
> I didn't yet adapted our code to use VM with your patch.
> I'll keep you informed.
>
>> Today I found a lock bottleneck through SystemTap, trace-cmd and lcnt,
>> after fixing it, ehttpd on my 16-cores can reach 325k rps.
>>
>> RX packets: 326117 TX packets: 326122
>> RX packets: 326845 TX packets: 326859
>> RX packets: 327983 TX packets: 327996
>> RX packets: 326651 TX packets: 326624
>>
>> This is the upper limit of our Gigabit network card, I run ab on three
>> standalone machines to make enough pressure, I posted the fix to
>> github, have a try ~
>
> That's simply fantastic. Could you share your bottleneck tracking method?
> Any new VM patch to provide?

through perf top, I see there is a big percentage of time is wasted in
kernel _spin_lock

1894.00 16.0% _spin_lock
/usr/lib/debug/lib/modules/2.6.32-131.21.1.tb477.el6.x86_64/vmlinux
566.00 4.8% process_main
/home/mingsong.cw/erlangpps/lib/erlang/erts-5.10/bin/beam.smp

After dumping and doing a statisics of _spin_lock's call stack via
trace-cmd, I found most of _spin_lock is called by futex_wake, which
is invoked by pthread mutex.

Finally, I use lcnt to locate all lock collisions in erlang VM, found
the mutex timeofday is the bottleneck.

lock
location #tries #collisions collisions [%] time
[us] duration [%]

----- --------- ------- ------------
--------------- ---------- -------------

timeofday 'beam/erl_time_sup.c':939 895234 551957
61.6551 3185159 23.5296

timeofday 'beam/erl_time_sup.c':971 408006 264498
64.8270 1473816 10.8874

the mutex timeofday is locked each time erts_check_io is invoked to
"sync the machine's idea of time", erts_check_io is executed hundreds
of thounds of times per second, so it's locked too much times, hence
reduce performance.

I solved this problem by moving the sync operation into a standalone
thread, invoked 1 time per millisecond

>
> Regards,
> Zabrane
>

--

Best,

Wei Cao

Zabrane Mickael

unread,

Jul 12, 2012, 6:14:18 AM7/12/12

to Wei Cao, Erlang Questions

Awesome analysis Wei.

you were right about LCNT (--enable-lock-counter), now it's compiling just fine.

I'll keep you informed.

Regards,
Zabrane

Zabrane Mickael

unread,

Jul 12, 2012, 7:01:47 AM7/12/12

to Wei Cao, Erlang Questions

Hi,

Good news. With the new (today) patch:

old bench: ~70K rps
new bench: ~85K rps

More than 15K rps handled now !!
We're not far from the 100K rps ;-)

Well done Wei.

Regards,
Zabrane

Zabrane Mickael

unread,

Jul 12, 2012, 7:09:02 AM7/12/12

to Wei Cao, Erlang Questions

For all using ab or siege, I strongly advice you to move to a better tool.

The best one for now which support theadings and the fast libev is weighttp (from Lighttpd webserver project):

weighttp use exactly the same ab syntax. So nothing to change in your benchs.

Hope this help !

###################################################

# Weighttp:

# http://redmine.lighttpd.net/projects/weighttp/wiki

###################################################

INSTALL

1. LibEV (http://software.schmorp.de/pkg/libev.html)

cvs -z3 -d :pserver:anon...@cvs.schmorp.de/schmorpforge co libev

cd libev

aclocal && automake --add-missing && autoconf

libtoolize --copy --force --ltdl

sh autogen.sh && ./configure --prefix=/usr && make && make install

2. Weighttp (http://redmine.lighttpd.net/projects/weighttp/wiki)

git clone git://git.lighttpd.net/weighttp

cd weighttp

./waf configure

./waf build

./waf install

Regards,
Zabrane

Regards,

Zabrane

Wojtek Narczyński

unread,

Jul 12, 2012, 12:17:10 PM7/12/12

to erlang-q...@erlang.org

Hello,

> I solved this problem by moving the sync operation into a standalone
> thread, invoked 1 time per millisecond

What about periods, when the machine is idle? Is this thread going to
make 1000 syscalls per second anyway?

--Regards

Wojtek Narczyński

unread,

Jul 12, 2012, 1:45:09 PM7/12/12

to Wei Cao, Erlang Questions

On 07/12/2012 11:58 AM, Wei Cao wrote:
> the mutex timeofday is locked each time erts_check_io is invoked to
> "sync the machine's idea of time", erts_check_io is executed hundreds
> of thounds of times per second, so it's locked too much times, hence
> reduce performance.
>

I am trying to figure out how is "the machine's idea of time" relevant
to checking for io. I am begining to suspect that the lone call to
erts_deliver_time() in erl_check_io.c:1179 might be unnecessary. If
do_wait is true, the call to erts_time_remaining(&wait_time) will update
the global time. If do_wait is false, which should be the case most of
the time on a busy server, as per erts_check_io(!runnable) calls, the
timeout to select/poll/epoll whichever is going to be zero anyway. There
must be a catch, because this simply looks too good to be true...

--Regards,
Wojtek Narczynski

Wei Cao

unread,

Jul 12, 2012, 10:10:49 PM7/12/12

to Wojtek Narczyński, Erlang Questions

I think periodly calls to erts_deliver_time() is necessary.Because at
first I tried to comment out the call to erts_deliver_time() from
erts_check_io(), it would cause erlang VM hang out.

erts_deliver_time() does more than erts_time_remaining(), it updates
a clock counter named do_time which is used inside timer wheel.

2012/7/13 Wojtek Narczyński <woj...@power.com.pl>:

> On 07/12/2012 11:58 AM, Wei Cao wrote:
>>
>> the mutex timeofday is locked each time erts_check_io is invoked to
>> "sync the machine's idea of time", erts_check_io is executed hundreds
>> of thounds of times per second, so it's locked too much times, hence
>> reduce performance.
>>
> I am trying to figure out how is "the machine's idea of time" relevant to
> checking for io. I am begining to suspect that the lone call to
> erts_deliver_time() in erl_check_io.c:1179 might be unnecessary. If do_wait
> is true, the call to erts_time_remaining(&wait_time) will update the global
> time. If do_wait is false, which should be the case most of the time on a
> busy server, as per erts_check_io(!runnable) calls, the timeout to
> select/poll/epoll whichever is going to be zero anyway. There must be a
> catch, because this simply looks too good to be true...
>
> --Regards,
> Wojtek Narczynski

--

Best,

Wei Cao

Wojtek Narczyński

unread,

Jul 13, 2012, 9:41:53 AM7/13/12

to Wei Cao, Erlang Questions

On 07/13/2012 04:10 AM, Wei Cao wrote:
> I think periodly calls to erts_deliver_time() is necessary.Because at
> first I tried to comment out the call to erts_deliver_time() from
> erts_check_io(), it would cause erlang VM hang out.

Well, that's hard evidence.

> erts_deliver_time() does more than erts_time_remaining(), it updates
> a clock counter named do_time which is used inside timer wheel.
>

Okay, I see that now.

Maybe it would be enough to call erts_deliver_time() only if do_wait is
true?

Or if that fails, create a new mutex, say erts_timedelivery_mtx, use
erts_smp_mtx_trylock() on it, and only proceed with the time delivery,
from the thread that succeeds in obtaining the lock. This should keep
do_time relatively fresh.

Hope you don't mind me throwing my silly ideas at you. That is because I
am excited about your great work. I would like erlang in sockets heavy
pumping applications to outperform verything.

I think, I'll try to build and test otp on my own, too. After all, how
hard can it be ;-)

Hynek Vychodil

unread,

Jul 15, 2012, 5:06:36 AM7/15/12

to Ronny Meeus, Erlang Questions

You can bind process to specific core in Erlang. First you have to
bind schedulers to cores using +sbt emulator option (I'm using default
binding +sbt db) and than you can bind process to specific scheduler
using spawn option {scheduler, N} when N is id of scheduler starting
from 1 (0 binds to same scheduler may be - it is not documented what I
could be able find). You can also get current scheduler id using
erlang:system_info(scheduler_id).

[spawn_opt(fun() -> io:format("~p -> Scheduler: ~p~n", [X,
erlang:system_info(scheduler_id)]) end, [{scheduler, X}])
|| X <- lists:seq(1,erlang:system_info(schedulers))].

> This is a "taskset" like primitive that we know it from the Linux thread
> world.
> For example in thread "Strange observation running in an SMP
> environment." in this mailing list this would certainly be beneficial.
>
> Best regards,
> Ronny
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions

Wojtek Narczyński

unread,

Jul 16, 2012, 10:27:08 AM7/16/12

to Wei Cao, Erlang Questions

On 07/13/2012 03:41 PM, Wojtek Narczy锟斤拷ski wrote:

> Maybe it would be enough to call erts_deliver_time() only if do_wait is
> true?

I checked, it would not.

> Or if that fails, create a new mutex, say erts_timedelivery_mtx, use
> erts_smp_mtx_trylock() on it, and only proceed with the time delivery,
> from the thread that succeeds in obtaining the lock. This should keep
> do_time relatively fresh.

This did not work either.

Neither calling erts_deliver_time() only if the poll call returned zero
workable descriptors.

And, because there are quite a few variables related to time, atomic
operations are not an easy solution either. Perhaps something like
seqlock, invented exactly in the context of timekeeping, would do the job.

--Regards,
Wojtek

Reply all

Reply to author

Forward