[erlang-questions] performance drop in R15A?

58 views
Skip to first unread message

Nico Kruber

unread,
Nov 30, 2011, 5:36:17 AM11/30/11
to erlang-q...@erlang.org
Is it possible, that some debug option is on by default in the R15A tagged git
checkout?
We build the source using the following commands (on a x86_64 system with 8
cores):

./otp_build autoconf
./configure --prefix=/home/buildbot/apps/erlang-R15A
make
make install

and then compiled our application (http://code.google.com/p/scalaris/) using
R15A.

With our internal benchmarks ("make bench") showed a drop of around 15% in
performance compared to R14B04. We were actually hoping for an increased
performance reading about the memory allocation improvements...

Is there some switch (during configuration or build) influencing the
performance?

output of R14B04 erl:
------------------------------------------------------------
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8] [async-
threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.5 (abort with ^G)
1>
------------------------------------------------------------

output of R15A erl:
------------------------------------------------------------
Erlang R15A (erts-5.9.pre) [source] [64-bit] [smp:8:8] [rq:8] [async-
threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.pre (abort with ^G)
1>
------------------------------------------------------------


Regards
Nico Kruber

signature.asc

Lukas Larsson

unread,
Nov 30, 2011, 10:30:05 AM11/30/11
to Nico Kruber, erlang-q...@erlang.org
Hello!

I was able to verify you findings on my machine for R14B04 and R15A. The problem is however fixed (I even get a 15% throughput improvement on your bench) in our latest internal master which will be pushed to github later this evening. So if you could pull the latest from github tomorrow and verify my findings that would be great!

Lukas

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


Nico Kruber

unread,
Nov 30, 2011, 11:05:29 AM11/30/11
to erlang-q...@erlang.org, Lukas Larsson
thank you
We'll try again tomorrow

Nico

--
Nico Kruber
phone: +49 30 84185-253
-----
Zuse Institute Berlin
Takustr. 7
D-14195 Berlin

signature.asc

Nico Kruber

unread,
Dec 2, 2011, 5:26:56 AM12/2/11
to erlang-q...@erlang.org, Lukas Larsson
Tonight, I ran the tests with an erlang checkout from around 16:00 yesterday
(erlang-otp-OTP_R15A-126-g0290bed.tar.gz).
This includes the fix Björn-Egil was referring to,
i.e. 'rickard/pix-mutex/OTP-9723'.


Following are the results or our "make bench" of Scalaris:

** transactions/s , i.e. "wall clock throughput : xxx" **
R14B04: ~950, R15A (ge21ff9b): ~800, R15A (g0290bed): ~635

** latency, i.e. "wall clock avg. latency: xxx ms" **:
R14B04: ~42ms, R15A (ge21ff9b): ~50ms, R15A (g0290bed): ~60ms


Note: This "make bench" command sets up a Scalaris ring consisting of 4 nodes,
each in a single VM. They all communicate through tcp.


Additionally, we also measure our performance through our Java-API which
accesses Erlang via JInterface using
"SCALARIS_BENCHMARKS="100 6 9 15 18" make java-bench"

** number of transactions per second for "Benchmark incrementing an
integer key (read+write):"
R14B04: ~310, R15A (ge21ff9b): ~200, R15A (g0290bed): ~140


I saw some new bug fixes for the scheduler in the recent commits, but with the
checkout from 6971b96 (just now), the results are still the same as with
g0290bed :(

Regards,
Nico

signature.asc

Nico Kruber

unread,
Dec 5, 2011, 9:14:42 AM12/5/11
to erlang-q...@erlang.org, Lukas Larsson
Is there any news on this?

The performance drop of the recent R15 checkout compared to R14B04 in our
application is now at _30%_ compared to a drop of "only" 15% in R15A!


Nico

signature.asc

Kenneth Lundin

unread,
Dec 9, 2011, 6:09:56 AM12/9/11
to Nico Kruber, erlang-q...@erlang.org, Lukas Larsson
On 12/5/11, Nico Kruber <kru...@zib.de> wrote:
> Is there any news on this?
>
> The performance drop of the recent R15 checkout compared to R14B04 in our
> application is now at _30%_ compared to a drop of "only" 15% in R15A!
>
When working with R15B we are of course performing a number of
bencmarks to keep track of the performance.

Whe always strive to improve general performance in each release AND
OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.

It is however always possible that some specific operations or
combinations of operations get slower while others are optimized.

Regarding the findings for this particular application we think the
reason for slowdown is a very frequent but non typical usage pattern
that now becomes visible as a performance bottleneck.


/ Kenneth, Erlang/OTP Ericsson

Tomas Abrahamsson

unread,
Dec 9, 2011, 10:01:38 AM12/9/11
to Kenneth Lundin, erlang-q...@erlang.org

Nico Kruber wrote:
NK>>>>>> With our internal benchmarks ("make bench") showed a drop of
NK>>>>>> around 15% in performance compared to R14B04. [...]

NK>> The performance drop of the recent R15 checkout compared to R14B04 in our
NK>> application is now at _30%_ compared to a drop of "only" 15% in R15A!

Kenneth Lundin wrote:
KL> When working with R15B we are of course performing a number of
KL> bencmarks to keep track of the performance.
KL>
KL> Whe always strive to improve general performance in each release AND
KL> OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
KL>
KL> It is however always possible that some specific operations or
KL> combinations of operations get slower while others are optimized.

Hi,

we're currently prototyping an simulator for mobile
telephony. Performance is one of our main goals, so
we've been measuring with both R14B04, R15A.pre and
with OTP from github.

We, too, have seen a performance drop similar to what
Nico has observed, although not as drastic. We try to
start a number of UEs, each UE is a few processes
passing a sequence of messages back and forth, then we
measure how many UEs a certain machine can handle,
given a fixed rate of how often they sould pass the
messages. This is what we've seen (higher number means
better performance):

- R14B04: 12200
- R15A.pre: 11800 (== git rev e21ff9b)
- git rev 6971b96: 11500

The workload for the processes consists of NAS TLV
encoding and decoding (bit/binary manipulations), a
fair bit of ASN.1 encoding and decoding, and some
ciphering/deciphering using the crypto lib.

KL> Regarding the findings for this particular application we think the
KL> reason for slowdown is a very frequent but non typical usage pattern
KL> that now becomes visible as a performance bottleneck.

It might well be that our usage pattern, is a bit away
from normal (currently, we hope so, but we fear it
might not be!) We'll arrange to send the source code
for our benchmark off-list, in case you would want to
repeat it.

BRs
Tomas

Tomas Abrahamsson

unread,
Dec 13, 2011, 12:24:24 PM12/13/11
to Kenneth Lundin, erlang-q...@erlang.org

Nico Kruber wrote:
NK>>>>>>> With our internal benchmarks ("make bench") showed a drop of
NK>>>>>>> around 15% in performance compared to R14B04. [...]

Kenneth Lundin wrote:


KL>> When working with R15B we are of course performing a number of
KL>> bencmarks to keep track of the performance.
KL>>
KL>> Whe always strive to improve general performance in each release AND
KL>> OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
KL>>
KL>> It is however always possible that some specific operations or
KL>> combinations of operations get slower while others are optimized.

I wrote:
TA> We, too, have seen a performance drop similar to what
TA> Nico has observed [...]

Hi again, (replying to self ... hmm :) )

well, anyway, status update: we've further developed
our prototype a bit, experimented a bit with various
scheduler bind settings and also a bit with erlang
process communication patterns, and now it seems that
the performance of R15A for our protoype is well on par
with R14B04. We can no longer say that R15 is slower.
(We haven't tracked down exactly why we earlier saw R15
to be slower, but we don't see it anymore.)

Nico Kruber

unread,
Dec 14, 2011, 11:41:33 AM12/14/11
to erlang-q...@erlang.org, Lukas Larsson
ok, I did some profiling and made some optimisations but the differences
between R15B (git checkout from master just now) and R14B04 are still the same:

First the results of the calls to bench:quorum_read(10, 50000) (best of three)
------------------------------------------------------------
| R14B04 | R15B | time surplus
------------------------------------------------------------
timer:tc, smp8:8 | 22484619| 26641737| +18%
timer:tc, smp4:4 | 34009785| 35963489| + 6%
timer:tc, smp2:2 | 32863428| 34102038| + 4%
timer:tc, smpoff | 53805419| 54214795| + 1%
------------------------------------------------------------

It seems there is some lock-contention or some other problem with smp in our
case which hasn't been there before.
I also observed that the results using R15B vary quite much, especially with
smp8:8 (see below for an example). R14B04 is more stable among different test
runs.
But still, even without smp, R15B is slower than R14B04 :(

Lukas suggested to run fprof to a bit deeper and maybe find the function taking
longer in R15 than in R14B04.
Although it was helpful in finding some hot spots, it did however not help with
this problem. Following are the totals reported by fprof
(if you like, I can send the complete logs - didn't want to send too much to
the mailing list though).

------------------------------------------------------------
| R14B04 | R15B | time surplus
------------------------------------------------------------
fprof, CNT, smp8:8 | 7723796| 7722635| n/a
fprof, ACC, smp8:8 | 98239| 85477| -13%
fprof, OWN, smp8:8 | 300534| 264018| -12%
------------------------------------------------------------

As you can see, according to fprof, R15B should be faster (but it isn't!).
I assume that some function calls are not reported by fprof
(e.g. I did not see erlang:length/1).
Is there some way to include them again in order to analyse further?
Maybe suspend or garbage_collect are not in ACC or OWN?
But even here, R15B wastes less time as you can see in the table below.
----------------------------------------------------------
| R14B04 | R15B |
----------------------------------------------------------
fprof, suspend, smp8:8 | 9.8421e+06| 8.51855e+06|
fprof, garbage_collect, smp8:8 | 1315| 794|
----------------------------------------------------------

We do rely quite much on the process dictionary (erlang:put and erlang:get)
which you might see as an "non-typical usage pattern" for erlang.
However, if I switch to ets instead, performance of bench:quorum_read(10, 50000)
for R15B does not change and for R14B04 decreases by about 7%.

Everything else I would consider "typical use" - or at least I'm not aware
of anything untypical.

I hope, someone can help to further analyse the problem. I did everything I
could think of and are stuck now.
Is other real-world software also affected? E.g. testing couchdb would be nice.


Regards
Nico Kruber


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
how to re-produce the values
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

I'm using scalaris from http://code.google.com/p/scalaris/
./configure && make
================ R15B ===============
--------------------------------
./bin/firstnode.sh
Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{33764932,ok} % 1st try
{26641737,ok} % 2nd try
{27575975,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S4:4"
Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{35963489,ok} % 1st try
{36332156,ok} % 2nd try
{37295819,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S2:2"
Erlang R15B (erts-5.9) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{34102038,ok} % 1st try
{34379052,ok} % 2nd try
{34416173,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "-smp disable"
Erlang R15B (erts-5.9) [source] [64-bit] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{54214795,ok} % 1st try
{55210947,ok} % 2nd try
{54537194,ok} % 3rd try

================ R14B04 ===============
--------------------------------
./bin/firstnode.sh


Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{22598819,ok} % 1st try
{22706185,ok} % 2nd try
{22484619,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S4:4"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{35482368,ok} % 1st try
{34009785,ok} % 2nd try
{34527203,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S2:2"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{33590356,ok} % 1st try
{33192033,ok} % 2nd try
{32863428,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "-smp disable"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]

timer:tc(bench, quorum_read, [10, 50000]).
{53805419,ok} % 1st try
{54054538,ok} % 2nd try
{54066521,ok} % 3rd try


fprof runs:
./bin/firstnode.sh

fprof:trace([start, {procs, processes()}]).
bench:quorum_read(10, 1000).
fprof:trace(stop).
fprof:profile().
fprof:analyse([{dest, "test.R15B.fprof.log"}]). % or R14B04 respectively

> cat test.R14B04.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END {print s}'
9.8421e+06
> cat test.R15B.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END {print s}'
8.51855e+06
> cat test.R14B04.fprof.log |grep garbage_collect|cut -d',' -f3 | awk '{s+=$1} END {print s}'
1315.33
> cat test.R15B.fprof.log |grep garbage_collect|cut -d',' -f3 | awk '{s+=$1} END {print s}'
794.956

signature.asc
Reply all
Reply to author
Forward
0 new messages