./otp_build autoconf
./configure --prefix=/home/buildbot/apps/erlang-R15A
make
make install
and then compiled our application (http://code.google.com/p/scalaris/) using
R15A.
With our internal benchmarks ("make bench") showed a drop of around 15% in
performance compared to R14B04. We were actually hoping for an increased
performance reading about the memory allocation improvements...
Is there some switch (during configuration or build) influencing the
performance?
output of R14B04 erl:
------------------------------------------------------------
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8] [async-
threads:0] [hipe] [kernel-poll:false]
Eshell V5.8.5 (abort with ^G)
1>
------------------------------------------------------------
output of R15A erl:
------------------------------------------------------------
Erlang R15A (erts-5.9.pre) [source] [64-bit] [smp:8:8] [rq:8] [async-
threads:0] [hipe] [kernel-poll:false]
Eshell V5.9.pre (abort with ^G)
1>
------------------------------------------------------------
Regards
Nico Kruber
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
Nico
--
Nico Kruber
phone: +49 30 84185-253
-----
Zuse Institute Berlin
Takustr. 7
D-14195 Berlin
Following are the results or our "make bench" of Scalaris:
** transactions/s , i.e. "wall clock throughput : xxx" **
R14B04: ~950, R15A (ge21ff9b): ~800, R15A (g0290bed): ~635
** latency, i.e. "wall clock avg. latency: xxx ms" **:
R14B04: ~42ms, R15A (ge21ff9b): ~50ms, R15A (g0290bed): ~60ms
Note: This "make bench" command sets up a Scalaris ring consisting of 4 nodes,
each in a single VM. They all communicate through tcp.
Additionally, we also measure our performance through our Java-API which
accesses Erlang via JInterface using
"SCALARIS_BENCHMARKS="100 6 9 15 18" make java-bench"
** number of transactions per second for "Benchmark incrementing an
integer key (read+write):"
R14B04: ~310, R15A (ge21ff9b): ~200, R15A (g0290bed): ~140
I saw some new bug fixes for the scheduler in the recent commits, but with the
checkout from 6971b96 (just now), the results are still the same as with
g0290bed :(
Regards,
Nico
Whe always strive to improve general performance in each release AND
OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
It is however always possible that some specific operations or
combinations of operations get slower while others are optimized.
Regarding the findings for this particular application we think the
reason for slowdown is a very frequent but non typical usage pattern
that now becomes visible as a performance bottleneck.
/ Kenneth, Erlang/OTP Ericsson
NK>> The performance drop of the recent R15 checkout compared to R14B04 in our
NK>> application is now at _30%_ compared to a drop of "only" 15% in R15A!
Kenneth Lundin wrote:
KL> When working with R15B we are of course performing a number of
KL> bencmarks to keep track of the performance.
KL>
KL> Whe always strive to improve general performance in each release AND
KL> OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
KL>
KL> It is however always possible that some specific operations or
KL> combinations of operations get slower while others are optimized.
Hi,
we're currently prototyping an simulator for mobile
telephony. Performance is one of our main goals, so
we've been measuring with both R14B04, R15A.pre and
with OTP from github.
We, too, have seen a performance drop similar to what
Nico has observed, although not as drastic. We try to
start a number of UEs, each UE is a few processes
passing a sequence of messages back and forth, then we
measure how many UEs a certain machine can handle,
given a fixed rate of how often they sould pass the
messages. This is what we've seen (higher number means
better performance):
- R14B04: 12200
- R15A.pre: 11800 (== git rev e21ff9b)
- git rev 6971b96: 11500
The workload for the processes consists of NAS TLV
encoding and decoding (bit/binary manipulations), a
fair bit of ASN.1 encoding and decoding, and some
ciphering/deciphering using the crypto lib.
KL> Regarding the findings for this particular application we think the
KL> reason for slowdown is a very frequent but non typical usage pattern
KL> that now becomes visible as a performance bottleneck.
It might well be that our usage pattern, is a bit away
from normal (currently, we hope so, but we fear it
might not be!) We'll arrange to send the source code
for our benchmark off-list, in case you would want to
repeat it.
BRs
Tomas
Kenneth Lundin wrote:
KL>> When working with R15B we are of course performing a number of
KL>> bencmarks to keep track of the performance.
KL>>
KL>> Whe always strive to improve general performance in each release AND
KL>> OUR BENCHMARKS INDICATE THAT R15B IS FASTER IN GENERAL THAN R14B04.
KL>>
KL>> It is however always possible that some specific operations or
KL>> combinations of operations get slower while others are optimized.
I wrote:
TA> We, too, have seen a performance drop similar to what
TA> Nico has observed [...]
Hi again, (replying to self ... hmm :) )
well, anyway, status update: we've further developed
our prototype a bit, experimented a bit with various
scheduler bind settings and also a bit with erlang
process communication patterns, and now it seems that
the performance of R15A for our protoype is well on par
with R14B04. We can no longer say that R15 is slower.
(We haven't tracked down exactly why we earlier saw R15
to be slower, but we don't see it anymore.)
First the results of the calls to bench:quorum_read(10, 50000) (best of three)
------------------------------------------------------------
| R14B04 | R15B | time surplus
------------------------------------------------------------
timer:tc, smp8:8 | 22484619| 26641737| +18%
timer:tc, smp4:4 | 34009785| 35963489| + 6%
timer:tc, smp2:2 | 32863428| 34102038| + 4%
timer:tc, smpoff | 53805419| 54214795| + 1%
------------------------------------------------------------
It seems there is some lock-contention or some other problem with smp in our
case which hasn't been there before.
I also observed that the results using R15B vary quite much, especially with
smp8:8 (see below for an example). R14B04 is more stable among different test
runs.
But still, even without smp, R15B is slower than R14B04 :(
Lukas suggested to run fprof to a bit deeper and maybe find the function taking
longer in R15 than in R14B04.
Although it was helpful in finding some hot spots, it did however not help with
this problem. Following are the totals reported by fprof
(if you like, I can send the complete logs - didn't want to send too much to
the mailing list though).
------------------------------------------------------------
| R14B04 | R15B | time surplus
------------------------------------------------------------
fprof, CNT, smp8:8 | 7723796| 7722635| n/a
fprof, ACC, smp8:8 | 98239| 85477| -13%
fprof, OWN, smp8:8 | 300534| 264018| -12%
------------------------------------------------------------
As you can see, according to fprof, R15B should be faster (but it isn't!).
I assume that some function calls are not reported by fprof
(e.g. I did not see erlang:length/1).
Is there some way to include them again in order to analyse further?
Maybe suspend or garbage_collect are not in ACC or OWN?
But even here, R15B wastes less time as you can see in the table below.
----------------------------------------------------------
| R14B04 | R15B |
----------------------------------------------------------
fprof, suspend, smp8:8 | 9.8421e+06| 8.51855e+06|
fprof, garbage_collect, smp8:8 | 1315| 794|
----------------------------------------------------------
We do rely quite much on the process dictionary (erlang:put and erlang:get)
which you might see as an "non-typical usage pattern" for erlang.
However, if I switch to ets instead, performance of bench:quorum_read(10, 50000)
for R15B does not change and for R14B04 decreases by about 7%.
Everything else I would consider "typical use" - or at least I'm not aware
of anything untypical.
I hope, someone can help to further analyse the problem. I did everything I
could think of and are stuck now.
Is other real-world software also affected? E.g. testing couchdb would be nice.
Regards
Nico Kruber
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
how to re-produce the values
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
I'm using scalaris from http://code.google.com/p/scalaris/
./configure && make
================ R15B ===============
--------------------------------
./bin/firstnode.sh
Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{33764932,ok} % 1st try
{26641737,ok} % 2nd try
{27575975,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S4:4"
Erlang R15B (erts-5.9) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{35963489,ok} % 1st try
{36332156,ok} % 2nd try
{37295819,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S2:2"
Erlang R15B (erts-5.9) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{34102038,ok} % 1st try
{34379052,ok} % 2nd try
{34416173,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "-smp disable"
Erlang R15B (erts-5.9) [source] [64-bit] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{54214795,ok} % 1st try
{55210947,ok} % 2nd try
{54537194,ok} % 3rd try
================ R14B04 ===============
--------------------------------
./bin/firstnode.sh
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{22598819,ok} % 1st try
{22706185,ok} % 2nd try
{22484619,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S4:4"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{35482368,ok} % 1st try
{34009785,ok} % 2nd try
{34527203,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "+S2:2"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{33590356,ok} % 1st try
{33192033,ok} % 2nd try
{32863428,ok} % 3rd try
--------------------------------
./bin/firstnode.sh "-smp disable"
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]
timer:tc(bench, quorum_read, [10, 50000]).
{53805419,ok} % 1st try
{54054538,ok} % 2nd try
{54066521,ok} % 3rd try
fprof runs:
./bin/firstnode.sh
fprof:trace([start, {procs, processes()}]).
bench:quorum_read(10, 1000).
fprof:trace(stop).
fprof:profile().
fprof:analyse([{dest, "test.R15B.fprof.log"}]). % or R14B04 respectively
> cat test.R14B04.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END {print s}'
9.8421e+06
> cat test.R15B.fprof.log |grep suspend|cut -d',' -f3 | awk '{s+=$1} END {print s}'
8.51855e+06
> cat test.R14B04.fprof.log |grep garbage_collect|cut -d',' -f3 | awk '{s+=$1} END {print s}'
1315.33
> cat test.R15B.fprof.log |grep garbage_collect|cut -d',' -f3 | awk '{s+=$1} END {print s}'
794.956