I appreciate everything that the Erlang/OTP is doing but I thought I
would vent a few of my recent frustrations with Erlang. I'm in a good
mood after spending a day with OCaml and I have calmed down. Still,
prepare yourself for a long rant ahead!
My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon, 12Gb of
memory, one 250Gb and two more drives 500Gb each, all 7200RPM SATA. I
use R12B3, SMP and kernel poll, i.e.
Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-
threads:0] [kernel-poll:true]
My overwhelming frustration is the opacity of a running Erlang system.
There are no decent tools for peering inside. No usable ones whatsoever!
With any other language you can profile, make changes, evaluate
performance and make a judgement but not with Erlang.
I first wrote OpenPoker using OTP everywhere. My players, games, pots,
limits, hands, decks, etc. were all gen_server processes. I used
Mnesia transactions everywhere and I used them often.
Then I discovered that I cannot scale past 2-3k concurrent players
under heavy use.
I have a test harness that launches bots which connect to the server
and play by the script. The bots don't wait before replying to bet
requests and so launching a few thousand bots heavily loads the server.
I don't want just a few thousand concurrent bots, though! I want at
least 10k on a single VM and hundreds of thousands on a cluster, so I
set to optimize my poker server.
The Erlang Efficiency Guide recommends fprof as the tool. I ran fprof
on my test harness and discovered that the result set cannot be
processed in my 12Gb of memory. I made this discovery after leaving
fprof running for a couple of days and realized this because the fprof
data files were approaching 100Gb and my machine became unusable due
to heavy swapping.
fprof usets ets tables to analyze the trace results and ets tables
must fit in memory.
I shortened my test run and was able to see the output of the fprof
trace analysis. To say that it's dense would be an understatement! I
realize that dumping out tuples is easy but aren't computers suppose
to help us humans?
The final output from fprof is still too raw for me to analyze.
There's absolutely, positively, definitely no way to get a picture of
a running system by reading through it. I understand that I can infer
from the analysis that certain functions take a lot of time but what
if there are none?
The bulk of the time in my system was taken by various OTP functions
and processes, Mnesia and unknown functions. All I could infer from it
is that perhaps I have too many processes.
Another thing that I inferred is that the normal method of writing
gen_server code doesn't work for profiling.
I had to rewrite the gen_server clauses to immediately dispatch to
functions, e.g.
handle_cast('LOGOUT', Data) ->
handle_cast_logout(Data);
handle_cast('DISCONNECT', Data) ->
handle_cast_disconnect(Data);
otherwise all the clauses of a gen_server are squashed together,
regardless of the message pattern. I don't know if there's a better
way to tackle this.
Next, I rewrote most of my gen_servers as data structures, e.g. pot,
limit, deck, etc. A deck of cards can take a message to draw a card
but the message can just as well be a function call. The deck
structure will need to be modified regardless and the tuple will be
duplicated anyway. There didn't seem to be any advantage in using a
process here, much less a gen_server.
Next I carefully went trough my Mnesia schema and split some tables
into smaller tables. I made sure that only the absolutely necessary
tables were disk-based. I wish I could run without updating Mnesia
tables during a game but this is impossible since player balances and
status need to be updated when players join or leave a game, as well
as when a game finishes.
All my hard work paid off and I was able to get close to 10K players,
with kernel poll enabled, of course. Then I ran out of ETS tables.
I don't create ETS tables on the fly but, apparently, Mnesia does. For
every transaction!!!
This prompted me to go through the server again and use dirty_read,
dirty_write wherever possible. I also placed balanced in two separate
"counter" tables, integers to be divided by 10000 to get 4 decimal
points of precision. This is so that I could use dirty_update_counter
instead of a regular read, bump, write pattern.
My frustration kept increasing but I gained more concurrent players. I
can now safely run up to 8K bots before timeouts start to appear.
These are gen_server call timeouts when requests for game information
take longer than the default 5 seconds. I have an average of 5 players
per game so this is not because a large number of processes are trying
to access the game.
I suppose this is a reflection of the load on the system, although CPU
usage never goes past 300% which tells me that no more than 3 cores
are used by Erlang.
The straw that broke my back was when stopping a bot's matching player
gen_server by returning {stop, ... } started causing my observer
process to receive tcp_close and exit. I could repeat this like
clockwork. Only spawning a separate process to send player a stop
message would fix this.
Then I changed the way I represent cards started seeing this behavior
again, in just one of my tests. What do cards have to do with
tcp_close? I don't know and dbg tracer is my best friend! What I know
is what git tells me and git says cards were the only difference.
Anyway, I don't think I have fully recovered yet. I may need a weekend
just to regain my sanity. I will try to spread the load among several
VMs but my hunch is that my distributed 100k players target is far far
away. I'll may have to keep flying blind, with only traces and
printouts to my rescue.
Thanks for listening, Joel
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
> I can't see it clearly. How much requests per second did serve your
> server or how fast your bots sends requests?
I can't see it clearly either. How would I properly measure the # of
requests per second?
I tried to relieve the server by spreading the load and making each
bot wait a random amount of time before replying to a bet request. It
didn't improve the situation much.
> Count served requests during some longer period (from minutes, better
> quarter of hour up to day(s)) under some "almost constant" load?
> Goal is to avoid dynamic load leveling.
Here are the stats from a 1k game run with a total of ~5k players,
collected on a 10s timer.
Elapsed time between collections is always a little over 10 seconds.
The first report covers game startup. Lots of requests!
Things level off once games start and the load increases again once
games start to end and I need to update player balances and things.
What does this tell you?
Thanks, Joel
---
Simulating gameplay with 1000 games...
Waiting for games to end...
50 games started, 262 players
100 games started, 504 players
...
950 games started, 4777 players
1000 games started, 5066 players
=INFO REPORT==== 12-Sep-2008::16:34:34 ===
requests: 12132
bytes: 333257
requests_per_second: 1213
bytes_per_second: 33322
=INFO REPORT==== 12-Sep-2008::16:34:44 ===
requests: 2012
bytes: 27082
requests_per_second: 200
bytes_per_second: 2702
=INFO REPORT==== 12-Sep-2008::16:34:54 ===
requests: 3054
bytes: 41990
requests_per_second: 300
bytes_per_second: 4134
=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 2009
bytes: 30349
requests_per_second: 197
bytes_per_second: 2980
=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 4161
bytes: 35942
requests_per_second: 415
bytes_per_second: 3592
=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 5452
bytes: 48210
requests_per_second: 543
bytes_per_second: 4808
=INFO REPORT==== 12-Sep-2008::16:35:44 ===
requests: 7662
bytes: 53124
requests_per_second: 765
bytes_per_second: 5304
http://www.lshift.net/blog/2007/02/17/json-and-json-rpc-for-erlang
> It not seems as good as I realize from your previous messages. It
> looks like you can get less than 1kreq/s.
Right.
> I don't know if it is load which you can manage for example 5,
> 10, ... minutes.
Yes and no, see below.
> You can serve 15k concurrent players. Sound it still to much bad for
> you?
15K concurrent players would be excellent. The stats I posted before
don't paint the full picture, though. The part that I didn't post
looks like this...
Simulating gameplay with 1000 games...
Waiting for games to end...
50 games started, 262 players
100 games started, 504 players
...
950 games started, 4777 players
1000 games started, 5066 players
4 Mnesia overloaded error reports
=ERROR REPORT x 4==== 12-Sep-2008::16:34:40 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
And 11 more
=ERROR REPORT x 11 ==== 12-Sep-2008::16:34:44 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
And then the real bummer, a whole lot of these:
=ERROR REPORT ==== 12-Sep-2008::16:35:01 ===
** State machine <0.9390.0> terminating
** Last timer event in was 'CHECK'
** When State == dispatch
...
** Reason for termination =
** {timeout,{gen_server,call,[<0.9391.0>,{'SEATS',119}]}}
The above says that even with just 5K players the card game FSM times
out when requesting seats matching a certain mask, e.g. the reply
takes longer than 5 seconds. The game bombs out at this point.
I could increase the timeout or make the call just wait forever. I
don't want to do this, though, as this timeout serves as a performance
indicator.
Thanks, Joel
> What is your hardware configuration? How many cores are available? The
> reason I ask is because some people have seen better performance in
> certain
> clustering-type Erlang applications by running one VM per core, each
> VM
> running in +S 1 (non-SMP) mode, instead of using SMP. Have you tried
> this?
I ran the same test with +S 1. It doesn't look like SMP affects things
too much in terms of # requests per second.
The big difference is that the test runs through to the end, no
gen_server timeouts. It starts timing out with 2k games and ~7K
players, though.
=INFO REPORT==== 12-Sep-2008::18:42:55 ===
requests: 910
bytes: 24558
requests_per_second: 90
bytes_per_second: 2455
=INFO REPORT==== 12-Sep-2008::18:43:05 ===
requests: 11222
bytes: 308699
requests_per_second: 1122
bytes_per_second: 30867
10 Mnesia overloaded reports (x10).
=ERROR REPORT x10 ==== 12-Sep-2008::18:43:09 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=INFO REPORT==== 12-Sep-2008::18:43:15 ===
requests: 4461
bytes: 60746
requests_per_second: 442
bytes_per_second: 6025
=ERROR REPORT x6 ==== 12-Sep-2008::18:43:16 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=INFO REPORT==== 12-Sep-2008::18:43:25 ===
requests: 1874
bytes: 24746
requests_per_second: 187
bytes_per_second: 2470
=INFO REPORT==== 12-Sep-2008::18:43:35 ===
requests: 3864
bytes: 55476
requests_per_second: 384
bytes_per_second: 5516
=ERROR REPORT x4 ==== 12-Sep-2008::18:43:38 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=INFO REPORT==== 12-Sep-2008::18:43:45 ===
requests: 9752
bytes: 84331
requests_per_second: 974
bytes_per_second: 8425
=ERROR REPORT x7 ==== 12-Sep-2008::18:43:47 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
1000 games finished
Elapsed: 59.147214s, Average run time: 0.059147214 seconds
Mb exited successfully, 59.148019 seconds elapsed
ok
(1@mothership)3>
=INFO REPORT==== 12-Sep-2008::18:43:55 ===
requests: 10950
bytes: 74516
requests_per_second: 1094
bytes_per_second: 7450
- Ian
On Sep 12, 2008, at 12:00 PM, Vance Shipley wrote:
> On Fri, Sep 12, 2008 at 11:54:27AM -0700, Ian Baird wrote:
> } Actually, a dtrace-enabled version of the erlang interpreter
> would be
> } pretty sweet, and useful for people running on Solaris, *BSD, or
> OS X.
> } Is this in the works?
>
> http://groups.google.com/group/erlang-dtrace
>
> -Vance
> Like any technical project you are dealt a hand and you have to play
> it.
Aye! Rest assured, I'm not giving up.
Have you tried adding a thread pool? If you are IO bound on disk
that should help quite a bit. It doesn't help with TCP/IP however
as the inet drivers are not threaded. Maybe they should be.
-Vance
> Have you tried adding a thread pool? If you are IO bound on disk
> that should help quite a bit. It doesn't help with TCP/IP however
> as the inet drivers are not threaded. Maybe they should be.
> Have you tried adding a thread pool? If you are IO bound on disk
> that should help quite a bit.
Tried 5 async threads and got a much higher rate of "Mnesia
overloaded" messages. Didn't help much otherwise, I still get the
dreaded gen_server timeout.
Thanks, Joel
Actually, a dtrace-enabled version of the erlang interpreter would be
pretty sweet, and useful for people running on Solaris, *BSD, or OS X.
Is this in the works?
- Ian
> [...]
> I suppose this is a reflection of the load on the system, although CPU
> usage never goes past 300% which tells me that no more than 3 cores
> are used by Erlang.
> [...]
You know, my own experience with R12B on Mac OS X (analogous machine
to yours, 8 cores, buncha RAM, fast disks, etc.) was similar:
http://is.gd/2y0j = http://mult.ifario.us/p/use-the-cores-erl
http://is.gd/2y0g = http://www.erlang.org/pipermail/erlang-questions/2008-January/032386.html
I'll chalk it up to something like the preemption timing or other low-
level characteristics of MacOS. You need to head to a larger number
of OS processes running Erlang VMs to get your utilization up, and
this seems like a sensible reality for multicore in general, as OS
scheduling and Erlang scheduling are two independent behaviors from
which you need to select the happy medium.
Cheers.
Paul Brown
paulr...@gmail.com
> I don't want to sound like an ass, but what about leasing server
> storage and processing power? Most do a decent job at dealing with the
> hardware side of things, all that you have to focus on is the code.
I do lease a server at Joyent. I'll post the metrics in a little while
so that we have a chance to compare.
We have a system with a completely different workload profile, but our
initial testing has indicated a SMP scaling barrier, such that scaling
beyond 4 cores (on x86-64 linux 2.6.24 w/R12-3) deteriorate quickly.
My very preliminary in investigations indicated a substantial number of
futex system calls (which should happen only on contended locks which
need kernel arbitrartion) as the work scaled toward 4 cores. I had
initially thought that this was related to the amount of work we perform
in linked in drivers via async threads, but some more current
information indicates that this might be more core scheduler related.
Can anyone with experiences running a sufficiently complex system on 8+
core systems share your experiences? I would love to hear from someone
intimate with the SMP implementation on the state of affairs and
possible future directions.
--
paul
'1
@mothership
' 17:19:03
Load: cpu 90 Memory: total 1309096
binary 1302
procs 21259 processes 1185824
code 6406
runq 0 atom 961
ets 52340
Pid Name or Initial Func Time Reds Memory MsgQ
Current Function
----------------------------------------------------------------------------------------
<5624.3952.0> proc_lib:init_p/5 '-' 1046280 1113656 0
gen_server:loop/6
<5624.0.0> init '-' 576618 26320 0
init:loop/1
<5624.3941.0> portServer3000 '-' 183416 88336 0
tcp_server:socket_lo
<5624.3956.0> inet_gethost_native '-' 178043 6864 0
inet_gethost_native:
<5624.73.0> proc_lib:init_p/5 '-' 89051 688680 0
disk_log:loop/1
<5624.3937.0> proc_lib:init_p/5 '-' 68600 24520 0
gen_server:loop/6
<5624.3938.0> portServer6000 '-' 27716 418544 0
tcp_server:socket_lo
<5624.30.0> group:server/3 '-' 20382 24520 0
group:server_loop/3
<5624.25.0> file_server_2 '-' 7192 88400 0
gen_server:loop/6
<5624.16494.0> proc_lib:init_p/5 '-' 4770 8688 0
gen_server:loop/6
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
========================================================================
Output server crashed: connection_lost
Another issue with etop:
Erlang top got garbage {trace_ts,<5623.28.0>,out,
{user_drv,server_loop,5},
{1221,332097,210811}}
Erlang top got garbage {trace_ts,<5623.25.0>,out,
{gen_server,loop,6},
{1221,332097,267093}}
Erlang top got garbage {trace_ts,<5623.4.0>,out,
{gen_event,fetch_msg,5},
{1221,332097,340833}}
I'm not doing any tracing myself so I really don't know where these
traces come from. The easiest way to reproduce is to connect etop and
then start Mnesia.
You might get more help debugging this if you provided code
which triggers the problem. Obviously not your proprietary code,
but something that stresses the Erlang system in the same way.
The Erlang system doesn't do any explicit futex ops on it's own.
But glibc's pthread lock ops will do futex ops in their slow paths.
> How many mnesia transactions do you have left?
No idea. Is there anything in the etop output that makes you think
transactions are a problem?
I think I only have one transaction in my code, everything else is
dirty ops.
> From what I tested,
> transactions are slow as hell. I'm sure you would get much better
> performance, if a gen_server was in charge of a table and all
> queries were dirty and went through that process.
I also tried making all the tables ram_copies, just for the sake of
it. I still get gen_server timeouts.
I'm trying to profile with eprof now. I started profiling but my test
harness hangs waiting for the games that will never finish.
I thought I'd try to connect to the node, find that test process and
kill it.
erl -remsh 1@mothership
Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-
threads:0] [kernel-poll:false]
*** ERROR: Shell process terminated! (^G to start new job) ***
User switch command
--> r '1@mothership'
--> j
2* {'1@mothership',shell,start,[]}
--> c 2
*** ERROR: Shell process terminated! (^G to start new job) ***
Tools are failing me and the saga continues!
'erl -sname 2 -remsh 1@mothership' works fine.
On Sep 13, 2008, at 9:09 PM, Joel Reymont wrote:
> I thought I'd try to connect to the node, find that test process and
> kill it.
>
> erl -remsh 1@mothership
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]
> [async-threads:0] [kernel-poll:false]
>
> *** ERROR: Shell process terminated! (^G to start new job) ***
--
I don't claim any experience in this field but I did a few tests with
CouchDB
on a 32 core machine. I've seen linear performance increase up to 14
cores
with a CPU utilisation of ~22% and no disk I/O (makes sense, little
data, loads
of RAM). I tested with increasing +S parameters for `erl`. We ran the
results
by some friends at Sun who suggested that this might be a bottleneck
in the
networking layer. Something might have a per-CPU (not per-core)
bottleneck.
Some kernel tuning in Solaris and possibly better NIC-programming
could solve
this one, but I only had access to the box for a short time and had no
chance
to change any boot-parameters, let alone hardware specs. Profiling
showed that
Erlang was mostly waiting for network I/O which further underlines the
Solaris
config problem theory.
Cheers
Jan
--
> Here's the stupid idea: Run one +S 1 VM per core instead of N +S N
> VMS.
> [...]
> What do you think? Is the newbie smoking something, or is there maybe
> something to it?
There's definitely something to it and I'll try it just as soon as I'm
able to make my test harness a distributed one. It's designed to
launch everything on a single node right now.
There's something that's nagging at me as well, which is why I haven't
been in a rush to rewrite the test harness. I want to know why my
gen_server calls start to timeout!
I even killed the VM upon the first timeout to get a crash dump. Lo
and behold, there were no processes with lengthy message queues. The
distribution among the 21234 processes goes like this:
0, 1, 2, 3, 4, 5, 6, 7 (20), 8 (18), 9 (28), 10 (36), 11 (19), 12
(24), 13 (8), 14 (8), 15, 26, 32, 33
The # of processes with this queue length is in parenthesis and is 1
where there are none. Perhaps I should pay attention to the counts
over 5 or 10.
The complication is that most of the gen_servers look like this in the
crash dump...
=proc:<0.4093.0>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.4001.0>
Started: Sat Sep 13 18:36:39 2008
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [#Port<0.5677>]
Reductions: 3422
Stack+heap: 2584
OldHeap: 377
Heap unused: 1800
OldHeap unused: 377
Program counter: 0x0000000144d3c748 (gen_server:loop/6 + 288)
CP: 0x0000000144cfe458 (proc_lib:init_p/5 + 400)
arity = 0
Note that the current function is gen_server:loop/6, they were
(obviously) spawned by proc_lib:init_p/5 and the processes are unnamed.
There's more info in the crash dump on every process, e.g. process
dictionary, stack and heap (below). I'm able to deduce that this is a
bot process but I'm still scratching my head over jumping from this to
identifying the gen_server timeouts.
=proc_dictionary:<0.4093.0>
H10B4D30
H10B4D48
=proc_stack:<0.4093.0>
0x000000015e0fc498:SReturn addr 0x44CFE458 (proc_lib:init_p/5 + 400)
y0:N
y1:A8:infinity
y2:A3:bot
y3:H5E0F8C30
y4:P<0.4093.0>
y5:P<0.4093.0>
0x000000015e0fc4d0:SReturn addr 0x1A9D28 (<terminate process normally>)
y0:SCatch 0x44CFE478 (proc_lib:init_p/5 + 432)
y1:A3:gen
y2:A7:init_it
y3:H10B4D20
=proc_heap:<0.4093.0>
5E0F8C30
:tD:A3
:bot
,H10B4DB0,I14,I3,p<0.5677>,H10B4FA0,N,I1150,I2,I797372250,A5:false,I1,I0
10B4FA0:lA5:CHECK|H10B4FB0
10B4FB0:lA5:CHECK|H10B4FC0
10B4FC0:lA4:FOLD|H10B4FD0
10B4FD0:lA3:nil|N
10B4DB0:YhE:6B6D616E2F373937333732323530
10B4D20:lAA:gen_server|H10B4DE8
10B4DE8:lP<0.4001.0>|H10B4E48
10B4E48:lA4:self|H10B4E88
10B4E88:lA3:bot|H10B4EC0
10B4EC0:lH10B4EE0|H10B4EF0
10B4EE0:lH10B4DB0|H10B4F10
10B4F10:lI797372250|H10B4F30
10B4F30:lI2|H10B4F50
10B4F50:lI1160|N
10B4EF0:lN|N
10B4D30:t2:AD:$initial_call,H10B4DF8
10B4DF8:t3:A3:gen,A7:init_it,H10B4D20
10B4D48:t2:AA:$ancestors,H10B4E18
10B4E18:lP<0.4001.0>|H10B4E58
10B4E58:lP<0.37.0>|N
> You'd probably have to partition the load to round-robin across the
> individual VMs, possibly using some front-end load-balancing
> hardware. This is why I keep harping on this: some time ago I put
> the system I am working on under heavy load to test the maximum
> possible throughput. There was no appreciable disk I/O. The kicker
> is that I did not see an even distribution of load across the 4
> cores of my box. In fact, it looked as if one or maybe two cores
> were being used at 100% and the rest were idle. When I re-ran the
> test on a whim, using only 1 non-SMP (+S 1) node, I actually got
> better performance.
>
> This seemed counter-intuitive and against the "Erlang SMP scales
> linearly for CPU-intensive loads" idea. I have not done a lot of
> investigation into this because I have other fish to fry right now,
> but the folks over at LShift (RabbitMQ) - assuming I did not
> misunderstand them - wrote that they had seen similar behavior when
> running clustered Rabbit nodes (i.e. better performance from N
> single-CPU nodes than N +S N nodes). However, they, like me, are not
> ready to come out and state this bluntly as a fact because (I
> believe) they feel not enough investigation has been done to make
> this a conclusive case.
I've also been seeing similar behavior trying to parallelize the
alioth shootout code, fwiw. I'd also say it's premature to draw any
concrete conclusions, but another anecdotal point.
(Also, on the particular OS & hardware the benchmarks run on, the
total CPU usage nearly doubles for the parallel implementations. On
my 2-core mac, though, I see no more than a 10% increase in total CPU
usage, and a near 100% improvement in the wall-time, as one should
expect on the embarrassingly parallel problems. Dunno, if this is
related to the OS, the chip (Core 2 Duo vs Core 2 Quad), HiPE, or what.)
-kevin
I havent read the whole correspndance (it seems to be going on for a way too long), but like to add my 2c worth...
While SMP (+S1) approach may solve some problems, it defeats the purpose of having a multi-core machine. Please note that multi-core machines have lower clock speeds, thus should run generally slower per given CPU core. IMHO, if +S 1 solves your problem, maybe you should revisit your code -- I think that it is wrong to expect that the same code would work better on SMP just because you had such expectations. For example, it is known fact that ETS works slower in SMP environment.
Also, one should not forget to use +A in addition to +S -- although you do not have any disk I/O, I think this parameter is relevant for PORT scheduling, therefore improving performance of your I/O.
V.
----- Original Message ----- From: "Kevin Scaldeferri" <ke...@scaldeferri.com>
To: "Edwin Fine" <erlang-ques...@usa.net>
Cc: "erlang-questions Questions" <erlang-q...@erlang.org>
Sent: Sunday, September 14, 2008 12:07 AM
Subject: Re: [erlang-questions] My frustration with Erlang
That said, te reasons for running threads rather than nodes might be:
- easier to setup and change (porting to 16 cores is "+S 16", or perhaps "+S 20")
- easier to load-balance
- less costly messages and cheaper table access than remote operations
- less memory used due to sharing inside VM
- perhaps better memory performance, again due to sharing
A further technical limitation of nodes, once pointed out to me by Sean Hinde, is that sockets can't be sent between nodes today.
Best,
Thomas
--- On Sun, 9/14/08, Edwin Fine <erlang-ques...@usa.net> wrote:
Did you try increasing the maximum number of ETS tables?
This is done using the environment variable ERL_MAX_ETS_TABLES
It can be set to a number considerably higher than the default
(1400 tables).
BR,
Ulf W
> Did you try increasing the maximum number of ETS tables?
>
> This is done using the environment variable ERL_MAX_ETS_TABLES
That would have been the easy way out :-).
No, I just combed through the code and switched to dirty mnesia ops
whenever possible.
I'm very close to figuring out my issue I'm almost certain that it's
mine to fix.
I'll post again next week, explaining what I did to pinpoint the
problem and how I did it (tracing, crash dump analysis). I'll also
post new throughput statistics.
Thanks, Joel
Sometimes it is possible to ask for new cards.
bengt
On Fri, 2008-09-12 at 14:11 -0400, john s wolter wrote:
> Joel,
>
>
> Like any technical project you are dealt a hand and you have to play
> it.
>
> On Fri, Sep 12, 2008 at 9:52 AM, Joel Reymont <joe...@gmail.com>
> wrote:
> I sell a poker server written in Erlang. It's supposed to be
> super-
> robust and super-scalable. I'm about to move to the next level
> by
> adding the missing features, e.g. tournaments and a Flash
> client.
>
> I appreciate everything that the Erlang/OTP is doing but I
> thought I
> would vent a few of my recent frustrations with Erlang. I'm in
> a good
> mood after spending a day with OCaml and I have calmed down.
> Still,
> prepare yourself for a long rant ahead!
>
> My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon,
> 12Gb of
> memory, one 250Gb and two more drives 500Gb each, all 7200RPM
> SATA. I
> use R12B3, SMP and kernel poll, i.e.
>
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]
> [async-
> threads:0] [kernel-poll:true]
>
> My overwhelming frustration is the opacity of a running Erlang
> system.
> There are no decent tools for peering inside. No usable ones
> whatsoever!
>
> With any other language you can profile, make changes,
> evaluate
> performance and make a judgement but not with Erlang.
>
> I first wrote OpenPoker using OTP everywhere. My players,
> games, pots,
> limits, hands, decks, etc. were all gen_server processes. I
> used
> Mnesia transactions everywhere and I used them often.
>
> Then I discovered that I cannot scale past 2-3k concurrent
> players
> under heavy use.
>
> I have a test harness that launches bots which connect to the
> server
> and play by the script. The bots don't wait before replying to
> bet
> requests and so launching a few thousand bots heavily loads
> the server.
>
> I don't want just a few thousand concurrent bots, though! I
> want at
> least 10k on a single VM and hundreds of thousands on a
> cluster, so I
> set to optimize my poker server.
>
> The Erlang Efficiency Guide recommends fprof as the tool. I
> ran fprof
> on my test harness and discovered that the result set cannot
> be
> processed in my 12Gb of memory. I made this discovery after
> leaving
> fprof running for a couple of days and realized this because
> the fprof
> data files were approaching 100Gb and my machine became
> unusable due
> to heavy swapping.
>
> fprof usets ets tables to analyze the trace results and ets
> tables
> must fit in memory.
>
> I shortened my test run and was able to see the output of the
> fprof
> trace analysis. To say that it's dense would be an
> understatement! I
> realize that dumping out tuples is easy but aren't computers
> suppose
> to help us humans?
>
> The final output from fprof is still too raw for me to
> analyze.
> There's absolutely, positively, definitely no way to get a
> picture of
> a running system by reading through it. I understand that I
> can infer
> from the analysis that certain functions take a lot of time
> but what
> if there are none?
>
> The bulk of the time in my system was taken by various OTP
> functions
> and processes, Mnesia and unknown functions. All I could infer
> from it
> is that perhaps I have too many processes.
>
> Another thing that I inferred is that the normal method of
> writing
> gen_server code doesn't work for profiling.
>
> I had to rewrite the gen_server clauses to immediately
> dispatch to
> functions, e.g.
>
> handle_cast('LOGOUT', Data) ->
> handle_cast_logout(Data);
>
> handle_cast('DISCONNECT', Data) ->
> handle_cast_disconnect(Data);
>
> otherwise all the clauses of a gen_server are squashed
> together,
> regardless of the message pattern. I don't know if there's a
> better
> way to tackle this.
>
> Next, I rewrote most of my gen_servers as data structures,
> e.g. pot,
> limit, deck, etc. A deck of cards can take a message to draw a
> card
> but the message can just as well be a function call. The deck
> structure will need to be modified regardless and the tuple
> will be
> duplicated anyway. There didn't seem to be any advantage in
> using a
> process here, much less a gen_server.
>
> Next I carefully went trough my Mnesia schema and split some
> tables
> into smaller tables. I made sure that only the absolutely
> necessary
> tables were disk-based. I wish I could run without updating
> Mnesia
> tables during a game but this is impossible since player
> balances and
> status need to be updated when players join or leave a game,
> as well
> as when a game finishes.
>
> All my hard work paid off and I was able to get close to 10K
> players,
> with kernel poll enabled, of course. Then I ran out of ETS
> tables.
>
> I don't create ETS tables on the fly but, apparently, Mnesia
> does. For
> every transaction!!!
>
> This prompted me to go through the server again and use
> dirty_read,
> dirty_write wherever possible. I also placed balanced in two
> separate
> "counter" tables, integers to be divided by 10000 to get 4
> decimal
> points of precision. This is so that I could use
> dirty_update_counter
> instead of a regular read, bump, write pattern.
>
> My frustration kept increasing but I gained more concurrent
> players. I
> can now safely run up to 8K bots before timeouts start to
> appear.
>
> These are gen_server call timeouts when requests for game
> information
> take longer than the default 5 seconds. I have an average of 5
> players
> per game so this is not because a large number of processes
> are trying
> to access the game.
>
> I suppose this is a reflection of the load on the system,
> although CPU
> usage never goes past 300% which tells me that no more than 3
> cores
> are used by Erlang.
>
> The straw that broke my back was when stopping a bot's
> matching player
> gen_server by returning {stop, ... } started causing my
> observer
> process to receive tcp_close and exit. I could repeat this
> like
> clockwork. Only spawning a separate process to send player a
> stop
> message would fix this.
>
> Then I changed the way I represent cards started seeing this
> behavior
> again, in just one of my tests. What do cards have to do with
> tcp_close? I don't know and dbg tracer is my best friend! What
> I know
> is what git tells me and git says cards were the only
> difference.
>
> Anyway, I don't think I have fully recovered yet. I may need a
> weekend
> just to regain my sanity. I will try to spread the load among
> several
> VMs but my hunch is that my distributed 100k players target is
> far far
> away. I'll may have to keep flying blind, with only traces and
> printouts to my rescue.
>
> Thanks for listening, Joel
>
> --
> wagerlabs.com
>
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>
> --
> John S. Wolter President
> Wolter Works
> Mailto:johns...@wolterworks.com
> Desk 1-734-665-1263
> Cell: 1-734-904-8433
>4 Mnesia overloaded error reports
> ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
> e.g. the reply takes longer than 5 seconds. The game bombs out at this point.
> I could increase the timeout or make the call just wait forever. I don't want to do this, though, as this timeout serves as a performance indicator.
I've had the same exact same two reports causing all kinds of
unexpected behaviour.
While setting infinity with gen_server call's worked with me, i think
setting a higher timeout value (than the default 5 seconds ) is your
best best, and thats what worked with us. We had some python<->erlang
logic happening in one of the gen_server calls, and used to
consistently timeout until i came to know of the 5 second default
value.
IMHO , perhaps you can also move the 'performance analysis' to another
part of the poker server, rather than mixign it with the decks, poker
logic. In that case , having gen_server calls as infinite may not seem
a bad idea.
Regarding the mnesia dump_threshold calls, we did two things :
#1 If latency is'nt an issue take a whole bunch of writes , write
them to a file, then on a trial-and-error basis configure that after
every N entries, do a mnesia:load_textfile(Dump). This way all those N
entries get written in one transaction. One thing to look at though is
that at the time of loading, permissions to access mnesia . In our
case since we access mnesia only through Yaws , so no rwx permission
errors throw up , and also made sure that mnesia already knows where
to distribute the newly loaded data ( in our case a 4-node cluster
currently). If latency is critical, then why do you need mnesia in the
first place, just message pass (but i suppose that you still need to
archive them even if it behaves like a messaging system)
#2 creating user specific tables. i know most normalization/db guru's
out there might raise an eyebrow . As much as we're used to pattern
matching all rows from a table where user matches X, we just created
seperate tables for each user, and know exactly to look. Now we can
take maybe a much higher factor of the load into a table , without
even considering Fragmentation. btw, Thanks for pointing out the
default max ets tables, will have to look into that.
jut FYI, the third biggest error we used to get on our logs was the
{error,open} message "yaws failed to bind socket error", and we spent
a long time figuring out if it was related to the timeout. turned out
that yaws.conf has a little property that sets "Fail completely when
yaws fails to bind socket = true " by default. Switched it to false,
and that's the last time we saw those error,open reports 8 )
~
Keep Clicking,
Bhasker V Kode | bosky101
hover.in
While SMP (+S1) approach may solve some problems, it defeats the purpose of
having a multi-core machine. Please note that multi-core machines have lower
clock speeds, thus should run generally slower per given CPU core. IMHO, if
+S 1 solves your problem, maybe you should revisit your code -- I think that
it is wrong to expect that the same code would work better on SMP just
because you had such expectations. For example, it is known fact that ETS
works slower in SMP environment.
Also, one should not forget to use +A in addition to +S -- although you do
not have any disk I/O, I think this parameter is relevant for PORT
scheduling, therefore improving performance of your I/O.
V.
----- Original Message -----
From: "Kevin Scaldeferri" <ke...@scaldeferri.com>
To: "Edwin Fine" <erlang-ques...@usa.net>
Cc: "erlang-questions Questions" <erlang-q...@erlang.org>
Sent: Sunday, September 14, 2008 12:07 AM
Subject: Re: [erlang-questions] My frustration with Erlang
>
I might have a suggestion. At my job, we've used CORBA. Although not
exactly the same, in principle, it's similar. We had an issue with
lots of messages getting passed around and gumming up the works,
causing messages to get delivered late (our software had requirements
to deliver the messages within a certain amount of time). The biggest
problem was that there were lots of small messages getting sent
around. Which meant, behind the scenes, a lot of connections openned
and closed, which takes time, during which less productive processing
(on the CPU) is done. We tried 2 solutions:
1) Turn the small messages into large, long ones. That way, you just
have one massive message sent. Combine them.
2) Collect pieces of information into a buffer and then send it out
once the buffer has reached a limit.
Say you do:
receive
Message1 ->
....
Message2 ->
....
Message3 ->
....
Message4 ->
....
end.
But you change it to:
receive
LargeMessage ->
end.
Also, I have a question. Isn't it possible to simulate multiple CPU's
with Erlang's virtual machine? If so, then wouldn't creating lots of
these virtual CPUs emulate the said problem and make it easier to see
on the horizon? etop seems like a tool that might answer this
question, too bad it crashes so often :( .
r> How many mnesia transactions do you have left? From what I tested,
r> transactions are slow as hell.
I'm curious, what kind of transactions? In tests from over a year ago
(using R11B-5, I think, it's the only hard data that I have within easy
arm's reach), I had single table performance on 2-node cluster, Xeon
2.8GHz CPUs (maybe?), 'disc_copies' tables that were replicated on each
node as well as fragmented, with single table read txns (context =
sync_transaction) of up to 7K transactions/sec and single table (single
key) update txns at 4.1K transactions/sec. Not stupendously great, but
for a non-SMP VM plus Mnesia's distributed transaction managers involved
in every transaction, it isn't too bad.
If your tables were type 'disc_only_copies', then I'd have far fewer
problems believing.
-Scott
tl> A further technical limitation of nodes, once pointed out to me by
tl> Sean Hinde, is that sockets can't be sent between nodes today.
If you can extract the file descriptor number, the "unixdom_drv" item in
Jungerl should allow file descriptor passing on the same machine.
Assuming support for your OS, no bit-rot in the repository ... YMMV?
-Scott
--- On Fri, 9/19/08, Scott Lystig Fritchie <frit...@snookles.com> wrote:
> tl> A further technical limitation of nodes, once
> pointed out to me by
> tl> Sean Hinde, is that sockets can't be sent
> between nodes today.
>
> If you can extract the file descriptor number, the
> "unixdom_drv" item in
> Jungerl should allow file descriptor passing on the same
> machine.
> Assuming support for your OS, no bit-rot in the repository
> ... YMMV?
By no means a fundamental limitation (on some Unices at least), but as you note, it may need some trailblazing.
Best,
Thomas