[erlang-questions] My frustration with Erlang

811 views
Skip to first unread message

Joel Reymont

unread,
Sep 12, 2008, 9:52:07 AM9/12/08
to erlang-questions Questions
I sell a poker server written in Erlang. It's supposed to be super-
robust and super-scalable. I'm about to move to the next level by
adding the missing features, e.g. tournaments and a Flash client.

I appreciate everything that the Erlang/OTP is doing but I thought I
would vent a few of my recent frustrations with Erlang. I'm in a good
mood after spending a day with OCaml and I have calmed down. Still,
prepare yourself for a long rant ahead!

My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon, 12Gb of
memory, one 250Gb and two more drives 500Gb each, all 7200RPM SATA. I
use R12B3, SMP and kernel poll, i.e.

Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-
threads:0] [kernel-poll:true]

My overwhelming frustration is the opacity of a running Erlang system.
There are no decent tools for peering inside. No usable ones whatsoever!

With any other language you can profile, make changes, evaluate
performance and make a judgement but not with Erlang.

I first wrote OpenPoker using OTP everywhere. My players, games, pots,
limits, hands, decks, etc. were all gen_server processes. I used
Mnesia transactions everywhere and I used them often.

Then I discovered that I cannot scale past 2-3k concurrent players
under heavy use.

I have a test harness that launches bots which connect to the server
and play by the script. The bots don't wait before replying to bet
requests and so launching a few thousand bots heavily loads the server.

I don't want just a few thousand concurrent bots, though! I want at
least 10k on a single VM and hundreds of thousands on a cluster, so I
set to optimize my poker server.

The Erlang Efficiency Guide recommends fprof as the tool. I ran fprof
on my test harness and discovered that the result set cannot be
processed in my 12Gb of memory. I made this discovery after leaving
fprof running for a couple of days and realized this because the fprof
data files were approaching 100Gb and my machine became unusable due
to heavy swapping.

fprof usets ets tables to analyze the trace results and ets tables
must fit in memory.

I shortened my test run and was able to see the output of the fprof
trace analysis. To say that it's dense would be an understatement! I
realize that dumping out tuples is easy but aren't computers suppose
to help us humans?

The final output from fprof is still too raw for me to analyze.
There's absolutely, positively, definitely no way to get a picture of
a running system by reading through it. I understand that I can infer
from the analysis that certain functions take a lot of time but what
if there are none?

The bulk of the time in my system was taken by various OTP functions
and processes, Mnesia and unknown functions. All I could infer from it
is that perhaps I have too many processes.

Another thing that I inferred is that the normal method of writing
gen_server code doesn't work for profiling.

I had to rewrite the gen_server clauses to immediately dispatch to
functions, e.g.

handle_cast('LOGOUT', Data) ->
handle_cast_logout(Data);

handle_cast('DISCONNECT', Data) ->
handle_cast_disconnect(Data);

otherwise all the clauses of a gen_server are squashed together,
regardless of the message pattern. I don't know if there's a better
way to tackle this.

Next, I rewrote most of my gen_servers as data structures, e.g. pot,
limit, deck, etc. A deck of cards can take a message to draw a card
but the message can just as well be a function call. The deck
structure will need to be modified regardless and the tuple will be
duplicated anyway. There didn't seem to be any advantage in using a
process here, much less a gen_server.

Next I carefully went trough my Mnesia schema and split some tables
into smaller tables. I made sure that only the absolutely necessary
tables were disk-based. I wish I could run without updating Mnesia
tables during a game but this is impossible since player balances and
status need to be updated when players join or leave a game, as well
as when a game finishes.

All my hard work paid off and I was able to get close to 10K players,
with kernel poll enabled, of course. Then I ran out of ETS tables.

I don't create ETS tables on the fly but, apparently, Mnesia does. For
every transaction!!!

This prompted me to go through the server again and use dirty_read,
dirty_write wherever possible. I also placed balanced in two separate
"counter" tables, integers to be divided by 10000 to get 4 decimal
points of precision. This is so that I could use dirty_update_counter
instead of a regular read, bump, write pattern.

My frustration kept increasing but I gained more concurrent players. I
can now safely run up to 8K bots before timeouts start to appear.

These are gen_server call timeouts when requests for game information
take longer than the default 5 seconds. I have an average of 5 players
per game so this is not because a large number of processes are trying
to access the game.

I suppose this is a reflection of the load on the system, although CPU
usage never goes past 300% which tells me that no more than 3 cores
are used by Erlang.

The straw that broke my back was when stopping a bot's matching player
gen_server by returning {stop, ... } started causing my observer
process to receive tcp_close and exit. I could repeat this like
clockwork. Only spawning a separate process to send player a stop
message would fix this.

Then I changed the way I represent cards started seeing this behavior
again, in just one of my tests. What do cards have to do with
tcp_close? I don't know and dbg tracer is my best friend! What I know
is what git tells me and git says cards were the only difference.

Anyway, I don't think I have fully recovered yet. I may need a weekend
just to regain my sanity. I will try to spread the load among several
VMs but my hunch is that my distributed 100k players target is far far
away. I'll may have to keep flying blind, with only traces and
printouts to my rescue.

Thanks for listening, Joel

--
wagerlabs.com

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions

Joel Reymont

unread,
Sep 12, 2008, 10:33:46 AM9/12/08
to Hynek Vychodil, erlang-questions Questions

On Sep 12, 2008, at 3:29 PM, Hynek Vychodil wrote:

> I can't see it clearly. How much requests per second did serve your
> server or how fast your bots sends requests?

I can't see it clearly either. How would I properly measure the # of
requests per second?

I tried to relieve the server by spreading the load and making each
bot wait a random amount of time before replying to a bet request. It
didn't improve the situation much.

Joel Reymont

unread,
Sep 12, 2008, 11:55:57 AM9/12/08
to Hynek Vychodil, erlang-questions Questions

On Sep 12, 2008, at 3:52 PM, Hynek Vychodil wrote:

> Count served requests during some longer period (from minutes, better
> quarter of hour up to day(s)) under some "almost constant" load?
> Goal is to avoid dynamic load leveling.

Here are the stats from a 1k game run with a total of ~5k players,
collected on a 10s timer.

Elapsed time between collections is always a little over 10 seconds.

The first report covers game startup. Lots of requests!

Things level off once games start and the load increases again once
games start to end and I need to update player balances and things.

What does this tell you?

Thanks, Joel

---

Simulating gameplay with 1000 games...
Waiting for games to end...
50 games started, 262 players
100 games started, 504 players
...
950 games started, 4777 players
1000 games started, 5066 players

=INFO REPORT==== 12-Sep-2008::16:34:34 ===
requests: 12132
bytes: 333257
requests_per_second: 1213
bytes_per_second: 33322

=INFO REPORT==== 12-Sep-2008::16:34:44 ===
requests: 2012
bytes: 27082
requests_per_second: 200
bytes_per_second: 2702

=INFO REPORT==== 12-Sep-2008::16:34:54 ===
requests: 3054
bytes: 41990
requests_per_second: 300
bytes_per_second: 4134

=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 2009
bytes: 30349
requests_per_second: 197
bytes_per_second: 2980

=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 4161
bytes: 35942
requests_per_second: 415
bytes_per_second: 3592

=INFO REPORT==== 12-Sep-2008::16:35:39 ===
requests: 5452
bytes: 48210
requests_per_second: 543
bytes_per_second: 4808

=INFO REPORT==== 12-Sep-2008::16:35:44 ===
requests: 7662
bytes: 53124
requests_per_second: 765
bytes_per_second: 5304

Joel Reymont

unread,
Sep 12, 2008, 12:02:30 PM9/12/08
to Hynek Vychodil, erlang-questions Questions
I forgo to mention that these are JSON requests and that I'm using
LShift's library to encode and decode requests:

http://www.lshift.net/blog/2007/02/17/json-and-json-rpc-for-erlang

Edwin Fine

unread,
Sep 12, 2008, 12:14:28 PM9/12/08
to Joel Reymont, erlang-questions Questions
Joel,

What is your hardware configuration? How many cores are available? The reason I ask is because some people have seen better performance in certain clustering-type Erlang applications by running one VM per core, each VM running in +S 1 (non-SMP) mode, instead of using SMP. Have you tried this? I couldn't figure that out from what you wrote.

Regards,
Edwin

Edwin Fine

unread,
Sep 12, 2008, 12:17:47 PM9/12/08
to Joel Reymont, erlang-questions Questions
Oh, I see it now, I missed it. An 8-core Mac. Sorry.

Joel Reymont

unread,
Sep 12, 2008, 1:41:10 PM9/12/08
to Hynek Vychodil, erlang-questions Questions

On Sep 12, 2008, at 5:35 PM, Hynek Vychodil wrote:

> It not seems as good as I realize from your previous messages. It
> looks like you can get less than 1kreq/s.

Right.

> I don't know if it is load which you can manage for example 5,
> 10, ... minutes.

Yes and no, see below.

> You can serve 15k concurrent players. Sound it still to much bad for
> you?

15K concurrent players would be excellent. The stats I posted before
don't paint the full picture, though. The part that I didn't post
looks like this...

Simulating gameplay with 1000 games...
Waiting for games to end...
50 games started, 262 players
100 games started, 504 players
...
950 games started, 4777 players
1000 games started, 5066 players

4 Mnesia overloaded error reports

=ERROR REPORT x 4==== 12-Sep-2008::16:34:40 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

And 11 more

=ERROR REPORT x 11 ==== 12-Sep-2008::16:34:44 ===
Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

And then the real bummer, a whole lot of these:

=ERROR REPORT ==== 12-Sep-2008::16:35:01 ===
** State machine <0.9390.0> terminating
** Last timer event in was 'CHECK'
** When State == dispatch
...
** Reason for termination =
** {timeout,{gen_server,call,[<0.9391.0>,{'SEATS',119}]}}

The above says that even with just 5K players the card game FSM times
out when requesting seats matching a certain mask, e.g. the reply
takes longer than 5 seconds. The game bombs out at this point.

I could increase the timeout or make the call just wait forever. I
don't want to do this, though, as this timeout serves as a performance
indicator.

Thanks, Joel

Joel Reymont

unread,
Sep 12, 2008, 1:52:16 PM9/12/08
to Edwin Fine, erlang-questions Questions

On Sep 12, 2008, at 5:14 PM, Edwin Fine wrote:

> What is your hardware configuration? How many cores are available? The
> reason I ask is because some people have seen better performance in
> certain
> clustering-type Erlang applications by running one VM per core, each
> VM
> running in +S 1 (non-SMP) mode, instead of using SMP. Have you tried
> this?

I ran the same test with +S 1. It doesn't look like SMP affects things
too much in terms of # requests per second.

The big difference is that the test runs through to the end, no
gen_server timeouts. It starts timing out with 2k games and ~7K
players, though.

=INFO REPORT==== 12-Sep-2008::18:42:55 ===
requests: 910
bytes: 24558
requests_per_second: 90
bytes_per_second: 2455

=INFO REPORT==== 12-Sep-2008::18:43:05 ===
requests: 11222
bytes: 308699
requests_per_second: 1122
bytes_per_second: 30867

10 Mnesia overloaded reports (x10).

=ERROR REPORT x10 ==== 12-Sep-2008::18:43:09 ===


Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

=INFO REPORT==== 12-Sep-2008::18:43:15 ===
requests: 4461
bytes: 60746
requests_per_second: 442
bytes_per_second: 6025

=ERROR REPORT x6 ==== 12-Sep-2008::18:43:16 ===


Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

=INFO REPORT==== 12-Sep-2008::18:43:25 ===
requests: 1874
bytes: 24746
requests_per_second: 187
bytes_per_second: 2470

=INFO REPORT==== 12-Sep-2008::18:43:35 ===
requests: 3864
bytes: 55476
requests_per_second: 384
bytes_per_second: 5516

=ERROR REPORT x4 ==== 12-Sep-2008::18:43:38 ===


Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

=INFO REPORT==== 12-Sep-2008::18:43:45 ===
requests: 9752
bytes: 84331
requests_per_second: 974
bytes_per_second: 8425

=ERROR REPORT x7 ==== 12-Sep-2008::18:43:47 ===


Mnesia('1@mothership'): ** WARNING ** Mnesia is overloaded: {dump_log,

write_threshold}

1000 games finished
Elapsed: 59.147214s, Average run time: 0.059147214 seconds
Mb exited successfully, 59.148019 seconds elapsed
ok
(1@mothership)3>

=INFO REPORT==== 12-Sep-2008::18:43:55 ===
requests: 10950
bytes: 74516
requests_per_second: 1094
bytes_per_second: 7450

john s wolter

unread,
Sep 12, 2008, 2:11:09 PM9/12/08
to Joel Reymont, john s wolter, erlang-questions
Joel,

Like any technical project you are dealt a hand and you have to play it.
--
John S. Wolter President
Wolter Works
Mailto:johns...@wolterworks.com
Desk 1-734-665-1263
Cell: 1-734-904-8433

yin sun

unread,
Sep 12, 2008, 2:17:16 PM9/12/08
to Joel Reymont, erlang-questions Questions
Erlang needs to implement performance counters similar to Dtrace.

/Yin

Ian Baird

unread,
Sep 12, 2008, 3:01:38 PM9/12/08
to Vance Shipley, erlang-q...@erlang.org
Sweet - thanks.

- Ian

On Sep 12, 2008, at 12:00 PM, Vance Shipley wrote:

> On Fri, Sep 12, 2008 at 11:54:27AM -0700, Ian Baird wrote:
> } Actually, a dtrace-enabled version of the erlang interpreter
> would be
> } pretty sweet, and useful for people running on Solaris, *BSD, or
> OS X.
> } Is this in the works?
>
> http://groups.google.com/group/erlang-dtrace
>
> -Vance

Vance Shipley

unread,
Sep 12, 2008, 3:00:34 PM9/12/08
to Ian Baird, erlang-q...@erlang.org

Edwin Fine

unread,
Sep 12, 2008, 3:08:31 PM9/12/08
to Joel Reymont, erlang-questions Questions
I'd be interested in seeing the output of  vmstat over a period of time. Because only 3 of the 8 CPUs are being used, it sounds as if either the app is suffering from some sort of lock contention, or is I/O bound. If it's IO bound, then putting the database tables on heavily-memory-cached RAID or SAN hardware like EMC could make a huge difference. I know that probably doesn't help because if you're like most people, you don't have the megabucks one needs for such a device. However, maybe a lower-end (but still fast) RAID would be useful.

Is there any way you could run the app for 15 minutes under full load and capture the system performance using vmstat?

Ed

Joel Reymont

unread,
Sep 12, 2008, 2:13:59 PM9/12/08
to johns...@wolterworks.com, erlang-questions
On Sep 12, 2008, at 7:11 PM, john s wolter wrote:

> Like any technical project you are dealt a hand and you have to play
> it.


Aye! Rest assured, I'm not giving up.

Vance Shipley

unread,
Sep 12, 2008, 3:34:14 PM9/12/08
to Joel Reymont, erlang-questions Questions
On Fri, Sep 12, 2008 at 02:52:07PM +0100, Joel Reymont wrote:
} Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-
} threads:0] [kernel-poll:true]

Joel,

Have you tried adding a thread pool? If you are IO bound on disk
that should help quite a bit. It doesn't help with TCP/IP however
as the inet drivers are not threaded. Maybe they should be.

-Vance

Raoul Duke

unread,
Sep 12, 2008, 3:46:41 PM9/12/08
to erlang-questions Questions
cool! lots of very useful info coming across the wire here from kind
experienced folks. would be neat if it were available on the wiki, or
(as folks have suggested) were supported by tools like dtrace.

> Have you tried adding a thread pool? If you are IO bound on disk
> that should help quite a bit. It doesn't help with TCP/IP however
> as the inet drivers are not threaded. Maybe they should be.

Joel Reymont

unread,
Sep 12, 2008, 3:53:13 PM9/12/08
to Vance Shipley, erlang-questions Questions

On Sep 12, 2008, at 8:34 PM, Vance Shipley wrote:

> Have you tried adding a thread pool? If you are IO bound on disk
> that should help quite a bit.

Tried 5 async threads and got a much higher rate of "Mnesia
overloaded" messages. Didn't help much otherwise, I still get the
dreaded gen_server timeout.

Thanks, Joel

--
wagerlabs.com

Ian Baird

unread,
Sep 12, 2008, 2:54:27 PM9/12/08
to erlang-q...@erlang.org
> Erlang needs to implement performance counters similar to Dtrace.
>
> /Yin

Actually, a dtrace-enabled version of the erlang interpreter would be

pretty sweet, and useful for people running on Solaris, *BSD, or OS X.
Is this in the works?

- Ian

Paul Brown

unread,
Sep 12, 2008, 5:03:39 PM9/12/08
to Joel Reymont, erlang-questions Questions, erlang-questions Questions

On Sep 12, 2008, at 6:52 AM, Joel Reymont wrote:

> [...]


> I suppose this is a reflection of the load on the system, although CPU
> usage never goes past 300% which tells me that no more than 3 cores
> are used by Erlang.

> [...]


You know, my own experience with R12B on Mac OS X (analogous machine
to yours, 8 cores, buncha RAM, fast disks, etc.) was similar:

http://is.gd/2y0j = http://mult.ifario.us/p/use-the-cores-erl
http://is.gd/2y0g = http://www.erlang.org/pipermail/erlang-questions/2008-January/032386.html

I'll chalk it up to something like the preemption timing or other low-
level characteristics of MacOS. You need to head to a larger number
of OS processes running Erlang VMs to get your utilization up, and
this seems like a sensible reality for multicore in general, as OS
scheduling and Erlang scheduling are two independent behaviors from
which you need to select the happy medium.

Cheers.

Paul Brown
paulr...@gmail.com

YourSurrogateGod

unread,
Sep 12, 2008, 10:04:39 PM9/12/08
to erlang-q...@erlang.org, yoursurr...@gmail.com
I don't want to sound like an ass, but what about leasing a server
from a provider? They do a decent job on the hardware end and all that
you have to do is focus on the software.

YourSurrogateGod

unread,
Sep 12, 2008, 10:02:16 PM9/12/08
to erlang-q...@erlang.org
I don't want to sound like an ass, but what about leasing server
storage and processing power? Most do a decent job at dealing with the
hardware side of things, all that you have to focus on is the code.

Joel Reymont

unread,
Sep 13, 2008, 2:44:53 AM9/13/08
to YourSurrogateGod, erlang-q...@erlang.org

On Sep 13, 2008, at 3:02 AM, YourSurrogateGod wrote:

> I don't want to sound like an ass, but what about leasing server
> storage and processing power? Most do a decent job at dealing with the
> hardware side of things, all that you have to focus on is the code.

I do lease a server at Joyent. I'll post the metrics in a little while
so that we have a chance to compare.

--
wagerlabs.com

Paul Fisher

unread,
Sep 12, 2008, 11:00:19 AM9/12/08
to erlang-questions Questions
Joel Reymont wrote:
> These are gen_server call timeouts when requests for game information
> take longer than the default 5 seconds. I have an average of 5 players
> per game so this is not because a large number of processes are trying
> to access the game.
>
> I suppose this is a reflection of the load on the system, although CPU
> usage never goes past 300% which tells me that no more than 3 cores
> are used by Erlang.


We have a system with a completely different workload profile, but our
initial testing has indicated a SMP scaling barrier, such that scaling
beyond 4 cores (on x86-64 linux 2.6.24 w/R12-3) deteriorate quickly.

My very preliminary in investigations indicated a substantial number of
futex system calls (which should happen only on contended locks which
need kernel arbitrartion) as the work scaled toward 4 cores. I had
initially thought that this was related to the amount of work we perform
in linked in drivers via async threads, but some more current
information indicates that this might be more core scheduler related.

Can anyone with experiences running a sufficiently complex system on 8+
core systems share your experiences? I would love to hear from someone
intimate with the SMP implementation on the state of affairs and
possible future directions.


--
paul

Joel Reymont

unread,
Sep 13, 2008, 2:59:30 PM9/13/08
to pfi...@alertlogic.net, erlang-questions Questions
I thought of running etop on another node to figure out what's going
on at peak load. Sadly, etop crashes


'1
@mothership
' 17:19:03
Load: cpu 90 Memory: total 1309096
binary 1302
procs 21259 processes 1185824
code 6406
runq 0 atom 961
ets 52340

Pid Name or Initial Func Time Reds Memory MsgQ
Current Function
----------------------------------------------------------------------------------------
<5624.3952.0> proc_lib:init_p/5 '-' 1046280 1113656 0
gen_server:loop/6
<5624.0.0> init '-' 576618 26320 0
init:loop/1
<5624.3941.0> portServer3000 '-' 183416 88336 0
tcp_server:socket_lo
<5624.3956.0> inet_gethost_native '-' 178043 6864 0
inet_gethost_native:
<5624.73.0> proc_lib:init_p/5 '-' 89051 688680 0
disk_log:loop/1
<5624.3937.0> proc_lib:init_p/5 '-' 68600 24520 0
gen_server:loop/6
<5624.3938.0> portServer6000 '-' 27716 418544 0
tcp_server:socket_lo
<5624.30.0> group:server/3 '-' 20382 24520 0
group:server_loop/3
<5624.25.0> file_server_2 '-' 7192 88400 0
gen_server:loop/6
<5624.16494.0> proc_lib:init_p/5 '-' 4770 8688 0
gen_server:loop/6
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
========================================================================

Output server crashed: connection_lost

Another issue with etop:

Erlang top got garbage {trace_ts,<5623.28.0>,out,
{user_drv,server_loop,5},
{1221,332097,210811}}
Erlang top got garbage {trace_ts,<5623.25.0>,out,
{gen_server,loop,6},
{1221,332097,267093}}
Erlang top got garbage {trace_ts,<5623.4.0>,out,
{gen_event,fetch_msg,5},
{1221,332097,340833}}

I'm not doing any tracing myself so I really don't know where these
traces come from. The easiest way to reproduce is to connect etop and
then start Mnesia.

--
wagerlabs.com

Mikael Pettersson

unread,
Sep 13, 2008, 3:22:03 PM9/13/08
to pfi...@alertlogic.net, erlang-questions Questions
Paul Fisher writes:
> Joel Reymont wrote:
> > These are gen_server call timeouts when requests for game information
> > take longer than the default 5 seconds. I have an average of 5 players
> > per game so this is not because a large number of processes are trying
> > to access the game.
> >
> > I suppose this is a reflection of the load on the system, although CPU
> > usage never goes past 300% which tells me that no more than 3 cores
> > are used by Erlang.
>
>
> We have a system with a completely different workload profile, but our
> initial testing has indicated a SMP scaling barrier, such that scaling
> beyond 4 cores (on x86-64 linux 2.6.24 w/R12-3) deteriorate quickly.
>
> My very preliminary in investigations indicated a substantial number of
> futex system calls (which should happen only on contended locks which
> need kernel arbitrartion) as the work scaled toward 4 cores. I had
> initially thought that this was related to the amount of work we perform
> in linked in drivers via async threads, but some more current
> information indicates that this might be more core scheduler related.

You might get more help debugging this if you provided code
which triggers the problem. Obviously not your proprietary code,
but something that stresses the Erlang system in the same way.

The Erlang system doesn't do any explicit futex ops on it's own.
But glibc's pthread lock ops will do futex ops in their slow paths.

Rapsey

unread,
Sep 13, 2008, 3:36:32 PM9/13/08
to erlang-q...@erlang.org
How many mnesia transactions do you have left? From what I tested, transactions are slow as hell. I'm sure you would get much better performance, if a gen_server was in charge of a table and all queries were dirty and went through that process. Unless you need transactions for more than just thread safety. Then this method won't work of course.


Sergej

Joel Reymont

unread,
Sep 13, 2008, 4:09:53 PM9/13/08
to Rapsey, erlang-q...@erlang.org

On Sep 13, 2008, at 8:36 PM, Rapsey wrote:

> How many mnesia transactions do you have left?

No idea. Is there anything in the etop output that makes you think
transactions are a problem?

I think I only have one transaction in my code, everything else is
dirty ops.

> From what I tested,
> transactions are slow as hell. I'm sure you would get much better
> performance, if a gen_server was in charge of a table and all
> queries were dirty and went through that process.

I also tried making all the tables ram_copies, just for the sake of
it. I still get gen_server timeouts.

I'm trying to profile with eprof now. I started profiling but my test
harness hangs waiting for the games that will never finish.

I thought I'd try to connect to the node, find that test process and
kill it.

erl -remsh 1@mothership


Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-

threads:0] [kernel-poll:false]

*** ERROR: Shell process terminated! (^G to start new job) ***

User switch command
--> r '1@mothership'
--> j
2* {'1@mothership',shell,start,[]}
--> c 2
*** ERROR: Shell process terminated! (^G to start new job) ***

Tools are failing me and the saga continues!

Joel Reymont

unread,
Sep 13, 2008, 4:12:36 PM9/13/08
to Rapsey, erlang-questions Questions
Foot in mouth. I forgot to supply the node name.

'erl -sname 2 -remsh 1@mothership' works fine.

On Sep 13, 2008, at 9:09 PM, Joel Reymont wrote:

> I thought I'd try to connect to the node, find that test process and
> kill it.
>
> erl -remsh 1@mothership
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]

> [async-threads:0] [kernel-poll:false]


>
> *** ERROR: Shell process terminated! (^G to start new job) ***

--

Jan Lehnardt

unread,
Sep 13, 2008, 4:11:49 PM9/13/08
to pfi...@alertlogic.net, erlang-questions Questions

On Sep 12, 2008, at 17:00, Paul Fisher wrote:
> Can anyone with experiences running a sufficiently complex system on
> 8+
> core systems share your experiences? I would love to hear from
> someone
> intimate with the SMP implementation on the state of affairs and
> possible future directions.

I don't claim any experience in this field but I did a few tests with
CouchDB
on a 32 core machine. I've seen linear performance increase up to 14
cores
with a CPU utilisation of ~22% and no disk I/O (makes sense, little
data, loads
of RAM). I tested with increasing +S parameters for `erl`. We ran the
results
by some friends at Sun who suggested that this might be a bottleneck
in the
networking layer. Something might have a per-CPU (not per-core)
bottleneck.
Some kernel tuning in Solaris and possibly better NIC-programming
could solve
this one, but I only had access to the box for a short time and had no
chance
to change any boot-parameters, let alone hardware specs. Profiling
showed that
Erlang was mostly waiting for network I/O which further underlines the
Solaris
config problem theory.

Cheers
Jan
--

Edwin Fine

unread,
Sep 13, 2008, 4:56:21 PM9/13/08
to Joel Reymont, erlang-questions Questions
Joel,

I have mentioned this before on this list, but I'm getting the feeling that folks on this list must think it's a really stupid idea, or not worth looking into because I am new to Erlang or some other reason - maybe I say lots of dumb things - because I have had virtually no responses. Or maybe it is worth looking into, but could be a big can of worms. EIther way, there is something nagging at me that won't allow me to let this go until I know a bit more about it.

Here's the stupid idea: Run one +S 1 VM per core instead of N +S N VMS.

You'd probably have to partition the load to round-robin across the individual VMs, possibly using some front-end load-balancing hardware. This is why I keep harping on this: some time ago I put the system I am working on under heavy load to test the maximum possible throughput. There was no appreciable disk I/O. The kicker is that I did not see an even distribution of load across the 4 cores of my box. In fact, it looked as if one or maybe two cores were being used at 100% and the rest were idle. When I re-ran the test on a whim, using only 1 non-SMP (+S 1) node, I actually got better performance.

This seemed counter-intuitive and against the "Erlang SMP scales linearly for CPU-intensive loads" idea. I have not done a lot of investigation into this because I have other fish to fry right now, but the folks over at LShift (RabbitMQ) - assuming I did not misunderstand them - wrote that they had seen similar behavior when running clustered Rabbit nodes (i.e. better performance from N single-CPU nodes than N +S N nodes). However, they, like me, are not ready to come out and state this bluntly as a fact because (I believe) they feel not enough investigation has been done to make this a conclusive case.

It looks, though, like you are in a very good position to see if this makes any difference, and perhaps bring to light a hitherto-unknown flaw (or at least hitch) in the SMP implementation, which will benefit everyone if there is something there and it gets resolved.

What do you think? Is the newbie smoking something, or is there maybe something to it?

Regards,
Edwin Fine

Joel Reymont

unread,
Sep 13, 2008, 5:15:41 PM9/13/08
to Edwin Fine, erlang-questions Questions

On Sep 13, 2008, at 9:56 PM, Edwin Fine wrote:

> Here's the stupid idea: Run one +S 1 VM per core instead of N +S N
> VMS.

> [...]


> What do you think? Is the newbie smoking something, or is there maybe
> something to it?


There's definitely something to it and I'll try it just as soon as I'm
able to make my test harness a distributed one. It's designed to
launch everything on a single node right now.

There's something that's nagging at me as well, which is why I haven't
been in a rush to rewrite the test harness. I want to know why my
gen_server calls start to timeout!

I even killed the VM upon the first timeout to get a crash dump. Lo
and behold, there were no processes with lengthy message queues. The
distribution among the 21234 processes goes like this:

0, 1, 2, 3, 4, 5, 6, 7 (20), 8 (18), 9 (28), 10 (36), 11 (19), 12
(24), 13 (8), 14 (8), 15, 26, 32, 33

The # of processes with this queue length is in parenthesis and is 1
where there are none. Perhaps I should pay attention to the counts
over 5 or 10.

The complication is that most of the gen_servers look like this in the
crash dump...

=proc:<0.4093.0>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.4001.0>
Started: Sat Sep 13 18:36:39 2008
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [#Port<0.5677>]
Reductions: 3422
Stack+heap: 2584
OldHeap: 377
Heap unused: 1800
OldHeap unused: 377
Program counter: 0x0000000144d3c748 (gen_server:loop/6 + 288)
CP: 0x0000000144cfe458 (proc_lib:init_p/5 + 400)
arity = 0

Note that the current function is gen_server:loop/6, they were
(obviously) spawned by proc_lib:init_p/5 and the processes are unnamed.

There's more info in the crash dump on every process, e.g. process
dictionary, stack and heap (below). I'm able to deduce that this is a
bot process but I'm still scratching my head over jumping from this to
identifying the gen_server timeouts.

=proc_dictionary:<0.4093.0>
H10B4D30
H10B4D48
=proc_stack:<0.4093.0>
0x000000015e0fc498:SReturn addr 0x44CFE458 (proc_lib:init_p/5 + 400)
y0:N
y1:A8:infinity
y2:A3:bot
y3:H5E0F8C30
y4:P<0.4093.0>
y5:P<0.4093.0>
0x000000015e0fc4d0:SReturn addr 0x1A9D28 (<terminate process normally>)
y0:SCatch 0x44CFE478 (proc_lib:init_p/5 + 432)
y1:A3:gen
y2:A7:init_it
y3:H10B4D20
=proc_heap:<0.4093.0>
5E0F8C30
:tD:A3
:bot
,H10B4DB0,I14,I3,p<0.5677>,H10B4FA0,N,I1150,I2,I797372250,A5:false,I1,I0
10B4FA0:lA5:CHECK|H10B4FB0
10B4FB0:lA5:CHECK|H10B4FC0
10B4FC0:lA4:FOLD|H10B4FD0
10B4FD0:lA3:nil|N
10B4DB0:YhE:6B6D616E2F373937333732323530
10B4D20:lAA:gen_server|H10B4DE8
10B4DE8:lP<0.4001.0>|H10B4E48
10B4E48:lA4:self|H10B4E88
10B4E88:lA3:bot|H10B4EC0
10B4EC0:lH10B4EE0|H10B4EF0
10B4EE0:lH10B4DB0|H10B4F10
10B4F10:lI797372250|H10B4F30
10B4F30:lI2|H10B4F50
10B4F50:lI1160|N
10B4EF0:lN|N
10B4D30:t2:AD:$initial_call,H10B4DF8
10B4DF8:t3:A3:gen,A7:init_it,H10B4D20
10B4D48:t2:AA:$ancestors,H10B4E18
10B4E18:lP<0.4001.0>|H10B4E58
10B4E58:lP<0.37.0>|N

Kevin Scaldeferri

unread,
Sep 13, 2008, 6:07:39 PM9/13/08
to Edwin Fine, erlang-questions Questions

On Sep 13, 2008, at 1:56 PM, Edwin Fine wrote:

> You'd probably have to partition the load to round-robin across the
> individual VMs, possibly using some front-end load-balancing
> hardware. This is why I keep harping on this: some time ago I put
> the system I am working on under heavy load to test the maximum
> possible throughput. There was no appreciable disk I/O. The kicker
> is that I did not see an even distribution of load across the 4
> cores of my box. In fact, it looked as if one or maybe two cores
> were being used at 100% and the rest were idle. When I re-ran the
> test on a whim, using only 1 non-SMP (+S 1) node, I actually got
> better performance.
>
> This seemed counter-intuitive and against the "Erlang SMP scales
> linearly for CPU-intensive loads" idea. I have not done a lot of
> investigation into this because I have other fish to fry right now,
> but the folks over at LShift (RabbitMQ) - assuming I did not
> misunderstand them - wrote that they had seen similar behavior when
> running clustered Rabbit nodes (i.e. better performance from N
> single-CPU nodes than N +S N nodes). However, they, like me, are not
> ready to come out and state this bluntly as a fact because (I
> believe) they feel not enough investigation has been done to make
> this a conclusive case.

I've also been seeing similar behavior trying to parallelize the
alioth shootout code, fwiw. I'd also say it's premature to draw any
concrete conclusions, but another anecdotal point.

(Also, on the particular OS & hardware the benchmarks run on, the
total CPU usage nearly doubles for the parallel implementations. On
my 2-core mac, though, I see no more than a 10% increase in total CPU
usage, and a near 100% improvement in the wall-time, as one should
expect on the embarrassingly parallel problems. Dunno, if this is
related to the OS, the chip (Core 2 Duo vs Core 2 Quad), HiPE, or what.)


-kevin

Edwin Fine

unread,
Sep 14, 2008, 4:39:40 AM9/14/08
to Valentin Micic, erlang-questions Questions
I think you have misunderstood my reasoning here. If you have 8 Erlang VMs going, each with +S 1, how exactly does this defeat the purpose of having a multi-core machine? Long before threading models came into vogue, multiple processes were taking advantage of multi-CPU systems simply by letting the OS scheduler choose which CPU on which to run the next runnable process. Since Erlang threads are "green" threads, they don't individually use the threading model of the underlying operating system anyway. Each VM, other than for I/O operations on files (controleld by +A, I believe), uses AFAIK one O/S thread per scheduler. So +S 8 will use 8 O/S threads. When you have 8 threads sharing something (which they will when running SMP), there is a risk of contention slowing things down.

I wonder what happens when Erlang processes running on the scheduler thread of one core send lots of messages to Erlang processes running on a scheduler on a different core? There HAS to be a lock there somewhere while the message moves from the memory space of the first process to the second one, or if shared memory is being used (more likely scenario), there is going to be some locking of shared data structures. Now if there are thousands of processes sending in aggregate hundreds of thousands of messages, I believe this will not scale well if the messages cross scheduler boundaries. I could be wrong, but it fits the anecdotal data. It would be very interesting to see the effect of designing an application so that processes that send lots of messages to each other run on the same scheduler, sort of "clustering" the messages so that they stay within the same VM's memory space. Using multiple VMs with +S 1 would force this to happen because there IS only one scheduler thread, so hopefully the VM doesn't use any SMP locks/mutexes under those conditions. This is in contrast to assuming a uniform model where the cost of sending messages is the same regardless of the fact that it may be going to a different process space and incurring the cost of a lock. The locks used are Posix semaphores, IIRC, and they are not cheap.

I guess what I am trying to say is that the basic assumption Erlang applications seem to have been designed on top of up to now, knowing that the cost of sending a message is extremely low, is perhaps not as true as it used to be when using SMP. I would be very interested in hearing from Joe Armstrong about this. I seem to recall that he wrote something about the cost of what a process does needing to be much greater than the cost of starting a process or sending a message to it in order to scale. This is true of any IPC, even cheap IPC like Erlang. On top of that, things that use shared structures like ETS heavily, for example Mnesia transactions, are possibly going to suffer in the SMP context. All of this is just conjecture based on my own work and anecdotal evidence presented by some others, but I feel in my bones that there is something to this. Look at is this way: Erlang got its major performance from eliminating locks. Running on a 1024 processor system in SMP mode and treating it like it's one big uniform processor is going to backfire badly. I haven't seen much discussion about this. Maybe it's too obvious to mention.

There is an interesting discussion (http://www.erlang.org/pipermail/erlang-questions/2008-January/032273.html) about assigning individual +S 1 Erlang VMs to a given CPU using processor affinity. This could help considerably by leaving more of a VM's running state cached in the processor code and data caches than if the VM's thread were to be switched between CPUs by the OS.

I plan Real Soon Now ;-) to do some extensive research into this to see if there is merit to it. Just as soon as I get over the hump in a current project.

Anyway, hopefully this clarifies my thinking to you and you have less of an issue with it.


On Sun, Sep 14, 2008 at 3:21 AM, Valentin Micic <vale...@pixie.co.za> wrote:
I havent read the whole correspndance (it seems to be going on for a way too long), but like to add my 2c worth...

While SMP (+S1) approach may solve some problems, it defeats the purpose of having a multi-core machine. Please note that multi-core machines have lower clock speeds, thus should run generally slower per given CPU core. IMHO, if +S 1 solves your problem, maybe you should revisit your code -- I think that it is wrong to expect that the same code would work better on SMP just because you had such expectations. For example, it is known fact that ETS works slower in SMP environment.
Also, one should not forget to use +A in addition to +S -- although you do not have any disk I/O, I think this parameter is relevant for PORT scheduling, therefore improving performance of your I/O.

V.


----- Original Message ----- From: "Kevin Scaldeferri" <ke...@scaldeferri.com>
To: "Edwin Fine" <erlang-ques...@usa.net>
Cc: "erlang-questions Questions" <erlang-q...@erlang.org>
Sent: Sunday, September 14, 2008 12:07 AM
Subject: Re: [erlang-questions] My frustration with Erlang

Thomas Lindgren

unread,
Sep 14, 2008, 10:58:12 AM9/14/08
to erlang-questions Questions
For what it's worth, I basically agree with Edwin's reasoning: letting the cores run nodes rather than SMP/threads in essence means there will be "minimal" contention (mediated by the kernel) and "minimal" synchronization (mediated by sockets).

That said, te reasons for running threads rather than nodes might be:
- easier to setup and change (porting to 16 cores is "+S 16", or perhaps "+S 20")
- easier to load-balance
- less costly messages and cheaper table access than remote operations
- less memory used due to sharing inside VM
- perhaps better memory performance, again due to sharing

A further technical limitation of nodes, once pointed out to me by Sean Hinde, is that sockets can't be sent between nodes today.

Best,
Thomas

--- On Sun, 9/14/08, Edwin Fine <erlang-ques...@usa.net> wrote:

Ulf Wiger (TN/EAB)

unread,
Sep 14, 2008, 1:30:03 PM9/14/08
to Joel Reymont, erlang-questions Questions
Joel Reymont skrev:
>
> All my hard work paid off and I was able to get close to 10K players,
> with kernel poll enabled, of course. Then I ran out of ETS tables.
>
> I don't create ETS tables on the fly but, apparently, Mnesia does.
> For every transaction!!!
>
> This prompted me to go through the server again and use dirty_read,
> dirty_write wherever possible. I also placed balanced in two
> separate "counter" tables, integers to be divided by 10000 to get 4
> decimal points of precision. This is so that I could use
> dirty_update_counter instead of a regular read, bump, write pattern.

Did you try increasing the maximum number of ETS tables?

This is done using the environment variable ERL_MAX_ETS_TABLES

It can be set to a number considerably higher than the default
(1400 tables).

BR,
Ulf W

Joel Reymont

unread,
Sep 14, 2008, 1:40:19 PM9/14/08
to Ulf Wiger (TN/EAB), erlang-questions Questions

On Sep 14, 2008, at 6:30 PM, Ulf Wiger (TN/EAB) wrote:

> Did you try increasing the maximum number of ETS tables?
>
> This is done using the environment variable ERL_MAX_ETS_TABLES


That would have been the easy way out :-).

No, I just combed through the code and switched to dirty mnesia ops
whenever possible.

I'm very close to figuring out my issue I'm almost certain that it's
mine to fix.

I'll post again next week, explaining what I did to pinpoint the
problem and how I did it (tracing, crash dump analysis). I'll also
post new throughput statistics.

Thanks, Joel

--
wagerlabs.com

Bengt Kleberg

unread,
Sep 15, 2008, 2:17:20 AM9/15/08
to erlang-questions
Greetings,

Sometimes it is possible to ask for new cards.


bengt

On Fri, 2008-09-12 at 14:11 -0400, john s wolter wrote:
> Joel,


>
>
> Like any technical project you are dealt a hand and you have to play
> it.
>

> On Fri, Sep 12, 2008 at 9:52 AM, Joel Reymont <joe...@gmail.com>
> wrote:
> I sell a poker server written in Erlang. It's supposed to be
> super-
> robust and super-scalable. I'm about to move to the next level
> by
> adding the missing features, e.g. tournaments and a Flash
> client.
>
> I appreciate everything that the Erlang/OTP is doing but I
> thought I
> would vent a few of my recent frustrations with Erlang. I'm in
> a good
> mood after spending a day with OCaml and I have calmed down.
> Still,
> prepare yourself for a long rant ahead!
>
> My development workstation is a Mac Pro 2x2.8Ghz Quad Xeon,
> 12Gb of
> memory, one 250Gb and two more drives 500Gb each, all 7200RPM
> SATA. I
> use R12B3, SMP and kernel poll, i.e.


>
> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8]

> [async-
> threads:0] [kernel-poll:true]
>

> My overwhelming frustration is the opacity of a running Erlang
> system.
> There are no decent tools for peering inside. No usable ones
> whatsoever!
>
> With any other language you can profile, make changes,
> evaluate
> performance and make a judgement but not with Erlang.
>
> I first wrote OpenPoker using OTP everywhere. My players,
> games, pots,
> limits, hands, decks, etc. were all gen_server processes. I
> used
> Mnesia transactions everywhere and I used them often.
>
> Then I discovered that I cannot scale past 2-3k concurrent
> players
> under heavy use.
>
> I have a test harness that launches bots which connect to the
> server
> and play by the script. The bots don't wait before replying to
> bet
> requests and so launching a few thousand bots heavily loads
> the server.
>
> I don't want just a few thousand concurrent bots, though! I
> want at
> least 10k on a single VM and hundreds of thousands on a
> cluster, so I
> set to optimize my poker server.
>
> The Erlang Efficiency Guide recommends fprof as the tool. I
> ran fprof
> on my test harness and discovered that the result set cannot
> be
> processed in my 12Gb of memory. I made this discovery after
> leaving
> fprof running for a couple of days and realized this because
> the fprof
> data files were approaching 100Gb and my machine became
> unusable due
> to heavy swapping.
>
> fprof usets ets tables to analyze the trace results and ets
> tables
> must fit in memory.
>
> I shortened my test run and was able to see the output of the
> fprof
> trace analysis. To say that it's dense would be an
> understatement! I
> realize that dumping out tuples is easy but aren't computers
> suppose
> to help us humans?
>
> The final output from fprof is still too raw for me to
> analyze.
> There's absolutely, positively, definitely no way to get a
> picture of
> a running system by reading through it. I understand that I
> can infer
> from the analysis that certain functions take a lot of time
> but what
> if there are none?
>
> The bulk of the time in my system was taken by various OTP
> functions
> and processes, Mnesia and unknown functions. All I could infer
> from it
> is that perhaps I have too many processes.
>
> Another thing that I inferred is that the normal method of
> writing
> gen_server code doesn't work for profiling.
>
> I had to rewrite the gen_server clauses to immediately
> dispatch to
> functions, e.g.
>
> handle_cast('LOGOUT', Data) ->
> handle_cast_logout(Data);
>
> handle_cast('DISCONNECT', Data) ->
> handle_cast_disconnect(Data);
>
> otherwise all the clauses of a gen_server are squashed
> together,
> regardless of the message pattern. I don't know if there's a
> better
> way to tackle this.
>
> Next, I rewrote most of my gen_servers as data structures,
> e.g. pot,
> limit, deck, etc. A deck of cards can take a message to draw a
> card
> but the message can just as well be a function call. The deck
> structure will need to be modified regardless and the tuple
> will be
> duplicated anyway. There didn't seem to be any advantage in
> using a
> process here, much less a gen_server.
>
> Next I carefully went trough my Mnesia schema and split some
> tables
> into smaller tables. I made sure that only the absolutely
> necessary
> tables were disk-based. I wish I could run without updating
> Mnesia
> tables during a game but this is impossible since player
> balances and
> status need to be updated when players join or leave a game,
> as well
> as when a game finishes.


>
> All my hard work paid off and I was able to get close to 10K
> players,
> with kernel poll enabled, of course. Then I ran out of ETS
> tables.
>
> I don't create ETS tables on the fly but, apparently, Mnesia
> does. For
> every transaction!!!
>
> This prompted me to go through the server again and use
> dirty_read,
> dirty_write wherever possible. I also placed balanced in two
> separate
> "counter" tables, integers to be divided by 10000 to get 4
> decimal
> points of precision. This is so that I could use
> dirty_update_counter
> instead of a regular read, bump, write pattern.
>

> My frustration kept increasing but I gained more concurrent
> players. I
> can now safely run up to 8K bots before timeouts start to
> appear.


>
> These are gen_server call timeouts when requests for game
> information
> take longer than the default 5 seconds. I have an average of 5
> players
> per game so this is not because a large number of processes
> are trying
> to access the game.
>
> I suppose this is a reflection of the load on the system,
> although CPU
> usage never goes past 300% which tells me that no more than 3
> cores
> are used by Erlang.
>

> The straw that broke my back was when stopping a bot's
> matching player
> gen_server by returning {stop, ... } started causing my
> observer
> process to receive tcp_close and exit. I could repeat this
> like
> clockwork. Only spawning a separate process to send player a
> stop
> message would fix this.
>
> Then I changed the way I represent cards started seeing this
> behavior
> again, in just one of my tests. What do cards have to do with
> tcp_close? I don't know and dbg tracer is my best friend! What
> I know
> is what git tells me and git says cards were the only
> difference.
>
> Anyway, I don't think I have fully recovered yet. I may need a
> weekend
> just to regain my sanity. I will try to spread the load among
> several
> VMs but my hunch is that my distributed 100k players target is
> far far
> away. I'll may have to keep flying blind, with only traces and
> printouts to my rescue.
>
> Thanks for listening, Joel


>
> --
> wagerlabs.com
>
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>

> --
> John S. Wolter President
> Wolter Works
> Mailto:johns...@wolterworks.com
> Desk 1-734-665-1263
> Cell: 1-734-904-8433

Bhasker Kode

unread,
Sep 15, 2008, 4:49:31 AM9/15/08
to erlang-q...@erlang.org
Hey Joel,

>4 Mnesia overloaded error reports

> ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}

> e.g. the reply takes longer than 5 seconds. The game bombs out at this point.
> I could increase the timeout or make the call just wait forever. I don't want to do this, though, as this timeout serves as a performance indicator.

I've had the same exact same two reports causing all kinds of
unexpected behaviour.

While setting infinity with gen_server call's worked with me, i think
setting a higher timeout value (than the default 5 seconds ) is your
best best, and thats what worked with us. We had some python<->erlang
logic happening in one of the gen_server calls, and used to
consistently timeout until i came to know of the 5 second default
value.

IMHO , perhaps you can also move the 'performance analysis' to another
part of the poker server, rather than mixign it with the decks, poker
logic. In that case , having gen_server calls as infinite may not seem
a bad idea.

Regarding the mnesia dump_threshold calls, we did two things :

#1 If latency is'nt an issue take a whole bunch of writes , write
them to a file, then on a trial-and-error basis configure that after
every N entries, do a mnesia:load_textfile(Dump). This way all those N
entries get written in one transaction. One thing to look at though is
that at the time of loading, permissions to access mnesia . In our
case since we access mnesia only through Yaws , so no rwx permission
errors throw up , and also made sure that mnesia already knows where
to distribute the newly loaded data ( in our case a 4-node cluster
currently). If latency is critical, then why do you need mnesia in the
first place, just message pass (but i suppose that you still need to
archive them even if it behaves like a messaging system)

#2 creating user specific tables. i know most normalization/db guru's
out there might raise an eyebrow . As much as we're used to pattern
matching all rows from a table where user matches X, we just created
seperate tables for each user, and know exactly to look. Now we can
take maybe a much higher factor of the load into a table , without
even considering Fragmentation. btw, Thanks for pointing out the
default max ets tables, will have to look into that.

jut FYI, the third biggest error we used to get on our logs was the
{error,open} message "yaws failed to bind socket error", and we spent
a long time figuring out if it was related to the timeout. turned out
that yaws.conf has a little property that sets "Fail completely when
yaws fails to bind socket = true " by default. Switched it to false,
and that's the last time we saw those error,open reports 8 )

~
Keep Clicking,
Bhasker V Kode | bosky101
hover.in

Valentin Micic

unread,
Sep 14, 2008, 3:21:41 AM9/14/08
to Kevin Scaldeferri, Edwin Fine, erlang-questions Questions
I havent read the whole correspndance (it seems to be going on for a way too
long), but like to add my 2c worth...

While SMP (+S1) approach may solve some problems, it defeats the purpose of
having a multi-core machine. Please note that multi-core machines have lower
clock speeds, thus should run generally slower per given CPU core. IMHO, if
+S 1 solves your problem, maybe you should revisit your code -- I think that
it is wrong to expect that the same code would work better on SMP just
because you had such expectations. For example, it is known fact that ETS
works slower in SMP environment.
Also, one should not forget to use +A in addition to +S -- although you do
not have any disk I/O, I think this parameter is relevant for PORT
scheduling, therefore improving performance of your I/O.

V.


----- Original Message -----
From: "Kevin Scaldeferri" <ke...@scaldeferri.com>
To: "Edwin Fine" <erlang-ques...@usa.net>
Cc: "erlang-questions Questions" <erlang-q...@erlang.org>
Sent: Sunday, September 14, 2008 12:07 AM
Subject: Re: [erlang-questions] My frustration with Erlang


>

Valentin Micic

unread,
Sep 14, 2008, 11:14:50 AM9/14/08
to Edwin Fine, erlang-questions Questions
> Long before threading models came into vogue, multiple processes
> were taking advantage of multi-CPU systems simply by letting
> the OS scheduler choose which CPU on which to run the next
> runnable process.
 
Well... a single process would never exceed a capacity of a single CPU. Therefore, if you had, say, 4 CPUs, one process shall never exceed more than 25% of the overall system capacity. In a case of SOLARIS, this may be true even if you create multiple threads (e.g. if thread were created as bound threads, they will still be scheduled using the same light-weight process, but that is another matter alltogether).
 
> Since Erlang threads are "green" threads, they don't individually use
> the threading model of the underlying operating system anyway.
> Each VM, other than for I/O operations on files (controleld by +A, I believe),
> uses AFAIK one O/S thread per scheduler. So +S 8 will use 8 O/S threads.
> When you have 8 threads sharing something (which they will when running SMP),
> there is a risk of contention slowing things down.

 
That is why I'm saying that one should look at the code, and ensure that code is written to take advantage of SMP environment. If you going to use SMP machine and then use it as a "normal" multi-CPU machine, you would be better off staying with a multi-CPU machine which has a faster clock.
In my experience so far, one should stay clear from servers that are doing a lot of processing per request -- clock is slower, clearly indicating that server will be slower, and yet we have more schedulers resulting in more requests being generated. One may experiment with different priorities and change server to higher priority, which is, I suspect, what one should do for the code that was written for non-SMP machines. My preference, however, would be to write another kind of server, that would spawn a new process for each request, which will in turn report back to the caller bypassing the server.
 
V.

YourSurrogateGod

unread,
Sep 15, 2008, 5:16:06 PM9/15/08
to erlang-q...@erlang.org
I'll admit that I'm highly inexperienced when it comes to Erlang.
However, I'm surprised that there is such a barrier. From what I
understood, Erlang's way of handling messages is *very* efficient.
Just what percentage of overall processing is done by the process
that's responsible for passing messages around? As opposed to other
processes which can do actual work.

I might have a suggestion. At my job, we've used CORBA. Although not
exactly the same, in principle, it's similar. We had an issue with
lots of messages getting passed around and gumming up the works,
causing messages to get delivered late (our software had requirements
to deliver the messages within a certain amount of time). The biggest
problem was that there were lots of small messages getting sent
around. Which meant, behind the scenes, a lot of connections openned
and closed, which takes time, during which less productive processing
(on the CPU) is done. We tried 2 solutions:
1) Turn the small messages into large, long ones. That way, you just
have one massive message sent. Combine them.
2) Collect pieces of information into a buffer and then send it out
once the buffer has reached a limit.

Say you do:
receive
Message1 ->
....
Message2 ->
....
Message3 ->
....
Message4 ->
....
end.

But you change it to:
receive
LargeMessage ->
end.

Also, I have a question. Isn't it possible to simulate multiple CPU's
with Erlang's virtual machine? If so, then wouldn't creating lots of
these virtual CPUs emulate the said problem and make it easier to see
on the horizon? etop seems like a tool that might answer this
question, too bad it crashes so often :( .

Scott Lystig Fritchie

unread,
Sep 19, 2008, 2:24:04 AM9/19/08
to Rapsey, erlang-q...@erlang.org
Rapsey <rap...@gmail.com> wrote:

r> How many mnesia transactions do you have left? From what I tested,
r> transactions are slow as hell.

I'm curious, what kind of transactions? In tests from over a year ago
(using R11B-5, I think, it's the only hard data that I have within easy
arm's reach), I had single table performance on 2-node cluster, Xeon
2.8GHz CPUs (maybe?), 'disc_copies' tables that were replicated on each
node as well as fragmented, with single table read txns (context =
sync_transaction) of up to 7K transactions/sec and single table (single
key) update txns at 4.1K transactions/sec. Not stupendously great, but
for a non-SMP VM plus Mnesia's distributed transaction managers involved
in every transaction, it isn't too bad.

If your tables were type 'disc_only_copies', then I'd have far fewer
problems believing.

-Scott

Scott Lystig Fritchie

unread,
Sep 19, 2008, 2:28:32 AM9/19/08
to thomasl...@yahoo.com, erlang-questions Questions
Thomas Lindgren <thomasl...@yahoo.com> wrote:

tl> A further technical limitation of nodes, once pointed out to me by
tl> Sean Hinde, is that sockets can't be sent between nodes today.

If you can extract the file descriptor number, the "unixdom_drv" item in
Jungerl should allow file descriptor passing on the same machine.
Assuming support for your OS, no bit-rot in the repository ... YMMV?

-Scott

Thomas Lindgren

unread,
Sep 19, 2008, 12:16:39 PM9/19/08
to erlang-questions Questions


--- On Fri, 9/19/08, Scott Lystig Fritchie <frit...@snookles.com> wrote:

> tl> A further technical limitation of nodes, once
> pointed out to me by
> tl> Sean Hinde, is that sockets can't be sent
> between nodes today.
>
> If you can extract the file descriptor number, the
> "unixdom_drv" item in
> Jungerl should allow file descriptor passing on the same
> machine.
> Assuming support for your OS, no bit-rot in the repository
> ... YMMV?

By no means a fundamental limitation (on some Unices at least), but as you note, it may need some trailblazing.

Best,
Thomas

Reply all
Reply to author
Forward
0 new messages