[erlang-questions] gen_server clustering strategy

Knut Nesheim

unread,

Feb 1, 2011, 12:25:00 PM2/1/11

to erlang-q...@erlang.org

Hello list,

We are interested in understanding how the Erlang distribution
mechanism behaves in a fully connected cluster. Our concern is that as
we add nodes, the overhead will become a problem.

Our application is a bunch of gen_servers containing state, which we
need to run on multiple nodes due to memory usage. A request will come
to our application and it needs to be handled by the correct
gen_server. Every request includes a user id which we can use to map
to the process.

We have two specific use cases in mind.

1) Proxy to workers

In this scenario we have a proxy (or multiple proxies) accepting
requests at the edge of our cluster. The proxy will ask the correct
server for a response. Either through using the 'global' module or
gproc or something similar, the proxy node will keep track of the
mapping of user ids to process and nodes. The proxy will call the node
using the Erlang distribution protocol.

2) Mesh

In this scenario a request can be handled by any node in the cluster.
If the request cannot be handled by the local node, the correct node
is called on instead. Every node in the cluster needs to keep track of
which id belongs to which process.

The numbers:
* We must be able to run 100,000 processes, each may peak at 300kb of memory
* We expect 5000 requests coming in to our application per second at
peak, generating the same number of messages
* Out of those 5000 messages, 1000 has a content that may peak at 300kb
* All the other messages are well under 1kb

Our concerns:

* In the mesh, will the traffic between the nodes become a problem?
Lets say we have 4 nodes, if the requests are evenly divided between
the nodes the probability of hitting the correct node is 25%. With 100
nodes this is 1%. As we add nodes, there will be more chatter. May
this chatter "drown-out" the important messages, like pid down, etc.

* Will it be more efficient to use the proxy to message every node?
Will the message then always go to the node directly, or may it go
through some other node in the cluster?

* For the mesh, we need to keep the 'global' or gproc state
replicated. With 'global' we will run into the default limit of atoms.
If we were to increase this limit, what kind of memory usage could we
expect? Are there any other nasty side-effects that we should be aware
of?

* In the "Erlang/OTP in Action" book it is stated that you may have a
"couple of dozen, but probably not hundreds of nodes." Our
understanding is that this is because of traffic like "this pid just
died", "this node is no longer available", etc.

* If we were to eliminate messaging between nodes, how many nodes
could we actually run in a fully connected cluster? Will a
high-latency network like ec2 affect this number or just the latency
of messages? What else may affect the size?

Regards
Knut

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Evans, Matthew

unread,

Feb 1, 2011, 2:10:05 PM2/1/11

to Knut Nesheim, erlang-q...@erlang.org

Without knowing much more another model would be to use your own ets/mnesia table to map the workers.

I am assuming the creation of a worker is relatively infrequent (compared to the times you need to find a worker).

When you create a worker use gen_server:start(?MODULE, [], []).

This will return {ok,Pid}. You can save the Pid in an mnesia table along with whatever reference you need (doesn't need to be an atom). All nodes in the mesh will get the pid and the reference. Starting it as shown above (with an arity of 3) means you don't need to use the global service and ensure the names are unique atoms for each server. You can even do start via some application that will spread the gen_servers over different nodes.

When a request comes in it only need do a lookup to find the correct pid.

I can't recall as to how bad the process crash messaging is. But what you could do is to monitor a gen_server locally, and monitor each node globally. When a node fails all other nodes can cleanup processes that were registered on that node.

Matt

________________________________________
From: erlang-q...@erlang.org [erlang-q...@erlang.org] On Behalf Of Knut Nesheim [knut.n...@wooga.net]
Sent: Tuesday, February 01, 2011 12:25 PM
To: erlang-q...@erlang.org
Subject: [erlang-questions] gen_server clustering strategy

James Churchman

unread,

Feb 2, 2011, 10:37:36 AM2/2/11

to Evans, Matthew, Knut Nesheim, erlang-q...@erlang.org

all great questions, most of which i don't know the answers to, but queuing systems can help if a bit of extra latency is ok, and more so if the order does not have to be exact

i think at very large scale hashing becomes the only answer, and handling failed nodes becomes tricky, but a basic distributed mnesia table that contains the number of alive nodes and maybe reports on their repeated failures, and then hashes the input (on userid in your case) to the correct node would get you half way there.

Also i don't think that 100,000 processes that peak at 300kb is a problem for erlang. Thats only 30GB max. If its stored as binaries, and possibly after a really large message you invoke the garbage collector for that process ( i have no idea if this is a bad idea or not, but can be done easily and should reduce memory consumption) then a single box should be able to handle your needs easily, and 3 no probs at all. My mac pro should be ok :-)

i think dedicated hardware is always better than virtualised

As for 1000 messages at 300kb, that seems like quite a lot, bordering on what standard gigabit ethernet can handle, but just benchmark erlang, again one box might be enough... you can get a 32 core amd server for not much these days. Are these requirements realistic tho, or just "if 10 million people use my service on day one that i have put together on a shoe string etc... wishful thinking"

Evans, Matthew

unread,

Feb 2, 2011, 11:40:25 AM2/2/11

to James Churchman, Knut Nesheim, erlang-q...@erlang.org

My worry is the OP is suggesting using the global name server. This has a few problems

1) That he will soon run out of the atom table space (I think he's making the assumption that gen_servers need to be registered, which is not the case).
2) Sharing process state information between all nodes
3) All all lookups need to go to a central gen_server (the global name-server) running on a single logical core.

(I might be wrong about point 3).

But actually a hashing function could be a good idea, at least to find the correct node. This could negate the need to share process availability information between host/node boundaries.

One could maintain a list of current nodes in the mesh, and then use a hash to find the correct node. Identifier to pid still needs to be mapped, but this can be done by a local ets table (optimized and fast for lookups).

For example:

process_inbound_request(Message,Identifier) ->
Nodes = nodes()++[node()],
Id = erlang:phash2(Identifier, length(Nodes)),
case lists:nth(Id+1,Nodes) of
node() ->
% On same host, lookup locally to find the correct instance
dispatch_locally(Message.Identifier);
Node ->
dispatch_remotely(Message,Identifier)
end.

dispatch_locally(Message,Identifier) ->
case ets:lookup(local_pids, Identifier) ->
[{_,Pid}] ->
gen_server:cast(Pid,{inbound,Message});
_ ->
ok;
end.

dispatch_remotely(Message,Identifier) ->
gen_server:cast({local_proxy,Node},{request,Message,Identifier}).

local_proxy is a locally registered gen_server on each node that implements the equivalent of the dispatch_locally function.

Matt

________________________________________
From: James Churchman [jamesch...@gmail.com]
Sent: Wednesday, February 02, 2011 10:37 AM
To: Evans, Matthew
Cc: Knut Nesheim; erlang-q...@erlang.org
Subject: Re: [erlang-questions] gen_server clustering strategy

Anthony Molinaro

unread,

Feb 2, 2011, 2:17:02 PM2/2/11

to Evans, Matthew, James Churchman, Knut Nesheim, erlang-q...@erlang.org

Riak Core actually does a bunch of this stuff for you

http://blog.basho.com/category/riak-core/
https://github.com/basho/riak_core

It will manage a partitioned ring sending reads and writes to some number
of nodes, and is great a library. We use it as a bridge between a thrift
server and a custom storage module, and it allows us to scale out as
necessary.

Also according to riak FAQ, the largest cluster they've run is about 60
nodes

http://wiki.basho.com/FAQ.html#What-is-the-largest-cluster-currently-running?

so that's at least one example of scaling into double digits.

-Anthony

--
------------------------------------------------------------------------
Anthony Molinaro <anth...@alumni.caltech.edu>

Knut Ivar Nesheim

unread,

Feb 2, 2011, 3:27:37 PM2/2/11

to Evans, Matthew, James Churchman, Knut Nesheim, erlang-q...@erlang.org

Thanks for all the great feedback!

To clarify some points:

* We need to be able to send messages from every process to any other
process, so we could use some form of registry. Using the global
module would get us going, but not take us all the way (which is
something we can worry about later).

* We have considered writing our own process registry, based on ets
or mnesia or redis or whatever to perfectly support our use case. We
are however interested in standing on the shoulders of giants,
especially if we can use a library like gproc, which has been
extensively tested and verified(at least without distribution).

* We won't hit 10 million users anytime soon, however 2 million daily
users four months after marketing starts marketing is realistic. Also,
our system cannot break when doubling that number. This means we need
to be able to handle at least 5000 requests per second. The numbers
are from one of our other products, so we know pretty well what to
expect.

* We would like to add nodes without disturbing the processes already
running, so consistent hashing must be used if hashing

Our worry with the cluster of identical nodes, the "mesh" described
earlier, is that there will be too many messages passing around. As we
add nodes, it will only get worse and we might start hitting the
limits of the Erlang distribution itself.

Our worry with the cluster that includes "proxy" nodes, is that they
must maintain the state of the cluster in order to route accordingly.
At least they must know where to send the requests using some form of
hashing. To this end, building on top of riak_core might be a very
good option. However, we feel that riak_core is pretty complex to
understand fully and would prefer a simpler solution.

As Matt suggested in his second post, we can make the system simpler
by not keeping a global registry of userid to process mapping, but
only keep a mapping of userid to the node the process is running on.
The node can then itself store the userid to process mapping using
gproc or something similar locally.

If riak is running on 60 nodes, then the question for us is how much
message passing is going on. Our concern is that when we reach 10 or
20 nodes, our architecture will just not cut it anymore as we will
increase the message passing.

Also, what kind of overhead can we expect if we monitor processes on
other nodes? Our understanding is that the information regarding pids
that die, will be sent to any node in the cluster anyway, so actually
monitoring and receiving the info won't make that much of a difference
as it is already there.

Thanks for all the great feedback!

Regards
Knut

Scott Lystig Fritchie

unread,

Feb 6, 2011, 2:25:00 PM2/6/11

to Knut Ivar Nesheim, Evans, Matthew, James Churchman, Knut Nesheim, erlang-q...@erlang.org

Following-up many days after the last posting, sorry.

Knut Ivar Nesheim <knu...@gmail.com> wrote:

kin> * We need to be able to send messages from every process to any
kin> other process, so we could use some form of registry. Using the
kin> global module would get us going, but not take us all the way
kin> (which is something we can worry about later).

'global' will allow prototype code to work in a small development
system, but it won't scale to what you need. Previous messages have
pointed to reasons why.

kin> * We have considered writing our own process registry, based on ets
kin> or mnesia or redis or whatever [...]

Mnesia and Redis and Riak would work quite well for that, ...

kin> * We would like to add nodes without disturbing the processes
kin> already running, so consistent hashing must be used if hashing...

... though it isn't clear to me (haven't followed recent Redis
developments) if Redis can cope with changing cluster sizes dynamically.
Mnesia and Riak both can.

kin> Our worry with the cluster of identical nodes, the "mesh" described
kin> earlier, is that there will be too many messages passing around.
kin> [...]
kin> Our worry with the cluster that includes "proxy" nodes, is that
kin> they must maintain the state of the cluster in order to route
kin> accordingly. At least they must know where to send the requests
kin> using some form of hashing.

Er, you could build a system that routes messages like that, but neither
Mnesia nor Riak use that method, so one need not worry about it.

kin> To this end, building on top of riak_core might be a very good
kin> option. However, we feel that riak_core is pretty complex to
kin> understand fully and would prefer a simpler solution.

Riak Core's callback functions aren't terribly complex to implement.
Though for your name->pid process registry, using the as-is Riak KV
application would be sufficient: KV is already using Core to figure out
where your key to map name->pid would be.

kin> If riak is running on 60 nodes, then the question for us is how
kin> much message passing is going on.

The hash calculation is O(1) with respect to CPU time, and fetching the
value is O(1) with respect to number of messages required.

kin> Also, what kind of overhead can we expect if we monitor processes
kin> on other nodes? Our understanding is that the information regarding
kin> pids that die, will be sent to any node in the cluster anyway, so
kin> actually monitoring and receiving the info won't make that much of
kin> a difference as it is already there.

Er, if I'm understanding your understanding, no. Information on a
monitored process P is only sent to interested parties, i.e. those procs
that actually created a monitor relationship with P.

One thing that you didn't mention is what happens to the application
when one of your gen_server worker processes dies. Do those processes
contain state that must be persistent? If that state must be
persistent, then you have another data management task that you haven't
asked about yet.

-Scott

Reply all

Reply to author

Forward