[erlang-questions] Erlang suitability

28 views
Skip to first unread message

Ovid

unread,
May 18, 2012, 5:00:24 AM5/18/12
to erlang-q...@erlang.org
Hi there,

We've a system that run across 75 servers and needs to be highly performant, fault-tolerant, scalable and shares persistent data across all 75 servers. We're investigating Erlang/Mnesia (which we don't know) because it sounds tailor-made for our situation.

We are not using Erlang for our first implementation, but are instead hacking together a solution from known technologies including Perl, MySQL and Redis. We're considering Erlang for our future work.

We have two primary needs: Each box can bid on an auction and potentially spend a tiny amount of money and each of the 75 boxes will receive notifications of a small amount of money spent if they win the auction (the auction notification will probably not be sent to the box bidding in the auction).

Use case 1: If the *total* of all of those small amounts exceeds a daily cap or an all-time cap, all 75 boxes must immediately stop spending bidding in auctions. It seems that each box can run a separate Erlang process and write out "winning bid" information to an Mnesia database and all boxes can read the total amount spent from that to determine if it should stop bidding.

This seems trivial to set up.

Use case 2: we periodically need to reauthenticate to the auction system. We MUST NOT have all 75 boxes trying to reauthenticate at the same time because we will be locked out of the system if we do this. Having a central box handling reauthentication is a single point of failure that we would like to avoid, but we don't know what design pattern Erlang would use to ensure that only one of the 75 Erlang instances would attempt to reauthenticate at any one time (all 75 boxes can share the same authentication token).

Use case 1 is clearly perfect for Erlang. Use case 2 is less clear to us. We're going to be spending a fair amount of time hacking together a non-Erlang solution, but the benefit is using known technologies that don't depend on the "one guy who knows the system".

At full scale, we anticipate billions of auctions per day.
 
Does anyone care to comment about things we should look at? Particularly, is use case 2 not something appropriate for Erlang? A tiny amount of sample code would be lovely (note: I am somewhat comforable with Prolog, so Erlang looks fairly straightforward for me, aside from understanding the message passing).

Cheers,
Ovid

Rapsey

unread,
May 18, 2012, 5:40:36 AM5/18/12
to Ovid, erlang-q...@erlang.org
75 machines in a fully connected distributed erlang network is a bit much but it should work.
use case 1: Why do you need to store to mnesia and all machines read from it? If you have 75 servers erlang connected, you can just send a simple erlang message to all of them.
use case 2: In a fully connected erlang network you always know which machines are offline/online. Every machine has a name.  If you simply sort all online machines by node name, that is the node you pick to do authentication. No real synchronization required and it will work even if that machine is lost. Or you can hardcode a sequence of machines which ones do the authenticating. First in list that is online is the one that does it.


Sergej

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


Ovid

unread,
May 18, 2012, 5:49:02 AM5/18/12
to Rapsey, erlang-q...@erlang.org
Hi Sergej,

Thanks for the response. It sounds like use case 2 is easier than I thought.

As for Mnesia and use case 1, our thought was twofold:

1. We want ACID compliance. We need to ensure that were all 75 boxes to mysteriously crash, we could bring them back up and not worry about data integrity.
2. Our storage needs are likely to become more complex in the future. The daily and total capping amounts will eventually be broken down into a variety of different categories, along with synchronization of which machines bid on a request and if those individual machines won a request (something that we currently cannot do without offline aggregation of this data).
 
When you say "75 machines in a fully connected distributed erlang network is a bit much but it should work." could you elaborate on that? It makes me a bit nervous to hear, particularly since we have no Erlang knowledge and just assumed that it was suited for this. Are we thinking about this the wrong way?

Cheers,
Ovid

From: Rapsey <rap...@gmail.com>
To: Ovid <curtis_...@yahoo.com>
Cc: "erlang-q...@erlang.org" <erlang-q...@erlang.org>
Sent: Friday, 18 May 2012, 11:40
Subject: Re: [erlang-questions] Erlang suitability

Gleb Peregud

unread,
May 18, 2012, 6:27:51 AM5/18/12
to Ovid, erlang-q...@erlang.org
On Fri, May 18, 2012 at 11:49 AM, Ovid <curtis_...@yahoo.com> wrote:
> 1. We want ACID compliance. We need to ensure that were all 75 boxes to
> mysteriously crash, we could bring them back up and not worry about data
> integrity.

This is mostly guaranteed by Mnesia, but it is sensitive to start/stop
order. I.e. the last node to stop has to be the first node to start
(in most cases this is not necessary, but it is the case if whole
cluster crashed). Please correct me if I'm wrong.

> 2. Our storage needs are likely to become more complex in the future. The
> daily and total capping amounts will eventually be broken down into a
> variety of different categories, along with synchronization of which
> machines bid on a request and if those individual machines won a request
> (something that we currently cannot do without offline aggregation of this
> data).

Not sure if mnesia supports such "sub clusters". Although it probably
run transactions only over nodes on which a involved table is located,
isn't it?

> When you say "75 machines in a fully connected distributed erlang network is
> a bit much but it should work." could you elaborate on that? It makes me a
> bit nervous to hear, particularly since we have no Erlang knowledge and just
> assumed that it was suited for this. Are we thinking about this the wrong
> way?

Mnesia was tailored for 2-3-4-5 nodes at max, since it was designed to
be a storage for storing configuration and short lived data over pairs
of boxes, where one of the boxes is an failover node.

Rapsey

unread,
May 18, 2012, 6:29:08 AM5/18/12
to Ovid, erlang-q...@erlang.org
Distributed erlang

I actually don't have a lot of experience in running a distributed erlang installation, but guys in this discussion do:

But from the information you have provided erlang should definitely be the best tool for the job. Since you lack erlang experience, maybe you should contact the erlang solution guys: http://www.erlang-solutions.com/



Sergej

Ulf Wiger

unread,
May 18, 2012, 7:22:06 AM5/18/12
to Gleb Peregud, erlang-q...@erlang.org

On 18 May 2012, at 12:27, Gleb Peregud wrote:

> This is mostly guaranteed by Mnesia, but it is sensitive to start/stop
> order. I.e. the last node to stop has to be the first node to start
> (in most cases this is not necessary, but it is the case if whole
> cluster crashed). Please correct me if I'm wrong.

It doesn't have to be the first to start, but the other nodes will wait
for it to appear. That is, all nodes that share replicas with that node,
since they cannot know the status of that replica until it comes back
on-line.

BR,
Ulf W

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com

Fred Hebert

unread,
May 18, 2012, 8:02:22 AM5/18/12
to Ovid, erlang-q...@erlang.org
Answers inline.


On 12-05-18 5:00 AM, Ovid wrote:
Hi there,

We've a system that run across 75 servers and needs to be highly performant, fault-tolerant, scalable and shares persistent data across all 75 servers. We're investigating Erlang/Mnesia (which we don't know) because it sounds tailor-made for our situation.
As mentioned earlier in this thread, 75 servers is a bit much, but people have done it before.


We are not using Erlang for our first implementation, but are instead hacking together a solution from known technologies including Perl, MySQL and Redis. We're considering Erlang for our future work.

We have two primary needs: Each box can bid on an auction and potentially spend a tiny amount of money and each of the 75 boxes will receive notifications of a small amount of money spent if they win the auction (the auction notification will probably not be sent to the box bidding in the auction).

Use case 1: If the *total* of all of those small amounts exceeds a daily cap or an all-time cap, all 75 boxes must immediately stop spending bidding in auctions. It seems that each box can run a separate Erlang process and write out "winning bid" information to an Mnesia database and all boxes can read the total amount spent from that to determine if it should stop bidding.

This seems trivial to set up.
It isn't trivial. You have think about what happens when a box is seen as crashing. How strongly consistent do you want things to be? There is always a risk that a box didn't crash, but was cut off in a netsplit. You might get divergences in budget that will be hard to explain.

There is also a definite timing issue depending on how your data is being observed. For example, you ask permission to bid on an item, but you do not get instant feedback; by the time you sent maybe 5-10 bids, the cap is finally reached and broken at once because the delay to the other network made you keep on bidding without a final result. How much tolerance do you have for this?

You mentioned in another post that "We need to ensure that were all 75 boxes to mysteriously crash, we could bring them back up and not worry about data integrity.", Possibly, but what about 1 node only? What about 5? What about 30 or 35? What if they crash and you missed winning bids because you went out after bidding but before getting your notifications back (if that is possible by the bidding rules of whatever exchange you're dealing with).

The most solid synchronous database setup might not give you the guarantees you expect in the first place.


Use case 2: we periodically need to reauthenticate to the auction system. We MUST NOT have all 75 boxes trying to reauthenticate at the same time because we will be locked out of the system if we do this. Having a central box handling reauthentication is a single point of failure that we would like to avoid, but we don't know what design pattern Erlang would use to ensure that only one of the 75 Erlang instances would attempt to reauthenticate at any one time (all 75 boxes can share the same authentication token).
That depends on: 1. how many times you can try to re-authenticate before being blocked, 2. how close together they have to be.

Central points of failures are definitely something to avoid. Leader election across 75 boxes might not be the funnest thing in the world either. I could see a scheme where you use some distributed cached value that can say "I am currently being logged" that can time out at some point, visible to all readers. When you read that timeout value from each box (possibly from an OTP Application that only handles auth), each reading of that value adds or subtracts a random number to the timeout. This is to try and avoid a cluster-wide synchronization on the timeout value, and instead have them happen at different times. You could add an "I'm updating" flag related to that value and that could give you good probabilities that only a fraction of all the nodes attempt an authentication at any point in time close to the timeout value.

Again, this would depend on how often your authentication needs to be done, and to what frequency you're allowed to do it.

If it's too tight, you might need a central server or node that takes care of it, with one or two fail-overs to add some reliability.

Note you will still have to care about netsplits ruining your day with this whole scheme.

-- I had nothing to add on the rest of the mail, so cut if off.

Hope this helps,
Fred.

Ovid

unread,
May 18, 2012, 8:29:16 AM5/18/12
to Fred Hebert, erlang-q...@erlang.org
Netsplits. Damn. I forgot to put my thinking CAP on. In this case, a netsplit would be disastrous unless we fell back to a central data store such as Redis. At that point, Erlang doesn't look like a solution at all.

On top of that, you're the second person to point out that a complete graph with 75 nodes is problematic. Now that I think about it, I guess I can see why. It now sounds like an Erlang solution is not the quick win we thought it might be (quelle surprise!). 
 
Thanks to everyone for all of your answers.

Cheers,
Curtis

From: Fred Hebert <mono...@ferd.ca>
Sent: Friday, 18 May 2012, 14:02

Subject: Re: [erlang-questions] Erlang suitability

Ladislav Lenart

unread,
May 18, 2012, 10:05:22 AM5/18/12
to Ovid, erlang-q...@erlang.org
Hello.

I am by no means an Erlang expert, but...

Netsplits are an inherent part of a distributed solution. Erlang or not, you
will have to deal with them one way or the other. And even if some 3rd party
framework takes care of them for you, they are still there. Erlang shines here,
because it gives you the tools you need to build a solution tailored to your
specific problem. I don't know of any other language that has builtin language
constructs and libraries that deal with beasts such as netsplits and HW / SW
failures. So for me, Erlang would fit nicely to your problem description.

Regarding 75 nodes in a cluster...

One objection was against native erlang distribution protocol (fully connected
mesh of nodes), not against Erlang language + platform itself. You can fairly
easily build your own comunication layer on top of TCP/IP, term_to_binary/1 and
binary_to_term/1. For example you can divide 75 nodes to fully connected islands
with one node responsible for communication with outside world. I believe
"hidden nodes" should be of value here.

I know that not a long ago, someone posted on this very list that their app runs
on ~700 nodes (sorry I dont' remember more) so 75 nodes is certainly doable.

Another problem is mnesia, which as some pointed to you, was not built to run on
this number of nodes. Nevertheless there are other possibilities, for example
riak. See basho.com for more.


HTH,

Ladislav Lenart


On 18.5.2012 14:29, Ovid wrote:
> Netsplits. Damn. I forgot to put my thinking CAP on. In this case, a netsplit
> would be disastrous unless we fell back to a central data store such as Redis.
> At that point, Erlang doesn't look like a solution at all.
>
> On top of that, you're the second person to point out that a complete graph with
> 75 nodes is problematic. Now that I think about it, I guess I can see why. It
> now sounds like an Erlang solution is not the quick win we thought it might be
> (quelle surprise!).
>
> Thanks to everyone for all of your answers.
>
> Cheers,
> Curtis
> --
> Live and work overseas - http://www.overseas-exile.com/
> Buy the book - http://www.oreilly.com/catalog/perlhks/
> Tech blog - http://blogs.perl.org/users/ovid/
> Twitter - http://twitter.com/OvidPerl/
>
> --------------------------------------------------------------------------------
> *From:* Fred Hebert <mono...@ferd.ca>
> *To:* Ovid <curtis_...@yahoo.com>
> *Cc:* "erlang-q...@erlang.org" <erlang-q...@erlang.org>
> *Sent:* Friday, 18 May 2012, 14:02
> *Subject:* Re: [erlang-questions] Erlang suitability

Garrett Smith

unread,
May 18, 2012, 11:52:04 AM5/18/12
to Ovid, erlang-q...@erlang.org
Hi Ovid,

Thanks for the very interesting use case!

On Fri, May 18, 2012 at 4:00 AM, Ovid <curtis_...@yahoo.com> wrote:
> Hi there,
>
> We've a system that run across 75 servers and needs to be highly performant,
> fault-tolerant, scalable and shares persistent data across all 75 servers.
> We're investigating Erlang/Mnesia (which we don't know) because it sounds
> tailor-made for our situation.

The previous comments about 75 being "a bit much" are warranted. In
distributed mode, Erlang establishes a "fully connected mesh" where
each node tries to maintain connections to every other node. This
N(N-1) scheme adds up quickly and caps the practical limits of these
clusters.

You can partition the clusters to limit their size, but you're moving
into a more complex topology. It's probably worth looking at though
since you get so much out of distributed Erlang for this problem.

> We are not using Erlang for our first implementation, but are instead
> hacking together a solution from known technologies including Perl, MySQL
> and Redis. We're considering Erlang for our future work.
>
> We have two primary needs: Each box can bid on an auction and potentially
> spend a tiny amount of money and each of the 75 boxes will receive
> notifications of a small amount of money spent if they win the auction (the
> auction notification will probably not be sent to the box bidding in the
> auction).
>
> Use case 1: If the *total* of all of those small amounts exceeds a daily cap
> or an all-time cap, all 75 boxes must immediately stop spending bidding in
> auctions. It seems that each box can run a separate Erlang process and write
> out "winning bid" information to an Mnesia database and all boxes can read
> the total amount spent from that to determine if it should stop bidding.

You might also look at Riak for this.

Also, something like doozerd.

Or a consensus algorithm like Paxos, if you don't mind the extra work
in building smarter clients.

> This seems trivial to set up.
>
> Use case 2: we periodically need to reauthenticate to the auction system. We
> MUST NOT have all 75 boxes trying to reauthenticate at the same time because
> we will be locked out of the system if we do this. Having a central box
> handling reauthentication is a single point of failure that we would like to
> avoid, but we don't know what design pattern Erlang would use to ensure that
> only one of the 75 Erlang instances would attempt to reauthenticate at any
> one time (all 75 boxes can share the same authentication token).

This could also be framed as a consensus problem -- you'd elect a node
to reauthenticate among a set of candidates.

> Use case 1 is clearly perfect for Erlang. Use case 2 is less clear to us.
> We're going to be spending a fair amount of time hacking together a
> non-Erlang solution, but the benefit is using known technologies that don't
> depend on the "one guy who knows the system".

This is a great approach!

One tech you might look at to glue these pieces together is 0MQ. This
would let you use whatever languages you'd want, hacking something
that just works, then replace pieces as needed without breaking your
network protocols.

You'll have a learning curve using 0MQ, but this is because you'll be
learning core messaging patterns, not so much 0MQ itself, which is
very simple. I think all of that learning will be indispensable for
your team as you move along.

> At full scale, we anticipate billions of auctions per day.
>
> Does anyone care to comment about things we should look at? Particularly, is
> use case 2 not something appropriate for Erlang? A tiny amount of sample
> code would be lovely (note: I am somewhat comforable with Prolog, so Erlang
> looks fairly straightforward for me, aside from understanding the message
> passing).

If you haven't already, take a look at this excellent document from Google:

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf

Distributed algorithms are a PITA (for me anyway) and the Chubby lock
service is a great "cheat" that simplifies a lot of the harder
problems.

Doozer is one of the few implementations of something Chubby-like that
I'm aware of.

> Cheers,
> Ovid

Best of luck!

Garrett

Joe Armstrong

unread,
May 18, 2012, 12:00:27 PM5/18/12
to Ovid, erlang-q...@erlang.org
It's a bit difficult to answer your questions since you don't say much
about time.

Is the total time of an auction a millisecond or a day (or longer) - I
have no idea and don't
like to guess. What are the timing requirements for an individual bid
in an auction?
How many bids before an auction is closed?

Re authentication - how often is periodically? - every hour / every
day / every second
not knowing makes answering your questions very diffcult.

If re authentication is every 75 minutes - then you could authenticate
machine one on
the first minute, machine 2 on the second minute and so on....
Erlang/mensia is not magic
you still need to think out your algorithms in detail first ...

Cheers

/Joe

o...@cs.otago.ac.nz

unread,
May 18, 2012, 12:17:45 PM5/18/12
to Garrett Smith, erlang-q...@erlang.org

> On Fri, May 18, 2012 at 4:00 AM, Ovid <curtis_...@yahoo.com> wrote:

>> Use case 2: we periodically need to reauthenticate to the auction
>> system. We
>> MUST NOT have all 75 boxes trying to reauthenticate at the same time
>> because
>> we will be locked out of the system if we do this. Having a central box
>> handling reauthentication is a single point of failure that we would
>> like to
>> avoid, but we don't know what design pattern Erlang would use to ensure
>> that
>> only one of the 75 Erlang instances would attempt to reauthenticate at
>> any
>> one time (all 75 boxes can share the same authentication token).

How *often* is "periodically?
What would happen if each of your boxes just chose times at random?
If you have to re-authenticate once an hour, each box could be given
two 24-second windows to use (time division multiplexing, as it were).
You can probably keep the clocks sufficiently synchronised to make that
work.

Could you use a scheme where each node holds a token and uses
gproc:give_away/2 to pass it to the next node?

Garrett Smith

unread,
May 18, 2012, 12:24:37 PM5/18/12
to o...@cs.otago.ac.nz, erlang-q...@erlang.org
On Fri, May 18, 2012 at 11:17 AM, <o...@cs.otago.ac.nz> wrote:
>
>> On Fri, May 18, 2012 at 4:00 AM, Ovid <curtis_...@yahoo.com> wrote:
>>> Use case 2: we periodically need to reauthenticate to the auction
>>> system.

-snip-

> How *often* is "periodically?
> What would happen if each of your boxes just chose times at random?
> If you have to re-authenticate once an hour, each box could be given
> two 24-second windows to use (time division multiplexing, as it were).
> You can probably keep the clocks sufficiently synchronised to make that
> work.
>
> Could you use a scheme where each node holds a token and uses
> gproc:give_away/2 to pass it to the next node?

Hah! I forgot about gproc's leader election features.

I know that was "in process" for a while -- can anyone share recent
experience with it?

Garrett

Ulf Wiger

unread,
May 18, 2012, 12:39:28 PM5/18/12
to Garrett Smith, erlang-q...@erlang.org

On 18 May 2012, at 18:24, Garrett Smith wrote:

>> Could you use a scheme where each node holds a token and uses
>> gproc:give_away/2 to pass it to the next node?
>
> Hah! I forgot about gproc's leader election features.
>
> I know that was "in process" for a while -- can anyone share recent
> experience with it?

Well, it is about as good as the latest and greatest gen_leader. ;-)

Gproc relies on gen_leader, and simply passes on the start options
to it. Some versions of gen_leader (I believe you have the fork that's
currently 'leading the pack' in the github network graph ;-)

However, gproc uses full replication, so gproc:give_away/2 will involve
all 75 nodes - not just the two.

BR,
Ulf W

PS Since gen_leader hasn't had a strategy for resolving netsplits
(i.e. merging data and resolving inconsistencies), gproc has no support
for this either. Does any of the latest gen_leader versions have something
that gproc can use?

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com



Michael Turner

unread,
May 19, 2012, 3:24:17 AM5/19/12
to Fred Hebert, erlang-q...@erlang.org
"As mentioned earlier in this thread, 75 servers is a bit much, but
people have done it before."

What I haven't seen mentioned so far: depending on your application,
reimplementing it in Erlang might mean that you no longer need nearly
as many as 75 servers.

-michael turner

Jesper Louis Andersen

unread,
May 21, 2012, 11:40:21 AM5/21/12
to Ovid, erlang-q...@erlang.org
On 5/18/12 11:00 AM, Ovid wrote:

Use case 1: If the *total* of all of those small amounts exceeds a daily cap or an all-time cap, all 75 boxes must immediately stop spending bidding in auctions. It seems that each box can run a separate Erlang process and write out "winning bid" information to an Mnesia database and all boxes can read the total amount spent from that to determine if it should stop bidding.

Have you considered using the opposite direction? A bidder takes a lease on a part of the possible cap. If there is a lot of remaining money you can pick out a fairly large lease and as you get closer to the spending limit you can decrease the amount of extra money you can get for bids. For instance, you may know there is $100 in the pool and a bidder needs to do a bid. Hence it allocates $5 to itself and can now roam on those $5 as it sees fit.

If the leaser crashes, you have a monitor on it, so you will get notified and can recover money it did not spend.

You will still need some database solution that is running on multiple nodes to battle a single-point-of-failure. But this idea works even if this is the case.

The advantage is that this scales a lot better. A bidder now knows for how much it is allowed to bid and can then operate independently on a synchronization point of "How much more am I allowed to spend?"


This seems trivial to set up.
It isn't. But Erlang could perhaps lend you some tools to make this work.

Use case 2: we periodically need to reauthenticate to the auction system. We MUST NOT have all 75 boxes trying to reauthenticate at the same time because we will be locked out of the system if we do this. Having a central box handling reauthentication is a single point of failure that we would like to avoid, but we don't know what design pattern Erlang would use to ensure that only one of the 75 Erlang instances would attempt to reauthenticate at any one time (all 75 boxes can share the same authentication token).

Your problem is that of a hypothesis: waiting-requires-locking. That is, if you need to wait on others, you need to synchronize who is doing things in what order - and that requires you to block on a single point. This in turn makes it hard to avoid the single-point-of-failure.

If you know that you may have up to K simultaneous authentications running it becomes easier to handle because then you have some leverage in how much synchronization that is needed.

There is no really good solution though. A problem here is the split-brain scenario, where your network gets disconnected, but individual nodes are still operating and can authenticate. In that case, you might have double auths if you pick the lowest possible node in a list.

What you should really do is to use *risk* as a deciding factor. You must evaluate the risk of something happening to the impact. It is, for instance, more likely that a node is lost than the network connectivity is in a split brain where you can still authenticate. Hence you decide to take that risk probably - knowing that certain split-brain scenarios cannot be handled by the solution.

If there is anything I wish to tell people about distributed programming it is that it is a fuzzy logic. On a single machine you are *not* safe since it can die. On multiple machines you have an error rate and different types of errors. What is important is that you control the error rate rather than let it flow by itself. You will almost never hit a situation where 100% stability can be guaranteed if you also need speed. So it becomes a question of risk management and trade-offs.

-- 
Jesper Louis Andersen
  Erlang Solutions Ltd., Copenhagen, DK
Reply all
Reply to author
Forward
0 new messages