Remove master_node dependency or full clustring support

58 views
Skip to first unread message

Bip Thelin

unread,
Jan 9, 2012, 1:38:11 PM1/9/12
to ChicagoBoss
Hi,

We've just started digging into Chicago Boss and loving it so far.
From my understanding reading the docs and code CB is using global
modules to achieve "clustring" which means that one master_node is a
potential single point of failure. My first idea was to change this
behaviour from a normal gen_server to gen_leader. Since I've just done
a couple of projects with gen_leader I have fairly good insights into
it. But after spending a good 30min converting I felt that it might be
the wrong way to go. Here's a modified session controller that handles
creation of sessions on all nodes in a cluster:
https://github.com/bipthelin/ChicagoBoss/commit/48eded3c7a03166aac45f1940aec2cc1789be715

This only work for new sessions and dont't remove/update existing
sessions and before I code anything more I'd like to discuss some
design decisions since I don't (yet) have any deep knowledge of the
architecture.

My gut feeling is that there should be a more generic approach to
clustring instead of doing it specifically at every place. Some
gen_leader here, some mnesia there, etc.

If anyone with more insights in the code (Evan?) can provide some of
their ideas/visions of clustring, etc for CB I'd be glad to chip in my
0.5c and the corresponding code.

Thanks, Bip

Evan Miller

unread,
Jan 9, 2012, 8:53:33 PM1/9/12
to chica...@googlegroups.com
I've never used gen_leader so I can't say whether it's a good fit. In
general I think this is a hard problem because for true clustering
we'll need replication of data (not just services). I think you've
discovered the difficulty with session storage, but then we also need
to worry about making BossMQ (the message queue) truly distributed and
fault-tolerant. That sounds hard to me.

A more practical approach might be to clusterize the master_node
pure-computation services (i.e. incoming email), then for data
services just interface to external applications that have already
implemented fault-toleration. This is done for sessions (which can use
memcached), and could be done for the message queue as well. As much
as I like CB's batteries-included approach I think it's best to farm
out hard problems like data replication to other servers.

Evan

--
Evan Miller
http://www.evanmiller.org/

Dave Cottlehuber

unread,
Jan 9, 2012, 9:26:09 PM1/9/12
to chica...@googlegroups.com
I'd love to add a couchdb backend for Boss and then point it at a
bigcouch cluster :-)))

The first part I hopefully will have time for in Feb.

Erlang FTW.

Evan Miller

unread,
Jan 9, 2012, 9:44:26 PM1/9/12
to chica...@googlegroups.com
Incidentally, "Prehistoric Boss" (circa 2008) used CouchDB
exclusively. But I kept getting weird errors and gave up on my
NoSQL/Erlang dreams until I discovered Tyrant the next year. Ah,
memories.

Anyways, Couch is much more stable now and it'd be great to add it to the mix.

Evan

Dave Cottlehuber

unread,
Jan 10, 2012, 3:01:10 AM1/10/12
to chica...@googlegroups.com
On 10 January 2012 03:44, Evan Miller <emmi...@gmail.com> wrote:
> Incidentally, "Prehistoric Boss" (circa 2008) used CouchDB
> exclusively. But I kept getting weird errors and gave up on my
> NoSQL/Erlang dreams until I discovered Tyrant the next year. Ah,
> memories.

Any code remnants??

Bip Thelin

unread,
Jan 10, 2012, 4:39:41 AM1/10/12
to chica...@googlegroups.com
Good this was the kind of response I was looking for. A few notes on gen_leader(sometimes known as Paxos), it's an erlang behaviour where a cluster of services can dispatch messages to an elected leader and that leader can dispatch messages to all workers(i.e. gen_leaders but not elected leaders). If the elected leader goes down a new leader will be automatically elected and takes over the responsibilities. It's a simple and elegant solution for the problem when you have a bunch of services but at a given time you want only one of them performing something, like sending mail, etc. In my opinion it's a perfect candidate for a master_node setup.

It's not as much a perfect fit for clustring if you want a true horizontal approach with a "gossip" protocol like memcache or Riak.

This is where I stopped. I started with a gen_leader approach but felt halfway through that a true clustring approach is more suitable.

I agree that one (and a pretty good one) approach is to use external applications like memcached. My biggest gripe with this is ending up with dependencies on a bunch of different servers/applications and the hassle with configuring, running all of these and the eventual cyclic dependency hell. I'm not saying that this is where one ends up but I just had a rather unpleasant experience with Scribe(log transport for Hadoop) which ended up in us rolling our own(https://github.com/bipthelin/zerolog).

There is a fine line in keeping it simple to setup, use and maintain and ending up rolling your own Riak in the end. I like the "batteries-included approach" so I'll do some research and see what I come up with. But on another note gen_leader might be a good addition to some other parts of CB, just not as a distributed k/v.

--
Bip Thelin
 
Evolope AB | Lugnets Allé 1 | 120 33 Stockholm
Tel 08-533 335 37 | Mob 0735-18 18 90
www.evolope.se

Evan Miller

unread,
Jan 10, 2012, 9:32:28 AM1/10/12
to chica...@googlegroups.com
On Tue, Jan 10, 2012 at 2:01 AM, Dave Cottlehuber <da...@muse.net.nz> wrote:
> On 10 January 2012 03:44, Evan Miller <emmi...@gmail.com> wrote:
>> Incidentally, "Prehistoric Boss" (circa 2008) used CouchDB
>> exclusively. But I kept getting weird errors and gave up on my
>> NoSQL/Erlang dreams until I discovered Tyrant the next year. Ah,
>> memories.
>
> Any code remnants??

Sure -- after minutes of digging under the hot Illinois sun, I found a
partial skeleton:

https://gist.github.com/1589346

Karmen Blake

unread,
Jan 10, 2012, 12:15:48 PM1/10/12
to chica...@googlegroups.com
+1 for couchdb :)

Florent Gallaire

unread,
May 19, 2012, 11:50:52 PM5/19/12
to chica...@googlegroups.com
The master_node problem is an important design choice, and fix it
could be really complex.
But is it on the roadmap ?

Cheers

Florent

--
FLOSS Engineer & Lawyer
Reply all
Reply to author
Forward
0 new messages