Fire drill for replacing the main node

32 views
Skip to first unread message

Eran Sandler

unread,
Mar 21, 2012, 5:11:41 AM3/21/12
to doo...@googlegroups.com
Hi there,

I'm new to Doozer and it looks like a really cool project.

Looking at the fire drill page I saw that it basically kills the non "master" (is that the right term. That's the one that everyone else attaches to) instances to replace them with a new one.
What would be the right drill/methodology to replace the "master" one? Is there sort of a handover protocol saying the "master" is going down, someone else should take the call?

Eran

Andreas Fuchs

unread,
Mar 21, 2012, 3:04:41 PM3/21/12
to doo...@googlegroups.com
On Wed, Mar 21, 2012 at 2:11 AM, Eran Sandler <eran.s...@gmail.com> wrote:
> Looking at the fire drill page I saw that it basically kills the non
> "master" (is that the right term. That's the one that everyone else attaches
> to) instances to replace them with a new one.
> What would be the right drill/methodology to replace the "master" one? Is
> there sort of a handover protocol saying the "master" is going down, someone
> else should take the call?

The theory is that replacing the initial master should work just like
any other node (it should be just like any other node, after other
nodes have joined).

In practice, though, there are a bunch of ways in which this can turn
out very wrong. I've run a few fire drill scripts (with random sleeps
in them), and the behavior ranged from everything-ok (very rarely)
through "Too late" messages and any new node refusing to start (very
often) to random deadlocks (pretty rarely). There's a patch in the
currently-open doozer issues that fixes the worst "Too late" behavior,
but it's still very easy to trigger problems.

After running these tests, we stopped our effort to use doozer in
production for now, hoping it'll become a bit more stable with time
/-:

Reply all
Reply to author
Forward
0 new messages