Potential problem of deadlock with membership change

Hitoshi Mitake

unread,

Sep 13, 2015, 10:19:11 PM9/13/15

to raft...@googlegroups.com

Hi Raft folks,
I'm Hitoshi Mitake, nice to meet you.

I'd like to share a potential problem of Raft which can cause deadlock
triggered by membership change. I found the problem in etcd and fixed
it ( https://github.com/coreos/etcd/pull/3479 ). However, I thought
this bug isn't etcd implementation specific one. So I thought sharing
it on raft-dev would be a little bit valuable for developers working
on other Raft implementations, even though the problem can happen only
in very corner cases.

The problem can be caused in the below 2 steps:
1. construct N node cluster
2. add N new nodes without starting the new members

After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests for design simplicity (especially for reducing a number of
RPC types and storage types?).

Assume the node information (in a case of etcd, peer URL) of the added
nodes are wrong because of miss operation or bugs in wrapping program
which launch raft implmenetations. In such a case, both of adding and
removing members are impossible because the quorum isn't preserved. Of
course ordinal requests cannot be served. The cluster would seem to be
deadlock.

The problem can easily avoided by rejecting reconfiguration request in
a case of a number of running node < quorum of reconfigured cluster.
So it doesn't mean any weakpoints of Raft design.

However, I believe considering this problem in practical
implementation is valuable. Because Raft algorithm doesn't say
anything about an order of starting a newly added node and
reconfiguration request (if I understand correctly). Therefore there
can be the effect of miss operation or bugs of launcher programs
between two events.

I'm very new to Raft. If this problem is already known, sorry for noise ;)

Thanks,
Hitoshi

jordan.h...@gmail.com

unread,

Sep 13, 2015, 10:52:57 PM9/13/15

to raft...@googlegroups.com

So, what you're saying is when adding >= n nodes, because the quorum must include one of the added nodes, that means a majority of servers do not have committed entries, correct?

The way Copycat and some other Raft implementations have handled this is by making configuration changes a sort of two-stage process wherein when members are first added to the cluster, they're added in a state that prevents them from participating in elections or being counted in quorums (thanks to Ayende for this) but still allows logs to be replicated to those nodes. So, the leader still sends AppendEntries RPCs to non-voting members, but non-voting members don't have an election timeout or receive pre-vote or vote RPCs. In Copycat, as new members are caught up to the rest of the cluster, a second configuration change is logged by the leader to promote each member to full voting members. How to determine when a new node is caught up is questionable. Copycat currently logs the promotion when the joining node has all committed entries from the previous round of heartbeats, but there are some problems with this. Once the leader logs a promotion a joining node receives the second configuration change adding it as a full voting member it transitions to follower and resumes normal operation. So, if n nodes are added to a cluster, the configuration change doesn't complete and the quorum size doesn't effectively change until they can participate in the algorithm.

> --
> You received this message because you are subscribed to the Google Groups "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Henrik Ingo

unread,

Sep 14, 2015, 2:43:10 AM9/14/15

to raft...@googlegroups.com

Hi Hitoshi

It seems like maybe you've tried to implement cluster reconfiguration like it was described in the first Raft papers (aka the Usenix paper). I'll refer to Diego's thesis: https://ramcloud.stanford.edu/~ongaro/thesis.pdf

What you need to do correctly in your implementation is the joint consensus part. This is section 4.3 in the thesis.

However, a simpler and safer way to reconfigure the cluster is presented (new for the thesis version of Raft) in the previous sections 4-4.2. In this you only add one server at a time.

henrik

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Martin Hedenfalk

unread,

Sep 14, 2015, 3:19:47 AM9/14/15

to raft...@googlegroups.com

On Mon, Sep 14, 2015 at 11:19:10AM +0900, Hitoshi Mitake wrote:
> Hi Raft folks,
> I'm Hitoshi Mitake, nice to meet you.
>
> I'd like to share a potential problem of Raft which can cause deadlock
> triggered by membership change. I found the problem in etcd and fixed
> it ( https://github.com/coreos/etcd/pull/3479 ). However, I thought
> this bug isn't etcd implementation specific one. So I thought sharing
> it on raft-dev would be a little bit valuable for developers working
> on other Raft implementations, even though the problem can happen only
> in very corner cases.
>
> The problem can be caused in the below 2 steps:
> 1. construct N node cluster
> 2. add N new nodes without starting the new members

This problem is not restricted to nodes that are not started. If the new nodes
are slow, the cluster will not be able to make progress until the logs of a
majority has caught up with the leader.

In fact, this exact situation is discussed in Diegos' thesis, chapter 4.2.
Figure 4.4b illustrates the problem. The solution is given in chapter 4.2.1,
"Cathing up new servers". It also gives a suggestion for criteria to determine
when the new nodes are sufficiently up to date.

/martin hedenfalk

Hitoshi Mitake

unread,

Sep 24, 2015, 9:31:52 AM9/24/15

to raft...@googlegroups.com

Hi Jordan, thanks for your response.

On Mon, Sep 14, 2015 at 11:52 AM, jordan.h...@gmail.com
<jordan.h...@gmail.com> wrote:
> So, what you're saying is when adding >= n nodes, because the quorum must include one of the added nodes, that means a majority of servers do not have committed entries, correct?
>
> The way Copycat and some other Raft implementations have handled this is by making configuration changes a sort of two-stage process wherein when members are first added to the cluster, they're added in a state that prevents them from participating in elections or being counted in quorums (thanks to Ayende for this) but still allows logs to be replicated to those nodes. So, the leader still sends AppendEntries RPCs to non-voting members, but non-voting members don't have an election timeout or receive pre-vote or vote RPCs. In Copycat, as new members are caught up to the rest of the cluster, a second configuration change is logged by the leader to promote each member to full voting members. How to determine when a new node is caught up is questionable. Copycat currently logs the promotion when the joining node has all committed entries from the previous round of heartbeats, but there are some problems with this. Once the leader logs a promotion a joining node receives the second configuration change adding it as a full voting member it transitions to follower and resumes normal operation. So, if n nodes are added to a cluster, the configuration change doesn't complete and the quorum size doesn't effectively change until they can participate in the algorithm.
>

Thanks for your explanation of Copycat. It seems that the above way is
similar to the algorithm described in the thesis of Diego Ongaro.
IIUC, the main benefit of adding multiple nodes at once seems to be
parallelizing log synchronization of the new members. Is this correct?

Thanks,
Hitoshi

Hitoshi Mitake

unread,

Sep 24, 2015, 9:34:48 AM9/24/15

to raft...@googlegroups.com

Hi Henrik, Martin, thanks for your response.

I only read the TR (extended version of the paper published in usenix
atc), so I didn't noticed the description in the thesis. Thanks a lot!
Actually I could find the case of typo of TCP port number (and other
related cases you pointed) in the section :)

Thanks,
Hitoshi

>
> /martin hedenfalk

Diego Ongaro

unread,

Sep 27, 2015, 3:13:51 PM9/27/15

to raft...@googlegroups.com

IIUC, the main benefit of adding multiple nodes at once seems to be
parallelizing log synchronization of the new members. Is this correct?

Definitely. Transferring the logs takes the longest (for medium to large logs), whereas committing a log entry or two should be quick (for wired networks on planet Earth).

Hitoshi Mitake

unread,

Sep 28, 2015, 3:49:12 AM9/28/15

to raft...@googlegroups.com

Hi Diego,

On Mon, Sep 28, 2015 at 4:13 AM, Diego Ongaro <onga...@gmail.com> wrote:
>> IIUC, the main benefit of adding multiple nodes at once seems to be
>> parallelizing log synchronization of the new members. Is this correct?
>
>
> Definitely. Transferring the logs takes the longest (for medium to large
> logs), whereas committing a log entry or two should be quick (for wired
> networks on planet Earth).

Thanks for your answer. Now I can be confident about my understanding
on the topic :)

Thanks,
Hitoshi

Oren Eini (Ayende Rahien)

unread,

Sep 28, 2015, 5:16:26 AM9/28/15

to raft...@googlegroups.com

That is why I like doing this in two stages.

First, add the node as a promotable (non voting member), then send the logs to it.

That can take a while, then when it is done, you mark it as voting.

Both actions are topology changes that are very quick to run, and you aren't holding up the cluster for sending all the logs.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

Reply all

Reply to author

Forward