idea of backup voters

Dániel Urbán

unread,

Apr 15, 2021, 6:33:34 AM4/15/21

to raft-dev

Hi raft devs,

I had an idea for an extension for raft, but I was thinking that probably someone else already thought of it, but couldn't really find any clues. Regardless, curious what is your take on the idea.

I know that many raft implementations have some kind of a replica/witness concept, where some nodes are just following the cluster, but don't have the right to vote.

I was wondering if this could be extended into a voter backup concept: if a voter node fails, the leader could try to promote a caught up replica into a voter role, and then demote the failed node. Both the "caught up" concept and single membership changes are pretty well defined, so that shouldn't be too complex to do.

The advantage of this could be that systems using raft to store their metadata and coordinate could run a large number of nodes in the raft cluster, but still reduce the quorum size of the raft cluster itself.

E.g. a cluster running 7 nodes could run a 3 voter raft cluster, where the remaining 4 nodes are backup voters. This means that even if the majority of the nodes (4) go down, the raft cluster is still operational (given that the nodes went down one by one, and the leader had a chance to promote backups to keep the quorum functional).

This could be beneficial compared to running a 7 voter raft cluster, where the 4th node going down would render the full cluster unable to progress.

I was mainly thinking of kafka when arriving at this idea - kafka moving to raft from zookeeper is a great thing, but still requires controller processes (these are the nodes running raft) to be thought of separately from the brokers. Instead, if brokers could all participate in the raft cluster without actually increasing the quorum size, the controllers can be baked into the brokers.

This is a very high-level and vague idea, but would be nice to see you tearing it apart :)

Thanks,

Daniel

Oren Eini (Ayende Rahien)

unread,

Apr 15, 2021, 6:39:51 AM4/15/21

to raft...@googlegroups.com

You can do that as two separate single membership changes, sure. But you are running the risk of midway failures. You add the new node, and then it goes down,what now?

You now have a 4 member cluster (3 old, 1 new) with 2 down. You cannot make forward progress. Note that this is strictly worse than the previous behavior, where the watcher failure wouldn't matter.

To be honest, that is the sort of thing that you usually want to have manual control over. Reminds me of the failover of Github, they had a 43 seconds network split, and their own automated processes basically corrupted their state, leading to > 24 hour downtime.

https://github.blog/2018-10-30-oct21-post-incident-analysis/

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/669ab3d9-32f0-4e14-87b1-53c4fbe5104cn%40googlegroups.com.

Urbán Dániel

unread,

Apr 15, 2021, 6:39:53 AM4/15/21

to raft...@googlegroups.com

And just as I sent this, found the "4.4 System integration" chapter in Diego's thesis - so I understand this is not new.

Anyone implemented this? Or has some experience with this kind of functionality?

Daniel

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/ygsvkBAvBLM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/669ab3d9-32f0-4e14-87b1-53c4fbe5104cn%40googlegroups.com.

Mentes a vírusoktól. www.avg.com

Urbán Dániel

unread,

Apr 15, 2021, 7:03:53 AM4/15/21

to raft...@googlegroups.com

In that specific situation, using the the weakened log replication quorum requirement would help wouldn't it? So in the 4 node cluster, 2 nodes would be enough to commit a message. (Ensar Basri Kahveci shared on this list some time ago in the paper titled "Improved Majority Quorums for Raft")
If the leader is still alive in that situation, it can roll back the addition of the new node by committing a new membership change. Then we are back at square one.

Too bad if the leader goes down instead of the new voter, but without the voter promotion, the cluster would be broken anyway, right?

Thanks for sharing that post, will check it out.

Daniel

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CAF0G-ZhMrkWjtRHmeMCH98-_Zv41oM-rjDNWw6%3DU2pFHSKt5Mw%40mail.gmail.com.

Oren Eini (Ayende Rahien)

unread,

Apr 15, 2021, 7:06:28 AM4/15/21

to raft...@googlegroups.com

Really complex to reason about, and likely fragile.

Not in the concept, but in implementation, testing, actual usage, etc.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/d249fefa-3486-be32-a298-d0d8e8298a8f%40gmail.com.

Martin Furmanski

unread,

Jun 30, 2021, 4:52:05 PM6/30/21

to raft-dev

I've been involved in the design of a Raft implementation where the leader would eventually remove members considered dead, so that would have the effect you mention of being able to survive more than a quorum of failures, the trade-off being to accept the reduced redundancy of data replication etc. It was configurable whether this was to be allowed.

Reply all

Reply to author

Forward