A more flexible Paxos

558 views
Skip to first unread message

Sugu Sougoumarane

unread,
Aug 11, 2016, 7:19:41 PM8/11/16
to raft-dev
Hello,
I was sent to this forum to present a proposed improvement to Paxos: http://ssougou.blogspot.com/2016/08/a-more-flexible-paxos.html.
It's basically a way to run a larger number of nodes without losing write performance.
On one side, the idea looks simple enough that someone must have thought of this before. OTOH, I haven't seen anyone use this flexibility in real life.

The change is orthogonal enough that it can be applied to most consensus algorithms including RAFT.

Philip Haynes

unread,
Aug 12, 2016, 1:21:48 AM8/12/16
to raft-dev
Hi Sugu,

What was the motivation for you to create this modification? We are not practically write bound and I can see (even after the proofs are done), that it would add a lot of complexity in practical RAFT implementation.

Kind Regards,
Philip

Sugu Sougoumarane

unread,
Aug 12, 2016, 9:02:17 AM8/12/16
to raft-dev
The change involves splitting a variable into two: Use a different number for leader election vs. proposal approval. It should not be too complex.

The motivation comes from a few data points:
- The etcd admin guide: https://coreos.com/etcd/docs/latest/admin_guide.html was talking about running up to 7 nodes, but not going beyond. I felt that those who decide to go with 7 can continue to enjoy the write performance of a 3 or 5-quorum cluster.
- An even number of nodes would be easier to administer if you're trying to balance them across zones; You can distribute them more uniformly.
- People are starting to use Paxos (or RAFT) for high QPS systems beyond lockservers (Spanner, Cockroach). Write performance becomes important for them. The option to have more nodes in a quorum makes the offering more attractive.
- Something I haven't fully thought through: we can look at converting all the followers (acceptors) to voters, because the number of nodes won't affect the write performance any more. So, why differentiate.

Archie Cobbs

unread,
Aug 12, 2016, 9:50:02 AM8/12/16
to raft-dev
Hi Sugu,

This is an interesting idea. However it's not obvious to me exactly how one would modify the Raft algorithm, or that it still preserves Raft's correctness. That doesn't mean it can't and doesn't, only that it's not obvious (nothing in this consensus space ever is). In other words, the devil is in the details.

Have you or anyone tried to create a more thorough description of what your modified Raft algorithm would be, and generalized Raft's proof of correctness for this modified algorithm?

-Archie

Sugu Sougoumarane

unread,
Aug 12, 2016, 11:06:54 AM8/12/16
to raft-dev
For now, it's a conjecture based on intuitive understanding. No proof yet.

Oren Eini (Ayende Rahien)

unread,
Aug 12, 2016, 2:10:09 PM8/12/16
to raft...@googlegroups.com
You run the risk of losing proposals here.
A proposal that was sent to 3 nodes ( and thus accepted ) can be lost if those three nodes are down.

Also, from practical perspective. Assume that the proposal is:

CompareAndSwap("lock", 1, 0);


And you have two clients issue this at the same time, this is handled by two different groups of acceptors, and then you have a conflict.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sugu Sougoumarane

unread,
Aug 12, 2016, 3:14:26 PM8/12/16
to raft-dev
If you can explain the use cases fully (how many nodes in quorum, etc) with regular Paxos, I can show you how each of those will work with the alternate scheme.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Aug 12, 2016, 3:23:09 PM8/12/16
to raft...@googlegroups.com
Consider a clusters with 9 nodes. 
Majority: 5
Your proposal: 3

Imagine two actions:

Set("X", 1);

CompareAndSwap("my-lock", 1, 0);

How do you handle them?
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Sugu Sougoumarane

unread,
Aug 12, 2016, 3:33:47 PM8/12/16
to raft-dev
For N=9 and P=3, L will be 9-3+1=7.

So, a leader will need 7 votes. When the new leader comes up, it's guaranteed to see any proposal that's reached 3 nodes, and will be forced to honor it.

Does this answer your question?

Oren Eini (Ayende Rahien)

unread,
Aug 12, 2016, 3:46:22 PM8/12/16
to raft...@googlegroups.com
No, because the operation is conflicting.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Sugu Sougoumarane

unread,
Aug 12, 2016, 4:23:48 PM8/12/16
to raft-dev
I think I don't understand the problem. Are you saying that the two proposals:
Set("X", 1);
and
CompareAndSwap("my-lock", 1, 0);

are conflicting, and only one of them must win? If so, can you show me the flow (and failures), that will make it work correctly for normal Paxos? Then we can see how it will map to the flexible scheme.

Heidi Howard

unread,
Aug 12, 2016, 5:04:01 PM8/12/16
to raft...@googlegroups.com
Hi raft-dev team,

As some of you are aware, we have previously made such an observation and are working on a generalization of Paxos to apply it. 

Currently, we are busy preparing a paper for submission on the topic. It includes a formal specification and proof of safety, as well as a prototype implementation. We chose to use traditional Multi-Paxos for our implementation, but it would be interesting to apply this to Raft in the future. 

Our work is waiting to be checked and reviewed by various researchers in the field, so it is not quite ready for public release. However, I will make sure to post it to this list as soon as it is available.

If anyone else is interested in exploring this area further, please do get in touch.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards
Heidi

Diego Ongaro

unread,
Aug 12, 2016, 5:07:42 PM8/12/16
to raft...@googlegroups.com
Heidi,

Just to clarify, who is "we"?

-Diego

Heidi Howard

unread,
Aug 12, 2016, 5:27:09 PM8/12/16
to raft...@googlegroups.com
Diego,

Sorry for the confusion. 
I did not mean me and Sugu, I was referring to myself and the teams at VMware and Cambridge University.


John Ousterhout

unread,
Aug 12, 2016, 6:24:45 PM8/12/16
to raft...@googlegroups.com
Here's the core of Sugu's proposal:

------------------------------------------------------------------------------

For standard Paxos, L and P are set to N/2+1, which makes (N/2+1)*2 > N. Instead, we can choose N and P such that P <= N, and L can be computed as N-P+1.

An example:

    N=11
    L=9
    P=3

In the above case, 9 voters have to agree for a server to be a leader, but only 3 voters are needed for a proposal to be final. This system will have the performance of a 5-node paxos, but the uptime of 11 nodes.

---------------------------------------------------------------------------------

I think this approach would work, but the system would only have the uptime of 5 nodes: it could tolerate 2 failures, but would become unavailable with 3 failures. Once 3 servers are down, there are only 8 up, so there's no way for leader election to complete. I suppose the system could keep operating as long as the leader doesn't crash....

Furthermore, if 3 nodes crash in a way that loses their stable storage, you could lose all the copies of committed data. In a traditional 11-node cluster, up to 5 nodes could lose their stable storage without permanent loss of committed data.

Presumably you eventually need to update all 11 nodes even with a commit quorum of 3, so the result is that this cluster will do as much work as an 11-node cluster, but will only achieve the reliability of a 5-node cluster.

Is there something I am missing?

-John-

Maciej Smoleński

unread,
Aug 12, 2016, 7:49:00 PM8/12/16
to raft...@googlegroups.com
Hi Sugu,

People choose L=C=ceil(N+1/2) because it provides best availability.

You want to be able to handle two situation when working in systems build with Paxos/Raft:
a) be able to elect new leader
b) be able to commit next entry

Consider that there are N servers and L are required to elect leader and C are required to commit new entry.
For correctness: L + C > N is required

To be able to elect new leader there must be L servers up.
To be able to commit next entry there must be C servers up.

High availability is the goal of systems built with Raft/Paxos.
To achive high availability we need to be prepared for a) and b) at any time.
For system availability - max(L,C) servers must be up - because this is required to handle both a) and b)

Requirements:
L + C > N
max(L,C) should be minimal - less servers required means more time of service availability

For availability it is best to set L=C=ceil(N+1/2)
Setting L and C to other values degrades availability - as more servers are required to have service available than with L=C=ceil(N+1/2).

Kind regards,
Maciej

Sugu Sougoumarane

unread,
Aug 12, 2016, 9:40:10 PM8/12/16
to raft...@googlegroups.com
John & Maciej: Your understanding is correct. I've covered the failures you're describing in the Caveats section, but I didn't elaborate too much because the blog was getting too long.

The uptime term used by John, I would call it durability: If P=3 and 3 nodes crash, you can lose data. This is the same durability of a standard 5-node Paxos. In other words, write availability has to be traded-off with durability. No choice there (CAP theorem).

My definition of uptime is orthogonal to durability. If you kept P at 3, a 11-node system will survive more failure modes than a 5-node system. The simple reason is that you're starting with 11 nodes, and you can keep iterating the quorum size down to N=5 as nodes fail. I called this subsetting in the caveats section, and mentioned that human intervention is needed. But I think the system can subset itself under certain conditions. I just haven't thought through the mechanism.

The main advantage of this configuration is the number of nodes you can take down for maintenance without fear of jeopardizing the system.

The inspiration for high values of L, and low P comes from how we run Vitess in production. In a way, it's a degenerate form of Paxos. Our masters have 10s of replicas worldwide. But we set P=2. In our case, any replica can be promoted to master, but we obviously exclude some because they're not geographically convenient. With this kind of setup, we don't fear losing entire cells, and our software updates are more bold.


You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/wANZPzSz6QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.

Dahlia Malkhi

unread,
Aug 12, 2016, 11:56:00 PM8/12/16
to raft-dev
Hi everyone,

Indeed, Heidi approached me earlier this summer with an observation that subsumes this idea. I immediately saw it was correct and I think it is brilliant.

A manuscript describing the generalized scheme, with more far reaching applications and use-cases, is under preparation. We will shortly post a link to this list; stay tuned! 

As always, there is a difficult balance to be make between making work open and available to the community as quickly as possible and ensure that the work is correct, of high quality and well explained. Heidi already corresponded with various members of the community. None withstanding, this is one of those rare cases where a breakthrough idea might have occurred in two places, and we will definitely cite Sugu’s post.

best,
-Dahlia
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards
Heidi

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards
Heidi

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/wANZPzSz6QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.

Sugu Sougoumarane

unread,
Aug 13, 2016, 4:20:28 AM8/13/16
to raft-dev
For the record, Paxos is not my main area of work. It was more of a personal fascination, which was why I posted this as a personal blog, and not part of the vitess blog.
It looks like Heidi has worked pretty hard on this, and for longer. So, I'd rather focus on endorsing and supporting her work. It's refreshing to see doctoral research that's relevant to the industry.

Heidi Howard

unread,
Aug 24, 2016, 8:21:05 PM8/24/16
to raft...@googlegroups.com
I am pleased to announce that our paper is now available online.

Title: Flexible Paxos: Quorum intersection revisited

Abstract: 

Distributed consensus is integral to modern distributed systems. The widely adopted Paxos algorithm uses two phases, each requiring majority agreement, to reliably reach consensus. In this paper, we demonstrate that Paxos, which lies at the foundation of many production systems, is conservative. Specifically, we observe that each of the phases of Paxos may use non-intersecting quorums. Majority quorums are not necessary as intersection is required only across phases.
Using this weakening of the requirements made in the original formulation, we propose Flexible Paxos, which generalizes over the Paxos algorithm to provide flexible quorums. We show that Flexible Paxos is safe, efficient and easy to utilize in existing distributed systems. We conclude by discussing the wide reaching implications of this result. Examples include improved availability from reducing the size of second phase quorums by one when the number of acceptors is even and utilizing small disjoint phase-2 quorums to speed up the steady-state.


I look forward to a lively discussion :)

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Regards
Heidi

Sugu Sougoumarane

unread,
Sep 8, 2016, 1:51:12 PM9/8/16
to raft-dev, hh...@cam.ac.uk
If any of you have interest in MySQL, I've proposed how its semi-sync feature can be brought to FPaxos compliance: http://ssougou.blogspot.com/2016/09/distributed-durability-in-mysql.html.
In some respect, this was the original idea that led to the discovery of the Paxos generalization.

A noteworthy point in the article: Pointing a replica to a master is equivalent to gaining a vote from that node in the leader election process.

Dahlia Malkhi

unread,
Sep 8, 2016, 2:21:58 PM9/8/16
to Sugu Sougoumarane' via raft-dev
nice blogpost !

does Orchestrator assign a new master with a rank in order to be able to resolve conflicts? 

-Dahlia

PS you might want to refer to FPaxos as work by "Howard, Malkhi and Spiegelman", or by "Howard et al.” because to me, without taking away any credit from Heidi, it is disconcerting when a joint work is referred to as “X’s work” (e.g., “Lamport’s Vertical Paxos”, which was really my work, argh).

Sugu Sougoumarane

unread,
Sep 8, 2016, 2:46:08 PM9/8/16
to raft...@googlegroups.com
Sorry about that. I've fixed this in both of my posts.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/wANZPzSz6QE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.

Sugu Sougoumarane

unread,
Sep 8, 2016, 2:54:40 PM9/8/16
to raft...@googlegroups.com
Forgot to answer the question about rank. I'm actually discussing that part with the author of Orchestrator. The current behavior is that the tool uses a different backend (galera cluster) to keep track of the rank. It should be solvable without the need for such a backend.

If we come to a better solution, I can post it here.
Reply all
Reply to author
Forward
0 new messages