Lock and leadership election in RAFT

892 views
Skip to first unread message

andy maule

unread,
Dec 28, 2017, 6:56:47 PM12/28/17
to raft-dev
Hi all, 

I'm trying to create a simple single lock / leadership election algorithm using RAFT. The goal is to have a guarantee that only a single node in a cluster can be performing an action at any given time, but given a failure of any node, another will take over. 

I can't simply use the RAFT leader, as there could be multiple nodes at the same time that think they're the leader. So this requires some kind of lock. I read this post: https://groups.google.com/forum/#!searchin/raft-dev/lock%7Csort:date/raft-dev/mVqo0FAAKU8/FSD_LRfqAQAJ and looked at alternative locking and leadership election algorithms such as the Apache Curator Zookeeper recipes. Using these I have tried to put some of the ideas into an algorithm using RAFT.

The goal is to have a single lock acquired by the leader. The basic algorithm is that the RAFT leader specifies a lock duration in ticks, and appends it to the log. If the append to the log is successful, the leader has acquired the lock. All other nodes will wait for at least the lock time before trying checking again. The leader that has the lock will try to renew the lock, by creating a new lock after some percentage of the lock time has expired. 

I can't see any issues with this at present, but I understand there are lots of edge cases, so any advice would be greatly appreciated.

Here is the pseudo-code, with more detail:

LOCK_DURATION_TICKS = 5   // 10 second locks
SINGLE_TICK_MS
= 2000     // 2 seconds
RENEW_THRESHOLD
= 0.5     // Try to renew lock after 50% of time has expired

function main()
 
while true:
    log_entry
= raft.latest_log_entry()
    last_lock_ticks
= log_entry.lock_duration_ticks

   
// wait for previous lock to be guaranteed expired
    sleep
(SINGLE_TICK_MS * last_lock_ticks)

   
if raft.is_leader()
      message
= {
        lock_duration_ticks
: LOCK_DURATION_TICKS
     
}

      lock_start_time
= time.now
      lock_end_time
= lock_start_time + (message.lock_duration_ticks * SINGLE_TICK_MS)
      lock_renew_time
= lock_start_time + (message.lock_duration_ticks * SINGLE_TICK_MS * RENEW_THRESHOLD)

      success
= raft.append_message(message)

     
if success
       
// Everyone else has agreed to wait at least this amount of time before trying to lock again
       
// but the leader node can continue with the lock for at most this amount of time.
       
OnLockAquired()
     
else
       
continue

     
while(lock_end_time > time.now && raft.is_leader()) {
       
if (lock_renew_time > time.now)
          message
= {
            lock_node_id
: raft.current_node_id,
            lock_duration_ticks
: LOCK_DURATION_TICKS
         
}
          success
= raft.append_message(message)
         
if success
           
OnLockRenewed()
     
}
     
OnLockExpired()function OnLockAquired()

function OnLockAquired()
 
// start some async process

function OnLockRenewed()
 
// noop

function OnLockExpired()
 
// stop the async processes

Again, any feedback or links to further reading that I might have missed, are greatly appreciated.

Thanks

Andy


David B Murray

unread,
Dec 29, 2017, 10:07:15 AM12/29/17
to raft...@googlegroups.com
https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
 is a personal favorite on this subject. TL;DR be very very careful about depending on clocks for correctness guarantees.

The concept of “fencing” described in the post is the safe way to implement leases. This is more or less what raft is doing with its election term. It’s okay if two nodes think they are the leader simultaneously because the one with the lower term won’t be able to actually get anything done.

-d

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

andy maule

unread,
Dec 29, 2017, 6:38:42 PM12/29/17
to raft...@googlegroups.com
Thanks for the link David. Great article.

I think you're right, I'm missing some edge cases that can be solved with fencing. I'm updating the algorithm. I think the basic algorithm will now be, put a lock/fencing token on the log, that signifies which raft leader has the lock. If a node thinks it's the current leader, and the term matches the lock at the head of the raft log, the node has the lock. However, there's one wrinkle, you have to deal with timeouts. The leader can only use the lock for the lease duration, and will have to renew before that to maintain the lock. When a leader changes we have to make sure to wait for the max lease duration, to give the old leader a chance to timeout and clean up. I've tried updating the pseudocode for this, and I'll try and send an update when I've cleaned it up.

Thanks

Andy


To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/oO0NfgUVrjg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.

Archie Cobbs

unread,
Dec 29, 2017, 7:55:06 PM12/29/17
to raft-dev
Dumb question... how is the functionality you're trying to get not already existing in Raft?

In other words, why wouldn't the following work...

First, setup a normal Raft cluster. Next, tweak the normal Raft state machine as follows:
* Do not transition from follower to candidate unless you "want to acquire the lock".
* If you are the leader, and no longer "want to acquire the lock", revert to follower, i.e. passively allow your leadership to time out.

Then whenever a node needs to determine whether it currently "has the lock" it does this:
* Node checks whether it is a Raft leader; if not, it does not have the lock
* Otherwise the node performs the operation as described in the dissertation section 6.4.1

The result of section 6.4.1 ("Processing read-only queries more efficiently") is to verify that the node is still leader and also to produce a "lease" timeout, such that the node knows that no other node can become the Raft leader prior to this timeout. So now the node has an exclusive lease on the "lock" until this timeout. If it needs more time, it just continues to be leader. When it is done, it unilaterally reverts to being a follower. Then later if/when some other node wants the lock, it can become the new leader, etc.

-Archie
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/oO0NfgUVrjg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.

jordan.h...@gmail.com

unread,
Dec 29, 2017, 7:57:06 PM12/29/17
to raft...@googlegroups.com
Kleppman’s article is indeed great. I tend to point people there a lot for questions about distributed locking as well.

Atomix (http://github.com/atomix/atomix) does reactive distributed locking, leader election, etc on top of Raft using algorithms very similar to ZooKeeper’s. You can use polling, but what you need to make this as efficient as ZK is some way of notifying clients of changes inside a state machine while still preserving strong consistency guarantees. ZooKeeper does this in its watches. To do it in Atomix, we expose Raft sessions to the state machine and built in a session events feature to allow state machines to send events to sessions.

For example, here’s the lock state machine (in Java):

Session objects are queued when a lock is already held, and once it’s released or the owning session expires, the next session in the queue is sent an event to notify it that it acquired the lock. Clients never poll, they’re just notified when interesting changes occur in the state machine.

Most of this is simply a matter of implementing Raft sessions. But depending on the consistency guarantees you’d like to achieve, implementing session events can add significant complexity. We provide a sequential guarantee for session events. That is, so long as a client’s session is not expired, it will see all events in the order in which they occurred in the state machine. To do that, we send each batch of events with the previous event index and current event index. Then the client either waits for the previous index if it hasn’t yet been received yet or immediately handles the event. We also use client-provided sequence numbers to provide a consistent sequential ordering for both command responses and events. That is, if a command at index 1 created an event, the client is guaranteed to see that event before it sees a response > index 1. This greatly simplifies reasoning about state changes in client-side primitive APIs, especially for distributed locks and leader elections. In case events are lost on their way to the clients, periodic session keep-alives also include the highest event index received by the client, and events beyond that index are held in memory on _all_ nodes and resent on keep-alives.

But the last point can complicate snapshots a bit. Events can be re-generated from individual commands in the Raft log, but not from a snapshot unless the events themselves are also stored in the snapshot. So to ensure that nodes always have unacknowledged events in memory, you must either store pending events in the snapshots or wait for events that occurred prior to the snapshot to be acknowledged before “persisting” it. We take the latter approach and flag a stored snapshot as “complete” only once all pending events have been acknowledged.

Anyways, Atomix is a wealth of information on using Raft to do ZooKeeper style distributed coordination at a higher level. I’m always happy to share the low level details of how it’s implemented. We use Raft in Atomix successfully for leader election, distributed locking, sharding, fault-tolerant cross-shard transactions, primary-backup replication, persistent messaging, and general coordination. We’ve put a lot of time and effort into refactoring its algorithms for efficiency and scalability.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

andy maule

unread,
Dec 30, 2017, 4:08:51 AM12/30/17
to raft...@googlegroups.com
Hi Archie,

Not a dumb question at all, in fact this seems like a great suggestion. My intuition tells me that there is probably a much simpler way to do this by relying on the raft protocol more, so I hope you're correct. I originally had an algorithm that just relied on the raft leadership, but as far as I'm aware, there is a possibility of more than node becoming leader at a given time, for example in a network partition (see last slides of this presentation for example http://thesecretlivesofdata.com/raft/).

As for section 6.4.1 and basing read-only queries on leasing, it sounds like this might require alterations to most raft implementations? I think the algorithm I'm proposing is very similar to this, but after reading this section 6.3 and 6.4 more closely, I think I'd made some incorrect assumptions, and I can probably rely on the Raft protocol much more. Particularly the linearizability guarantees of section 6.3 might really help. High performance is not an issue, so put read queries on the raft log instead of using leasing might simplify this a lot.

Thanks for the suggestion! I'll see if I can take this into account and come up with a simplified solution.

Andy


To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.

andy maule

unread,
Dec 30, 2017, 4:19:44 AM12/30/17
to raft...@googlegroups.com
Thanks Jordan, using Raft session in the state machine seems like a really good idea. I'll dig into the Atomic implementation and take a look. However, I assume this is going to be quite tricky to implement onto of a standard raft implementation such as etcd/hashicorp?

There's a lot for me to look into in the Atomix code, seems like a great project. I see you have TLA+ checks, and Jepsen tests, nice! Sounds like quite a reliable implementation. I'm going to have to dig into it a bit more. How mature is the Atomix project, and is it being used a lot in production? I've been looking more at hashicorp and etcd, but this might be a good alternative.

Thanks,

Andy

Archie Cobbs

unread,
Dec 30, 2017, 1:03:17 PM12/30/17
to raft-dev
Hi Andy,


On Saturday, December 30, 2017 at 3:08:51 AM UTC-6, andy maule wrote:
I originally had an algorithm that just relied on the raft leadership, but as far as I'm aware, there is a possibility of more than node becoming leader at a given time, for example in a network partition (see last slides of this presentation for example http://thesecretlivesofdata.com/raft/).

You're correct that it's possible for more than one node to be in the "leader" state at the same time. Being in the leader state doesn't really prove anything. It only means that at some point in the past the node won an election.

So one has to be careful when referring to "the" leader, because that's not always well defined. However a node X can sometimes safely conclude that at the current moment no other leader node Y could have possibly appended anything new to the log that (a) X doesn't know about already and (b) is/will eventually be committed. In other words, X can assume that its top log entry is current, IF it is already - or eventually - committed.

When can X assume this? When X is in the leader state AND it has calculated its "leader lease timeout" to be sometime in the future.
 
As for section 6.4.1 and basing read-only queries on leasing, it sounds like this might require alterations to most raft implementations?

The alteration is very easy. Leaders simply keep an up-to-date calculation of their "leader lease timeout", which is the the time in the past at which the leader sent AppendEntries to a majority of followers (minus one for the leader) who have since responded, plus the minimum election timeout, minus the maximum possible clock drift. This it assumes the maximum clock drift is known/bounded and accounted for, and all nodes have the same minimum election timeout (if not, use the minimum of the minimums).

In practice, assuming (a) functioning network, (b) nodes that respond promptly, and (c) heartbeat interval much less than minimum election timeout, then the leader lease timeout will almost always be in the future.

The net effect is: if the leader lease timeout is in the future and the top log entry is already committed, then the leader can perform an immediate read-only transaction without any communication. If the top entry is not already committed, then the transaction simply has to wait until it is.

In your application you could have each log entry say in effect "Node X can assume it owns the lock as long as it is still leader in term T". Then when node X wins a election and becomes leader of term T, the first thing it does is commit such an entry. As long as node X is leader, and a read-only transaction confirms this log entry is still the last log entry, then node X can safely assume it owns the lock. When it no longer needs the lock, it simply relinquishes leadership.

Note that this only enables node X to confirm that it currently "owns the lock". For any other non-leader node Y, Y can perform the same read-only transaction, and that will tell Y that node X did in fact "own the lock" at some point between the time the transaction started and completed (per linearizability), but it does not allow Y to conclude that node X still owns the lock when the transaction completes. I don't think that's one of your requirements but mention this for completeness.

-Archie

jordan.h...@gmail.com

unread,
Dec 31, 2017, 9:04:08 PM12/31/17
to raft...@googlegroups.com
Good stuff Archie!

Indeed, all of these approaches will have the problem of multiple nodes potentially *believing* themselves to hold the lock at the same time because that’s just an unavoidable aspect of asynchronous networks. Atomix clients can be partitioned and their sessions expired while they hold a lock without the client ever being notified. We add APIs for clients to listen for communication failures in the client and assume locks are released when they can’t communicate with a leader and their session timeout has expired, but that’s still just as imperfect as leader leases are (both can still suffer from clock skew and GC pauses). All of these approaches should use Raft log indexes to provide some semblance of ordering (back to the fencing tokens in Kleppman’s article) of locks across nodes.

As for the client-centric approach I described, I suspect it would indeed be difficult to modify an existing Raft implementation to do this because of the back and forth communication needed between client and server to handle leader changes and what not. Archie’s method seems entirely possible to do in most moderately extensible existing implementations.

andy maule

unread,
Jan 2, 2018, 2:12:29 PM1/2/18
to raft...@googlegroups.com
Thanks Jordan, Thanks Archie,

It sounds like the approach Archie suggests, without clocks, is likely to generally be "good enough", and the lease approach or the atomix approach, is the same but with better performance. Also it seems like Jordan is correct about there always being some cases that aren't fully accounted for and so relying on the log much more, and using fencing is the way to account for this, as David mentioned in the first response.

Really appreciate you both taking the time out to create such detailed answers. I think I have the answers to my original question, but the clarifications on the semantics of raft have really helped, and I might update the design to use the log more, and the lock less, although this will take some thought and iteration on the design. It seems there is a consistency/latency trade-off and a simplicity/correctness-safety trade-off. 

Thanks for the discussion!

Andy

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/oO0NfgUVrjg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+unsubscribe@googlegroups.com.

Martin Furmanski

unread,
Jan 30, 2018, 3:44:03 PM1/30/18
to raft-dev
First of all, a little bit on strategy when dealing with consensus. Only deal with the basic primitives that you have at hand which are the sequential log, the client, the state machine and commands. Whenever your solution needs to reason about who the leader is you are bound to be doing the wrong thing. The only thing you should have to reason about are the operations in the sequential log which you know are completely ordered.

Here is a rough sketch for how a solution working only with the basic primitives could look like:

LeaseRequest (command)
    seq_no
    owner
    start_time
    lease_time

LeaseSM (state machine)
    if ( seq_no != sm_seq_no + 1 )
    {
        return;
    }

    sm_seq_no = seq_no;

    if ( owner != me )
    {
        // assume worst case for other lease span (longest span)
        sm_lease_tmo = max( sm_lease_tmo, current_time() + lease_time );
    }
    else
    {
        // assume worst case for own lease span (from time request was sent)
        sm_lease_tmo = start_time + lease_time;
    }

    sm_owner = owner;

Client (lease monitor and potentially holder)
    if ( sm_owner == me )
    {
        if ( sm_lease_tmo + REAFFIRM_DELTA > current_time() )
        {
            // resend request to reaffirm lease
            raft.send( new Operation( sm_seq_no + 1, me, current_time(), LEASE_TIME ) );
        }
    }
    else if ( sm_lease_tmo > current_time() )
    {
        // nobody seems to have a lease, send request for lease
        raft.send( new Operation( sm_seq_no + 1, me, current_time(), LEASE_TIME ) );
    }

The client code above can be interpreted as running continuously, but of course a real implementation should set up timers.

What is happening above is that the client is monitoring the state machine for an active current lease and if it detects that a current lease is not active then it sends a request for a lease and makes sure to base it upon the latest information by utilising a sequence number, basically functioning as a fence, barrier or whatever you want to call it. The request encodes the start_time which is guaranteed to be earlier than any time that any other consensus group member is able to observe your lease request. When another member observes your request they assume that it is has the worst case validity from their perspective, the full time of the lease (lease_time) from the time they processed the message. Those members will respect the expiry of that full worst case timeout before attempting to request a lease of their own. This should guarantee that there never is an overlap in leases. Note that other members do not care about the start_time, it is merely a convenient way for the client to keep track of the time it sent its own lease request and on a successful lease request calculate its worst case allowed validity from that point in time.

I am sure I have made some mistake somewhere and at the very least forgotten about some edge case, but this should give you a general idea of how a proper solution can be created on top of Raft and the Lamport state machine approach, rather than by trying to dip deep into Raft and "hooking into" some internal aspects of it.
Reply all
Reply to author
Forward
0 new messages