Improving weak read consistency in raft

Alex Bligh

unread,

Apr 11, 2016, 5:33:31 AM4/11/16

to raft-dev, Alex Bligh, philip...@yahoo.com

This was inspired by Philip's post on rqlite.

I'll post a snippet from his documentation, though this has wider applicability to raft as it sets out the problem space admirably.

> Read Consistency
>
> Even though serving queries does not require consensus (because the database is not changed), queries should generally be served by the leader. Why is this? Because without this check queries on a node could return out-of-date results. This could happen for one of two reasons:
>
> • The node, which still part of the cluster, has fallen behind the leader.
> • The node is no longer part of the cluster, and has stopped receiving Raft log updates.
> This is why rqlite offers read consistency levels of none, weak, and strong. Each is explained below.
>
> With none, the node simply queries its local SQLite file, and does not even check if it is leader. This offers the fastest query response, but suffers from the problems listed above. Weak instructs the node to check that it is the leader, before querying the local SQLite file. Checking leader state only involves checking local state, so is still very fast. There is, however, a very small window of time (milliseconds by default) during which the node may return stale data. This is because after the leader check, but before the local SQLite file is read, another node could be elected leader. As result the node may not be up-to-date with the rest of cluster.

Here's my suggestion to improve weak consistency of reads:

1. Check if you are the leader; if not query the leader instead, and stop.
2. Do the query locally.
3. Check if you are still the leader. If not, go to 1.

The addition of step 3 is what I am proposing. This improves matters if the query itself may take a substantial amount of time. As the leader check is almost free, it is (I think) pretty much without drawbacks. In order for the query to be permissible, we know it must not have side effects. Therefore it is safe to issue it more than once.

This still is not completely safe for the following reason. Philip's phrase 'another node could be elected leader' is not the 100% opposite of "the node in question thinks it is the leader). The reasons are as follows:

* At least in theory, in a network partition situation where the leader is in a partition that does not have a quorum of nodes, the other nodes might elect themselves a leader before the current leader notices it cannot contact the rest of the quorum. In some raft implementations this should not happen as the leader should step down more quickly than the election timeout is triggered (hashicorp/raft does this for instance). However, one cannot rely on this because raft does not permit us to rely on absence of clock skew. In practice, the leader's clock would need to be running at less than half the speed of the followers over a short period, so I think this flaw is unlikely.

* In a network partition scenario where the leader is in a partition that does not have a quorum of nodes, at the time of the query no one might have noticed the network is partitioned. Between the point of partition and the point of the query concerned, another write might have arrived on the other side of the partition. This write cannot yet have been processed (or the other side of the partition would have selected a leader and we'd be in the above position). However, the write won't have been replied to yet, so this is no great a disordering of reads/writes than could have happened if the write had been sent to any other non-leader and forwarded during normal operation.

WDYT?

--
Alex Bligh

Oren Eini (Ayende Rahien)

unread,

Apr 11, 2016, 5:47:35 AM4/11/16

to raft...@googlegroups.com, Alex Bligh, philip...@yahoo.com

A strong consistency for reads is outline in the spec. A read reply from the leader will only be answered after a quorum of heartbeats has confirmed that we are still the leader.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
Alex Bligh

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Bligh

unread,

Apr 11, 2016, 5:53:04 AM4/11/16

to Oren Eini (Ayende Rahien), Alex Bligh, raft...@googlegroups.com, philip...@yahoo.com

On 11 Apr 2016, at 10:47, Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:

> A strong consistency for reads is outline in the spec. A read reply from the leader will only be answered after a quorum of heartbeats has confirmed that we are still the leader.

Indeed, but that means waiting for a quorum of heartbeats.

My suggestion involved no waiting at all. It isn't perfect, but better than the weak consistency mode as is.

--
Alex Bligh

Archie Cobbs

unread,

Apr 11, 2016, 11:27:34 AM4/11/16

to raft-dev, al...@alex.org.uk, philip...@yahoo.com

On Monday, April 11, 2016 at 4:33:31 AM UTC-5, Alex Bligh wrote:

Here's my suggestion to improve weak consistency of reads:

1. Check if you are the leader; if not query the leader instead, and stop.
2. Do the query locally.
3. Check if you are still the leader. If not, go to 1.

This "lease" idea from the dissertation is a way to provide fully consistency reads efficiently.

First of all, we must be able to assume (a) clock slew is bounded by some value X%, and (b) all nodes are configured with the same election timeout minimum M. If so, then we can build into our protocol a simple mechanism to allow a leader to ask the question "Am I still the leader at this moment?" and with high likelihood be able to give an immediate answer of "yes" without any network communication.

Do this by having each message from Leader to Follower include the leader's current timestamp. Follower replies include a copy of this timestamp. Now the leader knows how "up-to-date" each follower is. In particular, for some easily calculated timestamp value X, the leader knows a majority of the cluster (including itself) is up-to-date as of leader time X.

The leader can then calculate its "lease timeout" LT, which is a lower bound on the earliest time at which it is possible for another leader to be elected: LT = X + M * (1.0 - X%).

During normal operation with regular heartbeats, the lease timeout will always be greater than the current time. So fully consistent reads are possible immediately on the leader, or via a single round trip communication from follower to leader asking the question "Is my latest entry also your latest entry?". In either case, if that last entry is not committed (in the Raft sense), the transaction simply waits for that to occur.

RaftKVDatabase has three other weaker consistency levels for read-only transactions besides LINEARIZABLE: these are UNCOMMITTED, EVENTUAL, and EVENTUAL_COMMITTED. These weaken various requirements in exchange for less network communication. In particular, the EVENTUAL_COMMITTED is useful in situations where the network cannot be relied upon at all (e.g., a read-only transaction when you are in a minority network partition).

See Consistency Javadoc for details.

-Archie

Archie Cobbs

unread,

Apr 11, 2016, 11:32:32 AM4/11/16

to raft-dev, al...@alex.org.uk, philip...@yahoo.com

Minor correction...

On Monday, April 11, 2016 at 10:27:34 AM UTC-5, Archie Cobbs wrote:

First of all, we must be able to assume (a) clock slew is bounded by some value X%

"Clock slew" is probably the wrong term here .. I should have said "clock drift".

I.e., what you have to bound is the difference in the rates at which the clocks run, not their difference in actual time readings. Clock drift should be very small (less than 1%).

-Archie

Reply all

Reply to author

Forward