Is this type of read really linearizable?

Philip O'Toole

unread,

Oct 3, 2024, 8:15:14 AM10/3/24

to raft...@googlegroups.com

From section 8 of the Raft paper, on avoiding returning stale data:

Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests.

However, isn't there still a possibility of returning stale data? If, after exchanging Heartbeats, but before the read is initiated, the Leader is partitioned from the cluster, a new Leader is elected by the rest of the cluster, and data is written to the cluster, the "original" Leader will then read its copy of the data -- and return stale data. Am I missing something here?

BTW, immediately after the section above the paper states this:

Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).

But that implies that the first approach does not rely on timing. Or is implicit in the first approach that the Leader can be sure there won't be a new Leader elected until the Election Timeout happens so there is a window of time immediately after the heartbeat exchange where it can "safely" do a read? But a node could be running slow (inside a paused VM for example) and its sense of the passage of time is not the same as other nodes.

Philip

milan...@axoniq.io

unread,

Oct 3, 2024, 8:28:20 AM10/3/24

to raft-dev

Hi Philip,

To avoid stale reads, you must replicate the read request as any other request (ensuring the majority of the nodes confirm). This approach ensures that you get the most recent data (not only the leader's view of the data).

Cheers,

Milan

Jordan Halterman

unread,

Oct 3, 2024, 8:45:12 AM10/3/24

to raft...@googlegroups.com

I don’t think you’re missing anything. You’re absolutely right that the heartbeat needs to happen after reading the state for reads to be linearizable. But that’s what the paper is saying. It says to exchange heartbeats with a majority of the cluster before *responding* to read-only requests, and that means after reading the leader’s state machine but before sending the response to the client.

Jordan Halterman

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CAEajhJM54tX2YQ%2BKKZVM1EAQqQSEEhnFgY1S0afemKGfpo120Q%40mail.gmail.com.

Philip O'Toole

unread,

Oct 3, 2024, 9:07:52 AM10/3/24

to raft...@googlegroups.com

On Thu, Oct 3, 2024 at 8:28 AM 'milan...@axoniq.io' via raft-dev <raft...@googlegroups.com> wrote:

Hi Philip,

To avoid stale reads, you must replicate the read request as any other request (ensuring the majority of the nodes confirm). This approach ensures that you get the most recent data (not only the leader's view of the data).

Thanks Milan.

Got it, yeah, I knew about that mode, which definitely makes sense to me. rqlite (the distributed database I maintain) already offers that form: https://rqlite.io/docs/api/read-consistency/#strong

Of course, these kinds of reads are slow, but they offer the strongest consistency guarantees.

Thanks,

Philip

Philip O'Toole

unread,

Oct 3, 2024, 9:19:18 AM10/3/24

to raft...@googlegroups.com

Thanks Jordan, one response inline.

On Thu, Oct 3, 2024 at 8:45 AM Jordan Halterman <jordan.h...@gmail.com> wrote:

I don’t think you’re missing anything. You’re absolutely right that the heartbeat needs to happen after reading the state for reads to be linearizable. But that’s what the paper is saying. It says to exchange heartbeats with a majority of the cluster before *responding* to read-only requests, and that means after reading the leader’s state machine but before sending the response to the client.

I guess there is some ambiguity in the paper. "Responding" could mean -- as you say -- reading the data, doing the heartbeat, then sending the data to the client. However I think many folks (including me) interpreted it as "exchange the heartbeats, then do the read, and return the data to the client".

However, I'm still not 100% convinced. :-) Couldn't this happen?

Leader receives read request
Leader reads state machine
Leader exchanges heartbeats with majority of nodes successfully
Leader responds to the client with the data it read in step #2.

However if the Leader is partitioned between 3 and 4, and new data is written to the cluster, we're back sending stale data to the client. In fact an entire leader-election cycle could also have taken place between step 2 and 3. The point is that the sequence above is not atomic, hence can't offer 100% linearizable reads.

As Milan said earlier it seems to me the *only* way to be 100% certain the Leader doesn't send stale data to the client is to replicate the read. However the Raft paper doesn't seem to make this point clearly.

(In practical systems you may not need 100% certainty and that this is a theoretical discussion for most people, but I'm considering adding a new type of read request to rqlite, and want to ensure my technical document is absolutely correct).

Philip

Archie Cobbs

unread,

Oct 3, 2024, 9:54:06 AM10/3/24

to raft...@googlegroups.com

On Thu, Oct 3, 2024 at 8:19 AM 'Philip O'Toole' via raft-dev <raft...@googlegroups.com> wrote:

However, I'm still not 100% convinced. :-) Couldn't this happen?
Leader receives read request
Leader reads state machine
Leader exchanges heartbeats with majority of nodes successfully
Leader responds to the client with the data it read in step #2.

It's helpful to remember just what "linearizable" means (this is my understanding anyway)....

First you have to say that each "transaction" i has a starting timestamp Sᵢ (open) and an ending timestamp Eᵢ (commit) - in a distributed network, there is no such thing as an instantaneous transaction.

Then "linearizable" means there exists an ordered sequence of timestamps Tᵢ such that (a) Sᵢ ≤ Tᵢ ≤ Eᵢ and (b) regarding the state of the system, transaction i "appears to happen" (i.e., all of its reads and writes take place atomically) at instant Tᵢ.

So when the leader receives a request, that obviously happens after Sᵢ. If the leader then wants to "snapshot" the state of the system at that time, and that snapshot can be proven to be up-to-date at that time (via Raft), then it doesn't matter what happens afterwards; the requirements of linearizability have already been satisfied. In particular, it doesn't matter how long it takes the leader to reply. This of course assumes the leader doesn't handle some other transaction differently (out of order) in a way that would violate requirement (b).

-Archie

--

Archie L. Cobbs

Free Ekanayaka

unread,

Oct 3, 2024, 10:24:33 AM10/3/24

to 'Philip O'Toole' via raft-dev

Hello Philip,

"'Philip O'Toole' via raft-dev" <raft...@googlegroups.com> writes:

> Thanks Jordan, one response inline.
>
> On Thu, Oct 3, 2024 at 8:45 AM Jordan Halterman <jordan.h...@gmail.com>
> wrote:
>
>> I don’t think you’re missing anything. You’re absolutely right that the
>> heartbeat needs to happen after reading the state for reads to be
>> linearizable. But that’s what the paper is saying. It says to exchange
>> heartbeats with a majority of the cluster before *responding* to read-only
>> requests, and that means after reading the leader’s state machine but
>> before sending the response to the client.
>>
>
> I guess there is some ambiguity in the paper. "Responding" could mean -- as
> you say -- reading the data, doing the heartbeat, then sending the data to
> the client. However I think many folks (including me) interpreted it as
> "exchange the heartbeats, then do the read, and return the data to the
> client".
>
> However, I'm still not 100% convinced. :-) Couldn't this happen?
>

> 1. Leader receives read request
> 2. Leader reads state machine
> 3. Leader exchanges heartbeats with majority of nodes successfully
> 4. Leader responds to the client with the data it read in step #2.

>
> However if the Leader is partitioned between 3 and 4, and new data is
> written to the cluster, we're back sending stale data to the client. In
> fact an entire leader-election cycle could also have taken place between
> step 2 and 3. The point is that the sequence above is not atomic, hence
> can't offer 100% linearizable reads.

I think here there is a subtle confusion about the meaning of
linearizability.

There are multiple ways of defining it, some are more formal than
others, personally I find this definition very good and intuitive:

https://jepsen.io/consistency/models/linearizable

"Linearizability [...] implies that every operation appears to take
place atomically, in some order, consistent with the real-time ordering
of those operations: e.g., if operation A completes before operation B
begins, then B should logically take effect after A."

The main point here the relationship between logical and real time
ordering, which is pretty much what the guarantee that linearizability
provides is all about.

According to this definition I believe that for a read A to be
linearizable it is sufficient that its *real time* completion time
happens before a subsequent write B starts.

In step #2 you can consider the read complete, from a state machine
point of view (it does not matter when the client will receive the
reply). If step #3 completes successfully, it means that no other write
could have happened between #2 and #3, which is enough to guarantee
linearizability.

At least that's my understanding. I'm sure Diego and/or Kyle (from
jepsen.io) might provide a definite answer on the topic.

Free

Message has been deleted

Philip O'Toole

unread,

Oct 7, 2024, 11:38:38 AM10/7/24

to raft...@googlegroups.com

Thanks Free -- one response inline.

I'm not sure this is true. Nothing says that execution of the steps above can't be paused at any moment, for an indeterminate amount of time.

What is stopping the "leader" being deposed, another write happening, the "old leader" rejoining the cluster, the "new leader" being deposed, and the old leader being re-elected? All between steps #2 and #3? Obviously this is highly unlikely in practice but it is surely possible in theory -- and we're arguing theory here.

Let me know what you think.

Philip

Free Ekanayaka

unread,

Oct 8, 2024, 10:14:09 AM10/8/24

to 'Philip O'Toole' via raft-dev

I would not consider step #3 "successful" in that case, in the sense
that it would not be a successful heartbeat *in the same term*, because
the old deposed leader would have needed to increase its term to win the
election and at that point it also needs to commit a new blank entry in
order to get to know what the current commit index is, before replying
to any client request.

It would need to send an error to the client for the original read
request.

Free

Philip O'Toole

unread,

Oct 8, 2024, 11:08:10 AM10/8/24

to raft...@googlegroups.com

Got it -- that sounds reasonable, and had occurred to me too. Let me think about this some more.

Philip

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/87wmiiel7r.fsf%40carbon.mail-host-address-is-not-set.

Free Ekanayaka

unread,

Oct 8, 2024, 12:04:41 PM10/8/24

to 'Philip O'Toole' via raft-dev

"'Philip O'Toole' via raft-dev" <raft...@googlegroups.com> writes:

> Got it -- that sounds reasonable, and had occurred to me too. Let me think
> about this some more.

The way I would think of this heartbeat is basically a lightweight
"commit". A regular commit is basically a consenus write, but since this
is a read-only request there is no need to write, and only consensus is
needed, which is provided by the heartbeat.

If you think of it that way, then all the implications of the scenario
you mentioned (leader deposed and re-elected) are pretty much the same
as the ones of a regular commit request.

Free

> To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CAEajhJOd4XHV8H-Zp8zLW2MR7Xrhv_gsWORTSq%2B2BHxK1Rn4oQ%40mail.gmail.com.

Philip O'Toole

unread,

Oct 8, 2024, 12:45:37 PM10/8/24

to raft-dev

OK, I'm convinced -- and that this type of read is something we should add to rqlite.

See https://github.com/rqlite/rqlite/pull/1930#issuecomment-2400359257

Thanks,

Philip

Diego Ongaro

unread,

Oct 14, 2024, 8:12:06 PM10/14/24

to raft...@googlegroups.com

Hi all,

I'm a bit late to chime in here. I agree with what Archie and Free wrote above. I just wanted to add that some more detailed steps for processing read-only requests while preserving linearizability are in my dissertation ( https://github.com/ongardie/dissertation ), section 6.4 "Processing read-only queries more efficiently". I'll highlight two things from there:

1. The leader's commit index isn't necessarily up-to-date until it has committed one of its own entries, so it's critical to wait for that to happen. You might want to review your code to make sure.

2. My dissertation suggests reading the current commit index first, then doing the heartbeats, then waiting for the state machine to catch up through that index, then doing the read against the state machine. (This is similar to what's outlined on the concurrent thread: https://groups.google.com/g/raft-dev/c/epxGXp461qc .) Assuming the state machine's applied index can lag behind Raft's commit index, this sequence helps reduce latency, as the state machine can catch up while the heartbeats are taking place. It's a minor and fairly obvious optimization: because you can't simply query a state machine that can lag arbitrarily, you have to wait for it to get up to some index and then query it, and you might as well delay waiting for things until you need them. LogCabin's code structure encourages (requires?) this optimization. The Raft module has a method to get and confirm the latest commit index, including doing the heartbeats (and it does the appropriate Raft-level synchronization). Waiting for the state machine is the next logical step (done outside the Raft module, without Raft locks). BTW, the index at which to do the read is a lower bound: it's OK and probably preferred to reply with fresher results if the state machine has made more progress past that index by the time you query it.

-Diego

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/ad58ead0-7ccd-4c69-b66c-bcbd0fbdb8a9n%40googlegroups.com.

Philip O'Toole

unread,

Oct 14, 2024, 11:47:50 PM10/14/24

to raft...@googlegroups.com

Thanks Diego -- my latest implementation is then flawed, as I didn't add the "wait for the state machine to catch up" logic. I forgot that just because a node is elected leader doesn't mean that its *state machine* is also caught up. It just means that the leader's log is "up to date" (as defined in the paper, section 5.4.1).

A node could be elected leader, do a read, send out the heartbeats, confirm it's still the leader (and confirm that the term never changed from before the read was initiated) and still return stale data if its state machine wasn't yet up-to-date. The "previous" leader could actually have a state machine with more recent data if it was further ahead in the apply process! I missed this point until now -- I had convinced myself the code was right without doing the "wait for the state machine to catch up" step.

I better go fix it (or let me know if I am missing something else!).

Philip

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CABUcDvN%2BWZZoMMQhQoDPL_jYNLY6TUsmJ_Fbd0pdw_6JiAxBgQ%40mail.gmail.com.

Philip O'Toole

unread,

Oct 15, 2024, 12:17:54 AM10/15/24

to raft...@googlegroups.com

https://github.com/rqlite/rqlite/pull/1944

(or something very close -- the timeout should be configurable by the client).

Thanks again,

Philip

Philip O'Toole

unread,

Oct 15, 2024, 7:40:24 AM10/15/24

to raft-dev

On Monday, October 14, 2024 at 8:12:06 PM UTC-4 onga...@gmail.com wrote:

Hi all,

I'm a bit late to chime in here. I agree with what Archie and Free wrote above. I just wanted to add that some more detailed steps for processing read-only requests while preserving linearizability are in my dissertation ( https://github.com/ongardie/dissertation ), section 6.4 "Processing read-only queries more efficiently". I'll highlight two things from there:

1. The leader's commit index isn't necessarily up-to-date until it has committed one of its own entries, so it's critical to wait for that to happen. You might want to review your code to make sure.

rqlite uses the Hashicorp Raft implementation underneath, which takes care of that. See https://github.com/hashicorp/raft/blob/v1.7.1/raft.go#L565

Philip

Philip O'Toole

unread,

Oct 15, 2024, 3:15:37 PM10/15/24

to raft-dev

Free -- I just want to confirm one last time, now Diego chimed in. You seem to believe that there is an implicit statement in the Section 6.4 Processing read-only queries more efficiently that the Heartbeats are done within the same Raft term as the storing of the Commit Index in readIndex (to use the variable name from the dissertation) was done.

This makes sense to me, because otherwise I seem to be able to come up with a scenario whereby the value of readIndex is set, but the node isn't actually "really" the Leader (it just thinks it is, but it's been partitioned, and there is actually another Leader on the cluster at the time), but by the time the Heartbeats are sent out, it's been (re-)elected Leader. This would mean that the statement made in the dissertation "the readIndex was, at the time, the largest commit index ever seen by any server in the cluster" was not necessarily true.

WDYT?

Philip

Philip O'Toole

unread,

Oct 15, 2024, 3:56:45 PM10/15/24

to raft-dev

One clarification inline.

On Tue, Oct 15, 2024 at 3:15 PM Philip O'Toole <oto...@google.com> wrote:

Free -- I just want to confirm one last time, now Diego chimed in. You seem to believe that there is an implicit statement in the Section 6.4 Processing read-only queries more efficiently that the Heartbeats are done within the same Raft term as the storing of the Commit Index in readIndex (to use the variable name from the dissertation) was done.

This makes sense to me, because otherwise I seem to be able to come up with a scenario whereby the value of readIndex is set, but the node isn't actually "really" the Leader (it just thinks it is, but it's been partitioned, and there is actually another Leader on the cluster at the time),

In other words, just after the node sets the value of readIndex, but before it sends out the heartbeats, the node is re-elected, and when it sends those heartbeats out -- lo-and-behold -- it gets a successful response. Almost anything is possible in these racy systems, and my assumption is that Leader election happens in a separate thread than that serving the read request (which is the case with rqlite and Hashicorp Raft).

Philip

Free Ekanayaka

unread,

Oct 15, 2024, 4:15:20 PM10/15/24

to 'Philip O'Toole' via raft-dev

Hi Philip,

"'Philip O'Toole' via raft-dev" <raft...@googlegroups.com> writes:

> One clarification inline.
>
> On Tue, Oct 15, 2024 at 3:15 PM Philip O'Toole <oto...@google.com> wrote:
>
>> Free -- I just want to confirm one last time, now Diego chimed in. You

>> seem to believe that there is an implicit statement in the Section 6.4 *Processing
>> read-only queries more efficiently* that the Heartbeats are done within
>> the same Raft term as the storing of the Commit Index in *readIndex* (to

>> use the variable name from the dissertation) was done.
>>
>> This makes sense to me, because otherwise I seem to be able to come up

>> with a scenario whereby the value of *readIndex* is set, but the node

>> isn't actually "really" the Leader (it just thinks it is, but it's been
>> partitioned, and there is actually another Leader on the cluster at the
>> time),
>>
>

> In other words, just *after* the node sets the value of readIndex, but
> *before* it sends out the heartbeats, the node is re-elected, and when it

> sends those heartbeats out -- lo-and-behold -- it gets a successful
> response. Almost anything is possible in these racy systems, and my
> assumption is that Leader election happens in a separate thread than that
> serving the read request (which is the case with rqlite and Hashicorp Raft).

I think I see what you mean, and you might somehow say that yes that
there is some kind of obvious implicit assumption in Section 6.4,
quoting it:

"The leader needs to make sure it hasn’t been superseded by a newer
leader of which it is unaware. It issues a new round of heartbeats and
waits for their acknowledgments from a majority of the cluster. Once
these acknowledgments are received, the leader knows that there could
not have existed a leader for a greater term at the moment it sent the
heartbeats."

Of course it means that the goal is to be sure there was no leader for a
greater term than the term when readIndex was taken.

Hope this clarifies it.

Free

Diego Ongaro

unread,

Oct 15, 2024, 5:10:20 PM10/15/24

to raft...@googlegroups.com

You've linked to where hashicorp/raft appends a no-op entry when a server becomes leader. The leader's commit index is not up-to-date until that entry has been committed. Is there anything preventing rqlite from reading the commit index before that no-op entry has committed?

-Diego

Diego Ongaro

unread,

Oct 15, 2024, 5:20:23 PM10/15/24

to raft...@googlegroups.com

And yes, I agree with what Free wrote here. The commit index and the heartbeats all have to be tied to the same leadership term. Otherwise, Philip's scenario would be problematic.

-Diego

Philip O'Toole

unread,

Oct 15, 2024, 5:30:58 PM10/15/24

to raft...@googlegroups.com

On Tue, Oct 15, 2024 at 5:10 PM Diego Ongaro <onga...@gmail.com> wrote:

You've linked to where hashicorp/raft appends a no-op entry when a server becomes leader. The leader's commit index is not up-to-date until that entry has been committed. Is there anything preventing rqlite from reading the commit index before that no-op entry has committed?

Nope, I don't think so. Hmmmm, not good. I'm not even sure the Hashicorp library makes this easy to do. All their Go Doc states is:

func (Raft) CommitIndex ¶*

func (r *Raft) CommitIndex() uint64

CommitIndex returns the committed index. This API maybe helpful for server to implement the read index optimization as described in the Raft paper.

Nothing about ensuring that the initial log has actually been committed before calling this. I'll need to dig into this, there may be something else in the library that helps, I'm so close to getting this code right.

Philip

Reply all

Reply to author

Forward

Is this type of read really linearizable?

Philip O'Toole

milan...@axoniq.io

Jordan Halterman

Philip O'Toole

Philip O'Toole

Archie Cobbs

Free Ekanayaka

Philip O'Toole

Free Ekanayaka

Philip O'Toole

Free Ekanayaka

Philip O'Toole

Diego Ongaro

Philip O'Toole

Philip O'Toole

Philip O'Toole

Philip O'Toole

Philip O'Toole

Free Ekanayaka

Diego Ongaro

Diego Ongaro

Philip O'Toole

func (*Raft) CommitIndex ¶

func (Raft) CommitIndex ¶*