Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Is it necessary to wait fo FSM applied in readindex read to be linearizable?

77 views
Skip to first unread message

李緒成

unread,
Oct 4, 2024, 3:02:35 AM10/4/24
to raft-dev
I am working with Philip on the read index optimization in rqlite :https://github.com/rqlite/rqlite/pull/1930

The previous conversation is talking about the ordering of heartbeat and FSM read.

In current change, We take the same approach as consul, but after doing some search, I found TiDB's readindex read has two addition step:
1. get current commit index
2. wait for the FSM until it applied to the commit index we get

So I wonder if these steps are necessary to be linearizable?

My thought is, with the Archie's and Free's point, those committed commands waiting for FSM to apply can be consider is concurrent  to the incoming readindex read request, so even if the read doesn't affected by those committed but not applied request, that's also lineariziable, since they are not logical related.

李緒成

unread,
Oct 4, 2024, 3:18:02 AM10/4/24
to raft-dev
TiDB read index & lease read: https://www.pingcap.com/blog/lease-read/

Free Ekanayaka

unread,
Oct 4, 2024, 5:28:51 AM10/4/24
to 李緒成, raft-dev
李緒成 <pete...@gmail.com> writes:

> I am working with Philip on the read index optimization in rqlite
> :https://github.com/rqlite/rqlite/pull/1930
>
> The previous conversation
> <https://groups.google.com/g/raft-dev/c/4QlyV0aptEQ> is talking about the
> ordering of heartbeat and FSM read.
>
> In current change, We take the same approach
> <https://github.com/hashicorp/consul/blob/a6898939910b175db7495f02131918c0cc73027c/internal/storage/raft/backend.go#L129-L134>
> as consul, but after doing some search, I found TiDB's readindex read has
> two addition step:
> 1. get current commit index
> 2. wait for the FSM until it applied to the commit index we get
>
> So I wonder if these steps are necessary to be linearizable?

I assume that you are referring the "ReadIndex Read" section of this
page:

https://www.pingcap.com/blog/lease-read/

right? In particular that section says:

1. Record its current commit index into the local variable ReadIndex.

2. Send a heartbeat message to other Regions. If the majority of Regions
replies with the corresponding heartbeat response, then the leader
confirms its state.

3. Wait for the execution of its state machine until the apply index
equals to or exceeds ReadIndex. By then, the data can be read
consistently.

4. Execute the read request and return the result to the client.

I think it's necessary to wait for the state machine to apply the
current commit index, before performing the read, otherwise you wouldn't
read the latest state.

However, in many implementations the when the commit index is increased
the corresponding log entry is immediately applied to the state machine,
in an atomic manner. In those implementations you can be sure that the
commit index is always equal to the apply index, and in that case there
isn't anything to wait for.

>
> My thought is, with the Archie's
> <https://groups.google.com/g/raft-dev/c/4QlyV0aptEQ/m/Nliuzo0WAwAJ> and
> Free's <https://groups.google.com/g/raft-dev/c/4QlyV0aptEQ/m/a6QoLjcYAwAJ>
> point, those committed commands waiting for FSM to apply can be consider is
> *concurrent* to the incoming readindex read request, so even if the read
> doesn't affected by those committed but not applied request, that's also
> lineariziable, since they are not logical related.

I believe this would be a violation of linearizability, because you
would not read the latest committed state at the time that the read
request was received.

I'm not entirely sure about this though, and your one is an interesting
question. I'm curious to see what other folks think.

Free

李緒成

unread,
Oct 5, 2024, 12:25:17 AM10/5/24
to raft-dev
Thanks Free,


> I assume that you are referring the "ReadIndex Read" section of this
> page:
https://www.pingcap.com/blog/lease-read/
> right? 

yes~


> However, in many implementations the when the commit index is increased
> the corresponding log entry is immediately applied to the state machine,
> in an atomic manner. In those implementations you can be sure that the
> commit index is always equal to the apply index, and in that case there
> isn't anything to wait for.

rqlite use hashicorp/raft library, 
and the implementation of leader commit index and FSM apply is at: https://github.com/hashicorp/raft/blob/dd1f3da54cd2f093e4caf643e9fb761a5ae74f51/raft.go#L785-L846
1. set commit index
2. process a batch of logs and reply to those logs sender
3. set applied index 

Which seems not in an atomic manner, it‘s still possible that read commit index behind the fsm index; however it does fit the property "when the commit index is increased
the corresponding log entry is immediately applied to the state machine" you said, so I'm not quite sure about that 😅

What do you think?


Jordan Halterman

unread,
Oct 5, 2024, 2:24:18 AM10/5/24
to raft...@googlegroups.com
This is a fine way to do it. The state that’s read at a committed index will never change. What matters is that you verify leadership after reading that index to ensure there wasn’t a write to another leader that would violate linearizability. Once you have that, it’s safe to read the state at that index and return it to the client. This just sounds like a decent optimization in the state machine. 

Jordan Halterman

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/ba285c33-88c0-49a0-833f-3586b9aee0fan%40googlegroups.com.

Free Ekanayaka

unread,
Oct 5, 2024, 4:23:00 AM10/5/24
to 李緒成, raft-dev
Jordan has already shed light on this, so I guess it doesn't matter much
if the log is applied atomically or not, since apparently
linearizability is preserved as long as you just check leadership before
replying to the client.

李緒成 <pete...@gmail.com> writes:

> Thanks Free,
>
>> I assume that you are referring the "ReadIndex Read" section of this
>> page:
>> https://www.pingcap.com/blog/lease-read/
>> right?
>
> yes~
>
>> However, in many implementations the when the commit index is increased
>> the corresponding log entry is immediately applied to the state machine,
>> in an atomic manner. In those implementations you can be sure that the
>> commit index is always equal to the apply index, and in that case there
>> isn't anything to wait for.
>
> rqlite use hashicorp/raft library,
> and the implementation of leader commit index and FSM apply is at:
> https://github.com/hashicorp/raft/blob/dd1f3da54cd2f093e4caf643e9fb761a5ae74f51/raft.go#L785-L846
> 1. set commit index
> 2. process a batch of logs and reply to those logs sender
> 3. set applied index
>
> Which seems not *in an atomic manner, *it‘s still possible that read commit
> index behind the fsm index*;* however it does fit the property "when the

milan...@axoniq.io

unread,
Oct 5, 2024, 4:31:15 AM10/5/24
to raft-dev
What I find weird about this approach is that leader's state machine might be really slow in applying entries (slower than other nodes' state machines). 

Cheers,
Milan

Vilho Raatikka

unread,
Oct 5, 2024, 5:00:03 AM10/5/24
to raft...@googlegroups.com
Milan, could it be that the Leader does more than Followers and although Followers can handle it is just too much for the Leader?

What does the apply-process actually do in your case? Are you simply updating a value of a key, or, from another extreme, changing the state of a relational database? 

If Followers can handle the load then I'd check concurrency control; additional locking/mutexing required in the Leader.

In general, writing to a persistent storage is a typical source for delays. Especially in networked or cloud environments.

Regards

Vilho

Milan Savic

unread,
Oct 5, 2024, 5:54:32 AM10/5/24
to raft-dev
I'm building an event store, the state machine that appends events durably to the disk. The apply speed may differ significantly depending on the nodes' attached volumes for the RAFT log and an event store

In our case, "stale reads" mean that we are only a few events behind the head of the event stream since the event store is immutable by nature.
Nothing can go wrong since the event store must guarantee consistency on the append. But of course, we want to limit stale reads as much as possible to avoid too many conflicts on appends.

Cheers,
Milan

Vilho Raatikka

unread,
Oct 5, 2024, 6:42:39 AM10/5/24
to raft...@googlegroups.com
Reading committed values (even stale) may sometimes be ok but there may be an issue if the client who wrote a value immediately needs to read the value after that. 

Regards

Vilho

Reply all
Reply to author
Forward
0 new messages