How do applications commonly reverse transitions on the follower state machine?

Rajiv Kurian

unread,

Dec 14, 2013, 3:37:09 AM12/14/13

to raft...@googlegroups.com

I just watched http://www.youtube.com/watch?v=06cTPhi-3_8.This was my first introduction to Raft. Congratulations on this work. A few follow up questions:

1. When certain followers have extraneous entries during the log replication process, they eventually need to reconcile and shed those entries. But they have already applied the changes to the state machine. Don't they need to "unapply" the changes to the state machine too as part of the reconciliation, before they can service read requests or become a leader later? If so then I guess this is the responsibility of the application.

2. How do clients discover who the leader is? Do they just settle on a leader by asking around the cluster and have a heuristic on the failure rate of requests (if they picked the wrong leader their requests should fail) before switching to another leader? If this is the case then in a typical database application, would client reads also need to be logged? My guess is they would otherwise a client might ask a stale leader and get incorrect results.

Thank you,

Rajiv

allen....@gmail.com

unread,

Dec 14, 2013, 6:09:54 AM12/14/13

to Rajiv Kurian, raft...@googlegroups.com

It's a little late, so bear with me :)

1. Log entries are only applied to the state machine once they're committed. Committed entries are never rolled back.

2. I think there are two questions here:

"How do clients discover the cluster leader?"

"Should reads require consensus?"

2a. This is implementation dependent‎. You could have the client pick and send a request to an arbitrary server. If the destination is the leader the request is processed; if not, it is rejected and the location of the leader returned. Alternatively, you could have another directory service that stores the leader location. Or, you could have follower servers forward requests to the leader. There are probably more options that would make sense for particular use cases.

2b.‎ I believe the paper covers this. If you absolutely need the latest system state a read could be implemented as a state machine command, which would require a log entry to be committed before a result is returned. Alternatively, reads could be serviced locally. In that case yes, there's a potential for the client to receive stale data. But through the use of timeouts you can place a bound on how old that data is.

For example, if a follower services a read request the data ‎is potentially election_timeout time units old. I think things are a little more complicated if you somehow manage to talk to an old leader that doesn't realize that another leader has been elected. You'd probably have to implement some sort of leader lease mechanism to deal with that. (I think that's right...)

Cheers,

Allen

Sent from my BlackBerry 10 smartphone.

From: Rajiv Kurian

Sent: Saturday, December 14, 2013 12:37 AM

To: raft...@googlegroups.com

Subject: How do applications commonly reverse transitions on the follower state machine?

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rajiv Kurian

unread,

Dec 14, 2013, 1:56:33 PM12/14/13

to raft...@googlegroups.com, Rajiv Kurian

Thanks for the replies Allen.

On Saturday, December 14, 2013 3:09:54 AM UTC-8, Allen George wrote:

It's a little late, so bear with me :)

1. Log entries are only applied to the state machine once they're committed. Committed entries are never rolled back.

Aah this makes sense. Am I correct in saying that every commit is then a 2PC between the leader and any majority of followers?

2. I think there are two questions here:
"How do clients discover the cluster leader?"
"Should reads require consensus?"

2a. This is implementation dependent‎. You could have the client pick and send a request to an arbitrary server. If the destination is the leader the request is processed; if not, it is rejected and the location of the leader returned. Alternatively, you could have another directory service that stores the leader location. Or, you could have follower servers forward requests to the leader. There are probably more options that would make sense for particular use cases.

Makes sense. So clients could always pick the wrong leader for a request but they'll eventually reach the right leader.

2b.‎ I believe the paper covers this. If you absolutely need the latest system state a read could be implemented as a state machine command, which would require a log entry to be committed before a result is returned. Alternatively, reads could be serviced locally. In that case yes, there's a potential for the client to receive stale data. But through the use of timeouts you can place a bound on how old that data is.

For example, if a follower services a read request the data ‎is potentially election_timeout time units old. I think things are a little more complicated if you somehow manage to talk to an old leader that doesn't realize that another leader has been elected. You'd probably have to implement some sort of leader lease mechanism to deal with that. (I think that's right...)

Yeah I was worried about the case where the client connects to an old leader who doesn't know about the existence of a new leader. I'll have to look up what leader lease is.

Diego Ongaro

unread,

Dec 15, 2013, 6:38:43 PM12/15/13

to Rajiv Kurian, raft...@googlegroups.com

On Sat, Dec 14, 2013 at 10:56 AM, Rajiv Kurian <geet...@gmail.com> wrote:
> On Saturday, December 14, 2013 3:09:54 AM UTC-8, Allen George wrote:
>>
>> 1. Log entries are only applied to the state machine once they're committed. Committed entries are never rolled back.
>
> Aah this makes sense. Am I correct in saying that every commit is then a 2PC between the leader and any majority of followers?

I wouldn't call it 2PC to avoid confusion, since two-phase commit
refers to a different protocol with different behavior (see that
Wikipedia page).

>> 2b. I believe the paper covers this. If you absolutely need the latest system state a read could be implemented as a state machine command, which would require a log entry to be committed before a result is returned. Alternatively, reads could be serviced locally. In that case yes, there's a potential for the client to receive stale data. But through the use of timeouts you can place a bound on how old that data is.
>
>
>>
>> For example, if a follower services a read request the data is potentially election_timeout time units old. I think things are a little more complicated if you somehow manage to talk to an old leader that doesn't realize that another leader has been elected. You'd probably have to implement some sort of leader lease mechanism to deal with that. (I think that's right...)
>
> Yeah I was worried about the case where the client connects to an old leader who doesn't know about the existence of a new leader. I'll have to look up what leader lease is.

To guarantee the leader has the latest system state, it needs to make
sure it's still leader. Without relying on clocks, this requires
communication with a majority of the cluster, but it doesn't require a
new log entry to be committed (so there's no need for any disk
writes). LogCabin implements this so that leaders never return stale
information on reads.

If you're ok with relying on clocks for read freshness, then you
should have the leader step down if it can't communicate with a
majority of the cluster for some period of time (such as an election
timeout or a bit less to allow for clock drift). That's all that's
meant by "leader lease mechanism". (Leaders in LogCabin also do this
because it makes sure that client reads don't wait for too long at a
partitioned leader.)

Rajiv Kurian

unread,

Dec 15, 2013, 6:56:44 PM12/15/13

to raft...@googlegroups.com, Rajiv Kurian

On Sunday, December 15, 2013 3:38:43 PM UTC-8, Diego Ongaro wrote:

On Sat, Dec 14, 2013 at 10:56 AM, Rajiv Kurian <geet...@gmail.com> wrote:
> On Saturday, December 14, 2013 3:09:54 AM UTC-8, Allen George wrote:
>>
>> 1. Log entries are only applied to the state machine once they're committed. Committed entries are never rolled back.
>
> Aah this makes sense. Am I correct in saying that every commit is then a 2PC between the leader and any majority of followers?

I wouldn't call it 2PC to avoid confusion, since two-phase commit
refers to a different protocol with different behavior (see that
Wikipedia page).

>> 2b. I believe the paper covers this. If you absolutely need the latest system state a read could be implemented as a state machine command, which would require a log entry to be committed before a result is returned. Alternatively, reads could be serviced locally. In that case yes, there's a potential for the client to receive stale data. But through the use of timeouts you can place a bound on how old that data is.
>
>
>>
>> For example, if a follower services a read request the data is potentially election_timeout time units old. I think things are a little more complicated if you somehow manage to talk to an old leader that doesn't realize that another leader has been elected. You'd probably have to implement some sort of leader lease mechanism to deal with that. (I think that's right...)
>
> Yeah I was worried about the case where the client connects to an old leader who doesn't know about the existence of a new leader. I'll have to look up what leader lease is.

To guarantee the leader has the latest system state, it needs to make
sure it's still leader. Without relying on clocks, this requires
communication with a majority of the cluster, but it doesn't require a
new log entry to be committed (so there's no need for any disk
writes). LogCabin implements this so that leaders never return stale
information on reads.

So for every read operation the leader pings all the followers and needs a reply from the majority confirming it's still the leader before replying? Couldn't a new leader still be elected in the time between processing these replies and sending the read response to the client? If a new leader is indeed elected it could have processed a write request making the old leader's state machine invalid. The old leader though doesn't know this yet and happily sends a stale read response. Does that make sense?

Diego Ongaro

unread,

Dec 15, 2013, 7:04:01 PM12/15/13

to Rajiv Kurian, raft...@googlegroups.com

Yep, that can happen, and that's inevitable and perfectly fine. You
can think of it this way: if the new leader serviced the read request,
it too could have replied with that same response, assuming it
processed the read request prior to processing the write.

To be precise, the consistency model Raft provides is linearizability
(the strongest consistency model attainable in a distributed system,
practically speaking). The read and write operations in your example
are concurrent, because the write could not have possibly depended on
the result of the read and the read could not have possibly depended
on the result of the write. Because they're concurrent,
linearizability states that one operation must appear to happen before
the other, but we don't know in which order, a priori. In your
example, the read happened before the write; that's a perfectly valid
ordering.

Rajiv Kurian

unread,

Dec 15, 2013, 7:07:17 PM12/15/13

to Diego Ongaro, raft...@googlegroups.com

Thank you, that helps.

lemon wonder

unread,

Dec 28, 2016, 4:04:38 AM12/28/16

to raft-dev, geet...@gmail.com, allen....@gmail.com

great

在 2013年12月14日星期六 UTC+8下午7:09:54，allen....@gmail.com写道：

Reply all

Reply to author

Forward