Interaction of section 6.3 (Linearizability) and section 6.2 (routing to leader) in the raft thesis

Alex Bligh

unread,

Jan 5, 2016, 10:48:51 AM1/5/16

to raft...@googlegroups.com, Alex Bligh

This question concerns the raft thesis, not the original raft paper.

s6.3 concerns linearizability. Raft client requests are made idempotent by caching request IDs and their replies and in the event a duplicate is received, the reply is resent:

> To achieve linearizability in Raft, servers must filter out duplicate requests. The basic idea is that servers save the results of client operations and use them to skip executing the same request multiple times. To implement this, each client is given a unique identifier, and clients assign unique serial numbers to every command. Each server’s state machine maintains a session for each client. The session tracks the latest serial number processed for the client, along with the associated re- sponse. If a server receives a command whose serial number has already been executed, it responds immediately without re-executing the request.

s6.2 concerns communication by clients with the leader. In essence, it's up to the client to communicate with the leader. One option is for the client to pick a random server, and if that server is not the leader for the server to reply with the leader's address, and for the client to retry that server.

My question is how these two sections interact. Let us say we have a non-idempotent operation T1, that client C first sends to server S1, S1 having previously been leader. As S1 is not the leader, it returns an error, and points C to S2 (the new leader). C then sends T1 to S2 which executes it. This works fine.

However, suppose C sends T1 to S1, and soon (either before or after S1 has processed the transaction) S1's outbound communications fail (before it has sent a reply). This failure causes S2 to be elected leader. C eventually retries sending T1 and (somehow) discovers that S2 is now the leader, so sends it to S2. If S1 did not process T1, there is no issue as S2 can now process it as normal. However, if S2 is to continue to provide the linearizable semantics set out in s6.3, in the case where S1 has already processed T1, C's request should be ignored by S2 and the same reply sent as S1 sent (which it does not have access to), rather than attempting to process it again. However, I can see no way for S2 to determine whether or not this is the case as the list of IDs that have been processed is kept per server and not in the FSM.

How should this work?

--
Alex Bligh

Archie Cobbs

unread,

Jan 5, 2016, 1:19:20 PM1/5/16

to raft-dev, al...@alex.org.uk

On Tuesday, January 5, 2016 at 9:48:51 AM UTC-6, Alex Bligh wrote:

However, suppose C sends T1 to S1, and soon (either before or after S1 has processed the transaction) S1's outbound communications fail (before it has sent a reply). This failure causes S2 to be elected leader. C eventually retries sending T1 and (somehow) discovers that S2 is now the leader, so sends it to S2. If S1 did not process T1, there is no issue as S2 can now process it as normal. However, if S2 is to continue to provide the linearizable semantics set out in s6.3, in the case where S1 has already processed T1, C's request should be ignored by S2 and the same reply sent as S1 sent (which it does not have access to), rather than attempting to process it again. However, I can see no way for S2 to determine whether or not this is the case as the list of IDs that have been processed is kept per server and not in the FSM.

First note this bit in the first section you quoted:

Each server’s state machine maintains a session for each client. The session tracks the latest serial number processed for the client

This means that if a client transaction is committed, then it will be visible to any leader of any subsequent term... eventually... but in any case before that leader is able to commit a new transaction.

Therefore, if the "second" T1 sent to S2 is committed, then the "first" T1 that was sent to S1 must not have been committed because otherwise S2 would have seen it.

-Archie

Alex Bligh

unread,

Jan 5, 2016, 3:53:56 PM1/5/16

to Archie Cobbs, Alex Bligh, raft-dev

Archie,

On 5 Jan 2016, at 18:19, Archie Cobbs <archie...@gmail.com> wrote:

> First note this bit in the first section you quoted:
>
> > Each server’s state machine maintains a session for each client. The session tracks the latest serial number processed for the client
>
> This means that if a client transaction is committed, then it will be visible to any leader of any subsequent term... eventually... but in any case before that leader is able to commit a new transaction.
>
> Therefore, if the "second" T1 sent to S2 is committed, then the "first" T1 that was sent to S1 must not have been committed because otherwise S2 would have seen it.

Are you saying "Each server's state machine" should be parsed as "the replicated state machine on each server"?

If so (this is not how I read it), how might that work in practice?

If not, I'm afraid (raft newbie) I may be missing the point. If the information isn't replicated, then yes the result of T1 must be in S2's state machine, but as T1 isn't idempotent (by assumption) it will try and process it again (as would S1 if it received it in the absence of the provisions of s6.3), just like in Figure 6.2.

--
Alex Bligh

Archie Cobbs

unread,

Jan 5, 2016, 5:19:20 PM1/5/16

to raft-dev, archie...@gmail.com, al...@alex.org.uk

On Tuesday, January 5, 2016 at 2:53:56 PM UTC-6, Alex Bligh wrote:

> First note this bit in the first section you quoted:
>
> > Each server’s state machine maintains a session for each client. The session tracks the latest serial number processed for the client
>
> This means that if a client transaction is committed, then it will be visible to any leader of any subsequent term... eventually... but in any case before that leader is able to commit a new transaction.
>
> Therefore, if the "second" T1 sent to S2 is committed, then the "first" T1 that was sent to S1 must not have been committed because otherwise S2 would have seen it.

Are you saying "Each server's state machine" should be parsed as "the replicated state machine on each server"?

Yes... at least that's how I was reading it. It's not crystal clear to me either.

If so (this is not how I read it), how might that work in practice?

The replicated state machine is augmented with a new table having three columns, Client ID, Serial Number, and Response.

Each time a client requests a transaction, it provides its client ID and serial number. If the client's serial number is already known (i.e., in the table), the server just sends back the corresponding response (or, if the serial number is stale, some out-of-sync error). Otherwise it performs the client's transaction and privately includes in that transaction the appending (or replacing) of the client's row in the table with the new serial number. The overall transaction (client activity plus new table row) is then committed as a new Raft log entry. This guarantees that serial numbers increase monotonically for each client, and therefore no client transaction is duplicated.

As always, the client does not get back a "your transaction was successfully committed" message until that overall transaction is committed to the Raft log in the Raft sense (this is standard).

-Archie

Diego Ongaro

unread,

Jan 5, 2016, 9:05:57 PM1/5/16

to raft...@googlegroups.com, Archie Cobbs, Alex Bligh

However, I can see no way for S2 to determine whether or not this is the case as the list of IDs that have been processed is kept per server and not in the FSM.

Agreed! The list of IDs that have been process should be part of the replicated state machine. As in, each server's state machine independently processes the requests to arrive at the same client session state.

Archie is almost right, except for:

The overall transaction (client activity plus new table row) is then committed as a new Raft log entry

The simplest way, I think, is to include the client table as part of the state machine and only consult it within the state machine. And the entire client request goes right into the Raft log as one entry, not two. A typical request in the Raft log then looks like:

(client ID, serial number, command to modify primary data structure)

Once an entry is committed, it's applied to the state machine. Assuming the request hasn't already been executed and has an effect, applying it updates both the state machine's primary data structure (key-value store or whatever) and also its client session table. The client session table is just part of the state machine state. That's it, no second entry needed, no consulting the client table until after the request was committed in the log.

LogCabin has an implementation like this that you can refer to. The client table is managed in Server/StateMachine.cc, while the primary data structure is in Tree/Tree.cc. Here's the line where the state machine calls into the Tree to do the real processing: https://github.com/logcabin/logcabin/blob/v1.1.0/Server/StateMachine.cc#L324 . You can see that all the client table-related logic surrounds that call, and it's all encapsulated within the state machine. (Warning: the syntax might make you puke--the joys of protobuf and C++.)

-Diego

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Bligh

unread,

Jan 6, 2016, 8:45:50 AM1/6/16

to Diego Ongaro, Alex Bligh, raft...@googlegroups.com, Archie Cobbs

On 6 Jan 2016, at 02:05, Diego Ongaro <onga...@gmail.com> wrote:

>> However, I can see no way for S2 to determine whether or not this is the case as the list of IDs that have been processed is kept per server and not in the FSM.
>
> Agreed! The list of IDs that have been process should be part of the replicated state machine. As in, each server's state machine independently processes the requests to arrive at the same client session state.

OK thanks. I think perhaps the language might benefit from a little tidying.

> Archie is almost right, except for:
>
>> The overall transaction (client activity plus new table row) is then committed as a new Raft log entry
>
> The simplest way, I think, is to include the client table as part of the state machine and only consult it within the state machine. And the entire client request goes right into the Raft log as one entry, not two. A typical request in the Raft log then looks like:
> (client ID, serial number, command to modify primary data structure)
> Once an entry is committed, it's applied to the state machine. Assuming the request hasn't already been executed and has an effect, applying it updates both the state machine's primary data structure (key-value store or whatever) and also its client session table. The client session table is just part of the state machine state. That's it, no second entry needed, no consulting the client table until after the request was committed in the log.

Suppose the client's request/response also includes some state to return. Let us assume that the client's call is an RPC call to set the value of a given key in a key/value store AND return the previous value. This would be the sort of call that we would want to ensure runs exactly once, as if it runs twice the returned value would be wrong.

In the situation where the leader's response is lost, there are two possibilities: either the leader has applied the transaction or it has not. The client resends the request. The above method allows the leader to determine whether the original transaction was applied. If not it applies it now. If it has been applied, how does it return the old value, now it has been removed from the K/V store?

What I was doing was simply caching queries and replies (on a per server basis) and discarding those appropriately. Do I need to effectively put the entire cache into the FSM? I have a nasty suspicion the session cache might then become the largest part of the FSM!

--
Alex Bligh

Diego Ongaro

unread,

Jan 6, 2016, 1:53:05 PM1/6/16

to Alex Bligh, raft...@googlegroups.com, Archie Cobbs

On Wed, Jan 6, 2016 at 5:45 AM, Alex Bligh <al...@alex.org.uk> wrote:

On 6 Jan 2016, at 02:05, Diego Ongaro <onga...@gmail.com> wrote:

>> However, I can see no way for S2 to determine whether or not this is the case as the list of IDs that have been processed is kept per server and not in the FSM.
>
> Agreed! The list of IDs that have been process should be part of the replicated state machine. As in, each server's state machine independently processes the requests to arrive at the same client session state.

OK thanks. I think perhaps the language might benefit from a little tidying.

It's not my best written chapter, but hey, they signed.

> Archie is almost right, except for:
>
>> The overall transaction (client activity plus new table row) is then committed as a new Raft log entry
>
> The simplest way, I think, is to include the client table as part of the state machine and only consult it within the state machine. And the entire client request goes right into the Raft log as one entry, not two. A typical request in the Raft log then looks like:
> (client ID, serial number, command to modify primary data structure)
> Once an entry is committed, it's applied to the state machine. Assuming the request hasn't already been executed and has an effect, applying it updates both the state machine's primary data structure (key-value store or whatever) and also its client session table. The client session table is just part of the state machine state. That's it, no second entry needed, no consulting the client table until after the request was committed in the log.

Suppose the client's request/response also includes some state to return. Let us assume that the client's call is an RPC call to set the value of a given key in a key/value store AND return the previous value. This would be the sort of call that we would want to ensure runs exactly once, as if it runs twice the returned value would be wrong.

In the situation where the leader's response is lost, there are two possibilities: either the leader has applied the transaction or it has not. The client resends the request. The above method allows the leader to determine whether the original transaction was applied. If not it applies it now. If it has been applied, how does it return the old value, now it has been removed from the K/V store?

What I was doing was simply caching queries and replies (on a per server basis) and discarding those appropriately. Do I need to effectively put the entire cache into the FSM? I have a nasty suspicion the session cache might then become the largest part of the FSM!

Yeah, if those replies are necessary to provide the linearizable semantics you presumably want, then they can't just be kept in a cache -- they ought to be as consistent and available as the state machine state. So if the operation is [client c, op n, set x to v2 and reply with v1] then each server's state machine would independently put the reply including v1 into its client session table under (c,n) and then set x to v2.

This session information might grow to be large, depending on your operations and workload. Here's some ideas, in no particular order:

* sharing: if many replies refer to the same data, use a layer of indirection to only store that data once

* aggressively acknowledging: If the client receives a large reply, maybe it should quickly acknowledge to the servers that they don't need to keep that around anymore (you can do that through the Raft log or not).

* redefining the operations: Maybe it's not necessary for every write to return the previous value, for example. What would the client do with the previous value? Perhaps there's a smaller reply that lets it get the job done. In LogCabin, for example, a write just returns a status code. If a client wants to know the value that was overwritten, it reads the current value (read-only query so the reply doesn't need to be stored) and uses that value as a condition for its write. That covers the more common use case that the client is modifying the old value rather than generating something new out of thin air.

* relaxing the consistency guarantees: Maybe you're ok with the compromise of returning the wrong result some of the time. For example, you could return the correct status code but possibly incorrect data. Whether that's reasonable really depends on the application.

-Diego

Reply all

Reply to author

Forward