Relationship of state machine and consensus module

Matthias Vallentin

unread,

Dec 12, 2016, 5:11:07 AM12/12/16

to raft...@googlegroups.com

After reading the paper and looking at several Raft implementations, I
wonder about the relationship of the state machine and the consensus
module. Initially, I thought that the implementation of the two can be
strictly decoupled:

User ---> State Machine ---> Consensus Module ---> Log (Diagram 1)

That is, the state machine uses the consensus module only as
implementation detail. But then I see implementations where this is not
the case, where the state machine lives *inside* the consensus module.
Consul, for example [1]:

Since all servers participate as part of the peer set, they all know
the current leader. When an RPC request arrives at a non-leader
server, the request is forwarded to the leader. If the RPC is a
query type, meaning it is read-only, the leader generates the result
based on the current state of the FSM. If the RPC is a transaction
type, meaning it modifies state, the leader generates a new log
entry and applies it using Raft. Once the log entry is committed and
applied to the FSM, the transaction is complete.

That is, users send a command to the consensus module as opposed to the
state machine, and the command is dispatched based on its type (read vs.
write):

/---> State Machine
User ---> Consensus Module ---/ (Diagram 2)
\
\---> Log

Going back to the paper and looking at Figure 1, it seems that the
consensus module indeed controls the state machine, and more
importantly, is directly user-facing. i wonder whether that's a
necessity. In fact, the accompanying text in the paper reads more like
Diagram 1.

For example, would it be possible for Raft to only manage the log and
let the state machine drive snapshotting/compaction? Then, the state
machine would use the consensus module only as an implementation detail,
forwarding to it mutable operations (writes) while answering immutable
operations (reads) from its own state, derived from its local log of the
consensus node.

In Diagram 2, the consensus module takes care of both committing and
applying, returning to the user as soon as the command has been
"applied" (even though "applied" could have meant only reading
successfully from the state machine in case of an immutable command). In
Diagram 1, the state machine only dispatches the command to the
consensus module if its a write. In this case, the consensus module
returns to the user successfully after persisting the command in the log
and subsequent application to the state machine. The latter approach
makes more sense to me from an architectural point of view: only the
state machine knows how to interpret its commands, and whether
dispatching to the consensus module makes sense.

I haven't use many other Raft implementations, but these concerns came
up during my own implementation of Raft within an actor model runtime.
Any comments that help me better understand the fundamental design
decisions would be much appreciated.

Matthias

[1] https://www.consul.io/docs/internals/consensus.html

Samo Pogačnik

unread,

Dec 12, 2016, 1:45:24 PM12/12/16

to raft-dev, vall...@berkeley.edu

Hi Matthias,

I had similar thoughts regarding decoupling consensus module from the state machine. I was wondering if anyone considered the posibillity of implementing the (common) consensus module part at kernel level of the operating system (i.e. like a special transport protocol with its own socket type??) used by different (application specific) state machine parts implemented as user code.

regards, Samo

Dne ponedeljek, 12. december 2016 11.11.07 UTC+1 je oseba Matthias Vallentin napisala:

Matthias Vallentin

unread,

Dec 13, 2016, 8:04:07 AM12/13/16

to raft...@googlegroups.com

> I had similar thoughts regarding decoupling consensus module from the state
> machine. I was wondering if anyone considered the posibillity of
> implementing the (common) consensus module part at kernel level of the
> operating system (i.e. like a special transport protocol with its own
> socket type??) used by different (application specific) state machine parts
> implemented as user code.

Interesting, what's your use case for a kernel-level implementation?

I've thought a bit more about the coupling of consensus module and state
machine:

- If the state machine is user-facing, then it has to ensure
linearizability by keeping track of user requests. Otherwise this
could be done (and implemented only once) in the consensus module.
This appears subtle in practice: etcd/consul didn't handle stale
reads properly initially.

- If the state machine is user-facing, the consensus module does not
know how to compact log snapshots (unless it is given that
logic somehow, e.g., upon construction). Although this is just an
optimization, it could be important for large state machines: say
a snapshot consists of N assignments that overwrite the same
value, then the snapshot size is bounded by O(N) if the consensus
module doesn't know the internals of the log entries. If it does,
space constraints would come down to O(1).

I'm sure that I'm just scratching the surface here, but these are two
points that come to mind.

Matthias

Archie Cobbs

unread,

Dec 13, 2016, 10:43:01 AM12/13/16

to raft-dev, vall...@berkeley.edu

On Monday, December 12, 2016 at 4:11:07 AM UTC-6, Matthias Vallentin wrote:

For example, would it be possible for Raft to only manage the log and
let the state machine drive snapshotting/compaction?

I'm having a hard time understanding this idea. I can't help seeing the state machine as just an implementation detail of how the log is stored. In other words, it's a data structure resulting from an optimization. The state machine has no useful meaning by itself, except perhaps as a place to go for weakly consistent (i.e., eventually consistent) reads. If you care about linearizable reads (presumably the reason you're using Raft in the first place), then the state machine and the leader's unapplied log entries must be viewed together as one unified thing.

If you really want to break them apart, a good next question to ask is ok what communication must occur between the Raft consensus machinery (which maintains uncompacted log entries), and the state machine. E.g., if they were micro-services, what would the API look like? Actually there are three parties communicating now: Raft consensus, state machine, and clients.

E.g., if you split out the state machine from Raft, a client is still required to contact both in order to do a linearizable read, and it must do so in a way that avoids seeing a missed or duplicated log entry if Raft happens to be applying a new log entry to the state machine concurrently. You could do this by contacting Raft first, then the state machine, and checking the state machine's last applied index.

Worse is a compare-and-swap operation. Now the client must "lock" the Raft module while it performs a linearizable read, then apply its write (if appropriate) before releasing the lock. Yuck.

You could have the Raft consensus module proxy those state machine requests and handle the locking for you, but now you're back where you started - i.e., Raft + state machine integrated into one.

-Archie

Samo Pogačnik

unread,

Dec 13, 2016, 2:56:22 PM12/13/16

to raft-dev, vall...@berkeley.edu

I had no particular use case in mind. Initially it seemed to me, that a part of Raft consensus protocol at some layer could be seen as a one-to-many “replacement” of what TCP does for one-to-one reliable ordered data delivery over unreliable IP network. Of course this comparison is very lose as Raft goes far beyond pure delivery of data.

From what you explained, it looks that I am trying to decouple the consensus module itself into a kernel part, which contains all network plumbing and protocol-timing mechanics that would operate based upon persistent cluster data (log, ...) provided by user part of the consensus module. If such decoupling was possible, this may be a way to standardize a system level raft-consensus API.

I imagine that for instance the complete election procedure could be handled inside kernel using only a few persistent data (like last term and last index) provided by the user part.

Similarly, less data copying between kernel and user-space is required during log replication process…

Maybe some client handling mechanics also fits into this model...

Samo

Dne torek, 13. december 2016 14.04.07 UTC+1 je oseba Matthias Vallentin napisala:

Reply all

Reply to author

Forward