[QUEUED scylla next] Merge "raft: misc documentation edits" from Kostja

Skip to first unread message

Commit Bot

Sep 28, 2021, 6:13:27 AMSep 28
to scylla...@googlegroups.com, Tomasz Grabiec
From: Tomasz Grabiec <tgra...@scylladb.com>
Committer: Tomasz Grabiec <tgra...@scylladb.com>
Branch: next

Merge "raft: misc documentation edits" from Kostja

* scylla-dev/raft-misc-v4-docedit:
raft: document pre-voting and protection against disruptive leaders
raft: style edits of README.md.
raft: document snapshot API

diff --git a/raft/README.md b/raft/README.md
--- a/raft/README.md
+++ b/raft/README.md
@@ -7,41 +7,68 @@ This library provides an efficient, extensible, implementation of
Raft consensus algorithm for Seastar.
For more details about Raft see https://raft.github.io/

+## Terminology
+Raft PhD is using a set of terms which are widely adopted in the
+industry, including this library and its documentation. The
+library provides replication facilities for **state machines**.
+Thus the **state machine** here and in the source is a user
+application, distributed by means of Raft.
+The library is implemented in a way which allows to replace/
+plug in its key components:
+- communication between peers by implementing **rpc** API
+- persisting the library's private state on disk,
+ via **persistence** class
+- shared failure detection by supplying a custom
+ **failure detector** class
+- user state machine, by passing an instance of **state
+ machine** class.
+Please note that the library internally implements its own
+finite state machine for protocol state - class fsm. This class
+shouldn't be confused with the user state machine.
## Implementation status
- log replication, including throttling for unresponsive
-- leader election
+- managing of the user's state machine snapshots
+- leader election, including the pre-voting algorithm
+- non-voting members (learners) support
- configuration changes using joint consensus
+- read barriers
+- forwarding commands to the leader

## Usage

-In order to use the library the application has to provide implementations
-for RPC, persistence and state machine APIs, defined in raft/raft.hh. The
-purpose of these interfaces is:
-- provide a way to communicate between Raft protocol instances
-- persist the required protocol state on disk,
-a pre-requisite of the protocol correctness,
-- apply committed entries to the state machine.
-While comments for these classes provide an insight into
-expected guarantees they should provide, in order to provide a complying
-implementation it's necessary to understand the expectations
-of the Raft consistency protocol on its environment:
+In order to use the library, the application has to provide implementations
+for RPC, persistence and state machine APIs, defined in `raft/raft.hh`,
+- `class rpc`, provides a way to communicate between Raft protocol instances,
+- `class persistence` persists the required protocol state on disk,
+- class `state_machine` is the actual state machine being replicated.
+A complete description of expected semantics and guarantees
+is maintained in the comments for these classes and in sample
+implementations. Let's list here key aspects the implementor should
+bear in mind:
- RPC should implement a model of asynchronous, unreliable network,
in which messages can be lost, reordered, retransmitted more than
once, but not corrupted. Specifically, it's an error to
- deliver a message to a Raft server which was not sent to it.
+ deliver a message to a wrong Raft server.
- persistence should provide a durable persistent storage, which
survives between state machine restarts and does not corrupt
- its state.
+ its state. The storage should contain an efficient mostly-appended-to
+ part containing Raft log, thousands and hundreds of thousands of entries,
+ and a small register-like memory are to contain Raft
+ term, vote and the most recent snapshot descriptor.
- Raft library calls `state_machine::apply_entry()` for entries
reliably committed to the replication log on the majority of
servers. While `apply_entry()` is called in the order
- entries are serialized in the distributed log, there is
+ the entries were serialized in the distributed log, there is
no guarantee that `apply_entry()` is called exactly once.
- E.g. when a protocol instance restarts from the persistent state,
+ E.g. when a server restarts from the persistent state,
it may re-apply some already applied log entries.

Seastar's execution model is that every object is safe to use
@@ -58,17 +85,19 @@ In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster
on this node
+- call server::start() to start the server
- repeat the above for every node in the cluster
- use `server::add_entry()` to submit new entries
- on a leader, `state_machine::apply_entries()` is called after the added
+ `state_machine::apply_entries()` is called after the added
entry is committed by the cluster.

### Subsequent usages

-Similar to the first usage, but `persistence::load_term_and_vote()`
-`persistence::load_log()`, `persistence::load_snapshot()` are expected to
-return valid protocol state as persisted by the previous incarnation
-of an instance of class server.
+Similar to the first usage, but internally `start()` calls
+`persistence::load_term_and_vote()` `persistence::load_log()`,
+`persistence::load_snapshot()` to load the protocol and state
+machine state, persisted by the previous incarnation of this
+server instance.

## Architecture bits

@@ -79,18 +108,19 @@ changes: it is possible to add and remove one or multiple
nodes in a single transition, or even move Raft group to an
entirely different set of servers. The implementation adopts
the two-step algorithm described in the original Raft paper:
-- first, an entry in the Raft log with joint configuration is
-committed. The joint configuration contains both old and
+- first, a log entry with joint configuration is
+committed. The "joint" configuration contains both old and
new sets of servers. Once a server learns about a new
-configuration, it immediately adopts it.
+configuration, it immediately adopts it, so as soon as
+the joint configuration is committed, the leader will require two
+majorities - the old one and the new one - to commit new entries.
- once a majority of servers persists the joint
entry, a final entry with new configuration is appended
to the log.

If a leader is deposed during a configuration change,
-the new leader carries out the transition from joint
-to final configuration for it.
-it carries out the transition for the prevoius leader.
+a new leader carries out the transition from joint
+to the final configuration.

No two configuration changes could happen concurrently. The leader
refuses a new change if the previous one is still in progress.
@@ -116,6 +146,29 @@ this:
not sepately for each Raft instance. The library expects an accurate
`failure_detector` instance from a complying implementation.

+### Pre-voting and protection against disruptive leaders
+tl;dr: do not turn pre-voting OFF
+The library implements the pre-voting algorithm described in Raft PHD.
+This algorithms adds an extra voting step, requiring each candidate to
+collect votes from followers before updating its term. This prevents
+"term races" and unnecessary leader step downs when e.g. a follower that
+has been isolated from the cluster increases its term, becomes a candidate
+and then disrupts an existing leader.
+The pre-voting extension is ON by default. Do not turn it OFF unless
+testing or debugging the library itself.
+Another extension suggested in the PhD is protection against disruptive
+leaders. It requires followers to withhold their vote within an election
+timeout of hearing from a valid leader. With pre-voting ON and use of shared
+failure detector we found this extension unnecessary, and even leading to
+reduced liveness. It was thus removed from the implementation.
+As a downside, with pre-voting *OFF* servers outside the current
+configuration can disrupt cluster liveness if they stay around after having
+been removed.
### RPC module address mappings

Raft instance needs to update RPC subsystem on changes in
@@ -148,3 +201,107 @@ return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.
+## Snapshot API
+A snapshot is a compact representation of the user state machine
+state. The structure of the snapshot and details of taking
+a snapshot are not known to the library. It uses instances
+of class `snapshot_id` (essentially UUIDs) to identify
+state machine snapshots.
+The snapshots are used in two ways:
+- to manage Raft log length, i.e. be able to truncate
+ the log when it grows too big; to truncate a log,
+ the library takes a new state machine snapshot and
+ erases most log entries older than the snapshot;
+- to bootstrap a new member of the cluster or
+ catch up a follower that has fallen too much behind
+ the leader and can't use the leader's log alone; in
+ this case the library instructs the user state machine
+ on the leader to transfer its own snapshot (identified by snapshot
+ id) to the specific follower, identified by `raft::server_id`.
+ It's then the responsibility of the user state machine
+ to transfer its compact state to the peer in full.
+`snapshot_descriptor` is a library container for snapshot id
+and associated metadata. This class has the following structure:
+struct snapshot_descriptor {
+ // Index and term of last entry in the snapshot
+ index_t idx = index_t(0);
+ term_t term = term_t(0);
+ // The committed configuration in the snapshot
+ configuration config;
+ // Id of the snapshot.
+ snapshot_id id;
+// The APIs in which the snapshot descriptor is used:
+future<snapshot_id> state_machine::take_snapshot()
+void state_machine::drop_snapshot(snapshot_id id)
+future<> state_machine::load_snapshot(snapshot_id id)
+future<snpashot_reply> rpc::send_snapshot(server_id server_id, const install_snapshot& snap, seastar::abort_source& as)
+future<> persistence::store_snapshot_descriptor(const snapshot& snap, size_t preserve_log_entries);
+future<> persistence::load_snapshot_descriptor()
+The state machine must save its snapshot either
+when the library calls `state_machine::take_snapshot()`, intending
+to truncate Raft log length afterwards, or when the snapshot
+transfer from the leader is initiated via `rpc::send_snapshot()`.
+In the latter case the leader's state machine is expected
+to contact the follower's state machine and send its snaphsot to
+When Raft wants to initialize a state machine with a snapshot
+state it calls `state_machine::load_snapshot()` with appropriate
+snapshot id.
+When raft no longer needs a snapshot it uses
+`state_machine::drop_snapshot()` to inform the state machine it
+can drop the snapshot with a given id.
+Raft persists the currently used snapshot descriptor by calling
+`persistence::store_snapshot_descriptor()`. There is no separate
+API to explicitly drop the previous stored descriptor, the
+call is allowed to overwrite it. On success, this call is followed
+by `state_machine::drop_snapshot()` to drop the snapshot referenced
+by the previous descriptor in the state machine.
+The snapshot state must survive restarts, so it should be put to
+disk either in `take_snapshot()` or when with persisting the
+snapshot descriptor, in `persistence::store_snapshot_descriptor()`.
+It is possible that a crash or stop happens soon after creating a
+new snapshot and before dropping the old one. In that case
+`persistence` contains only the latest snapshot descriptor.
+The library never uses more than one snapshot, so when the state
+machine is later restarted all snapshots except the one with its
+id persisted in the snapshot descriptor can be
+safely dropped.
+Calls to `state_machine::take_snapshot()` and snapshot transfers are not
+expected to run instantly. Indeed, respective API returns
+`future<>`, so these calls may take a while. Imagine the
+state machine has a snapshot and is asked by the library to
+take a new one. A leader change happens while snapshot-taking is in
+progress and the new leader starts a snapshot transfer to the
+follower. Even less likely, but still possible, that a yet another
+leader is elected and it also starts an own snapshot transfer to the
+follower. And another one. Thus a single server may be taking a
+local state machine snapshot and running multiple transfers. When
+all of this is done, the library will automatically select the
+snapshot with the latest term and index, persist its id in the
+snapshot descriptor and call `state_machine::load_snapshot()` with this
+id. All the extraneous snapshots will be dropped
+by the library, unless the server crashes.
+Once again, to cleanup any garbage after a crash, the complying
+implementation is expected to delete all snapshots except the one
+which id is persisted in the snapshot descriptor upon restart.

Commit Bot

Sep 28, 2021, 1:12:00 PMSep 28
to scylla...@googlegroups.com, Tomasz Grabiec
From: Tomasz Grabiec <tgra...@scylladb.com>
Committer: Tomasz Grabiec <tgra...@scylladb.com>
Branch: master
Reply all
Reply to author
0 new messages