---
raft/README.md | 86 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 58 insertions(+), 28 deletions(-)
diff --git a/raft/README.md b/raft/README.md
index 5f1f813279..160bd134bc 100644
--- a/raft/README.md
+++ b/raft/README.md
@@ -7,41 +7,68 @@ This library provides an efficient, extensible, implementation of
Raft consensus algorithm for Seastar.
For more details about Raft see
https://raft.github.io/
+## Terminology
+
+Raft PhD is using a set of terms which are widely adopted in the
+industry, including this library and its documentation. The
+library provides replication facilities for **state machines**.
+Thus the **state machine** here and in the source is a user
+application, distributed by means of Raft.
+The library is implemented in a way which allows to replace/
+plug in its key components:
+- communication between peers by implementing **rpc** API
+- persisting the library's private state on disk,
+ via **persistence** class
+- shared failure detection by supplying a custom
+ **failure detector** class
+- user state machine, by passing an instance of **state
+ machine** class.
+
+Please note that the library internally implements its own
+finite state machine for protocol state - class fsm. This class
+shouldn't be confused with the user state machine.
+
## Implementation status
---------------------
- log replication, including throttling for unresponsive
servers
-- leader election
+- managing of the user's state machine snapshots
+- leader election, including the pre-voting algorithm
+- non-voting members (learners) support
- configuration changes using joint consensus
+- read barriers
+- forwarding commands to the leader
## Usage
-----
-In order to use the library the application has to provide implementations
-for RPC, persistence and state machine APIs, defined in raft/raft.hh. The
-purpose of these interfaces is:
-- provide a way to communicate between Raft protocol instances
-- persist the required protocol state on disk,
-a pre-requisite of the protocol correctness,
-- apply committed entries to the state machine.
-
-While comments for these classes provide an insight into
-expected guarantees they should provide, in order to provide a complying
-implementation it's necessary to understand the expectations
-of the Raft consistency protocol on its environment:
+In order to use the library, the application has to provide implementations
+for RPC, persistence and state machine APIs, defined in `raft/raft.hh`,
+namely:
+- `class rpc`, provides a way to communicate between Raft protocol instances,
+- `class persistence` persists the required protocol state on disk,
+- class `state_machine` is the actual state machine being replicated.
+
+A complete description of expected semantics and guarantees
+is maintained in the comments for these classes and in sample
+implementations. Let's list here key aspects the implementor should
+bear in mind:
- RPC should implement a model of asynchronous, unreliable network,
in which messages can be lost, reordered, retransmitted more than
once, but not corrupted. Specifically, it's an error to
- deliver a message to a Raft server which was not sent to it.
+ deliver a message to a wrong Raft server.
- persistence should provide a durable persistent storage, which
survives between state machine restarts and does not corrupt
- its state.
+ its state. The storage should contain an efficient mostly-appended-to
+ part containing Raft log, thousands and hundreds of thousands of entries,
+ and a small register-like memory are to contain Raft
+ term, vote and the most recent snapshot descriptor.
- Raft library calls `state_machine::apply_entry()` for entries
reliably committed to the replication log on the majority of
servers. While `apply_entry()` is called in the order
- entries are serialized in the distributed log, there is
+ the entries were serialized in the distributed log, there is
no guarantee that `apply_entry()` is called exactly once.
- E.g. when a protocol instance restarts from the persistent state,
+ E.g. when a server restarts from the persistent state,
it may re-apply some already applied log entries.
Seastar's execution model is that every object is safe to use
@@ -58,17 +85,19 @@ In a nutshell:
- create instances of RPC, persistence, and state machine
- pass them to an instance of Raft server - the facade to the Raft cluster
on this node
+- call server::start() to start the server
- repeat the above for every node in the cluster
- use `server::add_entry()` to submit new entries
- on a leader, `state_machine::apply_entries()` is called after the added
+ `state_machine::apply_entries()` is called after the added
entry is committed by the cluster.
### Subsequent usages
-Similar to the first usage, but `persistence::load_term_and_vote()`
-`persistence::load_log()`, `persistence::load_snapshot()` are expected to
-return valid protocol state as persisted by the previous incarnation
-of an instance of class server.
+Similar to the first usage, but internally `start()` calls
+`persistence::load_term_and_vote()` `persistence::load_log()`,
+`persistence::load_snapshot()` to load the protocol and state
+machine state, persisted by the previous incarnation of this
+server instance.
## Architecture bits
@@ -79,18 +108,19 @@ changes: it is possible to add and remove one or multiple
nodes in a single transition, or even move Raft group to an
entirely different set of servers. The implementation adopts
the two-step algorithm described in the original Raft paper:
-- first, an entry in the Raft log with joint configuration is
-committed. The joint configuration contains both old and
+- first, a log entry with joint configuration is
+committed. The "joint" configuration contains both old and
new sets of servers. Once a server learns about a new
-configuration, it immediately adopts it.
+configuration, it immediately adopts it, so as soon as
+the joint configuration is committed, the leader will require two
+majorities - the old one and the new one - to commit new entries.
- once a majority of servers persists the joint
entry, a final entry with new configuration is appended
to the log.
If a leader is deposed during a configuration change,
-the new leader carries out the transition from joint
-to final configuration for it.
-it carries out the transition for the prevoius leader.
+a new leader carries out the transition from joint
+to the final configuration.
No two configuration changes could happen concurrently. The leader
refuses a new change if the previous one is still in progress.
--
2.25.1