TL;DR
https://github.com/kuujo/copycat
Hey all. This is a brief update on
Copycat and an in-depth examination of some of the abstractions I've been able to build around the Raft algorithm.
I'm hoping that I can get some feedback. Please feel free to point out any issues in my description of Copycat's Raft implementation.
I've mentioned Copycat on this forum a few times. I first became interested in consensus algorithms a little more than a year ago, and I've been working on Copycat ever since. It initially started out as a basic Raft implementation, but as I've identified use cases in my day job it has evolved into a more generally useful distributed coordination framework.
After I built Copycat's original Raft implementation, I realized that many of the components of the algorithm could be encapsulated from one another. A few months ago I set out to build a more widely useful API by abstracting leader election, logging, state management, and other elements of the algorithm from one another. Eventually, Copycat transformed into a multi-module project that includes separate interfaces for leader elections, event logs, state logs, state machines, and distributed collections. Each of these types of resources simply exposes different features of the Raft algorithm in a unique API.
Abstracting portions of the Raft algorithm in order to facilitate a such a modular API required a lot of trial and error. Eventually, I was able to create a core implementation of the Raft algorithm that strictly handles communication and logging, is extensible and reusable, and is generic enough to support a variety of features. High level resources wrap the Raft algorithm and expose use-case specific features such as election information and events, state management, and log compaction.
The Raft implementation
Copycat's core Raft implementation takes the extensible approach of implementing only the basic necessities of the Raft algorithm including only voting, logging, and replication. Other tasks such as membership changes and log compaction are delegated to wrapper classes (resources). Additionally, rather than interacting directly with any transport layer, Copycat delegates work between the Raft algorithm and the transport layer in order to facilitate multiple instances of the algorithm running over the same transport.
At the protocol level, Copycat supports pluggable transports, allowing the framework to be used within a variety of asynchronous environments (actually, this is sort of a legacy feature, as Copycat was originally implemented on the
Vert.x event bus).
For the most part, Copycat adheres to the Raft protocol specification whenever necessary, but it does diverge in some areas. Specifically, rather than implementing heartbeats via AppendEntries RPCs, Copycat uses a separate RPC for sending heartbeats when replication is not taking place. This is simply a cleanliness feature and makes no practical difference since these RPCs carry all the same data except entries.
One change to the protocol that is an optimization the use of separate RPC types for reads and writes in Copycat. Read and write requests between nodes were separated because internally Copycat allows optional per-request configuration options for reads. Normally, in order to guarantee consistency Raft dictates that the leader should send a heartbeat to a majority of the cluster in order to ensure that it has not been deposed before processing read-only requests. Of course, there are some common trade-offs to this frequently futile consistency check, so Copycat exposes several consistency modes for reads.
* weak - All reads are immediately evaluated on the local node
* default - When a read request is received, the request will be forwarded to the leader to be evaluated and the leader's heartbeat it used as a sort of lease timeout
* full - All reads are forwarded to the leader to be evaluated and the leader will synchronously heartbeat a majority of the cluster before processing the request
With Copycat's core Raft implementation complete, I decided to explore some options for expanding the conceivable size of the Copycat cluster. After realizing how simple gossip was compared to Raft, I decided to put some effort into implementing a gossip protocol for replication to large clusters. The concept here is gossip is used to replicate only committed entries to "passive" members of the cluster. This makes the gossip protocol significantly more lightweight than Raft since almost no coordination needs to take place between nodes. Once an entry becomes committed on any "active" (Raft voting) member of the cluster, the entry becomes available for replication via the gossip protocol. As with other areas of the framework, the algorithm classes are responsible for communicating entries, while the cluster classes are responsible for managing the dynamic cluster configuration.
Copycat's gossip implementation uses vector clocks to keep track of membership and indexes for passive nodes. Whenever an active or passive member communicates with another passive member, they exchange a vector clock of cluster membership and index information. Versions are used to calculate updates for each entry in the vector clock. This ensures that index changes are more quickly propagated throughout the passive cluster members and thus reduces the risk of unnecessary communication.
Additionally, the cluster - which handles membership changes and failure detection - uses the gossip protocol to determine the status of unreachable nodes. In order to verify that a passive member has left the cluster or died (it makes no distinction between the two), Copycat requires independent verification from multiple nodes by simply piggybacking "suspicious" counters on exchanged gossip messages. Once enough nodes have marked a member as suspicious it is considered dead and that information is gossiped with the updated passive membership list.
Logs
As with many other features of Copycat, the Log is highly configurable and extensible. This helps Copycat serve the wide variety of use cases conquered by its various resource types. Additionally, Copycat's logs are designed to support fast compaction. Whereas Copycat's original log implementations were simple files, logs are now separated into segments. By requiring that log compaction occur on segment rollovers, compacting the Copycat log is as simple as deleting a segment file.
State Machines
Copycat does provide a state machine interface. The state machine wraps the core Raft algorithm to provide Raft log based state management and expose distributed state machine proxies. More relevantly, though, the state machine is an excellent example of the modularity of Copycat's Raft implementation. Because of the variety of use cases supported in Copycat, the core Raft algorithm does not implement log compaction. Various use cases can dictate different methods for compacting the log. Event logs often require time or size based compaction, while state machines require more careful management of logs. The state machine wraps Copycat's pluggable logs in a special log that handles snapshots. When the log grows large enough, of course, Copycat serializes a snapshot of the state machine's state and compacts the log.
Event Logs
Finally, the feature that I was most interested in when I began my effort to abstract the Raft algorithm was event logging. A few months ago when I was researching
Kafka for work, I realized that the algorithm Kafka uses for managing partitions and replication via ZooKeeper seemed eerily similar to Raft. Of course, it's not completely similar - as opposed to a quorum based approach, Kafka synchronously replicates to all "in-sync" replicas, followers pull from the leader, and, of course, it doesn't write to disk - but the challenge of abstracting the Raft algorithm to fit the range between state and event logging intrigued me. This led to the highly configurable logging framework within Copycat. Whereas state logs and state machines frequently flush to disk and perform snapshot based compaction, event logs allow for fast off heap in-memory logs that are compacted only via time and size based compaction policies.
Conclusion
You've probably noticed those last few sections were brief. That's because I pretty much tired myself out. There's still a lot to be done on this project, and it's nowhere near production ready, but I have every intention of getting it to that point. I value the insight of people on this forum, and, as I said, I would be really interested in getting feedback, particularly in relation to the Raft and gossip implementations and the general interest in the project. I promise not to make any further responses quite so long :-)