Snapshot Recovery in copycat

34 views
Skip to first unread message

Terry Tan

unread,
Mar 31, 2017, 7:13:43 AM3/31/17
to raft-dev
Hi Jordan,

I have got a  question about the server recovery process ,  I found after the server started ,it will read the lastIndex in raft log ,then the statemachine will do the operations until it reaches the logIndex, after that, it will install the snapshot, but as what i expected ,it would  install the snapshot firstly, then based on the snapshot , doing the recovery according to the raft log . why you do it like this?  

jordan.h...@gmail.com

unread,
Mar 31, 2017, 3:10:02 PM3/31/17
to raft...@googlegroups.com
Good question! So, snapshots in Copycat only represent a subset of the entries in the log. Copycat primarily uses an incremental compaction algorithm with snapshots implemented on top of the incremental algorithm.


Essentially, when a snapshot is taken, it replaces only those entries marked with the SNAPSHOT compaction mode. Entries that use different compaction modes are compacted incrementally and are not overridden by the snapshot. So, what you're seeing is non-SNAPSHOT-compacted entries being applied up to the snapshot, and then after that SNAPSHOT entries begin being applied as well.

The way we use this in practice is to do different types of compaction for different types of state machines that share the same log. For example, a key-value store uses incremental compaction where setting a key causes any past operations on that key to be marked for compaction. Alternatively, a counter state machine marks all its commands as SNAPSHOT and uses snapshots to store the counter state (a 64-bit number). We even often use different types of compaction within the same state machine.

I think we'll be doing some testing soon on the performance differences between pure snapshotting, incremental compaction, and incremental snapshots some time in the relatively near future.

On Mar 31, 2017, at 4:13 AM, Terry Tan <tx...@sina.com> wrote:

Hi Jordan,

I have got a  question about the server recovery process ,  I found after the server started ,it will read the lastIndex in raft log ,then the statemachine will do the operations until it reaches the logIndex, after that, it will install the snapshot, but as what i expected ,it would  install the snapshot firstly, then based on the snapshot , doing the recovery according to the raft log . why you do it like this?  

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Terry Tan

unread,
Apr 1, 2017, 2:35:30 AM4/1/17
to raft-dev
Hi Jordan ,

1. It can make the log relatively short that you  combine the log compaction and snapshot process together , so you can apply the entries to the statemachine without losing performance ?  
2 .But  i have got another question, it seems the initial value of applied is 0, why dont you just simply specify it to firstIndex(given that it is compacted ,so it is supposed to be certain value  at least greater than 1 ) in log so that it will have no  meaningless loop any more(see the codes below)
  Could you help me on these two question? 

 public void applyAll(long index) {
    if (!log.isOpen())
      return;

    // If the effective commit index is greater than the last index applied to the state machine then apply remaining entries.
    long lastIndex = Math.min(index, log.lastIndex());
    if (lastIndex > lastApplied) {
      for (long i = lastApplied + 1; i <= lastIndex; i++) {
        Entry entry = log.get(i);
        if (entry != null) {
          apply(entry).whenComplete((result, error) -> entry.release());
        }
        setLastApplied(i);
      }
    }
  }

jordan.h...@gmail.com

unread,
Apr 1, 2017, 4:29:14 AM4/1/17
to raft...@googlegroups.com
Yeah, so the idea behind the log compaction algorithm is incremental compaction allows for more consistent performance and reduces the overhead of replication. Taking and/or replicating a large snapshot can be costly. Incremental compaction works at its own pace on individual segments of the log. The state machine continues to operate independently of the incremental compaction algorithm, and compaction threads can be throttled to reduce the load on the servers. Perhaps more interestingly, the incremental compaction algorithm allows Copycat to exclude a lot of entries from ever being replicated to some nodes in the first place, and that helps followers that briefly fall behind the rest of the cluster quickly catch back up. We've actually seen that Copycat is able to large percentages of entries from being replicated in our systems.

But incremental compaction doesn't suit all use cases, which is why I mentioned counters. The state of a counter is the sum of all its increments, so Copycat would have to hold all increments in the log as long as the counter exists. Snapshots are supported primarily for those use cases and most use cases that require counters. But we try to keep the size of snapshots relatively small, do counting on clients when possible (e.g. a reentrant lock/mutex), and use incremental compaction for the most active state machines like maps, queues, and other data structures.

As for the log, Copycat's log always starts at index 1 even if the actual entries on disk start at index 1,000,000. There's not really any reason for that other than for simplicity and consistency. Indexes are never removed from the log, the entries just become null, and iterating from 1 to 1,000,000 when the server starts is trivial in terms of time. But sure, it would be fine to do because presumably the entries can't be applied anyways.
Reply all
Reply to author
Forward
0 new messages