Tune for very large scale system

117 views

Skip to first unread message

Yawei

unread,

May 6, 2017, 9:50:55 PM5/6/17

to raft-dev

Hi there,

Want to see from the community about the scale related things of Raft. Do we have any info of the known largest OLTP type database deployment raft cluster in production? Recently I am thinking about if there were discussions on possible protocol/algorithmic enhancement for raft to support very large database deployment: Say o(10K) machines, o(10M) write rps, across multiple data centers, possibly different geo regions.

Thoughts off the top of my head:

1. heartbeat. More raft instances in the system --> more parallelism/throughput and possibly better hardware utilization. However, the heartbeat mechanism does seem like a potential scalability bottleneck for large deployments. E.g. CRDB and TIDB both are working on quiescence raft leaders in their multi-raft implementation. I am wondering: 1) can we have relative long hb interval? (spanner has a default 10s paxos leader lease). will long interval necessarily lead to long write unavailability? Can an election be triggered by a live client request when it tries to talk to a troubled leader while the next hb isn't due yet? 2) or can we go further, can the protocol be changed not to rely on heartbeat push? So we may choose not to eagerly converge the system but in a more deferred way, e.g. trigger by writes or strong reads, or the push from the leader or pull from followers can be done later on in a more batched fashion?

2. Snapshot. bind snapshot together with raft may have non-trivial performance implication in production: the state machine could be so huge that disk i/o, memory consumption becomes concerning. Anyone considered decoupling snapshot from raft implementation such that it is more of an offline operation? Want to see thoughts here.

3. Allowing holes in log sequence. I understand this comes from paxos world and to some degree it contradicts the simplicity principle of original Raft design. Though such complexities are often paid off by gains in large production system. Suppose we have a cluster of node A, B, C with A, B in a region and C in a remote region, and A the leader. The write latency are usually decided by (A, B) so it will be fast. When B is down for a while, the quorum (A,C) imposes higher latency. Now suppose B comes back, AFAIK, B has to fully catch up before it can participate the quorum voting. If we allows holes in log sequence (as long as each commit log entry can be resolved from the quorum), B should be able to immediately vote without fully catch up. This brings two benefits: 1) improve the latency as A and B are close to each other; 2) better availability as system can quickly recover to three-voter config. Yes it does introduce non-trivial complexity as we'd likely keep per-log metadata (e.g. which replica had accepted this entry etc) and more watermarks per replica to indicate the hole, commit and apply status etc. I am wondering how disruptive of this change to raft framework or I just totally turn raft into multi-paxos?

Would like to know if my concerns are valid and if yes, what's the community's take to address them.