Multi-DC RAFT extensions

Philip Haynes

unread,

Nov 14, 2016, 8:27:28 PM11/14/16

to raft-dev

Hi,

We are in the process of upgrading a number of our transaction systems to support multiple, global data centres for redundancy, legal and scale reasons.
Now our needs and capabilities are much more modest than Google's. So it looks like we don't need to replicate something as complex as Spanner that requires
Atomic Clocks and a sophisticated Op's / SRE teams and the like.

Rather it looks like we can get what we need through relatively simple extension of our RAFT implementation and that this type of extension would
be relatively generic. My questions are thus:
a) Is anyone else doing something like this and where have you got up to your implementation?
b) Is extending RAFT to support multi-DC of interest to others or do you think it is off topic?
c) Would anyone be interested collaborating in describing such an extension.

Kind Regards,
Philip Haynes
ThreatMetrix Inc

Henrik Ingo

unread,

Nov 15, 2016, 4:18:23 AM11/15/16

to raft...@googlegroups.com

What extensions do you have in mind?

There's no particular reason why Raft couldn't be used in multi-DC
setups as is. Of course, the client will see the additional latency,
and you might want to create optimizations around that.

henrik

> --
> You received this message because you are subscribed to the Google Groups "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Philip Haynes

unread,

Nov 15, 2016, 5:24:26 PM11/15/16

to raft-dev

> Of course, the client will see the additional latency, and you might want to create optimizations around that

Yes - the speed of light is a bummer. A straight cross DC RAFT deployment would increase latency from about 200us to
a best case of 2-300 ms+. A show stopper for us. Furthermore we need to be able to keep processing transactions,
even if the links between DC's are down (e.g. continue to processes EU transactions and US transactions even if
links between the EU and US are down).

Our RAFT cluster processes transactions for different entities (customers). We have a wholesale business model
so the number of entities we worry about is dramatically smaller those with a retail model such Google. Thus,
to simplify our transaction model, entities are constrained perform transactions from one DC at a time
(i.e. a global strong leader) rather than support cross DC transactions.
External consistency is provided from the leader DC and other DC's are updated asynchronously (again contrasting Spanner).
This also means global leader election for an entity is different to local cluster leader election.

For operational simplicity, our goal is a single database log rather than every client having their own replicated database.
Current thinking is to introduce a name space into the log index along with a name space log index so irrespective of
where a transaction is done, from the perspective of the namespace, transaction logs are sequential.

To support cross DC replication, further more add the originating DC of a log entry and its remote log-index.

In this way, within a single cluster, with the exception of the additional fields, the protocol is unchanged.

Furthermore, after a bit of design to and fro we are looking to completely separate out cross DC replication.
One design option was to piggy back the replication logic already within the protocol. We decided that
with the strong DC leaders, introducing logic to exclude the foreign remotes from voting for the leader, in the
commit process was too hard. Furthermore, execution of a FSM on commit in a remote DC may have different
logic to a local follower (e.g. a foreign US DC may not have the data needed for a EU transaction due to
data residency requirements).

I hope this provides a sense of the sort of modification we are thinking and that we are going in a
direction in the spirit of protocol.

Henrik Ingo

unread,

Nov 16, 2016, 5:15:30 AM11/16/16

to raft...@googlegroups.com

Hi Philip

Thanks for the additional detail. Seems like you've pretty much
figured out the issues. Thinking of the multi-DC architecture as an
entirely separate layer of replication, with its own elections, etc,
is one approach. What you describe reminds me for example of
Couchbase.

A fundamental truth is that you cannot have your cake and eat it too.
If you want your transactions to be durable against a DC-failure, the
commit time must include the round trip time to another DC. Either you
wait, or you give up on durability. There's just no way around that.
Once we accept this fact, it's easy to reason about solutions one way
or another.

henrik

Oren Eini (Ayende Rahien)

unread,

Nov 16, 2016, 1:01:15 PM11/16/16

to raft...@googlegroups.com

You can't really do that on a single log, and adding a namespace just make the logs harder (no global order in the log).

What I would do in this case is to have multiple Raft clusters, one for each DC, but with non voting nodes from other DCs in each of those.

This way, each DC can do strong transactions, and I have automatic replication outside the DC.

However, my client / server code need to know that it can't submit a request to mutate a value outside of the owning DC.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Philip Haynes

unread,

Nov 16, 2016, 10:20:01 PM11/16/16

to raft-dev

Hi Henrik,

Thanks for the pointer to CouchBase replication XDCR. What I was calling a namespace they are calling Bucket / Shard.

> If you want your transactions to be durable against a DC-failure, the commit time must include the round trip time to another DC

Yes. For our use cases accepting loss of transactions in the case of DC failure rather than including the vagaries of wide area comms
into the middle of all transactions is the better choice.

> Seems like you've pretty much figured out the issues

Not really - I have had a first stab. I _may_ have figured out all the details when I have had something in production for 2-3 years and
all the real world issues have been handled. Getting this consensus stuff to reliably work is hard - hence the suggestion for a
more common design to cover off all the issues.

Hi Ayende,

I should have been clearer. Each DC must have its own log and its log Index and terms which will of course be different between DC's.
The secondary namespace log index is included to provide consistency within a namespace / shard / bucket.

We have considered the approach to having non-voting followers. This still also includes modifying the
protocol so passive followers do not participate in the commit count (with the assumption above, that you don't want wide area comms
in the middle of the transaction commit).

Whist this approach does provide the benefit of a globally consistent log, it does mean all transactions must flow though a central DC otherwise
you see a growth in log replica's. Apart from legal issues that would make this approach a non-starter for us (i.e. PII between Europe & US),
it has scaling issues. Alternatively, what you are doing is pulling the namespace concept outside the logs creating a more operationally
complex situation (since you have multiple logs) that we are trying to avoid like the plague.

> my client / server code need to know that it can't submit a request to mutate a value outside of the owning DC

So yes there needs to be a "owning DC" for a transaction for which we keep a separate map of transaction space against owning DC so
this is not a problem for either the client or server to worry about. A sophisticated approach here would involve something like
TrueTime to automatically failover - however our initial implementation will be more agricultural in design.

Philip

Oren Eini (Ayende Rahien)

unread,

Nov 17, 2016, 2:06:40 AM11/17/16

to raft...@googlegroups.com

Given PII limitations, what cross DC information do you have, then?

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Henrik Ingo

unread,

Nov 18, 2016, 5:53:41 AM11/18/16

to raft...@googlegroups.com

On Thu, Nov 17, 2016 at 5:20 AM, Philip Haynes <haynes...@gmail.com> wrote:
>> Seems like you've pretty much figured out the issues
> Not really - I have had a first stab. I _may_ have figured out all the details when I have had something in production for 2-3 years and
> all the real world issues have been handled. Getting this consensus stuff to reliably work is hard - hence the suggestion for a
> more common design to cover off all the issues.

The problem with aspiring towards a general purpose solution is that
this is an area where you have to choose a trade-off, and different
trade-offs are all valid and are optimal for different applications.

For example, after the initial shock, many architects accept that if
they want durable multi-DC redundancy, they need to pay the price and
simply wait for the data having arrived at the other DCs. Also, it
turns out that for example most user facing applications can very well
wait 300ms on writes, as long as reads are fast. (Page load times must
be <100ms, but if I just spent a minute manually typing something into
a form, it's ok to block 300ms when saving that.)

Otoh for some other applications such a latency on writes is not ok.
An interesting case are applications which are mostly ok, but have
some hot-spot records that are updated 100 times/second.

Yet one more variant of solving this problem is the case of
asynchronous multi-master replication, such as is seen in Cassandra
(when using consistency level of 1). In this case writes will
trivially "succeed" and be fast on the local node, and the hard
science is how to reconcile that with potentially conflicting writes
coming from other nodes. And ultimately there's no guarantee that the
"successful" write will ever actually reach a majority of the cluster
such that it would have been committed in a majority sense.

IMO the most interesting approach to solving this problem is perhaps
what is called consistency levels in some products, and write concern
in MongoDB. With such a concept, all nodes in a cluster use the same
replication whether in the same DC or over a WAN. (In the case of
MongoDB, a very Raft-like replication, which does however pre-exist
Raft.) The server then provides enough information and hooks, and
leaves it up to the client to decide whether to wait for a majority of
nodes to commit the transaction, or some other number, such as just
the primary alone. In the case of MongoDB there's even the (rarely
used) functionality to tag your servers with identities or groupings,
so that an application can ask the cluster to confirm (for example)
when the transaction was committed on at least 1 server in US and 1 in
EU, but don't care about the DC in APAC. The existence of voting and
non-voting nodes is similarly a common feature that can be useful.

Your requirement to write different data to different "local" nodes or
shards is yet another, somewhat orthogonal requirement that is or
isn't part of someone's requirements.

Beyond this there are interesting possibilities for further
optimizations. Heidi Howard's "Flexible Paxos" is somewhat related.
(...but won't actually provide as much value for the multi-DC problem
as you might first think, because fundamentally it doesn't circumvent
the speed of light.) At OSDI there was an interesting article
"Incremental Consistency Guarantees for Replicated Objects", which
allows applications to speculatively continue executing after a
commit, while still waiting for the majority confirmation.

henrik

Philip Haynes

unread,

Nov 20, 2016, 7:03:59 PM11/20/16

to raft-dev

Thanks for your reply Henrik.

Completely agree that there are inevitable hard compromises that must made. Either wait until a consensus agreed across DC's or get additional write throughput.
Our local node sharding is not an issue were processing retail transactions directly - you wouldn't have quite the performance concerns.

Certainly choosing the right consistency level with Cassandra across DC's took us quite some time.
Client side decisioning, however, is not some thing we would ever do. There are too many moving parts for us to push the responsibility
to the application programmer on the right trade offs for consensus. This is the sort of thing we do server side as a combined engineering /
operations group to minimise the chances of things going south.

The OSDI paper you reference is interesting - both directly and with its reference list. It does quite a good job between characterising
the demand for weaker consistency versus the demand for performance. This does jog the thinking, however, on how one could plug
a Future's based consistency model into our API's.

So you are arguing there is no universal consistency model - just ones for give sets of constraints.

Philip

Oren Eini (Ayende Rahien)

unread,

Nov 21, 2016, 1:57:34 AM11/21/16

to raft...@googlegroups.com

As a note, we started out with the client making those decisions, and we are moving all of them to the server.

The problem is that you need the admin to make those kind of decisions, not the devs. And you need to be able to change them on the fly, without deploying code.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Henrik Ingo

unread,

Nov 21, 2016, 5:21:15 AM11/21/16

to raft...@googlegroups.com

On Mon, Nov 21, 2016 at 2:03 AM, Philip Haynes <haynes...@gmail.com> wrote:
> So you are arguing there is no universal consistency model - just ones for give sets of constraints.

If there were, surely it would include a majority based durable
commit. But even you yourself aren't using that. So no, I don't
believe there is.

Also for relational databases, the different isolation levels exist
for a good reason, not just historical legacy. I've used READ
UNCOMMITTED once, and it was well justified (to the extent that the
long running count() was justified to begin with...) and "safe" in
that specific app.

Diego Ongaro

unread,

Nov 29, 2016, 12:53:58 AM11/29/16

to raft...@googlegroups.com

Hi Philip,

I'm a bit late to jump in here, but I do think this topic is relevant to raft-dev. The thread has successfully argued that there won't be a one-size-fits-all solution, but should we come back around to talking about your design?

I didn't quite understand the requirements in your original descriptions. Does each key range map to one DC where all writes to those keys will take place? And is that mapping static?

If so, I'd suggest that each datacenter have (1) a Raft group for the key range it owns and (2) for each remote DC, a set of servers that are nonvoting members following the remote DC's Raft group. I think this would be cleaner than trying to somehow unify the logs.

Best,

Diego

Philip Haynes

unread,

Dec 5, 2016, 5:05:50 PM12/5/16

to raft-dev

Hi Diego,

Definitely agree that one size does not fit all. A WAN solution looks very different to a LAN both from a software and operations perspective.

We don't have a key range per se but assume that a given organisation will write only to a single DC at a time and a reasonably large time to fail over to a secondary DC if required. Our key is a Long that has no particular DC mapping.

My initial instinct was very much as you suggest for non-voting followers in the remote DC and to not modify with the core RAFT logs.
Our op's team has a complex environment to manage and when our systems move other than basics, we get into problems.
From an operational perspective, there was pessimism that multiple logs and replica's from each DC could be operationally managed at scale correctly. One thing to setup, configure / monitor and ensure is right in each DC.

Having started to go down the unified log approach, I have noticed that there are other benefits such as looking at how to best manage at the application level execution of the FSM at consensus depending on which DC and Jurisdiction you are running from.

Philip

Reply all

Reply to author

Forward