We are in the process of upgrading a number of our transaction systems to support multiple, global data centres for redundancy, legal and scale reasons.
Now our needs and capabilities are much more modest than Google's. So it looks like we don't need to replicate something as complex as Spanner that requires
Atomic Clocks and a sophisticated Op's / SRE teams and the like.
Rather it looks like we can get what we need through relatively simple extension of our RAFT implementation and that this type of extension would
be relatively generic. My questions are thus:
a) Is anyone else doing something like this and where have you got up to your implementation?
b) Is extending RAFT to support multi-DC of interest to others or do you think it is off topic?
c) Would anyone be interested collaborating in describing such an extension.
Kind Regards,
Philip Haynes
ThreatMetrix Inc
Our RAFT cluster processes transactions for different entities (customers). We have a wholesale business model
so the number of entities we worry about is dramatically smaller those with a retail model such Google. Thus,
to simplify our transaction model, entities are constrained perform transactions from one DC at a time
(i.e. a global strong leader) rather than support cross DC transactions.
External consistency is provided from the leader DC and other DC's are updated asynchronously (again contrasting Spanner).
This also means global leader election for an entity is different to local cluster leader election.
For operational simplicity, our goal is a single database log rather than every client having their own replicated database.
Current thinking is to introduce a name space into the log index along with a name space log index so irrespective of
where a transaction is done, from the perspective of the namespace, transaction logs are sequential.
To support cross DC replication, further more add the originating DC of a log entry and its remote log-index.
In this way, within a single cluster, with the exception of the additional fields, the protocol is unchanged.
Furthermore, after a bit of design to and fro we are looking to completely separate out cross DC replication.
One design option was to piggy back the replication logic already within the protocol. We decided that
with the strong DC leaders, introducing logic to exclude the foreign remotes from voting for the leader, in the
commit process was too hard. Furthermore, execution of a FSM on commit in a remote DC may have different
logic to a local follower (e.g. a foreign US DC may not have the data needed for a EU transaction due to
data residency requirements).
I hope this provides a sense of the sort of modification we are thinking and that we are going in a
direction in the spirit of protocol.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
Thanks for the pointer to CouchBase replication XDCR. What I was calling a namespace they are calling Bucket / Shard.
> If you want your transactions to be durable against a DC-failure, the commit time must include the round trip time to another DC
Yes. For our use cases accepting loss of transactions in the case of DC failure rather than including the vagaries of wide area comms
into the middle of all transactions is the better choice.
> Seems like you've pretty much figured out the issues
Not really - I have had a first stab. I _may_ have figured out all the details when I have had something in production for 2-3 years and
all the real world issues have been handled. Getting this consensus stuff to reliably work is hard - hence the suggestion for a
more common design to cover off all the issues.
Hi Ayende,
I should have been clearer. Each DC must have its own log and its log Index and terms which will of course be different between DC's.
The secondary namespace log index is included to provide consistency within a namespace / shard / bucket.
We have considered the approach to having non-voting followers. This still also includes modifying the
protocol so passive followers do not participate in the commit count (with the assumption above, that you don't want wide area comms
in the middle of the transaction commit).
Whist this approach does provide the benefit of a globally consistent log, it does mean all transactions must flow though a central DC otherwise
you see a growth in log replica's. Apart from legal issues that would make this approach a non-starter for us (i.e. PII between Europe & US),
it has scaling issues. Alternatively, what you are doing is pulling the namespace concept outside the logs creating a more operationally
complex situation (since you have multiple logs) that we are trying to avoid like the plague.
> my client / server code need to know that it can't submit a request to mutate a value outside of the owning DC
So yes there needs to be a "owning DC" for a transaction for which we keep a separate map of transaction space against owning DC so
this is not a problem for either the client or server to worry about. A sophisticated approach here would involve something like
TrueTime to automatically failover - however our initial implementation will be more agricultural in design.
Philip
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
Completely agree that there are inevitable hard compromises that must made. Either wait until a consensus agreed across DC's or get additional write throughput.
Our local node sharding is not an issue were processing retail transactions directly - you wouldn't have quite the performance concerns.
Certainly choosing the right consistency level with Cassandra across DC's took us quite some time.
Client side decisioning, however, is not some thing we would ever do. There are too many moving parts for us to push the responsibility
to the application programmer on the right trade offs for consensus. This is the sort of thing we do server side as a combined engineering /
operations group to minimise the chances of things going south.
The OSDI paper you reference is interesting - both directly and with its reference list. It does quite a good job between characterising
the demand for weaker consistency versus the demand for performance. This does jog the thinking, however, on how one could plug
a Future's based consistency model into our API's.
So you are arguing there is no universal consistency model - just ones for give sets of constraints.
Philip
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
Definitely agree that one size does not fit all. A WAN solution looks very different to a LAN both from a software and operations perspective.
We don't have a key range per se but assume that a given organisation will write only to a single DC at a time and a reasonably large time to fail over to a secondary DC if required. Our key is a Long that has no particular DC mapping.
My initial instinct was very much as you suggest for non-voting followers in the remote DC and to not modify with the core RAFT logs.
Our op's team has a complex environment to manage and when our systems move other than basics, we get into problems.
From an operational perspective, there was pessimism that multiple logs and replica's from each DC could be operationally managed at scale correctly. One thing to setup, configure / monitor and ensure is right in each DC.
Having started to go down the unified log approach, I have noticed that there are other benefits such as looking at how to best manage at the application level execution of the FSM at consensus depending on which DC and Jurisdiction you are running from.
Philip