Part of Voldemort's design is that consensus is not used in
replication. Consensus is used to achieve the strong sort of
consistency that's typically associated with distributed commit and
atomic multicast protocols. Atomic multicast happens when either all
machines receive same sequence of updates or none do -- total order
and atomicity: receipt of message N is not acknowledged until all
machines have acknowledge receipt of message N-1, a message is not
considered "stable" until it's been acknowledge by all machines.
Beautiful and proven algorithms exist for these problems, but they are
tricky to implement in a manner that performs well
Vector clocks in Dynamo provide a partial ordering, which along side
quorums the client to see the most recently written version i.e., you
are able to "read your writes". Read-write conflicts are prevented
when R+W > N, and when W > ceil(N/2) *write-write* conflicts are also
prevented. In practice, this is the sort of consistency most people
will want. It's intuitive and makes sense to their end-users. Even
when R+W < N (a choice made to reduce latency or provide for higher
availability with less hardware, something I'll discuss in more detail
later), it's important to note that:
1) Not being able to read your writes is only a failure condition:
consistent routing strategy still directs you to the same replica to
which the earlier value was written
2) This weak form of eventual consistency still means *consistency is
eventually achieved*: the TCP messages sent by the coordinator (with
Java and C++ clients, the coordinator is the client) will shortly
arrive at all live replicas (including replicas that are slow and may
have "not made it" into a quorum).
This is actually a stronger guarantee than what conventional database
replication provides: effectively, when you're using MySQL replication
or Slony, you're choosing a system where W is 1 (or 0, if you're not
using the full guarantees of InnoDB, Postgres or another transactional
system), where you are not guaranteed to read from the same replica as
the one you wrote to (i.e., very frequently, you will *not* be able to
read your writes even under normal conditions) and where there is no
guarantee of eventual consistency (this is "potential consistency").
This isn't an a jab at these systems (several NoSQL systems in fact
employ the same approach for replication), it's difficult to build a
well performing system based on replicating a totally ordered commit
log and very frequently not required.
In practice, other approaches are chosen: eventual consistency, weaker
forms of virtual synchrony (reliable multicast/distributed commit are
its strong forms) such as "timeline based consistency" as used by
PNUTS/Sherpa, single elected master approach (e.g., BigTable where
only a single Memtable may answer request for a given key range, with
Paxos/Chubby being only used for providing a consistent view of
cluster membership and request routing; partitioning allows this
approach to scale) and others.
Eventual consistency has the advantage of being a simpler system to
implement (no need to have a low latency distributed file system or a
lock manager, no need to implement "Spanner" on top to allow
multi-site operation) and operate (the system is symmetric, all nodes
are functionally the same and can be configured with the same
hardware). It's reliable in the guarantees it provides. It does come
at the cost of a more complex API to develop to (application
developers need to be aware of versioning, vector clocks and the
chosen consistency settings).
To answer the question at hand ("why are there additional consistency
mechanisms in Dynamo"), we have to examine the downside to using
quorums to provide strong consistency guarantees in *all* cases.
When R+W > N and W > ceil(N/2) you lose high availability in the case
of a network partition. Assume you have four nodes connected to two
leaf switches. If N=4, R = 2, W = 3 (i.e., any read quorum is
guaranteed to contain a version with the latest vector clock) then if
either switch fails, the system is no longer able to take writes. This
isn't an esoteric scenario: network equipment is the "last bastion of
mainframe computing" (<
http://perspectives.mvdirona.com/2009/12/19/NetworkingTheLastBastionOfMainframeComputing.aspx
>). Large read or write quorums also mean additional incurred latency
and lesser fault tolerance for these operations (with N=4, W=3, R=2
write availability could be lost if two nodes responsible for a
specific key fail).
In short, in an environment spanning multiple switches (or other
causes of correlated multi-machine failures, such as power or human
error) we have to, in a failure scenario, tune down either our
expectation of availability or our expectation of consistency. In
reality, we want both.
To avoid a completely binary choice between large quorums (and reduced
availability) or weak eventual consistency, Dynamo and Voldemort
implement additional consistency mechanisms. Since there's no totally
ordered commit log that can be replayed, synchronizing an out of date
replica isn't straight forward: you can't say "high water mark on the
recovering machine is X, replay all operations since X and only then
allow requests to be made against the machine".
Like Dynamo, Voldemort implements read repair. As I've explained on
IRC, it's a fairly simple algorithm:
1) Perform a quorum read
2) Write the new value back to out-of-date replicas
To avoid the large quorum problem, Voldemort allows "preferred-reads"
and "preferred-writes" to be set in addition to "required-reads" and
"required-writes" (they are configured just like required reads/writes
in stores.xml, on a per-store basis). In this case, we aim for a
stronger guarantee of consistency, falling back on a weaker one (as
defined by required reads/writes) if availability is at jeopardy.
We're also working on an implementation of hinted handoff and sloppy
quorums: I won't go into details on these methods, but you can read
about them in the Dynamo paper.
We also allow a machine that's been brought down to be restored from
its replicas purely by copying (much like how we rebalance to a newly
inserted node). There's a command line tool
(bin/voldemort-admin-tool.sh) and a JMX hook to do so. I've mentioned
this feature elsewhere on this list and we should properly document it
on the site. This is a total copy of all data and may be slow
(depending on how much data you have per node).
We do not presently implement other, constantly running anti-entropy
protocols such as Merkle trees. This isn't due to a design decision,
but rather due to the fact we do not have a known robust and
performant implementation of them. If we build one (proper scoping and
an implementation attempt is on our roadmap) or if one is contributed,
we'll use them. However, there's some indication from the other open
source Dynamo implementations that this approach has its limitations:
Dynomite's developer found the overhead of Merkle trees to be too
high; last time I spoke to the Cassandra community (March/April of
this year), they've mentioned that their implementation hasn't seen
large-scale production testing (this may have changed recently).
We'd also like to implement publish/subscribe, for the purpose of
synchronizing nodes that have been off-line for a medium length
period. It's a long standing "desirement" and would allow other
advanced features to be implemented, such as secondary indices. It
would also make it possible to integrate Voldemort with other
specialized systems (search engines and offline batch/data-warehousing
environment such as Hadoop/MapReduce and OLAP databases) without
performing full extraction of values. There's an issue open for this:
http://code.google.com/p/project-voldemort/issues/detail?id=2
Sorry for the long detour, I just wanted to mention the reason for
these "additional" consistency mechanisms, why some of them are
present in Dynamo but not in Voldemort and vice-versa and why we can't
just "replay the commit logs".
Thanks,
- Alex