[project-voldemort] Anti-entropy thread?

89 views
Skip to first unread message

Freeman, Tim

unread,
May 7, 2010, 5:33:28 PM5/7/10
to project-...@googlegroups.com
Suppose I have three nodes, A, B, and C, replication factor 3, 2 reads, 2 writes. Node A goes down, and during that time we write a new key x. Node A comes back up. Now B and C have a value for x, but not A. Some time later, B goes down. Then someone tries to read x. There are no other read or write operations on x.

What should I expect in this case? I see multiple possibilities:

1) Maybe it reads no x from A, and a fresh value from C, and writes the good one back to A.
2) Maybe it fails because it can't find two x's and we need two reads to be good.
3) Maybe there's some thread that's supposed to eventually check for consistency between nodes, and it might eventually notice that A needs a value for x and copies it from B or C to A. So if enough time passes between A coming up and B going down, everything works, but if it's too little time the read fails. This is called the anti-entropy thread in the Dynamo paper, if I recall correctly. http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf section 4.7.

Some people in the chat room helpfully pointed me to voldemort.store.routed.ReadRepairer, which is used from RoutedStore. That code looks to me like it just repairs the records you actually manipulated, rather than proactively enumerating records and repairing broken things that aren't being manipulated, so I'm not seeing an anti-entropy thread.

Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536


--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.

Bruce Ritchie

unread,
May 13, 2010, 12:47:46 PM5/13/10
to project-voldemort
Tim,

To the best of my knowledge the current version of Voldemort does not
have a background anti-entropy thread on the server repairing data.
read-repair is what is currently offered, though there is work being
done for hinted-handoffs.

In your example I'd expect #1 to occur.


Regards,

Bruce Ritchie

On May 7, 5:33 pm, "Freeman, Tim" <tim.free...@hp.com> wrote:
> Suppose I have three nodes, A, B, and C, replication factor 3, 2 reads, 2 writes.  Node A goes down, and during that time we write a new key x.  Node A comes back up.  Now B and C have a value for x, but not A.  Some time later, B goes down.  Then someone tries to read x.  There are no other read or write operations on x.
>
> What should I expect in this case?  I see multiple possibilities:
>
> 1) Maybe it reads no x from A, and a fresh value from C, and writes the good one back to A.
> 2) Maybe it fails because it can't find two x's and we need two reads to be good.
> 3) Maybe there's some thread that's supposed to eventually check for consistency between nodes, and it might eventually notice that A needs a value for x and copies it from B or C to A.  So if enough time passes between A coming up and B going down, everything works, but if it's too little time the read fails.  This is called the anti-entropy thread in the Dynamo paper, if I recall correctly.  http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2...section 4.7.
>
> Some people in the chat room helpfully pointed me to voldemort.store.routed.ReadRepairer, which is used from RoutedStore.  That code looks to me like it just repairs the records you actually manipulated, rather than proactively enumerating records and repairing broken things that aren't being manipulated, so I'm not seeing an anti-entropy thread.
>
> Tim Freeman
> Email: tim.free...@hp.com

Yang Y

unread,
May 16, 2010, 1:09:05 PM5/16/10
to project-voldemort
not sure about Voldemort, but generally in a consensus/reliable
broadcast protocol (Paxos, Zookeeper etc ) when a dead node comes up,
it has to acquire past transactions from other nodes. in this case
when A gets up it needs to
pull the list of messages that were delivered while it was dead,
before it successfully did so, A is considered "dead", so that you
still have A and B dead, and the system is not to be functioning
according to spec.


Yang

On May 7, 2:33 pm, "Freeman, Tim" <tim.free...@hp.com> wrote:
> Suppose I have three nodes, A, B, and C, replication factor 3, 2 reads, 2 writes.  Node A goes down, and during that time we write a new key x.  Node A comes back up.  Now B and C have a value for x, but not A.  Some time later, B goes down.  Then someone tries to read x.  There are no other read or write operations on x.
>
> What should I expect in this case?  I see multiple possibilities:
>
> 1) Maybe it reads no x from A, and a fresh value from C, and writes the good one back to A.
> 2) Maybe it fails because it can't find two x's and we need two reads to be good.
> 3) Maybe there's some thread that's supposed to eventually check for consistency between nodes, and it might eventually notice that A needs a value for x and copies it from B or C to A.  So if enough time passes between A coming up and B going down, everything works, but if it's too little time the read fails.  This is called the anti-entropy thread in the Dynamo paper, if I recall correctly.  http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2...section 4.7.
>
> Some people in the chat room helpfully pointed me to voldemort.store.routed.ReadRepairer, which is used from RoutedStore.  That code looks to me like it just repairs the records you actually manipulated, rather than proactively enumerating records and repairing broken things that aren't being manipulated, so I'm not seeing an anti-entropy thread.
>
> Tim Freeman
> Email: tim.free...@hp.com

Alex Feinberg

unread,
Jun 3, 2010, 3:23:30 AM6/3/10
to project-...@googlegroups.com
These are interesting discussions. I am going to take a little detour
in this message, before getting back to Tim and Yang. The Dynamo paper
explains very well how quorums and vector clocks allow a
"read-your-writes" consistent view to be achieved on top of an
eventually consistent system. Yet, they also mention additional
consistency mechanisms such as
read-repair (the most straight forward one), sloppy quorums, Merkle
trees and anti-entropy. This has often been the source of confusion on
the matter. I'll also address the issues of consensus, atomic
multicast and the more general problem -- distributed commit -- of
which atomic multicast is an instance.

Part of Voldemort's design is that consensus is not used in
replication. Consensus is used to achieve the strong sort of
consistency that's typically associated with distributed commit and
atomic multicast protocols. Atomic multicast happens when either all
machines receive same sequence of updates or none do -- total order
and atomicity: receipt of message N is not acknowledged until all
machines have acknowledge receipt of message N-1, a message is not
considered "stable" until it's been acknowledge by all machines.
Beautiful and proven algorithms exist for these problems, but they are
tricky to implement in a manner that performs well

Vector clocks in Dynamo provide a partial ordering, which along side
quorums the client to see the most recently written version i.e., you
are able to "read your writes". Read-write conflicts are prevented
when R+W > N, and when W > ceil(N/2) *write-write* conflicts are also
prevented. In practice, this is the sort of consistency most people
will want. It's intuitive and makes sense to their end-users. Even
when R+W < N (a choice made to reduce latency or provide for higher
availability with less hardware, something I'll discuss in more detail
later), it's important to note that:

1) Not being able to read your writes is only a failure condition:
consistent routing strategy still directs you to the same replica to
which the earlier value was written

2) This weak form of eventual consistency still means *consistency is
eventually achieved*: the TCP messages sent by the coordinator (with
Java and C++ clients, the coordinator is the client) will shortly
arrive at all live replicas (including replicas that are slow and may
have "not made it" into a quorum).

This is actually a stronger guarantee than what conventional database
replication provides: effectively, when you're using MySQL replication
or Slony, you're choosing a system where W is 1 (or 0, if you're not
using the full guarantees of InnoDB, Postgres or another transactional
system), where you are not guaranteed to read from the same replica as
the one you wrote to (i.e., very frequently, you will *not* be able to
read your writes even under normal conditions) and where there is no
guarantee of eventual consistency (this is "potential consistency").
This isn't an a jab at these systems (several NoSQL systems in fact
employ the same approach for replication), it's difficult to build a
well performing system based on replicating a totally ordered commit
log and very frequently not required.

In practice, other approaches are chosen: eventual consistency, weaker
forms of virtual synchrony (reliable multicast/distributed commit are
its strong forms) such as "timeline based consistency" as used by
PNUTS/Sherpa, single elected master approach (e.g., BigTable where
only a single Memtable may answer request for a given key range, with
Paxos/Chubby being only used for providing a consistent view of
cluster membership and request routing; partitioning allows this
approach to scale) and others.

Eventual consistency has the advantage of being a simpler system to
implement (no need to have a low latency distributed file system or a
lock manager, no need to implement "Spanner" on top to allow
multi-site operation) and operate (the system is symmetric, all nodes
are functionally the same and can be configured with the same
hardware). It's reliable in the guarantees it provides. It does come
at the cost of a more complex API to develop to (application
developers need to be aware of versioning, vector clocks and the
chosen consistency settings).

To answer the question at hand ("why are there additional consistency
mechanisms in Dynamo"), we have to examine the downside to using
quorums to provide strong consistency guarantees in *all* cases.

When R+W > N and W > ceil(N/2) you lose high availability in the case
of a network partition. Assume you have four nodes connected to two
leaf switches. If N=4, R = 2, W = 3 (i.e., any read quorum is
guaranteed to contain a version with the latest vector clock) then if
either switch fails, the system is no longer able to take writes. This
isn't an esoteric scenario: network equipment is the "last bastion of
mainframe computing" (<
http://perspectives.mvdirona.com/2009/12/19/NetworkingTheLastBastionOfMainframeComputing.aspx
>). Large read or write quorums also mean additional incurred latency
and lesser fault tolerance for these operations (with N=4, W=3, R=2
write availability could be lost if two nodes responsible for a
specific key fail).

In short, in an environment spanning multiple switches (or other
causes of correlated multi-machine failures, such as power or human
error) we have to, in a failure scenario, tune down either our
expectation of availability or our expectation of consistency. In
reality, we want both.

To avoid a completely binary choice between large quorums (and reduced
availability) or weak eventual consistency, Dynamo and Voldemort
implement additional consistency mechanisms. Since there's no totally
ordered commit log that can be replayed, synchronizing an out of date
replica isn't straight forward: you can't say "high water mark on the
recovering machine is X, replay all operations since X and only then
allow requests to be made against the machine".

Like Dynamo, Voldemort implements read repair. As I've explained on
IRC, it's a fairly simple algorithm:

1) Perform a quorum read
2) Write the new value back to out-of-date replicas

To avoid the large quorum problem, Voldemort allows "preferred-reads"
and "preferred-writes" to be set in addition to "required-reads" and
"required-writes" (they are configured just like required reads/writes
in stores.xml, on a per-store basis). In this case, we aim for a
stronger guarantee of consistency, falling back on a weaker one (as
defined by required reads/writes) if availability is at jeopardy.
We're also working on an implementation of hinted handoff and sloppy
quorums: I won't go into details on these methods, but you can read
about them in the Dynamo paper.

We also allow a machine that's been brought down to be restored from
its replicas purely by copying (much like how we rebalance to a newly
inserted node). There's a command line tool
(bin/voldemort-admin-tool.sh) and a JMX hook to do so. I've mentioned
this feature elsewhere on this list and we should properly document it
on the site. This is a total copy of all data and may be slow
(depending on how much data you have per node).

We do not presently implement other, constantly running anti-entropy
protocols such as Merkle trees. This isn't due to a design decision,
but rather due to the fact we do not have a known robust and
performant implementation of them. If we build one (proper scoping and
an implementation attempt is on our roadmap) or if one is contributed,
we'll use them. However, there's some indication from the other open
source Dynamo implementations that this approach has its limitations:
Dynomite's developer found the overhead of Merkle trees to be too
high; last time I spoke to the Cassandra community (March/April of
this year), they've mentioned that their implementation hasn't seen
large-scale production testing (this may have changed recently).

We'd also like to implement publish/subscribe, for the purpose of
synchronizing nodes that have been off-line for a medium length
period. It's a long standing "desirement" and would allow other
advanced features to be implemented, such as secondary indices. It
would also make it possible to integrate Voldemort with other
specialized systems (search engines and offline batch/data-warehousing
environment such as Hadoop/MapReduce and OLAP databases) without
performing full extraction of values. There's an issue open for this:

http://code.google.com/p/project-voldemort/issues/detail?id=2

Sorry for the long detour, I just wanted to mention the reason for
these "additional" consistency mechanisms, why some of them are
present in Dynamo but not in Voldemort and vice-versa and why we can't
just "replay the commit logs".

Thanks,
- Alex

Reply all
Reply to author
Forward
0 new messages