I have a working initial implementation of Hinted Handoff. It is
presently undergoing testing. I wrote up a wiki on the implementation
and the test plan, please take a look.
http://wiki.github.com/voldemort/voldemort/hinted-handoff
Thanks,
- Alex
--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
To deal with that, you could have the newly recovered node request data from the slop store of each node that might have some data, rather than having all nodes with potentially useful data trying to push it to the newly recovered node.
If we assume that nodes never have sure knowledge of a network partition, this approach would imply that every node periodically asks every other node whether there's data for it in the slop store. In the normal case, there's no data there and this ought to be quick, so maybe the n**2 traffic is okay. Or maybe we do have a good model of when network partitions start and end so the n**2 behavior is not required. I haven't dug through the code enough to know.
Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536
Hey guys,
http://wiki.github.com/voldemort/voldemort/hinted-handoff
Thanks,
- Alex
--
Perhaps measuring the response time on a push and using that as an indicator. It's really cheap and tune-able.
>Perhaps measuring the response time on a push and using that as an indicator. It's really cheap and tune-able.
How exactly would we tune this?
The consequences of tuning it wrong seem bad. If nodes are too willing to postpone pushing because of a slow response time from the recipient, then the push never happens. If they are not willing enough to postpone pushing because of a slow response time from the recipient, then the recipient is unusable. The right tuning parameters would seem to depend on the cluster size, since if the cluster is really big, a recently rebooted recipient might have crashed by the time the sender's experiment to determine the response time has completed, since all the other nodes are doing the same experiment at the same time.
Not knowing how to tune the parameter is exactly why I was suggesting a pull model.
Hmm. The problem with the pull model is the n**2 behavior, and that seems fixable. Suppose the node containing a partition P is down. The proposal at
http://wiki.github.com/voldemort/voldemort/hinted-handoff
says to use a random node to store slop when we observe that a node is down. Suppose that instead, we rehash P a bunch of times, getting a sequence of partition numbers P1, P2, P3, and so forth. Truncate the list after some reasonable number of probes T, which might be 10 or so. We try storing the slop on the node containing P1, and if that node is down we try the node containing P2, and continue on down the list until we succeed or we get to the end. When we move partitions between nodes during rebalancing, we move the associated pending slop with them. When a node comes back up, it can enumerate its partitions and then for each partition derive the same list and query the nodes with those partitions for slop. This way, if our nodes don't have a model of network partitions, we don't have N nodes looking at N nodes, we just have N nodes looking at R*T nodes, where R is the number of partitions per node and R<<N and T is around 10.
Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536
I could be wrong, but it seems that requests to pull could be smaller
than requests to push (which have payload)?
And getting some statistics to guide rate of sending requests is
easier too, as one can measure speed of getting response through, as
APIs expose that more readily (this is not a fundamental limitation).
-+ Tatu +-
> What are the objectives of this HH implementation? It appears as if
> the main goal is to provide a background mechanism of synchronization
> without requiring a read repair. Are there other goals this
> implementation hopes to accomplish?
> These two statements are true today without the addition of new code
> for HH, so I am unclear as to what HH made better. The situation that
> exists today still exists after this implementation -- a PUT operation
> returns an error that states it failed when in fact it may have
> succeeded without the application being able to tell what the state of
> the request is.
This is correct, the main goal is to provide a way to do
synchronization without a read repair. A read repair requires a quorum
read, which is going to be expensive in a multi-datacenter scenario.
The situation where the network is partitioned off and one partition
is behind another is also much more likely to happen in a
multi-datacenter environment.
What this also accomplishes is when a required-writes succeeds for W
out of N nodes (return success to the application), but some of the
background asynchronous writes fail the put requests for those nodes
are queued up and replayed when the nodes are back online.
> From the wiki, it says that "if required-writes aren’t met by a strict
> quorum, the request is still considered failed (even if hinted handoff
> succeeds". This statement appears to state that:
> - If too few copies were written from the primaries, PUT will return
> an error even though the data was written and replicated to the slop
> stores;
> - Since the handoff nodes are not queried, a GET is unreliable after
> a partial PUT operation.
> I am not sure that HH works for DELETE operations. Imagine a scenario
> with 3 replica/2 required writes store. The DELETE operation
> completes successfully on nodes A and B but not on node C, meaning the
> key still exists there but is now marked as DELETED in a slop store.
> What prevents that key from being read-repaired from C to A and B
> before the Slop Store kicks in to update C?
A delete is done with a version. This way, the delete will eventually
be processed. That being said, in a fully distributed system like
Voldemort deletes are "best-effort", meant only for space reclamation.
Applications should instead use soft deletes, by marking the data as
deleted e.g., a boolean flag in the definition of a value.
> How does this HH implementation work with rebalancing? Are the Slop
> Stores updated with the new node to rebalance to?
I believe this should be handled by the existing metadata invalidation
mechanism, if it isn't, it shouldn't be difficult to fix. I'll make
sure to test this, although I am not sure if that will be a common
scenario (adding nodes to one side of a long standing partition).
> I believe to make an effective HH implementation, nodes must be able
> to determine who their handoff replicas are. There needs to be
> periodic chatter (at both startup and run-time) between the nodes and
> their HH partners to determine what keys and versions the partner is
> storing for the given node. If a node determines that it is out-of-
> date for a given key through this chatter, it does not respond
> directly to the request but instead proxies the request from its HH
> partner.
Problem with assigning peer replicas for a node is that this may lead
to cascading failures / degradation when the original node fails. I do
like the idea of chatter i.e., having nodes periodically pull and find
*who* their partners are and then requests the versions from them
rather than wait for a push.
Thanks,
- Alex
The javadoc at http://project-voldemort.com/javadoc/all/voldemort/client/StoreClient.html#delete%28K%29 says that delete will "Delete any version of the given key which [is] equal to or less than the current versions", which is different from and more useful for the application than what you said above.
You could implement "delete" to do what the javadoc says. Voldemort could manage the boolean flag for the application by implementing delete so it writes a tombstone. Then "delete" would then really delete the record from the application's point of view, and then some batch process comes by occasionally and removes tombstones from the index if they are old enough that they should have propagated to all of the nodes.
I don't immediately see a good way to get reliable results where delete makes a best-effort attempt to immediately reclaim all storage for the record, which is what I think you're saying. What if someone reads the record near the time it's deleted? Suppose N=3, R=W=2, and deletion has deleted one of three copies at the time we read. Is read going to see that one copy is missing and "repair" it, thus leaving it around forever when the deletion completes? If not, how is this avoided?
Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536
-----Original Message-----
From: project-...@googlegroups.com [mailto:project-...@googlegroups.com] On Behalf Of Alex Feinberg
Sent: Tuesday, August 10, 2010 1:01 PM
To: project-...@googlegroups.com
Hi Mark,
Thanks,
- Alex
--
You are right that there are a lot of machines to poll, but likely the
polling would be round robin with a fixed frequency (with some
randomized ordering to avoid pilling on) so the impact would probably
be latency in finding an update rather than exceptional network
traffic.
I think that this issue could be addressed by making the handoff
target be related to the partition so there was a bounded set of nodes
that had to be polled.
Alex, what do you think? Since we have a working hh, I think we should
go with it, this can be a future improvement that would change the
slop pusher to be a slop puller. :-)
-Jay
HH respects these semantics since what is saved is "delete all
versions prior to X" which remains a valid operation even when
re-ordered.
Others have proposed a hybrid strategy which uses a tombstone and then
does periodic garbage collection and hopes that all operations have
been applied. This would not be bad, though my experience is that any
time period you gave as the maximum time for tombstones to live would
likely be violated in practice since you would have a machine down for
longer than that time and want to bring it back.
You also mentioned the fact that a failed PUT could actually end up
being written via HH. The semantics of PUT are that if it succeeds it
is written, if you get an error there is no guarantee either way. This
is the limitation of any remote RPC-like call that changes the state
of the remote machine since, for example, an IOException due to
timeout may occur either before or after the state change on the
remote node. So the goal for HH is to maintain the same guarantee--if
it is written successfully it can be immediately read back with the
normal quorum guarantees, if you get an error it may eventually
succeed but there is no read quorum guaranteed to return the value.
Hope that makes sense.
-Jay
> Alex, what do you think? Since we have a working hh, I think we should
> go with it, this can be a future improvement that would change the
> slop pusher to be a slop puller. :-)
How would you suggest determing the peers for hinted handoff?
"Neighbours in the ring" approach, I am afraid, would mean
overwhelming nodes holding the partitions that are replicas of the
dead node's partitions.
One theory to investigate (experimentally) would be to use the failure
detector listener, starting the push process (to specific nodes) when
they become available again. I think this is really worth looking at,
along with the "chatter" approach.
Of course since failure detection is done on the client without
server-side routing, if client side routing is used (which is what's
done in majority of cases), we'd need to be periodically sending
canary requests from the servers a way of failure detection, which
would amount to a form of chatter.
I am thinking out loud, but:
We could also use gossip to "multicast" out the information "hey, node
A has a slop for node B". When node B receives the information, node B
would then contact node A asking for the slop (node A wouldn't have to
guess or know whom to poll). Any thoughts on this idea?
- Alex