Hinted Handoff proposal, initial implementation and testplan

Alex Feinberg

unread,

Jul 29, 2010, 5:52:30 PM7/29/10

to project-...@googlegroups.com

Hey guys,

I have a working initial implementation of Hinted Handoff. It is
presently undergoing testing. I wrote up a wiki on the implementation
and the test plan, please take a look.

http://wiki.github.com/voldemort/voldemort/hinted-handoff

Thanks,
- Alex

Rob Adams

unread,

Jul 29, 2010, 5:55:02 PM7/29/10

to project-...@googlegroups.com

Awesome; this is going to be a great step forward for V.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.

Freeman, Tim

unread,

Jul 30, 2010, 7:39:45 PM7/30/10

to project-...@googlegroups.com

I assume one issue is ensuring that the broken node doesn't get flooded receiving hinted handoffs when it comes back up.

To deal with that, you could have the newly recovered node request data from the slop store of each node that might have some data, rather than having all nodes with potentially useful data trying to push it to the newly recovered node.

If we assume that nodes never have sure knowledge of a network partition, this approach would imply that every node periodically asks every other node whether there's data for it in the slop store. In the normal case, there's no data there and this ought to be quick, so maybe the n**2 traffic is okay. Or maybe we do have a good model of when network partitions start and end so the n**2 behavior is not required. I haven't dug through the code enough to know.

Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536

Hey guys,

http://wiki.github.com/voldemort/voldemort/hinted-handoff

Thanks,
- Alex

--

spence

unread,

Jul 31, 2010, 11:59:21 AM7/31/10

to project-...@googlegroups.com

Perhaps measuring the response time on a push and using that as an indicator. It's really cheap and tune-able.

Mark Rambacher

unread,

Aug 2, 2010, 10:25:15 AM8/2/10

to project-voldemort

What are the objectives of this HH implementation? It appears as if
the main goal is to provide a background mechanism of synchronization
without requiring a read repair. Are there other goals this
implementation hopes to accomplish?

From the wiki, it says that "if required-writes aren’t met by a strict
quorum, the request is still considered failed (even if hinted handoff
succeeds". This statement appears to state that:
- If too few copies were written from the primaries, PUT will return
an error even though the data was written and replicated to the slop
stores;
- Since the handoff nodes are not queried, a GET is unreliable after
a partial PUT operation.
These two statements are true today without the addition of new code
for HH, so I am unclear as to what HH made better. The situation that
exists today still exists after this implementation -- a PUT operation
returns an error that states it failed when in fact it may have
succeeded without the application being able to tell what the state of
the request is.

I am not sure that HH works for DELETE operations. Imagine a scenario
with 3 replica/2 required writes store. The DELETE operation
completes successfully on nodes A and B but not on node C, meaning the
key still exists there but is now marked as DELETED in a slop store.
What prevents that key from being read-repaired from C to A and B
before the Slop Store kicks in to update C?

How does this HH implementation work with rebalancing? Are the Slop
Stores updated with the new node to rebalance to?

I believe to make an effective HH implementation, nodes must be able
to determine who their handoff replicas are. There needs to be
periodic chatter (at both startup and run-time) between the nodes and
their HH partners to determine what keys and versions the partner is
storing for the given node. If a node determines that it is out-of-
date for a given key through this chatter, it does not respond
directly to the request but instead proxies the request from its HH
partner.

Once some of the key design decisions are made and the trade-offs
acknowledged, then the "push" versus "pull" nature of updates can be
decided upon.

- Mark

Freeman, Tim

unread,

Aug 2, 2010, 12:40:59 PM8/2/10

to project-...@googlegroups.com

>Perhaps measuring the response time on a push and using that as an indicator. It's really cheap and tune-able.

How exactly would we tune this?

The consequences of tuning it wrong seem bad. If nodes are too willing to postpone pushing because of a slow response time from the recipient, then the push never happens. If they are not willing enough to postpone pushing because of a slow response time from the recipient, then the recipient is unusable. The right tuning parameters would seem to depend on the cluster size, since if the cluster is really big, a recently rebooted recipient might have crashed by the time the sender's experiment to determine the response time has completed, since all the other nodes are doing the same experiment at the same time.

Not knowing how to tune the parameter is exactly why I was suggesting a pull model.

Hmm. The problem with the pull model is the n**2 behavior, and that seems fixable. Suppose the node containing a partition P is down. The proposal at

http://wiki.github.com/voldemort/voldemort/hinted-handoff

says to use a random node to store slop when we observe that a node is down. Suppose that instead, we rehash P a bunch of times, getting a sequence of partition numbers P1, P2, P3, and so forth. Truncate the list after some reasonable number of probes T, which might be 10 or so. We try storing the slop on the node containing P1, and if that node is down we try the node containing P2, and continue on down the list until we succeed or we get to the end. When we move partitions between nodes during rebalancing, we move the associated pending slop with them. When a node comes back up, it can enumerate its partitions and then for each partition derive the same list and query the nodes with those partitions for slop. This way, if our nodes don't have a model of network partitions, we don't have N nodes looking at N nodes, we just have N nodes looking at R*T nodes, where R is the number of partitions per node and R<<N and T is around 10.

Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536

spence

unread,

Aug 2, 2010, 1:17:19 PM8/2/10

to project-...@googlegroups.com

I honestly haven't put an effort into thinking it through. Figure I'd put it out there to flush out while running around w/ other stuff.

Though wouldn't there be a similar problem w/ pull if a node gets too busy, and everyone continues to attempt to pull?

Cheers,

-s

Tatu Saloranta

unread,

Aug 2, 2010, 3:26:06 PM8/2/10

to project-...@googlegroups.com

On Mon, Aug 2, 2010 at 10:17 AM, spence <exu...@gmail.com> wrote:
> I honestly haven't put an effort into thinking it through. Figure I'd put
> it out there to flush out while running around w/ other stuff.
> Though wouldn't there be a similar problem w/ pull if a node gets too busy,
> and everyone continues to attempt to pull?

I could be wrong, but it seems that requests to pull could be smaller
than requests to push (which have payload)?
And getting some statistics to guide rate of sending requests is
easier too, as one can measure speed of getting response through, as
APIs expose that more readily (this is not a fundamental limitation).

-+ Tatu +-

Alex Feinberg

unread,

Aug 10, 2010, 4:01:00 PM8/10/10

to project-...@googlegroups.com

Hi Mark,

> What are the objectives of this HH implementation? It appears as if
> the main goal is to provide a background mechanism of synchronization
> without requiring a read repair. Are there other goals this
> implementation hopes to accomplish?

> These two statements are true today without the addition of new code

> for HH, so I am unclear as to what HH made better. The situation that
> exists today still exists after this implementation -- a PUT operation
> returns an error that states it failed when in fact it may have
> succeeded without the application being able to tell what the state of
> the request is.

This is correct, the main goal is to provide a way to do
synchronization without a read repair. A read repair requires a quorum
read, which is going to be expensive in a multi-datacenter scenario.
The situation where the network is partitioned off and one partition
is behind another is also much more likely to happen in a
multi-datacenter environment.

What this also accomplishes is when a required-writes succeeds for W
out of N nodes (return success to the application), but some of the
background asynchronous writes fail the put requests for those nodes
are queued up and replayed when the nodes are back online.

> From the wiki, it says that "if required-writes aren’t met by a strict
> quorum, the request is still considered failed (even if hinted handoff
> succeeds". This statement appears to state that:
> - If too few copies were written from the primaries, PUT will return
> an error even though the data was written and replicated to the slop
> stores;
> - Since the handoff nodes are not queried, a GET is unreliable after
> a partial PUT operation.

> I am not sure that HH works for DELETE operations. Imagine a scenario
> with 3 replica/2 required writes store. The DELETE operation
> completes successfully on nodes A and B but not on node C, meaning the
> key still exists there but is now marked as DELETED in a slop store.
> What prevents that key from being read-repaired from C to A and B
> before the Slop Store kicks in to update C?

A delete is done with a version. This way, the delete will eventually
be processed. That being said, in a fully distributed system like
Voldemort deletes are "best-effort", meant only for space reclamation.
Applications should instead use soft deletes, by marking the data as
deleted e.g., a boolean flag in the definition of a value.

> How does this HH implementation work with rebalancing? Are the Slop
> Stores updated with the new node to rebalance to?

I believe this should be handled by the existing metadata invalidation
mechanism, if it isn't, it shouldn't be difficult to fix. I'll make
sure to test this, although I am not sure if that will be a common
scenario (adding nodes to one side of a long standing partition).

> I believe to make an effective HH implementation, nodes must be able
> to determine who their handoff replicas are. There needs to be
> periodic chatter (at both startup and run-time) between the nodes and
> their HH partners to determine what keys and versions the partner is
> storing for the given node. If a node determines that it is out-of-
> date for a given key through this chatter, it does not respond
> directly to the request but instead proxies the request from its HH
> partner.

Problem with assigning peer replicas for a node is that this may lead
to cascading failures / degradation when the original node fails. I do
like the idea of chatter i.e., having nodes periodically pull and find
*who* their partners are and then requests the versions from them
rather than wait for a push.

Thanks,
- Alex

Freeman, Tim

unread,

Aug 10, 2010, 8:04:08 PM8/10/10

to project-...@googlegroups.com

From: Alex Feinberg

>A delete is done with a version. This way, the delete will eventually
>be processed. That being said, in a fully distributed system like
>Voldemort deletes are "best-effort", meant only for space reclamation.
>Applications should instead use soft deletes, by marking the data as
>deleted e.g., a boolean flag in the definition of a value.

The javadoc at http://project-voldemort.com/javadoc/all/voldemort/client/StoreClient.html#delete%28K%29 says that delete will "Delete any version of the given key which [is] equal to or less than the current versions", which is different from and more useful for the application than what you said above.

You could implement "delete" to do what the javadoc says. Voldemort could manage the boolean flag for the application by implementing delete so it writes a tombstone. Then "delete" would then really delete the record from the application's point of view, and then some batch process comes by occasionally and removes tombstones from the index if they are old enough that they should have propagated to all of the nodes.

I don't immediately see a good way to get reliable results where delete makes a best-effort attempt to immediately reclaim all storage for the record, which is what I think you're saying. What if someone reads the record near the time it's deleted? Suppose N=3, R=W=2, and deletion has deleted one of three copies at the time we read. Is read going to see that one copy is missing and "repair" it, thus leaving it around forever when the deletion completes? If not, how is this avoided?

Tim Freeman
Email: tim.f...@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536

-----Original Message-----
From: project-...@googlegroups.com [mailto:project-...@googlegroups.com] On Behalf Of Alex Feinberg

Sent: Tuesday, August 10, 2010 1:01 PM
To: project-...@googlegroups.com

Hi Mark,

Thanks,
- Alex

--

Jay Kreps

unread,

Aug 10, 2010, 11:40:47 PM8/10/10

to project-...@googlegroups.com

This is a legitimate concern. Pull would certainly alleviate the
problem, though there is no real way to know that one is "newly
recovered" so this would amount to polling for changes.

You are right that there are a lot of machines to poll, but likely the
polling would be round robin with a fixed frequency (with some
randomized ordering to avoid pilling on) so the impact would probably
be latency in finding an update rather than exceptional network
traffic.

I think that this issue could be addressed by making the handoff
target be related to the partition so there was a bounded set of nodes
that had to be polled.

Alex, what do you think? Since we have a working hh, I think we should
go with it, this can be a future improvement that would change the
slop pusher to be a slop puller. :-)

-Jay

Jay Kreps

unread,

Aug 10, 2010, 11:56:18 PM8/10/10

to project-...@googlegroups.com

The semantics of delete are the following: "Attempt to delete any
version prior to X". This is different from "Mark as deleted the key Y
at version X". Only the former frees up space but only the later
guarantees that get(k) returns a null when it immediately succeeds a
delete(k). Arguably we should have both, but we only implemented the
former, as the later is easily implemented with some marker/tombstone
value, and the only real use cases we have seen for delete are
actually interested in freeing space.

HH respects these semantics since what is saved is "delete all
versions prior to X" which remains a valid operation even when
re-ordered.

Others have proposed a hybrid strategy which uses a tombstone and then
does periodic garbage collection and hopes that all operations have
been applied. This would not be bad, though my experience is that any
time period you gave as the maximum time for tombstones to live would
likely be violated in practice since you would have a machine down for
longer than that time and want to bring it back.

You also mentioned the fact that a failed PUT could actually end up
being written via HH. The semantics of PUT are that if it succeeds it
is written, if you get an error there is no guarantee either way. This
is the limitation of any remote RPC-like call that changes the state
of the remote machine since, for example, an IOException due to
timeout may occur either before or after the state change on the
remote node. So the goal for HH is to maintain the same guarantee--if
it is written successfully it can be immediately read back with the
normal quorum guarantees, if you get an error it may eventually
succeed but there is no read quorum guaranteed to return the value.

Hope that makes sense.

-Jay

Alex Feinberg

unread,

Aug 11, 2010, 12:39:53 AM8/11/10

to project-...@googlegroups.com

> This is a legitimate concern. Pull would certainly alleviate the
> problem, though there is no real way to know that one is "newly
> recovered" so this would amount to polling for changes.
>
> You are right that there are a lot of machines to poll, but likely the
> polling would be round robin with a fixed frequency (with some
> randomized ordering to avoid pilling on) so the impact would probably
> be latency in finding an update rather than exceptional network
> traffic.
>
> I think that this issue could be addressed by making the handoff
> target be related to the partition so there was a bounded set of nodes
> that had to be polled.

> Alex, what do you think? Since we have a working hh, I think we should
> go with it, this can be a future improvement that would change the
> slop pusher to be a slop puller. :-)

How would you suggest determing the peers for hinted handoff?
"Neighbours in the ring" approach, I am afraid, would mean
overwhelming nodes holding the partitions that are replicas of the
dead node's partitions.

One theory to investigate (experimentally) would be to use the failure
detector listener, starting the push process (to specific nodes) when
they become available again. I think this is really worth looking at,
along with the "chatter" approach.

Alex Feinberg

unread,

Aug 11, 2010, 12:47:13 AM8/11/10

to project-...@googlegroups.com

> One theory to investigate (experimentally) would be to use the failure
> detector listener, starting the push process (to specific nodes) when
> they become available again. I think this is really worth looking at,
> along with the "chatter" approach.

Of course since failure detection is done on the client without
server-side routing, if client side routing is used (which is what's
done in majority of cases), we'd need to be periodically sending
canary requests from the servers a way of failure detection, which
would amount to a form of chatter.

I am thinking out loud, but:

We could also use gossip to "multicast" out the information "hey, node
A has a slop for node B". When node B receives the information, node B
would then contact node A asking for the slop (node A wouldn't have to
guess or know whom to poll). Any thoughts on this idea?

- Alex

Alex Esterkin

unread,

Sep 10, 2010, 4:56:47 PM9/10/10

to project-voldemort

I agree with Mark.

Though, I would not be as cautious in drawing some early conclusions.
I am afraid that with inconsistency resolution techniques and client-
side routing, "pull" based HH may be unrealistic. No doubt, it can be
done, but it is likely a ten times bigger effort. All the mentioned
"pull" based solution ideas resemble distributed cache coherence
algorithms too much, and this may be indicative of what a "pull" based
implementation might entail...

- Alex Esterkin

Reply all

Reply to author

Forward