Making rebalance resumable

chinmay gupte

unread,

Mar 11, 2016, 5:06:13 AM3/11/16

to project-voldemort

Hi community,

Rebalance of a voldemort cluster as it stands currently, suffers from transient node failures.

Lets say we are zone-expanding a voldemort cluster and the partitions have almost completed moving to the new nodes in the new zone, but just before the successful finish, voldemort suffers failure on one of the new nodes. The entire rebalance process fails and it auto-rollbacks to the original cluster state (which is a good thing!).

Lets assume the data does not get upserted often, in that case, even though the data migration is almost complete and just needs a final verification and sync, it is going to restart the entire data migration again when we run rebalance. This will have greater impact on big clusters with huge data sizes and thousands of partition moves involved.

Has anyone given a thought on making the rebalance process resumable, so that we do not lose the work already done?

One possible solution would be Merkle trees based approach (like the one used by Dynamo and Cassandra for out-of-band repair) i.e stealer nodes will request donor nodes for a Merkle tree of the partitions to be streamed, stealer node validates it with the Merkle tree of the existing data and then requests only the missing subset and not the entire partition via fetch and update admin command (since fetch-and-update is getting used internally by rebalance).

I am deep diving into the code to gauge the feasibility of such an approach but would like to get an early opinion from the community at large, especially if anyone has already thought about it.

Thanks much,

Chinmay

cgu...@apple.com

unread,

Mar 16, 2016, 8:39:56 PM3/16/16

to project-voldemort

Gentle bump! Any opinions on this?

Cheers,

Chinmay

Félix GV

unread,

Mar 17, 2016, 11:57:23 AM3/17/16

to project-voldemort

Async tasks in general should be resumable, but unfortunately they're not. There is a similar problem with Read-Only file fetches, where a server restart fails any ongoing fetch. It would be good to persist async tasks and to periodically checkpoint their progress, so that they can be resumed properly from the last checkpoint. Sidian might be looking into doing something similar for the Read-Only fetches in the near future. You might be able to piggy back on that work to also make the rebalance task resumable.

That is likely not something which can be accomplished as a quick fix, however. Are you capable of paying the extra cost of resuming your rebalance from scratch in the short-term? How urgently do you reckon needing a solution for this?

-F

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-voldem...@googlegroups.com.
Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
--
Félix

cgu...@apple.com

unread,

Mar 17, 2016, 2:30:35 PM3/17/16

to project-voldemort

Hi Felix,

Thanks for your response. Find my comments inline,

It would be good to persist async tasks and to periodically checkpoint their progress, so that they can be resumed properly from the last checkpoint.

I am not too familiar with bdb storage engine and how the data is laid on the disk but wouldn't checkpoint lose accuracy if new data starts flowing in during the rebalance pause and cleaner threads start cleaning the data? Also, is there a WIP branch for the Read-Only fetches I can take a look at?

That is likely not something which can be accomplished as a quick fix, however. Are you capable of paying the extra cost of resuming your rebalance from scratch in the short-term? How urgently do you reckon needing a solution for this?

We need it ASAP ;) We have a few candidates for zone expansion which will take weeks for rebalance and it will be too strong of an assumption if we say there won't be any transient failures. Losing time is just a part of the problem with starting it from scratch, there is also the performance impact with proxy PUTs and GETs. So ideally we would like the rebalance process for zone-expansion to be done with in the first go and even if we fail during the first attempt, re-use the time, resources and performance impact.

FWIW, we completely understand this might not be something which can be done immediately, so we are willing to work with the community to drive this.

Thanks,

Chinmay

On Wed, Mar 16, 2016 at 5:39 PM, cgu...@apple.com <cgu...@apple.com> wrote:

Gentle bump! Any opinions on this?

Cheers,
Chinmay

On Friday, 11 March 2016 02:06:13 UTC-8, cgu...@apple.com wrote:
Hi community,

Rebalance of a voldemort cluster as it stands currently, suffers from transient node failures.

Lets say we are zone-expanding a voldemort cluster and the partitions have almost completed moving to the new nodes in the new zone, but just before the successful finish, voldemort suffers failure on one of the new nodes. The entire rebalance process fails and it auto-rollbacks to the original cluster state (which is a good thing!).

Lets assume the data does not get upserted often, in that case, even though the data migration is almost complete and just needs a final verification and sync, it is going to restart the entire data migration again when we run rebalance. This will have greater impact on big clusters with huge data sizes and thousands of partition moves involved.

Has anyone given a thought on making the rebalance process resumable, so that we do not lose the work already done?

One possible solution would be Merkle trees based approach (like the one used by Dynamo and Cassandra for out-of-band repair) i.e stealer nodes will request donor nodes for a Merkle tree of the partitions to be streamed, stealer node validates it with the Merkle tree of the existing data and then requests only the missing subset and not the entire partition via fetch and update admin command (since fetch-and-update is getting used internally by rebalance).

I am deep diving into the code to gauge the feasibility of such an approach but would like to get an early opinion from the community at large, especially if anyone has already thought about it.

Thanks much,

Chinmay

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.

To unsubscribe from this group and stop receiving emails from it, send an email to project-voldemort+unsubscribe@googlegroups.com.

Visit this group at https://groups.google.com/group/project-voldemort.
For more options, visit https://groups.google.com/d/optout.

--
--
Félix

Félix GV

unread,

Mar 17, 2016, 3:28:40 PM3/17/16

to project-voldemort

Response inline.

On Thu, Mar 17, 2016 at 11:30 AM, cgu...@apple.com <cgu...@apple.com> wrote:

It would be good to persist async tasks and to periodically checkpoint their progress, so that they can be resumed properly from the last checkpoint.

I am not too familiar with bdb storage engine and how the data is laid on the disk but wouldn't checkpoint lose accuracy if new data starts flowing in during the rebalance pause and cleaner threads start cleaning the data?

It's possible that checkpointing is unnecessary for the use case of resuming rebalances, since as you said, BDB is itself the checkpoint, in some way. I haven't given it much thought.

Also, is there a WIP branch for the Read-Only fetches I can take a look at?

There isn't any branch at this point. It is just at the state of idea.

If you are in a rush to fix this, feel free to go ahead and submit a PR when you're done. We'll be happy to review.

-F

--
--
Félix

Reply all

Reply to author

Forward