Re-distribution of data from a failed node

36 views
Skip to first unread message

Dan

unread,
Jan 27, 2009, 1:51:50 PM1/27/09
to project-voldemort
Hi,

I'm evaluating a number of different DHT's and Project Voldemort seems
to be one of the more promising. The following passage from the
website is a bit unclear to me...

"The number of data partitions cannot be changed. Online
redistribution of data is not yet supported, but this provides the
mechanism by which it will work when it is. Partitions will be moved
to the new servers (or rebalanced between servers), but the total
number of partitions will always remain the same, as will the mapping
of key to partition. This means it is important to give a good number
of partitions to start with. The script test/integration/
generate_partitions.py will generate this part of the config for you.
Note that the configuration is currently simple files so it is
important that the data in cluster.xml and stores.xml be exactly the
same on each server, and that the node ids and partitions not be
changed, since that can mean that clients will think their data should
be on node X when really it was stored on node Y. This limitation will
be removed as the configuration is moved into voldemort itself."

Is this saying that at the present time, data from a failed node is
never re-distributed to other systems? Or is it saying that its
partitions will rebalance themselves right now, but that the list
can't be changed? Is there an ETA for having dynamic membership/
balancing of partitions built in? It seems like graceful recovery of
failed nodes would be essential before describing PV as being truly
fault tolerant.

-Dan

Bhupesh Bansal @ Linkedin

unread,
Jan 27, 2009, 4:50:21 PM1/27/09
to project-voldemort
Hey Dan ,

some background on data rebalancing was discussed here
http://groups.google.com/group/project-voldemort/browse_thread/thread/8f280c17e0254fbf

The assumption is that the failing nodes are temp failures and will
become online later on so
in the meantime
1) replicated nodes serves keys
2) after coming back online master get the delta through read-repair
(saves keys if it doesnt have it currently)

if nodes are not temp failures then you are better off doing
grandfathering(copying data files) then auto rebalancing from
performance perspectives. This is still in active discussion please
let us know your comments

Best
Bhupesh

Dan Dockery

unread,
Jan 27, 2009, 5:21:45 PM1/27/09
to project-...@googlegroups.com
Hi Bhupesh,

Thanks for the response.  I've been experimenting with Voldemort in the meantime.  I've been doing a limited test with two nodes and there may be a problem with the behavior when a node is offline.

Replication scenario:
Replication factor: 3 - Required reads: 1 - Required writes: 2
1) Bring up two nodes, each with one partition (Node 0 / Partition 0, Node 1 / Partition 1)
2) Insert data into either node: Works as expected.  Data is stored and can be queried.
3) Shut down node 0  Works as expected.  Data can still be queried.
4) Insert data into node 1  Seems to work as expected.  An exception is thrown indicating that storage was unsuccessful.
5) Attempt to query node 1:  Returns the stored value despite previous failure message.
6) Bring node 0 back online.  Data inserted while node 0 was down for which node 1 is authoritative can now be queried from both servers.  Data inserted while node 0 was down for which node 0 is authoritative can not be read from either server.  This is still the case after 10 minutes which seems more than enough time for 2 nodes to replicate around 10 values. 
7) Bring node 0 down again.  Data is all available again.

I haven't confirmed yet that updates to existing values exhibit the same behavior, but it seems likely that they would.  This could lead to unexpected behavior any time enough nodes in a cluster are unavailable to prevent write quorum.

If Voldemort is only to be used as a cache (fancier memcached with replication?), that behavior is likely fine, but if it's going to be used as a distributed form of data storage, that is probably less than desirable.  What I'm hoping to find, really, is a form of low-latency distributed storage which can accept systems randomly joining and leaving its ring and deal elegantly with data re-replication without intervention or manually copying files.  Bamboo came awfully close to fitting the bill, but it's unstable under even a moderate load and has been abandoned since 2006.

I'm not sure whether the right answer would be to throw the exception and not store the value or whether it would be to throw the exception and then assume that the most recent generation identifier for the data is the best when another node comes back online.

-Dan

Jay Kreps

unread,
Jan 28, 2009, 4:28:59 AM1/28/09
to project-...@googlegroups.com
Hi Dan,

I see what you are saying. I think it is important to stress the
semantics of the put() method--if it succeeds it guarantees it wrote
your data in at least "required-writes" places, if it fails it makes
no guarantee about the state of the data. In general I think this is
the semantics of most any rpc mechanism that changes state--if it
fails you don't really know what state you will get back from another
call (e.g. if you get an IOException for socket timeout, does that
mean it did or did not change the state).

So (5) is the intended semantics. If you want to enforce strict
semantics you need to have required-reads + required-writes >
replication-factor. This will give you the semantics you want, but
(only by throwing an exception on reads when two nodes are down as in
your example, which may not be better). This is the "read-your-writes
consistency guarantee, which is often very important for things with
UI to not confuse the user.

For (6) I think you are expecting behavior that is not there. There
are three consistency approaches in dynamo are:
1. Read repair. This does repair the value but does so at read time
not in the background.
2. Writing to random nodes when a node is down, and having that node
push the value to the right location
3. A checksum based log catchup mechanism

We have really only implemented (1) and (2). (2) is disabled for now
because our tests showed this exhibits a swarming behavior. When a
node is down for a while it ends up having many writes that need to be
delivered to it. When it comes back up in addition to normal traffic
it gets a bazillion missed writes thrown at it. So to be done properly
there needs to be a backoff mechanism in place, and overall just more
engineering and testing.

I think what you want is (3). We want that too, but no one has written
it yet. Patches accepted :-)

-Jay

Jay Kreps

unread,
Jan 28, 2009, 4:44:06 AM1/28/09
to project-...@googlegroups.com
Also to answer your original question: I don't think the partitions
will be dynamic (since they determine load distribution), and I don't
think cluster membership will be dynamic (since you bought some fixed
number of servers, and presumably know which ones). What really,
really does need to get done is online adding of servers to the
cluster and rebalancing data partitions to this server.

So to clarify:
(1) adding a server and moving data to that server in a push-button
manner is really, really essential to being able to do incremental
expansion
(2) Removing a server is not very useful but makes sense and needs
some button or command.
(3) Having the cluster dynamically determine which servers are
currently up and working for doing routing of requests is essential.
(4) Having the system dynamically determine which servers are or are
not available and try to auto rebalance your many gigabytes of data
every time a server appears or disappears is really not a good idea.

(4) is the problem real distributed hash tables (attempt to) solve
because they are trying to use unreliable client nodes that will come
and go very frequently (e.g. everyone runs a node on their pc). The
niche for voldemort is really not that. We are really just trying to
make a service based storage mechanism that has good latency and you
can run your site on. Dynamically discovering members and doing data
distribution based on that is a non-feature in this domain, since it
leads to really bad failure modes where a node is popping in and out
of the cluster and data is going everywhere. Our expectation is that
nodes have > 99% uptime, so it is the case that some node is often
failed but not the case that a particular node (say node 12) will be
down most of the time.

I don't know if that provides any clarification.

Obviously the best would be if we just had all the features now :-)

-Jay

ethane

unread,
Feb 6, 2009, 3:22:52 PM2/6/09
to project-voldemort


On Jan 28, 1:28 am, Jay Kreps <jay.kr...@gmail.com> wrote:
> For (6) I think you are expecting behavior that is not there. There
> are three consistency approaches in dynamo are:
> 1. Read repair. This does repair the value but does so at read time
> not in the background.
> 2. Writing to random nodes when a node is down, and having that node
> push the value to the right location
> 3. A checksum based log catchup mechanism
>
> We have really only implemented (1) and (2). (2) is disabled for now
> because our tests showed this exhibits a swarming behavior.  

In my testing, similar to Dan's testing, I'm not able to get (1) to
work. The steps I took, and configuration.
Two servers, with 8 partitions total.
stores.xml
---------------
<store>
<name>test</name>
<persistence>bdb</persistence>
<routing>client</routing>
<replication-factor>2</replication-factor>
<preferred-reads>2</preferred-reads>
<required-reads>1</required-reads>
<preferred-writes>2</preferred-writes>
<required-writes>1</required-writes>
...
</store>
cluster.xml
<server>
<id>0</id>
<host>node0</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<partitions>0,2,4,6</partitions>
</server>
<server>
<id>1</id>
<host>node1</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<partitions>1,3,5,7</partitions>
</server>
-------------------
1) put value, read value, works fine, both servers running.
2) shutdown node0, put value, read value, works fine
3) start node0, get value from step 2, works fine.
4) shutdown node1, read value from step 2, fail.

What am I missing? Wouldn't the read from step 3 have pushed the
missing value from node1 to node0 so that step 4 would succeed? Would
setting required-read = replication-factor fix this problem?

ethane

unread,
Feb 9, 2009, 8:57:41 PM2/9/09
to project-voldemort
Hi all,

Does anyone have input about my previous post? Am I just being thick,
and not getting something, or is the functionality that I describe
just not supported yet? From previous posts on this subject, it seems
like read-repair was implemented and expected to work.

Thanks again,
Ethan

Jay Kreps

unread,
Feb 9, 2009, 9:46:45 PM2/9/09
to project-...@googlegroups.com
Hi Ethan,

This functionality is implemented and expected to work. Sorry for not
getting back to you, I have been moving so I have not had internet
access except via coffee shop, so I am a bit behind on emails.

Can you give me your full config so I can test exactly what you are doing?

Thanks,

-Jay

Ethan Erchinger

unread,
Feb 10, 2009, 1:38:21 PM2/10/09
to project-...@googlegroups.com
Hey Jay, thanks for responding.

Here's are the configs. Anything other than the 3 main configuration
files needed?
http://pastebin.com/m53ba6a59

Also, here's a patch I made to the command-line client to get
multi-server support. With this patch you can do something like:
bin/voldemort-shell.sh test tcp://node0:6666,tcp://node1:6666

http://pastebin.com/m1e892a21

Thanks for your help,
Ethan

-----Original Message-----
From: project-...@googlegroups.com
[mailto:project-...@googlegroups.com] On Behalf Of Jay Kreps
Sent: Monday, February 09, 2009 6:47 PM
To: project-...@googlegroups.com
Subject: [project-voldemort] Re: Re-distribution of data from a failed
node


jay....@gmail.com

unread,
Feb 13, 2009, 8:23:21 PM2/13/09
to project-voldemort
Hey guys,

So i did some testing. I don't see the exact problem Ethan refers to,
but in the process of doing a variety of tests I did find a real
problem which is that the read repair is not running on null values.
This is bad, these need to be repaired, and I think this could be the
cause of what Ethan sees. The effect of this is that repair(V1, V2)
correctly gives V3, but repair(V1, null) gives back no repairs. I
believe it is a fairly straight forward fix. Thanks for insisting I
follow up on this. I will update when I have a tested fix in place.

-Jay

On Feb 10, 10:38 am, "Ethan Erchinger" <et...@plaxo.com> wrote:
> Hey Jay, thanks for responding.
>
> Here's are the configs.  Anything other than the 3 main configuration
> files needed?http://pastebin.com/m53ba6a59

ethane

unread,
Feb 13, 2009, 8:32:06 PM2/13/09
to project-voldemort
Thanks for the follow-up Jay. Could it be the command-line client
that I'm using doesn't do the right thing from a repair perspective?
Reply all
Reply to author
Forward
0 new messages