[project-voldemort] Does Voldemort support reading/writing during rebalancing?

65 views
Skip to first unread message

bjno...@gmail.com

unread,
May 24, 2010, 10:51:07 AM5/24/10
to project-voldemort
Hi

What is the expected behaviour for reading & writing to a cluster
during a rebalance?

I've set up a scenario based on the example configurations
(test_config1 and test_config2), and added a node (test_config3). The
initial partition configuration is
node0: 0,1,2
node1: 3,4,5

the target configuration is
node0: 0,1
node1: 2,3
node2: 4,5

While the rebalance happens I have another process that is writing and
reading values ("k0" → "v0" up to "k1999" → "v1999") in a single
thread.

During rebalance the read/write process stalls, then times out after
15 seconds with
"voldemort.store.InsufficientOperationalNodesException: 1 writes
succeeded, but 2 are required."

I.e. Voldemort doesn't seem to support reading/writing whilst
rebalancing?

The rebalance time is
real 0m9.178s
user 0m2.125s
sys 0m0.220s

and I'm on Voldemort 0.80.2

Thanks
Benjamin

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.

Alex Feinberg

unread,
May 24, 2010, 4:02:57 PM5/24/10
to project-...@googlegroups.com
Hey Benjamin,

Yes, it does support reading/writing during a rebalance. You should be
seeing "InvalidMetadataException" which is a sign that the client is
going to the cluster to get the updated metadata and the updates
should go on as usual. This should work even under significant load
(I've done this both under simulated load in a QA lab and in
production on one of the clusters powering LinkedIn).

What GC/memory settings are you using with Voldemort? You may be
encountering GC stops in Voldemort, which could occur during a
rebalance. The other option (if you have a slower disk and/or less
memory) is that the data transfer is causing latency spikes (as it's
being written to disk), in which case you should enable throttling for
the re-balance.

It would help if you let us know:

- The JVM options you're passing to Voldemort
- The amount of BDB cache you're using
- The amount disk/amount of memory on your machines

Lastly, are you using code from trunk (which has a slightly different
rebalance algorithm, which can write data even faster) or from a
tar/zip archive?

Thanks,
- Alex

Benjamin Nortier

unread,
May 25, 2010, 5:23:21 AM5/25/10
to project-...@googlegroups.com
On Mon, May 24, 2010 at 9:02 PM, Alex Feinberg <fein...@gmail.com> wrote:
Hey Benjamin,

Yes, it does support reading/writing during a rebalance. You should be
seeing "InvalidMetadataException" which is a sign that the client is
going to the cluster to get the updated metadata and the updates
should go on as usual. This should work even under significant load
(I've done this both under simulated load in a QA lab and in
production on one of the clusters powering LinkedIn).

I have observed the InvalidMetadataException, good to know that's expected.
 

What GC/memory settings are you using with Voldemort? You may be
encountering GC stops in Voldemort, which could occur during a
rebalance. The other option (if you have a slower disk and/or less
memory) is that the data transfer is causing latency spikes (as it's
being written to disk), in which case you should enable throttling for
the re-balance.

-Xmx2G -server -Dcom.sun.management.jmxremote -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=2048m -XX:MaxNewSize=2048m

How do I enable throttling during rebalance?
 
It would help if you let us know:

- The JVM options you're passing to Voldemort
As above
 
- The amount of BDB cache you're using (default in quickstart)
bdb.cache.size=100MB 

- The amount disk/amount of memory on your machines
200GB/4GB, (running locally on my MBP). All three nodes are running locally
 

Lastly, are you using code from trunk (which has a slightly different
rebalance algorithm, which can write data even faster) or from a
tar/zip archive?

As mentioned, I'm using 0.80.2

Cheers
Ben



--
Benjamin Nortier
e: bjno...@gmail.com
c: +44 (0)778 946 1959
gtalk: bjno...@gmail.com

Alex Feinberg

unread,
May 25, 2010, 3:04:40 PM5/25/10
to project-...@googlegroups.com
Ben,

Thanks for for the additional information. You're actually exhausting
your disk seek capacity in this case: all the nodes are reading from
the same disk and as the ratio of data to BDB cache is 1000:1, you're
almost certainly going to disk on every request (I am assuming page
cache isn't devoted to Voldemort, as you're also running lots of other
applications on your machine), while the disk is also seeking to
stream the data for rebalancing (unfortunately, BerkeleyDB Java
Edition opens cursors in key rather than disk order, resulting in
random disk seeks).

This is actually a fairly good test for throttling though, so I'd
suggest lowering these settings in server.properties (for all the
nodes):

stream.read.byte.per.sec (default is: 10000000 or 10MB/s)
stream.read.byte.per.sec (default is: 10000000 or 10MB/s)

By the way, I'd highly advice against using such a high data to memory
ratio in production with magnetic disks (SSDs may do a lot better).
BDB-Je performs best when the ratio is 5:1, although we've pushed it
to 10:1 (_per_ machine that is -- you can have many machines in the
cluster) in some production cases. We are exploring storage engines
that would let us go beyond that ratio, but Voldemort allows you to
scale horizontally thus decreasing the amount of data each node has to
store. Of course the other factor to consider is the size of your
working set: if your working set is only a part of your total data,
you'll be able to achieve a higher data to memory ratio as both BDB
and the operating systems are very aggressive about caching.

I'd also try going to 3GB JVM heap and a 1GB BDB cache size.

Testing all nodes being a single machine is also somewhat misleading
in that you're not able to see the effects of latency or
non-correlated failures (e.g., one machine being slowed down by disk
seeks while another keeps moving alone).

Thanks,
- Alex

Benjamin Nortier

unread,
May 26, 2010, 6:03:35 AM5/26/10
to project-...@googlegroups.com
Alex, thanks for the good info. 

However, there's been a miscommunication and I don't think I'm exhausting the disk seek capacity. I'm aware that all nodes on the same machine can be misleading, but the test scenario is extremely basic so it should not be a problem. I've even simplified it tremendously and still see the problem

There are only 10 values. I'm reading/writing at 1 requests/second respectively. I will paste the Java and output I use below.

Here's the procedure:
1. Clean up all data (i.e. remove all database values)
2. Fire up the 3 nodes
3. Execute the Java test process
4. Wait until I see some values being written/read
5. Start the rebalancing
6. Observe the output

Every time, without fail, the test process stalls and times out on one of the writes after I see the InvalidMetadataException.

Here's the Java for the test reader/writer:

public class Exerciser {
private static long before;
private static long after;

public static void main(String[] args) {
String bootstrapUrl = "tcp://localhost:6666";
StoreClientFactory factory = new SocketStoreClientFactory(new ClientConfig().setBootstrapUrls(bootstrapUrl));

// create a client that executes operations on a single store
StoreClient< String, String > client = factory.getStoreClient("test");

String key = "_";
String value = "_";
before = System.currentTimeMillis();
for (int i = 0; i < 10; ++i) {
try {
key = "k" + i;
value = "v" + i;
client.put(key, value);
client.get(key);
reportEnd(key, value, System.out);
Thread.sleep(1000);
} catch (Exception e) {
System.err.print("\nERROR:");
reportEnd(key, value, System.err);
System.err.println("\nERROR:" + e.toString());
}
System.out.println("\nDone");
}

private static void reportEnd(String key, String value, PrintStream out) {
after = System.currentTimeMillis();
out.print(String.format("%s:%s->%d ", key, value, (after - before)));
before = after;
}
}

And the output:

k0:v0->58 k1:v1->1006 k2:v2->1007 k3:v3->1006 k4:v4->1005 k5:v5->1025 k6:v6->1004 
Exception in thread "voldemort-client-thread-3" voldemort.store.InvalidMetadataException: client attempt accessing key belonging to partition:[3, 4] at Node:Node0 partitionList:[0, 1]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
...

ERROR:k7:v7->16005 
ERROR:voldemort.store.InsufficientOperationalNodesException: 1 writes succeeded, but 2 are required.
k8:v8->39 k9:v9->1004 
Done


The size of the data afterwards is only a few kilobytes:
$du -h
4.0K config/test_config1/data/bdb
4.0K config/test_config1/data
4.0K config/test_config2/data/bdb
4.0K config/test_config2/data
4.0K config/test_config3/data/bdb
4.0K config/test_config3/data

And the rebalancer output:

$bin/voldemort-rebalance.sh --cluster config/rebalance/target.xml --url tcp://localhost:6666
[2010-05-26 10:41:24,067 voldemort.client.rebalance.RebalanceController] INFO Cluster Rebalancing Plan:
StealerNode:1
RebalancingStealInfo(1 <--- 0 partitions:[0, 1, 2] stores:[test])
StealerNode:2
RebalancingStealInfo(2 <--- 0 partitions:[2] stores:[test])
RebalancingStealInfo(2 <--- 1 partitions:[3, 4, 5] stores:[test])
 
[2010-05-26 10:41:24,136 voldemort.client.rebalance.RebalanceController] INFO Starting rebalancing for stealerNode:1 with rebalanceInfo:RebalancingStealInfo(1 <--- 0 partitions:[0, 1, 2] stores:[test]) 
[2010-05-26 10:41:25,723 voldemort.client.rebalance.RebalanceController] INFO Successfully finished rebalance attempt:RebalancingStealInfo(1 <--- 0 partitions:[0, 1, 2] stores:[test]) 
[2010-05-26 10:41:25,723 voldemort.client.rebalance.RebalanceController] INFO Starting rebalancing for stealerNode:2 with rebalanceInfo:RebalancingStealInfo(2 <--- 0 partitions:[2] stores:[test]) 
[2010-05-26 10:41:27,291 voldemort.client.rebalance.RebalanceController] INFO Successfully finished rebalance attempt:RebalancingStealInfo(2 <--- 0 partitions:[2] stores:[test]) 
[2010-05-26 10:41:27,292 voldemort.client.rebalance.RebalanceController] INFO Starting rebalancing for stealerNode:2 with rebalanceInfo:RebalancingStealInfo(2 <--- 1 partitions:[3, 4, 5] stores:[test]) 
[2010-05-26 10:41:27,867 voldemort.client.rebalance.RebalanceController] INFO Successfully finished rebalance attempt:RebalancingStealInfo(2 <--- 1 partitions:[3, 4, 5] stores:[test]) 
[2010-05-26 10:41:27,867 voldemort.client.rebalance.RebalanceController] INFO Thread run() finished:

I would be very surprised if this is a configuration/resource issue because the load is miniscule. I think it's a bug.

Thanks
Ben

Alex Feinberg

unread,
May 26, 2010, 4:01:48 PM5/26/10
to project-...@googlegroups.com
Ben,

My apologies, I misread you thinking you had 200gb of data per node.
Sorry for the poor reading comprehension in this case.

This is fairly odd, I'll replicate your experiment and see if I can
get the same result.

There's a unit test within Voldemort's test suite that does a local
machine rebalance, but most of our testing for re-balancing is done on
EC2 (through automated tests) or on physical hardware with real data
(in QA Lab and Production). In all cases I am specifically throwing
load and making sure that ongoing requests are able to succeed.

I'll let you know if I find any issue in the code.

Thanks,
- Alex

Benjamin Nortier

unread,
May 27, 2010, 5:14:28 AM5/27/10
to project-...@googlegroups.com
On Wed, May 26, 2010 at 9:01 PM, Alex Feinberg <fein...@gmail.com> wrote:
Ben,

My apologies, I misread you thinking you had 200gb of data per node.
Sorry for the poor reading comprehension in this case.

This is fairly odd, I'll replicate your experiment and see if I can
get the same result.

There's a unit test within Voldemort's test suite that does a local
machine rebalance, but most of our testing for re-balancing is done on
EC2 (through automated tests) or on physical hardware with real data
(in QA Lab and Production). In all cases I am specifically throwing
load and making sure that ongoing requests are able to succeed.

I'll let you know if I find any issue in the code.

Thanks,
- Alex

Much appreciated!
Ben 
Reply all
Reply to author
Forward
0 new messages