Hey Benjamin,
Yes, it does support reading/writing during a rebalance. You should be
seeing "InvalidMetadataException" which is a sign that the client is
going to the cluster to get the updated metadata and the updates
should go on as usual. This should work even under significant load
(I've done this both under simulated load in a QA lab and in
production on one of the clusters powering LinkedIn).
What GC/memory settings are you using with Voldemort? You may be
encountering GC stops in Voldemort, which could occur during a
rebalance. The other option (if you have a slower disk and/or less
memory) is that the data transfer is causing latency spikes (as it's
being written to disk), in which case you should enable throttling for
the re-balance.
It would help if you let us know:
- The JVM options you're passing to Voldemort
- The amount of BDB cache you're using (default in quickstart)
- The amount disk/amount of memory on your machines
Lastly, are you using code from trunk (which has a slightly different
rebalance algorithm, which can write data even faster) or from a
tar/zip archive?
Thanks for for the additional information. You're actually exhausting
your disk seek capacity in this case: all the nodes are reading from
the same disk and as the ratio of data to BDB cache is 1000:1, you're
almost certainly going to disk on every request (I am assuming page
cache isn't devoted to Voldemort, as you're also running lots of other
applications on your machine), while the disk is also seeking to
stream the data for rebalancing (unfortunately, BerkeleyDB Java
Edition opens cursors in key rather than disk order, resulting in
random disk seeks).
This is actually a fairly good test for throttling though, so I'd
suggest lowering these settings in server.properties (for all the
nodes):
stream.read.byte.per.sec (default is: 10000000 or 10MB/s)
stream.read.byte.per.sec (default is: 10000000 or 10MB/s)
By the way, I'd highly advice against using such a high data to memory
ratio in production with magnetic disks (SSDs may do a lot better).
BDB-Je performs best when the ratio is 5:1, although we've pushed it
to 10:1 (_per_ machine that is -- you can have many machines in the
cluster) in some production cases. We are exploring storage engines
that would let us go beyond that ratio, but Voldemort allows you to
scale horizontally thus decreasing the amount of data each node has to
store. Of course the other factor to consider is the size of your
working set: if your working set is only a part of your total data,
you'll be able to achieve a higher data to memory ratio as both BDB
and the operating systems are very aggressive about caching.
I'd also try going to 3GB JVM heap and a 1GB BDB cache size.
Testing all nodes being a single machine is also somewhat misleading
in that you're not able to see the effects of latency or
non-correlated failures (e.g., one machine being slowed down by disk
seeks while another keeps moving alone).
Thanks,
- Alex
My apologies, I misread you thinking you had 200gb of data per node.
Sorry for the poor reading comprehension in this case.
This is fairly odd, I'll replicate your experiment and see if I can
get the same result.
There's a unit test within Voldemort's test suite that does a local
machine rebalance, but most of our testing for re-balancing is done on
EC2 (through automated tests) or on physical hardware with real data
(in QA Lab and Production). In all cases I am specifically throwing
load and making sure that ongoing requests are able to succeed.
I'll let you know if I find any issue in the code.
Thanks,
- Alex
Ben,
My apologies, I misread you thinking you had 200gb of data per node.
Sorry for the poor reading comprehension in this case.
This is fairly odd, I'll replicate your experiment and see if I can
get the same result.
There's a unit test within Voldemort's test suite that does a local
machine rebalance, but most of our testing for re-balancing is done on
EC2 (through automated tests) or on physical hardware with real data
(in QA Lab and Production). In all cases I am specifically throwing
load and making sure that ongoing requests are able to succeed.
I'll let you know if I find any issue in the code.
Thanks,
- Alex