StorageEngine Methods and Performance of Rebalancing Commands

Mark Rambacher

unread,

Jan 13, 2011, 12:10:07 PM1/13/11

to project-voldemort

We have created our own custom storage engine where each partition is
stored in a separate database (similar to what is proposed for BDB in
http://code.google.com/p/project-voldemort/issues/detail?id=179).
However, these changes do not "play nice" with a rebalancing in that
the rebalancing decisions are made at a layer *above* the storage
engine. The rebalance code ends up calling storageEngine.keys() and
storageEngine.get(key):
ByteArray key = keyIterator.next();

if(validPartition(key.get()) && counter % skipRecords == 0) {
for(Versioned<byte[]> value: storageEngine.get(key, null))
{
throttler.maybeThrottle(key.length());
if(filter.accept(key, value)) {

It seems like this code is very inefficient, at least for the case
where the storage engine can already make some of these determinations
(it might be for other cases as well since the store must be traversed
once -- to get the keys -- and then searched -- to get the values --
rather than getting both in one chunk).

Is there a reason not to change the method signatures for
storageEngine.keys() and storageEngine.entries to also take a list of
partitions to get keys from and the filter, e.g.,
public ClosableIterator<K> keys(List<Integer> partitions,
VoldemortFilter filter);
public ClosableIterator<Pair<K, Versioned<V>>>
entries(List<Integer> partitions, VoldemortFilter filter);

If filter is null, all values are returned. If partitions is null or
empty, all values are returned. If the interfaces are being changed,
it might also be nice to add T transform to the StorageEngine and add
it as an argument to entries as well. Finally, there should probably
also be a method like:
public void deleteEntries(List<Integer> partitions,
VoldemortFilter filter);
That can delete all entries that match a filter in the given
partition.

Before we go about making such a change, is there any reason anyone
can think of NOT to do this? It seems like it could be beneficial to
have more of the decisions made closer to the data, as implementations
might be able to optimize some of the decision processing.

Thanks,

Mark

bhupesh bansal

unread,

Jan 13, 2011, 12:18:08 PM1/13/11

to project-...@googlegroups.com

+ 1,

I think it make perfect sense, till the time we can keep client and server backward compatible I dont think

there is any other issue which should block this.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To post to this group, send email to project-...@googlegroups.com.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.

Alex Feinberg

unread,

Jan 14, 2011, 2:49:37 AM1/14/11

to project-...@googlegroups.com

Can't think of a reason not to do this. The reason we stored each
partition in the same database was due to the fact that in older
versions of BerkeleyDB, the environments (BDB equivalent to a MySQL
database) would not share the cache with each other. That's no longer
the case with more recent versions of BDB.

Thanks,
- Alex

bhupesh bansal

unread,

Jan 14, 2011, 11:57:21 AM1/14/11

to project-...@googlegroups.com

One bdb environment per partition is something I was thinking about long back, even after bdb have shared cache feature there was a question of number of cleaner threads etc and the TODO was to write your own cleaner thread pool etc, I completely believe that will speed up the rebalancing

by a 20X factor ateast.

John Cohen

unread,

Jan 14, 2011, 6:39:14 PM1/14/11

to project-...@googlegroups.com

Yes it does.
No longer is needed read the entire DB an later on throw away (filter)
all the keys that do not belong to the partition that we want. In
other words this is materialized in minimizing disk access IO, we just
read what we need nothing else.
It has other benefits too, it opens the possibility to experiment in
other areas, for example, a rebalance that can be a mix of
backup/restore (to avoid moving bulk the data by Voldemort - JVM's
resource consumption) and the "diff" the data that did not made into
the backup will be migrated as a regular Voldemort rebalance. For
this to be effective some other things has to happen like provide an
AdminClient API to fetch partition based on a point-in-time or other
parameter that can be utilized to get a subset of the data (data that
did not make it into the backup, after the backup was finished)...

Reply all

Reply to author

Forward