New store / Build in concurrency resolver

Francois

unread,

Mar 17, 2012, 1:02:53 PM3/17/12

to project-voldemort

After looking at cache store (a nice implementation, that we like
because hash makes sense), we are building a new type of store called
DiskMap , for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
cheap machines with a very low ratio RAM/Disk. The use case is tons of
data, but only 5% live ones.

Our need is not the fastest performance, but a very good, consistent
performance even with 1 to 10 GKeys. This is based on a logged needle
structure (vaguely inspired from haystack)+ a disk mapped hash table
with a dirty cache and an elevator pattern for committing the index.
Unit tests are promising with results on a test workstation with a
very slow disk being > 10 times BDB. In fact as fast as BDB or cache
store when all index fits in memory, which is impossible in our case
with a target of 160TB drive arrays on a 24GB RAM machine (like the
Backblaze storage pod).
With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
get a crazy write performance.

Our app does not need to keep versions, this is done at a higher level
using different keys.

So to optimize this storage strictly dedicated to Voldemort, would it
create problems to clean versions on the fly during puts ?
- Eliminating obsolete versions at write time, like in the bdb
implementation
- Eliminating concurrent versions based on timestamp, unlike the bdb
implementation

Our goal is to keep the hash chain as short as possible to minimize
disk seeks. Any danger doing this ?

Francois

Francois

unread,

May 10, 2012, 8:42:52 AM5/10/12

to project-voldemort

Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
running on a host with entry level disks in RAID 1 (so not the optimal
situation).

Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
key depending on the hashmap loading factor (10 runs OK). The rest is
dedicated to needles and needlepointer cache.

Performance can be better with one small SSD for the bucket table, but
this runs fine on 2 HDD (you just need two different axis).

It has been designed to be modified to be partition aware (sub-
partitionning), and if 100% dedicated to Voldemort.

Use case is good performance for very large data sets with 10% live
data (so that last access based caching is efficient). We need a small
effort to make it generic, but if someone has the same use case feel
free to contact me.

François

On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> After looking at cache store (a nice implementation, that we like

> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on

Greg Moulliet

unread,

May 10, 2012, 12:04:34 PM5/10/12

to project-...@googlegroups.com

Sounds neat Francois. Can you share the code via github?

Thanks,

Greg

> --
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-...@googlegroups.com.
> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
>

Francois

unread,

May 11, 2012, 4:35:12 AM5/11/12

to project-voldemort

Until it's shared, you can directly call me.

Francois

unread,

May 12, 2012, 3:20:49 AM5/12/12

to project-voldemort

BTW we're walking away from the Storage Pod model, and go for blades
powered by ATOM processors. With the low cost of ATOM motherboards and
2 X 3 or 4GB disks, the motherboard/box cost overhead is not much and
the form factor us smaller (better for maintenance, + no RAID to
setup). And 4GB RAM is more than enough to drive 8TB storage (at least
with our access pattern).

It's also a good candidate for large blob storage slitted in chunks:
by modifying the Voldemort hashing function to introduce a sub key
which is transparent for the ring and for the DiskMap bucket, you end
up chaining chunks on the same node which are contiguous on disk.

Ultimate goal would be with a sequencial number per sub-partition (=
one disk) to facilitate automatic single disk rebuild without Merkle
trees, but we're not yet there. This would allow to increase the
number of disks per CPU without any RAID overhead (because RAID0 would
generate too large disks sets, with an unmanageable rebuild time).

We're running a bit after time for and that's not yet on GitHub, but
we're looking for partners with similar needs to further developp and
optimize this cost driven concept as a sub-project.

If some are interested with also have a retry manager, which replaces
the updateAction loop on obsolete writes by adding a increasing and
random wait so that processes stop fighting in case of high
concurrency: this is a must if you have some long transactions (e.g.:
20-30ms). Same principle here as CSMA/CD network retries.

Francois

unread,

May 12, 2012, 3:23:08 AM5/12/12

to project-voldemort

And sorry for typos, I didn't find the message edit function !

Jeff Clites

unread,

May 12, 2012, 4:50:12 PM5/12/12

to project-...@googlegroups.com

To your original question about resolving concurrent versions at put
time: I think that part of the reason Voldemort doesn't currently do
this is to allow custom conflict-resolution strategies which operate
client-side, where there may be more information available. (For
instance, a custom conflict resolver might consult some other data
source or use business-logic code not conveniently available
server-side.)

So I think that's the point of the current implementation. But, I
don't know how common it is to actually take advantage of that;
probably most people just use the default conflict resolver and fall
back to timestamps as you are planning to do, so it may not matter.

On the other hand, if your implementation is able to support
duplicates and you are just trying to minimize the data stored, then
you may want to go ahead and keep the concurrent versions as is done
today, on the theory that it's rare to actually have such concurrent
writes, so it would amount to only a tiny amount of data anyway. (Not
sure if that's actually true, but it seems likely.)

JEff

Francois

unread,

May 13, 2012, 3:14:43 AM5/13/12

to project-voldemort

@Jeff: we made the same conclusion and finally did a timestamp based
conflict resolution directly in DiskMap, because in our use case this
is fine. We could keep duplicates, but this would have a negative
impact on performance ( we chain N needles behind the same hash
bucket, so allowing duplicates means you can't stop walking the chain
at the first one, and even with chains of 1 to 4 needles there's a
performance impact when you try to kill disk seeks ). That's what I
meant about "we need to make it more generic". Today this storage is
really designed for our use case, and performance ( to be seen as a
cost factor ) and ease of operations ( log forward structure ) where
the main criteria. If fact in case we can't resolve the conflict, an
exception is thrown and the client deals with it.

Francois

Reply all

Reply to author

Forward