Blog post detailing efforts on improving BDB JE for Voldemort

Vinoth Chandar

unread,

Sep 14, 2012, 10:28:57 PM9/14/12

to project-...@googlegroups.com

I have tried to cover all the issues here in detail

http://distributeddreams.blogspot.com/2012/09/improving-bdb-je-storage-for-voldemort.html

Thanks
Vinoth

Francois

unread,

Sep 15, 2012, 3:32:39 AM9/15/12

to project-...@googlegroups.com

Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but that's not so much work to make it generic. Maybe it can be a starting point for a generic hash based storage, purely KV ? It implements a log forward needle structure with cleaning process, a chained bucket table, separate chaining and needle cache, and bucket commit on the fly (slow commit) with a shuttle mechanism. We are not so much used to Github (sorry svn fans) but if you want to give it a try, we can organize a quick phone + zipped mail session

François

Vinoth Chandar

unread,

Sep 17, 2012, 1:17:50 PM9/17/12

to project-...@googlegroups.com

Hi Francois,

It will be great if you send some pointers to architecture documents or presentations. Is this cachestore http://code.google.com/p/cachestore/ ?

Francois

unread,

Sep 21, 2012, 8:23:06 AM9/21/12

to project-...@googlegroups.com

No this is different. Cachestore uses a hashmap, we use a hash function and a bucket table, chaining records on disk.

So the memory footprint is lower for large number of keys: we couldn't live with bdb nor cachestore for RAM reasons.

The bucket table is committed from time to start, but a cold start can be done by rolling back all log files (data is log forward only). We also implement a cost benefit cleaning on log files. Needles behind the same bucket are simply chained on disk, pointer caching allowing faster access on recent ones.

3 basic type of objects: the bucket table, needle files, and needle pointers which are the needles without data. needles and needle pointers are cached with Guava ( typically you cache as much pointers as you can, and few needles because of the linux buffers ).

bucket table is backed up by a concurrent hashmap just for the time of the commit by a shuttle that walks down the bucket table in two passes ( which can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), when the bucket table is partially dirty.

No doc so far (there will a post on our blog when we find the time), but basically this is it.

Francois

Vinoth Chandar

unread,

Sep 24, 2012, 11:04:33 PM9/24/12

to project-...@googlegroups.com

Sounds very interesting. Actually, the idea of chaining on disk sounds familiar. see SkimpyStash .
I did not get a clear picture about the disk layout though.
Looking forward to your blog!

Thanks
Vinoth

Francois

unread,

Sep 27, 2012, 8:10:07 AM9/27/12

to project-voldemort

Basically (in our case), data is stored in numbered files ( numbers
always go forward ), in needles, containing:

Control data & MD5, magic tags

Key, version
Pointer to the next needle on the same bucket: file nbr + needle
offset (reduced to chunks of 256/512/1024 according to your taste, to
make the pointer smaller)
Data

To find data you walk down the chain from the bucket table which only
contains a fileNbr and needle offset, which goes fast thanks to a
cache of needle pointers

To add/delete you find back your way, break the chain and insert the
new data, rebuild the smallest side of the chain

Bucket table commit is the tricky part, because you have to save dirty
writes during commit in a more classic map

That's it.

You can play with needle pointer cache size, full needle cache size,
caching strategy, second level caching on ssd (one of our plans), ...
to take best advantage of the system.
Then you can play by hacking the hashing function of Voldemort to get
related data ( that you mostly need together ) behind the same node,
behind the same diskmap bucket when that's relevant

Concurrent versions could be store on same bucket, in our case
versions known as concurrent throw obsolete at write time to optimize
space ( we have few versionned keys, and many single ones, so we can
manage post read resync )

Cleaning is reading files and rewriting, removing file after
operation. You do it when the cost/benefit is good (=rewrites/dirty)

Francois

On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com>
wrote:

> Sounds very interesting. Actually, the idea of chaining on disk sounds

> familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=1989327%20> .

> >>>>http://distributeddreams.blogspot.com/2012/09/improving-bdb-je-storag...
>
> >>>> Thanks
> >>>> Vinoth

Vinoth Chandar

unread,

Oct 8, 2012, 3:43:32 PM10/8/12

to project-...@googlegroups.com

I am still not quite grasping the needle part of it. By needle, do you just mean a pointer to the other keys-value blocks that hashed to the same hash bucket?

I guess I will wait for your blog to be out and ask more questions. Seems easier than this.

Thanks
Vinoth

Francois

unread,

Oct 12, 2012, 4:13:16 AM10/12/12

to project-voldemort

Yes, Needle = block of data, with key, vector clock, data, md5, magic
numbers, and linked to the previous KV blocked that hashed to the same
bucket. Hard to find the time to write a decent article (and we'll
have other to write first to support the launch of the platform which
goes way beyond just file sharing), so feel free to mail or phone if
you need more info, we don't even have time to push on Git!

Vinoth Chandar

unread,

Oct 16, 2012, 2:27:49 PM10/16/12

to project-...@googlegroups.com

I will be on vacation for a while. Will sync up once back. Thanks!

Reply all

Reply to author

Forward