Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Blog post detailing efforts on improving BDB JE for Voldemort

Received: by 10.68.226.41 with SMTP id rp9mr4608641pbc.4.1349725415138;
        Mon, 08 Oct 2012 12:43:35 -0700 (PDT)
X-BeenThere: project-voldemort@googlegroups.com
Received: by 10.68.194.202 with SMTP id hy10ls23561025pbc.0.gmail; Mon, 08 Oct
 2012 12:43:33 -0700 (PDT)
Received: by 10.68.229.231 with SMTP id st7mr4755644pbc.2.1349725413821;
        Mon, 08 Oct 2012 12:43:33 -0700 (PDT)
Date: Mon, 8 Oct 2012 12:43:32 -0700 (PDT)
From: Vinoth Chandar <mail.vinoth.chan...@gmail.com>
To: project-voldemort@googlegroups.com
Message-Id: <601666c6-174a-44f6-97e6-247210c2d74d@googlegroups.com>
In-Reply-To: <4c00de0d-7a32-4fcc-b8b7-a6bfaf7d27cc@r10g2000vby.googlegroups.com>
References: <e48c2f9b-6425-4cb5-a25e-652478f5f236@googlegroups.com>
 <50f65d65-7ac2-484b-b510-b74830ec53c1@googlegroups.com> <ffb7a472-a735-444d-aed1-1d3bb28e7a1f@googlegroups.com>
 <0c6f782a-b40c-430c-8470-075828fadad7@googlegroups.com> <a1f56167-30d9-4360-b86b-1c9c2a7b5bf8@googlegroups.com>
 <4c00de0d-7a32-4fcc-b8b7-a6bfaf7d27cc@r10g2000vby.googlegroups.com>
Subject: Re: Blog post detailing efforts on improving BDB JE for Voldemort
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_86_12577031.1349725412275"

------=_Part_86_12577031.1349725412275
Content-Type: multipart/alternative; 
	boundary="----=_Part_87_22795573.1349725412275"

------=_Part_87_22795573.1349725412275
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I am still not quite grasping the needle part of it.  By needle, do you=20
just mean a pointer to the other keys-value blocks that hashed to the same=
=20
hash bucket?=20

I guess I will wait for your blog to be out and ask more questions. Seems=
=20
easier than this.=20

Thanks
Vinoth

On Thursday, September 27, 2012 5:10:09 AM UTC-7, Francois wrote:
>
> Basically (in our case), data is stored in numbered files ( numbers=20
> always go forward ), in needles, containing:=20
>
> Control data & MD5, magic tags=20
>
>
> Key, version=20
> Pointer to the next needle on the same bucket: file nbr + needle=20
> offset (reduced to chunks of 256/512/1024 according to your taste, to=20
> make the pointer smaller)=20
> Data=20
>
> To find data you walk down the chain from the bucket table which only=20
> contains a fileNbr and needle offset, which goes fast thanks to a=20
> cache of needle pointers=20
>
> To add/delete you find back your way, break the chain and insert the=20
> new data, rebuild the smallest side of the chain=20
>
>
> Bucket table commit is the tricky part, because you have to save dirty=20
> writes during commit in a more classic map=20
>
>
> That's it.=20
>
>
> You can play with needle pointer cache size, full needle cache size,=20
> caching strategy, second level caching on ssd (one of our plans), ...=20
> to take best advantage of the system.=20
> Then you can play by hacking the hashing function of Voldemort to get=20
> related data ( that you mostly need together ) behind the same node,=20
> behind the same diskmap bucket when that's relevant=20
>
> Concurrent versions could be store on same bucket, in our case=20
> versions known as concurrent throw obsolete at write time to optimize=20
> space ( we have few versionned keys, and many single ones, so we can=20
> manage post read resync )=20
>
> Cleaning is reading files and rewriting, removing file after=20
> operation. You do it when the cost/benefit is good (=3Drewrites/dirty)=20
>
> Francois=20
>
> On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com>=20
> wrote:=20
> > Sounds very interesting. Actually, the idea of chaining on disk sounds=
=20
> > familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=3D1989327%=
20>=20
> .=20
> > I did not get a clear picture about the disk layout though.=20
> > Looking forward to your blog!=20
> >=20
> > Thanks=20
> > Vinoth=20
> >=20
> >=20
> >=20
> >=20
> >=20
> >=20
> >=20
> > On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:=20
> >=20
> > > No this is different. Cachestore uses a hashmap, we use a hash=20
> function=20
> > > and a bucket table, chaining records on disk.=20
> >=20
> > > So the memory footprint is lower for large number of keys: we couldn'=
t=20
> > > live with bdb nor cachestore for RAM reasons.=20
> >=20
> > > The bucket table is committed from time to start, but a cold start ca=
n=20
> be=20
> > > done by rolling back all log files (data is log forward only). We als=
o=20
> > > implement a cost benefit cleaning on log files. Needles behind the=20
> same=20
> > > bucket are simply chained on disk, pointer caching allowing faster=20
> access=20
> > > on recent ones.=20
> >=20
> > > 3 basic type of objects: the bucket table, needle files, and needle=
=20
> > > pointers which are the needles without data. needles and needle=20
> pointers=20
> > > are cached with Guava ( typically you cache as much pointers as you=
=20
> can,=20
> > > and few needles because of the linux buffers ).=20
> > > bucket table is backed up by a concurrent hashmap just for the time o=
f=20
> the=20
> > > commit by a shuttle that walks down the bucket table in two passes (=
=20
> which=20
> > > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ),=20
> when the=20
> > > bucket table is partially dirty.=20
> >=20
> > > No doc so far (there will a post on our blog when we find the time),=
=20
> but=20
> > > basically this is it.=20
> >=20
> > > Francois=20
> >=20
> > > Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:=
=20
> >=20
> > >> Hi Francois,=20
> >=20
> > >> It will be great if you send some pointers to architecture documents=
=20
> or=20
> > >> presentations. Is this cachestorehttp://code.google.com/p/cachestore=
/?=20
>
> >=20
> > >> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:=
=20
> >=20
> > >>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but=
=20
> > >>> that's not so much work to make it generic. Maybe it can be a=20
> starting=20
> > >>> point for a generic hash based storage, purely KV ? It implements a=
=20
> log=20
> > >>> forward needle structure with cleaning process, a chained bucket=20
> table,=20
> > >>> separate chaining and needle cache, and bucket commit on the fly=20
> (slow=20
> > >>> commit) with a shuttle mechanism. We are not so much used to Github=
=20
> (sorry=20
> > >>> svn fans) but if you want to give it a try, we can organize a quick=
=20
> phone +=20
> > >>> zipped mail session=20
> >=20
> > >>> Fran=C3=A7ois=20
> >=20
> > >>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a =C3=A9=
crit :=20
> >=20
> > >>>> I have tried to cover all the issues here in detail=20
> >=20
> > >>>>
> http://distributeddreams.blogspot.com/2012/09/improving-bdb-je-storag...=
=20
> >=20
> > >>>> Thanks=20
> > >>>> Vinoth=20
>

------=_Part_87_22795573.1349725412275
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

I am still not quite grasping the needle part of it.&nbsp; By needle, do yo=
u just mean a pointer to the other keys-value blocks that hashed to the sam=
e hash bucket? <br><br>I guess I will wait for your blog to be out and ask =
more questions. Seems easier than this. <br><br>Thanks<br>Vinoth<br><br>On =
Thursday, September 27, 2012 5:10:09 AM UTC-7, Francois wrote:<blockquote c=
lass=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px=
 #ccc solid;padding-left: 1ex;">Basically (in our case), data is stored in =
numbered files ( numbers
<br>always go forward ), in needles, containing:
<br>
<br>Control data &amp; MD5, magic tags
<br>
<br>
<br>Key, version
<br>Pointer to the next needle on the same bucket: file nbr + needle
<br>offset (reduced to chunks of 256/512/1024 according to your taste, to
<br>make the pointer smaller)
<br>Data
<br>
<br>To find data you walk down the chain from the bucket table which only
<br>contains a fileNbr and needle offset, which goes fast thanks to a
<br>cache of needle pointers
<br>
<br>To add/delete you find back your way, break the chain and insert the
<br>new data, rebuild the smallest side of the chain
<br>
<br>
<br>Bucket table commit is the tricky part, because you have to save dirty
<br>writes during commit in a more classic map
<br>
<br>
<br>That's it.
<br>
<br>
<br>You can play with needle pointer cache size, full needle cache size,
<br>caching strategy, second level caching on ssd (one of our plans), ...
<br>to take best advantage of the system.
<br>Then you can play by hacking the hashing function of Voldemort to get
<br>related data ( that you mostly need together ) behind the same node,
<br>behind the same diskmap bucket when that's relevant
<br>
<br>Concurrent versions could be store on same bucket, in our case
<br>versions known as concurrent throw obsolete at write time to optimize
<br>space ( we have few versionned keys, and many single ones, so we can
<br>manage post read resync )
<br>
<br>Cleaning is reading files and rewriting, removing file after
<br>operation. You do it when the cost/benefit is good (=3Drewrites/dirty)
<br>
<br>Francois
<br>
<br>On Sep 25, 5:04&nbsp;am, Vinoth Chandar &lt;<a>mail.vinoth.chan...@gmai=
l.com</a><wbr>&gt;
<br>wrote:
<br>&gt; Sounds very interesting. Actually, the idea of chaining on disk so=
unds
<br>&gt; familiar. see SkimpyStash &lt;<a href=3D"http://dl.acm.org/citatio=
n.cfm?id=3D1989327%20" target=3D"_blank">http://dl.acm.org/citation.<wbr>cf=
m?id=3D1989327%20</a>&gt; .
<br>&gt; I did not get a clear picture about the disk layout though.
<br>&gt; Looking forward to your blog!
<br>&gt;
<br>&gt; Thanks
<br>&gt; Vinoth
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt;
<br>&gt; On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
<br>&gt;
<br>&gt; &gt; No this is different. Cachestore uses a hashmap, we use a has=
h function
<br>&gt; &gt; and a bucket table, chaining records on disk.
<br>&gt;
<br>&gt; &gt; So the memory footprint is lower for large number of keys: we=
 couldn't
<br>&gt; &gt; live with bdb nor cachestore for RAM reasons.
<br>&gt;
<br>&gt; &gt; The bucket table is committed from time to start, but a cold =
start can be
<br>&gt; &gt; done by rolling back all log files (data is log forward only)=
. We also
<br>&gt; &gt; implement a cost benefit cleaning on log files. Needles behin=
d the same
<br>&gt; &gt; bucket are simply chained on disk, pointer caching allowing f=
aster access
<br>&gt; &gt; on recent ones.
<br>&gt;
<br>&gt; &gt; 3 basic type of objects: the bucket table, needle files, and =
needle
<br>&gt; &gt; pointers which are the needles without data. needles and need=
le pointers
<br>&gt; &gt; are cached with Guava ( typically you cache as much pointers =
as you can,
<br>&gt; &gt; and few needles because of the linux buffers ).
<br>&gt; &gt; bucket table is backed up by a concurrent hashmap just for th=
e time of the
<br>&gt; &gt; commit by a shuttle that walks down the bucket table in two p=
asses ( which
<br>&gt; &gt; can be very slow on a hard drive for 4Gkeys, but is fast on S=
SD ), when the
<br>&gt; &gt; bucket table is partially dirty.
<br>&gt;
<br>&gt; &gt; No doc so far (there will a post on our blog when we find the=
 time), but
<br>&gt; &gt; basically this is it.
<br>&gt;
<br>&gt; &gt; Francois
<br>&gt;
<br>&gt; &gt; Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth C=
handar:
<br>&gt;
<br>&gt; &gt;&gt; Hi Francois,
<br>&gt;
<br>&gt; &gt;&gt; It will be great if you send some pointers to architectur=
e documents or
<br>&gt; &gt;&gt; presentations. Is this cachestorehttp://<a href=3D"http:/=
/code.google.com/p/cachestore/" target=3D"_blank">code.google.<wbr>com/p/ca=
chestore/</a>?
<br>&gt;
<br>&gt; &gt;&gt; On Saturday, September 15, 2012 12:32:39 AM UTC-7, Franco=
is wrote:
<br>&gt;
<br>&gt; &gt;&gt;&gt; Hello Vinoth, as I said diskmap is a bit tweaked to o=
ur needs, but
<br>&gt; &gt;&gt;&gt; that's not so much work to make it generic. Maybe it =
can be a starting
<br>&gt; &gt;&gt;&gt; point for a generic hash based storage, purely KV ? I=
t implements a log
<br>&gt; &gt;&gt;&gt; forward needle structure with cleaning process, a cha=
ined bucket table,
<br>&gt; &gt;&gt;&gt; separate chaining and needle cache, and bucket commit=
 on the fly (slow
<br>&gt; &gt;&gt;&gt; commit) with a shuttle mechanism. We are not so much =
used to Github (sorry
<br>&gt; &gt;&gt;&gt; svn fans) but if you want to give it a try, we can or=
ganize a quick phone +
<br>&gt; &gt;&gt;&gt; zipped mail session
<br>&gt;
<br>&gt; &gt;&gt;&gt; Fran=C3=A7ois
<br>&gt;
<br>&gt; &gt;&gt;&gt; Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Ch=
andar a =C3=A9crit :
<br>&gt;
<br>&gt; &gt;&gt;&gt;&gt; I have tried to cover all the issues here in deta=
il
<br>&gt;
<br>&gt; &gt;&gt;&gt;&gt;<a href=3D"http://distributeddreams.blogspot.com/2=
012/09/improving-bdb-je-storag." target=3D"_blank">http://distributeddreams=
.<wbr>blogspot.com/2012/09/<wbr>improving-bdb-je-storag.</a>..
<br>&gt;
<br>&gt; &gt;&gt;&gt;&gt; Thanks
<br>&gt; &gt;&gt;&gt;&gt; Vinoth
<br></blockquote>
------=_Part_87_22795573.1349725412275--

------=_Part_86_12577031.1349725412275--