Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but that's not so much work to make it generic. Maybe it can be a starting point for a generic hash based storage, purely KV ? It implements a log forward needle structure with cleaning process, a chained bucket table, separate chaining and needle cache, and bucket commit on the fly (slow commit) with a shuttle mechanism. We are not so much used to Github (sorry svn fans) but if you want to give it a try, we can organize a quick phone + zipped mail session
François
Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but that's > not so much work to make it generic. Maybe it can be a starting point for a > generic hash based storage, purely KV ? It implements a log forward needle > structure with cleaning process, a chained bucket table, separate chaining > and needle cache, and bucket commit on the fly (slow commit) with a shuttle > mechanism. We are not so much used to Github (sorry svn fans) but if you > want to give it a try, we can organize a quick phone + zipped mail session
> François
> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
>> I have tried to cover all the issues here in detail
No this is different. Cachestore uses a hashmap, we use a hash function and a bucket table, chaining records on disk.
So the memory footprint is lower for large number of keys: we couldn't live with bdb nor cachestore for RAM reasons.
The bucket table is committed from time to start, but a cold start can be done by rolling back all log files (data is log forward only). We also implement a cost benefit cleaning on log files. Needles behind the same bucket are simply chained on disk, pointer caching allowing faster access on recent ones.
3 basic type of objects: the bucket table, needle files, and needle pointers which are the needles without data. needles and needle pointers are cached with Guava ( typically you cache as much pointers as you can, and few needles because of the linux buffers ).
bucket table is backed up by a concurrent hashmap just for the time of the commit by a shuttle that walks down the bucket table in two passes ( which can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), when the bucket table is partially dirty.
No doc so far (there will a post on our blog when we find the time), but basically this is it.
Francois
Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:
> It will be great if you send some pointers to architecture documents or > presentations. Is this cachestore http://code.google.com/p/cachestore/ ?
> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but that's >> not so much work to make it generic. Maybe it can be a starting point for a >> generic hash based storage, purely KV ? It implements a log forward needle >> structure with cleaning process, a chained bucket table, separate chaining >> and needle cache, and bucket commit on the fly (slow commit) with a shuttle >> mechanism. We are not so much used to Github (sorry svn fans) but if you >> want to give it a try, we can organize a quick phone + zipped mail session
>> François
>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
>>> I have tried to cover all the issues here in detail
Sounds very interesting. Actually, the idea of chaining on disk sounds familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=1989327%20> .
I did not get a clear picture about the disk layout though. Looking forward to your blog!
On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
> No this is different. Cachestore uses a hashmap, we use a hash function > and a bucket table, chaining records on disk.
> So the memory footprint is lower for large number of keys: we couldn't > live with bdb nor cachestore for RAM reasons.
> The bucket table is committed from time to start, but a cold start can be > done by rolling back all log files (data is log forward only). We also > implement a cost benefit cleaning on log files. Needles behind the same > bucket are simply chained on disk, pointer caching allowing faster access > on recent ones.
> 3 basic type of objects: the bucket table, needle files, and needle > pointers which are the needles without data. needles and needle pointers > are cached with Guava ( typically you cache as much pointers as you can, > and few needles because of the linux buffers ).
> bucket table is backed up by a concurrent hashmap just for the time of the > commit by a shuttle that walks down the bucket table in two passes ( which > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), when the > bucket table is partially dirty.
> No doc so far (there will a post on our blog when we find the time), but > basically this is it.
> Francois
> Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:
>> Hi Francois,
>> It will be great if you send some pointers to architecture documents or >> presentations. Is this cachestore http://code.google.com/p/cachestore/ ?
>> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
>>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but >>> that's not so much work to make it generic. Maybe it can be a starting >>> point for a generic hash based storage, purely KV ? It implements a log >>> forward needle structure with cleaning process, a chained bucket table, >>> separate chaining and needle cache, and bucket commit on the fly (slow >>> commit) with a shuttle mechanism. We are not so much used to Github (sorry >>> svn fans) but if you want to give it a try, we can organize a quick phone + >>> zipped mail session
>>> François
>>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
>>>> I have tried to cover all the issues here in detail
Basically (in our case), data is stored in numbered files ( numbers
always go forward ), in needles, containing:
Control data & MD5, magic tags
Key, version
Pointer to the next needle on the same bucket: file nbr + needle
offset (reduced to chunks of 256/512/1024 according to your taste, to
make the pointer smaller)
Data
To find data you walk down the chain from the bucket table which only
contains a fileNbr and needle offset, which goes fast thanks to a
cache of needle pointers
To add/delete you find back your way, break the chain and insert the
new data, rebuild the smallest side of the chain
Bucket table commit is the tricky part, because you have to save dirty
writes during commit in a more classic map
That's it.
You can play with needle pointer cache size, full needle cache size,
caching strategy, second level caching on ssd (one of our plans), ...
to take best advantage of the system.
Then you can play by hacking the hashing function of Voldemort to get
related data ( that you mostly need together ) behind the same node,
behind the same diskmap bucket when that's relevant
Concurrent versions could be store on same bucket, in our case
versions known as concurrent throw obsolete at write time to optimize
space ( we have few versionned keys, and many single ones, so we can
manage post read resync )
Cleaning is reading files and rewriting, removing file after
operation. You do it when the cost/benefit is good (=rewrites/dirty)
Francois
On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com>
wrote:
> Sounds very interesting. Actually, the idea of chaining on disk sounds
> familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=1989327%20> .
> I did not get a clear picture about the disk layout though.
> Looking forward to your blog!
> Thanks
> Vinoth
> On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
> > No this is different. Cachestore uses a hashmap, we use a hash function
> > and a bucket table, chaining records on disk.
> > So the memory footprint is lower for large number of keys: we couldn't
> > live with bdb nor cachestore for RAM reasons.
> > The bucket table is committed from time to start, but a cold start can be
> > done by rolling back all log files (data is log forward only). We also
> > implement a cost benefit cleaning on log files. Needles behind the same
> > bucket are simply chained on disk, pointer caching allowing faster access
> > on recent ones.
> > 3 basic type of objects: the bucket table, needle files, and needle
> > pointers which are the needles without data. needles and needle pointers
> > are cached with Guava ( typically you cache as much pointers as you can,
> > and few needles because of the linux buffers ).
> > bucket table is backed up by a concurrent hashmap just for the time of the
> > commit by a shuttle that walks down the bucket table in two passes ( which
> > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), when the
> > bucket table is partially dirty.
> > No doc so far (there will a post on our blog when we find the time), but
> > basically this is it.
> > Francois
> > Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:
> >> Hi Francois,
> >> It will be great if you send some pointers to architecture documents or
> >> presentations. Is this cachestorehttp://code.google.com/p/cachestore/?
> >> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
> >>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but
> >>> that's not so much work to make it generic. Maybe it can be a starting
> >>> point for a generic hash based storage, purely KV ? It implements a log
> >>> forward needle structure with cleaning process, a chained bucket table,
> >>> separate chaining and needle cache, and bucket commit on the fly (slow
> >>> commit) with a shuttle mechanism. We are not so much used to Github (sorry
> >>> svn fans) but if you want to give it a try, we can organize a quick phone +
> >>> zipped mail session
> >>> François
> >>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
> >>>> I have tried to cover all the issues here in detail
I am still not quite grasping the needle part of it. By needle, do you just mean a pointer to the other keys-value blocks that hashed to the same hash bucket?
I guess I will wait for your blog to be out and ask more questions. Seems easier than this.
On Thursday, September 27, 2012 5:10:09 AM UTC-7, Francois wrote:
> Basically (in our case), data is stored in numbered files ( numbers > always go forward ), in needles, containing:
> Control data & MD5, magic tags
> Key, version > Pointer to the next needle on the same bucket: file nbr + needle > offset (reduced to chunks of 256/512/1024 according to your taste, to > make the pointer smaller) > Data
> To find data you walk down the chain from the bucket table which only > contains a fileNbr and needle offset, which goes fast thanks to a > cache of needle pointers
> To add/delete you find back your way, break the chain and insert the > new data, rebuild the smallest side of the chain
> Bucket table commit is the tricky part, because you have to save dirty > writes during commit in a more classic map
> That's it.
> You can play with needle pointer cache size, full needle cache size, > caching strategy, second level caching on ssd (one of our plans), ... > to take best advantage of the system. > Then you can play by hacking the hashing function of Voldemort to get > related data ( that you mostly need together ) behind the same node, > behind the same diskmap bucket when that's relevant
> Concurrent versions could be store on same bucket, in our case > versions known as concurrent throw obsolete at write time to optimize > space ( we have few versionned keys, and many single ones, so we can > manage post read resync )
> Cleaning is reading files and rewriting, removing file after > operation. You do it when the cost/benefit is good (=rewrites/dirty)
> Francois
> On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com> > wrote: > > Sounds very interesting. Actually, the idea of chaining on disk sounds > > familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=1989327%20> > . > > I did not get a clear picture about the disk layout though. > > Looking forward to your blog!
> > Thanks > > Vinoth
> > On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
> > > No this is different. Cachestore uses a hashmap, we use a hash > function > > > and a bucket table, chaining records on disk.
> > > So the memory footprint is lower for large number of keys: we couldn't > > > live with bdb nor cachestore for RAM reasons.
> > > The bucket table is committed from time to start, but a cold start can > be > > > done by rolling back all log files (data is log forward only). We also > > > implement a cost benefit cleaning on log files. Needles behind the > same > > > bucket are simply chained on disk, pointer caching allowing faster > access > > > on recent ones.
> > > 3 basic type of objects: the bucket table, needle files, and needle > > > pointers which are the needles without data. needles and needle > pointers > > > are cached with Guava ( typically you cache as much pointers as you > can, > > > and few needles because of the linux buffers ). > > > bucket table is backed up by a concurrent hashmap just for the time of > the > > > commit by a shuttle that walks down the bucket table in two passes ( > which > > > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), > when the > > > bucket table is partially dirty.
> > > No doc so far (there will a post on our blog when we find the time), > but > > > basically this is it.
> > > Francois
> > > Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:
> > >> Hi Francois,
> > >> It will be great if you send some pointers to architecture documents > or > > >> presentations. Is this cachestorehttp://code.google.com/p/cachestore/?
> > >> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
> > >>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but > > >>> that's not so much work to make it generic. Maybe it can be a > starting > > >>> point for a generic hash based storage, purely KV ? It implements a > log > > >>> forward needle structure with cleaning process, a chained bucket > table, > > >>> separate chaining and needle cache, and bucket commit on the fly > (slow > > >>> commit) with a shuttle mechanism. We are not so much used to Github > (sorry > > >>> svn fans) but if you want to give it a try, we can organize a quick > phone + > > >>> zipped mail session
> > >>> François
> > >>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
> > >>>> I have tried to cover all the issues here in detail
Yes, Needle = block of data, with key, vector clock, data, md5, magic
numbers, and linked to the previous KV blocked that hashed to the same
bucket. Hard to find the time to write a decent article (and we'll
have other to write first to support the launch of the platform which
goes way beyond just file sharing), so feel free to mail or phone if
you need more info, we don't even have time to push on Git!
On 8 oct, 21:43, Vinoth Chandar <mail.vinoth.chan...@gmail.com> wrote:
> I am still not quite grasping the needle part of it. By needle, do you
> just mean a pointer to the other keys-value blocks that hashed to the same
> hash bucket?
> I guess I will wait for your blog to be out and ask more questions. Seems
> easier than this.
> Thanks
> Vinoth
> On Thursday, September 27, 2012 5:10:09 AM UTC-7, Francois wrote:
> > Basically (in our case), data is stored in numbered files ( numbers
> > always go forward ), in needles, containing:
> > Control data & MD5, magic tags
> > Key, version
> > Pointer to the next needle on the same bucket: file nbr + needle
> > offset (reduced to chunks of 256/512/1024 according to your taste, to
> > make the pointer smaller)
> > Data
> > To find data you walk down the chain from the bucket table which only
> > contains a fileNbr and needle offset, which goes fast thanks to a
> > cache of needle pointers
> > To add/delete you find back your way, break the chain and insert the
> > new data, rebuild the smallest side of the chain
> > Bucket table commit is the tricky part, because you have to save dirty
> > writes during commit in a more classic map
> > That's it.
> > You can play with needle pointer cache size, full needle cache size,
> > caching strategy, second level caching on ssd (one of our plans), ...
> > to take best advantage of the system.
> > Then you can play by hacking the hashing function of Voldemort to get
> > related data ( that you mostly need together ) behind the same node,
> > behind the same diskmap bucket when that's relevant
> > Concurrent versions could be store on same bucket, in our case
> > versions known as concurrent throw obsolete at write time to optimize
> > space ( we have few versionned keys, and many single ones, so we can
> > manage post read resync )
> > Cleaning is reading files and rewriting, removing file after
> > operation. You do it when the cost/benefit is good (=rewrites/dirty)
> > Francois
> > On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com>
> > wrote:
> > > Sounds very interesting. Actually, the idea of chaining on disk sounds
> > > familiar. see SkimpyStash <http://dl.acm.org/citation.cfm?id=1989327%20>
> > .
> > > I did not get a clear picture about the disk layout though.
> > > Looking forward to your blog!
> > > Thanks
> > > Vinoth
> > > On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
> > > > No this is different. Cachestore uses a hashmap, we use a hash
> > function
> > > > and a bucket table, chaining records on disk.
> > > > So the memory footprint is lower for large number of keys: we couldn't
> > > > live with bdb nor cachestore for RAM reasons.
> > > > The bucket table is committed from time to start, but a cold start can
> > be
> > > > done by rolling back all log files (data is log forward only). We also
> > > > implement a cost benefit cleaning on log files. Needles behind the
> > same
> > > > bucket are simply chained on disk, pointer caching allowing faster
> > access
> > > > on recent ones.
> > > > 3 basic type of objects: the bucket table, needle files, and needle
> > > > pointers which are the needles without data. needles and needle
> > pointers
> > > > are cached with Guava ( typically you cache as much pointers as you
> > can,
> > > > and few needles because of the linux buffers ).
> > > > bucket table is backed up by a concurrent hashmap just for the time of
> > the
> > > > commit by a shuttle that walks down the bucket table in two passes (
> > which
> > > > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ),
> > when the
> > > > bucket table is partially dirty.
> > > > No doc so far (there will a post on our blog when we find the time),
> > but
> > > > basically this is it.
> > > > Francois
> > > > Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth Chandar:
> > > >> Hi Francois,
> > > >> It will be great if you send some pointers to architecture documents
> > or
> > > >> presentations. Is this cachestorehttp://code.google.com/p/cachestore/?
> > > >> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois wrote:
> > > >>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, but
> > > >>> that's not so much work to make it generic. Maybe it can be a
> > starting
> > > >>> point for a generic hash based storage, purely KV ? It implements a
> > log
> > > >>> forward needle structure with cleaning process, a chained bucket
> > table,
> > > >>> separate chaining and needle cache, and bucket commit on the fly
> > (slow
> > > >>> commit) with a shuttle mechanism. We are not so much used to Github
> > (sorry
> > > >>> svn fans) but if you want to give it a try, we can organize a quick
> > phone +
> > > >>> zipped mail session
> > > >>> François
> > > >>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a écrit :
> > > >>>> I have tried to cover all the issues here in detail
On Friday, October 12, 2012 1:13:18 AM UTC-7, Francois wrote:
> Yes, Needle = block of data, with key, vector clock, data, md5, magic > numbers, and linked to the previous KV blocked that hashed to the same > bucket. Hard to find the time to write a decent article (and we'll > have other to write first to support the launch of the platform which > goes way beyond just file sharing), so feel free to mail or phone if > you need more info, we don't even have time to push on Git!
> On 8 oct, 21:43, Vinoth Chandar <mail.vinoth.chan...@gmail.com> wrote: > > I am still not quite grasping the needle part of it. By needle, do you > > just mean a pointer to the other keys-value blocks that hashed to the > same > > hash bucket?
> > I guess I will wait for your blog to be out and ask more questions. > Seems > > easier than this.
> > Thanks > > Vinoth
> > On Thursday, September 27, 2012 5:10:09 AM UTC-7, Francois wrote:
> > > Basically (in our case), data is stored in numbered files ( numbers > > > always go forward ), in needles, containing:
> > > Control data & MD5, magic tags
> > > Key, version > > > Pointer to the next needle on the same bucket: file nbr + needle > > > offset (reduced to chunks of 256/512/1024 according to your taste, to > > > make the pointer smaller) > > > Data
> > > To find data you walk down the chain from the bucket table which only > > > contains a fileNbr and needle offset, which goes fast thanks to a > > > cache of needle pointers
> > > To add/delete you find back your way, break the chain and insert the > > > new data, rebuild the smallest side of the chain
> > > Bucket table commit is the tricky part, because you have to save dirty > > > writes during commit in a more classic map
> > > That's it.
> > > You can play with needle pointer cache size, full needle cache size, > > > caching strategy, second level caching on ssd (one of our plans), ... > > > to take best advantage of the system. > > > Then you can play by hacking the hashing function of Voldemort to get > > > related data ( that you mostly need together ) behind the same node, > > > behind the same diskmap bucket when that's relevant
> > > Concurrent versions could be store on same bucket, in our case > > > versions known as concurrent throw obsolete at write time to optimize > > > space ( we have few versionned keys, and many single ones, so we can > > > manage post read resync )
> > > Cleaning is reading files and rewriting, removing file after > > > operation. You do it when the cost/benefit is good (=rewrites/dirty)
> > > Francois
> > > On Sep 25, 5:04 am, Vinoth Chandar <mail.vinoth.chan...@gmail.com> > > > wrote: > > > > Sounds very interesting. Actually, the idea of chaining on disk > sounds > > > > familiar. see SkimpyStash <
> http://dl.acm.org/citation.cfm?id=1989327%20> > > > . > > > > I did not get a clear picture about the disk layout though. > > > > Looking forward to your blog!
> > > > Thanks > > > > Vinoth
> > > > On Friday, September 21, 2012 5:23:06 AM UTC-7, Francois wrote:
> > > > > No this is different. Cachestore uses a hashmap, we use a hash > > > function > > > > > and a bucket table, chaining records on disk.
> > > > > So the memory footprint is lower for large number of keys: we > couldn't > > > > > live with bdb nor cachestore for RAM reasons.
> > > > > The bucket table is committed from time to start, but a cold start > can > > > be > > > > > done by rolling back all log files (data is log forward only). We > also > > > > > implement a cost benefit cleaning on log files. Needles behind the > > > same > > > > > bucket are simply chained on disk, pointer caching allowing faster > > > access > > > > > on recent ones.
> > > > > 3 basic type of objects: the bucket table, needle files, and > needle > > > > > pointers which are the needles without data. needles and needle > > > pointers > > > > > are cached with Guava ( typically you cache as much pointers as > you > > > can, > > > > > and few needles because of the linux buffers ). > > > > > bucket table is backed up by a concurrent hashmap just for the > time of > > > the > > > > > commit by a shuttle that walks down the bucket table in two passes > ( > > > which > > > > > can be very slow on a hard drive for 4Gkeys, but is fast on SSD ), > > > when the > > > > > bucket table is partially dirty.
> > > > > No doc so far (there will a post on our blog when we find the > time), > > > but > > > > > basically this is it.
> > > > > Francois
> > > > > Am Montag, 17. September 2012 19:17:50 UTC+2 schrieb Vinoth > Chandar:
> > > > >> Hi Francois,
> > > > >> It will be great if you send some pointers to architecture > documents > > > or > > > > >> presentations. Is this cachestorehttp://
> code.google.com/p/cachestore/?
> > > > >> On Saturday, September 15, 2012 12:32:39 AM UTC-7, Francois > wrote:
> > > > >>> Hello Vinoth, as I said diskmap is a bit tweaked to our needs, > but > > > > >>> that's not so much work to make it generic. Maybe it can be a > > > starting > > > > >>> point for a generic hash based storage, purely KV ? It > implements a > > > log > > > > >>> forward needle structure with cleaning process, a chained bucket > > > table, > > > > >>> separate chaining and needle cache, and bucket commit on the fly > > > (slow > > > > >>> commit) with a shuttle mechanism. We are not so much used to > Github > > > (sorry > > > > >>> svn fans) but if you want to give it a try, we can organize a > quick > > > phone + > > > > >>> zipped mail session
> > > > >>> François
> > > > >>> Le samedi 15 septembre 2012 04:28:57 UTC+2, Vinoth Chandar a > écrit :
> > > > >>>> I have tried to cover all the issues here in detail