After looking at cache store (a nice implementation, that we like
because hash makes sense), we are building a new type of store called
DiskMap , for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
cheap machines with a very low ratio RAM/Disk. The use case is tons of
data, but only 5% live ones.
Our need is not the fastest performance, but a very good, consistent
performance even with 1 to 10 GKeys. This is based on a logged needle
structure (vaguely inspired from haystack)+ a disk mapped hash table
with a dirty cache and an elevator pattern for committing the index.
Unit tests are promising with results on a test workstation with a
very slow disk being > 10 times BDB. In fact as fast as BDB or cache
store when all index fits in memory, which is impossible in our case
with a target of 160TB drive arrays on a 24GB RAM machine (like the
Backblaze storage pod).
With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
get a crazy write performance.
Our app does not need to keep versions, this is done at a higher level
using different keys.
So to optimize this storage strictly dedicated to Voldemort, would it
create problems to clean versions on the fly during puts ?
- Eliminating obsolete versions at write time, like in the bdb
implementation
- Eliminating concurrent versions based on timestamp, unlike the bdb
implementation
Our goal is to keep the hash chain as short as possible to minimize
disk seeks. Any danger doing this ?
Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
running on a host with entry level disks in RAID 1 (so not the optimal
situation).
Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
key depending on the hashmap loading factor (10 runs OK). The rest is
dedicated to needles and needlepointer cache.
Performance can be better with one small SSD for the bucket table, but
this runs fine on 2 HDD (you just need two different axis).
It has been designed to be modified to be partition aware (sub-
partitionning), and if 100% dedicated to Voldemort.
Use case is good performance for very large data sets with 10% live
data (so that last access based caching is efficient). We need a small
effort to make it generic, but if someone has the same use case feel
free to contact me.
François
On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> After looking at cache store (a nice implementation, that we like
> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
> cheap machines with a very low ratio RAM/Disk. The use case is tons of
> data, but only 5% live ones.
> Our need is not the fastest performance, but a very good, consistent
> performance even with 1 to 10 GKeys. This is based on a logged needle
> structure (vaguely inspired from haystack)+ a disk mapped hash table
> with a dirty cache and an elevator pattern for committing the index.
> Unit tests are promising with results on a test workstation with a
> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
> store when all index fits in memory, which is impossible in our case
> with a target of 160TB drive arrays on a 24GB RAM machine (like the
> Backblaze storage pod).
> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
> get a crazy write performance.
> Our app does not need to keep versions, this is done at a higher level
> using different keys.
> So to optimize this storage strictly dedicated to Voldemort, would it
> create problems to clean versions on the fly during puts ?
> - Eliminating obsolete versions at write time, like in the bdb
> implementation
> - Eliminating concurrent versions based on timestamp, unlike the bdb
> implementation
> Our goal is to keep the hash chain as short as possible to minimize
> disk seeks. Any danger doing this ?
On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
> running on a host with entry level disks in RAID 1 (so not the optimal
> situation).
> Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
> key depending on the hashmap loading factor (10 runs OK). The rest is
> dedicated to needles and needlepointer cache.
> Performance can be better with one small SSD for the bucket table, but
> this runs fine on 2 HDD (you just need two different axis).
> It has been designed to be modified to be partition aware (sub-
> partitionning), and if 100% dedicated to Voldemort.
> Use case is good performance for very large data sets with 10% live
> data (so that last access based caching is efficient). We need a small
> effort to make it generic, but if someone has the same use case feel
> free to contact me.
> François
> On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
>> After looking at cache store (a nice implementation, that we like
>> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
>> cheap machines with a very low ratio RAM/Disk. The use case is tons of
>> data, but only 5% live ones.
>> Our need is not the fastest performance, but a very good, consistent
>> performance even with 1 to 10 GKeys. This is based on a logged needle
>> structure (vaguely inspired from haystack)+ a disk mapped hash table
>> with a dirty cache and an elevator pattern for committing the index.
>> Unit tests are promising with results on a test workstation with a
>> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
>> store when all index fits in memory, which is impossible in our case
>> with a target of 160TB drive arrays on a 24GB RAM machine (like the
>> Backblaze storage pod).
>> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
>> get a crazy write performance.
>> Our app does not need to keep versions, this is done at a higher level
>> using different keys.
>> So to optimize this storage strictly dedicated to Voldemort, would it
>> create problems to clean versions on the fly during puts ?
>> - Eliminating obsolete versions at write time, like in the bdb
>> implementation
>> - Eliminating concurrent versions based on timestamp, unlike the bdb
>> implementation
>> Our goal is to keep the hash chain as short as possible to minimize
>> disk seeks. Any danger doing this ?
>> Francois
> --
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-voldemort@googlegroups.com.
> To unsubscribe from this group, send email to project-voldemort+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
> Sounds neat Francois. Can you share the code via github?
> Thanks,
> Greg
> On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> > Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
> > running on a host with entry level disks in RAID 1 (so not the optimal
> > situation).
> > Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
> > key depending on the hashmap loading factor (10 runs OK). The rest is
> > dedicated to needles and needlepointer cache.
> > Performance can be better with one small SSD for the bucket table, but
> > this runs fine on 2 HDD (you just need two different axis).
> > It has been designed to be modified to be partition aware (sub-
> > partitionning), and if 100% dedicated to Voldemort.
> > Use case is good performance for very large data sets with 10% live
> > data (so that last access based caching is efficient). We need a small
> > effort to make it generic, but if someone has the same use case feel
> > free to contact me.
> > François
> > On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> >> After looking at cache store (a nice implementation, that we like
> >> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
> >> cheap machines with a very low ratio RAM/Disk. The use case is tons of
> >> data, but only 5% live ones.
> >> Our need is not the fastest performance, but a very good, consistent
> >> performance even with 1 to 10 GKeys. This is based on a logged needle
> >> structure (vaguely inspired from haystack)+ a disk mapped hash table
> >> with a dirty cache and an elevator pattern for committing the index.
> >> Unit tests are promising with results on a test workstation with a
> >> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
> >> store when all index fits in memory, which is impossible in our case
> >> with a target of 160TB drive arrays on a 24GB RAM machine (like the
> >> Backblaze storage pod).
> >> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
> >> get a crazy write performance.
> >> Our app does not need to keep versions, this is done at a higher level
> >> using different keys.
> >> So to optimize this storage strictly dedicated to Voldemort, would it
> >> create problems to clean versions on the fly during puts ?
> >> - Eliminating obsolete versions at write time, like in the bdb
> >> implementation
> >> - Eliminating concurrent versions based on timestamp, unlike the bdb
> >> implementation
> >> Our goal is to keep the hash chain as short as possible to minimize
> >> disk seeks. Any danger doing this ?
> >> Francois
> > --
> > You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> > To post to this group, send email to project-voldemort@googlegroups.com.
> > To unsubscribe from this group, send email to project-voldemort+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/project-voldemort?hl=en.
BTW we're walking away from the Storage Pod model, and go for blades
powered by ATOM processors. With the low cost of ATOM motherboards and
2 X 3 or 4GB disks, the motherboard/box cost overhead is not much and
the form factor us smaller (better for maintenance, + no RAID to
setup). And 4GB RAM is more than enough to drive 8TB storage (at least
with our access pattern).
It's also a good candidate for large blob storage slitted in chunks:
by modifying the Voldemort hashing function to introduce a sub key
which is transparent for the ring and for the DiskMap bucket, you end
up chaining chunks on the same node which are contiguous on disk.
Ultimate goal would be with a sequencial number per sub-partition (=
one disk) to facilitate automatic single disk rebuild without Merkle
trees, but we're not yet there. This would allow to increase the
number of disks per CPU without any RAID overhead (because RAID0 would
generate too large disks sets, with an unmanageable rebuild time).
We're running a bit after time for and that's not yet on GitHub, but
we're looking for partners with similar needs to further developp and
optimize this cost driven concept as a sub-project.
If some are interested with also have a retry manager, which replaces
the updateAction loop on obsolete writes by adding a increasing and
random wait so that processes stop fighting in case of high
concurrency: this is a must if you have some long transactions (e.g.:
20-30ms). Same principle here as CSMA/CD network retries.
On 11 mai, 10:35, Francois <francois.vai...@ezcgroup.net> wrote:
> On 10 mai, 18:04, Greg Moulliet <gr...@conducivetech.com> wrote:
> > Sounds neat Francois. Can you share the code via github?
> > Thanks,
> > Greg
> > On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> > > Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
> > > running on a host with entry level disks in RAID 1 (so not the optimal
> > > situation).
> > > Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
> > > key depending on the hashmap loading factor (10 runs OK). The rest is
> > > dedicated to needles and needlepointer cache.
> > > Performance can be better with one small SSD for the bucket table, but
> > > this runs fine on 2 HDD (you just need two different axis).
> > > It has been designed to be modified to be partition aware (sub-
> > > partitionning), and if 100% dedicated to Voldemort.
> > > Use case is good performance for very large data sets with 10% live
> > > data (so that last access based caching is efficient). We need a small
> > > effort to make it generic, but if someone has the same use case feel
> > > free to contact me.
> > > François
> > > On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> > >> After looking at cache store (a nice implementation, that we like
> > >> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
> > >> cheap machines with a very low ratio RAM/Disk. The use case is tons of
> > >> data, but only 5% live ones.
> > >> Our need is not the fastest performance, but a very good, consistent
> > >> performance even with 1 to 10 GKeys. This is based on a logged needle
> > >> structure (vaguely inspired from haystack)+ a disk mapped hash table
> > >> with a dirty cache and an elevator pattern for committing the index.
> > >> Unit tests are promising with results on a test workstation with a
> > >> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
> > >> store when all index fits in memory, which is impossible in our case
> > >> with a target of 160TB drive arrays on a 24GB RAM machine (like the
> > >> Backblaze storage pod).
> > >> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
> > >> get a crazy write performance.
> > >> Our app does not need to keep versions, this is done at a higher level
> > >> using different keys.
> > >> So to optimize this storage strictly dedicated to Voldemort, would it
> > >> create problems to clean versions on the fly during puts ?
> > >> - Eliminating obsolete versions at write time, like in the bdb
> > >> implementation
> > >> - Eliminating concurrent versions based on timestamp, unlike the bdb
> > >> implementation
> > >> Our goal is to keep the hash chain as short as possible to minimize
> > >> disk seeks. Any danger doing this ?
> > >> Francois
> > > --
> > > You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> > > To post to this group, send email to project-voldemort@googlegroups.com.
> > > To unsubscribe from this group, send email to project-voldemort+unsubscribe@googlegroups.com.
> > > For more options, visit this group athttp://groups.google.com/group/project-voldemort?hl=en.
> BTW we're walking away from the Storage Pod model, and go for blades
> powered by ATOM processors. With the low cost of ATOM motherboards and
> 2 X 3 or 4GB disks, the motherboard/box cost overhead is not much and
> the form factor us smaller (better for maintenance, + no RAID to
> setup). And 4GB RAM is more than enough to drive 8TB storage (at least
> with our access pattern).
> It's also a good candidate for large blob storage slitted in chunks:
> by modifying the Voldemort hashing function to introduce a sub key
> which is transparent for the ring and for the DiskMap bucket, you end
> up chaining chunks on the same node which are contiguous on disk.
> Ultimate goal would be with a sequencial number per sub-partition (=
> one disk) to facilitate automatic single disk rebuild without Merkle
> trees, but we're not yet there. This would allow to increase the
> number of disks per CPU without any RAID overhead (because RAID0 would
> generate too large disks sets, with an unmanageable rebuild time).
> We're running a bit after time for and that's not yet on GitHub, but
> we're looking for partners with similar needs to further developp and
> optimize this cost driven concept as a sub-project.
> If some are interested with also have a retry manager, which replaces
> the updateAction loop on obsolete writes by adding a increasing and
> random wait so that processes stop fighting in case of high
> concurrency: this is a must if you have some long transactions (e.g.:
> 20-30ms). Same principle here as CSMA/CD network retries.
> On 11 mai, 10:35, Francois <francois.vai...@ezcgroup.net> wrote:
> > Until it's shared, you can directly call me.
> > On 10 mai, 18:04, Greg Moulliet <gr...@conducivetech.com> wrote:
> > > Sounds neat Francois. Can you share the code via github?
> > > Thanks,
> > > Greg
> > > On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> > > > Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
> > > > running on a host with entry level disks in RAID 1 (so not the optimal
> > > > situation).
> > > > Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
> > > > key depending on the hashmap loading factor (10 runs OK). The rest is
> > > > dedicated to needles and needlepointer cache.
> > > > Performance can be better with one small SSD for the bucket table, but
> > > > this runs fine on 2 HDD (you just need two different axis).
> > > > It has been designed to be modified to be partition aware (sub-
> > > > partitionning), and if 100% dedicated to Voldemort.
> > > > Use case is good performance for very large data sets with 10% live
> > > > data (so that last access based caching is efficient). We need a small
> > > > effort to make it generic, but if someone has the same use case feel
> > > > free to contact me.
> > > > François
> > > > On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> > > >> After looking at cache store (a nice implementation, that we like
> > > >> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
> > > >> cheap machines with a very low ratio RAM/Disk. The use case is tons of
> > > >> data, but only 5% live ones.
> > > >> Our need is not the fastest performance, but a very good, consistent
> > > >> performance even with 1 to 10 GKeys. This is based on a logged needle
> > > >> structure (vaguely inspired from haystack)+ a disk mapped hash table
> > > >> with a dirty cache and an elevator pattern for committing the index.
> > > >> Unit tests are promising with results on a test workstation with a
> > > >> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
> > > >> store when all index fits in memory, which is impossible in our case
> > > >> with a target of 160TB drive arrays on a 24GB RAM machine (like the
> > > >> Backblaze storage pod).
> > > >> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
> > > >> get a crazy write performance.
> > > >> Our app does not need to keep versions, this is done at a higher level
> > > >> using different keys.
> > > >> So to optimize this storage strictly dedicated to Voldemort, would it
> > > >> create problems to clean versions on the fly during puts ?
> > > >> - Eliminating obsolete versions at write time, like in the bdb
> > > >> implementation
> > > >> - Eliminating concurrent versions based on timestamp, unlike the bdb
> > > >> implementation
> > > >> Our goal is to keep the hash chain as short as possible to minimize
> > > >> disk seeks. Any danger doing this ?
> > > >> Francois
> > > > --
> > > > You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> > > > To post to this group, send email to project-voldemort@googlegroups.com.
> > > > To unsubscribe from this group, send email to project-voldemort+unsubscribe@googlegroups.com.
> > > > For more options, visit this group athttp://groups.google.com/group/project-voldemort?hl=en.
To your original question about resolving concurrent versions at put
time: I think that part of the reason Voldemort doesn't currently do
this is to allow custom conflict-resolution strategies which operate
client-side, where there may be more information available. (For
instance, a custom conflict resolver might consult some other data
source or use business-logic code not conveniently available
server-side.)
So I think that's the point of the current implementation. But, I
don't know how common it is to actually take advantage of that;
probably most people just use the default conflict resolver and fall
back to timestamps as you are planning to do, so it may not matter.
On the other hand, if your implementation is able to support
duplicates and you are just trying to minimize the data stored, then
you may want to go ahead and keep the concurrent versions as is done
today, on the theory that it's rare to actually have such concurrent
writes, so it would amount to only a tiny amount of data anyway. (Not
sure if that's actually true, but it seems likely.)
On Sat, May 12, 2012 at 12:23 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> And sorry for typos, I didn't find the message edit function !
> On 12 mai, 09:20, Francois <francois.vai...@ezcgroup.net> wrote:
>> BTW we're walking away from the Storage Pod model, and go for blades
>> powered by ATOM processors. With the low cost of ATOM motherboards and
>> 2 X 3 or 4GB disks, the motherboard/box cost overhead is not much and
>> the form factor us smaller (better for maintenance, + no RAID to
>> setup). And 4GB RAM is more than enough to drive 8TB storage (at least
>> with our access pattern).
>> It's also a good candidate for large blob storage slitted in chunks:
>> by modifying the Voldemort hashing function to introduce a sub key
>> which is transparent for the ring and for the DiskMap bucket, you end
>> up chaining chunks on the same node which are contiguous on disk.
>> Ultimate goal would be with a sequencial number per sub-partition (=
>> one disk) to facilitate automatic single disk rebuild without Merkle
>> trees, but we're not yet there. This would allow to increase the
>> number of disks per CPU without any RAID overhead (because RAID0 would
>> generate too large disks sets, with an unmanageable rebuild time).
>> We're running a bit after time for and that's not yet on GitHub, but
>> we're looking for partners with similar needs to further developp and
>> optimize this cost driven concept as a sub-project.
>> If some are interested with also have a retry manager, which replaces
>> the updateAction loop on obsolete writes by adding a increasing and
>> random wait so that processes stop fighting in case of high
>> concurrency: this is a must if you have some long transactions (e.g.:
>> 20-30ms). Same principle here as CSMA/CD network retries.
>> On 11 mai, 10:35, Francois <francois.vai...@ezcgroup.net> wrote:
>> > Until it's shared, you can directly call me.
>> > On 10 mai, 18:04, Greg Moulliet <gr...@conducivetech.com> wrote:
>> > > Sounds neat Francois. Can you share the code via github?
>> > > Thanks,
>> > > Greg
>> > > On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
>> > > > Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
>> > > > running on a host with entry level disks in RAID 1 (so not the optimal
>> > > > situation).
>> > > > Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
>> > > > key depending on the hashmap loading factor (10 runs OK). The rest is
>> > > > dedicated to needles and needlepointer cache.
>> > > > Performance can be better with one small SSD for the bucket table, but
>> > > > this runs fine on 2 HDD (you just need two different axis).
>> > > > It has been designed to be modified to be partition aware (sub-
>> > > > partitionning), and if 100% dedicated to Voldemort.
>> > > > Use case is good performance for very large data sets with 10% live
>> > > > data (so that last access based caching is efficient). We need a small
>> > > > effort to make it generic, but if someone has the same use case feel
>> > > > free to contact me.
>> > > > François
>> > > > On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
>> > > >> After looking at cache store (a nice implementation, that we like
>> > > >> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
>> > > >> cheap machines with a very low ratio RAM/Disk. The use case is tons of
>> > > >> data, but only 5% live ones.
>> > > >> Our need is not the fastest performance, but a very good, consistent
>> > > >> performance even with 1 to 10 GKeys. This is based on a logged needle
>> > > >> structure (vaguely inspired from haystack)+ a disk mapped hash table
>> > > >> with a dirty cache and an elevator pattern for committing the index.
>> > > >> Unit tests are promising with results on a test workstation with a
>> > > >> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
>> > > >> store when all index fits in memory, which is impossible in our case
>> > > >> with a target of 160TB drive arrays on a 24GB RAM machine (like the
>> > > >> Backblaze storage pod).
>> > > >> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
>> > > >> get a crazy write performance.
>> > > >> Our app does not need to keep versions, this is done at a higher level
>> > > >> using different keys.
>> > > >> So to optimize this storage strictly dedicated to Voldemort, would it
>> > > >> create problems to clean versions on the fly during puts ?
>> > > >> - Eliminating obsolete versions at write time, like in the bdb
>> > > >> implementation
>> > > >> - Eliminating concurrent versions based on timestamp, unlike the bdb
>> > > >> implementation
>> > > >> Our goal is to keep the hash chain as short as possible to minimize
>> > > >> disk seeks. Any danger doing this ?
@Jeff: we made the same conclusion and finally did a timestamp based
conflict resolution directly in DiskMap, because in our use case this
is fine. We could keep duplicates, but this would have a negative
impact on performance ( we chain N needles behind the same hash
bucket, so allowing duplicates means you can't stop walking the chain
at the first one, and even with chains of 1 to 4 needles there's a
performance impact when you try to kill disk seeks ). That's what I
meant about "we need to make it more generic". Today this storage is
really designed for our use case, and performance ( to be seen as a
cost factor ) and ease of operations ( log forward structure ) where
the main criteria. If fact in case we can't resolve the conflict, an
exception is thrown and the client deals with it.
Francois
On 12 mai, 22:50, Jeff Clites <jeffs...@gmail.com> wrote:
> To your original question about resolving concurrent versions at put
> time: I think that part of the reason Voldemort doesn't currently do
> this is to allow custom conflict-resolution strategies which operate
> client-side, where there may be more information available. (For
> instance, a custom conflict resolver might consult some other data
> source or use business-logic code not conveniently available
> server-side.)
> So I think that's the point of the current implementation. But, I
> don't know how common it is to actually take advantage of that;
> probably most people just use the default conflict resolver and fall
> back to timestamps as you are planning to do, so it may not matter.
> On the other hand, if your implementation is able to support
> duplicates and you are just trying to minimize the data stored, then
> you may want to go ahead and keep the concurrent versions as is done
> today, on the theory that it's rare to actually have such concurrent
> writes, so it would amount to only a tiny amount of data anyway. (Not
> sure if that's actually true, but it seems likely.)
> JEff
> On Sat, May 12, 2012 at 12:23 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> > And sorry for typos, I didn't find the message edit function !
> > On 12 mai, 09:20, Francois <francois.vai...@ezcgroup.net> wrote:
> >> BTW we're walking away from the Storage Pod model, and go for blades
> >> powered by ATOM processors. With the low cost of ATOM motherboards and
> >> 2 X 3 or 4GB disks, the motherboard/box cost overhead is not much and
> >> the form factor us smaller (better for maintenance, + no RAID to
> >> setup). And 4GB RAM is more than enough to drive 8TB storage (at least
> >> with our access pattern).
> >> It's also a good candidate for large blob storage slitted in chunks:
> >> by modifying the Voldemort hashing function to introduce a sub key
> >> which is transparent for the ring and for the DiskMap bucket, you end
> >> up chaining chunks on the same node which are contiguous on disk.
> >> Ultimate goal would be with a sequencial number per sub-partition (=
> >> one disk) to facilitate automatic single disk rebuild without Merkle
> >> trees, but we're not yet there. This would allow to increase the
> >> number of disks per CPU without any RAID overhead (because RAID0 would
> >> generate too large disks sets, with an unmanageable rebuild time).
> >> We're running a bit after time for and that's not yet on GitHub, but
> >> we're looking for partners with similar needs to further developp and
> >> optimize this cost driven concept as a sub-project.
> >> If some are interested with also have a retry manager, which replaces
> >> the updateAction loop on obsolete writes by adding a increasing and
> >> random wait so that processes stop fighting in case of high
> >> concurrency: this is a must if you have some long transactions (e.g.:
> >> 20-30ms). Same principle here as CSMA/CD network retries.
> >> On 11 mai, 10:35, Francois <francois.vai...@ezcgroup.net> wrote:
> >> > Until it's shared, you can directly call me.
> >> > On 10 mai, 18:04, Greg Moulliet <gr...@conducivetech.com> wrote:
> >> > > Sounds neat Francois. Can you share the code via github?
> >> > > Thanks,
> >> > > Greg
> >> > > On Thu, May 10, 2012 at 5:42 AM, Francois <francois.vai...@ezcgroup.net> wrote:
> >> > > > Our diskmap storage demonstrated a 5000r/w ops/s on a single Core VM
> >> > > > running on a host with entry level disks in RAID 1 (so not the optimal
> >> > > > situation).
> >> > > > Unlike BdB, the minimum memory requirement is between 8 and 16 bytes /
> >> > > > key depending on the hashmap loading factor (10 runs OK). The rest is
> >> > > > dedicated to needles and needlepointer cache.
> >> > > > Performance can be better with one small SSD for the bucket table, but
> >> > > > this runs fine on 2 HDD (you just need two different axis).
> >> > > > It has been designed to be modified to be partition aware (sub-
> >> > > > partitionning), and if 100% dedicated to Voldemort.
> >> > > > Use case is good performance for very large data sets with 10% live
> >> > > > data (so that last access based caching is efficient). We need a small
> >> > > > effort to make it generic, but if someone has the same use case feel
> >> > > > free to contact me.
> >> > > > François
> >> > > > On 17 mar, 19:02, Francois <francois.vai...@ezcgroup.net> wrote:
> >> > > >> After looking at cache store (a nice implementation, that we like
> >> > > >> because hash makes sense), we are building a new type of store calledDiskMap, for 1k to 5k/s Ops/s in write and > 10 k Ops/s in read on
> >> > > >> cheap machines with a very low ratio RAM/Disk. The use case is tons of
> >> > > >> data, but only 5% live ones.
> >> > > >> Our need is not the fastest performance, but a very good, consistent
> >> > > >> performance even with 1 to 10 GKeys. This is based on a logged needle
> >> > > >> structure (vaguely inspired from haystack)+ a disk mapped hash table
> >> > > >> with a dirty cache and an elevator pattern for committing the index.
> >> > > >> Unit tests are promising with results on a test workstation with a
> >> > > >> very slow disk being > 10 times BDB. In fact as fast as BDB or cache
> >> > > >> store when all index fits in memory, which is impossible in our case
> >> > > >> with a target of 160TB drive arrays on a 24GB RAM machine (like the
> >> > > >> Backblaze storage pod).
> >> > > >> With the hash file residing on a low cost ( < 256GB ) SSD, we hope to
> >> > > >> get a crazy write performance.
> >> > > >> Our app does not need to keep versions, this is done at a higher level
> >> > > >> using different keys.
> >> > > >> So to optimize this storage strictly dedicated to Voldemort, would it
> >> > > >> create problems to clean versions on the fly during puts ?
> >> > > >> - Eliminating obsolete versions at write time, like in the bdb
> >> > > >> implementation
> >> > > >> - Eliminating concurrent versions based on timestamp, unlike the bdb
> >> > > >> implementation
> >> > > >> Our goal is to keep the hash chain as short as possible to minimize
> >> > > >> disk seeks. Any danger doing this ?