KV store w/ persistence

214 views
Skip to first unread message

Marc Logemann

unread,
Nov 9, 2012, 6:14:12 AM11/9/12
to project-...@googlegroups.com
Hi,

we are checking alternatives for our current redis way of doing this. We heavily use key value collections to store agregated statistics data. So we have heavy writes, not so heavy reads on our redis instance. We are quite happy the way how redis works but since we want to implement / embed the NoSQL server in one of our java based products, redis being C based is a no go. So we evaluated a lot of projects. Voldemort comes nearest in terms of our requirements for now. I wrote about that there.

May i ask some questions to this group because we really need to get some facts right:

- We certainly dont want to use BDB. Is there an alternative (file based) store available for Voldemort? We love file based because of freedom wrt installations.
- Is an embedded installation scenario with just one node ok? Scaling to different nodes just as an option...
- what about range queries for keys?

Thanks a lot for clearification. If we would settle on Voldemort, our dev team would be happy to contribute back in some form. We just need to make a sane decission how to go on and since its for two of our products, it would be a long lasting decission ;-)

Marc
CEO LOGENTIS GmbH

Sunny Gleason

unread,
Nov 9, 2012, 9:15:01 AM11/9/12
to project-...@googlegroups.com
For your use case, I might consider leveldb or persistit.

https://github.com/dain/leveldb
http://www.akiban.com/akiban-persistit

I'd try as hard as possible not to embed voldemort in a
product; but if you really need to, it's not that hard to create
your own storage engine. I'd be happy to give you advice if
you need it; I wrote an innodb storage engine for Voldemort
and it wasn't too horrible.

All the best,

-Sunny


On 11/9/12, Marc Logemann <marc.l...@googlemail.com> wrote:
> Hi,
>
> we are checking alternatives for our current redis way of doing this. We
> heavily use key value collections to store agregated statistics data. So we
>
> have heavy writes, not so heavy reads on our redis instance. We are quite
> happy the way how redis works but since we want to implement / embed the
> NoSQL server in one of our java based products, redis being C based is a no
>
> go. So we evaluated a lot of projects. Voldemort comes nearest in terms of
> our requirements for now. I wrote about that
> there<http://www.logemann.org/2012/11/nosql-kv-project-search-goes-on.html>
> .
>
> May i ask some questions to this group because we really need to get some
> facts right:
>
> - We certainly dont want to use BDB. Is there an alternative (file based)
> store available for Voldemort? We love file based because of freedom wrt
> installations.
> - Is an embedded installation scenario with just one node ok? Scaling to
> different nodes just as an option...
> - what about range queries for keys?
>
> Thanks a lot for clearification. If we would settle on Voldemort, our dev
> team would be happy to contribute back in some form. We just need to make a
>
> sane decission how to go on and since its for two of our products, it would
>
> be a long lasting decission ;-)
>
> Marc
> CEO LOGENTIS <http://www.logentis.de> GmbH
>
> --
> You received this message because you are subscribed to the Google Groups
> "project-voldemort" group.
> To unsubscribe from this group, send email to
> project-voldem...@googlegroups.com.
> Visit this group at http://groups.google.com/group/project-voldemort?hl=en.
>
>
>

Geir Magnusson Jr.

unread,
Nov 9, 2012, 9:21:49 AM11/9/12
to project-...@googlegroups.com
Sunny,

I don't understand the BDB resistance.... can you explain?

Sunny Gleason

unread,
Nov 9, 2012, 9:32:51 AM11/9/12
to project-...@googlegroups.com
For me, it is due to the cliff function that BDB goes
off of when the data set goes out of RAM -- I've been
placing a higher value on consistent latency than raw
throughput.

The BDB issues hark back to the original Dynamo
paper where the Amazon folks had a bunch of trouble
tuning BDB-JE for consistent performance. When I
measured the performance of the Voldemort BDB
storage engine a year or 2 ago, I found the same
issues.

I wrote the InnoDB plugin because InnoDB has better
buffer pool management than BDB and I could configure
it for more consistent performance. I put a presentation
about it together here:

http://www.slideshare.net/sunnygleason/accelerating-nosql

Granted, it sacrifices throughput for stability. I have it on
my list to write storage engines for the other two products
I mentioned (LevelDB and Persistit); they are promising
in theory but admittedly it's based on superficial analysis--
I don't have any measurements just yet. Based on the
way the voldemort community has atrophied in recent months,
I'm also not sure how much time I want to invest in it
anymore...

-Sunny

Geir Magnusson Jr.

unread,
Nov 9, 2012, 9:49:12 AM11/9/12
to project-...@googlegroups.com
Interesting. I assume that you need access patterns that span the entire keyset at random for the most part?

When I was evaluating the storage options years ago, BDB was the best and my usage pattern was such that for activity in a given window, the active keyset was a mostly bounded subset of the full keyset. (Use case was e.g. shopping cart at a flash sale website - while there are millions of customers, the group that shows up any given day at sale start is much smaller and activity tends to be continuous - they show up, hang out, then go away. So the result was things fit very well in the BDB cache and performance was good. What I really liked about BDB also was that it was self-maintaining. We never had to do any ops maintenance on the cluster. We put up three clusters and they only came down when we moved colos .

I'm interested in your plugin :)

I guess my question for Marc was why he didn't want to use BDB - he mentioned a love of file-based and BDB is file based...

Thanks

geir

Sunny Gleason

unread,
Nov 9, 2012, 10:10:35 AM11/9/12
to project-...@googlegroups.com
On 11/9/12, Geir Magnusson Jr. <ge...@pobox.com> wrote:
> Interesting. I assume that you need access patterns that span the entire
> keyset at random for the most part?

Yes exactly -- like a user authentication/lookup system, where we
need to say that lookup of user X takes at most N milliseconds;
if it takes N/10 milliseconds today and 2N milliseconds tomorrow,
the consuming services start having drastic performance variability...

> When I was evaluating the storage options years ago, BDB was the best and my
> usage pattern was such that for activity in a given window, the active
> keyset was a mostly bounded subset of the full keyset. (Use case was e.g.
> shopping cart at a flash sale website - while there are millions of
> customers, the group that shows up any given day at sale start is much
> smaller and activity tends to be continuous - they show up, hang out, then
> go away. So the result was things fit very well in the BDB cache and
> performance was good. What I really liked about BDB also was that it was
> self-maintaining. We never had to do any ops maintenance on the cluster.
> We put up three clusters and they only came down when we moved colos .

That's awesome! BDB is definitely solid -- it has stood
the test of time and held up well, it was just the variability
that I couldn't live with (and I also liked having the buffer
pool for the storage engine have a hard fixed-size outside
the JVM).

> I'm interested in your plugin :)

It's very much proof-of-concept-quality right now, but if it
would be a win for other people I'd definitely be interested in
bringing it to production-grade.

> I guess my question for Marc was why he didn't want to use BDB - he
> mentioned a love of file-based and BDB is file based...

Indeed -- that's very true...

> Thanks
>
> geir


-Sunny

Marc Logemann

unread,
Nov 9, 2012, 11:36:45 AM11/9/12
to project-...@googlegroups.com
Hi,

yeah, technically i have no problems with BDB JE but i really dont want to spend license fees on every sell. I admit that i have not checked the exact fees.... perhaps i am also a little bit biased cause i dont like Oracle too much....

Marc

Geir Magnusson Jr.

unread,
Nov 9, 2012, 11:39:24 AM11/9/12
to project-...@googlegroups.com
Oh, I see- you're going to bundle this in a product....

Yes, I'm not their biggest fan either, but it's really a solid piece of engineering if your use case is satisfied by it.

geir
Message has been deleted

Marc Logemann

unread,
Nov 9, 2012, 11:43:51 AM11/9/12
to project-...@googlegroups.com, sunny....@gmail.com


Am Freitag, 9. November 2012 15:15:04 UTC+1 schrieb Sunny:
For your use case, I might consider leveldb or persistit.

https://github.com/dain/leveldb
http://www.akiban.com/akiban-persistit

I'd try as hard as possible not to embed voldemort in a
product; but if you really need to, it's not that hard to create
your own storage engine. I'd be happy to give you advice if
you need it; I wrote an innodb storage engine for Voldemort
and it wasn't too horrible.


Thanks for the links. leveldb is c based but it seems a java port exists.... Seems a little bit rought to me. Second one is also interessting... need to check.

Why cant i use embedded mode. Is this oficially a bad decission? I thought about using Krati as store. Its still in contrib which might make me nervous when using it...

Marc

Tatu Saloranta

unread,
Nov 9, 2012, 12:30:07 PM11/9/12
to project-...@googlegroups.com
On Fri, Nov 9, 2012 at 6:15 AM, Sunny Gleason <sunny....@gmail.com> wrote:
> For your use case, I might consider leveldb or persistit.
>
> https://github.com/dain/leveldb
> http://www.akiban.com/akiban-persistit

On this: have you used either extensively? Or anyone else on the mailing list?
I keep on hearing about both (esp leveldb) as suggestion, but I have
not really seen solid real usage expertise, but more kicking-the-tires
level of interest. BDB(-JE) is widely used, and so its problems are
much more widely known. But is the same true for newer alternatives.

I am interested on this for sort of similar reasons; I have a BDB-JE
based storage backend (similar to how Voldemort uses it), but am
interested in alternatives. But I have yet to find something with
enough votes by developers who have extensively used them AND
something else; that is, people who can make comparison based on
usage.

-+ Tatu +-

Tatu Saloranta

unread,
Nov 9, 2012, 12:31:19 PM11/9/12
to project-...@googlegroups.com
On Fri, Nov 9, 2012 at 8:36 AM, Marc Logemann
<marc.l...@googlemail.com> wrote:
> Hi,
>
> yeah, technically i have no problems with BDB JE but i really dont want to
> spend license fees on every sell. I admit that i have not checked the exact
> fees.... perhaps i am also a little bit biased cause i dont like Oracle too
> much....

It's an open source library isn' t it? (Oracle just bought the company
a while ago -- they didn't write it).
If you build backends modularly, you need not embed implementation
there. So is this really a valid point?

-+ Tatu +-

Sunny Gleason

unread,
Nov 9, 2012, 12:40:37 PM11/9/12
to project-...@googlegroups.com
I think the problem is the sleepycat license:

http://en.wikipedia.org/wiki/Sleepycat_License

"The license is a strong form of copyleft because it mandates that
redistributions in any form not only include the source code of
Berkeley DB, but also "any accompanying software that uses the DB
software". It is possible to circumvent this strict licensing policy
through the purchase of a commercial software license from Oracle
Corporation consisting of terms and conditions which are negotiated at
the time of sale. This is an example of dual licensing."

-Sunny

Tatu Saloranta

unread,
Nov 9, 2012, 12:42:41 PM11/9/12
to project-...@googlegroups.com
On Fri, Nov 9, 2012 at 9:40 AM, Sunny Gleason <sunny....@gmail.com> wrote:
> I think the problem is the sleepycat license:
>
> http://en.wikipedia.org/wiki/Sleepycat_License
>
> "The license is a strong form of copyleft because it mandates that
> redistributions in any form not only include the source code of
> Berkeley DB, but also "any accompanying software that uses the DB
> software". It is possible to circumvent this strict licensing policy
> through the purchase of a commercial software license from Oracle
> Corporation consisting of terms and conditions which are negotiated at
> the time of sale. This is an example of dual licensing."

So the usual GPL virality for not-all-open-source software? Which is
why it's less of an issue for my projects, which are OS all the way
typically.

-+ Tatu +-

Sunny Gleason

unread,
Nov 9, 2012, 12:47:38 PM11/9/12
to project-...@googlegroups.com
On 11/9/12, Tatu Saloranta <tsalo...@gmail.com> wrote:
> On Fri, Nov 9, 2012 at 6:15 AM, Sunny Gleason <sunny....@gmail.com>
> wrote:
>> For your use case, I might consider leveldb or persistit.
>>
>> https://github.com/dain/leveldb
>> http://www.akiban.com/akiban-persistit
>
> On this: have you used either extensively? Or anyone else on the mailing
> list?

I haven't used the java version of leveldb much, but I want
to do some performance qualification of it soon. Riak uses
the C version of leveldb heavily (having moved away from
InnoDB and their custom bitcask hash-based store), so I
tend to trust it.

Akiban Persistit is the core of the Akiban Server database
product -- its strength right now, similar to BDB-JE, is with
in-memory performance. I know there's a lot of performance
tuning work going on with it now for disk/SSD-based
performance.

Out of all the transactional java persistence APIs out there, I
like Persistit the best because it has strong typing with schema-free
design. I wrote some helper functions to help with integration
based on my experience with InnoDB and Java:

https://github.com/sunnygleason/persistit-helpers

I'd also love to see a high-performance hash store for Java; I
did some work with static hash-file based storage, but AFAIK
there isn't any great contender out there for hash-oriented
storage in pure java (if there is, I'd love to learn more...).

Also, if anyone out there is doing this type of work / analysis, I'm
always eager to lend a hand...

-Sunny

Sunny Gleason

unread,
Nov 9, 2012, 12:48:41 PM11/9/12
to project-...@googlegroups.com
On 11/9/12, Tatu Saloranta <tsalo...@gmail.com> wrote:
Yes exactly...

-Sunny

Tatu Saloranta

unread,
Nov 9, 2012, 1:01:14 PM11/9/12
to project-...@googlegroups.com
On Fri, Nov 9, 2012 at 9:47 AM, Sunny Gleason <sunny....@gmail.com> wrote:
> On 11/9/12, Tatu Saloranta <tsalo...@gmail.com> wrote:
>> On Fri, Nov 9, 2012 at 6:15 AM, Sunny Gleason <sunny....@gmail.com>
>> wrote:
>>> For your use case, I might consider leveldb or persistit.
>>>
>>> https://github.com/dain/leveldb
>>> http://www.akiban.com/akiban-persistit
>>
>> On this: have you used either extensively? Or anyone else on the mailing
>> list?
>
> I haven't used the java version of leveldb much, but I want

Right -- as far as I know, no one has done this for java-based version
(I did ask around quite a bit).
It seems like one of Dain's hackathon projects unfortunately.

> to do some performance qualification of it soon. Riak uses
> the C version of leveldb heavily (having moved away from
> InnoDB and their custom bitcask hash-based store), so I
> tend to trust it.

Ok good. I did not know Riak uses it.

> Akiban Persistit is the core of the Akiban Server database
> product -- its strength right now, similar to BDB-JE, is with
> in-memory performance. I know there's a lot of performance
> tuning work going on with it now for disk/SSD-based
> performance.
>
> Out of all the transactional java persistence APIs out there, I
> like Persistit the best because it has strong typing with schema-free
> design. I wrote some helper functions to help with integration
> based on my experience with InnoDB and Java:
>
> https://github.com/sunnygleason/persistit-helpers

That can be both strength and weakness; strength for using as-is,
weakness when using as "dumb" datastore (like backend for V). But as
long as overhead is moderate I guess it's an overall plus?

> I'd also love to see a high-performance hash store for Java; I
> did some work with static hash-file based storage, but AFAIK
> there isn't any great contender out there for hash-oriented
> storage in pure java (if there is, I'd love to learn more...).
>
> Also, if anyone out there is doing this type of work / analysis, I'm
> always eager to lend a hand...

Yup. I'll send a separate note wrt my usage (since it's not really
related to Voldemort, i.e. out of scope here). Might actually be
useful for performance testing too.

-+ Tatu +-

Mickey Hsieh

unread,
Nov 9, 2012, 1:23:42 PM11/9/12
to project-...@googlegroups.com
Please take a look of CacheStore is a file base storage-plugin for Voldemort.
It designs for  high performance (within 5 ms) and large scale per node (10M keys) voldemort storage plugin.

Mickey

Sunny Gleason

unread,
Nov 9, 2012, 1:34:11 PM11/9/12
to project-...@googlegroups.com
Very neat! Is there any documentation about the actual
on-disk storage format?

-Sunny

Mickey Hsieh

unread,
Nov 9, 2012, 1:43:31 PM11/9/12
to project-...@googlegroups.com
Sunny,

Take a look of power point presentation.


If you have more questions, let me know.

Mickey

Sunny Gleason

unread,
Nov 9, 2012, 1:56:10 PM11/9/12
to project-...@googlegroups.com
I guess I'm looking for something more like this, with
specifics about cache eviction and RAM versus disk
usage:

http://www.scribd.com/doc/15014700/InnoDB-Internals-InnoDB-File-Formats-and-Source-Code-Structure

There are so many tradeoffs in designing a persistence
engine, I didn't really get a feel for the use cases where
the CacheStore engine performs well versus poorly.
(Small-keys or values, large blobs, write-once versus
modification, etc)

I think your approach seems solid -- if I understand correctly,
Bitcask has a similar approach where keys are stored in-memory
and point to on-disk values. I know that Bitcask has the tradeoff
of having to read the entire file from disk on startup before the
engine can serve requests. I'm not sure if your format would be
the same?

Interesting stuff! Thank you for sharing,

Mickey Hsieh

unread,
Nov 9, 2012, 2:26:52 PM11/9/12
to project-...@googlegroups.com
Your point is well taken. To share with you of our production data.
We had two data stores per node, key is a long integer and 40 nodes for each center (West and East).

First store has 44.5 M keys, it takes 70 seconds to load key and meta datas
Second store has 5.3 M keys. It takes around 16 seconds to load.
The whole boot strap process takes less 2 minutes to load 50M keys.


Mickey

Vinoth Chandar

unread,
Nov 12, 2012, 2:52:01 PM11/12/12
to project-...@googlegroups.com
Nice discussion. We here at LNKD, did try out leveldb for Voldemort. I think the lack of proper java bindings is a real concern here.
With large datasets, the latency shot up (20ms on SSDs) due to lack of support of bloom filters mostly.   We have not tried Persistit yet.

With BDB JE, we found that if you move the data off the heap and rein in the scan jobs as well, performance is consistent for the most part.
(This code is checked into master. We have not made a release yet since we are waiting for few more changes to stabilize as well. 
https://github.com/voldemort/voldemort/compare/master...release-096li8)..
The only 'open item' , as I understand at this point, is tuning the cleaner down (moving to lazy migration=false, which is more GC friendly).


Thanks
Vinoth
Reply all
Reply to author
Forward
0 new messages