Calculating record size and limit imposed by skydb / lmdb

harry...@gmail.com

unread,

Jan 14, 2014, 3:31:31 AM1/14/14

to sk...@googlegroups.com

Hello,

Among the test data I am trying to load, some of the rows (events) fail with the following error

2014/01/14 00:20:25 ERROR lmdb txn put error: MDB_BAD_VALSIZE: Too big key/data, key is empty, or wrong DUPFIXED size

I understand I need to look at the data size of the row, but few of the failed events I have looked at had roughly 600 bytes. Where is the limit configured or enforced. ? I see LMDB has limitations documented here http://lmdb.readthedocs.org/en/release/, however those are in 2048 to 4096 range. I am nowhere close to it. I understand key is limited to 511 bytes, but I am not crosssing that either.

Thank you.
--
Harry

Ben Johnson

unread,

Jan 14, 2014, 10:26:56 AM1/14/14

to harry...@gmail.com, sk...@googlegroups.com

2014/01/14 00:20:25 ERROR lmdb txn put error: MDB_BAD_VALSIZE: Too big key/data, key is empty, or wrong DUPFIXED size

This was a very recent change we made. Previously Sky used a long byte array to store all the events for a single object but this made appending new events slower and slower. We switched to use LMDB's DUPSORT feature to get O(log n) insertion time but one downside is that LMDB limits values to the max key size (511 bytes). You can change that when you compile LMDB but I think the upper bound is still ~2048 bytes.

As small as this limit sounds, it's usually not a limitation if you're using Factor types (instead of Strings). Factors get stored in the event as only a couple bytes so you need quite a few to fill up 511 bytes. I'm thinking about just removing support for string types entirely since they're horribly inefficient to query against and Sky is really meant for fields that have low cardinality so factors are always a better choice.

We are adding support for collections in Sky soon which can be problematic with the small value size restriction. We're going to allow spanning of events across multiple LMDB values though when that happens.

--
Ben

On January 14, 2014 at 1:31:33 AM, harry...@gmail.com (harry...@gmail.com) wrote:

Hello,

Among the test data I am trying to load, some of the rows (events) fail with the following error

harry...@gmail.com

unread,

Jan 14, 2014, 2:11:02 PM1/14/14

to sk...@googlegroups.com, harry...@gmail.com

Ben,

Thank you for the quick reply. I will look at my data and see what I can do to make it fit with 511 bytes. However, that said, if you are removing support for strings altogether, that wouldn't work for me at all. I plan to have URL fragments with query arguments stored and those are not low cardinality. Even if I parse out the arguments, I might have a campaign-id or catalog-id or sku-id that can be in millions (cardinality).

Thanks
--
Harry

Ben Johnson

unread,

Jan 14, 2014, 4:59:09 PM1/14/14

to harry...@gmail.com, sk...@googlegroups.com, harry...@gmail.com

However, that said, if you are removing support for strings altogether, that wouldn't work for me at all. I plan to have URL fragments with query arguments stored and those are not low cardinality. Even if I parse out the arguments, I might have a campaign-id or catalog-id or sku-id that can be in millions (cardinality).

High cardinality isn't necessarily the problem. Sky can still handle high cardinality but I meant to say that Sky isn't meant for dumping a bunch of string data that needs to be parsed later. If you're using string data then Sky works well if you're grouping by it or if you're doing an equality filter (== or !=). There's no support for anything like regex or LIKE clauses. Sorry for the confusion.

If your ID fields can be parsed as 64-bit integers then definitely store them as integers. If they need to be strings then factorizing them will incur a lookup cost on insert and in the final step of query processing. They should still be pretty snappy though.

That being said, doing this query will take a long time (assuming sku_id has a million+ cardinality):

SELECT count() GROUP BY sku_id

Then again, that'll probably take a long time in most databases.

A query like this:

WHEN sku_id == '1234' THEN

SELECT COUNT()

END

Should be fast though.

--
Ben

On January 14, 2014 at 12:11:03 PM, harry...@gmail.com (harry...@gmail.com) wrote:

Ben,

Thank you for the quick reply. I will look at my data and see what I can do to make it fit with 511 bytes.

Reply all

Reply to author

Forward