Reason of using lmdb

382 views
Skip to first unread message

Aris Setyawan

unread,
Mar 20, 2014, 9:29:17 AM3/20/14
to hustle...@googlegroups.com
Hi,

Any on-disk btree structure will suffered if ram is very small compared to data size.

What is your reason of using lmdb?
You use it for what?

Aris

Tim Spurway

unread,
Mar 20, 2014, 10:05:58 AM3/20/14
to Aris Setyawan, hustle...@googlegroups.com
Hi Aris, 

You have the honour of being the first post to the group.  Welcome!

LMDB is a memory mapped database that supports multiple b+ trees per mapped file.  We use this capability to keep each Hustle table in many LMDB files, replicated across multiple nodes in Disco's Distributed File System (DDFS).

Within each LMDB file, we organize the data by creating one b+ tree for each column, plus one b+ tree for each index - essentially a 'column oriented' database.  The advantage of this scheme is that we only page data into memory that the query needs.  If the table has 50 columns, but our query only accesses 3 columns, we will never page *any* data from the other 47 columns into memory.

In a typical Hustle query, the 'where' clause is executed first.  The query engine will only access the index columns that appear in the where clause, so only those indexes will be 'paged' into memory.  The indexes are 'bitmap' indexes, which means they are very compact and occupy relatively little space, especially for low cardinality keys.  

During the execution of a single where expression (like table.column == 'hello'), we access all values with a single 'get' operation in the case of equality operation (==), or we access the index using cursors (<, >, <=, etc.).  Either way, it uses memory paging efficiently because it doesn't 'swap' back and forth over the keyspace, but rather accesses it sequentially.

After we have executed the where clause, we have all of the row_ids that we need to fetch for the 'select' part of the query.  We open cursors on the selected columns (and *only* those selected columns), and *sequentially* fetch data from these columns.  Again, only the data required is actually paged into memory.

In general, Hustle queries perform excellently because:
- memory mapped files are handled by the kernel and offer excellent performance
- column orientation ensures we only access the data we need to perform the query
- LMDB is a very (very) fast b+ tree implementation
- compression (bitmap, lz4 and prefix trie) ensures that data we do fetch is very small
- index and data columns are always accessed in a sequential manner (with cursors) as opposed to random access

I hope that answers some of your questions!

cheers,
tim 

--
You received this message because you are subscribed to the Google Groups "hustle-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hustle-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

signature.asc

Aris Setyawan

unread,
Mar 21, 2014, 5:53:37 AM3/21/14
to hustle...@googlegroups.com, Aris Setyawan
Your design is simple. Now you should focus on SQL optimization.

> Within each LMDB file, we organize the data by creating one b+ tree for each column, plus one b+ tree for each index

Can you point to me, the code which responsible for doing this?


How did you use EWAHBoolArray, a c++ library, ini hustle (python)?
Reply all
Reply to author
Forward
0 new messages