Hybrid data structure proposal

2 views
Skip to first unread message

Jak Sprats

unread,
Nov 7, 2010, 10:54:26 AM11/7/10
to redisql-dev
Hi All,

Nail found a bug the other day, and I dove into some parts of the code
where I want to do a pretty sophisticated optimisation and I was
wondering if anyone had opinions/tips on it.

-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------
B+Tree/Trie Hybrid data structure
-------------------------------------------

One problem w/ B+trees is they allocate memory in chunks.

If you have a big table it makes a lot of sense to allocate chunks in
4K blocks.

Redisql's indexes are also B+trees and so are the nodes of those
indexes.
For space saving reasons the 3 B+trees do not allocate memory in equal
proportions
1.) Data B+tree 4K
2.) Index B+tree 1K
3.) Node B+tree 128B

The reason that the Node B+tree is small is to not allocate a lot of
memory if e.g. there is an index on lastname and only one person's
last name is "Sprats", so only 128B would be allocated.
The problem w/ this approach is for the "Smith" lastname, that has
100K entries. w/ 128B block sizes, a 100K element B+tree has to split
nodes many times ....

The Node B+trees also dont have to be B+trees, they dont do range
queries, they just need fast insert/delete and low memory usage (so a B
+tree is correct, but not at small sizes)

So I think the truly elegant (although more complicated) is to have a
hybrid data structure that is a Trie when there are less than ?32?
elements and a B+tree w/ 4K block-size afterwards. This means when the
structure gets its 33rd element, it will have to convert from Trie to B
+tree.

At a size of 32 elements, a Trie should be as fast as a B+tree for
range-queries, but a Trie takes up almost no additional memory, so for
the FK on lastname "Sprats" entry and the "Smith" entry, both high
speed and low memory would be the case.

Additionally, in the case where you want to make a table per user, say
100K tables, and each table only has 10 rows, then using this B+tree/
Hash for the Data B+tree, would be a big win.

The only con to this approach (besides code complexity) is when the 32
element Trie has to be converted to a 33 element B+tree, there will be
a performance hiccup.

Thoughts?

- Jak

p.s. redis has similar issues w/ ZSETs
http://groups.google.com/group/redis-db/msg/dd7cad10e21ecf65

Jak Sprats

unread,
Dec 14, 2010, 7:51:53 AM12/14/10
to redisql-dev
Hi all,

on the hybrid Btree topic, I just checked in a simple hybridization
approach

Each Btree (in Alchemy tables are Btrees, and Indexes are Btrees of
Btrees), starts out w/ a block size of 128.

If you have an empty table, it will take up 178 bytes (128+54 for
struct btree)
If you have an empty table, w/ an index -> 354B

On insertion, once a Btree inserts its 32nd entry (this can be in the
table, in the index, or in the index_node) the btree converts itself
to a Btree w/ block-size of 4096.

In practice the Btree size jumps from about 1.1K to 4K, so the
transition is not perfect.

The upsides to the hybridization approach is for small tables
1.) tables w/ less than 32 rows, have Btrees w/ block_size=128
2.) indexes w/ less than 32 members (think fk on STATE and you arent
doing all 50), will have Btrees w/ block_size=128
3.) index_nodes w/ less than 32 members (think less than 32 people
from North Dakota), , will have Btrees w/ block_size=128

Another upside, and this is strange, is that really big tables save
lots of memory because the overhead of the btree_node's is reduced.
Example,
a.) table w/ 50million (id INT, val INT) w/ block_size=128 = 3.7GB
b.) table w/ 50million (id INT, val INT) w/ block_size=4096 = 3.0GB
-> 23% less memory usage

Performance is slightly better on large tables for certain operation
like long FK Range Queries (up to 20% better).

OK, that is it, this is a step in BTree hybridization, the next one
would be having a simple array of pointers for really small table to
reduce the initial overhead of 178B to 64B. This too would be
relatively simple, and help people who have lots of index_nodes w/
less than 8 members.

- Jak
Reply all
Reply to author
Forward
0 new messages