critbit (or radix) trees

412 views
Skip to first unread message

Pierre Chapuis

unread,
Mar 7, 2011, 1:27:49 PM3/7/11
to redi...@googlegroups.com
Hi Redis group,

I was thinking about data structures for an unrelated project
and I thought about something: the Redis keyspace is (if I
understand it correctly) a Hash. Yet it is a common practice
to store keys that look like this:

Something1:Something2:Something3

where Something1 will be extremely frequent, Something2 will
be a bit less frequent and Something3 will be the least
frequent.

Knowing that, would it not be a better idea to use something
like a critbit tree instead of a hash? It would save memory
by compressing the keys and it could be faster too. For
instance, Varnish uses critbit trees by default ([1]).

More information about these trees, which are basically
binary radix trees, can be found in the links below.

[1] http://kristianlyng.wordpress.com/2010/07/30/varnish-2-1-3-use-it/
[2] http://cr.yp.to/critbit.html
[3] http://www.imperialviolet.org/binary/critbit.pdf

--
Pierre 'catwell' Chapuis

Josiah Carlson

unread,
Mar 7, 2011, 3:00:24 PM3/7/11
to redi...@googlegroups.com, Pierre Chapuis
Redis seems to have been designed under the KISS principle. The
particular implementation of hashes used by Redis is fairly simple and
straightforward. Replacing them with something less simple doesn't
make sense unless it can be shown they they offer an advantage
significant enough to warrant the extra code, maintenance, etc.

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>
>

Dvir Volk

unread,
Mar 7, 2011, 3:28:52 PM3/7/11
to redi...@googlegroups.com, Josiah Carlson, Pierre Chapuis
I've been toying lately with the idea of creating a pseudo-redis slave that indexes all the keys in a critbit tree, and can be queried separately from redis.
You could index keys according to patterns in them if you don't want all the keys indexed.
This will still KISS re redis, but will allow this feature which quite a few people need to work seamlessly.
--
Dvir Volk
System Architect, Do@, http://doat.com

Jak Sprats

unread,
Mar 8, 2011, 7:48:38 AM3/8/11
to Redis DB
Hi Pierre,

the idea for a critbit tree is a pretty good idea on some levels. I
think a lot of people using redis do have denormalised data sets w/
highly repetitive key structures, so this would be better served by a
critbit tree than a hash tree.

Josiah's point of KISS is a good one and also the redis hash table is
battle tested and has been refactored many times, so anyone looking to
suggest alternatives needs to understand the amount of testing that
such a change would require in practice.

On to Dvir's point about putting up a fork w/ a critbit overriding the
hash-table, this is not that much work to do. It basically comes down
to overriding the functions in dict.c w/ critbit functions. If done
correctly, even if it is not merged into the main branch, it could be
applied as a patch (dict.c does not change much - cause its SOLID
code), meaning it could be applied indefinitely (barring dict.c
changes) and wouldn't turn into bit-rot :)

On a side note, I overrode the central redis' central hash-table w/ a
btree 4-5 months ago, and even though the speeds of the respective
data structures are O(1) vs. O(LogN) the lookup speeds were about 1 to
0.8 under a very high SET/GET load. My only explanation was that the
lookup itself constituted a fraction of the entire transaction (tcp
packet handling, parsing, lookup, response), but I did not do enough
in-depth profiling, so this information can be viewed as an
observation rather than a conclusion.

- Jak

Dvir Volk

unread,
Mar 8, 2011, 10:05:12 AM3/8/11
to redi...@googlegroups.com, Jak Sprats

On to Dvir's point about putting up a fork w/ a critbit overriding the
hash-table, this is not that much work to do. It basically comes down
to overriding the functions in dict.c w/ critbit functions. If done
correctly, even if it is not merged into the main branch, it could be
applied as a patch (dict.c does not change much - cause its SOLID
code), meaning it could be applied indefinitely (barring dict.c
changes) and wouldn't turn into bit-rot :)


I was thinking of something more modest, that not really acts as a complete server, but rather just lets you run optimized KEYS queries against it.
say you fire up such a slave, and tell it something like "INDEXKEYS user:* product:*" etc
each new key that matches these patterns, is inserted into a critbit. later you can simply do something like "IKEYS user:a*" to query the cribit tree and just get keynames.
Actually since the overhead in memory is controlled by the user (and is 0 by default), this might not be such a bad idea to add this to redis itself if enough people want it and the powers that be like it :)
I know I'd use it.

Alexander Gladysh

unread,
Mar 8, 2011, 10:16:37 AM3/8/11
to redi...@googlegroups.com, Dvir Volk, Jak Sprats
On Tue, Mar 8, 2011 at 18:05, Dvir Volk <dvi...@gmail.com> wrote:

> I was thinking of something more modest, that not really acts as a complete
> server, but rather just lets you run optimized KEYS queries against it.
> say you fire up such a slave, and tell it something like "INDEXKEYS user:*
> product:*" etc
> each new key that matches these patterns, is inserted into a critbit. later
> you can simply do something like "IKEYS user:a*" to query the cribit tree
> and just get keynames.
> Actually since the overhead in memory is controlled by the user (and is 0 by
> default), this might not be such a bad idea to add this to redis itself if
> enough people want it and the powers that be like it :)
> I know I'd use it.

I think that I would use it as well. That or nested hashes.

Alexander.

Pierre Chapuis

unread,
Mar 9, 2011, 5:06:45 AM3/9/11
to redi...@googlegroups.com
On Tue, 8 Mar 2011 17:05:12 +0200, Dvir Volk wrote:

> I was thinking of something more modest, that not really acts as a
> complete server, but rather just lets you run optimized KEYS queries
> against it.
> say you fire up such a slave, and tell it something like "INDEXKEYS
> user:* product:*" etc
> each new key that matches these patterns, is inserted into a critbit.
> later you can simply do something like "IKEYS user:a*" to query the
> cribit tree and just get keynames.
> Actually since the overhead in memory is controlled by the user (and
> is 0 by default), this might not be such a bad idea to add this to
> redis itself if enough people want it and the powers that be like it
> :)

> I know Id use it.

This is an interesting idea but it actually *increases* memory use
instead of reducing it, which is one of my goals.

I use Redis to implement complex data structures such as inverted
indices and trees. A strategy to implement a tree in a flat keyspace
if only the leaves hold values is:

Node keys store an integer which is the number of their sons.
Nodes whose value is 0 are terminal and if their key is "x" then
there is a key called "v:x" that stores the value.
The root is the "1" key.

For instance (key -> value):

1 -> 3
1:1 -> 1
1:1:1 -> 0
v:1:1:1 -> "a"
1:2 -> 0
v:1:2 -> "b"
1:3 -> 2
1:3:1 -> 0
v:1:3:1 -> "c"
v:1:3:2 -> 1
v:1:3:2:1 -> "d"

This represents:

1
|- 1:1
|- "a"
|- 1:2
|- "b"
|- 1:3
|- 1:3:1
| |- "c"
|- 1:3:2
|- 1:3:2:1
|- "d"

If your nodes hold a value you can replace the integer with a set of
sub-components. For instance, imagine you want to store web pages:

example.com
|- stuff
|- x.html
|- content for "http://example.com/stuff/x.html"
|- y.html
|- content for "http://example.com/stuff/y.html"
|- about
|- content for "http://example.com/about"

The key example.com:stuff would be a set that contains "x.html"
and "y.html".

All these structures have in common that the keys are prefixed by
stacked namespaces. ORM do this, too, so it's very common.
With such structures, in real life, long keys with their first
80% or more in common are frequent, and critbit trees are
designed to store such data very efficiently, so what I'm saying
is that using them to replace hashes could halve the memory used
by keys.

Moreover, if we go back to the URLs example,most people don't
actually store the content of the page in the tree, they just
want to know that they have it on disk. In this case, the
cumulative length of the keys is on the same order than the
cumulative length of the values, which means that you do want
to reduce the memory usage of the keys.

This has become a very long post, but my point is: yes we would
benefit from what Dvir proposes, but it would not bring all the
advantages of critbit trees as an alternative to hashes.

Dvir Volk

unread,
Mar 9, 2011, 5:15:16 AM3/9/11
to redi...@googlegroups.com, Pierre Chapuis
I'd take the memory hit if it will still keep the O(1) behavior of hashes :)
I have an ORM and indeed right now I'm saving tons of keys and can't really do prefix or partial searches on them.
for example, if I do a key on user name and last name, using SETs named for example "users:name=john:last=doe", I can't use the same set to get all the users named "john doe" and all the users name "john". and if you index more than one field, you get gazillions of keys.
So in my use case, the memory overhead of adding a critbit indexing some redis keys, would be nothing compared to the memory saving of being able to use partial keys like RDBMS do.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Salvatore Sanfilippo

unread,
Mar 9, 2011, 5:41:45 AM3/9/11
to redi...@googlegroups.com, Jak Sprats
On Tue, Mar 8, 2011 at 1:48 PM, Jak Sprats <jaks...@gmail.com> wrote:
> On a side note, I overrode the central redis' central hash-table w/ a
> btree 4-5 months ago, and even though the speeds of the respective
> data structures are O(1) vs. O(LogN) the lookup speeds were about 1 to
> 0.8 under a very high SET/GET load. My only explanation was that the
> lookup itself constituted a fraction of the entire transaction (tcp
> packet handling, parsing, lookup, response), but I did not do enough
> in-depth profiling, so this information can be viewed as an
> observation rather than a conclusion.

That's absolutely true, but there is a different issue. In diskstore,
in expires, and in other parts of the Redis core we do sampling to
obtain elements that are good candidates to expire. At this level we
perform a lot of lookups in a loop. In this case the O(1) vs O(logN)
will become a performance hint.

Also we have RANDOMKEY and other randomized operations. The random
sampling itself is based on the ability of getting random elements. To
do this with a tree we need to augment the tree, and use more memory,
and more CPU.

Final note: in a cluster scenario getting ranges of keys is not
possible anyway in a straightforward way.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Pierre Chapuis

unread,
Mar 9, 2011, 6:15:45 AM3/9/11
to redi...@googlegroups.com
On Wed, 9 Mar 2011 11:41:45 +0100, Salvatore Sanfilippo wrote:

> That's absolutely true, but there is a different issue. In diskstore,
> in expires, and in other parts of the Redis core we do sampling to
> obtain elements that are good candidates to expire. At this level we
> perform a lot of lookups in a loop. In this case the O(1) vs O(logN)
> will become a performance hint.

O(x) performance only means something relative to an operation.
Hash operations are actually O(N) (N being the length of the key)
because the hashing function itself is O(N).

> Also we have RANDOMKEY and other randomized operations. The random
> sampling itself is based on the ability of getting random elements.
> To
> do this with a tree we need to augment the tree, and use more memory,
> and more CPU.

That's definitely true.

> Final note: in a cluster scenario getting ranges of keys is not
> possible anyway in a straightforward way.

Not in a straightforward way, that's true. But it could theoretically
be made extremely efficient if Redis tries to regroup subtrees of the
critbit tree on the same instance... That could be a job for a
complementary critbit tree layer.

However it's true that I was only thinking about the single-instance
use case.

--
Pierre 'catwell' Chapuis

Salvatore Sanfilippo

unread,
Mar 9, 2011, 6:41:53 AM3/9/11
to redi...@googlegroups.com, Pierre Chapuis
On Wed, Mar 9, 2011 at 12:15 PM, Pierre Chapuis <cat...@archlinux.us> wrote:
> O(x) performance only means something relative to an operation.
> Hash operations are actually O(N) (N being the length of the key)
> because the hashing function itself is O(N).

I was not clear, let's assume the key is constant size, so you can
optimize away this as a constant factor.

If I do M lookups for random sampling to evict keys on max memory, and
I've O(1) lookups, this is O(M)
If instead I've O(logN) lookups complexity, this is O(M*LogN)
complexity that can easily be an order of magnitude more, so I
planned, for instance, to block 1 millisecond at max in an entry level
server instead I see latency of queries from time to time in the order
of 10 milliseconds.

Pierre Chapuis

unread,
Mar 9, 2011, 9:12:57 AM3/9/11
to redi...@googlegroups.com
On Wed, 9 Mar 2011 12:41:53 +0100, Salvatore Sanfilippo wrote:

> If I do M lookups for random sampling to evict keys on max memory,
> and
> I've O(1) lookups, this is O(M)

I've had a look at that and it looks strange to me but probably I don't
understand the score well.

According to what I saw:

- Keys that can expire are stored in a hash: db->expires
In this hash, the keys are keys in the DB and the values are
expiration
times
- In activeExpireCycle, for each db, you get a fixed number of random
keys and for each of those you delete it if has expired, ie. if its
expiration time is later than "now"
- To get random keys in a hash you use dictGetRandomKey which doesn't
look like it's O(1). It even looks like its runtime is
non-deterministic because:

- first it selects a bucket with something like:
do {
h = random() & d->ht[0].sizemask;
he = d->ht[0].table[h];
} while(he == NULL);
This has a probabilistic complexity depending on the number of
non-empty buckets, no? However I don't know how how frequent
empty buckets are.
- then it gets a random element in a list, which is o(len(list)),
but I suspect that most lists have small lengths (otherwise the
bucket number changes).

What I don't understand is why you use such a hash to store keys
that will expire and not a sorted set with the expiration date as
the score. Then you wouldn't have to do random lookups at all and
you could simply expire the keys with the oldest expiration date...

--
Pierre 'catwell' Chapuis

Pierre Chapuis

unread,
Mar 9, 2011, 9:19:24 AM3/9/11
to redi...@googlegroups.com
On Wed, 09 Mar 2011 15:12:57 +0100, Pierre Chapuis wrote:

> I don't understand the score well.

The code... Although score is a nice metaphor ;)

--
Pierre 'catwell' Chapuis

Pierre Chapuis

unread,
Mar 9, 2011, 10:46:35 AM3/9/11
to redi...@googlegroups.com
On Wed, 9 Mar 2011 15:48:21 +0100, Salvatore Sanfilippo wrote:

> On Wed, Mar 9, 2011 at 3:11 PM, Pierre Chapuis <cat...@archlinux.us>
> wrote:
>> - To get random keys in a hash you use dictGetRandomKey which
>> doesn't
>>  look like it's O(1). It even looks like its runtime is
>>  non-deterministic because:
>
> get random key is O(1) because we guarantee that at least a given
> percentage of used buckets.

OK, I thought so. I checked the code and when you resize you
double the size of the bucket so if the best possible number
of operations is X, the average just after a resize is 2X
and there's always less that 0.2% chances that it does more
than 10 operations at this step.

Now that doesn't explain why you use a random algorithm instead
of a deterministic one with a sorted data structure.

Also, I don't really understand why you say that lookups would
be O(logN) in a critbit tree. They are not balanced trees and
any lookup for a key K is O(len(K)). Since you factor out len(K)
as a constant it would be O(1).

--
Pierre 'catwell' Chapuis

Salvatore Sanfilippo

unread,
Mar 9, 2011, 11:05:30 AM3/9/11
to redi...@googlegroups.com
On Wed, Mar 9, 2011 at 4:46 PM, Pierre Chapuis <cat...@archlinux.us> wrote:
> Now that doesn't explain why you use a random algorithm instead
> of a deterministic one with a sorted data structure.

Not sure I understand the question, please can you add some detail?

> Also, I don't really understand why you say that lookups would
> be O(logN) in a critbit tree. They are not balanced trees and
> any lookup for a key K is O(len(K)). Since you factor out len(K)
> as a constant it would be O(1).

Now I see they are Radix Trees, so yes O(1) lookups (we can consider
the key fixed size for this matter).

Btw the point is, I don't consider range queries in the key space very
important nor a fundamental operation for data organization in Redis,
actually it is more or less an anti-pattern IMHO.
So I don't want to change a lot of code for no significant gain.

I'm much more interested in providing more range queries capabilities
to sorted sets, for instance in form of a command able to get a
lexicographic range, but since for now I consider the API to be more
or less pretty fixed, for now I'm not doing too much about it. We
really need to bring forward cluster, diskstore, memory footprint of
sorted sets, rdb save/load performances for now.

But I'll keep in mind there is this data structure if we'll ever need
some substitute, so thank you for pointing it out.

Jak Sprats

unread,
Mar 9, 2011, 11:22:25 AM3/9/11
to Redis DB
> sampling to
> obtain elements that are good candidates to expire. At this level we
> perform a lot of lookups in a loop. In this case the O(1) vs O(logN)
> will become a performance hint.

Yeah that makes sense, if you do many lookups sequentially, the
difference between a HashTable and a Btree becomes quickly evident.

Pierre wrote why not use a ZSET to store expire keys, so it would be
guaranteed that the oldest (not random) keys were getting evicted and
also walking a ZSET is most likely quicker than
N*random_hash_table_lookups. The ZSET approach would require
additional memory and add an additional data structure, but like
Pierre, I have also asked myself why random expiration is better (and
I am pretty sure I am missing something, there are tons of tradeoffs
in this problem). BTW, what data-structure does memcached do to store
eviction keys and expire them, I have no clue.

> Final note: in a cluster scenario getting ranges of keys is not
> possible anyway in a straightforward way.

Yeah, true. On a related note: I read your hackernews response to
simonw on ZSET intersections across cluster nodes and was really
impressed at the MIGRATE to temp intersection nodes solution. I think
you really nailed that extremely difficult problem by abstracting it
out of the server, brilliant :)

On Mar 9, 3:41 am, Salvatore Sanfilippo <anti...@gmail.com> wrote:

Pierre Chapuis

unread,
Mar 9, 2011, 11:26:17 AM3/9/11
to redi...@googlegroups.com
On Wed, 9 Mar 2011 17:05:30 +0100, Salvatore Sanfilippo wrote:

>> Now that doesn't explain why you use a random algorithm instead
>> of a deterministic one with a sorted data structure.
>
> Not sure I understand the question, please can you add some detail?

I was thinking you could make db->expires a zset (with score =
expiration
date) instead of a hash and simply expire the oldest keys. Then I
looked
at redis.io and saw ZRANGEBYSCORE is O(log(N)+M), so it would not be
more efficient. No silver bullet I guess.

> Btw the point is, I don't consider range queries in the key space
> very
> important nor a fundamental operation for data organization in Redis,
> actually it is more or less an anti-pattern IMHO.
> So I don't want to change a lot of code for no significant gain.

I can understand that. I wanted this more for the memory usage gains
than for range queries though.

> I'm much more interested in providing more range queries capabilities
> to sorted sets, for instance in form of a command able to get a
> lexicographic range, but since for now I consider the API to be more
> or less pretty fixed, for now I'm not doing too much about it. We
> really need to bring forward cluster, diskstore, memory footprint of
> sorted sets, rdb save/load performances for now.

Yes, those are real priorities. I'm probably going to need cluster
someday so I agree you should focus on it ;)

--
Pierre 'catwell' Chapuis

Salvatore Sanfilippo

unread,
Mar 9, 2011, 11:25:52 AM3/9/11
to redi...@googlegroups.com, Jak Sprats
On Wed, Mar 9, 2011 at 5:22 PM, Jak Sprats <jaks...@gmail.com> wrote:


> Pierre wrote why not use a ZSET to store expire keys, so it would be
> guaranteed that the oldest (not random) keys were getting evicted and
> also walking a ZSET is most likely quicker than
> N*random_hash_table_lookups. The ZSET approach would require
> additional memory and add an additional data structure, but like
> Pierre, I have also asked myself why random expiration is better (and
> I am pretty sure I am missing something, there are tons of tradeoffs
> in this problem). BTW, what data-structure does memcached do to store
> eviction keys and expire them, I have no clue.

Oh ok, now I get it. The reply is very easy: our current solution has
zero additional memory used per key.


> Yeah, true. On a related note: I read your hackernews response to
> simonw on ZSET intersections across cluster nodes and was really
> impressed at the MIGRATE to temp intersection nodes solution. I think
> you really nailed that extremely difficult problem by abstracting it
> out of the server, brilliant :)

Thanks ;)

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore Sanfilippo

unread,
Mar 9, 2011, 11:27:39 AM3/9/11
to redi...@googlegroups.com, Pierre Chapuis
On Wed, Mar 9, 2011 at 5:26 PM, Pierre Chapuis <cat...@archlinux.us> wrote:
> I was thinking you could make db->expires a zset (with score = expiration
> date) instead of a hash and simply expire the oldest keys. Then I looked
> at redis.io and saw ZRANGEBYSCORE is O(log(N)+M), so it would not be
> more efficient. No silver bullet I guess.

The problem is that we also evict things without an expire set using
LRU, so this solution can't work in the general case without using
additional memory. Given that, better to have a single approach for
all the cases.

Btw from randomized simulations I did, the current approach is pretty
similar to precise LRU in the long run.

Pierre Chapuis

unread,
Mar 9, 2011, 11:45:46 AM3/9/11
to redi...@googlegroups.com
On Wed, 9 Mar 2011 17:27:39 +0100, Salvatore Sanfilippo wrote:

> The problem is that we also evict things without an expire set using
> LRU, so this solution can't work in the general case without using
> additional memory. Given that, better to have a single approach for
> all the cases.

OK. The only problem with the current approach is that if you have
eg. 100k keys that expire in 2 days and 1 key that is actually
expired there is almost no chance that this key ever gets collected.

Dvir Volk

unread,
Mar 9, 2011, 12:27:26 PM3/9/11
to redi...@googlegroups.com, Salvatore Sanfilippo
Btw the point is, I don't consider range queries in the key space very
important nor a fundamental operation for data organization in Redis,
actually it is more or less an anti-pattern IMHO.

Why do you consider it an anti pattern? Of course today it is because of performance reasons.
But I can give you many cases where this might be useful and actually save memory. 

Salvatore Sanfilippo

unread,
Mar 9, 2011, 12:36:29 PM3/9/11
to Dvir Volk, redi...@googlegroups.com
On Wed, Mar 9, 2011 at 6:27 PM, Dvir Volk <dvi...@gmail.com> wrote:
> Why do you consider it an anti pattern? Of course today it is because of
> performance reasons.
> But I can give you many cases where this might be useful and actually save
> memory.

Because a huge assumption on the Redis key-value model is that keys
are only accessed by fast lookups.
This makes it trivial to distribute in a completely horizontally
scalable fashion, without inter-node communication or other complex
merging.

If this is no longer true we have a different model.

Dvir Volk

unread,
Mar 9, 2011, 12:40:49 PM3/9/11
to Salvatore Sanfilippo, redi...@googlegroups.com
I can tell you that's not entirely how I'm using redis...
that's why I'm so obsessed with sort stuff of course :)

Salvatore Sanfilippo

unread,
Mar 9, 2011, 12:44:26 PM3/9/11
to Dvir Volk, redi...@googlegroups.com
On Wed, Mar 9, 2011 at 6:40 PM, Dvir Volk <dvi...@gmail.com> wrote:
> I can tell you that's not entirely how I'm using redis...
> that's why I'm so obsessed with sort stuff of course :)

It's a good idea to abuse the single-instance model, so I encourage your usage.
To design with the single instance in mind instead at this point is dangerous.

Reply all
Reply to author
Forward
0 new messages