Re: Modeling a trie in MongoDB

Kevin Matulef

unread,

Sep 30, 2012, 11:08:59 PM9/30/12

to mongod...@googlegroups.com

First, I want to be clear on your requirements. Do you want to match the string that is the longest exact prefix, or the string that has the longest common prefix? For instance, if your database contains

Prefix : 1

Prefix : 122

and your input is 123, do you want to return 1, because it's the only element in the db that's a prefix of 123? Or do you want to return 122, since it shares a longer common prefix of "12" with 123? I think you mean the former, but I want to check.

Also, is there a reasonable bound on the length of the prefixes you're storing (e.g. maximum 100 characters), or can they be arbitrarily long?

-Kevin

On Friday, September 28, 2012 5:43:53 PM UTC-4, Vladut Paiu wrote:

Hello,

Currently, in my application I have the need to perform a longest prefix matching, like the following :

In my DB I have the following :

Prefix : 123
Prefix : 2345
Prefix : 1234
Prefix : 6789

If my input is 123456789, the result should be 1234, since it's the longest common prefix.

Currently, I operate this by loading all the info in memory and relying on tries for fast access. While this has advantages, my dataset is quickly growing large enough that it cannot fit entirely in RAM, so I am looking into MongoDB, and trying to figure out a way to map this usage case to it.

Has anybody got any ideas about how I could do this with Mongo ?

Regards,
Vlad

Vladut Paiu

unread,

Oct 1, 2012, 2:39:07 AM10/1/12

to mongod...@googlegroups.com

Hello,

Thank you very much for your answer.

In your example, I'd return 1, since it's the only element that's an exact prefix of 123.

The input number would be bound to around 15 digits, while the prefixes themselves would be bound to max 10 digits long.

Best Regards,

Vlad

Kevin Matulef

unread,

Oct 1, 2012, 4:35:51 AM10/1/12

to mongod...@googlegroups.com

Got it. In that case, for each input there are only 10 possible prefixes that can match it. So one way to do this is simply to store the prefixes as strings ("123", "2345", etc.) and create an index on the "prefix" field (e.g. db.coll.ensureIndex( { prefix : 1 } ) ). Then you can query for the possible matches using the $in operator, e.g.

// query for prefix matching "123456789", sorted reverse lexicographically, returning just the longest match

> db.coll.find( { prefix : { $in : [ "1" , "12" , "123" , ... , "123456789" ] } } ).sort( { prefix : -1 } ).limit(1)

The array after the $in operator should include the prefixes of your input string (up to 10 of them). In the worst case, the time complexity of this query is equivalent to that of doing 10 indexed equality queries, so it should be pretty efficient. Also note that with the addition of the "sort" and "limit," the query will look for the longest prefixes first, and stop at the first one it finds, so that should also cut down on unnecessary work.

-Kevin

Vladut Paiu

unread,

Oct 1, 2012, 2:27:55 PM10/1/12

to mongod...@googlegroups.com

Hello,

Thank you very much for your reply, very useful.

I have tested this in our QA platform and we are getting reasonable performance with an index on the prefix field, but still a little bit slow for our requirements.

Is there any way we could enhance the query performance ? We've tested various methods, but since we're not familiar with MongoDB internals, none of them seem to be working.

Best Regards,

Vlad

Kevin Matulef

unread,

Oct 2, 2012, 12:12:12 PM10/2/12

to mongod...@googlegroups.com

Are your prefixes and inputs all integers, or can they can be "chars" as well?

-Kevin

Vladut Paiu

unread,

Oct 2, 2012, 12:35:24 PM10/2/12

to mongod...@googlegroups.com

Hello,

The prefixes will be only integers.

Best Regards,

Vlad

Kevin Matulef

unread,

Oct 2, 2012, 1:51:19 PM10/2/12

to mongod...@googlegroups.com

In that case you can use essentially the same strategy, but store the prefixes as numbers instead of strings. If the average length of the prefixes is > ~3 this will save a few bytes, making the index smaller (thus more of it can fit in memory). You'll have to do some division and remainder calculations to construct the prefixes you put in your $in query.

This might help a little, but I wouldn't expect it to help a lot. Be aware that if your set of prefixes doesn't fit in memory, there's no way to avoid the occasional disk seek, so your performance is bound to degrade a bit, whatever data structure you use. Also, the performance of a general-purpose database index designed for range/equality queries will probably never quite match a custom data structure designed exactly for your problem (though I expect it should get close in this case).

-Kevin

Reply all

Reply to author

Forward