A good schema for an email recipient autocomplete search box

Martin Camitz

unread,

May 4, 2012, 7:02:29 AM5/4/12

to mongod...@googlegroups.com

I am in the process of implementing an autocomplete searchbox for message recipients similar to the likes of gmail with MongoDB. The returned set will be based on previous interaction and prefix matches on any number of emails per user and/or full name.

Currently I'm using this schema

{

_id: 1,

Fullname: "Aaa Bbb",

Emails: ["xxx...@zzz.com", "mmm.nnn"@lll.com"],

Keys: ["xxx yyy mmm nnn aaa bbb"],

Connections: [ 19, 21, 32 /*Links to other users*/]

}

A search could look like this

db.User.find({"Connections._id" : 19, "Keys" : /^m/})

where I am user 19 and I've typed "m" into the search box. The above document would match.

Keys are indexed and the array is created on insertion using all the subcomponents of all emails and names.

Connections is also indexed and contain the *reciprocal* connections based on previous interaction such that i A has ever messaged B then A will be in B's Connection array. The idea is to exploit Mongos indexing/sharding to restrict the search set.

The question is, is this a good idea?

An alternative would be to have Connections the other way around i.e. if A ever messaged B then B would be in A's connection array. A search for the same match would then be

db.User.find({"_id" : 19, "Connections.Keys" : /^m/})

I guess it comes down to how Mongo uses indexes. If it's a question of two consecutive lookups then the second approach seems appropriate. However, if there is a more complicated/smart merging of indexes going on then I can imagine the first one performing better.

The first also has the added benifit that Connections doesn't even have to contain linked objects, just ids. I'm not sure of the gain there though.

Brandon Diamond

unread,

May 4, 2012, 1:23:28 PM5/4/12

to mongodb-user

The first approach seems to be superior: you're not storing arrays of
nested, complex objects -- you're just storing a simple string array.

Note that every object will have N entries in the multi-key index
where N is the number of items in the array. If this becomes large,
then you'll have trouble keeping the index in primary memory.

Further, you want to limit document growth. This is more easily
achieved when storing simple strings (provided that there is a
constant limit to the growth such that MongoDB can predict how much
extra space to allocate).

On May 4, 7:02 am, Martin Camitz <martin.cam...@gmail.com> wrote:
> I am in the process of implementing an autocomplete searchbox for message
> recipients similar to the likes of gmail with MongoDB. The returned set
> will be based on previous interaction and prefix matches on any number of
> emails per user and/or full name.
>
> Currently I'm using this schema
>
> {
>
> _id: 1,
>
> Fullname: "Aaa Bbb",
>

> Emails: ["xxx....@zzz.com", "mmm.nnn"@lll.com"],

Martin Camitz

unread,

May 4, 2012, 3:48:17 PM5/4/12

to mongod...@googlegroups.com

Thank you for your answer which I have read several times over. I think what you're saying is that if I like I mentioned towards the end, did not link objects in the Connections array but instead only IDs then there is no question that the first approach is better.

Just to be sure that you understands me properly:

Approach 2, (naive): Searching involves finding the author of the message and then in his/her connection-array find recipients with any Key-array element matching the prefix search string.

Approach 1, (refined): Searching involves finding all users who've at some point received messages from the author by scanning everybody's (indexed) Connection-array for the author id, then in the returned set including only those with matches in the Key-array.

The Key-index promises to be smaller (although on the same order) than would be for instance an index on emails, since emails are unique to individuals whereas the Key-index would be comprised to a significant degrees of regular person names.

Document growth may be an issue. Emails and names tend to be few and fixed but connections can grow though and there'd be a skew in the distribution. In the 2nd approach capping on 10000 connections or even 1000 will not affect usability I would say. In the 1st approach it may. Consider how many you receive emails from vs how many you send to. On the other hand skewness in the number of connections would be lower in the 1st vs the 2nd, I'm not sure what the effect of this is. I'll have to look into mitigating this. Currently I'm rebuilding the entire database in a nightly batch and as long as I can continue with that there won't be a big problem with document growth.

Brandon Diamond

unread,

May 4, 2012, 6:16:01 PM5/4/12

to mongodb-user

Yes - that's correct.

Generally, when deciding which side of the relationship to capture
(using the array -- and ignoring the issue of embedding strings in an
array vs objects for the time being), you'll always want to go with
the option that is most likely to be smaller.

Since people tend to send fewer messages than they receive, that's
likely your best bet when modeling this problem.

Hope this helps,
- Brandon

Martin Camitz

unread,

May 5, 2012, 10:14:25 AM5/5/12

to mongod...@googlegroups.com

Alright, thanks, Brandon!

Martin

Reply all

Reply to author

Forward