Search for multiple values in multiple keys

1,147 views
Skip to first unread message

Bartłomiej "veyDer" Rudzki

unread,
Jan 14, 2011, 11:09:39 PM1/14/11
to mongodb-user
Hi,

I've been struggling with the search functionality for a couple of
days and it seems that I need some help.

Here's the problem:
In the User collection I store 100K+ records with the following
structure:
- first_name
- last_name
- school_name
- city
- state

When a user provides a query "foo bar xyz", I want to return all the
documents satisfying the following criteria:

first_name = "foo" OR last_name = "foo" OR school_name = "foo" OR city
= "foo" OR state = "FO"
AND
first_name = "bar" OR last_name = "bar" OR school_name = "bar" OR city
= "bar" OR state = "BA"
AND
first_name = "xyz" OR last_name = "xyz" OR school_name = "xyz" OR city
= "xyz" OR state = "XY"

("FO", "BA" and "XY" for the state property are the values mapped from
the input keywords, the same may apply to any other property, just an
illustration of the fact that not all the values for a single group
are identical)

The above is just a simplification, in the real-world case I'd be
using the regular expressions /\bfoo/i, /\bbar/i, /\bxyz/i instead of
the literal values.

I know that MongoDB doesn't support multiple ORs and I don't think I
can use the method described here: http://groups.google.com/group/mongodb-user/msg/662fe2fcdeed7771
as the number of the elements of the $or query would be [number of
fields = 5] ^ [number of input words] (for 5 words it would be 3125)

I've tried $where, but it's way too slow (at least 10 times), perhaps
I'm doing something wrong there? If you think the only way to return
the proper results is to use the custom server-side function, what
would be the best way to write it?

Thanks for any ideas,
Bartlomiej "veyDer" Rudzki

Nat

unread,
Jan 14, 2011, 11:43:59 PM1/14/11
to mongod...@googlegroups.com
Can you describe more in detail why you would need to do such search? There might be a better way to do it. I don't see why you should allow that many keyword searches.

If you really want to do, you might use boolean algebra to convert them into

City in [foo,bar] or state in [foo,bar]
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Bartłomiej "veyDer" Rudzki

unread,
Jan 15, 2011, 12:07:43 AM1/15/11
to mongodb-user
Hi Nat,

Thanks for your answer.

The problem in more details is that I have different types (e.g. User,
High School) and for each of them I have predefined the list of
properties to look in, for example:
User: first_name, last_name, school name, city, state
High School: name, short_name, city, state

If the visitor is in the context of the user search, the following
searches must be available:
"Bart Rudzki" (first_name + last_name)
"Bart Rudzki Manhattan Beach" (first_name + last_name + city)
"Bart Rudzki California" (first_name + last_name + state)

But in the context of the high school search, the following searches
must be available:
"Manhattan Beach" (city)
"California" (state)
"Mira Costa California" (name + state)
"MCHS California" (short_name + state)

There are the following problems:
1. I don't know which word should be matched to which property ("Bart"
could possibly be the first_name or the last_name or even the state, I
don't know it up front)
2. There are multiple-word keywords (e.g. "Manhattan Beach" or "Mira
Costa") that should be matched to the same property

So as I wrote in my original message, I want to return all the
documents, for which *ALL* of the provided keywords are matched to at
least one of the specified properties.

I can't use
"City in [foo,bar] or state in [foo,bar]"
as it would not guarantee matching all the keywords.

Compare:
first_name in ["Bart","Rudzki"] OR last_name in ["Bart","Rudzki"]
OR ...
would return all the documents with the first name "Bart" or the last
name "Rudzki" (but not necessarily both)

Also notice that neither can I use:
first_name in ["Bart", "Rudzki"] AND last_name in ["Bart","Rudzki"]
AND city in ["Bart", "Rudzki"] ...
as I don't care about the city in this case

So in other words I need something like:
"Bart" in [first_name, last_name, city, school_name, state] AND
"Rudzki" in [first_name, last_name, city, school_name, state]

I've been trying to use boolean algebra for that, but with no results.

Thanks,
Bartlomiej "veyDer" Rudzki

Scott Hernandez

unread,
Jan 15, 2011, 12:37:32 AM1/15/11
to mongod...@googlegroups.com
It sounds like you want something closer to full-text searching with
scores and relevancies. You might want to look at an external
indexer/searcher like solr or elasticsearch.

At the very least you probably want $and:
http://jira.mongodb.org/browse/SERVER-1089

(btw, I graduated from Mira Costa)

Bartłomiej "veyDer" Rudzki

unread,
Jan 15, 2011, 6:32:07 PM1/15/11
to mongodb-user
Hi Scott,

Thanks for your answer.

A stand-alone external indexer/searcher would be too complex to add to
the project at this stage, so I'll try to find some other solutions.
I'll keep that in mind though, in the long-term it may be the way to
go.

Thanks,
Bartlomiej "veyDer" Rudzki
> >> You received this message because you are subscribed to the Google Groups "mongodb-user" group.>> To post to this group, send email tomongo...@googlegroups.com.>> To unsubscribe from this group, send email tomongodb-use...@googlegroups.com.
> >> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.> To post to this group, send email tomongo...@googlegroups.com.> To unsubscribe from this group, send email tomongodb-use...@googlegroups.com.

Nat

unread,
Jan 15, 2011, 7:51:03 PM1/15/11
to mongod...@googlegroups.com
One way you can do is to store all of your data as key-value pair array.
{ Data :
[ { Key : "city" , Value : "san francisco" }, { Key : "state" , Value : "CA" }]}

Then you can create and index on data.value or the combination between key and value. With this structure, you can search wildcard like give me all the record containing San francisco anywhere.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Bartłomiej Rudzki

unread,
Jan 15, 2011, 11:01:06 PM1/15/11
to mongod...@googlegroups.com
Hi Nat,

Thanks for your answer, that's probably what I'll end up doing, it is
still really slow but a few times faster than using $where.

Hopefully the introduction of the $and operator will allow me to come
up with the long-term solution.

Thanks,
Bartlomiej "veyDer" Rudzki

2011/1/16 Nat <nat....@gmail.com>:

Nat

unread,
Jan 15, 2011, 11:11:50 PM1/15/11
to mongod...@googlegroups.com
It shouldn't be very slow. Can you give what kind of query you are doing?

Bartłomiej Rudzki

unread,
Jan 16, 2011, 12:22:35 AM1/16/11
to mongod...@googlegroups.com
Hi Nat,

Here's an example of the query:

> db.HighSchoolStudent.find({
... search_keywords:{
... $all:[ /\bbart/i, /\brudzki/i, /\bmanhattan/i, /\bbeach/i ]
... }
... }).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 112421,
"nscannedObjects" : 112421,
"n" : 1,
"millis" : 1430,
"indexBounds" : {

}
}

My original query (with $where) took between 5000 and 6000ms, so it's
a step in the good direction.

Now the thing is that I perform ~20-30 additional counts for faceted
search on each search results page. While all other queries take less
time (as each of them is filtered by another indexed property like age
group, region, etc.), the general loading time of the page is
unacceptable (~15000ms).

At the moment I'm modifying the code to use lowercased keywords, so I
can get rid of "i" flag from the regex (which prevents the index),
that should also help.

Nat

unread,
Jan 16, 2011, 12:30:06 AM1/16/11
to mongod...@googlegroups.com
If you can use prefix regex such as /\^xxxx format. It can leverage index to make it faster.
Reply all
Reply to author
Forward
0 new messages