MongoDB full text search?

2,542 views
Skip to first unread message

Sheldon

unread,
Apr 4, 2012, 9:54:15 AM4/4/12
to mongod...@googlegroups.com
MongoDB supports what is defined as 'full text search': http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo

It appears though that the text must be broken in to keywords for MongoDB to search it.  For example:

MongoDB could not find "Mongo" in
The Mongo multikey feature can automatically index arrays of values
but it could find it in
tags: [ "values", "automatically", "Mongo" ]

I would hate to breakdown all of my text for true full text search. Is this the only way that MongoDB can find text, or is there another option?

Dan Crosta

unread,
Apr 4, 2012, 10:18:52 AM4/4/12
to mongodb-user
Currently MongoDB does not support full-text search on string fields,
but, as you discovered, you can simulate a full-text search using an
indexed array field in your documents. To make this work best, you'll
probably want to normalize your keywords to a standard case (lower or
upper, it won't matter), remove stop words (like "the", "and", etc),
and possibly use a stemmer algorithm to normalize variants of words
such as "stopped" vs. "stops" vs. "stop".

Supporting full-text search is a highly requested feature. See (and
vote on) https://jira.mongodb.org/browse/SERVER-380 if you'd like to
track progress towards implementing it natively in MongoDB.

- Dan

Sam Millman

unread,
Apr 4, 2012, 10:18:59 AM4/4/12
to mongod...@googlegroups.com
Regexing will solve this problem.

I have found that both are relatively the same speed.

Remember Mongo is no search database so it will only be able to do full text search to a point, just something to keep in mind if you are looking for a Google replacement.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/UXVhxEvoXEwJ.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Glenn Maynard

unread,
Apr 4, 2012, 10:25:30 AM4/4/12
to mongod...@googlegroups.com
On Wed, Apr 4, 2012 at 8:54 AM, Sheldon <spor...@gmail.com> wrote:
MongoDB supports what is defined as 'full text search': http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo

It doesn't support FTS itself; it supports features which allow implementing FTS on top of it.

It appears though that the text must be broken in to keywords for MongoDB to search it.  For example:

MongoDB could not find "Mongo" in
The Mongo multikey feature can automatically index arrays of values
but it could find it in
tags: [ "values", "automatically", "Mongo" ]

I would hate to breakdown all of my text for true full text search. Is this the only way that MongoDB can find text, or is there another option?

It could be stored more efficiently than Mongo does (by not storing duplicate copies of each word), which is a feature I think Mongo does need (string-pooled collections), but aside from that, any FTS engine is going to do this, storing either words or stems in an indexable container.

On Wed, Apr 4, 2012 at 9:18 AM, Sam Millman <sam.m...@gmail.com> wrote:
Regexing will solve this problem.

I have found that both are relatively the same speed.

Regex searches will definitely not be the same speed as a multikey-indexed array containing individual words.  Regex searches can not (in general) use index lookups; a substring search with a regex will do a full scan over the whole table (or an index scan, which isn't the same as an index lookup).  AFAIK, only regexes with a prefix match can do any better than that (which substring matches don't have).

-- 
Glenn Maynard

Sam Millman

unread,
Apr 4, 2012, 10:33:57 AM4/4/12
to mongod...@googlegroups.com
Well in terms of user performance my search performs well on Regexs and actually out performs in terms of usability since substring directly on an array element is not as effective.

Imagine you have a keyword entry of "gmail" and the user searches "gm". An array of keywords would still require that you regex to get that, or you just break up the keywords enough to find something in which case the keywords array becomes useless because the words cannot be put into a context which means you could get thousands of documents that are not even related to what your searching for.

If you were to do a full table search (unrestricted by user_id or some other information to constrain the regex) then you should use a search tech cos then you are trying to perform a site search, something which in my personal opinion MongoDB should not be used for.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.

Glenn Maynard

unread,
Apr 4, 2012, 11:20:31 AM4/4/12
to mongod...@googlegroups.com
On Wed, Apr 4, 2012 at 9:33 AM, Sam Millman <sam.m...@gmail.com> wrote:
Well in terms of user performance my search performs well on Regexs and actually out performs in terms of usability since substring directly on an array element is not as effective.

Imagine you have a keyword entry of "gmail" and the user searches "gm". An array of keywords would still require that you regex to get that, or you just break up the keywords enough to find something in which case the keywords array becomes useless because the words cannot be put into a context which means you could get thousands of documents that are not even related to what your searching for.

Substring searches are a different feature than FTS, and are harder to do efficiently.  We're only talking about FTS here, which typically supports full-word matching and stem matching, which can all be precalculated into an index.

No substring solution based on regexes is going to scale (short of complete substring indexing, which will get expensive), but that's a separate problem than FTS.

(Prefix matching is easy, since prefix matches are easy to index, but that's a special case.)

--
Glenn Maynard


Sam Millman

unread,
Apr 4, 2012, 11:47:23 AM4/4/12
to mongod...@googlegroups.com
Doesn't the common FTS standard support substring search through the wildcard character (or similar) *? So that:

*gmail*
gmail*
*gmail

Are actually valid methods by which to perform FTS.

--
Glenn Maynard


Glenn Maynard

unread,
Apr 4, 2012, 1:20:02 PM4/4/12
to mongod...@googlegroups.com
On Wed, Apr 4, 2012 at 10:47 AM, Sam Millman <sam.m...@gmail.com> wrote:
Doesn't the common FTS standard support substring search through the wildcard character (or similar) *? So that:

*gmail*
gmail*
*gmail

Are actually valid methods by which to perform FTS.

These aren't always supported, since it's harder to do efficiently, and not all that useful to most users.  There may be domain-specific optimizations for this, but the only generic approach I've seen is to split each word into suffixes--["gmail", "mail", "ail", "il", "l"]--which can then be indexed.  That's expensive in terms of storage, of course.

But you definitely wouldn't use regex for infix searches, because that would require a complete index scan.  Regexes only help you with prefix searches ("gmail*"), since that can use an index effectively, eg:

db.test.ensureIndex({_keywords: 1})
db.test.insert({sentence: 'hello world', _keywords: ['hello', 'world']})
db.test.find({_keywords: /^he.*/})

-- 
Glenn Maynard


Sam Millman

unread,
Apr 4, 2012, 1:38:01 PM4/4/12
to mongod...@googlegroups.com
Well I chose to do it through regexs and it's served me well. I only regex maybe 100k documents so that's why I didn't really care about splitting my keywords, I use an actual search tech if I wanna do substantial searching.

Glenn Maynard


Reply all
Reply to author
Forward
0 new messages