Stemming does not work properly for MongoDB text index

51 views
Skip to first unread message

Michael Smolyak

unread,
Mar 31, 2014, 1:04:54 PM3/31/14
to mongod...@googlegroups.com

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".

Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.

I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.

Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?

Michael

horacio...@gmail.com

unread,
Mar 31, 2014, 3:38:18 PM3/31/14
to mongod...@googlegroups.com
I believe that is not a bug. Read this: 

NOTE
If you specify a language value of "none", then the text search has no list of stop words, and the text search does not stem or tokenize the search terms.

I have some cases with two language. Note that in portuguese not occurs matching:

> db.stuffs.getIndexes()

[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.stuffs",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "test.stuffs",
                "name" : "what_text",
                "default_language" : "portuguese",
                "weights" : {
                        "what" : 1
                },
]
                "language_override" : "language",
                "textIndexVersion" : 1
        }
]

> db.stuffs.runCommand("text", {search: "hunt"}) // stemmed hunting -> hunt? No.
{
        "queryDebugString" : "hunt||||||",
        "language" : "portuguese",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "nfound" : 0,
                "timeMicros" : 162
        },
        "ok" : 1
}


> db.stuffs.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.stuffs",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "test.stuffs",
                "name" : "what_text",
                "weights" : {
                        "what" : 1
                },
                "default_language" : "english",
                "language_override" : "language",
                "textIndexVersion" : 1
        }
]


> db.stuffs.runCommand("text", {search: "hunt"})  // stemmed hunting -> hunt? yes.
{
        "queryDebugString" : "hunt||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("5339c07c29ec96fbf8631f65"),
                                "what" : "big hunting dogs" // <-------------------------
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "nfound" : 1,
                "timeMicros" : 104
        },
        "ok" : 1
}


 I hope this helps.

Regards,

Horacio Ibrahim



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d61b8808-bda9-4b09-a70d-786c007818fc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
[]'s
Horacio Ibrahim
-----------------------------

Reply all
Reply to author
Forward
0 new messages