Issue with Spanish Text Index stemming

52 views
Skip to first unread message

Ruben Gonzalez

unread,
Mar 15, 2018, 5:17:19 PM3/15/18
to mongodb-user
Hi group

I have an issue with the stemming in a Spanish Text Index. (the process of deleting some letters at the end is a language-specific feature called "stemming". )

> db.Series.find({ "$text": { "$search": 'filologia clasica', "$language": "es" } }, {indexlanguage: 1}).explain("executionStats")
...
"terms" : [
    "filologi",
    "clasic" 
],
...

> db.Series.find({ "$text": { "$search": 'filologia clásica', "$language": "es" } }, {indexlanguage: 1}).explain("executionStats")
...
"terms" : [
    "filologi",
    "clasic" 
],
...

> db.Series.find({ "$text": { "$search": 'filología clásica', "$language": "es" } }, {indexlanguage: 1}).explain("executionStats")
...
"terms" : [
    "filolog",
    "clasic" 
],
...


The result of stemming filologia is filologi and the result of stemming filología (with accent mark) is filolog.

The termination -ía is very common in Spanish and this issue is critical when you use the text index to search without accent marks (behavior very common too).

Is this issue a bug in the Mongo stemming system or is a bad definition of the index? I reproduce this issue with MongoDB 3.2 (textIndexVersion 3) and MongoDB 2.6 (textIndexVersion 2)

> db.User.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_fts" : "text",
            "_ftsx" : 1
        },
        "name" : "$**_text",
        "ns" : "test.User",
        "weights" : {
            "$**" : 1
        },
        "default_language" : "english",
        "language_override" : "language",
        "textIndexVersion" : 3
    }
]

Thank you.

Ruben Gonzalez

unread,
Jul 4, 2018, 10:38:53 AM7/4/18
to mongodb-user
Hi group

The workaround of specifying a language value of "none" to avoid the stemming is not perfect solution. The stemming and the stop words depend of the language value, and the stop-words feature is very useful.


> If you specify a language value of "none", then the text index uses simple tokenization with no list of stop words and no stemming.

IMHO for this case custom stop-words will be a great feature.

Wan Bachtiar

unread,
Jul 6, 2018, 2:24:35 AM7/6/18
to mongodb-user

Is this issue a bug in the Mongo stemming system or is a bad definition of the index?

Hi Ruben,

Based on the output of the collection getIndexes(), the text index default_language is english. The default language associated with the indexed data determines the rules to parse word roots (i.e. stemming) and ignore stop words. To specify a different language, use the default_language option when creating the text index. See Text Search Languages for the languages available for default_language.

See also Specify a Language for Text Index.

I reproduce this issue with MongoDB 3.2 (textIndexVersion 3) and MongoDB 2.6 (textIndexVersion 2)

Additionally, I’d recommend to upgrade your MongoDB version. MongoDB v3.2 is close to reaching its end-of-life support (September 2018). See also MongoDB Download Centre.

Regards,
Wan.

Ruben Gonzalez

unread,
Aug 8, 2018, 5:27:12 AM8/8/18
to mongodb-user
Hi

This bug also happens in French. When I was going to file a bug a found: https://jira.mongodb.org/browse/SERVER-15027

I updated this issue with my data.



Stephen Steneker

unread,
Aug 26, 2018, 10:10:25 PM8/26/18
to mongodb-user
On Wednesday, 8 August 2018 19:27:12 UTC+10, Ruben Gonzalez wrote:
This bug also happens in French. When I was going to file a bug a found: https://jira.mongodb.org/browse/SERVER-15027
Hi Ruben,

This is actually a limitation of algorithmic stemming. Stemming algorithms use generic heuristics to reduce words to an expected root form, but don't actually have the context of language or grammar. Accuracy will vary depending on the language, verb conjugation, and the stemming algorithm used.

MongoDB (as at 4.0) uses the Snowball stemming library. You can test expected outcomes using the Snowball online demo.

There are other approaches for more accurate inflection which are generally referred to as lemmatization. Lemmatization algorithms are more complex and start heading into the domain of natural language processing. There are many open source (and commercial) toolkits that you may be able to leverage if you want to implement more advanced text search in your application, but these are outside the current scope of the MongoDB text search feature.

For more background, see: Stemming and lemmatization:

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Regards, Stennie 

Reply all
Reply to author
Forward
0 new messages