Is this issue a bug in the Mongo stemming system or is a bad definition of the index?
Hi Ruben,
Based on the output of the collection getIndexes()
, the text index default_language
is english
. The default language associated with the indexed data determines the rules to parse word roots (i.e. stemming) and ignore stop words. To specify a different language, use the default_language
option when creating the text index. See Text Search Languages for the languages available for default_language
.
See also Specify a Language for Text Index.
I reproduce this issue with MongoDB 3.2 (textIndexVersion 3) and MongoDB 2.6 (textIndexVersion 2)
Additionally, I’d recommend to upgrade your MongoDB version. MongoDB v3.2 is close to reaching its end-of-life support (September 2018). See also MongoDB Download Centre.
Regards,
Wan.
This bug also happens in French. When I was going to file a bug a found: https://jira.mongodb.org/browse/SERVER-15027
This is actually a limitation of algorithmic stemming. Stemming algorithms use generic heuristics to reduce words to an expected root form, but don't actually have the context of language or grammar. Accuracy will vary depending on the language, verb conjugation, and the stemming algorithm used.
MongoDB (as at 4.0) uses the Snowball stemming library. You can test expected outcomes using the Snowball online demo.
There are other approaches for more accurate inflection which are generally referred to as lemmatization. Lemmatization algorithms are more complex and start heading into the domain of natural language processing. There are many open source (and commercial) toolkits that you may be able to leverage if you want to implement more advanced text search in your application, but these are outside the current scope of the MongoDB text search feature.
For more background, see: Stemming and lemmatization:
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Regards, Stennie