$text search with/without diacritics

857 views
Skip to first unread message

Martin Klobusnik

unread,
Sep 12, 2014, 10:17:11 AM9/12/14
to mongod...@googlegroups.com
Hi,

I have a field containing string with localized characters like ľščťž and need to do the fulltext search which will find appropriate documents even if the search phrase was without the diacritics. So for eaxmple I have a document:

{
   "city":"Košice"
}

and want to match this document even if I search for "Kosice"...

Is there a way to do it without storing this string in another field transformed so it does not contain diacritics and doing search in that one as well?

Thanks
Martin

Will Berkeley

unread,
Sep 15, 2014, 11:13:35 AM9/15/14
to mongod...@googlegroups.com
If you're just matching the names of cities exactly, then don't use text search. Store two versions of the string and match exactly on one or the other. If you need text search capabilities like stemming and stopword removal, I think storing the string without accents/diacritics is still the best idea if you want to match the "accentless version" of the word. Text indexes currently (2.6) don't transliterate or attempt to remove accents/diacritics. In general it's dangerous to drop accents because in most languages removing the accents is at least a spelling error, and at worst gives a new word with a different meaning. For example, in French, the word "congres" means "eels" while the word "congrès" means "conference". You should find an algorithm that implements transliteration between the source language and the target language (English for "accentless") since there are also multiple ways that some accents and foreign letters are transliterated depending on the source language and target language (umlaut -> -e when coming from German, versus just dropped if coming from Swedish).
If you do transform the words and store them like

{ "city" : "Košice", "city_nd" : "Kosice" }

then you can search both by creating a text index on both fields

> db.test.ensureIndex({ "city" : "text", "city_nd" : "text" })

> db.test.find({"$text": {$search : "Košice"}})

> db.test.find({"$text": {$search : "Kosice"}})

-Will

Russell Bateman

unread,
Sep 15, 2014, 11:29:23 AM9/15/14
to mongodb-user
Here's an old machine translation technique you might consider. I would recommend storing the "correct" form as Will says. For the second form, the one to match on, store it neutralized removing the diacritics and encoding them at the end of the word. That way, you can do an exact match followed by a "beginning match" using a similar treatment on what's trying to be matched, then present the caller with a list of possible matches.

Will's example would be to keep:

    congres
    congrès : congres6è

as separate entries (because they're separate words), then to present the two in case of ambiguity. If someone searches for "congrès" then you return

    congrès

because you translated the search string into "congres6è" and just matched it--no ambiguity.

If they search for "congres" you return

    1) congres
    2) congrès

Maybe this will give you some ideas better adapted to your best usage.

(Hope this helps.)

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/7e116f0c-8864-4917-9d33-e656490f7cde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages