Adding free-implementation of Arabic Language support to Text search.

226 views
Skip to first unread message

Kefah Issa

unread,
Mar 1, 2018, 7:53:37 PM3/1/18
to mongodb-dev
Hello,

Currently text search support for Arabic is only possible on MongoDB Enterprise with dependency on 3rd party proprietary component that requires a separate license; Basis Technology Rosette Linguistics Platform (RLP) is used to perform normalization, word breaking, sentence breaking, and stemming or tokenization depending on the language.

I would like to champion the impelmentation of a free / open source Arabic search implementation for mongodb. The support would include normalization, stemming, word-breaking ...etc. 

As such I would like to have the following basic guidance / hints on how can that be done for mongodb:

1. What are the possible implementation languages: c++, javascript? 
2. What is the required interface / api / abi ? 
3. Is there an available sample language codebase that I can use as a skeleton ? e.g. English.
4. How can I setup mongodb to use a custom language support extension so I can test it on ground before submitting.

That implementation can easily be further extended - by others - to supports other languages like Farsi (Iranian/Persian) and Urdu.

Thank you in advance for your help and guidance.

Regards,
- Kefah.

Mark Benvenuto

unread,
Mar 2, 2018, 5:32:20 PM3/2/18
to mongo...@googlegroups.com
To answer your questions:
1. It would need to be C/C++.
2. The basic MongoDB interface is pretty simple since because we rely on third-party libraries to do the tokenization and stemming.

The basic tokenizer interface is here:
Language registration is here:

3. The main library we use for English and other languages is Snowball which is based on Porter's work. MongoDB does not actually have any stemming code itself, just code to integrate Snowball and do scoring. See http://snowballstem.org/. I do not know how well Arabic fits this stemming model.


Mark

--
You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to mongo...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-dev/65c52503-ab53-46f0-97c8-67c975f3e255%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages