I am Mostafa, and I am honored to introduce you my latest project named "Tihu". It's the name I am given to this "Persian NLP". It is an open source project and you can find it here:
https://github.com/tihu-nlp I have been working on this project more than a year and I published it under GNUv3 License. At the moment Tihu has some parts for processing a text or as I call it corpus. I try to give a brief explanation about these parts here:
1- Tokenizer. This is the first step in Tihu. Tihu has a main object named Corpus. Each corpus is made of “words” and each word made of “entries” (one or more). For tokenizing a text in Persian we can not rely on spaces on the text. Tihu do it by sweeping word letter by letter. It helps to normalize Persian text before looking up them in the main lexicon. For this porous Tihu has a per-defined Unicode table for standard characters with normalized values. you can find it here:
https://github.com/tihu-nlp/tihu/blob/master/src/build/data/tokens.txt 2- Part-Of-Speech tagging. For POS tagging Tihu uses a lexicon with thousands words with affixes (suffixes and prefixes) tags. each word has tagged with some affixes. It helps to find a word in the lexicon. You can find a lexicon and affixes documents here:
https://github.com/tihu-nlp/tihu/blob/master/src/build/data/lexicon.affhttps://github.com/tihu-nlp/tihu/blob/master/src/build/data/lexicon.dic my previous work -that is submitted to Google Chrome- was a dictionary for Persian spell checker. I used this lexicon to make it. You can find it here:
https://github.com/m-o-s-t-a-f-a/lilak 3-POS disambiguation or word sense disambiguation. It's empty now in tihu! I have several ideas to do it in Persian. The same works in other languages like English can be used here. But for Persian we can use some other ideas. I believe preposition in Persian can helps a lot in POS-disambiguation. I hope people help me for developing this part of Tihu.
4- Diacritics or Kasre-Ezafe detection. This part also need more study before implementing. There are some papers about it. I found out that in a amazing library named Hazm they have wonderful job about it. Check it here:
https://github.com/sobhe/hazm 6- Letter to sound. For converting letter to sound (or Grapheme to sound) I used tensrflow library. For training the model I used a dictionary with more than 50 thousands of Persians words. The trained model can be found here:
https://github.com/tihu-nlp/g2p-seq2seq-tihudict Word-Error-Rate for this model something around 15%-20% that is acceptable.
7- Speech synthesizer. At the moment Tihu uses Mbrola for speech synthesizer. however eSpeask can be used as well. I did some work (scratch) on Festvox library. Building new voices via Festvox need a lot of works and time. I am very hopeful that sooner a team work on it.
An Example: I put an example in Tihu project. you can check it here:
https://github.com/tihu-nlp/tihu/tree/master/example In this example you can find output of each level as an xml file. The final result is printed in text.lbl file.
Conclusion: Persian is a complex language in term of Natural Language Processing. it has a lot of exceptions. Processing Persian text can not be done be defining a set of morphological rules. May be it's one of the reason that Persian is the literary language.
The most important part of Tihu is making a simple infrastructure than can be developed by other programmers across the world. Hope to see you there ;)
Thanks,
Mostafa