Arabic, Morphological Locate.cpp, Arabic-Typo-rules.txt

22 views

Skip to first unread message

Alexis Neme

unread,

Jun 9, 2016, 2:02:38 AM6/9/16

to Unitex-GramLab

Dear Core Developers C/C++,

I have identified two bugs in the in the C++ module morphological-locate.Cpp in the procedure explore_dic_in_morpho_mode_arabic:

- one bug related to the omission of hamza on Alif

If we have OajonabyG in the compressed dictionary, and we have AajonabyG in the corpus , this later is tagged as unknown

(however alef hamza above O=YES in the Arabic-Typo-rules.txt)

- one bug related to solar assimilation (see below)

NB. All tests may be done in latin alphabet since we use Buckwalter++ encoding.

If you are willing to work on this issues of please contact me i will prepare the file for testing and validating

and I will include the BUG in the Github tracker .

Thanks for your collaboration in advance,

Let me know,

Alexis

The BUG of solar assimilation

When a token is prefixed with agglutinated prefixes (CONJC, and PREP) before the definite Article Al-, the dico module should find the corresponding the lemma in the Dictionary.

Example

samaAdi => AlsGamaAdi /insertion of G (gemination) after the first solar consonant.

waAlsGamaAdi /waAl agglutinated conjonction folled by Al-

Bug Location

Module: Arabic.cpp. Arabic.cpp module handles diacritization rules of Arabic script defined in the

Configuration files: arabic_typo_rules.txt : ... solar assimilation=YES ... in in the Arabic directory

Explanation

In Arabic, the consonants are divided into two groups, solar and lunar letters, based on whether or not they assimilate the letter 'l' of a preceding definite article Al-.

Solar letters are half of the alphabet (list is in Arabic.cpp)

Given a partially diacriticized token in Arabic, the dico program should find in the dictionary the fully diacriticized lemma according to the typo rules; and particularly when a 'G' is inserted and even with agglutinated prefixes.

Below the test case below:

AR-Token	TB-Token	FOUND (Yes/N)
سَمَادِ	samaAd
السَّمَادِ	AlsGamaAdi	Y
بِالسَّمَادِ	biAlsGamaAdi	Y
كَالسَّمَادِ	kaAlsGamaAdi
لِلسَّمَادِ	lilosGamaAdi	Y
وَالسَّمَادِ	waAlsGamaAdi
فَالسَّمَادِ	faAlsGamaAdi
وَبِالسَّمَادِ	wabiAlsGamaAdi	Y
وَكَالسَّمَادِ	wakaAlsGamaAdi
وَلِلسَّمَادِ	waliAlsGamaAdi	Y
فَبِالسَّمَادِ	fabiAlsGamaAdi	Y
فَكَالسَّمَادِ	fakaAlsGamaAdi
فَلِلسَّمَادِ	faliAlsGamaAdi	Y

Gilles Vollant

unread,

Jun 9, 2016, 3:45:16 AM6/9/16

to Alexis Neme, Unitex-GramLab

I suggest you prepare .ULP log file, which is the better way to be sure reproduce same condition

Regards

Gilles Vollant

De : unitex-...@googlegroups.com [mailto:unitex-...@googlegroups.com] De la part de Alexis Neme
Envoyé : jeudi 9 juin 2016 08:03
À : Unitex-GramLab
Objet : [Unitex-GramLab] Arabic, Morphological Locate.cpp, Arabic-Typo-rules.txt

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/a100e780-5872-4cda-8c9b-8e3dba03d85d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages