Arabic, Morphological Locate.cpp, Arabic-Typo-rules.txt

22 views
Skip to first unread message

Alexis Neme

unread,
Jun 9, 2016, 2:02:38 AM6/9/16
to Unitex-GramLab
Dear Core Developers C/C++,

I have identified two bugs in the in the C++ module morphological-locate.Cpp  in the procedure  explore_dic_in_morpho_mode_arabic:

  - one bug related to the omission of hamza on Alif 
            If we have OajonabyG in the compressed dictionary, and we have AajonabyG in the corpus , this later is tagged as unknown 
                       (however alef hamza above O=YES in the Arabic-Typo-rules.txt) 

  - one bug related to solar assimilation (see below)

NB. All tests may be done in latin alphabet since we use Buckwalter++ encoding.

If you are willing to work on this issues of  please contact me i will prepare the file for testing and validating
and  I will  include the BUG  in the Github tracker . 
Thanks for your collaboration in advance,

Let me know, 

Alexis 

The BUG  of solar assimilation

When a token is prefixed with agglutinated prefixes  (CONJC, and PREP) before the definite Article Al-, the dico module should find the corresponding the lemma in the Dictionary. 

Example
         samaAdi => AlsGamaAdi            /insertion of  G (gemination)  after the first solar consonant.
                             waAlsGamaAdi       /waAl agglutinated conjonction folled by Al- 
  



Bug Location
    Module: Arabic.cpp. Arabic.cpp module handles  diacritization rules of Arabic script defined in the   
    Configuration files: arabic_typo_rules.txt : ... solar assimilation=YES ... in in the Arabic directory 

Explanation

In Arabic, the consonants are divided into two groups, solar   and lunar letters, based on whether or not they assimilate the letter 'l' of a preceding definite article Al-.
Solar letters are half of the alphabet (list is in  Arabic.cpp)
 
Given a partially diacriticized token in Arabic,  the dico program should  find in the dictionary the fully diacriticized  lemma according to the typo rules; and particularly when a 'G' is inserted and even with agglutinated prefixes.

      

Below the test  case below:

AR-Token

TB-Token

FOUND (Yes/N)

سَمَادِ  

samaAd


السَّمَادِ

AlsGamaAdi

Y

 بِالسَّمَادِ

biAlsGamaAdi

Y

كَالسَّمَادِ

kaAlsGamaAdi


لِلسَّمَادِ

lilosGamaAdi

Y

وَالسَّمَادِ

waAlsGamaAdi


فَالسَّمَادِ

faAlsGamaAdi


وَبِالسَّمَادِ

wabiAlsGamaAdi

Y

وَكَالسَّمَادِ

wakaAlsGamaAdi


وَلِلسَّمَادِ

waliAlsGamaAdi

Y

فَبِالسَّمَادِ

fabiAlsGamaAdi

Y

فَكَالسَّمَادِ

fakaAlsGamaAdi


فَلِلسَّمَادِ

faliAlsGamaAdi

Y


Gilles Vollant

unread,
Jun 9, 2016, 3:45:16 AM6/9/16
to Alexis Neme, Unitex-GramLab

I suggest you prepare .ULP log file, which is the better way to be sure reproduce same condition

 

Regards

Gilles Vollant

 

De : unitex-...@googlegroups.com [mailto:unitex-...@googlegroups.com] De la part de Alexis Neme
Envoyé : jeudi 9 juin 2016 08:03
À : Unitex-GramLab
Objet : [Unitex-GramLab] Arabic, Morphological Locate.cpp, Arabic-Typo-rules.txt

 

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/a100e780-5872-4cda-8c9b-8e3dba03d85d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages