Core C++ developments: Arabic -Typo rules

21 views
Skip to first unread message

Alexis Neme

unread,
Oct 5, 2014, 3:09:26 PM10/5/14
to unitex-...@googlegroups.com
Dear Core Developers,

The BUG 

When a token is prefixed with agglutinated prefixes  (CONJC, and PREP) before the definite Article Al-, the dico module should find the corresponding the lemma in the Dictionary. 

Example
         samaAdi => AlsGamaAdi            /insertion of  G (gemination)  after the first solar consonant.
                             waAlsGamaAdi       /waAl agglutinated conjonction folled by Al- 
  

if you are interested,  I will  include the BUG  in the Core Bug Tracker. 
It is not true  in ALL the cases (table below)

Thanks for your help in advance,
Let me know, 

Alexis 


Bug Location
    Module: Arabic.cpp. Arabic.cpp module handles  diacritization rules of Arabic script defined in the   
    Configuration files: arabic_typo_rules.txt : ... solar assimilation=YES ... in in the Arabic directory 

Explanation

In Arabic, the consonants are divided into two groups, solar   and lunar letters, based on whether or not they assimilate the letter 'l' of a preceding definite article Al-.
Solar letters are half of the alphabet (list is in  Arabic.cpp)
 
Given a partially diacriticized token in Arabic,  the dico program should  find in the dictionary the fully diacriticized  lemma according to the typo rules; and particularly when a 'G' is inserted and even with agglutinated prefixes.

      

Below the test  case below:

AR-Token

TB-Token

FOUND (Yes/N)

سَمَادِ  

samaAd


السَّمَادِ

AlsGamaAdi

Y

 بِالسَّمَادِ

biAlsGamaAdi

Y

كَالسَّمَادِ

kaAlsGamaAdi


لِلسَّمَادِ

lilosGamaAdi

Y

وَالسَّمَادِ

waAlsGamaAdi


فَالسَّمَادِ

faAlsGamaAdi


وَبِالسَّمَادِ

wabiAlsGamaAdi

Y

وَكَالسَّمَادِ

wakaAlsGamaAdi


وَلِلسَّمَادِ

waliAlsGamaAdi

Y

فَبِالسَّمَادِ

fabiAlsGamaAdi

Y

فَكَالسَّمَادِ

fakaAlsGamaAdi


فَلِلسَّمَادِ

faliAlsGamaAdi

Y


Reply all
Reply to author
Forward
0 new messages