How to adapt Dico for Arabic text ?

61 views
Skip to first unread message

Noureddine Doumi

unread,
Jan 25, 2016, 2:37:09 PM1/25/16
to Unitex-GramLab
Hi everybody,

I'm trying to adapt the dico command for the Arabic text so that it can recognize the non vocalized or partially vocalized words even the Arabic DELAF actually contains the fully vocalized words and also recognize the words containing the kashida/tatweel symbol even the dictionary contains words without kashida/tatweel.
In Arabic text (corpus) we can find the words without short vowels, partially vocalized words or fully vocalized words. Only the third case (fully vocalized) applies to dictionary, the other cases are correct in spelling and easily readable by the Arabic native speaker but don't exist in the dictionary.
To clarify the problem I will give these two examples :
1- the fully vocalized word kataba /wrote/ can be written in corpus as ktb or ktba or ktaba
2- in Arabic the symbol kashida/tatweel is used to force the vertical justification of text : the fully vocalized word kataba can be found in the corpus as ka___ta___ba or kt____b or ka___t___ba and so on; there are plenty possible forms. (here I used the underscore instead of the kashida).

The algorithm is found and I implemnted it in java and it works smoothly but when I wanted to insert it in the dico source code I couldn't. Because I didn't know how the dico program accesses in the dictionary automaton and how it retrieves the entries.

Any help is appreciated and thanks in advance...

Best regards,
N. Doumli

Alexis Neme

unread,
Jan 27, 2016, 7:45:15 AM1/27/16
to Unitex-GramLab
Hi Noureddine,

The two issues are not related in Unitex

1 - For the kashida issue, you should create a preprocessing  graph in the Replace directory. (see Ligatures.grf for French)
 
2 - If your dictionary is  compressed with Semitic option
        - before using Compress,
             in the Java User Interface, Preference>language Semitic 
                            Semitic language should be ticked 
             Or by using a command line 
                Compress "testflx_AR.dic" --semitic  (cf. User Manual - Compress . 13.8)
  
For example, if   in your dictionary, 
                         kataba (in Arabic Script) is the only form;
                   then   in the Arabic text, 
                          any token  with partially or fully  diacriticized will be identified. 
                     here,   ktb, katab or ktab, etc will be identified 
                      but   not kutb   (u is incompatible diacritic with a at the same position) 
                      and  not kaatib (wrong form with two diacritics).

Hope this will help,

Alexis 

Noureddine Doumi

unread,
Jan 27, 2016, 1:04:59 PM1/27/16
to Unitex-GramLab
Dear Alexis,

First of all, I would like to thank you for the prompt answer.
I just applied what you suggested as follows :
1- I compressed my DELAF with semitic option
UnitexToolLogger.exe Compress DELAF_V.dic --semitic -qutf8-no-bom
2- I applied this DELAF on my text (exemple1.txt), containing just one token : كَتب /katb/.
UnitexToolLogger.exe Dico -t example1.snt -a Alphabet.txt -u arabic_typo_rules.txt --semitic DELAF_V.bin -qutf8-no-bom

Unfortunately it doesn't work I don't know why ! Given that my DELAF contains the fully diacritized entry كَتَبَ /kataba/, the token in my text is still not identified.

All the best.

Noureddine

Alexis Neme

unread,
Feb 23, 2016, 3:26:15 AM2/23/16
to unitex-...@googlegroups.com
Hello all,

Find below, my answer to "How to adapt Dico for Arabic text ?" 

The  partial diacritization in Arabic is tested :
       - in the morphological mode (character mode); 
       - by using a .grf agglutination  grammar (for Arabic verbs, for instance);
       - the  Arabic DELAF  (for verbs) is embedded in the agglutination grammar graph 
              by declaring it in the preferences> morphological-mode dictionaries

See below the detailed explanation with the attached files example, done  in 29 January 2016.

Bests,
Alexis 

---------- Forwarded message ----------
From: Alexis Neme <alexi...@gmail.com>
Date: Fri, Jan 29, 2016 at 12:07 AM
Subject: Re: [Unitex-GramLab] Re: How to adapt Dico for Arabic text ?
To: Noureddine Doumi <ndoum...@gmail.com>


Hello Noureddinne,

In fact, I have applied  the DELAF dictionary on "katb", it does not identify this token. 
- is-it necessary?  I dont think so. 
I never use the dico this way.
I never apply a dictionary directly.

I apply always my dico in morphological mode using .fst2 grammar to formalize agglutination, and it identifies partial diacritization. (see attached files)
I use always the dictionary in Morphological mode since we have always  in Arabic agglutination grammar for verbs, nouns and adjectives.

The dictionary should be declared your compressed DELAF dictionary  
 in Preference>morphological mode Dictionary.


Bests,

Alexis 

PS.
I advise to work in UTF16-LE since Unitex is native you switch to UTF-8 later on.



Attached files an example of grammar with Morphological Mode  
- put the grammar fst2 in Dela directory 
- put katb.snt  'katb.snt' in corpus directory
- unzip .7z in Corpus dir.
_ create DELAF dictionnary with kataba with your attibutes
   compress in semitic mode and declare it in preference as a morphological dictionnary.


- after tokenization, apply the lexical ressource : prfx_VRB-r.fst2
- check the Filter unknown word  with tag.ind (see fig below)
- word not identified by  your lexical ressources (here .fst2) will be in tags.err file


Inline image 3

execute File>construct-Fst-txt 

Inline image 2
 
 


--
You received this message because you are subscribed to a topic in the Google Groups "Unitex-GramLab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/unitex-gramlab/CK7z3V8rYmM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/43bf82d3-bc95-4314-8205-c72ab0109671%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


katb.txt
prfx_VRB-r.grf
prfx_VRB-r.fst2
katb.snt
katb_snt.7z
Reply all
Reply to author
Forward
0 new messages