How to adapt Dico for Arabic text ?

Noureddine Doumi

unread,

Jan 25, 2016, 2:37:09 PM1/25/16

to Unitex-GramLab

Hi everybody,

I'm trying to adapt the dico command for the Arabic text so that it can recognize the non vocalized or partially vocalized words even the Arabic DELAF actually contains the fully vocalized words and also recognize the words containing the kashida/tatweel symbol even the dictionary contains words without kashida/tatweel.
In Arabic text (corpus) we can find the words without short vowels, partially vocalized words or fully vocalized words. Only the third case (fully vocalized) applies to dictionary, the other cases are correct in spelling and easily readable by the Arabic native speaker but don't exist in the dictionary.
To clarify the problem I will give these two examples :
1- the fully vocalized word kataba /wrote/ can be written in corpus as ktb or ktba or ktaba
2- in Arabic the symbol kashida/tatweel is used to force the vertical justification of text : the fully vocalized word kataba can be found in the corpus as ka___ta___ba or kt____b or ka___t___ba and so on; there are plenty possible forms. (here I used the underscore instead of the kashida).

The algorithm is found and I implemnted it in java and it works smoothly but when I wanted to insert it in the dico source code I couldn't. Because I didn't know how the dico program accesses in the dictionary automaton and how it retrieves the entries.

Any help is appreciated and thanks in advance...

Best regards,
N. Doumli

Alexis Neme

unread,

Jan 27, 2016, 7:45:15 AM1/27/16

to Unitex-GramLab

Hi Noureddine,

The two issues are not related in Unitex

1 - For the kashida issue, you should create a preprocessing graph in the Replace directory. (see Ligatures.grf for French)

2 - If your dictionary is compressed with Semitic option

- before using Compress,

in the Java User Interface, Preference>language Semitic

Semitic language should be ticked

Or by using a command line

Compress "testflx_AR.dic" --semitic (cf. User Manual - Compress . 13.8)

For example, if   in your dictionary,
kataba (in Arabic Script) is the only form;
then   in the Arabic text,
any token with partially or fully diacriticized will be identified.
here,   ktb, katab or ktab, etc will be identified
but not kutb   (u is incompatible diacritic with a at the same position)
and  not kaatib (wrong form with two diacritics).

Hope this will help,

Alexis

Noureddine Doumi

unread,

Jan 27, 2016, 1:04:59 PM1/27/16

to Unitex-GramLab

Dear Alexis,

First of all, I would like to thank you for the prompt answer.
I just applied what you suggested as follows :
1- I compressed my DELAF with semitic option
UnitexToolLogger.exe Compress DELAF_V.dic --semitic -qutf8-no-bom
2- I applied this DELAF on my text (exemple1.txt), containing just one token : كَتب /katb/.
UnitexToolLogger.exe Dico -t example1.snt -a Alphabet.txt -u arabic_typo_rules.txt --semitic DELAF_V.bin -qutf8-no-bom

Unfortunately it doesn't work I don't know why ! Given that my DELAF contains the fully diacritized entry كَتَبَ /kataba/, the token in my text is still not identified.

All the best.

Noureddine

Alexis Neme

unread,

Feb 23, 2016, 3:26:15 AM2/23/16

to unitex-...@googlegroups.com

Hello all,

Find below, my answer to "How to adapt Dico for Arabic text ?"

The partial diacritization in Arabic is tested :

- in the morphological mode (character mode);

- by using a .grf agglutination grammar (for Arabic verbs, for instance);

- the Arabic DELAF (for verbs) is embedded in the agglutination grammar graph

by declaring it in the preferences> morphological-mode dictionaries.

See below the detailed explanation with the attached files example, done in 29 January 2016.

Bests,

Alexis

---------- Forwarded message ----------
From: Alexis Neme <alexi...@gmail.com>
Date: Fri, Jan 29, 2016 at 12:07 AM
Subject: Re: [Unitex-GramLab] Re: How to adapt Dico for Arabic text ?
To: Noureddine Doumi <ndoum...@gmail.com>

Hello Noureddinne,

In fact, I have applied the DELAF dictionary on "katb", it does not identify this token.

- is-it necessary? I dont think so.
I never use the dico this way.
I never apply a dictionary directly.

I apply always my dico in morphological mode using .fst2 grammar to formalize agglutination, and it identifies partial diacritization. (see attached files)

I use always the dictionary in Morphological mode since we have always in Arabic agglutination grammar for verbs, nouns and adjectives.

The dictionary should be declared your compressed DELAF dictionary

in Preference>morphological mode Dictionary.

Bests,

Alexis

PS.

I advise to work in UTF16-LE since Unitex is native you switch to UTF-8 later on.

Attached files an example of grammar with Morphological Mode

- put the grammar fst2 in Dela directory

- put katb.snt 'katb.snt' in corpus directory

- unzip .7z in Corpus dir.

_ create DELAF dictionnary with kataba with your attibutes

compress in semitic mode and declare it in preference as a morphological dictionnary.

- after tokenization, apply the lexical ressource : prfx_VRB-r.fst2

- check the Filter unknown word with tag.ind (see fig below)

- word not identified by your lexical ressources (here .fst2) will be in tags.err file

execute File>construct-Fst-txt

--
You received this message because you are subscribed to a topic in the Google Groups "Unitex-GramLab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/unitex-gramlab/CK7z3V8rYmM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at https://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/43bf82d3-bc95-4314-8205-c72ab0109671%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

katb.txt

prfx_VRB-r.grf

prfx_VRB-r.fst2

katb.snt

katb_snt.7z

Reply all

Reply to author

Forward