Arabic Diacritization

123 views
Skip to first unread message

Mona Alshehri

unread,
May 9, 2024, 9:46:22 AM5/9/24
to sig...@googlegroups.com
Dear all

I am looking for a machine diacritization that can be used with python. For example, I have different Arabic files. I want to add harakat to the former files in order to process the text with and without harakat. Also, is there any python library can be used ? and can CAMelBERT based model helpful for this purpose?


BW
Mona

Abdul-Mageed, Muhammad

unread,
May 9, 2024, 10:13:16 AM5/9/24
to Mona Alshehri, sig...@googlegroups.com

Dear Mona,

 

One thing you could do is use the Octopus toolkit [1].

It is a `wrapper` around the AraT5 model [2] that handles multiple tasks (see image below), including diacritization.

Octopus handles all tasks in a single model.

 

I would note that, for diacritization, we acquired best results (lowest character error rate [CER]) with a single task model (see the first row in Table 2 in the Octopus paper [1], p. 236).

So, if you need better performance, you could simply take the Octopus model and further finetune it on a diacritized dataset.

A colleague of mine just told me earlier this week he was so pleased when his team finetuned AraT5-v.2 to acquire excellent results on a diacritzation task.

 

More:

 

Octopus Demo page (with several examples, documentation, etc.).

 

Octopus screenshot

 

 

 

Best,

 

Muhammad

 

 

Refs:

 

[1] Octopus paper: https://aclanthology.org/2023.arabicnlp-1.20/

BibTex:

 

@inproceedings{elmadany-etal-2023-octopus,

    title = "Octopus: A Multitask Model and Toolkit for {A}rabic Natural Language Generation",

    author = "Elmadany, AbdelRahim  and

      Nagoudi, El Moatez Billah  and

      Abdul-Mageed, Muhammad",

    editor = "Sawaf, Hassan  and

      El-Beltagy, Samhaa  and

      Zaghouani, Wajdi  and

      Magdy, Walid  and

      Abdelali, Ahmed  and

      Tomeh, Nadi  and

      Abu Farha, Ibrahim  and

      Habash, Nizar  and

      Khalifa, Salam  and

      Keleg, Amr  and

      Haddad, Hatem  and

      Zitouni, Imed  and

      Mrini, Khalil  and

      Almatham, Rawan",

    booktitle = "Proceedings of ArabicNLP 2023",

    month = dec,

    year = "2023",

    address = "Singapore (Hybrid)",

    publisher = "Association for Computational Linguistics",

    url = https://aclanthology.org/2023.arabicnlp-1.20,

    doi = "10.18653/v1/2023.arabicnlp-1.20",

    pages = "232--243",

    abstract = "Understanding Arabic text and generating human-like responses is a challenging task. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a robust Arabic text-to-text Transformer model, namely AraT5v2, methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing OCTOPUS, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We provide a link to the models and the toolkit through our public repository.",

}

 

[2] AraT5 paper: https://aclanthology.org/2022.acl-long.47/

BibTex:

 

@inproceedings{nagoudi-etal-2022-arat5,
    title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",
    author = "Nagoudi, El Moatez Billah  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = https://aclanthology.org/2022.acl-long.47,
    doi = "10.18653/v1/2022.acl-long.47",
    pages = "628--647",
    abstract = "Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format was recently proposed as a simple and effective transfer learning approach. Although a multilingual version of the T5 model (mT5) was also introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects{--}Arabic. For evaluation, we introduce a novel benchmark for ARabic language GENeration (ARGEN), covering seven important tasks. For model comparison, we pre-train three powerful Arabic T5-style models and evaluate them on ARGEN. Although pre-trained with {\textasciitilde}49 less data, our new models perform significantly better than mT5 on all ARGEN tasks (in 52 out of 59 test sets) and set several new SOTAs. Our models also establish new SOTA on the recently-proposed, large Arabic language understanding evaluation benchmark ARLUE (Abdul-Mageed et al., 2021). Our new models are publicly available. We also link to ARGEN datasets through our repository: \url{https://github.com/UBC-NLP/araT5}.",
}

 

 

From: <sig...@googlegroups.com> on behalf of Mona Alshehri <msalsh...@gmail.com>
Date: Thursday, May 9, 2024 at 5:46 PM
To: "sig...@googlegroups.com" <sig...@googlegroups.com>
Subject: [SIGARAB] Arabic Diacritization

 

[CAUTION: Non-UBC Email]

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAC6bHr%3DXqf2UTKVG%3Db7MOAZEMyFcuBQHvK6C3zNU4DOTJraTxw%40mail.gmail.com.

Nizar Habash

unread,
May 9, 2024, 3:34:02 PM5/9/24
to Mona Alshehri, sig...@googlegroups.com

Mohamed H.

unread,
May 9, 2024, 5:21:16 PM5/9/24
to sig...@googlegroups.com

Assalamu alaikum sister Mona,

Shukran to brothers Nizar and Abdul Mageed for providing their relevant tools. This is useful knowledge for us all.

The current state-of-the-art model (and all other relevant diacriticization models as a byproduct) are listed at this link:

https://paperswithcode.com/sota/arabic-text-diacritization-on-tashkeela-1

@Nizar @Abdul Mageed

I do not see your models listed on this website. I hope you both will consider adding your relevant research to indicate if your models perform better than the current state-of-the-art model.

Shukran,
Mohamed

Mona Alshehri

unread,
May 10, 2024, 2:21:05 AM5/10/24
to sig...@googlegroups.com
Walikum Alslaam
@Nizar @Abdul Mageed @Mohamed

I am really grateful for each reply and this valuable information.

BW
Mona


Nizar Habash

unread,
May 10, 2024, 2:57:16 AM5/10/24
to Mohamed H., sig...@googlegroups.com
Dear Mohamed H. -  Thanks for pointing the Tashkeela leaderboard out.  
We have not reported on it in the past.  My team will look into it. 

But just as a point of clarification that may not be apparent to all. 
There are multiple data sets that have been reported on in the area 
of diacritization: Penn Arabic Treebank, WikiNews, and Tashkeela.  
And they have important differences among them in terms of genre, 
public availability, richness of annotations, and even diacritization style, 
which may explain our community's current  fragmentation in reporting 
on this matter. 

Best
Nizar




--
Nizar Habash
Professor of Computer Science
New York University Abu Dhabi
https://www.nizarhabash.com/ 
Reply all
Reply to author
Forward
0 new messages