Arabic Diacritization

Mona Alshehri

unread,

May 9, 2024, 9:46:22 AM5/9/24

to sig...@googlegroups.com

Dear all

I am looking for a machine diacritization that can be used with python. For example, I have different Arabic files. I want to add harakat to the former files in order to process the text with and without harakat. Also, is there any python library can be used ? and can CAMelBERT based model helpful for this purpose?

BW

Mona

Abdul-Mageed, Muhammad

unread,

May 9, 2024, 10:13:16 AM5/9/24

to Mona Alshehri, sig...@googlegroups.com

Dear Mona,

One thing you could do is use the Octopus toolkit [1].

It is a `wrapper` around the AraT5 model [2] that handles multiple tasks (see image below), including diacritization.

Octopus handles all tasks in a single model.

I would note that, for diacritization, we acquired best results (lowest character error rate [CER]) with a single task model (see the first row in Table 2 in the Octopus paper [1], p. 236).

So, if you need better performance, you could simply take the Octopus model and further finetune it on a diacritized dataset.

A colleague of mine just told me earlier this week he was so pleased when his team finetuned AraT5-v.2 to acquire excellent results on a diacritzation task.

More:

Octopus Demo page (with several examples, documentation, etc.).

Octopus screenshot

Best,

Muhammad

Refs:

[1] Octopus paper: https://aclanthology.org/2023.arabicnlp-1.20/

BibTex:

@inproceedings{elmadany-etal-2023-octopus,

title = "Octopus: A Multitask Model and Toolkit for {A}rabic Natural Language Generation",

author = "Elmadany, AbdelRahim and

Nagoudi, El Moatez Billah and

Abdul-Mageed, Muhammad",

editor = "Sawaf, Hassan and

El-Beltagy, Samhaa and

Zaghouani, Wajdi and

Magdy, Walid and

Abdelali, Ahmed and

Tomeh, Nadi and

Abu Farha, Ibrahim and

Habash, Nizar and

Khalifa, Salam and

Keleg, Amr and

Haddad, Hatem and

Zitouni, Imed and

Mrini, Khalil and

Almatham, Rawan",

booktitle = "Proceedings of ArabicNLP 2023",

month = dec,

year = "2023",

address = "Singapore (Hybrid)",

publisher = "Association for Computational Linguistics",

url = https://aclanthology.org/2023.arabicnlp-1.20,

doi = "10.18653/v1/2023.arabicnlp-1.20",

pages = "232--243",

abstract = "Understanding Arabic text and generating human-like responses is a challenging task. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a robust Arabic text-to-text Transformer model, namely AraT5v2, methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing OCTOPUS, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We provide a link to the models and the toolkit through our public repository.",

}

[2] AraT5 paper: https://aclanthology.org/2022.acl-long.47/

BibTex:

@inproceedings{nagoudi-etal-2022-arat5,

    title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",

    author = "Nagoudi, El Moatez Billah  and

      Elmadany, AbdelRahim  and

      Abdul-Mageed, Muhammad",

    editor = "Muresan, Smaranda  and

      Nakov, Preslav  and

      Villavicencio, Aline",

    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",

    month = may,

    year = "2022",

    address = "Dublin, Ireland",

    publisher = "Association for Computational Linguistics",

    url = https://aclanthology.org/2022.acl-long.47,

    doi = "10.18653/v1/2022.acl-long.47",

    pages = "628--647",

    abstract = "Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format was recently proposed as a simple and effective transfer learning approach. Although a multilingual version of the T5 model (mT5) was also introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects{--}Arabic. For evaluation, we introduce a novel benchmark for ARabic language GENeration (ARGEN), covering seven important tasks. For model comparison, we pre-train three powerful Arabic T5-style models and evaluate them on ARGEN. Although pre-trained with {\textasciitilde}49 less data, our new models perform significantly better than mT5 on all ARGEN tasks (in 52 out of 59 test sets) and set several new SOTAs. Our models also establish new SOTA on the recently-proposed, large Arabic language understanding evaluation benchmark ARLUE (Abdul-Mageed et al., 2021). Our new models are publicly available. We also link to ARGEN datasets through our repository: \url{https://github.com/UBC-NLP/araT5}.",

From: <sig...@googlegroups.com> on behalf of Mona Alshehri <msalsh...@gmail.com>
Date: Thursday, May 9, 2024 at 5:46 PM
To: "sig...@googlegroups.com" <sig...@googlegroups.com>
Subject: [SIGARAB] Arabic Diacritization

[CAUTION: Non-UBC Email]

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAC6bHr%3DXqf2UTKVG%3Db7MOAZEMyFcuBQHvK6C3zNU4DOTJraTxw%40mail.gmail.com.

Nizar Habash

unread,

May 9, 2024, 3:34:02 PM5/9/24

to Mona Alshehri, sig...@googlegroups.com

Hi Mona, Camel Tools provides diacritization as a side product of full morphological tagging, which using CamelBert to tag. See here a link to documentation:

https://camel-tools.readthedocs.io/en/stable/api/disambig/bert.html

A live camelira demo with example is here:

https://camelira.abudhabi.nyu.edu/?d=msa&s=%D9%82%D8%A7%D9%84%20%D9%85%D8%B5%D8%AF%D8%B1%20%D8%A3%D9%85%D9%8A%D8%B1%D9%83%D9%8A%20%D9%84%D9%84%D8%AC%D8%B2%D9%8A%D8%B1%D8%A9%20%D8%A5%D9%86%D9%87%20%D8%AA%D9%82%D8%B1%D8%B1%20%D9%88%D9%82%D9%81%20%D8%A7%D9%84%D9%85%D8%AD%D8%A7%D8%AF%D8%AB%D8%A7%D8%AA%20%D8%A7%D9%84%D9%85%D9%86%D8%B9%D9%82%D8%AF%D8%A9%20%D9%81%D9%8A%20%D8%A7%D9%84%D9%82%D8%A7%D9%87%D8%B1%D8%A9%20%D8%A8%D8%B4%D8%A3%D9%86%20%D9%88%D9%82%D9%81%20%D8%A5%D8%B7%D9%84%D8%A7%D9%82%20%D8%A7%D9%84%D9%86%D8%A7%D8%B1%20%D9%81%D9%8A%20%D8%BA%D8%B2%D8%A9%20%D9%88%D8%AA%D8%A8%D8%A7%D8%AF%D9%84%20%D8%A7%D9%84%D8%A3%D8%B3%D8%B1%D9%89%20%D8%A8%D8%B3%D8%A8%D8%A8%20%D8%A7%D9%84%D9%88%D8%B6%D8%B9%20%D8%A7%D9%84%D8%AD%D8%A7%D9%84%D9%8A%20%D9%81%D9%8A%20%D8%B1%D9%81%D8%AD.

Thanks,
Nizar

--

Mohamed H.

unread,

May 9, 2024, 5:21:16 PM5/9/24

to sig...@googlegroups.com

Assalamu alaikum sister Mona,

Shukran to brothers Nizar and Abdul Mageed for providing their relevant tools. This is useful knowledge for us all.

The current state-of-the-art model (and all other relevant diacriticization models as a byproduct) are listed at this link:

https://paperswithcode.com/sota/arabic-text-diacritization-on-tashkeela-1

@Nizar @Abdul Mageed

I do not see your models listed on this website. I hope you both will consider adding your relevant research to indicate if your models perform better than the current state-of-the-art model.

Shukran,
Mohamed

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAFfBGV%3Dj36FqVukGV2eqYjipQauGrev5dakbKDNnnnhDPtbDzQ%40mail.gmail.com.

-- 
Find me at:
https://www.kentoseth.com
https://fosstodon.org/web/@kentoseth

Mona Alshehri

unread,

May 10, 2024, 2:21:05 AM5/10/24

to sig...@googlegroups.com

Walikum Alslaam

@Nizar @Abdul Mageed @Mohamed

I am really grateful for each reply and this valuable information.

BW

Mona

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/a6ba1e47-13f9-4bdc-ad41-66eacbf8cf78%40devcroo.com.

Nizar Habash

unread,

May 10, 2024, 2:57:16 AM5/10/24

to Mohamed H., sig...@googlegroups.com

Dear Mohamed H. - Thanks for pointing the Tashkeela leaderboard out.

We have not reported on it in the past. My team will look into it.

But just as a point of clarification that may not be apparent to all.

There are multiple data sets that have been reported on in the area

of diacritization: Penn Arabic Treebank, WikiNews, and Tashkeela.

And they have important differences among them in terms of genre,

public availability, richness of annotations, and even diacritization style,

which may explain our community's current fragmentation in reporting

on this matter.

Best

Nizar

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/a6ba1e47-13f9-4bdc-ad41-66eacbf8cf78%40devcroo.com.

--

Nizar Habash

Professor of Computer Science

New York University Abu Dhabi
https://www.nizarhabash.com/

Reply all

Reply to author

Forward