Dear Mona,
One thing you could do is use the Octopus toolkit [1].
It is a `wrapper` around the AraT5 model [2] that handles multiple tasks (see image below), including diacritization.
Octopus handles all tasks in a single model.
I would note that, for diacritization, we acquired best results (lowest character error rate [CER]) with a single task model (see the first row in Table 2 in the Octopus paper [1], p. 236).
So, if you need better performance, you could simply take the Octopus model and further finetune it on a diacritized dataset.
A colleague of mine just told me earlier this week he was so pleased when his team finetuned AraT5-v.2 to acquire excellent results on a diacritzation task.
More:
Octopus Demo page (with several examples, documentation, etc.).
Octopus screenshot

Best,
Muhammad
Refs:
[1] Octopus paper: https://aclanthology.org/2023.arabicnlp-1.20/
BibTex:
@inproceedings{elmadany-etal-2023-octopus,
title = "Octopus: A Multitask Model and Toolkit for {A}rabic Natural Language Generation",
author = "Elmadany, AbdelRahim and
Nagoudi, El Moatez Billah and
Abdul-Mageed, Muhammad",
editor = "Sawaf, Hassan and
El-Beltagy, Samhaa and
Zaghouani, Wajdi and
Magdy, Walid and
Abdelali, Ahmed and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Habash, Nizar and
Khalifa, Salam and
Keleg, Amr and
Haddad, Hatem and
Zitouni, Imed and
Mrini, Khalil and
Almatham, Rawan",
booktitle = "Proceedings of ArabicNLP 2023",
month = dec,
year = "2023",
address = "Singapore (Hybrid)",
publisher = "Association for Computational Linguistics",
url = https://aclanthology.org/2023.arabicnlp-1.20,
doi = "10.18653/v1/2023.arabicnlp-1.20",
pages = "232--243",
abstract = "Understanding Arabic text and generating human-like responses is a challenging task. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a robust Arabic text-to-text Transformer model, namely AraT5v2, methodically trained on extensive and diverse data, utilizing an extended sequence length of 2,048 tokens. We explore various pretraining strategies including unsupervised, supervised, and joint pertaining, under both single and multitask settings. Our models outperform competitive baselines with large margins. We take our work one step further by developing and publicly releasing OCTOPUS, a Python-based package and command-line toolkit tailored for eight Arabic generation tasks all exploiting a single model. We provide a link to the models and the toolkit through our public repository.",
}
[2] AraT5 paper: https://aclanthology.org/2022.acl-long.47/
BibTex:
@inproceedings{nagoudi-etal-2022-arat5,
title = "{A}ra{T}5: Text-to-Text Transformers for {A}rabic Language Generation",
author = "Nagoudi, El Moatez Billah and
Elmadany, AbdelRahim and
Abdul-Mageed, Muhammad",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = https://aclanthology.org/2022.acl-long.47,
doi = "10.18653/v1/2022.acl-long.47",
pages = "628--647",
abstract = "Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format was recently proposed as a simple and effective transfer learning approach. Although a multilingual version of the T5 model (mT5) was also introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects{--}Arabic. For evaluation, we introduce a novel benchmark for ARabic language GENeration (ARGEN), covering seven important tasks. For model comparison, we pre-train three powerful Arabic T5-style models and evaluate them on ARGEN. Although pre-trained with {\textasciitilde}49 less data, our new models perform significantly better than mT5 on all ARGEN tasks (in 52 out of 59 test sets) and set several new SOTAs. Our models also establish new SOTA on the recently-proposed, large Arabic language understanding evaluation benchmark ARLUE (Abdul-Mageed et al., 2021). Our new models are publicly available. We also link to ARGEN datasets through our repository: \url{https://github.com/UBC-NLP/araT5}.",
}
From: <sig...@googlegroups.com> on behalf of Mona Alshehri <msalsh...@gmail.com>
Date: Thursday, May 9, 2024 at 5:46 PM
To: "sig...@googlegroups.com" <sig...@googlegroups.com>
Subject: [SIGARAB] Arabic Diacritization
|
[CAUTION: Non-UBC Email] |
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
sigarab+u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/sigarab/CAC6bHr%3DXqf2UTKVG%3Db7MOAZEMyFcuBQHvK6C3zNU4DOTJraTxw%40mail.gmail.com.
--
Assalamu alaikum sister Mona,
Shukran to brothers Nizar and Abdul Mageed for providing their relevant tools. This is useful knowledge for us all.
The current state-of-the-art model (and all other relevant diacriticization models as a byproduct) are listed at this link:
https://paperswithcode.com/sota/arabic-text-diacritization-on-tashkeela-1
@Nizar @Abdul Mageed
I do not see your models listed on this website. I hope you both will consider adding your relevant research to indicate if your models perform better than the current state-of-the-art model.
Shukran,
Mohamed
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAFfBGV%3Dj36FqVukGV2eqYjipQauGrev5dakbKDNnnnhDPtbDzQ%40mail.gmail.com.
-- Find me at: https://www.kentoseth.com https://fosstodon.org/web/@kentoseth
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/a6ba1e47-13f9-4bdc-ad41-66eacbf8cf78%40devcroo.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/a6ba1e47-13f9-4bdc-ad41-66eacbf8cf78%40devcroo.com.