Awesome Arabic NLP – curated hub for serious Arabic ML work

10 views
Skip to first unread message

Hesham Haroon

unread,
Nov 18, 2025, 7:04:42 PMNov 18
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
I hope you’re doing well.


I’ve just put together a public GitHub repo called “Awesome Arabic NLP” that might be useful for your work on Arabic language and ML:




The idea is simple: one place that stays up to date with the most relevant Arabic NLP resources, so people don’t have to re-discover the same links every time they start a new project.


The repo currently includes:


Key research groups and labs working on Arabic (AUB MIND, CAMeL Lab, UBC-NLP, SILMA, QCRI, etc.)


Benchmarks and leaderboards (Arabic MTEB, Arabic Broad Leaderboard, Open Arabic LLM Leaderboard, ALUE, BALSAM, and more)


State-of-the-art models (LLMs, multimodal models, embeddings, ASR/TTS, OCR, etc.)


Datasets for text, speech, and vision


Essential tools and preprocessing libraries for Arabic morphology, normalization, tokenization, and analysis



I’m trying to keep it practical and focused on what people actually use in real projects, not just a random link dump.


If you spot anything important that’s missing, I’d be happy to add it or review a PR.
And if you think it can help someone on your team or in your network, feel free to share the link.


Best,
Hesham

Houda Bouamor

unread,
Nov 19, 2025, 4:10:58 AMNov 19
to Hesham Haroon, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Dear Hesham,

I want to extend a sincere thank you for putting together the Awesome Arabic NLP GitHub repository. 
It’s an incredibly valuable contribution to the community, and we truly appreciate the time and care you’ve invested in curating such a comprehensive and practical resource.

Having a centralized, up-to-date collection of Arabic NLP research groups, benchmarks, models, datasets, and tools will undoubtedly help researchers and practitioners avoid duplication of effort and accelerate their projects. Your initiative fills a real gap in the ecosystem, and we’re grateful for your effort and generosity in making it public and maintaining it.

Thank you again for supporting and strengthening the Arabic NLP community.

Best regards,
Houda

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CAJ91-UMc540r3q13ex-7x69DLMvpUtw36N-bsLRHZ4c_5HTLXQ%40mail.gmail.com.

Dhaou Ghoul

unread,
Nov 19, 2025, 4:31:11 AMNov 19
to Houda Bouamor, Hesham Haroon, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hello Hesham, 
This is a tremendous contribution to the Arabic NLP community. Thank you for putting this together and for sharing it with us.

I've just spent some time browsing the repo, and the structure is excellent. Having the key research groups, benchmarks, and State-of-the-Art models all in one place is a huge time-saver. I particularly found the 'State-of-the-Art Models', 'Open Arabic LLM Leaderboard' and 'Datasets' sections immediately useful for a project I'm starting.

I'll make sure to share this widely with my network.  

Thanks again for this great effort.

Sincerely,



--
Bien Cordialement.
Dhaou Ghoul
Research And Development Engineer | Data Scientist @ Highsys, Paris, France.

Hesham Haroon

unread,
Nov 19, 2025, 4:48:24 AMNov 19
to Dhaou Ghoul, Houda Bouamor, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Thank you both, Houda and Dhaou, for your very kind words and for the enthusiastic support of the "Awesome Arabic NLP" repository.

I am genuinely pleased to hear that you both find the resource valuable, and that the structure, particularly the sections on research groups, benchmarks, and SOTA models, is proving immediately useful for your work. Your positive feedback confirms that the effort to create a centralized and practical hub is indeed meeting a real need within the Arabic NLP community.

I especially appreciate the commitment to share the resource within your respective networks. The goal is maximum utility for the community, and wider visibility is crucial for that.

I will continue to maintain and update the repository to ensure it remains a relevant and high-quality resource for everyone.

Best regards,
Hesham

Omar Najar

unread,
Nov 19, 2025, 9:07:40 AMNov 19
to Hesham Haroon, Dhaou Ghoul, Houda Bouamor, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Salam Hesham,

Thank you for sharing this — truly great work.
The Awesome Arabic NLP repo is a valuable contribution to the community, and it’s fantastic to see efforts like yours organizing models, datasets, and tools that many researchers rely on.

We definitely need more initiatives that support and advance Arabic NLP, and your repository is a step in the right direction. I’ll share it with colleagues and teams who can benefit from it, and I’ll also let you know if we spot resources worth adding.

From the NAMAA side, we’d be happy to contribute as well — whether by adding our open-source models (AraModernBERT,  Saudi dialect embeddings, MT models, and others  etc.), sharing our public datasets, proposing new benchmark sections, or submitting PRs for tools and evaluation frameworks we actively use.


Happy to coordinate with you anytime.

Appreciate your contribution — keep it up!

Best regards,
Omar


Hesham Haroon

unread,
Nov 19, 2025, 9:18:43 AMNov 19
to Omar Najar, Dhaou Ghoul, Houda Bouamor, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Salam Omar,

Thank you very much for your kind words regarding the Awesome Arabic NLP repository; I truly appreciate the support.

I have been following the work coming out of NAMAA, and I find it genuinely fascinating and highly impactful on the community. The open-source models, especially AraModernBERT and the Saudi dialect embeddings, are key resources that would significantly enhance the repository's value.

I am very enthusiastic about your offer to contribute, whether by adding your open-source models, sharing public datasets, or suggesting new benchmark sections. Your team's active involvement with various tools and evaluation frameworks will ensure the repo remains practical and state-of-the-art.

I look forward to receiving your contributions through pull requests or coordinating with you on the best way to integrate NAMAA’s resources into the repository. Please feel free to reach out whenever you are ready to proceed.

Best regards,
Hesham

Ahmed Younes

unread,
Nov 20, 2025, 3:47:19 AMNov 20
to Hesham Haroon, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Hi Hesham,

Thank you for sharing the Awesome Arabic NLP repository — it’s a fantastic initiative. Having a single, up-to-date place for Arabic NLP resources will genuinely save researchers and practitioners a lot of time.

I also wanted to share a tool I’ve been developing as part of my PhD research, called DeformAR, which I believe could be a relevant addition to the repo.

GitHub repo: https://github.com/ay94/DeformAR
EMNLP/ARR submission: https://openreview.net/forum?id=tBb0po9inH#discussion

I’m currently preparing an expanded version for arXiv — the project is too large to fit into a 9-page format, and the longer version will allow me to properly outline the methodology before condensing the core arguments for the EMNLP-style paper.


What DeformAR Does

A recurring issue in NLP evaluation is that most metrics focus only on high-level performance scores, without explaining why one system performs better than another, or why certain languages (like Arabic) behave differently from English.

During my PhD, I conducted an extensive survey of NER research and identified multiple factors influencing model behaviour, including:

  • architectural and modelling biases

  • English-centric fine-tuning practices

  • data quality issues

  • tokenization artefacts

  • cross-lingual inconsistencies

DeformAR combines quantitative analysis, interpretability, and visual analytics to uncover these underlying causes — particularly why Arabic systems often underperform compared to English.


How It Works

DeformAR is built around a modular, component-based analysis framework:

  • It decomposes any NLP system into components and sub-components.

  • For each component, it computes behavioural metrics and correlation profiles to identify weaknesses, inconsistencies, and interactions in the pipeline.

  • It provides a visual analytics dashboard for exploring all these behaviours in a unified workflow.

At the moment, the system supports a full end-to-end NER pipeline, including data preprocessing, fine-tuning, component extraction, quantitative evaluation, and multi-dimensional visual exploration.

It was designed to be:

  • Extensible — new metrics, components, or diagnostics can be added easily

  • Task-agnostic — while the main case study is NER (Arabic vs. English), the framework can be extended to other sequence-labelling tasks and eventually to classification, sentiment, and dialect analysis

  • Model-flexible — currently built around BERT models, but switching to other transformer variants is straightforward

The visual interface can explore up to five variables simultaneously, making it possible to uncover structural patterns such as tokenization behaviour, ambiguity, inconsistency, and their differences across Arabic and English systems (Many other low resource langauges.).

My PhD thesis presents the full theoretical and empirical foundation. An arXiv release is in progress, and a version of the work previously received encouraging (though mixed) feedback during the EMNLP/ARR cycle.


Collaboration

I believe DeformAR could be useful for broader diagnostic and analytical work in Arabic NLP. I’m very open to collaborating — whether on dialect tasks, more complex pipelines, or extending the framework to different models and domains. I’m also building an MLOps-ready version and a microservice-oriented version for cloud deployment.

If you think this tool would be a good fit for the Awesome Arabic NLP repo, I’d be happy to prepare a clear description or contribute a PR.

Thanks again for sharing the repo — really great work.

Best regards,
Ahmed Younes


--
Reply all
Reply to author
Forward
0 new messages