--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CAJ91-UMc540r3q13ex-7x69DLMvpUtw36N-bsLRHZ4c_5HTLXQ%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CAFjEKobsHjv2BuoOiNQeQTTC2vKv6RziZoE8u%2BOpdcfHNRpp2Q%40mail.gmail.com.
Salam Hesham,
Thank you for sharing this — truly great work.
The Awesome Arabic NLP repo is a valuable contribution to the community, and it’s fantastic to see efforts like yours organizing models, datasets, and tools that many researchers rely on.
We definitely need more initiatives that support and advance Arabic NLP, and your repository is a step in the right direction. I’ll share it with colleagues and teams who can benefit from it, and I’ll also let you know if we spot resources worth adding.
From the NAMAA side, we’d be happy to contribute as well — whether by adding our open-source models (AraModernBERT, Saudi dialect embeddings, MT models, and others etc.), sharing our public datasets, proposing new benchmark sections, or submitting PRs for tools and evaluation frameworks we actively use.
Happy to coordinate with you anytime.
Appreciate your contribution — keep it up!
Best regards,
Omar
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CAJ91-UMTt%3DCpuCssmkd5XTJG5abpxKVdHj%3Ds8f-7e1B7PrF%3Duw%40mail.gmail.com.
Hi Hesham,
Thank you for sharing the Awesome Arabic NLP repository — it’s a fantastic initiative. Having a single, up-to-date place for Arabic NLP resources will genuinely save researchers and practitioners a lot of time.
I also wanted to share a tool I’ve been developing as part of my PhD research, called DeformAR, which I believe could be a relevant addition to the repo.
GitHub repo: https://github.com/ay94/DeformAR
EMNLP/ARR submission: https://openreview.net/forum?id=tBb0po9inH#discussion
I’m currently preparing an expanded version for arXiv — the project is too large to fit into a 9-page format, and the longer version will allow me to properly outline the methodology before condensing the core arguments for the EMNLP-style paper.
A recurring issue in NLP evaluation is that most metrics focus only on high-level performance scores, without explaining why one system performs better than another, or why certain languages (like Arabic) behave differently from English.
During my PhD, I conducted an extensive survey of NER research and identified multiple factors influencing model behaviour, including:
architectural and modelling biases
English-centric fine-tuning practices
data quality issues
tokenization artefacts
cross-lingual inconsistencies
DeformAR combines quantitative analysis, interpretability, and visual analytics to uncover these underlying causes — particularly why Arabic systems often underperform compared to English.
DeformAR is built around a modular, component-based analysis framework:
It decomposes any NLP system into components and sub-components.
For each component, it computes behavioural metrics and correlation profiles to identify weaknesses, inconsistencies, and interactions in the pipeline.
It provides a visual analytics dashboard for exploring all these behaviours in a unified workflow.
At the moment, the system supports a full end-to-end NER pipeline, including data preprocessing, fine-tuning, component extraction, quantitative evaluation, and multi-dimensional visual exploration.
It was designed to be:
Extensible — new metrics, components, or diagnostics can be added easily
Task-agnostic — while the main case study is NER (Arabic vs. English), the framework can be extended to other sequence-labelling tasks and eventually to classification, sentiment, and dialect analysis
Model-flexible — currently built around BERT models, but switching to other transformer variants is straightforward
The visual interface can explore up to five variables simultaneously, making it possible to uncover structural patterns such as tokenization behaviour, ambiguity, inconsistency, and their differences across Arabic and English systems (Many other low resource langauges.).
My PhD thesis presents the full theoretical and empirical foundation. An arXiv release is in progress, and a version of the work previously received encouraging (though mixed) feedback during the EMNLP/ARR cycle.
I believe DeformAR could be useful for broader diagnostic and analytical work in Arabic NLP. I’m very open to collaborating — whether on dialect tasks, more complex pipelines, or extending the framework to different models and domains. I’m also building an MLOps-ready version and a microservice-oriented version for cloud deployment.
If you think this tool would be a good fit for the Awesome Arabic NLP repo, I’d be happy to prepare a clear description or contribute a PR.
Thanks again for sharing the repo — really great work.
Best regards,
Ahmed Younes
--