Introducing NeoAraBert: a Novel Embedding Model

Fadi Zaraket

unread,

May 6, 2026, 2:31:55 AM (7 days ago) May 6

to SIGARAB: Special Interest Group on Arabic Natural Language Processing

Hello everyone,

We are happy to share with you NeoAraBERT, a state-of-the-art, open-source Arabic text-embedding model built on the NeoBERT architecture.

This was the result of a collaboration between

the Digital Arabic Technologies unit at the American University of Beirut (AUB) and the Unit for Research in Arabic Social and Digital Spaces (U4RASD) at the Arab Center for Research and Policy Studies’ (ACRPS).

We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. These include the well appreciated contribution of the Assafir corpora to boost train the model.

We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, Muradif, that directly assesses embedding quality. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.

Links to the checkpoints and the ACL 2026 Findings paper are available on our website:

https://acr.ps/neoarabert

We also provide a Google Colab example to use the models here.

Best Regards,

The NeoAraBert Team

Omar Najar

unread,

May 10, 2026, 6:37:32 AM (2 days ago) May 10

to Fadi Zaraket, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear Fadi,

Saw the impressive work of the NeoAraBERT team ACRPS / U4RASD + AUB. I decided to build a small step on top of it by training what is, to the best of my knowledge, the first sentence-embedding head for NeoAraBERT_MSA.

The model, NeoAraBERT-MSA-Synonym-Matryoshka-V1 [Model on HF], was trained using a multi-source contrastive recipe. The result: the model reaches around 80% accuracy on Muradif, while existing Arabic sentence-transformer models built on AraBERT-vocab backbones drop to around 30% on the same benchmark. The gap appears to come largely from tokenization: AraBERT-style vocabularies struggle to represent diacritized Arabic words effectively, while NeoAraBERT’s diacritics-aware backbone gives the embedding model a much stronger base.

This matters especially for classical Arabic, religious texts, and dictionary / thesaurus / synonym-retrieval applications, where diacritics are not just decoration — they carry meaning.

For more details : https://www.linkedin.com/posts/omarnj_aepaesaetaecaeraffaesaer-neoarabert-aepaesaehaezaeqaey-activity-7459189624100462592-DoiX?utm_source=share&utm_medium=member_desktop&rcm=ACoAAChp0cYBTrzMocuvjgI_4ZoZdyVDnmyHYJA

Huge thanks to the NeoAraBERT authors for releasing this important backbone.

Best,

Omer

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CAJdxXAZmvmbaNTtEnT-qHDjjsOWQU1yxo%2BKPvT-HPs23V4xtug%40mail.gmail.com.

Fadi Zaraket

unread,

May 11, 2026, 3:26:14 PM (yesterday) May 11

to Omar Najar, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Thank you Omar. We are happy that you found the model useful. The team will celebrate again now :). I noticed your results for STS with the model were not similar to a trial we are conducting now. We will shortly share and maybe we can discuss the differences.

Cheers,

Fadi

Omar Najar

unread,

May 11, 2026, 4:28:29 PM (23 hours ago) May 11

to Fadi Zaraket, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Thank you for your kind message.

I would be very interested to compare the STS results with the trial you are currently conducting. The results I shared were based on a zero-shot MTEB evaluation setup, where the model was loaded directly from Hugging Face without additional fine-tuning. For NeoAraBERT specifically, since it is a bare encoder rather than a SentenceTransformer model, I used a custom wrapper around AutoModel and AutoTokenizer with CLS/mean pooling before passing the embeddings to MTEB.

I will send the code for full transparency. It would be great to discuss any differences in setup, pooling strategy, preprocessing, or evaluation configuration, as these may explain the variation in STS scores.