How Smart Are LLMs in Arabic knowledge, Really?

22 views

Skip to first unread message

سري السباعي‎

unread,

Aug 3, 2025, 4:22:55 AMAug 3

to SIGARAB: Special Interest Group on Arabic Natural Language Processing

Subject: New Paper on Arabic LLM Evaluation: From Guidelines to Practice

Dear SIG Arabic NLP community,

I’m happy to share my latest research:
"From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation."

🔍 What’s new?

We lay out comprehensive theoretical guidelines for evaluating Arabic LLMs.
We introduce ADMD, a carefully designed dataset of 490 challenging questions across 10 major domains and 42 sub-domains—focused on cultural depth and linguistic precision.
We evaluated top models (GPT-4, Claude 3.5, Gemini 1.5, CommandR, Qwen), and the results reveal serious gaps—especially in culturally nuanced or domain-specific tasks.

📊 Key insight: Claude 3.5 Sonnet achieved the highest accuracy at just 30%, showing how much progress is still needed.

We hope this work inspires better tools and deeper conversations around Arabic LLM evaluation.

Read the paper here: https://arxiv.org/abs/2506.01920
Look at the dataset here: https://huggingface.co/datasets/riotu-lab/ADMD

Best regards,
Serry Sibaee - Riotu Labs - PSU

Hesham Haroon

unread,

Aug 5, 2025, 5:35:53 AMAug 5

to سري السباعي, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear Serry,

Thank you for sharing your latest research, "From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation." This paper indeed makes a valuable contribution to the field of Arabic NLP evaluation, and the ADMD dataset is a very interesting development.

I have a few recommendations that I believe could further strengthen this important work:

* **Expand the Dataset:** To enhance the statistical reliability of the evaluation, I recommend increasing the number of questions to at least 50-100 per domain. This would provide a more robust basis for assessing model performance across the diverse topics covered.

* **Improve Evaluation Methodology:**

* It would be beneficial to implement clear rubrics for the 4-level evaluation scheme to ensure consistency and transparency in scoring.
* Reporting inter-annotator agreement would also be crucial to validate the reliability of human judgments.
* Additionally, consider exploring automated evaluation metrics where applicable, as they could complement the human-based assessment.

* **Enhance Analysis:**

* Providing qualitative error analysis with specific examples would offer deeper insights into the types of mistakes LLMs are making.
* Analyzing cultural failures and linguistic failures separately could highlight distinct areas for improvement.
* Including statistical significance testing when comparing model performances would add further rigor to the findings.

* **Validate Guidelines:** It would be highly valuable to systematically show how ADMD adheres to the proposed theoretical guidelines. This would demonstrate the practical application and effectiveness of your framework.

* **Broader Evaluation:** Expanding the evaluation to include more Arabic-specific models and exploring various prompting strategies could offer a more comprehensive understanding of the current landscape of Arabic LLMs.

I hope these recommendations are helpful and contribute to the continued development of this promising research.

Best regards,
Hesham

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/1e8d28ef-2e4e-4139-af2e-d8c92cd59170n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages