How Smart Are LLMs in Arabic knowledge, Really?

22 views
Skip to first unread message

سري السباعي

unread,
Aug 3, 2025, 4:22:55 AMAug 3
to SIGARAB: Special Interest Group on Arabic Natural Language Processing

Subject: New Paper on Arabic LLM Evaluation: From Guidelines to Practice

Dear SIG Arabic NLP community,

I’m happy to share my latest research:
"From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation."

🔍 What’s new?

  • We lay out comprehensive theoretical guidelines for evaluating Arabic LLMs.

  • We introduce ADMD, a carefully designed dataset of 490 challenging questions across 10 major domains and 42 sub-domains—focused on cultural depth and linguistic precision.

  • We evaluated top models (GPT-4, Claude 3.5, Gemini 1.5, CommandR, Qwen), and the results reveal serious gaps—especially in culturally nuanced or domain-specific tasks.

📊 Key insight: Claude 3.5 Sonnet achieved the highest accuracy at just 30%, showing how much progress is still needed.

We hope this work inspires better tools and deeper conversations around Arabic LLM evaluation.

Read the paper here: https://arxiv.org/abs/2506.01920
Look at the dataset here: https://huggingface.co/datasets/riotu-lab/ADMD 

Best regards,
Serry Sibaee - Riotu Labs - PSU


Hesham Haroon

unread,
Aug 5, 2025, 5:35:53 AMAug 5
to سري السباعي, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Dear Serry,

Thank you for sharing your latest research, "From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation." This paper indeed makes a valuable contribution to the field of Arabic NLP evaluation, and the ADMD dataset is a very interesting development.

I have a few recommendations that I believe could further strengthen this important work:

  * **Expand the Dataset:** To enhance the statistical reliability of the evaluation, I recommend increasing the number of questions to at least 50-100 per domain. This would provide a more robust basis for assessing model performance across the diverse topics covered.

  * **Improve Evaluation Methodology:**
   
      * It would be beneficial to implement clear rubrics for the 4-level evaluation scheme to ensure consistency and transparency in scoring.
      * Reporting inter-annotator agreement would also be crucial to validate the reliability of human judgments.
      * Additionally, consider exploring automated evaluation metrics where applicable, as they could complement the human-based assessment.

  * **Enhance Analysis:**
   
      * Providing qualitative error analysis with specific examples would offer deeper insights into the types of mistakes LLMs are making.
      * Analyzing cultural failures and linguistic failures separately could highlight distinct areas for improvement.
      * Including statistical significance testing when comparing model performances would add further rigor to the findings.

  * **Validate Guidelines:** It would be highly valuable to systematically show how ADMD adheres to the proposed theoretical guidelines. This would demonstrate the practical application and effectiveness of your framework.

  * **Broader Evaluation:** Expanding the evaluation to include more Arabic-specific models and exploring various prompting strategies could offer a more comprehensive understanding of the current landscape of Arabic LLMs.

I hope these recommendations are helpful and contribute to the continued development of this promising research.

Best regards,
Hesham


--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/1e8d28ef-2e4e-4139-af2e-d8c92cd59170n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages