Subject: New Paper on Arabic LLM Evaluation: From Guidelines to Practice
Dear SIG Arabic NLP community,
I’m happy to share my latest research:
"From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation."
🔍 What’s new?
We lay out comprehensive theoretical guidelines for evaluating Arabic LLMs.
We introduce ADMD, a carefully designed dataset of 490 challenging questions across 10 major domains and 42 sub-domains—focused on cultural depth and linguistic precision.
We evaluated top models (GPT-4, Claude 3.5, Gemini 1.5, CommandR, Qwen), and the results reveal serious gaps—especially in culturally nuanced or domain-specific tasks.
📊 Key insight: Claude 3.5 Sonnet achieved the highest accuracy at just 30%, showing how much progress is still needed.
We hope this work inspires better tools and deeper conversations around Arabic LLM evaluation.
Read the paper here: https://arxiv.org/abs/2506.01920
Look at the dataset here: https://huggingface.co/datasets/riotu-lab/ADMD
Best regards,
Serry Sibaee - Riotu Labs - PSU
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/1e8d28ef-2e4e-4139-af2e-d8c92cd59170n%40googlegroups.com.