We are pleased to announce the NAKBA NLP 2026: Arabic Manuscript Understanding Shared Task, a research competition advancing Arabic manuscript OCR and transcription technologies. We invite researchers, practitioners, and students to participate in this initiative that provides an open, curated benchmark for Arabic manuscript processing.
The shared task features a substantial dataset drawn from the Omar Al-Saleh Memoir Collection, spanning 16 historical documents from 1951 to 1965:
6,395 high-resolution manuscript pages
1,597,025 words
50,685 sentences
50,672 paragraphs
Expert-verified, line-level transcriptions
Manual transcription of unseen manuscript pages at line level to enrich the benchmark with high-quality ground truth.
Each team receives ~500 line images (mandatory batch)
Additional batches available for bonus contributions
CodaBench submission: https://acr.ps/1L9F2KW
Evaluation Criteria:
Coverage completeness
Transcription accuracy (CER, WER)
Quality of submitted transcription guidelines
Important: No generative AI tools may be used for transcription in this track.
Development of automatic OCR/HTR systems for Arabic manuscripts.
Dataset splits:
Training set: ~15,962 line images with gold transcriptions
Development set: ~1,774 line images with gold transcriptions
Test set: ~2,095 line images (held out)
CodaBench submission: https://acr.ps/1L9F2LL
Evaluation Metrics:
Primary: Character Error Rate (CER)
Secondary: Word Error Rate (WER)
Baseline Resources:
HuggingFace: https://huggingface.co/U4RASD/ar-ms-baseline
January 1, 2026 – Call for Participation
January 10, 2026 – Release of Training Data / Transcription Phase Begins
February 10, 2026 – Evaluation Period Begins (Subtask 2)
February 17, 2026 – Evaluation Period Ends (Both Subtasks)
February 21, 2026 – Results Announcement
March 1, 2026 – System Paper Submission Deadline
March 15, 2026 – Acceptance Notification
March 21, 2026 – Camera-Ready Deadline
All participating teams must submit a 4-page system description paper
Teams may participate in one or both subtasks
Registration required to receive dataset access
Submissions will undergo peer review via OpenReview
Selected papers will be published in NAKBA NLP 2026 proceedings
Contributed transcriptions will be released under CC-BY-4.0 (where licensing permits)
Fadi Zaraket (Arab Center for Research and Policy Studies / American University of Beirut)
Bilal Shalash (Arab Center for Research and Policy Studies)
Hadi Hamoud (Arab Center for Research and Policy Studies)
Ahmad Chamseddine (Arab Center for Research and Policy Studies)
Firas Ben Abid (Zinki AI)
Mustafa Jarrar (Hamad Bin Khalifa University / Birzeit University)
Chadi Abou Chakra (Arab Center for Research and Policy Studies)
Andrew Naaem (Arab Center for Research and Policy Studies)
Bernard Ghanem (King Abdullah University of Science and Technology)
Monther Salahat (Birzeit University)
Register via the application form
Access datasets upon approval (via GDrive)
Submit results through CodaBench platforms
Submit system description paper by March 1
For questions, please contact: ar...@dohainstitute.edu.qa