Vision-Language dataset for Arabic language

18 views
Skip to first unread message

Mohamed Khenchouch

unread,
Nov 20, 2025, 12:45:04 PMNov 20
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hi, everyone, I'm looking for open source available vision-language  (  specifically  Image captions) dataset. 
thank you for taking into consideration my request of help. 
best regards

Firoj Alam

unread,
Nov 20, 2025, 12:58:23 PMNov 20
to Mohamed Khenchouch, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hi Mohamed,
Please check this paper. If this is something you are interested in please let me know. We would be happy to collaborate. 
Firoj


................
Firoj Alam, PhD
https://firojalam.one


--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/5f6da2b3-45c7-4081-a14a-fa1a03c8d764n%40googlegroups.com.

Abdul-Mageed, Muhammad

unread,
Nov 20, 2025, 1:20:33 PMNov 20
to Mohamed Khenchouch, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Mohamed,

 

Thank you for this.

 

We have recently released several vision-language (multimodal) models and datasets that are relevant to your request,

particularly for image captioning and visual question answering in Arabic and its dialects.

 

Here are the details for our relevant work (PEARL, Peacock, and Dallah), including links to the papers, code, and models:

1. PEARL: A Multimodal Culturally-Aware Arabic Instruction Dataset

PEARL is a large-scale multimodal dataset explicitly designed for cultural understanding and visual question answering.

2. Peacock: A Family of Arabic Multimodal Large Language Models

Peacock is a family of Arabic MLLMs with strong vision and language capabilities. This project also introduces Henna, a benchmark for assessing cultural aspects in multimodal models.

3. Dallah: A Dialect-Aware Multimodal Large Language Model

Dallah is an advanced multimodal assistant specifically tailored for a number of Arabic dialects.

For a complete list of our work, you can also visit our main profiles here:

Bib entries are listed below.

I hope these resources are helpful for your work!

Best regards,

Muhammad Abdul-Mageed

 

Muhammad Abdul-Mageed,

Canada Research Chair in Natural Language Processing and Machine Learning,

Associate Professor

Chair, Minor in Informatics (iSchool)

Linguistics and School of Information (cross-appointed); Computer Science (courtesy)

The University of British Columbia | Vancouver Campus

signature_2513673430

 


BibTeX Citations

@inproceedings{alwajih-etal-2025-pearl,
    title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
    author = "Alwajih, Fakhraddin  and
      Magdy, Samar M.  and
      El Mekki, Abdellah  and
      Nacar, Omer  and
      Nafea, Youssef  and
      Abdelfadil, Safaa Taher  and
      Yahya, Abdulfattah Mohammed  and
      Luqman, Hamzah  and
      Almarwani, Nada  and
      Aloufi, Samah  and
      Qawasmeh, Baraah  and
      Atou, Houdaifa  and
      Sibaee, Serry  and
      Alsayadi, Hamzah A.  and
      Al-Dhabyani, Walid  and
      Al-shaibani, Maged S.  and
      El aatar, Aya  and
      Qandos, Nour  and
      Alhamouri, Rahaf  and
      Ahmad, Samar  and
      AL-Ghrawi, Mohammed Anwar  and
      Yacoub, Aminetou  and
      AbuHweidi, Ruwa  and
      Lemin, Vatimetou Mohamed  and
      Abdel-Salam, Reem  and
      Bashiti, Ahlam  and
      Ammar, Adel  and
      Alansari, Aisha  and
      Ashraf, Ahmed  and
      Alturayeif, Nora  and
      Alcoba Inciarte, Alcides  and
      Elmadany, AbdelRahim A.  and
      Tourad, Mohamedou Cheikh  and
      Berrada, Ismail  and
      Jarrar, Mustafa  and
      Shehata, Shady  and
      Abdul-Mageed, Muhammad",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1254/",
    doi = "10.18653/v1/2025.findings-emnlp.1254",
    pages = "23048--23079",
    ISBN = "979-8-89176-335-7",
    abstract = "Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available."
}

 

 

 

@inproceedings{alwajih-etal-2024-peacock,
    title = "Peacock: A Family of {A}rabic Multimodal Large Language Models and Benchmarks",
    author = "Alwajih, Fakhraddin  and
      Nagoudi, El Moatez Billah  and
      Bhatia, Gagan  and
      Mohamed, Abdelrahman  and
      Abdul-Mageed, Muhammad",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.689/",
    doi = "10.18653/v1/2024.acl-long.689",
    pages = "12753--12776",
    abstract = "Multimodal large language models (MLLMs) have proven effective in a wide range of tasks that require complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, the success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, even those with large speaker populations, such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed *Peacock*, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce *Henna*, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs. The GitHub repository for the *Peacock* project is available at [https://github.com/UBC-NLP/peacock](https://github.com/UBC-NLP/peacock)."
}

 

 

 

@inproceedings{alwajih-etal-2024-dallah,
    title = "Dallah: A Dialect-Aware Multimodal Large Language Model for {A}rabic",
    author = "Alwajih, Fakhraddin  and
      Bhatia, Gagan  and
      Abdul-Mageed, Muhammad",
    editor = "Habash, Nizar  and
      Bouamor, Houda  and
      Eskander, Ramy  and
      Tomeh, Nadi  and
      Abu Farha, Ibrahim  and
      Abdelali, Ahmed  and
      Touileb, Samia  and
      Hamed, Injy  and
      Onaizan, Yaser  and
      Alhafni, Bashar  and
      Antoun, Wissam  and
      Khalifa, Salam  and
      Haddad, Hatem  and
      Zitouni, Imed  and
      AlKhamissi, Badr  and
      Almatham, Rawan  and
      Mrini, Khalil",
    booktitle = "Proceedings of the Second Arabic Natural Language Processing Conference",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.arabicnlp-1.27/",
    doi = "10.18653/v1/2024.arabicnlp-1.27",
    pages = "320--336",
    abstract = "Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high-quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed ***Dallah***, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. ***Dallah*** demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, ***Dallah*** showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, ***Dallah*** has the potential to pave the way for further development of dialect-aware Arabic MLLMs."
}

 

  •  

 

From: sig...@googlegroups.com <sig...@googlegroups.com> on behalf of Mohamed Khenchouch <khenchouc...@gmail.com>
Date: Thursday, November 20, 2025 at 9:45 AM
To: SIGARAB: Special Interest Group on Arabic Natural Language Processing <sig...@googlegroups.com>
Subject: [SIGARAB] Vision-Language dataset for Arabic language

[CAUTION: Non-UBC Email]

Hi, everyone, I'm looking for open source available vision-language  (  specifically  Image captions) dataset. 

thank you for taking into consideration my request of help. 

best regards

--

Mohamed Khenchouch

unread,
Nov 20, 2025, 4:44:45 PMNov 20
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
  Thank you for your assistance. Your suggestions have been extremely valuable, and I have gained a deep understanding from the contributions you shared.
@Firoj Alam, I find your research highly inspiring, and I would welcome the opportunity to collaborate with you if you are willing  

Hanan Aldarmaki

unread,
Nov 21, 2025, 8:34:49 AMNov 21
to Mohamed Khenchouch, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Check out this dataset: 

It is open source and contains both image captions (in MSA and dialects) and question/answer pairs. The images are also sourced from the Arab region.



Best,

Hanan



--
Reply all
Reply to author
Forward
0 new messages