Recommendations for Arabic Dialect Classification Models

114 views
Skip to first unread message

Ali Al-laith

unread,
Aug 28, 2024, 7:26:08 AM8/28/24
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Dear all,

I am seeking your recommendations for models trained on the classification of Arabic dialects (country of city level). 
Currently, I am evaluating the predictions of 7 models on 300,000 Arabic text samples from different dialect, to understand and analyze the variations in their predictions. I am using the following models:

# model_huggingface = 'lafifi-24/arbert_arabic_dialect_identification'
# model_huggingface = 'CAMeL-Lab/bert-base-arabic-camelbert-msa-did-madar-twitter5'
# model_huggingface = 'CAMeL-Lab/bert-base-arabic-camelbert-mix-did-nadi'
# model_huggingface = 'AMR-KELEG/ADI-NADI-2023'
# model_huggingface = 'Ammar-alhaj-ali/arabic-MARBERT-dialect-identification-city'
# model_huggingface = 'Abdelrahman-Rezk/bert-base-arabic-camelbert-msa-finetuned-Arabic_Dialect_Identification_model_1'
# model_huggingface = 'AMR-KELEG/NADI2024-baseline'


If you have any suggestions for additional models that are effective in dialect classification, I would greatly appreciate it.

Best regards,
Ali Al-Laith

Nizar Habash

unread,
Aug 28, 2024, 7:44:42 AM8/28/24
to Ali Al-laith, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hi Ali - you can compare to classical methods such as the model of Salameh et al. (2018)'s work implemented now in Camel Tools:
Cheers
Nizar


--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/50e85daf-fdee-4d85-b611-9f610dba85e5n%40googlegroups.com.


--
Nizar Habash
Professor of Computer Science
New York University Abu Dhabi
https://www.nizarhabash.com/ 

Hamdy S. Hussein

unread,
Aug 28, 2024, 7:52:46 AM8/28/24
to Nizar Habash, Ali Al-laith, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Salam Ali,

 

We shared 540K tweets classified at the country level here:

https://alt.qcri.org/resources/qadi/

Paper: https://aclanthology.org/2021.wanlp-1.1.pdf

 

Please tell me if you need tweet texts as we shared tweet ids.

 

Best,

Hamdy

 

Hamdy S. Hussein

Principal Software Engineer

Qatar Computing Research Institute

+974 445 41679

www.hbku.edu.qa 

 

HBKU-RGB.png

Ali Al-laith

unread,
Aug 29, 2024, 6:40:38 AM8/29/24
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Thank you, Nizar and Hamdy. I will definitely incorporate the classical models into my experiments as well.
@Hamdy, could you please share the tweet texts with me? I'm finding it challenging to extract them using their IDs.

Best,
Ali

Amr Keleg

unread,
Aug 29, 2024, 11:31:46 AM8/29/24
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hi Ali,

I am glad to know about your interest in Arabic Dialect Identification. I just wanted to share some of my thoughts about the task formulation, mainly about the limitations of framing it as a single label classification. For instance, we found that ~70% of the samples in NADI 2024's evaluation set to be valid in more than one country-level dialect (please note that we only covered 9 dialects, so this percentage can only get even larger).
I have also added more information about developing a competitive baseline for multi-label Dialect Identification here: https://huggingface.co/AMR-KELEG/NADI2024-baseline#baseline-i-top-90

You can find more details in the following papers:
* Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification (Keleg & Magdy, ArabicNLP-WS 2023)
* NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task (Abdul-Mageed et al., ArabicNLP-WS 2024)

Happy to discuss these ideas further if you are interested.
Wishing you all the best!

Best Regards,
Amr


Amr Keleg (عمرو قلج)
PhD student - CDT in NLP
University of Edinburgh
https://amr-keleg.github.io/

Abdul-Mageed, Muhammad

unread,
Aug 29, 2024, 11:47:07 AM8/29/24
to Amr Keleg, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Adding to the discussion: as people work on this and similar tasks, I believe they should keep in mind issues related to “language production” vs. “language perception”. For example, a text produced by a speaker of a particular dialect can be perceived by an annotator as belonging to another dialect. 

Best,
Muhammad 

On Aug 29, 2024, at 7:31 PM, Amr Keleg <amr.k...@gmail.com> wrote:


[CAUTION: Non-UBC Email]

Hamdy S. Hussein

unread,
Aug 29, 2024, 1:04:31 PM8/29/24
to Abdul-Mageed, Muhammad, Amr Keleg, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Agree. We mentioned this in our QADI paper:

https://aclanthology.org/2021.wanlp-1.1.pdf

 

“… Similar to the results observed for both the Gulf and Levant regions, the Maghrebi dialects (MA, DZ, LY, TN) exhibit a similar pattern.

MA and DZ account for considerable confusion. For instance, the tweet  الله يبارك فيك خويا(God bless you, brother!!), could be used in both dialects.

As for the Nile Basin dialects, Egyptian (EG) and Sudanese (SD) could also be confused with one another.

The tweet التويتة دي معدلة فوتوشوب, (This tweet is modified in Photoshop), is equally valid in both dialects.”

 

Given only a dialectal text (especially short text), in many cases it’s hard to classify it to only a single dialect.

If we add voice, the task will be easier.

Asking native speakers to pronounce the sentences that have more than one country-label can be very useful for detailed comparative studies.

 

Best,

Hamdy

 

From: sig...@googlegroups.com <sig...@googlegroups.com> On Behalf Of Abdul-Mageed, Muhammad
Sent: Thursday, August 29, 2024 6:47 PM
To: Amr Keleg <amr.k...@gmail.com>
Cc: SIGARAB: Special Interest Group on Arabic Natural Language Processing <sig...@googlegroups.com>
Subject: Re: [SIGARAB] Recommendations for Arabic Dialect Classification Models

 

Adding to the discussion: as people work on this and similar tasks, I believe they should keep in mind issues related to “language production” vs. “language perception”. For example, a text produced by a speaker of a particular dialect can be perceived by an annotator as belonging to another dialect. 

Amr Keleg

unread,
Aug 29, 2024, 1:31:53 PM8/29/24
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Sure, Hamdy!
I definitely enjoyed reading this error analysis section of the QADI paper. However, and as an addition to this observation, we also found that the overlap does not just happen between dialects spoken in neighboring countries but it can extend to these that are perceived to be different from each other.

For instance:
وين يلعب هذا ماشفته - valid in Algeria, Palestine, Yemen
لمن الحياة ترسل ليك رسالة - valid in Sudan, Palestine, Yemen 
------------------------------------------------

I like the idea of asking speakers of different dialects to pronounce written sentences that are known to be valid in those countries, and studying how adding a speech signal makes these sentences more distinguishable from each other. That would be a great extension to NADI 2024's evaluation dataset, however, collecting these recordings is expected to be a bit tricky :)
 
Best,
Amr

Omer Said

unread,
Aug 30, 2024, 5:28:15 AM8/30/24
to Amr Keleg, SIGARAB: Special Interest Group on Arabic Natural Language Processing
مرحبا بالجميع 
بروفسور حمدي يبدو أن التسجيل الصوتي سيكون صعبا كما أشرتم جميعا في نقاشكم المثري، ويبدو كذلك أن الهدف من إضافة التسجيل هو التفريق بين نصوص يمكن استخدامها في لهجات عدة، وهذا يمكن التحقق منه بالعمل على دراسة النبر والتنغيم لكل لهجة، وضعه على شكل علامات معينة، على سبيل المثال وليس الحصر ظهور الفرق بين الجمل الخبرية والجمل الأخرى كالاستفهامية مثلا عن طريق علامات الترقيم، نحو: أحمد في البيت. أحمد في البيت؟ فكلا المثالين لهما تنغيم صوتي مختلف أثناء النطق وقد ظهر الفرق في المثالين المكتوبين عندما وضعنا العلامة. 

تحياتي 
عمر الشحري 
باحث لغويات مستقل

Rania Bouaziz

unread,
Sep 29, 2024, 11:36:22 AM9/29/24
to Omer Said, Amr Keleg, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hello everyone
I'm looking for  dialect database, based on expressions and texts, not just words. Are there any?
Thank you very much
Rania Bouaziz
University of Manouba

Amr Keleg

unread,
Oct 2, 2024, 12:00:12 PM10/2/24
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Hi Rania,

May I ask about the goal you are trying to achieve?
If you are interested in studying expressions from different Arabic-speaking countries then this site might be useful: https://en.mo3jam.com/term/%D8%A7%D8%AD%D8%A8%D9%83%D9%85%20%D8%A8%D8%B2%D8%A7%D9%81#Moroccan (Please note that the site has lots of dialectal swear words).
If you want to build a Dialect Identification model/system then I would refer you to the NADI 2024 shared task (https://nadi.dlnlp.ai/), for which I can ask the co-organizers for permission to share the task's training and development sets with you.

Best,
Amr
Reply all
Reply to author
Forward
0 new messages