Release: 𝟑𝟬 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 for Arabic NLP

11 views
Skip to first unread message

Mustafa Jarrar

unread,
Dec 18, 2025, 5:04:41 AM (9 days ago) Dec 18
to SIGARAB: Special Interest Group on Arabic Natural Language Processing

بمناسبة اليوم العالمي للغة العربية
 يسعدنا الاعلان عن 30 مدونة مفتوحة المصدر لحوسبة اللغة والذكاء الاصطناعي 
We are excited to release 𝟑𝟬 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀 for Arabic NLP 
in honor of World Arabic Language Day.

𝐋𝐞𝐱𝐢𝐜𝐨𝐠𝐫𝐚𝐩𝐡𝐲 & 𝐒𝐞𝐦𝐚𝐧𝐭𝐢𝐜
Arabic Ontology: 17K concepts - formal Arabic WordNet.(CC-BY-4.0)
Lexicographic DB: 150 lexicons with lexicographic search engine (3M Lemmas, 1M Glosses, 1.5M translation pairs).
Qabas Lexicon: 60K lemmas, linked with 110 lexicons and corpora 2M tokens. (CC-BY-ND-4.0)
Salma WSD: Arabic sense-annotated corpus, 34k tokens. Multilevels: single-word, multi-word senses, and NER. (CC-BY-4.0)
Synonyms: Synonyms dataset parallelly annotated by 4 linguists and fuzzy values. (CC-BY-4.0)
ArabGlossBERT: 167K context-gloss pairs labeled with True/False to train a TSV BERT model for WSD. (CC-BY-4.0)


𝐂𝐥𝐚𝐬𝐬𝐢𝐜𝐚𝐥 𝐀𝐫𝐚𝐛𝐢𝐜
QuranMorph: Morphology tagging of the Quran (POS, Lemma, etc), each word is linked with a lemma in Qabas .(CC-BY-4.0)

𝐃𝐢𝐚𝐥𝐞𝐜𝐭𝐬 & 𝐌𝐨𝐫𝐩𝐡𝐨𝐥𝐨𝐠𝐲
Curras: Palestinian dialect corpus, 56K tokens with morphological annotations.(CC-BY-4.0)
Baladi: Lebanese dialect corpus, 10K tokens with morphological annotations.(CC-BY-4.0)
Nabra: Syrian dialect corpus, 60K tokens with morphological annotations.(CC-BY-4.0)
Lisan-Iraqi: Iraqi dialect corpus, 45K tokens with morphological annotations.(CC-BY-4.0)
Lisan-Libyan: Libyan dialect corpus, 51K tokens with morphological annotations.(CC-BY-4.0)
Lisan-Sudanese: Sudanese dialect corpus, 52K tokens with morphological annotations.(CC-BY-4.0)
Lisan-Yemeni: Yemeni dialect corpus, 1.05M tokens with morphological annotations.(CC-BY-4.0)


𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧 
WojoodOntology: 55 class (entity types), 43 relationships, covers NLP tagsets, Wikidata, Schma.org  (CC-BY-4.0)
Named Entity Recognition
Wojood NER: 550K tokens, MSA, nested, 21 entity types, multi-domain (CC-BY-4.0)
Konooz NER: 770K tokens, 15 Dialects  ✕ 10 Domains, nested, 21 entity types (CC-BY-4.0) 
WojoodFine: Fine-grain NER corpus - extending Wojood with 31 entity subtypes.(CC-BY-4.0)
WojoodGaza: NER corpus, 60K tokens about the Israeli War on Gaza, using Wojood guidelines.(CC-BY-4.0)
Relation Extraction
WojoodRelations: 550K tokens, annotated with 40 relation types
WojoodHadath: Event-relation extraction corpus - extending Wojood with relations.(CC-BY-4.0)


𝐋𝐋𝐌𝐬 & 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥𝐢𝐭𝐢𝐞𝐬
Pearl: Multimodal instruction dataset - 13K+135K captioned images and 16K+309K Question–Answer Pairs (CC-BY-NC-ND-4.0)
ImageEval: Image caption datasets - 4.5k+4k manually captioned images.(CC-BY-4.0)
Palm: LLM instructions - ~17K QA pairs in MSA and dialects, 22 countries.(CC-BY-NC-ND-4.0)
PalGeoLLM: LLM Instruction - 4.6K+15.9K QA pairs about Palestinian geography and history.(CC-BY-4.0)
Casablanca:  Multidialectal Arabic speech recognition datasets - 8 dialects.(CC-BY-NC-ND-4.0)


𝐂𝐡𝐚𝐭𝐛𝐨𝐭𝐬 & 𝐃𝐢𝐚𝐥𝐞𝐜𝐭 𝐓𝐫𝐚𝐧𝐬𝐥𝐚𝐭𝐢𝐨𝐧
ArBanking77: Parallel Corpora: 15K questions in MSA, Palestinian, Morocco, Saudi, Tunisian - labeled banking intents (CC-BY-SA-4.0)

𝐒𝐨𝐜𝐢𝐚𝐥 𝐂𝐨𝐦𝐩𝐮𝐭𝐢𝐧𝐠
LLMs political bias: 1.8K QAs to measure political bias related Palestine-Israel
Offensive Hebrew: 16K Tweets labeled with hate, violence, racism, porno.(CC-BY-4.0)
FigNews: 12K FB posts annotated with Bias and Propaganda in Arabic, Hebrew, English, French, and Hindi.(CC-BY-4.0)


𝐒𝐢𝐧𝐚𝐓𝐨𝐨𝐥𝐬
Open-source Toolkit for Arabic NLP (Outperformed all related tools in all tasks)
Download: https://sina.birzeit.edu/sinatools/ 
Modules: 
  * Morphology Analyser for MSA and dialects
  * Word Sense Disambiguation
  * Named Entity Recognition 
  * Relation Extraction
  * Synonyms (Extend and Evaluate)
  * Diacritic-Based Matching
  * Corpus Tokenizer
  * Text Duplication Detector
   Similarity Functions (Jaccard, Cosine, etc) 


Enjoy!


SinaLab 
for Computational Linguistics and Artificial Intelligence
  Birzeit University, Palestine
  Hamad Bin Khalifa University, Qatar







Reply all
Reply to author
Forward
0 new messages