Open Source Arabic Corpora (OSAC) have been released !. You can download it from
The corpora include:
- BBC Arabic corpus: collected from bbcarabic.com, includes 4,763 text documents. Each text document belongs to 1 of 7 categories (Middle East News 2356, World News 1489, Business & Economy 296, Sports 219, International Press 49, Science & Technology 232, Art & Culture 122). The corpus contains 1,860,786 (1.8M) words and 106,733 district keywords after stopwords removal.
- CNN Arabic corpus: collected from cnnarabic.com, includes 5,070 text documents. Each text document belongs to 1 of 6 categories (Business 836, Entertainments 474, Middle East News 1462, Science & Technology 526, Sports 762, World News 1010). The corpus contains 2,241,348 (2.2M) words and 144,460 district keywords after stopwords removal.
- Open Source Arabic Corpus: collected from multiple sites, includes 22,429 text documents. Each text document belongs to 1 of 10 categories (Economics, History, Entertainments, Education & Family, Religious and Fatwas, Sports, Heath, Astronomy, Low, Stories, Cooking Recipes). The corpus contains about 18,183,511 (18M) words and 449,600 district keywords after stopwords removal.
Contributions are welcomed !
I would like to remind you regarding Arabic Morphological analysis tools, stemming / light stemming are now available within both RapidMiner and Weka
are welcomed !
Motaz K. Saad
Faculty of Information Technology
IT Building, Room: I319
Islamic University Of Gaza
P.O.BOX 108, Gaza, Palestine
Fax: +970 2860 800