OSAC: Open Source Arabic Corpora

29 views
Skip to first unread message

Motaz K. Saad

unread,
Oct 29, 2010, 5:27:47 PM10/29/10
to Motaz Saad
Assalamo Alikom,

Open Source Arabic Corpora
(OSAC) have been released !. You can download it from

http://ar-text-mining.sourceforge.net

The corpora include:
- BBC Arabic corpus:
collected from bbcarabic.com, includes 4,763 text documents. Each text document belongs to 1 of 7 categories (Middle East News 2356, World News 1489, Business & Economy 296, Sports 219, International Press 49, Science & Technology 232, Art & Culture 122). The corpus contains 1,860,786 (1.8M) words and 106,733 district keywords after stopwords removal.

- CNN Arabic corpus: collected from cnnarabic.com, includes 5,070 text documents. Each text document belongs to 1 of 6 categories (Business 836, Entertainments 474, Middle East News 1462, Science & Technology 526, Sports 762, World News 1010). The corpus contains 2,241,348 (2.2M) words and 144,460 district keywords after stopwords removal.

- Open Source Arabic Corpus: collected from multiple sites, includes 22,429 text documents. Each text document belongs to 1 of 10 categories (Economics, History, Entertainments, Education & Family, Religious and Fatwas, Sports, Heath, Astronomy, Low, Stories, Cooking Recipes). The corpus contains about 18,183,511 (18M) words and 449,600 district keywords after stopwords removal.

Contrib
utions are welcomed !

I would like to remind you regarding Arabic Morphological analysis tools, stemming / light stemming are now available within both RapidMiner and Weka

Again,
Contributions are welcomed !
Best Regards,
--
Motaz K. Saad
Faculty of Information Technology
IT Building, Room: I319
Islamic University Of Gaza
P.O.BOX 108, Gaza, Palestine
Fax: +970 2860 800
http://sites.google.com/site/MotazSite
Reply all
Reply to author
Forward
0 new messages