word frequency of nouns and verbs in Standard Arabic

Althubaiti,Kholoud

unread,

Sep 20, 2024, 4:58:06 AM9/20/24

to sig...@googlegroups.com

Hello colleagues,

Could you recommend a free tool that gives me the frequency of the lexical item in Standard Arabic. I conduct experimental research on second language acquisition of Arabic and I need to make a list of lexical items (nouns and verbs) according to their frequency in language use. For my experiment, frequency is one of the variables that I need to factor in analysis.

Thanks in advance for your help!

Kholoud Al-Thubaiti

--

K.Althubaiti

Mirko Vogel

unread,

Sep 20, 2024, 6:08:48 AM9/20/24

to SIGARAB

Hi Khouloud,

which kind of corpus should this frequency list be based upon? I could offer to share with you a dump of the lemma database of the online collocation dictionary Muraija, which is based on the el-khair corpus - that is, mostly news. It's currently 772k lemmas, the dump would be in json format:

{
"lemma": "مُنْتَخَب",
"pos": "NOUN",
"freq": 195586,
"surface_forms": {
    "المُنْتَخَباتُ": 1006,
    "المُنْتَخَباتِ": 9057,
    "المُنْتَخَبانِ": 1312,
    "المُنْتَخَبَ": 8201,
    "المُنْتَخَبَةَ": 212,
    "المُنْتَخَبَةِ": 1616,
    "المُنْتَخَبَيْنِ": 3576,
    "المُنْتَخَبُ": 25113,
    "المُنْتَخَبُونَ": 269,
    "المُنْتَخَبِ": 72780,
    "المُنْتَخَبِينَ": 1184,
    "مُنْتَخَبا": 609,
    "مُنْتَخَباتٌ": 180,
    "مُنْتَخَباتٍ": 2332,
    "مُنْتَخَباتُ": 684,
    "مُنْتَخَباتِ": 3513,
    "مُنْتَخَباً": 1368,
    "مُنْتَخَبٌ": 953,
    "مُنْتَخَبٍ": 3918,
    "مُنْتَخَبَ": 6274,
    "مُنْتَخَبَةٌ": 299,
    "مُنْتَخَبَةٍ": 1596,
    "مُنْتَخَبَيْ": 2242,
    "مُنْتَخَبَيْنِ": 372,
    "مُنْتَخَبُ": 10916,
    "مُنْتَخَبُونَ": 101,
    "مُنْتَخَبِ": 34814,
    "مُنْتَخَبِي": 73,
    "مُنْتَخَبِينَ": 269,
    "مُنْتَخَب": 118,
    "مُنْتَخَبانِ": 73,
    "مُنْتَخَبَةً": 175,
    "المُنْتَخَبَةُ": 312,
    "مُنْتَخَبُو": 20,
    "المُنْتَخَبَتَيْنِ": 1,
    "المُنْتَخَب": 40,
    "مُنْتَخَبَتَيْنِ": 4,
    "مُنْتَخَبَةِ": 3,
    "مُنْتَخَبَتا": 1
}
}

Best,
Mirko

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAL-0UTSnjcbM8ysTS-tm52TyYCuLRjaqMvGk7Az1YYHZBTircA%40mail.gmail.com.

Mustafa Jarrar

unread,

Sep 20, 2024, 7:12:37 AM9/20/24

to k.alth...@gmail.com, SIGARAB: Special Interest Group on Arabic Natural Language Processing, Mirko Vogel

Dear Khouloud,

You may check the frequency of wordforms in the 12 corpora (MSA and Dialects, 2.3 Million tokens) manually linked with the Qabas Lexicographic database (6K lemmas in Qabas)

https://sina.birzeit.edu/qabas/lemma/2021848500

Attached a screenshot:

Best
--Mustafa
__________________________
Mustafa Jarrar, PhD
Professor of Artificial Intelligence
Chair, PhD Program in Computer Science
Birzeit University, Palestine
WhatsApp:+972599662258

Page: http://www.jarrar.info

SinaLab: https://sina.birzeit.edu

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/d08c2122-e8ac-40fe-8e4b-0f33765bfac8%40gmail.com.

Nizar Habash

unread,

Sep 20, 2024, 7:38:42 AM9/20/24

to k.alth...@gmail.com, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear Kholoud -

Given your interest in language acquisition, you may want to check out the Lexicon of the SAMER project (http://samer.camel-lab.com/). [download form][paper]. It includes frequencies from two sources (novels/Hindawi and news/Gigaword) in lemma (dictionary entry) form, with readability leveling of a five point scale (defined in the paper mentioned above). The project site has other utilities building on it.

Sample of top of the file:

Screen Shot 2024-09-20 at 3.33.32 PM.png

Also we have a pure list of word forms with frequencies (16.1M unique words / from a 17.3B word corpus)

https://github.com/CAMeL-Lab/Camel_Arabic_Frequency_Lists

Best

Nizar

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/BB9E1F6A-C306-4879-8DB8-85185BCC8198%40gmail.com.

--

Nizar Habash

Professor of Computer Science

New York University Abu Dhabi
https://www.nizarhabash.com/

Mirko Vogel

unread,

Sep 20, 2024, 7:41:46 AM9/20/24

to SIGARAB

Hi Nizar,

the overlap with Camel Tools lemmas should be 100% (minus fallback analyses, of course) because that's what I use for MA. :-) Actually, most of the lemmas are proper names, numbers or words written with Latin letters. The relevant number of nouns, verbs and adjectives is much lower:

PROPN	556528
NUM	89659
X	47410
NOUN	16300
VERB	7234
ADJ	5032

Currently, I'm using the buggy r13 db, the data will become more reliable as soon as I managed to integrate Camel Morph MSA into the parsing pipeline (waiting for ud + catib pos tags to be added / https://github.com/CAMeL-Lab/camel_morph/issues/2).

Best,
Mirko

On 9/20/24 12:32, Nizar Habash wrote:

Hi Mirko - how do you define the lemma? 772k sounds like a very large number. Are you including digits in the counts?

I'm also curious about overlap with Camel Tools lemmas... would be great if we can compare.

Thanks

N

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/d08c2122-e8ac-40fe-8e4b-0f33765bfac8%40gmail.com.

Kholoud Althubaiti

unread,

Sep 20, 2024, 9:09:03 AM9/20/24

to Mustafa Jarrar, SIGARAB: Special Interest Group on Arabic Natural Language Processing, Mirko Vogel

Thanks Mustafa for this suggestion. Hope it works for me as I have to check it out.

Kholoud

Sent from my iPhone

On 20 Sep 2024, at 2:12 PM, Mustafa Jarrar <mustaf...@gmail.com> wrote:

Dear Khouloud,

You may check the frequency of wordforms in the 12 corpora (MSA and Dialects, 2.3 Million tokens) manually linked with the Qabas Lexicographic database (6K lemmas in Qabas)
https://sina.birzeit.edu/qabas/lemma/2021848500
Attached a screenshot:

Reply all

Reply to author

Forward