word frequency of nouns and verbs in Standard Arabic

66 views
Skip to first unread message

Althubaiti,Kholoud

unread,
Sep 20, 2024, 4:58:06 AM9/20/24
to sig...@googlegroups.com
Hello colleagues,

Could you recommend a free tool that gives me the frequency of the lexical item in Standard Arabic. I conduct experimental research on second language acquisition of Arabic and I need to make a list of lexical items (nouns and verbs) according to their frequency in language use. For my experiment, frequency is one of the variables that I need to factor in analysis. 

Thanks in advance for your help!
Kholoud Al-Thubaiti



--
K.Althubaiti

Mirko Vogel

unread,
Sep 20, 2024, 6:08:48 AM9/20/24
to SIGARAB

Hi Khouloud,

which kind of corpus should this frequency list be based upon? I could offer to share with you a dump of the lemma database of the online collocation dictionary Muraija, which is based on the el-khair corpus - that is, mostly news. It's currently 772k lemmas, the dump would be in json format:

{
  "lemma": "مُنْتَخَب",
  "pos": "NOUN",
  "freq": 195586,
  "surface_forms": {
    "المُنْتَخَباتُ": 1006,
    "المُنْتَخَباتِ": 9057,
    "المُنْتَخَبانِ": 1312,
    "المُنْتَخَبَ": 8201,
    "المُنْتَخَبَةَ": 212,
    "المُنْتَخَبَةِ": 1616,
    "المُنْتَخَبَيْنِ": 3576,
    "المُنْتَخَبُ": 25113,
    "المُنْتَخَبُونَ": 269,
    "المُنْتَخَبِ": 72780,
    "المُنْتَخَبِينَ": 1184,
    "مُنْتَخَبا": 609,
    "مُنْتَخَباتٌ": 180,
    "مُنْتَخَباتٍ": 2332,
    "مُنْتَخَباتُ": 684,
    "مُنْتَخَباتِ": 3513,
    "مُنْتَخَباً": 1368,
    "مُنْتَخَبٌ": 953,
    "مُنْتَخَبٍ": 3918,
    "مُنْتَخَبَ": 6274,
    "مُنْتَخَبَةٌ": 299,
    "مُنْتَخَبَةٍ": 1596,
    "مُنْتَخَبَيْ": 2242,
    "مُنْتَخَبَيْنِ": 372,
    "مُنْتَخَبُ": 10916,
    "مُنْتَخَبُونَ": 101,
    "مُنْتَخَبِ": 34814,
    "مُنْتَخَبِي": 73,
    "مُنْتَخَبِينَ": 269,
    "مُنْتَخَب": 118,
    "مُنْتَخَبانِ": 73,
    "مُنْتَخَبَةً": 175,
    "المُنْتَخَبَةُ": 312,
    "مُنْتَخَبُو": 20,
    "المُنْتَخَبَتَيْنِ": 1,
    "المُنْتَخَب": 40,
    "مُنْتَخَبَتَيْنِ": 4,
    "مُنْتَخَبَةِ": 3,
    "مُنْتَخَبَتا": 1
  }
}


Best,
Mirko
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAL-0UTSnjcbM8ysTS-tm52TyYCuLRjaqMvGk7Az1YYHZBTircA%40mail.gmail.com.

Mustafa Jarrar

unread,
Sep 20, 2024, 7:12:37 AM9/20/24
to k.alth...@gmail.com, SIGARAB: Special Interest Group on Arabic Natural Language Processing, Mirko Vogel
Dear Khouloud,

You may check the frequency of wordforms in the 12 corpora (MSA and Dialects, 2.3 Million tokens) manually linked with the Qabas Lexicographic database (6K lemmas in Qabas)
Attached a screenshot:
PastedGraphic-1.png

Best
--Mustafa
__________________________
Mustafa Jarrar, PhD
Professor of Artificial Intelligence
Chair, PhD Program in Computer Science
Birzeit University, Palestine 
WhatsApp:+972599662258 

Nizar Habash

unread,
Sep 20, 2024, 7:38:42 AM9/20/24
to k.alth...@gmail.com, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear Kholoud -

Given your interest in language acquisition, you may want to check out the Lexicon of the SAMER project (http://samer.camel-lab.com/). [download form][paper]. It includes frequencies from two sources (novels/Hindawi and news/Gigaword) in lemma (dictionary entry) form, with readability leveling of a five point scale (defined in the paper mentioned above).  The project site has other utilities building on it. 

Sample of top of the file:
Screen Shot 2024-09-20 at 3.33.32 PM.png

Also we have a pure list of word forms with frequencies (16.1M unique words / from a 17.3B word corpus)

Best
Nizar





--
Nizar Habash
Professor of Computer Science
New York University Abu Dhabi
https://www.nizarhabash.com/ 

Mirko Vogel

unread,
Sep 20, 2024, 7:41:46 AM9/20/24
to SIGARAB

Hi Nizar,

the overlap with Camel Tools lemmas should be 100% (minus fallback analyses, of course) because that's what I use for MA. :-) Actually, most of the lemmas are proper names, numbers or words written with Latin letters. The relevant number of nouns, verbs and adjectives is much lower:

PROPN 556528
NUM 89659
X 47410
NOUN 16300
VERB 7234
ADJ 5032

Currently, I'm using the buggy r13 db, the data will become more reliable as soon as I managed to integrate Camel Morph MSA into the parsing pipeline (waiting for ud + catib pos tags to be added / https://github.com/CAMeL-Lab/camel_morph/issues/2).

Best,
Mirko



On 9/20/24 12:32, Nizar Habash wrote:
Hi Mirko - how do you define the lemma? 772k sounds like a very large number. Are you including digits in the counts?
I'm also curious about overlap with Camel Tools lemmas... would be great if we can compare.

Thanks
N

Kholoud Althubaiti

unread,
Sep 20, 2024, 9:09:03 AM9/20/24
to Mustafa Jarrar, SIGARAB: Special Interest Group on Arabic Natural Language Processing, Mirko Vogel
Thanks Mustafa for this suggestion. Hope it works for me as I have to check it out. 
Kholoud 
Sent from my iPhone

On 20 Sep 2024, at 2:12 PM, Mustafa Jarrar <mustaf...@gmail.com> wrote:

Dear Khouloud,

You may check the frequency of wordforms in the 12 corpora (MSA and Dialects, 2.3 Million tokens) manually linked with the Qabas Lexicographic database (6K lemmas in Qabas)
Attached a screenshot:
Reply all
Reply to author
Forward
0 new messages