Hi Khouloud,
which kind of corpus should this frequency list be based upon? I
could offer to share with you a dump of the lemma database of the
online collocation dictionary Muraija,
which is based on the el-khair corpus - that is, mostly news. It's
currently 772k lemmas, the dump would be in json format:
{
"lemma": "مُنْتَخَب",
"pos": "NOUN",
"freq": 195586,
"surface_forms": {
"المُنْتَخَباتُ": 1006,
"المُنْتَخَباتِ": 9057,
"المُنْتَخَبانِ": 1312,
"المُنْتَخَبَ": 8201,
"المُنْتَخَبَةَ": 212,
"المُنْتَخَبَةِ": 1616,
"المُنْتَخَبَيْنِ": 3576,
"المُنْتَخَبُ": 25113,
"المُنْتَخَبُونَ": 269,
"المُنْتَخَبِ": 72780,
"المُنْتَخَبِينَ": 1184,
"مُنْتَخَبا": 609,
"مُنْتَخَباتٌ": 180,
"مُنْتَخَباتٍ": 2332,
"مُنْتَخَباتُ": 684,
"مُنْتَخَباتِ": 3513,
"مُنْتَخَباً": 1368,
"مُنْتَخَبٌ": 953,
"مُنْتَخَبٍ": 3918,
"مُنْتَخَبَ": 6274,
"مُنْتَخَبَةٌ": 299,
"مُنْتَخَبَةٍ": 1596,
"مُنْتَخَبَيْ": 2242,
"مُنْتَخَبَيْنِ": 372,
"مُنْتَخَبُ": 10916,
"مُنْتَخَبُونَ": 101,
"مُنْتَخَبِ": 34814,
"مُنْتَخَبِي": 73,
"مُنْتَخَبِينَ": 269,
"مُنْتَخَب": 118,
"مُنْتَخَبانِ": 73,
"مُنْتَخَبَةً": 175,
"المُنْتَخَبَةُ": 312,
"مُنْتَخَبُو": 20,
"المُنْتَخَبَتَيْنِ": 1,
"المُنْتَخَب": 40,
"مُنْتَخَبَتَيْنِ": 4,
"مُنْتَخَبَةِ": 3,
"مُنْتَخَبَتا": 1
}
}
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAL-0UTSnjcbM8ysTS-tm52TyYCuLRjaqMvGk7Az1YYHZBTircA%40mail.gmail.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/d08c2122-e8ac-40fe-8e4b-0f33765bfac8%40gmail.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/BB9E1F6A-C306-4879-8DB8-85185BCC8198%40gmail.com.
Hi Nizar,
the overlap with Camel Tools lemmas should be 100% (minus
fallback analyses, of course) because that's what I use for MA.
:-) Actually, most of the lemmas are proper names, numbers or
words written with Latin letters. The relevant number of nouns,
verbs and adjectives is much lower:
| PROPN | 556528 |
| NUM | 89659 |
| X | 47410 |
| NOUN | 16300 |
| VERB | 7234 |
| ADJ | 5032 |
Currently, I'm using the buggy r13 db, the data will become more reliable as soon as I managed to integrate Camel Morph MSA into the parsing pipeline (waiting for ud + catib pos tags to be added / https://github.com/CAMeL-Lab/camel_morph/issues/2).
Best,
Mirko
Hi Mirko - how do you define the lemma? 772k sounds like a very large number. Are you including digits in the counts?I'm also curious about overlap with Camel Tools lemmas... would be great if we can compare.
ThanksN
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/d08c2122-e8ac-40fe-8e4b-0f33765bfac8%40gmail.com.
On 20 Sep 2024, at 2:12 PM, Mustafa Jarrar <mustaf...@gmail.com> wrote:
Dear Khouloud,
You may check the frequency of wordforms in the 12 corpora (MSA and Dialects, 2.3 Million tokens) manually linked with the Qabas Lexicographic database (6K lemmas in Qabas)Attached a screenshot: