Release: Konooz corpus مدونة كنوز

Mustafa Jarrar

unread,

Jul 28, 2025, 3:36:30 PMJul 28

to SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear colleagues,

. يسعدنا الاعلان عن مدونة كنوز لاكتشاف اسماء الاعلام من النصوص الفصحى والعامية، وهي تغطي 16 لهجة في 10 حقول

It is our pleasure to announce the release of the Konooz corpus, a Multi-domain Multi-dialect corpusfor Named Entity Recognition, presented today at #ACL2025, Konooz comprises 16 dialects across 10 domains, totaling 777K tokens - manually collected and annotated. Konooz enable rich NER, as well as Domain Adaptation and Transfer Learning from one domain/dialect to another.

Download/Demo (CC-BY): https://sina.birzeit.edu/wojood

Article: https://www.jarrar.info/publications/HKJ25.pdf

Best Regards,

—Mustafa

Kalmasoft

unread,

Jul 29, 2025, 9:21:55 AMJul 29

to Mustafa Jarrar, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Hello

I tried a vertical list of 20 Arabic words, took long time to process, failed to process.

Perhaps server technical problem, or the internal logic expects specific text formats.

Also it will be nice to join entities so I expect to get

بورصة فلسطين

As one entity not granulated highlighting for two tokens, this would go recursively if not well controlled.

Thanks

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/9E86155A-8DA3-4392-B853-634FE72911F7%40gmail.com.

Kalmasoft

unread,

Jul 29, 2025, 10:36:34 AMJul 29

to Mustafa Jarrar, SIGARAB: Special Interest Group on Arabic Natural Language Processing

My second attempt ,

Basically the internal logic considers text formatting, while it shouldn't. Whatever the text format is, all input should be converted to one long single string, with all breaks converted to spaces.

On Mon, Jul 28, 2025, 11:36 PM Mustafa Jarrar <mustaf...@gmail.com> wrote:

--

Mustafa Jarrar

unread,

Aug 2, 2025, 12:02:44 PMAug 2

to Kalmasoft, SIGARAB: Special Interest Group on Arabic Natural Language Processing

Dear Sir,

Please use SinaTools (NER module) instead of using the online NER demo page.

Best, Mustafa

On 29 Jul 2025, at 5:25 PM, Kalmasoft <kalm...@gmail.com> wrote:

My second attempt ,

Basically the internal logic considers text formatting, while it shouldn't. Whatever the text format is, all input should be converted to one long single string, with all breaks converted to spaces.

<IMG_20250729_181705.jpg>

Reply all

Reply to author

Forward