Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Using / Improving open arabic MA (camel_morph)

55 views
Skip to first unread message

Mirko Vogel

unread,
Apr 28, 2025, 4:38:32 AMApr 28
to SIGARAB

Good morning,

this mail is about asking who is using camel_morph, the open morphological analyzer and generator from Camel Labs, and who might be interested in improving it.

I am using it as part of my parsing pipeline for the Arabic collocation dictionary Muraija, in addition to generating surface forms for collocations. Throwing massive amounts of data at a tool and using it in a production setting makes surface issues that might pass unnoticed in a research setting - I filed a couple of them on github.

Do you use or consider using camel_morph as well? For which use case? A sufficient amount of affirmative responses might convince Camel Labs to allocate more ressources to its development. :-)

@Rawan, @Nizar, maybe you can share some of you thoughts / plan on the future of camel_morph? Or do you think that LLMs are the nail in the coffin for morphological analyzers?

Best,
Mirko

Nizar Habash

unread,
Apr 28, 2025, 1:03:14 PMApr 28
to Mirko Vogel, SIGARAB
Dear Mirko,

Thank you so much for your continued support and for sharing your experiences with Camel Morph. It is fantastic to see it being used in real-world applications like your Muraija dictionary project. We really appreciate the issues you have raised on GitHub; they are already part of the updates we are preparing for our next release. Apologies for not responding sooner. Your feedback is incredibly valuable to us and helps make Camel Morph better.

On your question about LLMs and morphological analyzers: while LLMs are powerful, we firmly believe tools like Camel Morph remain essential, especially for applications that require precision, transparency, and efficiency. Optimized and explainable resources will continue to have an important role in NLP, and we are excited to keep building in this direction with the Arabic NLP community.

We are currently preparing a new release of Camel Morph as part of the Camel Tools 2.0 launch. We welcome collaborations on all fronts, including refining and improving the Standard Arabic version by identifying issues to correct, as well as extending Camel Morph to new Arabic dialects. If you or anyone else in the community would like to get involved, please feel free to reach out to me directly at nizar....@nyu.edu.

Best regards,
Nizar

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/56f4116b-f169-47bf-afb3-a0123b05048d%40gmail.com.


--
Nizar Habash
Professor of Computer Science
New York University Abu Dhabi
https://www.nizarhabash.com/ 

Zaid Alyafeai

unread,
Apr 28, 2025, 1:19:21 PMApr 28
to Nizar Habash, Mirko Vogel, SIGARAB
LLMs are mostly good at generative tasks. They are worse than a 2018 fine tuned BERT model on most of the remaining NLP tasks. 

Alexis Neme

unread,
May 3, 2025, 2:30:56 AMMay 3
to Mirko Vogel, Nizar Habash, SIGARAB, Zaid Alyafeai, البتول محمد صالح اباالخيل‎

Dear Mirko,

In response to your question regarding LLMs and the integration of morphological analyzers such as CAMeL,

Please find below an excerpt from our publication: "Beyond the Determiner 'Al-': Expanding the Determiner Class in Arabic, and Elimination of Lexical Ambiguities by Grammars" (22 pages) 

IEEE Xplore link, specifically from Discussion, Subsection C:

Cheers,

Alexis Neme, PhD - Paris

Senior NLP Developer, www.dalr.me

EN-FR-AR-PT (DE, Cebuano) 

 


 C. A hybrid approach: Rule-based with LLM 

Currently, natural language processing has achieved high accuracy in areas such as semantic distinctions, syntactic variations, and the identification of specific words. However, these claims remain debatable, particularly for Arabic and related grammatical tasks (see Appendix 8, LLM experience with ChatGPT-4o). While significant algorithmic advancements have been made for English—largely due to the extensive textual data used for training—languages with rich morphology and limited resources, such as Arabic, still pose significant and extensive challenges. Arabic, for instance, requires fine-grained labeling with over 1,000 morphosyntactic categories (excluding semantic features), making it difficult for LLMs to achieve the same level of accuracy as English. To mitigate these shortcomings, we propose a hybrid approach that enhances LLM capabilities in such complex linguistic settings. 

Traditional rules-based systems reveal significant limitations when applied to real-world problems. In complex scenarios, the number of rules can easily reach into the thousands—though the sheer volume is not the primary concern. A more critical issue is the exponential dependency among rules: many outcomes depend on the precise sequence in which rules are executed. Introducing just one or two new rules to handle additional cases can trigger a combinatorial explosion in dependencies, potentially resulting in memory overflows or infinite processing loops. Consequently, such systems are nondeterministic and defining a consistent and effective rule order becomes a major challenge. In practice, manageable rule-based systems rarely scale beyond a few hundred rules. 

With the help of large datasets, LLMs address the problem of numerous rules (often in the millions) and their firing order by implicitly prioritizing the most probable patterns during inference. This effectively mitigates technical issues such as memory overflow and infinite loops. However, it does not resolve the nondeterministic nature of LMs—commonly referred to as hallucinations.

In contexts where explainability is crucial—such as in linguistics, which aims to understand language scientifically—nondeterministic systems with billions of parameters remain largely opaque to human interpretation. Likewise, in high-stakes contexts where errors are critical, such as legal proceedings or pedagogical applications, LLMs offer limited reliability and are often unsuitable. 

image.png


Reply all
Reply to author
Forward
0 new messages