Arabic ChainBank Released-A New Resource for Arabic NLP

71 views
Skip to first unread message

reham marzouk

unread,
Jun 25, 2025, 8:08:20 AMJun 25
to sig...@googlegroups.com

We are thrilled to announce the first official release of the Arabic ChainBank, the latest tool developed by CAMeL Lab at New York University Abu Dhabi.

The Arabic ChainBank is a derivational resource for Modern Standard Arabic (MSA). It is designed to systematically link all derivatives belonging to the same derivational family in a sequential manner (chain), starting from the root and progressing through each derived form. 

Explore Arabic ChainBank here

We would be happy to hear your thoughts, feedback, and suggestions—your input will help us improve and grow the resource in future releases

Best wishes, 

Reham Marzouk

Mirko Vogel

unread,
Jun 25, 2025, 1:22:42 PMJun 25
to reham marzouk, SIGARAB

Hi Reham,

thanks for sharing. This is indeed a very useful resource, thanks for releasing it!

Until now I was using Otakar Smrž's ElixirFM for this purpose, but it seems to be no longer maintained ... maybe because the core is written in Haskell, which is not widely spoken. :-) Almost 10 years ago, when I was studying Arabic and struggled to grasp the derivational morphology,  I used (parts of) the ElixirFM data to populate a graph database, modelling the derivational chains. It never got beyond an alpha version, though: https://shabaka.muraija.org/tx/search?q=إبداعي

Just to get an idea, you can try  جمهورية or لامتناهي or ز خ ر ف (as graph).

I'm not writing this mail to say "your stuff is cool, but look at my stuff"... :-), but because I was wondering - without having read your paper in detail yet - if the ElixirFM data could be useful to smoke test ChainBank. ElixirFM is about morphology, so edges do not have semantic labels, but simply generating a list of edges known to Elixir but not to ChainBank might be useful for debugging.

In case you want to try it out, feel free to use a version of the ElixirFM data I prepared to generate "morphological variants of collocations" for Muraija, which is adapted to the camel_morph's diacritization conventions: el-khair-camel.derivations.json

That's my five cents for the moment, unfortunately I have to leave now before turning this email into something more elaborate :-)

Best,
Mirko

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CABZ1H3nqoeYzeWDrVZYX3CcWKG1g4vr1WOZ2gC3TX2dMu%3DbLuw%40mail.gmail.com.
el-khair-camel.derivations.json.html

reham marzouk

unread,
Jun 26, 2025, 2:41:05 AMJun 26
to Mirko Vogel, SIGARAB
Hi Mirko,

Thank you very much for your interest in our new model, "Arabic ChainBank". I must also say that I truly appreciate your contributions and always enjoy following your valuable discussions on SIGARAB.

Let me briefly share a few points about the Arabic ChainBank. At first glance, it may appear to be a morphological model similar to existing models that follow the root-and-pattern approach for derivation. However, the Arabic ChainBank introduces a unique perspective, positioning itself as a morphosemantic model.

The distinctiveness of the Arabic ChainBank lies in its form and meaning conceptualization. It captures the derivational behavior of Arabic by organizing derivatives into chains,  each chain forms a path from the root to the most complex derivational form. This structure reflects not only the morphological relations between words but also the semantic specification and shift that occur as words are derived.

 In this sense, the ChainBank provides deeper linguistic insight that cannot be revealed without such a structured organization. It also highlights phenomena like the interaction between derivation and inflection, and the role of affixation in shaping derivational meaning.

Using other resources to test the ChainBank, even those that are only about morphology, such as ElixirFM, is an excellent idea and an essential part of the process. It is not only useful for debugging but also for the completeness of the chains.
Thank you again for your valuable feedback and the shared data. The project is still in its early stages and will benefit from further studies to support its continued development and improvement.  

Kalmasoft

unread,
Jun 26, 2025, 4:20:15 AMJun 26
to reham marzouk, Mirko Vogel, SIGARAB
It looks like an inflectional paradigm to me, whether conjugation, declension, or both in one "chain".

Having all the surface forms originating from one root semantically related is expected though, a basic feature in the "Semitic" family of languages.

This may not work as expected for Arabic, being descendant of larger languages and having considerable amount of loanwords from other language families as well; this is for two reasons:

1. Semantic shift, within the same root paradigm, please refer to the attached image.
2. Morphosemantic dispersion, a cross-root semantic relationship.

In the second reason I would mention only few examples:

- أسر، عصر، حصر، هصر
- حكم، عكم، عقم
- أدب، هذب

I would have these relationships checked and suggested it for Arabic ChainBank as well.

Ps. The data source includes forms not existing in the Arabic vocabulary.

Regards 




Mohamed H.

unread,
Jun 26, 2025, 6:10:39 AMJun 26
to sig...@googlegroups.com

Assalamu alaikum,

This is more a design/linguistic question: why did you choose to establish the 3 letter root as the core from which all derivations follow, instead of the masdar?

Shukran,

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/CABZ1H3nqoeYzeWDrVZYX3CcWKG1g4vr1WOZ2gC3TX2dMu%3DbLuw%40mail.gmail.com.

Ahmed, H.I.A.A. (Hossam)

unread,
Jun 26, 2025, 7:40:18 AMJun 26
to Mohamed H., sig...@googlegroups.com

Linguistically, we understand المصدر (verbal nouns) to be related to verbs, which themselves are derived from roots.  If you start with verbs or verbal nouns, you will have a hard time initiating derivations for words like شجرة ورقة أسنان طريف

reham marzouk

unread,
Jun 26, 2025, 9:45:05 AMJun 26
to Ahmed, H.I.A.A. (Hossam), Mohamed H., sig...@googlegroups.com
Hi Mohamed
  Thank you for this question. The origin of Arabic words has long been a subject of debate among classical grammarians from Basra and Kufa. The Arabic ChainBank was not designed to resolve this historical controversy. Instead, it is a resource that links words by tracing their development from the simplest to the most complex forms. Its goal is to provide relational information among these words,  shedding light on issues such as the canonicity between the word forms and their meanings, the morphological syncretism, and other phenomena related to the Arabic derivational structure.

Best, 
Reham

Kalmasoft

unread,
Oct 19, 2025, 12:16:18 AMOct 19
to reham marzouk, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Further to my last comment on this thread, we are announcing the start of compilation of the Database of Homophonic Synonyms in Arabic 


This is dedicated, with other databases, to the academic community, will be available both in Kalmasoft website, github, and Kaggle.

Regards 

--
Reply all
Reply to author
Forward
0 new messages