Contributing over 150 Northeastern Neo-Aramaic dialects

15 views
Skip to first unread message

Matthew Nazari

unread,
Dec 13, 2023, 10:37:54 PM12/13/23
to unim...@googlegroups.com
There are over 150 dialects of Northeastern Neo-Aramaic (NENA), a diverse group of dialects spoken by marginalized Christian and Jewish communities from northwestern Iran, northern Iraq, and southeastern Turkiye.

The issue is that NENA is not like other languages that Unimorph supports. It does not have a prestige dialect that can represent all of them, and it does not even have appropriate ISO 639-3 codes.

What can the Unimorph project do to support languages like NENA, languages of community like mine?

Kat Vylomova

unread,
Dec 14, 2023, 2:38:47 AM12/14/23
to Matthew Nazari, unim...@googlegroups.com, Michael Gasser, omer goldman, Salam Khalifa, Antonis Anastasopoulos
Dear Matthew, 

Thank you for your interest! Wow, that's quite impressive! UniMorph has a group of annotators who work(ed) on Semitic languages; you may have a look at the languages that we have annotated so far, the feature set, and the issues we faced: https://docs.google.com/spreadsheets/d/1CEkZW2RdZpAFD6Go8SG3lQJFkhLsLeec_wBzCt8NdRI/edit#gid=0 (I CC'ed some annotators as they might be interested as well). You may also check our (more general) annotation instructions over here: https://unimorph.github.io/doc/unimorph-schema.pdf and particular examples, e.g. for Hebrew https://github.com/unimorph/heb
At some point, we have also created a Google group for annotators of Semitic languages (for annotation-related discussions), I am not sure how active it is, but worth giving it a try: https://groups.google.com/g/unimorph-semitic

Unfortunately, I cannot access the data on the website. What does it provide? Texts in those languages, or morphological paradigms? Do you have any annotations already? I am happy to help or advise on further steps if you provide a bit more information on the data you have. So far, the UniMorph database was enriched with the data from the English edition of Wiktionary and various inflection tables (full paradigms), FSTs (full paradigms), glossed texts (partial paradigms).

Regarding ISO 639-3 codes: As far as I know, it is possible to submit a request to register a language/dialect. I recall Antonis (also CC'ed) did this for Pomak (?), he might suggest something. 

In any case, we would be happy to have you as a part of the team! :-) 

Warm regards,
Kat



--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/CADYt9QG1e%2BQ2QSHYf8bFnXSTbTr499Hmn3wka%3Drg2xyk%2BKfgcQ%40mail.gmail.com.

Matthew Nazari

unread,
Dec 15, 2023, 3:50:02 AM12/15/23
to Kat Vylomova, unim...@googlegroups.com, Michael Gasser, omer goldman, Salam Khalifa, Antonis Anastasopoulos
Hi Kat,

Thank you very much for your informative and encouraging response.

I am excited to hear about the work done by UniMorph, especially in Semitic languages. The resources you shared are extremely valuable and I will dig into them in the upcoming days and weeks.

Regarding my existing data: I am currently involved with the Northeastern Neo-Aramaic Database Project at the University of Cambridge. Our database includes a comprehensive description of morphological features in most dialects. Additionally, there has been produced extensive grammatical descriptions, each running to about 2000 pages, for several dialects. These descriptions often include texts and lexicons. It is with this data that a complete corpus of surface forms for the Christian Urmi dialect was possible. I will look at the resources you provided and confirm whether or not I can continue this work through UniMorph. However, this would require a conversation about how to support a language that is not recognized with appropriate ISO codes and exists as a spectrum of many dialects.

Regarding ISO 639-3 codes: I would appreciate any information on how to register new ISO codes, if Antonis or Pomak has any information. 

Kindly,
Matthew

Christian Chiarcos

unread,
Dec 15, 2023, 5:53:28 PM12/15/23
to Matthew Nazari, unim...@googlegroups.com
Dear Matthew,

if the problem is to find appropriate language tags, then a good practice would be the following (for any language):

- provide the BCP47 language code (this is the ISO 639-1 two-letter code, if applicable, the ISO 639-2/3 three-letter code for other languages or mis for a languge not supported by ISO 639 at all. The IANA provides a list of registered/authorized codes (https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry), but you can build your own using the principles described in https://www.rfc-editor.org/info/bcp47.
- optionally, append information about country and writing system (BCP47)
- append -x- to the language code (this is a BCP47 feature that allows you to append user-specific information)
- append the Glottolog language code (https://glottolog.org/), these should be sufficiently fine-grained ... and you can browse the tree
- if there is no Glottolog language code, register a new one and follow the procedure above. Note that this might take a while, but it's much less cumbersome than creating novel ISO 639-3 language codes. I guess the official procedure is to use their contact form. Alternatively, you can also create a GitHub issue in their repository (github.com/glottolog/glottolog).

Note that Glottolog language IDs are persistent in the sense that a code once registered won't be re-assigned and will be resolvable. However, codes may be deprecated and/or the language description may be updated.

For Neo-Aramaic, I guess the following ISO 639-3 codes would be relevant:

aii Assyrian Neo-Aramaic
amw Western Neo-Aramaic
bhn Bohtan Neo-Aramaic
bjf Barzani Jewish Neo-Aramaic
cld Chaldean Neo-Aramaic
hrt Hertevin
syr Syriac
(there may be more)

For, say, them Umraya dialect of Hertevin, the corresponding language tag would be
hrt-x-umra1238

For constructing such a language code from Glottolog:
- search the language variety in Glottolog
- if the language does have a ISO 639-3 tag in Glottolog, use that, otherwise consult their parent group until you find one. If you find none, use mis
- if you have an ISO 639-3 code, consult the official ISO 639-3 code tables (https://iso639-3.sil.org/code_tables/639/data) to check whether they provide an ISO 639-1 code. Use that one. If not, but there is an ISO 639-2 code, use that one. If neither, use the ISO 639-3 code.

Now, whether or not UniMorph adopts these practices is another thing to be discussed. But it's way better than inventing something from scratch.

Best,
Christian


--

Christian Chiarcos

unread,
Dec 15, 2023, 5:53:32 PM12/15/23
to Matthew Nazari, Kat Vylomova, unim...@googlegroups.com, Michael Gasser, omer goldman, Salam Khalifa, Antonis Anastasopoulos
Am Fr., 15. Dez. 2023 um 09:50 Uhr schrieb Matthew Nazari <matthe...@college.harvard.edu>:
Hi Kat,

Thank you very much for your informative and encouraging response.

I am excited to hear about the work done by UniMorph, especially in Semitic languages. The resources you shared are extremely valuable and I will dig into them in the upcoming days and weeks.

Regarding my existing data: I am currently involved with the Northeastern Neo-Aramaic Database Project at the University of Cambridge. Our database includes a comprehensive description of morphological features in most dialects. Additionally, there has been produced extensive grammatical descriptions, each running to about 2000 pages, for several dialects. These descriptions often include texts and lexicons. It is with this data that a complete corpus of surface forms for the Christian Urmi dialect was possible. I will look at the resources you provided and confirm whether or not I can continue this work through UniMorph. However, this would require a conversation about how to support a language that is not recognized with appropriate ISO codes and exists as a spectrum of many dialects.

Regarding ISO 639-3 codes: I would appreciate any information on how to register new ISO codes, if Antonis or Pomak has any information.


But note that this is a slow process. If you look at undecided change requests at SIL (https://iso639-3.sil.org/code_changes/change_request_index/data), you see that in some cases, no decision has been reached after more than a decade. The oldest request still set on pending (so,  either rejected nor approved) right now is from 2006. Also note that they tend to reject dialects and to push them into ISO 639-6 (https://www.iso.org/standard/43380.html) -- but this standard has been revoked (resp., dissolved along with its maintainer). The closest replacement for ISO 639-6 is Glottolog, hence, the suggestion of using BCP47 language codes with a custom Glottolog extension.

Best,
Christian
 
Kindly,
Matthew

On Wed, Dec 13, 2023 at 11:38 PM Kat Vylomova <evyl...@gmail.com> wrote:
Dear Matthew, 

Thank you for your interest! Wow, that's quite impressive! UniMorph has a group of annotators who work(ed) on Semitic languages; you may have a look at the languages that we have annotated so far, the feature set, and the issues we faced: https://docs.google.com/spreadsheets/d/1CEkZW2RdZpAFD6Go8SG3lQJFkhLsLeec_wBzCt8NdRI/edit#gid=0 (I CC'ed some annotators as they might be interested as well). You may also check our (more general) annotation instructions over here: https://unimorph.github.io/doc/unimorph-schema.pdf and particular examples, e.g. for Hebrew https://github.com/unimorph/heb
At some point, we have also created a Google group for annotators of Semitic languages (for annotation-related discussions), I am not sure how active it is, but worth giving it a try: https://groups.google.com/g/unimorph-semitic

Unfortunately, I cannot access the data on the website. What does it provide? Texts in those languages, or morphological paradigms? Do you have any annotations already? I am happy to help or advise on further steps if you provide a bit more information on the data you have. So far, the UniMorph database was enriched with the data from the English edition of Wiktionary and various inflection tables (full paradigms), FSTs (full paradigms), glossed texts (partial paradigms).

Regarding ISO 639-3 codes: As far as I know, it is possible to submit a request to register a language/dialect. I recall Antonis (also CC'ed) did this for Pomak (?), he might suggest something. 

In any case, we would be happy to have you as a part of the team! :-) 

Warm regards,
Kat



On Thu, Dec 14, 2023 at 2:37 PM Matthew Nazari <matthe...@college.harvard.edu> wrote:
There are over 150 dialects of Northeastern Neo-Aramaic (NENA), a diverse group of dialects spoken by marginalized Christian and Jewish communities from northwestern Iran, northern Iraq, and southeastern Turkiye.

The issue is that NENA is not like other languages that Unimorph supports. It does not have a prestige dialect that can represent all of them, and it does not even have appropriate ISO 639-3 codes.

What can the Unimorph project do to support languages like NENA, languages of community like mine?

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/CADYt9QG1e%2BQ2QSHYf8bFnXSTbTr499Hmn3wka%3Drg2xyk%2BKfgcQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.

omer goldman

unread,
Dec 19, 2023, 6:50:15 AM12/19/23
to Matthew Nazari, Kat Vylomova, unim...@googlegroups.com, Michael Gasser, Salam Khalifa, Antonis Anastasopoulos
Hi Matthew, Kat,

The work on these dialects is really interesting and important, I'd be happy to help if needed. I was involved in improving the data and adapting resources for several languages, including Hebrew.

Regarding the language codes, if I understand correctly some lects have an ISO code, see Wikipedia. Although I am not sure that the mapping from the codes to the actual groups of lects from the list on your website is straightforward.
Another option, that may be complementary is to have all dialects share one repo with a single ISO code. See the repo of Karelian as an example.
In any event, I think that the exact organization of the data and the codes used should not hinder the inclusion of the data in unimorph. We could find additional solutions if needed.

Best wishes,
Omer

‫בתאריך יום ו׳, 15 בדצמ׳ 2023 ב-10:49 מאת ‪Matthew Nazari‬‏ <‪matthe...@college.harvard.edu‬‏>:‬
Reply all
Reply to author
Forward
0 new messages