Semi-automatically generating unimorph data with rule-based morphologies?

13 views
Skip to first unread message

Flammie Pirinen (Flammie)

unread,
Nov 8, 2023, 6:45:20 AM11/8/23
to unimorph
Hi all,

I've been experimenting a bit with trying to generate unimorph style data with the morphological analyser-generators we've been working on the last few decades ((c.f. https://github.com/giellalt/)... It seems quite straightforward to generate stuff but I've been talking with colleagues and also we don't want to blindly generate everything from everything without control, and I'd like to have some documentations and explanations for tag selections added to the datasets in same style as universal dependencies do. Where should I start?

Kat Vylomova

unread,
Nov 8, 2023, 7:05:58 AM11/8/23
to Flammie Pirinen (Flammie), unimorph
Dear Flammie, 

Regarding tag selection, I would recommend starting with the UM documentation provided here: https://unimorph.github.io/doc/unimorph-schema.pdf . Kyle Gorman has also developed a script to check the tags: https://github.com/unimorph/um-canonicalize.

Having consistent tag conversion and paradigm structures might be tricky in some cases, at some point we had working groups to agree on the conversion for languages within languages families. If you specify which languages you are working on, I might provide you with spreadsheets (if they exist). For instance, Mans Hulden worked on Uralic: https://docs.google.com/spreadsheets/d/1RjO_J22yDB5FH5C24ej7sGGbeFAjcIadJA6ML55tsOI/edit#gid=201864700 (Nominal Classes)

Warm regards,
Kat

On Wed, Nov 8, 2023 at 10:45 PM Flammie Pirinen (Flammie) <fffl...@gmail.com> wrote:
Hi all,

I've been experimenting a bit with trying to generate unimorph style data with the morphological analyser-generators we've been working on the last few decades ((c.f. https://github.com/giellalt/)... It seems quite straightforward to generate stuff but I've been talking with colleagues and also we don't want to blindly generate everything from everything without control, and I'd like to have some documentations and explanations for tag selections added to the datasets in same style as universal dependencies do. Where should I start?

--
You received this message because you are subscribed to the Google Groups "unimorph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimorph+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimorph/1c8156b3-bbdf-425f-bbed-7142a2b90165n%40googlegroups.com.

Flammie

unread,
Nov 8, 2023, 3:25:25 PM11/8/23
to unim...@googlegroups.com



Thanks, yeah I work most on the Uralic languages myself, but we do have others for possible future work (full list: https://giellalt.github.io/LanguageModels.html). I have a list of notes like the ones on the notes tab of that spreadsheet and many more, I think that many of these questions like the naming of forms and such are always open questions even within the linguistic field of morphology and perhaps will never totally be agreed upon, so I'd rather we have some sort of documentation to go with each language that contains just these notes that what we selected to present which affix and what it corresponds to in other linguistic grammars, also because the difference from Uralic tradition to unimorph is also quite big like one can see from the tables there. A bit Like UD has this: https://universaldependencies.org/sme/index.html just, more like about what tags were mapped and what was included and excluded like.
--
Flammie <https://flammie.github.io>
apologies for top-posting, html etc. this mail was written on mobile gmail app or web-app


--
Flammie <https://flammie.github.io>
apologies for top-posting, html etc. this mail was written on mobile gmail app or web-app

Ivan Ubaleht

unread,
Nov 11, 2023, 7:18:41 PM11/11/23
to Flammie, unim...@googlegroups.com
Hello Flammie and all! We are currently documenting the Siberian Ingrian Finnish language https://en.wikipedia.org/wiki/Siberian_Ingrian_Finnish https://github.com/ubaleht/SiberianIngrianFinnish We have about 80 .doc files with 40,000 tokens in this files without morphological marking and we have a similar problem with the annotation style.

We propose the client-server system architecture. Our rule-based morphological analyzer will save data of lemmas, word forms, tokens and morphological annotations into a relational database. We will have a table in the database in which there will be a correspondence of the main styles (Unimorph, UD and style used in Russian linguistic publications for Finno-Ugric languages). By using a relational database and server-side code, we will separate the morphological markup representation from the data and will create a switch of styles.

A similar architecture is used in the open corpus of the Veps and Karelian languages: https://arxiv.org/ftp/arxiv/papers/2206/2206.03870.pdf

Best regards,
Ivan Ubaleht

чт, 9 нояб. 2023 г. в 02:25, Flammie <fffl...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages