Hello Flammie and all! We are currently documenting the Siberian Ingrian Finnish language
https://en.wikipedia.org/wiki/Siberian_Ingrian_Finnish https://github.com/ubaleht/SiberianIngrianFinnish We have about 80 .doc files with 40,000 tokens in this files without morphological marking and we have a similar problem with the annotation style.
We propose the client-server system architecture. Our rule-based morphological analyzer will save data of lemmas, word forms, tokens and morphological annotations into a relational database. We will have a table in the database in which there will be a correspondence of the main styles (Unimorph, UD and style used in Russian linguistic publications for Finno-Ugric languages). By using a relational database and server-side code, we will separate the morphological markup representation from the data and will create a switch of styles.
A similar architecture is used in the open corpus of the Veps and Karelian languages:
https://arxiv.org/ftp/arxiv/papers/2206/2206.03870.pdfBest regards,
Ivan Ubaleht