import TEI-Lex 0 data and correct the text & annotations

7 views
Skip to first unread message

Ondrej Tichy

unread,
Sep 25, 2021, 12:04:50 PM9/25/21
to inception-users
Hi, 

We are digitising a large onomasiological dictionary. We have a TEI-Lex 0 (https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html) data produced by a machine learning tool (Grobid-dictionaries) and we need to correct both the XML annotations as well as the text itself.

We are looking for a tool where about 20 annotators can work on this - would Inception be a good match? I can't seem to get the info anywhere on whether the underlying text itself can be edited in inception as well, or only the annotations.

Also, how difficult it would be to import the TEI-XML files and the schema.

Thanks for any help or pointers,

Ondrej

Richard Eckart de Castilho

unread,
Sep 25, 2021, 1:13:35 PM9/25/21
to inception-users
Hi Ondrej,

> On 25. Sep 2021, at 18:04, Ondrej Tichy <ondrej...@gmail.com> wrote:
>
> We are looking for a tool where about 20 annotators can work on this - would Inception be a good match? I can't seem to get the info anywhere on whether the underlying text itself can be edited in inception as well, or only the annotations.

INCEpTION assumes that the text is fixed. You cannot change it. Imagine one annotator would change the text, what would that mean to another annotator independently working on the same text? Since they are independent, the other would and should not see the changes. Worse, the other annotator might make a different change. Eventually, both annotators would end up annotating actually different texts. How then to compare the annotations between the two, e.g. to calculate an inter-rater agreement?

In an INCEpTION-based workflow, you have two options dealing with the situation:

1) edit the text before importing into INCEpTION - all annotators have the same text then
2) instead of actually modifying the text, annotators can make annotations that indicate
how they would modify the text. These change suggestions can be compared and curated.
Then the agreed-upon changes could be applied to the texts and after that a second round
of annotation starts where these texts are then annotated - again all annotators have
the same text then

> Also, how difficult it would be to import the TEI-XML files and the schema.

INCEpTION has a basic support for a few linguistic layers defined in TEI. We use the DKPro Core TEI reader implementation for this [1]. But TEI LEX looks more like a knowledge resource to me than like a text annotation format - so you might rather consider converting your TEI LEX files into e.g. a simple SKOS ontology which you could import into INCEpTION and then you could link annotations in the texts to the entries defined in that ontology. If you have special needs for importing TEI into INCEpTION, the best way to go at the moment is to implement a custom Python script which reads your TEI dialect, extracts relevant information and outputs that into the UIMA CAS XMI format. The convenient DKPro Cassis [2] library makes it easy to work with XMI files in Python. These, you can then import into INCEpTION.

We do not have a ready-made example for such a custom conversion script for TEI, but we do have one for Word files which may be reasonable easy to adapt [3]. It outlines the process of preparing an XMI file compatible with INCEpTION but starting from a word file instead of starting from a TEI file. There are also a couple of other examples [4] you might find interesting.

Best,

-- Richard

[1] https://dkpro.github.io/dkpro-core/releases/2.2.0/docs/format-reference.html#format-Tei
[2] https://github.com/dkpro/dkpro-cassis
[3] https://colab.research.google.com/github/inception-project/inception/blob/main/notebooks/annotated_word_files_to_cas_xmi.ipynb
[4] https://inception-project.github.io/example-projects/python/

Ondrej Tichy

unread,
Sep 26, 2021, 2:44:24 PM9/26/21
to inception-users
Richard,

Thank you very much for a swift and detailed reply!
Adding annotations instead of directly editing has been on my mind, but we will see if that does not render the whole exercise too cumbersome.

Best

Ondrej

Reply all
Reply to author
Forward
0 new messages