Export format issues: TEI, CONLL-U

16 views
Skip to first unread message

Piroska Lendvai

unread,
Oct 7, 2021, 12:16:39 PM10/7/21
to inception-users
Dear All,

I have three questions regarding the export format.

1. I toy-annotated a document in the online demo ( again imported from HTML)), and the TEI export did not include the annotations.
The link to the file is:

2. 
I toy-annotated another document (also imported from HTML) in a local INCEpTION, and the annotations are present (as <rs type="..."> elements). 
I fed this TEI XML document to a converter tool that expects TEI5, and it failed. 
I randomly chose a TEI validation method: https://trafilatura.readthedocs.io/en/latest/tutorial2.html
and the file did not validate.
I wonder what version of TEI is used at export.

3. On both files, when exporting to CoNLL-U, I did not see the annotations in the exported data anymore.

Thank you very much in advance for any information to find out more,
Best regards,
Piroska

Richard Eckart de Castilho

unread,
Oct 7, 2021, 12:41:29 PM10/7/21
to incepti...@googlegroups.com
Hi.

> On 7. Oct 2021, at 18:16, Piroska Lendvai <piro...@gmail.com> wrote:
>
> 1. I toy-annotated a document in the online demo ( again imported from HTML)), and the TEI export did not include the annotations.
> The link to the file is:
> https://morbo.ukp.informatik.tu-darmstadt.de/p/2073/annotate?2#!d=6815&f=1

The TEI export does currently not support custom layers - only specific built-in layers are supported:

* Lemma,
* POS (xpos)
* named entities (value)

See: https://morbo.ukp.informatik.tu-darmstadt.de/doc/user-guide.html#sect_formats_tei

> 2.
> I toy-annotated another document (also imported from HTML) in a local INCEpTION, and the annotations are present (as <rs type="..."> elements).
> I fed this TEI XML document to a converter tool that expects TEI5, and it failed.
> I randomly chose a TEI validation method: https://trafilatura.readthedocs.io/en/latest/tutorial2.html
> and the file did not validate.
> I wonder what version of TEI is used at export.

We use the DKPro Core TEI reader / writer which supports a subset of TEI. The elements are listed here.

https://dkpro.github.io/dkpro-core/releases/2.2.0/docs/format-reference.html#format-Tei

The reader / writer were developed using various TEI files from different sources as test material. If you have particular problems with data not validating, you can report this as an issue in the INCEpTION or DKPro Core GitHub issue trackers.

> 3. On both files, when exporting to CoNLL-U, I did not see the annotations in the exported data anymore.

CoNLL-U also only supports specific built-in layers:

* Lemma
* POS
* dependencies (basic & enhanced)
* surface form

See: https://morbo.ukp.informatik.tu-darmstadt.de/doc/user-guide.html#sect_formats_conllu

You can add these layers to a project via the add-layer-dropdown in the layer settings of the project or create a project pre-fitted with such builtin layers by pressing the "new project" button on the project overview page or choosing "standard project" from the quick-project-template dropdown.

For working with custom layers, the viable formats are UIMA CAS XMI and WebAnno TSV 3.

Cheers,

-- Richard

Piroska Lendvai

unread,
Oct 7, 2021, 2:41:42 PM10/7/21
to inception-users
Dear Richard,

Thank you very much for the useful pointers.

Would there be some python recipe for how to work with UIMA CAS XMI and the cassis module, e.g. how to extract gazetteer lists from annotated entities?
did not work for me on the UIMA CAS XMI I exported (please find the files attached); I got: Type with name [cassis.Sentence] not found!
I suspect this is because my text/HTML was not analyzed/tokenized by me before import (or was it, after import, by default? please bear with me, I understand the INCEpTION workflow very roughly only).
Can one nevertheless extract the entities?

I also attach the XML (format: FoLiA, https://proycon.github.io/folia/) from where the HTML was created. I wonder if it would be feasible to import FoLiA XML directly to INCEpTION. What do you think?

Thank you, 
Piroska
FA-MBK-4-3_035245008_0019_abpproc_entries.ucto.folia.xml
FA-MBK-4-3_035245008_0019_abpproc_entries.xmi
TypeSystem.xml

Richard Eckart de Castilho

unread,
Oct 8, 2021, 2:17:15 AM10/8/21
to incepti...@googlegroups.com
Hi Piroska,

> On 7. Oct 2021, at 20:41, Piroska Lendvai <piro...@gmail.com> wrote:
>
> Would there be some python recipe for how to work with UIMA CAS XMI and the cassis module, e.g. how to extract gazetteer lists from annotated entities?
> The code snippet at https://github.com/dkpro/dkpro-cassis#selecting-annotations
> did not work for me on the UIMA CAS XMI I exported (please find the files attached); I got: Type with name [cassis.Sentence] not found!
> I suspect this is because my text/HTML was not analyzed/tokenized by me before import (or was it, after import, by default? please bear with me, I understand the INCEpTION workflow very roughly only).
> Can one nevertheless extract the entities?

The type names that you can use for looking up types can be found in the layer settings of INCEpTION. Select a layer and then look for "internal name" in the panel "technical information." You can also find the names in the TypeSystem.xml. However, note that the TypeSystem.xml contains a ton of files that are not really used by INCEpTION... we should probably clean up that file a bit...

```
from pathlib import Path
from cassis import load_cas_from_xmi, load_typesystem, Cas

typesystem = load_typesystem(Path("piroska/TypeSystem.xml"))
for t in typesystem.get_types():
print(f"{t.name}")
```

Once you know the type, you can retrieve the annotations of that type like this:

```
from pathlib import Path
from cassis import load_cas_from_xmi, load_typesystem, Cas

typesystem = load_typesystem(Path("piroska/TypeSystem.xml"))
cas = load_cas_from_xmi(Path("piroska/FA-MBK-4-3_035245008_0019_abpproc_entries.xmi"), typesystem=typesystem)

named_entity_name = "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"
for s in cas.select(named_entity_name):
print(f"{s.get_covered_text()}")
```

> I also attach the XML (format: FoLiA, https://proycon.github.io/folia/) from where the HTML was created. I wonder if it would be feasible to import FoLiA XML directly to INCEpTION. What do you think?

It would require implementing a reader for that format in Java. From a rough look, I see words and sentences - so that should relatively easy. I'd have to look in more detail for other aspects. But there seems to be a Python library for folia, so maybe you could whip something up reading folia and using cassis to write out XMI.

Cheers,

-- Richard

Piroska Lendvai

unread,
Oct 8, 2021, 6:06:53 AM10/8/21
to inception-users
Dear Richard

thank you, the details are very helpful.

named_entity_name = "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"
for s in cas.select(named_entity_name):
print(f"{s.get_covered_text()}")


How do I access the value of the entity annotation (e.g. 'person')? 
Where should I look in the cassis code (or its documentation, is there a manual)?

I also wonder what the difference is between using a custom entity layer, or using a built-in entity layer to which one adds their custom entities (this is possible, if I saw it right).
 
> I also attach the XML (format: FoLiA, https://proycon.github.io/folia/) from where the HTML was created. I wonder if it would be feasible to import FoLiA XML directly to INCEpTION. What do you think?

It would require implementing a reader for that format in Java. From a rough look, I see words and sentences - so that should relatively easy. I'd have to look in more detail for other aspects. But there seems to be a Python library for folia, so maybe you could whip something up reading folia and using cassis to write out XMI.

Thanks, I will ask the FoLiA developer meanwhile.

Regarding tokenization, is there a default behavior of INCEpTION? Does it need tokenization, does it perform better if it gets pretokenized text, or does it tokenize internally anyway in some scenarios?

Thanks very much,
Piroska
 

Richard Eckart de Castilho

unread,
Oct 8, 2021, 2:38:58 PM10/8/21
to incepti...@googlegroups.com
On 8. Oct 2021, at 12:06, Piroska Lendvai <piro...@gmail.com> wrote:
>
> How do I access the value of the entity annotation (e.g. 'person')?
> Where should I look in the cassis code (or its documentation, is there a manual)?

You use the feature name. Mind that the name shown in the UI in INCEpTION may not directly correspond to the feature name in the XMI file.

```
for s in cas.select(named_entity_name):
print(f"{s.get_covered_text()}: {s.value}")
```

You can get the list of features like this:

```
named_entity_name = "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"
for f in typesystem.get_type(named_entity_name).all_features:
print(f"{f.name}")
```

> I also wonder what the difference is between using a custom entity layer, or using a built-in entity layer to which one adds their custom entities (this is possible, if I saw it right).

The built-in layer has two features: "value" and "identifier". You can associate the value with a tagset - I think that is what you mean by "custom entities".

You can also create a custom layer with the same features as the build-in layer if you want.

The difference between the two is that importers/exporters such as TEI/CoNLL/etc. will only know about the built-in layer and won't be able to work with the custom layer.

Also, you cannot add additional features to the built-in layer.

Cheers,

-- Richard

Piroska Lendvai

unread,
Oct 12, 2021, 10:22:57 AM10/12/21
to inception-users
Dear Richard, 
Many thanks, your answer has been very helpful.
Cheers,
Piroska

Reply all
Reply to author
Forward
0 new messages