Hi Piroska,
> On 7. Oct 2021, at 20:41, Piroska Lendvai <
piro...@gmail.com> wrote:
>
> Would there be some python recipe for how to work with UIMA CAS XMI and the cassis module, e.g. how to extract gazetteer lists from annotated entities?
> The code snippet at
https://github.com/dkpro/dkpro-cassis#selecting-annotations
> did not work for me on the UIMA CAS XMI I exported (please find the files attached); I got: Type with name [cassis.Sentence] not found!
> I suspect this is because my text/HTML was not analyzed/tokenized by me before import (or was it, after import, by default? please bear with me, I understand the INCEpTION workflow very roughly only).
> Can one nevertheless extract the entities?
The type names that you can use for looking up types can be found in the layer settings of INCEpTION. Select a layer and then look for "internal name" in the panel "technical information." You can also find the names in the TypeSystem.xml. However, note that the TypeSystem.xml contains a ton of files that are not really used by INCEpTION... we should probably clean up that file a bit...
```
from pathlib import Path
from cassis import load_cas_from_xmi, load_typesystem, Cas
typesystem = load_typesystem(Path("piroska/TypeSystem.xml"))
for t in typesystem.get_types():
print(f"{
t.name}")
```
Once you know the type, you can retrieve the annotations of that type like this:
```
from pathlib import Path
from cassis import load_cas_from_xmi, load_typesystem, Cas
typesystem = load_typesystem(Path("piroska/TypeSystem.xml"))
cas = load_cas_from_xmi(Path("piroska/FA-MBK-4-3_035245008_0019_abpproc_entries.xmi"), typesystem=typesystem)
named_entity_name = "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"
for s in cas.select(named_entity_name):
print(f"{s.get_covered_text()}")
```
> I also attach the XML (format: FoLiA,
https://proycon.github.io/folia/) from where the HTML was created. I wonder if it would be feasible to import FoLiA XML directly to INCEpTION. What do you think?
It would require implementing a reader for that format in Java. From a rough look, I see words and sentences - so that should relatively easy. I'd have to look in more detail for other aspects. But there seems to be a Python library for folia, so maybe you could whip something up reading folia and using cassis to write out XMI.
Cheers,
-- Richard