Hi,
> On 8. Feb 2022, at 10:07, Maria Lidén <
maria....@gmail.com> wrote:
>
> If I use the IMS CWB format, does it matter that the tagset used by Treetagger is not the same as the tagset used by the IMS Open Corpus Workbench?
The tag set does not matter. IMS CWB is a generic corpus management tool, so like INCEpTION as well, it is tag set agnostic.
> Anyhow, I think that the easiest solution might be the UIMA XMI CAS format. The Python lib would thus be very appreciated.
You can use DKPro Cassis [1] to generate XMI files. For the sentences, tokens, POS and lemma that you need for your task, you can use initialize your CAS objects (documents) with the DKPro Core type system [2] that INCEpTION is compatible with.
In order to create a POS and Lemma annotated word, you need to
* create a Token annotation for the word
* create a POS annotation for the word and set the `PosValue` property on it
* create a Lemma annotation for the word and set the `value` property on it
* set the `pos` property of the Token to the POS annotation
* set the `lemma` property of the Token to the Lemma annotation
Or alternatively, instead of using the built-in layers of INCEpTION (i.e. the DKPro Core types), you could work with custom layers. Then the "Use pre-tokenized and pre-annotated documents in INCEpTION" Python notebook might be helpful [3].
Best,
-- Richard
[1]
https://github.com/dkpro/dkpro-cassis
[2]
https://github.com/dkpro/dkpro-cassis#dkpro-core-integration
[3]
https://inception-project.github.io/example-projects/python/