INCEpTION 22.2 : import Treetagger format

Maria Lidén

unread,

Feb 8, 2022, 2:48:33 AM2/8/22

to inception-users

Hi,

I am wondering whether it is possible to upload documents in Inception where the text has been segmented and parsed with Treetagger. If not, is there any suitable document type that I can convert these documents to?

Regards,

Maria

Richard Eckart de Castilho

unread,

Feb 8, 2022, 2:51:38 AM2/8/22

to incepti...@googlegroups.com

On 8. Feb 2022, at 08:48, Maria Lidén <maria....@gmail.com> wrote:
>
> I am wondering whether it is possible to upload documents in Inception where the text has been segmented and parsed with Treetagger. If not, is there any suitable document type that I can convert these documents to?

What does your output look like? Can you provide an example, best with two short sentences?

-- Richard

Maria Lidén

unread,

Feb 8, 2022, 2:56:45 AM2/8/22

to incepti...@googlegroups.com

Hi,

The document has three columns.

Column 1= token

Column 2= POStag

Column 3= lemma

I am currently working with a french file. It looks like this:

Elle PRO:PER elle
habitait VER:impf habiter
dans PRP dans
cette PRO:DEM ce
maison NOM maison
depuis PRP depuis
longtemps NOM longtemps
. SENT .

Se PRO:PER se
cacher VER:infi cacher
, PUN ,
voilà ADV voilà
un DET:ART un
jeu NOM jeu
merveilleux ADJ merveilleux
! SENT !

//Maria

--
You received this message because you are subscribed to a topic in the Google Groups "inception-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/inception-users/-5gMraHzPCQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to inception-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/FB0D87CC-BBBB-42E3-847D-07F293EE2BEF%40gmail.com.

Richard Eckart de Castilho

unread,

Feb 8, 2022, 3:44:29 AM2/8/22

to incepti...@googlegroups.com

Hi Maria,

the closest we have to that format is the "IMS CWB" format:

https://inception-project.github.io/releases/22.3/docs/user-guide.html#sect_formats_imscwb

<text id="http://www.epguides.de/nikita.htm">
<s>
Nikita NE Nikita
( $( (
La FM La
Femme NN Femme
Nikita NE Nikita
) $( )
........
. $. .
</s>
</text>

I believe you'd need to make sure though that none of your actual words start with a `<` and end with a `>`.

INCEpTION needs the `<s> </s>` tags to see where your sentences start and end. I don't know off the top of my head if the `<text>` tags are also mandatory, in any case, INCEpTION would require that there is only one text per file.

Btw. TreeTagger should also be able to process files like these if you supply the "-sgml" option:

<text id="http://www.epguides.de/nikita.htm">
<s>
Nikita
(
La
Femme
Nikita
)
........
.
</s>
</text>

Does that help? If you can do Python, we could also point you to a Python lib which you could use to convert your data to the UIMA XMI CAS format that gives you more control over your input to INCEpTION.

Best,

-- Richard

Maria Lidén

unread,

Feb 8, 2022, 4:08:08 AM2/8/22

to incepti...@googlegroups.com

Hi Richard,

If I use the IMS CWB format, does it matter that the tagset used by Treetagger is not the same as the tagset used by the IMS Open Corpus Workbench?

Anyhow, I think that the easiest solution might be the UIMA XMI CAS format. The Python lib would thus be very appreciated.

Thank you for your help.

Regards,

Maria

--
You received this message because you are subscribed to a topic in the Google Groups "inception-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/inception-users/-5gMraHzPCQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to inception-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/14EB39C4-2258-41AC-94A0-DC25C010B7AA%40gmail.com.

Richard Eckart de Castilho

unread,

Feb 8, 2022, 4:29:23 AM2/8/22

to incepti...@googlegroups.com

Hi,

> On 8. Feb 2022, at 10:07, Maria Lidén <maria....@gmail.com> wrote:
>
> If I use the IMS CWB format, does it matter that the tagset used by Treetagger is not the same as the tagset used by the IMS Open Corpus Workbench?

The tag set does not matter. IMS CWB is a generic corpus management tool, so like INCEpTION as well, it is tag set agnostic.

> Anyhow, I think that the easiest solution might be the UIMA XMI CAS format. The Python lib would thus be very appreciated.

You can use DKPro Cassis [1] to generate XMI files. For the sentences, tokens, POS and lemma that you need for your task, you can use initialize your CAS objects (documents) with the DKPro Core type system [2] that INCEpTION is compatible with.

In order to create a POS and Lemma annotated word, you need to

* create a Token annotation for the word
* create a POS annotation for the word and set the `PosValue` property on it
* create a Lemma annotation for the word and set the `value` property on it
* set the `pos` property of the Token to the POS annotation
* set the `lemma` property of the Token to the Lemma annotation

Or alternatively, instead of using the built-in layers of INCEpTION (i.e. the DKPro Core types), you could work with custom layers. Then the "Use pre-tokenized and pre-annotated documents in INCEpTION" Python notebook might be helpful [3].

Best,

-- Richard

[1] https://github.com/dkpro/dkpro-cassis
[2] https://github.com/dkpro/dkpro-cassis#dkpro-core-integration
[3] https://inception-project.github.io/example-projects/python/

Reply all

Reply to author

Forward