Re: [inception-users] Corrupt Data when Exporting

81 views
Skip to first unread message
Message has been deleted

Richard Eckart de Castilho

unread,
Oct 4, 2021, 2:05:28 PM10/4/21
to 'SoimulPatriei' via inception-users
Hi Kemal,

> On 4. Oct 2021, at 15:07, Kemal Araz <ke...@getdrop.com> wrote:
>
> We have been using inception for more than 2 months. We have an error that keeps happening. When exporting project with webanno 3.3 tsv format in some files some lines appear to be combined into one line thus hard to parse (an example source file as txt and annotation file as tsv shared - round5_20). In addition to that for some files instead of having the text in one cell and then the words and their annotations, the text seems to be divided into more than one parts which is in not in one row but in multiple rows (two examples I shared first is round6_314 and second is round6_0). Can you help me with that?

For better or worse, WebAnno TSV is a sentence-based format. It consists of blocks, each block representing a sentence.
It appears that you have imported your texts using the default "Plain text" format which causes
INCEpTION to apply a simple tokenizer and sentence detector. As you can see in your files, the
sentence detector decided to include everything up to the first full stop into the first "sentence".
I.e. the whole ingredient lists went to the first sentence. INCEpTION does support alternative
plain text formats such as "Plain text (one sentence per line)" and "Plain text (space-separated tokens, one sentence per line)"
which might be better suited for your particular data - at least if you have specific needs regarding sentence (and token) boundaries.

When you say "it is hard to parse", what do you mean by that?
Are you trying to implement your own parser for the WebAnno TSV format?

If you are working with Python, I would suggest you try exporting in the UIMA CAS XMI
format and load/processs the exported data using DKPro Cassis [1]. If you are working
with Java, the UIMA CAS XMI files can be loaded and processed using the Apache UIMA Java SDK.

Cheers,

-- Richard

[1] https://github.com/dkpro/dkpro-cassis

Message has been deleted

Richard Eckart de Castilho

unread,
Oct 5, 2021, 2:38:18 AM10/5/21
to incepti...@googlegroups.com
On 4. Oct 2021, at 21:56, Kemal Araz <ke...@getdrop.com> wrote:
>
> Hello again Richard,
>
> 1- I am sending you the typesystem and an xmi data but cannot make cassis work. Especially for selecting annotations. Getting typenotfound error.

```
from pathlib import Path

from cassis import load_cas_from_xmi, load_typesystem

typesystem = load_typesystem(Path("getdrop/TypeSystem.xml"))
cas = load_cas_from_xmi(Path("getdrop/recipe_annotation_round6_92.xmi"), typesystem=typesystem)

for entity in cas.select("webanno.custom.DropRecipeUnderstandingEntityTags"):
print(f"[{entity.begin}-{entity.end}]: {entity.get_covered_text()}")
```

> 2- I am going to build ner, relation extraction and entity disambiguation models. You have seen the structure of the recipes. There is a section of ingredients where there is no sentence pattern so hard to separate. Also there is another section where there are the steps which are mostly regular sentences. What type should I use to import my txt files to the system and what output should I use? I am using CoNLL 2002 format to train NER, SemEval 2010 Task 8 for relation classification and haven't decided on entity disambiguation. Can you recommend me import and export formats?

Depends on the level of sophistication you want to reach. In the optimal approach, you would prepare your texts before importing them into INCEpTION such that

* lines in the ingredients section are marked as sentences
* in the main section, you use a sentence splitter of your choice to mark the sentences
* if your ML training is starting from a base model which was trained using a particular tokenizer, you may want to use the same tokenizer to mark tokens

integrate all of the above into a Python script which reads your text and then uses DKPro Cassis to output it to XMI with the marked sentences and tokens.

In that way, you'll have exactly the sentences and tokens that you'll expect when you later retrieve the data again from INCEpTION - plus the annotations you made in INCEpTION of course.

> 3- Yes I am writing a parser for webanno outputs so my first priority is exporting as webanno tsv but what type should I use to import in order to get no error export from webanno. If you say xmi and cassis are better I can use those in the long run however I couldn't use cassis good enough especially for getting the annotations.

See script snippet above.

Cheers,

-- Richard

Kemal Araz

unread,
Oct 5, 2021, 8:23:02 AM10/5/21
to incepti...@googlegroups.com
Hello Richard,

Thanks a lot for your detailed explanation and quick response. When you say mark sentences what do you mean by that? Can you give an example?

Regards



--
You received this message because you are subscribed to the Google Groups "inception-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inception-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/inception-users/40EAF8CD-DB09-477F-BB23-D07FD2F96781%40gmail.com.


--
Kemal Araz | Applied Scientist
+353 12 345 6789Drop - The kitchen OS that connects all parts of the cooking journeyThe kitchen OS that connects the whole cooking journey.
getdrop.com
To view our email disclaimer, please click here.

Richard Eckart de Castilho

unread,
Oct 5, 2021, 8:34:51 AM10/5/21
to incepti...@googlegroups.com
Hi,

> On 5. Oct 2021, at 14:22, Kemal Araz <ke...@getdrop.com> wrote:
>
> Thanks a lot for your detailed explanation and quick response. When you say mark sentences what do you mean by that? Can you give an example?

A simple example using spacy as a sentence splitter. You'll likely have to use your own logic or configure spacy to deal with your ingredient lists.

---
from spacy.lang.en import English
from pathlib import Path

from cassis import load_cas_from_xmi, load_typesystem, Cas

typesystem = load_typesystem(Path("getdrop/TypeSystem.xml"))
cas = Cas(typesystem=typesystem)
cas.sofa_string = """This is a sentence. This is another sentence."""

nlp = English()
nlp.add_pipe('sentencizer')
doc = nlp(cas.sofa_string)

sentence_type_name = "de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence"
Sentence = typesystem.get_type(sentence_type_name)
for s in doc.sents:
cas.add(Sentence(begin=s.start_char, end=s.end_char))

for s in cas.select(sentence_type_name):
print(f"{s.get_covered_text()}")
---

Cheers,

-- Richard
Reply all
Reply to author
Forward
0 new messages