On 4. Oct 2021, at 21:56, Kemal Araz <
ke...@getdrop.com> wrote:
>
> Hello again Richard,
>
> 1- I am sending you the typesystem and an xmi data but cannot make cassis work. Especially for selecting annotations. Getting typenotfound error.
```
from pathlib import Path
from cassis import load_cas_from_xmi, load_typesystem
typesystem = load_typesystem(Path("getdrop/TypeSystem.xml"))
cas = load_cas_from_xmi(Path("getdrop/recipe_annotation_round6_92.xmi"), typesystem=typesystem)
for entity in cas.select("webanno.custom.DropRecipeUnderstandingEntityTags"):
print(f"[{entity.begin}-{entity.end}]: {entity.get_covered_text()}")
```
> 2- I am going to build ner, relation extraction and entity disambiguation models. You have seen the structure of the recipes. There is a section of ingredients where there is no sentence pattern so hard to separate. Also there is another section where there are the steps which are mostly regular sentences. What type should I use to import my txt files to the system and what output should I use? I am using CoNLL 2002 format to train NER, SemEval 2010 Task 8 for relation classification and haven't decided on entity disambiguation. Can you recommend me import and export formats?
Depends on the level of sophistication you want to reach. In the optimal approach, you would prepare your texts before importing them into INCEpTION such that
* lines in the ingredients section are marked as sentences
* in the main section, you use a sentence splitter of your choice to mark the sentences
* if your ML training is starting from a base model which was trained using a particular tokenizer, you may want to use the same tokenizer to mark tokens
integrate all of the above into a Python script which reads your text and then uses DKPro Cassis to output it to XMI with the marked sentences and tokens.
In that way, you'll have exactly the sentences and tokens that you'll expect when you later retrieve the data again from INCEpTION - plus the annotations you made in INCEpTION of course.
> 3- Yes I am writing a parser for webanno outputs so my first priority is exporting as webanno tsv but what type should I use to import in order to get no error export from webanno. If you say xmi and cassis are better I can use those in the long run however I couldn't use cassis good enough especially for getting the annotations.
See script snippet above.
Cheers,
-- Richard