Best Practices for Importing Multiple BioC Annotation Files for Curation

12 views
Skip to first unread message

leo cd

unread,
Jul 5, 2025, 9:00:12 AMJul 5
to inception-users

Dear Developers,

Apologies if this has been asked before,

We are working with a large collection of documents (hundreds), and for each document, we have three different annotation files from different annotators in BioC XML format.

What would be the best practice to import and curate these in INCEpTION? 

Thank you.

Kind regards,

Leo

Richard Eckart de Castilho

unread,
Jul 5, 2025, 9:09:31 AMJul 5
to incepti...@googlegroups.com
Hi Leo,

> On 4. Jul 2025, at 11:25, leo cd <leo...@gmail.com> wrote:
>
> Apologies if this has been asked before,
> We are working with a large collection of documents (hundreds), and for each document, we have three different annotation files from different annotators in BioC XML format.
> What would be the best practice to import and curate these in INCEpTION?

Via the browser, you can only upload "source documents", not "annotation documents". "Annotation documents" are those that
contain the different annotators annotations while the "source document" is the document that new annotators get a copy of
when they start annotating.

You can - however - upload "annotation documents" via the remote API.
INCEpTION will only let you import annotations if the text in the annotated file matches exactly the "source document" text.

So you can:

* enable the remote API (https://inception-project.github.io/releases/37.0/docs/admin-guide.html#sect_remote_api)
* add the "ROLE_REMOTE" in the user management to your admin user
* then you log out and back in as that user
* now you should see an extra icon in the menu bar that gives you access to the remote API swagger interface
* alternatively, look at
- https://openminted.github.io/releases/aero-spec/1.0.0/omtd-aero/
- https://pycaprio.readthedocs.io/en/latest/

Once you have the remote API enabled, you can try importing your texts as "source documents" and then manually
import a BioC XML document via the remote API swagger interface. If that works for you - great.

When you have hundreds of already annotated documents though, you'd want to write a script to upload them via
the remote API. The pycaprio Python library should be a great help here.

I hope that helps.

Cheers,

-- Richard



leo cd

unread,
Jul 21, 2025, 6:05:36 AMJul 21
to inception-users
Thanks, Richard!

I ended up using Pycaprio, and it works as intended in some cases. However, in some others file, it doesn't.
and they have the same error that is:
HTTP 500: {"messages":[{"level":"ERROR","message":"Internal server error: CasDoctor found 2 issues:\nCAS contains no CASMetadata. Cannot check concurrent access.\n[de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence] [The metata…ux (Fig.​]@[2722-2861] ends with whitespace\n"}]}
I couldn’t find any reference to this in either Pycaprio or INCEpTION documentation. Do you happen to know what might be causing it? Especially the no CASMetadata part?
The dataset seems fine as if I upload it manually through the web (as a document), it works without any issues. However, I need to upload only the annotations programmatically. Also, there is no whitespace \n on the text itself.

Richard Eckart de Castilho

unread,
Jul 21, 2025, 2:03:12 PMJul 21
to incepti...@googlegroups.com
Hi,

> On 21. Jul 2025, at 12:05, leo cd <leo...@gmail.com> wrote:
>
> HTTP 500: {"messages":[{"level":"ERROR","message":"Internal server error: CasDoctor found 2 issues:\nCAS contains no CASMetadata. Cannot check concurrent access.

The missing CASMetadata thing should not be an error unless you are using INCEpTION 38.0-SNAPSHOT. In prior versions, this should just be warning.
I'm not sure how this ends up inside an error message. But it is still good you mention this because I should check why this occurs when uplodaing
annotated files via the remote API.

> [de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence] [The metata…ux (Fig.​]@[2722-2861] ends with whitespace\n"}]}

This part of the messages comes from the CAS Doctor which checks the uploaded files for consistency with INCEpTION's expectations.
One of these expectations is that annotations (including sentences and tokens) must not start or end with whitespace characters.
So what this error is telling you is that the sentence at offsets [2722-2861] starting with "The metata…" and ending with "…ux (Fig.​]"
ends with a whitespace. Since we don't *see* a whitespace in "[The metata…ux (Fig.​]" it must be a zero-width whitespace. In fact, if
you copy this text into a text editor and step through it with the cursor keys, you will note that when you stem from "." to "]" you have
to press the cursor-right key twice in order to move forward one step. That is because of the zero-width character there.
When you prepare your annotations externally, make sure they are trimmed properly removing any leading or trailing whitespace - even invisible.

Cheers,

-- Richard
Reply all
Reply to author
Forward
0 new messages