Importing documents for annotation

9 views
Skip to first unread message

Anna Baczkowska

unread,
Jan 21, 2026, 12:45:25 AMJan 21
to inception-users
Hello,
I'm trying to import documents into inception for annotation. My data consists of social media comments. The format I have it in is as follows:
```
1
username
datetime
text (possibly multiline)
1.1
username
datetime
text
```
Root-level comments have an id like 1, 2, 3, and answers have ids like 1.1, 12.8.2 (nested reply). 

I'd like to import this data so that the metadata (id, username, datetime) is pre-annotated. As far as I understand, the comments should each be a separate document?

I browsed the supported formats but I didn't find one that seemed to allow me to import a single file with it acting as multiple documents. Could you point me in the right direction? The formats are generally way too granular for me (as they operate on the word/token-level). I'd ideally like to have it be something like this:
```
<doc id="1" username="abc" datetime="01.01.01.1970 00:01">text \n text</doc>
<doc> ....</doc>
```
where a <doc> tag acts as a whole (possibly multiline) comment. Unless it's fine to have the comments all be in one big file? But then still I'd be helpful to have the metadata as an annotation instead of inline, mixing with the actual text.

Thank you in advance for your reply.

Richard Eckart de Castilho

unread,
Jan 21, 2026, 1:12:21 AMJan 21
to incepti...@googlegroups.com
Hello Anna,

> On 20. Jan 2026, at 22:50, Anna Baczkowska <anna.k.b...@gmail.com> wrote:

> I browsed the supported formats but I didn't find one that seemed to allow me to import a single file with it acting as multiple documents. Could you point me in the right direction?

INCEpTION requires each document to be a separate file. It does not support any file format
where multiple documents are encoded in the same file.

> I'd ideally like to have it be something like this:
> ```
> <doc id="1" username="abc" datetime="01.01.01.1970 00:01">text \n text</doc>
> <doc> ....</doc>
> ```
> where a <doc> tag acts as a whole (possibly multiline) comment. Unless it's fine to have the comments all be in one big file? But then still I'd be helpful to have the metadata as an annotation instead of inline, mixing with the actual text.


In principle, INCEpTION has a flexible support for XML files.

(caution, deep dive follows... tl;dr at end)

However, this is an experimental feature that needs to be activated explicitly via the
configuration file. It is an advanced functionality that requires the user to be familiar with
CSS and to provide a CSS stylesheet to INCEpTION. And finally, it will now extract your
metadata as editable annotations. You could, however, render your metadata as unobtrusive
information using the CSS stylesheet.

https://inception-project.github.io/releases/39.4/docs/user-guide.html#sect_formats_xml_custom

The only formats that currently support importing pre-annotated document-level metadata are:

* UIMA CAS JSON (recommended)
* UIMA CAS XMI

If you want to get an idea of how these files would look like, I would recommend setting up
the document metadata layer(s) as you expect them to be, annotate a document on those layers
and then export that document in one of the formats above.

While in theory, it is possible to create/edit such files manually using a text editor, it is
also rather easy to build them incorrectly. So I would recommend creating/editing them only
programmatically e.g. in Python using the dkpro-cassis library (or using tools like INCEpTION).

Note that having your metadata as annotations would really only be necessary if you expect your
annotators to be able to edit that metadata. If you just wanted to keep it, it could simply stay
in the XML structure or you could just place it into the text itself as a heading or something
like that.

Note that exporting annotated XML files imported via the XML mechanism is also only sensibly
be exported again in those UIMA CAS formats mentioned above because during import, the entire
XML structure has been converted into internal (not editable) annotations from the systems
point of view. During export, the system would not know how to encode your custom annotations
into that original structure, so the CAS file will include both, the XML structure as well as
the custom annotations. Again, using dkpro-cassis, a script could be written in Python that
could transform this CAS structure into some other structure (e.g. an XML representation).

tl;dr

There are a lot of (advanced) possibilities maintaining all (meta)data in
proper machine readable way.

However, all that said, many people would probably just split up the documents into separate
files using a text editor and then import them as plain text files (even though they are XML)
and annotate those.

The question is maybe what you want to do with the annotations afterwards - and which level of
(programming) sophistication you would like to achieve.

Cheers,

-- Richard


Reply all
Reply to author
Forward
0 new messages