DIAMs are forever ?

40 views
Skip to first unread message

Grégoire Montcheuil

unread,
Mar 30, 2026, 7:29:40 AMMar 30
to inception-users
Dear INCEpTION (developers) team,

We are currently exploring the possibility to develop an "External Editor", with a custom document rendering using some document metadata and/or a special (hidden) layer of annotations.

We have a look to the External Editors section of the developer guide
  and to the INCEpTION Doccano Sequence Editor Plugin.
So we understand more or less that a good solution would be implementing a "document-rendering editor" that communicate with INCEpTION server using the DIAM Ajax API to get the document information and interact with the annotations.

Unfortunately, if the type(script) description in `inception-js-api/src/diam/DiamAjax.ts` help us to known the name and (input) data of the API, we have not (yet) found concrete documentation for this API.
It's true that some methods have a self-explanatory name, mainly the annotation manipulation ones (`createSpanAnnotation(...)`, `moveSpanAnnotation(...)`, etc. ).
But others are more cryptic (b.e. `loadLazyDetails(...)`, `triggerExtensionAction(...)`).

We also understand that the `loadAnnotations(...)` method seems a good entry-point to start the editor life-cycle (the doccanno sample call it two times)... and an important parameter is the format.
Thanks to the doccanno sample we known that "Brat" is one (core) accepted format, but is there somewhere a list of formats we can expect they are (and would remain) accepted ?

So we would appreciate if you could point us to more detailed documentation on the DIAM API.. and the way it will evolve and preserve compatibility.

Thank you in advance.

Best regards,

Grégoire Montcheuil

Richard Eckart de Castilho

unread,
Mar 31, 2026, 3:58:56 PMMar 31
to inception-users
Hello Grégoire,

> On 30. Mar 2026, at 13:29, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> Unfortunately, if the type(script) description in `inception-js-api/src/diam/DiamAjax.ts` help us to known the name and (input) data of the API, we have not (yet) found concrete documentation for this API.
> It's true that some methods have a self-explanatory name, mainly the annotation manipulation ones (`createSpanAnnotation(...)`, `moveSpanAnnotation(...)`, etc. ).
> But others are more cryptic (b.e. `loadLazyDetails(...)`, `triggerExtensionAction(...)`).

The lazy details are used by the annotation popover:

https://github.com/inception-project/inception/blob/main/inception/inception-js-api/src/main/ts/src/widget/AnnotationDetailPopOver.svelte

The extension action is a special action that is usually mapped to a double-click and which is provided by a potentially active

https://github.com/inception-project/inception/blob/main/inception/inception-api-editor/src/main/java/de/tudarmstadt/ukp/inception/editor/AnnotationEditorExtension.java

(e.g. curation editor extension, recommender editor extension, etc.)

> We also understand that the `loadAnnotations(...)` method seems a good entry-point to start the editor life-cycle (the doccanno sample call it two times)... and an important parameter is the format.
> Thanks to the doccanno sample we known that "Brat" is one (core) accepted format, but is there somewhere a list of formats we can expect they are (and would remain) accepted ?

You should use the format "compact_v2". If you want a pretty complete and modern implementation of an editor, look at the Apache Annotator editor in the INCEpTION codebase:

https://github.com/inception-project/inception/tree/main/inception/inception-html-apache-annotator-editor/src/main/ts (uses DIAM AJAX API)

It does not support relations though. You can look at the annotation sidebar or recogito editor for more inspirations including relations (recogito may be a bit outdated):

https://github.com/inception-project/inception/tree/main/inception/inception-diam-editor/src/main/ts/src (uses DIAM WebSocket API)
https://github.com/inception-project/inception/tree/main/inception/inception-html-recogito-editor/src/main/ts

The frontend API implementation is in:

https://github.com/inception-project/inception/tree/main/inception/inception-diam/src/main/ts/src/diam

The actual backend implementations are distributed over various Java classes, all implementing the EditorAjaxRequestHandler interface, e.g.:

loadAnnotations -> de.tudarmstadt.ukp.inception.diam.editor.actions.LoadAnnotationsHandler
https://github.com/inception-project/inception/blob/main/inception/inception-diam/src/main/java/de/tudarmstadt/ukp/inception/diam/editor/actions/LoadAnnotationsHandler.java
loadLazyDetails -> de.tudarmstadt.ukp.inception.diam.editor.actions.LazyDetailsHandler
https://github.com/inception-project/inception/blob/main/inception/inception-diam/src/main/java/de/tudarmstadt/ukp/inception/diam/editor/actions/LazyDetailsHandler.java

(for historic reasons, the actual command strings in the API/handler may look a bit odd, e.g. "normData" for the loadLazyDetails).

> So we would appreciate if you could point us to more detailed documentation on the DIAM API.. and the way it will evolve and preserve compatibility.

The API evolves occasionally as needed. I try to stay backwards compatible because that saves work updating the existing editors and examples ;)
The current wire format is "compact_v2" and it also has a bunch of shared decoder helpers in the JS API.
The API tends to be more stable then the wire format itself. I.e. implementations should rely on the
API and the shared decoder helpers for the compact_v2 format rather than implementing your own decoders.

https://github.com/inception-project/inception/tree/main/inception/inception-js-api/src/main/ts/src/model
https://github.com/inception-project/inception/tree/main/inception/inception-js-api/src/main/ts/src/model/compact_v2

Cheers,

-- Richard

Grégoire Montcheuil

unread,
Apr 8, 2026, 11:37:24 AMApr 8
to inception-users
Dear Richard,

Thank you very much for your very instructive response.

After a look at the back-end code (LoadAnnotationsHandler.java and its web-socket counterpart DiamWebsocketController.java), I thought we are perhaps not exploring the right track...
If we have well understood, the back-end handlers process in 3 steps:
1. Get the current document (based on the request) and collect some information to build a RenderRequest
2. Call the renderingPipeline to produce the VDocument
3. Use the format-specific serializer to convert it into a JSON object

If among the information that build the RenderRequest we have the document CAS and two lists of layers (allLayers and visibleLayers)...  it seems that the resulting VDocument keep just the annotation data (spans and arcs) of the visible layers (and cropped to the window).
It's understandable as the main purpose of an (external) editor is to work with the visible layers.
But in our case, even if we use the better "compact_v2" format, to get the "document metadata and/or a special layer of annotations" for our custom render via `loadAnnotations(...)` or the web-socket, we have to put them in a visible layer.
Sadly, this approach is quite annoying for the final user as all the special annotations appear in the left side of the annotator interface ;-/

It would be better if this extra information stay invisible for the final user, so we are looking for another way to access them.
As there is some similarity, we have a look to the PDF Editor, and we discover the back-end (PdfDocumentIFrameView.java) add two behaviors (~endpoints), one to get the PDF file and the other to get the VModel - who is based on hidden layers (containing annotations of type org.dkpro.core.api.pdf.type.PdfPage/PdfChunk).

Is there a similar way for an external editor to access the non-visible layers of the current document ?
And if no, did you think it would be a reasonable feature for the external editors ?
(By reasonable we mean that it will not open a security problem neither imply an important and hard to maintain redesign)

Thank you in advance.

Best regards,

Grégoire Montcheuil







Richard Eckart de Castilho

unread,
Apr 8, 2026, 4:08:51 PMApr 8
to incepti...@googlegroups.com
Hello Grégoire,

> On 8. Apr 2026, at 17:37, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> It's understandable as the main purpose of an (external) editor is to work with the visible layers.

Right, the DIAM layer is for dealing with (visible) annotations. It is a visual and interaction layer, not a data layer.

> But in our case, even if we use the better "compact_v2" format, to get the "document metadata and/or a special layer of annotations" for our custom render via `loadAnnotations(...)` or the web-socket, we have to put them in a visible layer.
> Sadly, this approach is quite annoying for the final user as all the special annotations appear in the left side of the annotator interface ;-/
>
> It would be better if this extra information stay invisible for the final user, so we are looking for another way to access them.
> As there is some similarity, we have a look to the PDF Editor, and we discover the back-end (PdfDocumentIFrameView.java) add two behaviors (~endpoints), one to get the PDF file and the other to get the VModel - who is based on hidden layers (containing annotations of type org.dkpro.core.api.pdf.type.PdfPage/PdfChunk).

Right. As long as your extra information is static, that would work. And it would put you into the "Custom XML Format" territory.
So here the idea is that you represent your text to be annotated as an XML document. Into that document, you can put as much extra
data as you want. This can be data used for layout (which can then by style using a CSS stylesheet associated with your custom format).

Examples: https://github.com/inception-project/inception-xml-formats-examples

You can even pick up on that data in your custom editor and do additional transformations on it. You just need to make sure that
any transformations you do do not change the character offsets within the part of the DOM in which the XML is rendered.
By combining a custom editor with a custom XML format, pretty awesome extensions are possible without having to touch a line of
Java code and without recompiling INCEpTION.

Cheers,

-- Richard

Grégoire Montcheuil

unread,
Apr 9, 2026, 5:34:59 AMApr 9
to inception-users
Dear Richard,

Thank you for pointing out the custom XML format feature, we had not seen its potential when we browsed the documentation.

We will have a deeper look on this options, but at a first glance we see an important barrier with this approach, the import/export limitations:
- The custom XML format allows only importation and do not support any layer - so we cannot import pre-annotations.
- We see that we can export into an UIMA CAS-based format (JSON or XMI) and even keep (part of) the initial XML structure (the org.dkpro.core.api.xml.type.Xml... annotations).
  But I tried re-importing a JSON CAS export and this initial XML structure seems vanished :-/


Best regards,

Grégoire Montcheuil

Richard Eckart de Castilho

unread,
Apr 9, 2026, 4:36:09 PMApr 9
to incepti...@googlegroups.com
Hi Grégoire,

> On 9. Apr 2026, at 11:34, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> We will have a deeper look on this options, but at a first glance we see an important barrier with this approach, the import/export limitations:
> - The custom XML format allows only importation and do not support any layer - so we cannot import pre-annotations.

Right, custom XML does not support pre-annotations other than the XML structure itself.
But if you only want to channel through hidden data that your annotators shouldn't be able to edit/see anyway,
then you could possibly channel that through the XML structure instead of using pre-annotation.

> - We see that we can export into an UIMA CAS-based format (JSON or XMI) and even keep (part of) the initial XML structure (the org.dkpro.core.api.xml.type.Xml... annotations).
> But I tried re-importing a JSON CAS export and this initial XML structure seems vanished :-/

I have just tested this:

- import HTML file
- open the file in the annotation editor
- add some annotation to it
- export it from the editor as "UIMA CAS JSON 0.4.0 (XML/PDF structure)".
- imported that file again via the documents panel in the project settings as "UIMA CAS JSON 0.4.0"
- opened the file in the annotation editor
- make sure the editor is explicitly switched to "HTML (Apache Annotator)" (because file type is now JSON and no longer HTML, so the editor does not auto-activate)
- and I can see the HTML structure again as well as the annotation I have added to it

So while there is certainly some room for improvement, in general it worked.

Maybe you try that approach?
What have you tried exactly?

Cheers,

-- Richard

Grégoire Montcheuil

unread,
Apr 10, 2026, 4:12:18 AMApr 10
to inception-users
Hello Richard,

Thank you for your test.

Yesterday my test was:
- Clone the TMX (and TTML) sample project to my ` xml-formats` sub-folder
- Import the tmx-example.xml file in a project (with span and relation layers)
- Open it.
  It was with the "HTML (RecojitoJS)" editor (perhaps I switched to it, I do not remember).
  It was not the "HTML (Apache Annotator)" because I was able to add a relation between two annotations.
- Export it from the editor as "UIMA CAS JSON 0.4.0 (XML/PDF structure)"
- Import the JSON file again via the documents panel in the project settings as "UIMA CAS JSON 0.4.0"
- Open the file in the annotation editor
  => the auto looks like "brat (sentence-oriented)" and if I switch to any of the "HTML" editors, I have a blank area in place of the document content
- Export the file again via the documents panel in the project settings as "UIMA CAS JSON 0.4.0"

Yesterday I thought the original XML structure was vanished, but after a fresh look to the 2 JSON files, it seems it was in reality rewritten as I have more org.dkpro.core.api.xml.type.Xml... annotations in the second export.

Best regards,

Grégoire

Richard Eckart de Castilho

unread,
Apr 13, 2026, 1:14:14 AM (13 days ago) Apr 13
to inception-users
Hi Grégoire,

> On 10. Apr 2026, at 10:12, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> Yesterday my test was:
> - Clone the TMX (and TTML) sample project to my ` xml-formats` sub-folder
> - Import the tmx-example.xml file in a project (with span and relation layers)
> - Open it.
> It was with the "HTML (RecojitoJS)" editor (perhaps I switched to it, I do not remember).
> It was not the "HTML (Apache Annotator)" because I was able to add a relation between two annotations.
> - Export it from the editor as "UIMA CAS JSON 0.4.0 (XML/PDF structure)"
> - Import the JSON file again via the documents panel in the project settings as "UIMA CAS JSON 0.4.0"
> - Open the file in the annotation editor
> => the auto looks like "brat (sentence-oriented)" and if I switch to any of the "HTML" editors, I have a blank area in place of the document content
> - Export the file again via the documents panel in the project settings as "UIMA CAS JSON 0.4.0"

Right. I think I know what's going on here. I did a test with a HTML file - the HTML policy is the default fallback policy, so
as long as a file contains (certain) HTML tags, those always make it through to the frontend.

In your case, you used a TMX file which contains no standard HTML tags. And because the UIMA CAS JSON file format is not associated
with a special XML policy and neither are the generic Recogito or Apache Annotator editors, all data from the TMX file
is filtered out before it reaches the frontend. I have to think about what might be a good way to fix this, e.g.

- allow changing the fallback XML policy per project
- allow changing the XML policy per document
- allow editor plugins override the file format XML policy
- ...?

Preferences or alternative suggestions welcome.

> Yesterday I thought the original XML structure was vanished, but after a fresh look to the 2 JSON files, it seems it was in reality rewritten as I have more org.dkpro.core.api.xml.type.Xml... annotations in the second export.

Good find. XML/PDF file-handling code makes sure the XML/PDF structure does not end up in the
annotator's data (to save space/speed up rendering). But CAS import code did not have that filter.
So on export with the CAS+structure format, the stucture was merged in a second time, causing the
duplication. This should be fixed by the following PR now:

https://github.com/inception-project/inception/pull/5966

Cheers,

-- Richard

Richard Eckart de Castilho

unread,
Apr 13, 2026, 1:43:40 AM (13 days ago) Apr 13
to inception-users

> On 13. Apr 2026, at 07:13, Richard Eckart de Castilho <richard...@gmail.com> wrote:
>
> - allow changing the fallback XML policy per project
> - allow changing the XML policy per document
> - allow editor plugins override the file format XML policy
> - ...?
>
> Preferences or alternative suggestions welcome.

I believe the best solution would be to allow custom editor plugins to
override the XML policy. They should know best which tags they support.

So I'd make it such that a custom XML file format can continue to supply
a policy and stylesheet for use with generic HTML editors, but when an
editor plugins supplies a policy, that would take precedence over the
policy provided by the file format, since the editor will know best
which content it supports.

That wouldn't fix your test per-se because when you export the TMX
as JSON CAS and re-import it again, INCEpTION doesn't know anymore
that the original format was TMX, so the connection to the policy and
stylesheet associated with the TMX format is lost. However, if you
are going down the road to a custom editor anyway, that wouldn't be
a problem for you.

Regarding the problem what to do when importing a CAS representation
of an originally non-CAS file (like TMX) and losing the special
properties of the original format in that step - still not sure how
to best fix this (if at all).

-- Richard

Grégoire Montcheuil

unread,
Apr 16, 2026, 6:09:01 AM (10 days ago) Apr 16
to inception-users
Dear Richard,

Sorry for the delay.
To be honest we didn't yet found the time to look deeper on the custom XML possibilities.

But, let's me try to sum up what we understood and do not hesitate correcting me.
1. UIMA CAS is more or less the native format that manipulate INCEpTION
2. When we import a document into INCEpTION, the platform keep a copy of the original document and build a UIMA CAS representation of the document.
3. To build the CAS representation, in the case of a custom XML:
    a. The policy.yaml define the elements/attributes to keep/drop in the same way of the external editors policies.
    b. After applying this filter,  we obtain the CAS SofA by the concatenation of the remaining Text Nodes.
    c. The org.dkpro.core.api.xml.type.Xml... annotations encode the (filtered) XML structure.
4. In a similar way, for a PDF document:
    a. The extracted text forms the CAS SofA
    b. The org.dkpro.core.api.pdf.type.Pdf... annotations encode the PDF structure - with the position of every glyph.
5. For all kind of document, if they are not yet present: 
    a. The backend also compute the segmentation (sentences and tokens) of the CAS SofA, - the de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.* annotations.
    b. And in the case of custom XML, the blockElements (defined in the plugin.json) avoid to have sentences that cover various blocks.
6. When we open a custom XML document in a HTML editor (or any external editor with the `"view": "iframe:cas+xhtml+xml"` setting):
    a. The backend apply first the custom XML policies to the original document and/or reuse the org.dkpro.core.api.xml.type.Xml... annotations
    b. At this filtered XML, it also apply the editor policies to obtain the X(HT)ML content of the iframe
        When you said " (certain) HTML tags" I suppose you refer to the minimal policies describe in the documentation:
        > There are several elements like script, meta, applet, link, iframe as well as a which are and JavaScript event attributes always filtered out.
    c. In the header of the iframe X(HT)ML, it also inject the custom format stylesheets (defined in the plugin.json)
        And, with the magic of CSS, the document show up (more or less) as expected ;-)
7. The case of importing a CAS representation (XML or JSON) differ a little as:
    a. Sometime some of the annotations are yet present and so they are not recomputed: segmentation, XML/PDF structure, etc. and of course user-defined layers (pre-)annotations.
    b. The original document is missing.
8. This absence of the original document have various repercussions:
   a. for the HTML/External editors, only the step 6b could apply on the saved XML structure, so no custom CSS (6c).
   b. the PDF Annotation Editor become useless.

Perhaps I'm a little too obtuse, but this observations take me back to a previous question:
> Is there a similar way for an external editor to access the non-visible layers of the current document ?
> And if no, did you think it would be a reasonable feature for the external editors ?
Let me reword in more explicit (and perhaps correct) terms (as I discover the layer concept is not exactly native of the CAS format).
Can we imagine an (at first experimental) feature that give to the external editors a read-only access to the underlying CAS information.
More exactly:
- a method, said getCASTypesystem(...), to consult (part of) the underlying type system - eventually filtering on some prefix or on type that have at least one annotation present.
- a method, said getCASAnnotations(...), to consult the annotations of some types - with some range filtering like loadAnnotations(...), etc..

My intuition is that this two methods could resolve some problems:
1. The editor could have dynamic and reactive policies based on the types of annotations present in the CAS.
2. The direct importation of a CAS representation do not reduce the possibility of the editor. 
  B.e. an editor that know an human-friendly way to render  a PDF document only using the org.dkpro.core.api.pdf.type.Pdf... annotations will work the same way with an imported PDF document and with the exported and re-imported CAS representation.

Another advantage, in my opinion, it is not necessary to find a way to fit all the information in one XML tree, what could not be very easy if you need various layers of possibly overlapping annotations... especially if the internal representation yet offer this various layers of standoff annotations.

Best regards,

Grégoire

Richard Eckart de Castilho

unread,
Apr 16, 2026, 1:29:12 PM (10 days ago) Apr 16
to incepti...@googlegroups.com
Hi,

> On 16. Apr 2026, at 12:09, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> But, let's me try to sum up what we understood and do not hesitate correcting me.
> 1. UIMA CAS is more or less the native format that manipulate INCEpTION
> 2. When we import a document into INCEpTION, the platform keep a copy of the original document and build a UIMA CAS representation of the document.

Correct.

> 3. To build the CAS representation, in the case of a custom XML:
> a. The policy.yaml define the elements/attributes to keep/drop in the same way of the external editors policies.
> b. After applying this filter, we obtain the CAS SofA by the concatenation of the remaining Text Nodes.
> c. The org.dkpro.core.api.xml.type.Xml... annotations encode the (filtered) XML structure.

Not quite. The full XML structure is encoded using the Xml... annotations.
The policy.yaml is only used during rendering to control which elements/attributes
are being sent to the browser. Its original use was to act as security policy to
ensure that e.g. no JavaScript gets injected from the custom XML documents into the
browser. And there is also a base-layer policy for that purpose that is always used
by INCEpTION which you cannot turn off. Additionally, custom formats need to declare
which elements of the custom XML they want to be sent to the browser. That is what
the policy.yaml in the plugin controls.

> 4. In a similar way, for a PDF document:
> a. The extracted text forms the CAS SofA
> b. The org.dkpro.core.api.pdf.type.Pdf... annotations encode the PDF structure - with the position of every glyph.

Correct.

> 5. For all kind of document, if they are not yet present:
> a. The backend also compute the segmentation (sentences and tokens) of the CAS SofA, - the de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.* annotations.
> b. And in the case of custom XML, the blockElements (defined in the plugin.json) avoid to have sentences that cover various blocks.

Correct.

> 6. When we open a custom XML document in a HTML editor (or any external editor with the `"view": "iframe:cas+xhtml+xml"` setting):
> a. The backend apply first the custom XML policies to the original document and/or reuse the org.dkpro.core.api.xml.type.Xml... annotations

Correct, the policy is applied during rendering.

> b. At this filtered XML, it also apply the editor policies to obtain the X(HT)ML content of the iframe
> When you said " (certain) HTML tags" I suppose you refer to the minimal policies describe in the documentation:
> > There are several elements like script, meta, applet, link, iframe as well as a which are and JavaScript event attributes always filtered out.

Correct, known harmful elements/attributes are always filtered out.

> c. In the header of the iframe X(HT)ML, it also inject the custom format stylesheets (defined in the plugin.json)
> And, with the magic of CSS, the document show up (more or less) as expected ;-)

Correct.

> 7. The case of importing a CAS representation (XML or JSON) differ a little as:
> a. Sometime some of the annotations are yet present and so they are not recomputed: segmentation, XML/PDF structure, etc. and of course user-defined layers (pre-)annotations.
> b. The original document is missing.

The CAS representation becomes the original document. For documents with an XML structure, that is not too much of an issue
since the rendering can be done based entirely on the XML structure embedded in the CAS file. But it requires to manually
switch to an appropriate editor plugin because INCEpTION doesn't know anymore that a HTML-based editor is better suited
for this given CAS file than say the default brat editor

However, it is an issue for PDF files where the original PDF file because the original PDF is needed during rendering.
So currently, we lack a good way of importing pre-annotated PDF files. The issue with PDF files is more complex anyway
though because you need INCEpTION to derive the visual structure annotations and base text from the PDF file. Only when
you have that, you can even start annotating based on the offsets in the extracted text. So theoretically, you'd
have to import the PDF into INCEpTION to have it visually analyzed, then you'd export that as CAS, add your pre-annotations
and then re-import. And in particular that last step of reimporting the PDF + analysis is currently lacking.

> 8. This absence of the original document have various repercussions:
> a. for the HTML/External editors, only the step 6b could apply on the saved XML structure, so no custom CSS (6c).

If the custom CSS is associated with an editor plugin instead of the custom file format, then switching to that particular
editor plugin would still apply the CSS. But you'd need that editor plugin in addition to the custom XML format.
Thinking about this, maybe we could create a special import format which takes a CAS as input and recovers the XML contained
in it during import and then stores that as the original source... we'd also need to save the format ID in the CAS then...
maybe some edge cases... but it might be quite a viable strategy going foward.

> b. the PDF Annotation Editor become useless.

Yes. You could theoretically manually manipulate the database to correct the file type and replace the source file with
the PDF... but its not for the faint of heart. We'd need a better workflow here.

> Perhaps I'm a little too obtuse, but this observations take me back to a previous question:
> > Is there a similar way for an external editor to access the non-visible layers of the current document ?
> > And if no, did you think it would be a reasonable feature for the external editors ?
> Let me reword in more explicit (and perhaps correct) terms (as I discover the layer concept is not exactly native of the CAS format).
> Can we imagine an (at first experimental) feature that give to the external editors a read-only access to the underlying CAS information.
> More exactly:
> - a method, said getCASTypesystem(...), to consult (part of) the underlying type system - eventually filtering on some prefix or on type that have at least one annotation present.
> - a method, said getCASAnnotations(...), to consult the annotations of some types - with some range filtering like loadAnnotations(...), etc..

We already have an option to show/hide certain layers via the layer visibility sidebar.

You can disable layers in the project settings - disabled layers are not rendered.

Maybe you can explain in more detail if/why you need fine-grained access to the type system and annotation types in the frontend and during rendering?

> My intuition is that this two methods could resolve some problems:
> 1. The editor could have dynamic and reactive policies based on the types of annotations present in the CAS.

What would be the purpose of such reactivity?

> 2. The direct importation of a CAS representation do not reduce the possibility of the editor.
> B.e. an editor that know an human-friendly way to render a PDF document only using the org.dkpro.core.api.pdf.type.Pdf... annotations will work the same way with an imported PDF document and with the exported and re-imported CAS representation.

The Pdf... annotations are necessary for the PDF editor to render annotations, but they are not sufficient.
They only provide location information for the text. They do not provide the actual image rendered from the PDF.
So in case of the PDF, we *always* need the original PDF file.

> Another advantage, in my opinion, it is not necessary to find a way to fit all the information in one XML tree, what could not be very easy if you need various layers of possibly overlapping annotations... especially if the internal representation yet offer this various layers of standoff annotations.

Annotations should indeed not go into the XML tree. They should be modelled as annotations in the CAS.
But annotations are visible to the user - because INCEpTION is an annotation tool ;) And editors do not
get access to annotations unless they are visible already because INCEpTION uses server-side pre-rendering.
The information encoded in the XML structure is document structure or metadata (but not really annotations)
and editor plugins have easy access to both. But they are less well suited if you need them to represent
overlapping structures.

We actually do already send the layer with the CompactV2 format, e.g.:

[[122317,"1918",[[0,10]],{"l":"SOF","c":"#80b1d3"}]

Here the 122317 is the layer ID. However, we currently don't expose a "loadSchema()" call to the editors which
would allow them to resolve the layer ID to something more interpretable. I guess that's where your "getCASTypesystem()"
idea might come in. Although instead of explaining the full CAS type system, I'd tend to rather just provide information
that the editor can correlate. Since the editor does not get the full CAS annotations, I would probably mostly expose
only the layer name and maybe description through a "loadSchema()" call.
So theoretically, a custom editor could already now choose to not render certain annotations from a set of well-known
layers (as long as the layer ID is known). With a "loadSchema()" it could replace the layer ID with a set of well-known
layer names that would be sent to the editor but which the editor may chose to keep hidden.

As for getCASAnnotations... I could imagine that we might provide more detailed information about features and feature
values then just a server-side pre-rendered "l" (label). But I don't think I would like to provide full access to a
server-side CAS annotation. Admins/managers may turn off file export for annotators, and I don't want a creative annotator to
get ideas and circumvent the disabled export by simply using a getCASAnnotations endpoint. Also the admin/manager
may have disabled certain layers they don't want the users to interact with, and I wouldn't like those to be exposed either.
Finally, a CAS annotation (feature structure) is a graph that potentially could have quite a big footprint (e.g. the entire
XML structure in case of a XmlDocument annotation). The flat rendered annotations do not have that problem.
So I think we could enrich the flat rendered annotations with more (optional) information, but access to the backend CAS seems a bit
too much I think.

-- Richard


Richard Eckart de Castilho

unread,
Apr 17, 2026, 2:17:51 AM (9 days ago) Apr 17
to incepti...@googlegroups.com
I have opened an issue for this:

https://github.com/inception-project/inception/issues/5970

-- Richard

Grégoire Montcheuil

unread,
Apr 17, 2026, 12:20:09 PM (9 days ago) Apr 17
to inception-users
Dear Richard,

Once again, thank you for your corrections and clarifications.

> Maybe you can explain in more detail if/why you need fine-grained access to the type system and annotation types in the frontend and during rendering?

To enlighten my remarks, perhaps I need to resolve some vocabulary clashes around "annotation" and "layer".
I agree with you, INCEpTION is a platform to manage human annotation, and the parameters of this various "layers" of annotations are at the discretion of the project manager.

But as you confirm me, under the hood INCEpTION manipulate a CAS representation of each document, that include both the various project layers annotations (visible or not) and some internal information - we yet mentioned the segmentation (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*), the PDF (org.dkpro.core.api.pdf.type.Pdf...) or XML structure (org.dkpro.core.api.xml.type.Xml...) - also encoded as annotations.
Lets call this latest kind of annotation "internal annotations", and to avoid the "layer" word, we will use "families" i.e. a (coherent) list of types - for the internal annotations.

With the exception of the segmentation family that eventually could be editable, this internal annotations are helpful (and a priori read-only) information for the document rendering and/or human annotation task
Proof of this, the HTML/server-side view editors are based on the XML structure internal annotations, and the PDF Annotation editor on the PDF structure internal annotations, and so they both use their custom access to this information.
In the case of the  HTML/server-side view editors, it get the information after the various XML filters that build the server-side view.
And it the case of the PDF Annotation editor, it defined extra behaviours

So, to be clear, the purpose of getCASTypesystem(...)/getCASAnnotations(...), was not to get a different access to the project layers annotations, but a new, perhaps more generic, access to the internal annotations.

> The Pdf... annotations are necessary for the PDF editor to render annotations, but they are not sufficient.
> They only provide location information for the text. They do not provide the actual image rendered from the PDF.
> So in case of the PDF, we *always* need the original PDF file.


The PDF annotation case is excellent to illustrate part of our objectives.
We worked on critical data, health data, that should be pseudonymized before to enter in the human annotation process.
So have the original PDF is not exactly an good option on our pipeline.
Nonetheless, we believe that a PDF-like rendering of the documents is a great help for the humans "in the loop".
So we found a way to save enough layout information to be able to re-render the document after the pseudonymization.

Technically, the information we saved is very similar to the org.dkpro.core.api.pdf.type.PdfPage/PdfChunk annotation,
  except we cannot save individual position of each glyphs (the number of glyphs could change during the pseudonymization),
  but in return we also saved extra information (like bold, italic, underline and eventually the font name and color) that are not present in PdfChunk.

We have yet some code to re-render the document (within a <canvas> or as styled html sub-tree) using just the text content (the CAS SofA) and our layout information.
So, as a first test, we put our layout information as project layer annotations in a CAS import, and we can re-render the document based on the DIAM loadAnnotation() results.
Even without the human annotations interaction coded, this first result is quite interesting and encouraging...
But the two first disadvantages we encounter are:
- all the layout annotations appear to the user, and so pollute the left-side part of the annotator UI,
- the various layout information have to be encoded inside the annotation label - so all useful information should appear, and be re-parsed :-/.

So, if we can move our layout annotations to the internal annotations and access them from the external editor, we will resolve this two problems...

At the day, the only solution seems to find a way to encode our layout information inside an custom XML tree and define the right policies to find them intact after the server-side view filters.
As I said you, we haven't yet found the time to explore this option...
But we have already discussed  in this thread the (current) import/export limitations of this approach.
And, if I'm not mistaken, I foresee another complication: the server-side view is limited by the page size setting that "controls how many sentences are visible in the annotation area"... and the sentences ranges will not  exactly match the (Pdf)Page ranges
 
> What would be the purpose of such reactivity?

Let follow our PDF-like annotation use case.
Said we define our ad-hoc layout information type system, foo.bar.LayoutPage/LayoutChunk, as they differs a little to org.dkpro.core.api.pdf.type.PdfPage/PdfChunk.
The external editor can first check if there are foo.bar.Layout... internal annotations and so use them to re-render the document (with bold, italic, color, etc.).
But if there is no foo.bar.Layout... annotations, it can fall back to org.dkpro.core.api.pdf.type.Pdf... internal annotations and use them to re-render the document (with less style).
And so on with other families of internal annotations... to finally, if any of them are present, use a default rendering ;-/
So the same editor could transparently manage/switch between various families of internal annotations - without to find the right way to encode any of them inside an XML tree.

We also can see other usage of this reactivity.
Suppose we have another feature that use other types of internal annotations to display some useful patient information (or any other meta-data).
We can easily define this feature as an independent module as it should only know how its information is encoded in the internal annotations, and not in the resulting XML server view that also combine some other families of internal annotation.
Based on the presence of this internal annotation, we can dynamically (des)activate the feature.
And so one with every new feature.

> I don't want a creative annotator to get ideas and circumvent the disabled export by simply using a getCASAnnotations endpoint

We agree with that.
I suppose we can have imagine some policies that ensure the getCASAnnotations()  endpoint limits its results to some families of internal annotations - and eventually with the possibility to refine this policies at the project level.
B.e. for our pdf-like annotation use case:
- the external editor will declare in its policies two families of internal annotations, foo.bar.LayoutPage/LayoutChunk and org.dkpro.core.api.pdf.type.PdfPage/PdfChunk,
  and wouldn't never know about or have access to other types of annotations with the getCASAnnotations()  endpoint.
- the project manager could allow the annotators to switch to this external editor, but eventually disable the foo.bar.LayoutPage/LayoutChunk family as the project only use imported PDF (or for any other reason).

> Finally, a CAS annotation (feature structure) is a graph that potentially could have quite a big footprint (e.g. the entire
> XML structure in case of a XmlDocument annotation). The flat rendered annotations do not have that problem.

We also understand this point, that why we propose that getCASAnnotations(...) have "some range filtering like loadAnnotations(...)"

And this also advocates a little for my reluctance to the XML-tree solution.
Not only XML is not the format with the smallest footprint, but the editor have to define a policies that put all the information it (may) need inside the server-side view to receive it in the initial request that give it the (current sentences range) view.
With separate access to the internal annotations, we can reduce the footprint of the initial request and the footprint of the getCASAnnotations(...) calls.
And if some internal annotations are not always necessary (based on some user setting), we can also reduce the numbers/size of the getCASAnnotations(...) calls.

I hope this clarifies my proposal further.

Best regards,

Grégoire Montcheuil

Richard Eckart de Castilho

unread,
Apr 18, 2026, 5:25:51 AM (8 days ago) Apr 18
to inception-users
Hello Grégoire,

> On 17. Apr 2026, at 18:20, Grégoire Montcheuil <gregoire....@gmail.com> wrote:
>
> > Maybe you can explain in more detail if/why you need fine-grained access to the type system and annotation types in the frontend and during rendering?
>
> To enlighten my remarks, perhaps I need to resolve some vocabulary clashes around "annotation" and "layer".
> I agree with you, INCEpTION is a platform to manage human annotation, and the parameters of this various "layers" of annotations are at the discretion of the project manager.

Layers are a concept of INCEpTION which covers all kinds of behavioural and other aspects that are not present in the underlying UIMA type system.
A layer typically maps to exactly one UIMA type, but can also map to multiple types (e.g. in the case of chain layers).

Annotations are a bit ambiguous.
One the one hand, annotations are spans, relations or chain elements at the level of INCEpTION.
On the other hand, Annotation is a special type of UIMA feature structure which has a begin/end.
Typically, an INCEpTION annotation maps to a particular UIMA annotation.

Features are likewise ambiguous.
In INCEpTION, a feature is a property of an span/relation/chain element that is user visible.
In UIMA, an annotation or feature structure may have additional features that are not visible to the user.
Also, e.g. in the case of INCEpTION link features, an INCEpTION feature may map to additional UIMA features and feature structures.

Think of INCEpTION being the high-level language and UIMA being the assembler language being used below it.

> But as you confirm me, under the hood INCEpTION manipulate a CAS representation of each document, that include both the various project layers annotations (visible or not) and some internal information - we yet mentioned the segmentation (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*), the PDF (org.dkpro.core.api.pdf.type.Pdf...) or XML structure (org.dkpro.core.api.xml.type.Xml...) - also encoded as annotations.
> Lets call this latest kind of annotation "internal annotations", and to avoid the "layer" word, we will use "families" - i.e. a (coherent) list of types - for the internal annotations.

Yes, we have those internal annotations. However, the fact that those are represented as annotations are
almost entirely an internal matter to the INCEpTION backend. They are not managed like other INCEpTION-level
annotations. They are used only during rendering. The reason why they are in the CAS is that this makes it
easier to correlate true annotation positions or text positions with layout information. In the case of PDF,
the layout was originally kept separately from the CAS. It was only merged into the CAS during a major refactoring
so we have it cached. Before, we generated the layout information on-thy-fly when a PDF was loaded.

For a long time, the layout information was saved in each annotators CAS copy. But in particular for
large documents, that impacted the reaction, save and render times quite a bit. So now layout is only
saved in the "INITIAL_CAS" once and loaded from there during rendering.

At that point, I found out that some users were relying on the layout information in the CAS, e.g. when
running external recommenders in order to know in which parts of the document to make or not to make suggestions.
Thus, the option to include the document structure in exports and when talking to the recommender was
reintroduced.

And that is where we are now.

> With the exception of the segmentation family that eventually could be editable, this internal annotations are helpful (and a priori read-only) information for the document rendering and/or human annotation task.
> Proof of this, the HTML/server-side view editors are based on the XML structure internal annotations, and the PDF Annotation editor on the PDF structure internal annotations, and so they both use their custom access to this information.
> In the case of the HTML/server-side view editors, it get the information after the various XML filters that build the server-side view.
> And it the case of the PDF Annotation editor, it defined extra behaviours.

I'm not sure what you mean by "extra behaviors".

The XHTML-XML/PDF server-side views are special server-side components whose purpose it is to
render the document and send it to the browser. They are part of a XML/PDF backend subsystems
that own the Xml... and Pdf... UIMA types. So they are privileged in that sense. They also have
certain responsibilities, e.g. to make the data safe for consumption by the browser and to make
it compact if possible (cf. XML policy).

> So, to be clear, the purpose of getCASTypesystem(...)/getCASAnnotations(...), was not to get a different access to the project layers annotations, but a new, perhaps more generic, access to the internal annotations.

Sure, but those annotations are intentionally not accessibly by the editor DIAM API.
The document rendering happens server-side via the backend ApacheAnnotatorHtmlAnnotationEditor or
PdfAnnotationEditor. Those provision the document data to the frontend via the views
and the editor running in the frontend can then rely on the document being there and
being overlayable by them.

The brat or doccano editors work a bit different. They have no view component.
Instead, they are free to define their own document structure and render it
semantically as they like. For this purpose, they have access to sentence and
token information.

Maybe you are looking for a concept of "style" annotations that are neither
part of the document layout nor of the editable annotations, but which would
be provided to editors such as brat or doccano as well so they could flavour
up their semantic rendering with e.g. bold or italic font styles.

On the other hand, my feeling is that such style is inherently part of the
document layout, so I would tend towards using a (XHTML-XML) view-based
editor in this case and post-process the backend-rendered document in the
browser-side editor code before it is shown to the user if necessary,
e.g. by introducing custom styling around sentences or such. Or alternatively
by re-rendering the document around sentences and then superimposing some
of the style information that was provided by the view on it. Those are just
examples. On some non-public editors I have been working on, I have done some
quite extensive post-processing of the view-provided XML. The only important
thing here is that any post-processing must ensure that character offsets remain
stable.

> > The Pdf... annotations are necessary for the PDF editor to render annotations, but they are not sufficient.
> > They only provide location information for the text. They do not provide the actual image rendered from the PDF.
> > So in case of the PDF, we *always* need the original PDF file.
>
> The PDF annotation case is excellent to illustrate part of our objectives.
> We worked on critical data, health data, that should be pseudonymized before to enter in the human annotation process.
> So have the original PDF is not exactly an good option on our pipeline.
> Nonetheless, we believe that a PDF-like rendering of the documents is a great help for the humans "in the loop".
> So we found a way to save enough layout information to be able to re-render the document after the pseudonymization.
>
> Technically, the information we saved is very similar to the org.dkpro.core.api.pdf.type.PdfPage/PdfChunk annotation,
> except we cannot save individual position of each glyphs (the number of glyphs could change during the pseudonymization),

In terms of INCEpTION, it is essential that the glyphs and the base document text in the CAS match.
Otherwise annotations could not be rendered properly. So my assumption would be that pseudonymization
would happen prior to the import to INCEpTION, so that INCEpTION only sees the final pseudonymized text.
And if you were to include any layout information, that layout information would also relate to the pseudonymized text.

> but in return we also saved extra information (like bold, italic, underline and eventually the font name and color) that are not present in PdfChunk.

Yes, those are not included in PdfChunk - because visual information is encoded in the PDF itself.

> We have yet some code to re-render the document (within a <canvas> or as styled html sub-tree) using just the text content (the CAS SofA) and our layout information.
> So, as a first test, we put our layout information as project layer annotations in a CAS import, and we can re-render the document based on the DIAM loadAnnotation() results.
> Even without the human annotations interaction coded, this first result is quite interesting and encouraging...

I wonder... if you have a way of pseudonymizing your PDFs while retaining layout information etc.
have you considered rendering a new pseudonymized PDF from your original PDF and then importing that?
Or have you considered rendering your pseudonymized PDF layout into (X)HTML and then importing that
into INCEpTION? In both PDF-rendering and HTML-rendering, you would be able to represent your
style and layout information.

> But the two first disadvantages we encounter are:
> - all the layout annotations appear to the user, and so pollute the left-side part of the annotator UI,
> - the various layout information have to be encoded inside the annotation label - so all useful information should appear, and be re-parsed :-/.

If you created a PDF version of your data or an (X)HTML version of your document, the layout information could be
transported to the frontend naturally. Going the (X)HTML route, you could even use e.g. CSS classes or "data" attributes
to encode special (layout) information which you would like your custom editor plugin to react to. Otherwise, if it is
just about the visual aspect, just use an CSS stylesheet or pre-render the visuals into a pseudonymized PDF.

I still do not see why the layout information should be reaching the editor in the form of layers/annotations at all.

> So, if we can move our layout annotations to the internal annotations and access them from the external editor, we will resolve this two problems...
>
> At the day, the only solution seems to find a way to encode our layout information inside an custom XML tree and define the right policies to find them intact after the server-side view filters.

That I think is the best solution. As I said, theoretically, you could also just pre-render your data into pseudonymized PDFs and then import them.
But honestly, the (X)HTML-XML route is very likely to cause much fewer headaches.

Btw. nobody is forcing you to use a document-oriented XML where the tags are inline with the text.
Theoretically, you could happily define a standoff format like:

<doc>
<text>...</text>
<layout>
...
</layout>
</doc>

And then in your editor, you could render the text into a canvas according to the layout information.
Make sure you don't have any text nodes in the layout section (i.e. you only use elements and attributes),
then none of the data from the layout section ends up in the CAS base text. If you remove any whitespace
and line breaks outside the text element, the CAS base text would end just with the string inside the text
element. When you interact with DIAM, make sure that your editor is able to recover the begin/end offsets
from within the text element (or the doc element if you do not remove whitespace).

> As I said you, we haven't yet found the time to explore this option...
> But we have already discussed in this thread the (current) import/export limitations of this approach.
> And, if I'm not mistaken, I foresee another complication: the server-side view is limited by the page size setting that "controls how many sentences are visible in the annotation area"... and the sentences ranges will not exactly match the (Pdf)Page ranges.

The setting for how many sentence are visible in the annotation area applies only to semantic editors
like the brat editor - not to layout oriented editors like the PDF or XML editors.

PDF/XML editors keep track of what part of the document the user is currently viewing in the browser.
Based on that, they then request all annotations from the backend that overlap with the visible part
of the document, and those are then rendered. This is necessary in order to keep the rendering performance
manageable and to keep the editors reactive. If they were always to load all the data from the backend,
they would be terribly slow.

> > What would be the purpose of such reactivity?
>
> Let follow our PDF-like annotation use case.
> Said we define our ad-hoc layout information type system, foo.bar.LayoutPage/LayoutChunk, as they differs a little to org.dkpro.core.api.pdf.type.PdfPage/PdfChunk.
> The external editor can first check if there are foo.bar.Layout... internal annotations and so use them to re-render the document (with bold, italic, color, etc.).
> But if there is no foo.bar.Layout... annotations, it can fall back to org.dkpro.core.api.pdf.type.Pdf... internal annotations and use them to re-render the document (with less style).
> And so on with other families of internal annotations... to finally, if any of them are present, use a default rendering ;-/
> So the same editor could transparently manage/switch between various families of internal annotations - without to find the right way to encode any of them inside an XML tree.

That sounds fragile to me.
Why not define a proper set of datatypes that the editor can rely on being there?
Why not encode those data types into the XML structure of the document as outlined above?

> We also can see other usage of this reactivity.
> Suppose we have another feature that use other types of internal annotations to display some useful patient information (or any other meta-data).
> We can easily define this feature as an independent module as it should only know how its information is encoded in the internal annotations, and not in the resulting XML server view that also combine some other families of internal annotation.
> Based on the presence of this internal annotation, we can dynamically (des)activate the feature.
> And so one with every new feature.

You could include additional metadata into the XML as well:

<doc>
<patient>
<identity givenName="Jack" familyName="Black"/>
</patient>
<text>...</text>
<layout>
...
</layout>
</doc>

And your browser-side editor could check if this information is there and if so render it **separately** from the data in the text element.
You could maybe show it as an infobox above the actual document. Your editor could ensure that data in this info-box is not selectable and
thus not annotatable.

If you wanted the metadata to be editable thought, the right thing to do would be to create a document-level annotation layer for it.
Then the annotator could view and edit the data via the document-level annotation sidebar.

In that case, however, you would no longer import your XML file above directly. Rather you would create a UIMA CAS file (XML or JSON)
which would encode the layout part (doc element above) as Xml... annotations and your editable metadata as a respective document-layer
type. Then you would import that UIMA CAS file into INCEpTION. In your project settings, you would hard-set the annotation editor to
your custom annotation editor which knows how to deal with the XML data. In that way, when your annotators open a document for annotation,
they would immediately be looking at your editor.

> > I don't want a creative annotator to get ideas and circumvent the disabled export by simply using a getCASAnnotations endpoint
>
> We agree with that.
> I suppose we can have imagine some policies that ensure the getCASAnnotations() endpoint limits its results to some families of internal annotations - and eventually with the possibility to refine this policies at the project level.
> B.e. for our pdf-like annotation use case:
> - the external editor will declare in its policies two families of internal annotations, foo.bar.LayoutPage/LayoutChunk and org.dkpro.core.api.pdf.type.PdfPage/PdfChunk,
> and wouldn't never know about or have access to other types of annotations with the getCASAnnotations() endpoint.
> - the project manager could allow the annotators to switch to this external editor, but eventually disable the foo.bar.LayoutPage/LayoutChunk family as the project only use imported PDF (or for any other reason).

I am not sure this is necessary given what I have outlined above.

> > Finally, a CAS annotation (feature structure) is a graph that potentially could have quite a big footprint (e.g. the entire
> > XML structure in case of a XmlDocument annotation). The flat rendered annotations do not have that problem.
>
> We also understand this point, that why we propose that getCASAnnotations(...) have "some range filtering like loadAnnotations(...)"

The document/layout structure is static. I don't really see any need to load that on-demand.
It can be loaded once when a document is opened and then remain in the browser for the lifetime
of the editor.

The PDF is only loaded once from the server - although pdf.js in the browser will render it page-by-page and load annotations only for currently visible pages.
The XML is only loaded once from the server - although editors like Apache or recogito only load annotations for the currently visible part on demand.

> And this also advocates a little for my reluctance to the XML-tree solution.
> Not only XML is not the format with the smallest footprint, but the editor have to define a policies that put all the information it (may) need inside the server-side view to receive it in the initial request that give it the (current sentences range) view.
> With separate access to the internal annotations, we can reduce the footprint of the initial request and the footprint of the getCASAnnotations(...) calls.
> And if some internal annotations are not always necessary (based on some user setting), we can also reduce the numbers/size of the getCASAnnotations(...) calls.

While you might reduce the footprint of the initial request, reloading layout information as you go along will over time likely exceed the initial savings.

Furthermore, if you have no global information about the document and its layout, how could the browser know which part of the document it is looking at
and which parts of the document it might have to load when you scroll up/down.

From my experience, the initial loading when the document is opened can happily have a larger footprint and take a bit longer.
What is more important is that the subsequent interaction with the document after opening (adding annotations, scrolling, etc.)
need to be fast. Annotators are more likely to wait a little longer during document opening, but complain bitterly when the actual annotation
task is sluggish.

Cheers,

-- Richard

Grégoire Montcheuil

unread,
Apr 20, 2026, 10:26:47 AM (6 days ago) Apr 20
to inception-users
Dear Richard,

Thank you for your response.

> Layers are a concept of INCEpTION which covers all kinds of behavioural and other aspects that are not present in the underlying UIMA type system.
> (...)
> Think of INCEpTION being the high-level language and UIMA being the assembler language being used below it.

Yes, my purpose was not to enter into the details and complexity of the projects "layers" of "annotations", but just clearly separate them from the "families" of "internal annotations".

> Yes, we have those internal annotations. (...) The reason why they are in the CAS is that this makes it easier to correlate true annotation positions or text positions with layout information.

And it is why our first intuition was: if we have more (position related) information to store, and if UIMA CAS is the underlying format, why not add them directly in the imported CAS ?

> In the case of PDF (...)
> For a long time, the layout information was saved in each annotators CAS copy. But in particular for
> large documents, that impacted the reaction, save and render times quite a bit. So now layout is only
> saved in the "INITIAL_CAS" once and loaded from there during rendering.

Yes, I remember see this "look for PDF structure in the CAS or fallback to re-compute it" in the code of the PDF Annotation editor backend

> At that point, I found out that some users were relying on the layout information in the CAS, e.g. when
> running external recommenders in order to know in which parts of the document to make or not to make suggestions.
>  Thus, the option to include the document structure in exports and when talking to the recommender was
reintroduced.

So, perhaps this possibility could also be extended to external editors ? ;-)

> I'm not sure what you mean by "extra behaviors".

By "extra behaviors" I mean the two endpoints added as "org.apache.wicket.behavior.AbstractAjaxBehavior" in the code of the backend, one for the PDF binary, and other for the PDF structure
I didn't want to go into technical details, but just show that, depending on their features, the various editors have different needs of internal information.

> > Technically, the information we saved is very similar to the org.dkpro.core.api.pdf.type.PdfPage/PdfChunk annotation,
> > except we cannot save individual position of each glyphs (the number of glyphs could change during the pseudonymization),
> In terms of INCEpTION, it is essential that the glyphs and the base document text in the CAS match.
> Otherwise annotations could not be rendered properly. So my assumption would be that pseudonymization
> would happen prior to the import to INCEpTION, so that INCEpTION only sees the final pseudonymized text.
> And if you were to include any layout information, that layout information would also relate to the pseudonymized text.
> > but in return we also saved extra information (like bold, italic, underline and eventually the font name and color) that are not present in PdfChunk.
> Yes, those are not included in PdfChunk - because visual information is encoded in the PDF itself.

Yes, we well understood why each glyph position are needed by the PDF Annotation editor.
I was just trying to illustrate how different needs imply slightly different internal information.

>  I wonder... if you have a way of pseudonymizing your PDFs while retaining layout information etc.
> have you considered rendering a new pseudonymized PDF from your original PDF and then importing that?


Yes, we have yet tested that.

I have yet mentioned one limitation: (currently) we cannot load pre-annotations this way ;-/
I see your description of issue #5970 include the possibility to resolve that point.

Another difficulty is, as we not control the exact way the text is extracted from the PDF by INCEpTION, this can lead to potential misalignment with our "reference text".
We can realign them after the human annotation... but some annotation spans could be affected (a continuous span in the extracted text could be split into discontinuous spans in the reference text).
(That has nothing to do with our topic, but I fully agree with this affirmation "it’s crucial to get your data out of PDFs as early as possible" in the introduction of this interesting blog post on NLP).

> Or have you considered rendering your pseudonymized PDF layout into (X)HTML and then importing that
> into INCEpTION? In both PDF-rendering and HTML-rendering, you would be able to represent your
> style and layout information.  
  

We also experimented to import our documents as HTML, but - perhaps due to our misunderstanding of the HTML editors - on our first tests all useful styles weren't preserved ;-(

> I still do not see why the layout information should be reaching the editor in the form of layers/annotations at all.

No, not in the form of project, visible, "layers/annotations", but in any form of internal information, preferably transparent/invisible for the user.
And, as the CAS format give the possibility to import/export both the project layers (pre-)annotations and many other families of internal annotations, we consider it as a good candidate to encode and deliver what we need.
As we said, we have not yet well explored the many possibilities offered by the custom-XML format.
And thank you for the various options you highlighted for us...

> > And, if I'm not mistaken, I foresee another complication: the server-side view is limited by the page size setting that "controls how many sentences are visible in the annotation area"... and the sentences ranges will not exactly match the (Pdf)Page ranges.
> The setting for how many sentence are visible in the annotation area applies only to semantic editors
> like the brat editor - not to layout oriented editors like the PDF or XML editors.
>
> PDF/XML editors keep track of what part of the document the user is currently viewing in the browser.

> (...)
> The XML is only loaded once from the server - although editors like Apache or recogito only load annotations for the currently visible part on demand.

Thank you to clarify this point.
I have wrongly assumed that the server-side view that is sent to the XML editor was also "cropped" to the current visible sentences.


> > > What would be the purpose of such reactivity?
> >
> > Let follow our PDF-like annotation use case.
> > Said we define our ad-hoc layout information type system, foo.bar.LayoutPage/LayoutChunk, as they differs a little to org.dkpro.core.api.pdf.type.PdfPage/PdfChunk.
> > The external editor can first check if there are foo.bar.Layout... internal annotations and so use them to re-render the document (with bold, italic, color, etc.).
> > But if there is no foo.bar.Layout... annotations, it can fall back to org.dkpro.core.api.pdf.type.Pdf... internal annotations and use them to re-render the document (with less style).
> > And so on with other families of internal annotations... to finally, if any of them are present, use a default rendering ;-/
> > So the same editor could transparently manage/switch between various families of internal annotations - without to find the right way to encode any of them inside an XML tree.
> That sounds fragile to me.
> Why not define a proper set of datatypes that the editor can rely on being there?

I do not share your point of view.
I can understand that accepting only one "proper set of datatypes" and "rely on being there" could seems safer (for the developer)... 
But anticipating and adapting to different nuances of information does not seem to me a weakness.


> If you wanted the metadata to be editable thought, the right thing to do would be to create a document-level annotation layer for it.
> Then the annotator could view and edit the data via the document-level annotation sidebar.

> In that case, however, you would no longer import your XML file above directly. Rather you would create a UIMA CAS file (XML or JSON)
> which would encode the layout part (doc element above) as Xml... annotations and your editable metadata as a respective document-layer
> type. Then you would import that UIMA CAS file into INCEpTION. In your project settings, you would hard-set the annotation editor to
> your custom annotation editor which knows how to deal with the XML data. In that way, when your annotators open a document for annotation,
> they would immediately be looking at your editor.

Thank you to add this possibility.
We do not (yet) plan to make the document meta-data editable... but it illustrate how all roads seems to lead (us) to the UIMA CAS representation ;-) 

> Furthermore, if you have no global information about the document and its layout, how could the browser know which part of the document it is looking at
> and which parts of the document it might have to load when you scroll up/down.

Not "no global information", but just enough to start...

If we take the example of a PDF-like editor:
At start, we only need the PdfPage annotations to know what are the current visible page(s) corresponding to the (focused) offset range.
So we can load the PdfChunk  only for the visible page(s) range.
As PdfChunk annotations are static, we can eventually also cache them to not re-request them every time we need them.
And, if it become critical, we can also pre-load in background the PdfChunks of the (2-3) next page(s) as with a (music/video) streaming service.  

If we well understand, this strategy is yet what all editors do for the project layers annotations (except the caching/pre-load parts).
And this strategy is an important element for maintaining the platform's performance with big documents.

> From my experience, the initial loading when the document is opened can happily have a larger footprint and take a bit longer.
> What is more important is that the subsequent interaction with the document after opening (adding annotations, scrolling, etc.)
> need to be fast. Annotators are more likely to wait a little longer during document opening, but complain bitterly when the actual annotation
> task is sluggish.

We have no doubt you are right about the annotators preferences.
But, nothing forbid the client-side of a steaming service to download first all data it can need before to "show up".

In the case of an eternal editors that will have some direct access to the internal annotations, this will be just a "strategy" among others...
In the case of the custom-XML approach, it looks more like an obligation.

When we found the time, be sure we will seriously explore the various custom XML-based possibilities that could solve our needs and come back to you with - we hope - good news.

But, right now, my opinion hasn't change much: we will end up needing the CAS representation to import/export our data.
And our internal information would pass by two encoding steps: first as the custom-XML tags and then as the XML structure in the CAS representation. 

Best regards,

Grégoire Montcheuil


Richard Eckart de Castilho

unread,
Apr 21, 2026, 12:36:37 PM (5 days ago) Apr 21
to incepti...@googlegroups.com
I should mention some more things:

- custom UIMA annotations are only retained in the CAS by INCEpTION if there is an appropriate layer definition for them
- any custom UIMA annotations that are not built-in types like Xml... / Pdf... or DKPro Core types are removed
- meanwhile, you can package up any data in the build-in Xml... types and that will preserved and can be obtained via the respective view

-- Richard

Reply all
Reply to author
Forward
0 new messages