Hello Grégoire,
> On 17. Apr 2026, at 18:20, Grégoire Montcheuil <
gregoire....@gmail.com> wrote:
>
> > Maybe you can explain in more detail if/why you need fine-grained access to the type system and annotation types in the frontend and during rendering?
>
> To enlighten my remarks, perhaps I need to resolve some vocabulary clashes around "annotation" and "layer".
> I agree with you, INCEpTION is a platform to manage human annotation, and the parameters of this various "layers" of annotations are at the discretion of the project manager.
Layers are a concept of INCEpTION which covers all kinds of behavioural and other aspects that are not present in the underlying UIMA type system.
A layer typically maps to exactly one UIMA type, but can also map to multiple types (e.g. in the case of chain layers).
Annotations are a bit ambiguous.
One the one hand, annotations are spans, relations or chain elements at the level of INCEpTION.
On the other hand, Annotation is a special type of UIMA feature structure which has a begin/end.
Typically, an INCEpTION annotation maps to a particular UIMA annotation.
Features are likewise ambiguous.
In INCEpTION, a feature is a property of an span/relation/chain element that is user visible.
In UIMA, an annotation or feature structure may have additional features that are not visible to the user.
Also, e.g. in the case of INCEpTION link features, an INCEpTION feature may map to additional UIMA features and feature structures.
Think of INCEpTION being the high-level language and UIMA being the assembler language being used below it.
> But as you confirm me, under the hood INCEpTION manipulate a CAS representation of each document, that include both the various project layers annotations (visible or not) and some internal information - we yet mentioned the segmentation (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*), the PDF (org.dkpro.core.api.pdf.type.Pdf...) or XML structure (org.dkpro.core.api.xml.type.Xml...) - also encoded as annotations.
> Lets call this latest kind of annotation "internal annotations", and to avoid the "layer" word, we will use "families" - i.e. a (coherent) list of types - for the internal annotations.
Yes, we have those internal annotations. However, the fact that those are represented as annotations are
almost entirely an internal matter to the INCEpTION backend. They are not managed like other INCEpTION-level
annotations. They are used only during rendering. The reason why they are in the CAS is that this makes it
easier to correlate true annotation positions or text positions with layout information. In the case of PDF,
the layout was originally kept separately from the CAS. It was only merged into the CAS during a major refactoring
so we have it cached. Before, we generated the layout information on-thy-fly when a PDF was loaded.
For a long time, the layout information was saved in each annotators CAS copy. But in particular for
large documents, that impacted the reaction, save and render times quite a bit. So now layout is only
saved in the "INITIAL_CAS" once and loaded from there during rendering.
At that point, I found out that some users were relying on the layout information in the CAS, e.g. when
running external recommenders in order to know in which parts of the document to make or not to make suggestions.
Thus, the option to include the document structure in exports and when talking to the recommender was
reintroduced.
And that is where we are now.
> With the exception of the segmentation family that eventually could be editable, this internal annotations are helpful (and a priori read-only) information for the document rendering and/or human annotation task.
> Proof of this, the HTML/server-side view editors are based on the XML structure internal annotations, and the PDF Annotation editor on the PDF structure internal annotations, and so they both use their custom access to this information.
> In the case of the HTML/server-side view editors, it get the information after the various XML filters that build the server-side view.
> And it the case of the PDF Annotation editor, it defined extra behaviours.
I'm not sure what you mean by "extra behaviors".
The XHTML-XML/PDF server-side views are special server-side components whose purpose it is to
render the document and send it to the browser. They are part of a XML/PDF backend subsystems
that own the Xml... and Pdf... UIMA types. So they are privileged in that sense. They also have
certain responsibilities, e.g. to make the data safe for consumption by the browser and to make
it compact if possible (cf. XML policy).
> So, to be clear, the purpose of getCASTypesystem(...)/getCASAnnotations(...), was not to get a different access to the project layers annotations, but a new, perhaps more generic, access to the internal annotations.
Sure, but those annotations are intentionally not accessibly by the editor DIAM API.
The document rendering happens server-side via the backend ApacheAnnotatorHtmlAnnotationEditor or
PdfAnnotationEditor. Those provision the document data to the frontend via the views
and the editor running in the frontend can then rely on the document being there and
being overlayable by them.
The brat or doccano editors work a bit different. They have no view component.
Instead, they are free to define their own document structure and render it
semantically as they like. For this purpose, they have access to sentence and
token information.
Maybe you are looking for a concept of "style" annotations that are neither
part of the document layout nor of the editable annotations, but which would
be provided to editors such as brat or doccano as well so they could flavour
up their semantic rendering with e.g. bold or italic font styles.
On the other hand, my feeling is that such style is inherently part of the
document layout, so I would tend towards using a (XHTML-XML) view-based
editor in this case and post-process the backend-rendered document in the
browser-side editor code before it is shown to the user if necessary,
e.g. by introducing custom styling around sentences or such. Or alternatively
by re-rendering the document around sentences and then superimposing some
of the style information that was provided by the view on it. Those are just
examples. On some non-public editors I have been working on, I have done some
quite extensive post-processing of the view-provided XML. The only important
thing here is that any post-processing must ensure that character offsets remain
stable.
> > The Pdf... annotations are necessary for the PDF editor to render annotations, but they are not sufficient.
> > They only provide location information for the text. They do not provide the actual image rendered from the PDF.
> > So in case of the PDF, we *always* need the original PDF file.
>
> The PDF annotation case is excellent to illustrate part of our objectives.
> We worked on critical data, health data, that should be pseudonymized before to enter in the human annotation process.
> So have the original PDF is not exactly an good option on our pipeline.
> Nonetheless, we believe that a PDF-like rendering of the documents is a great help for the humans "in the loop".
> So we found a way to save enough layout information to be able to re-render the document after the pseudonymization.
>
> Technically, the information we saved is very similar to the org.dkpro.core.api.pdf.type.PdfPage/PdfChunk annotation,
> except we cannot save individual position of each glyphs (the number of glyphs could change during the pseudonymization),
In terms of INCEpTION, it is essential that the glyphs and the base document text in the CAS match.
Otherwise annotations could not be rendered properly. So my assumption would be that pseudonymization
would happen prior to the import to INCEpTION, so that INCEpTION only sees the final pseudonymized text.
And if you were to include any layout information, that layout information would also relate to the pseudonymized text.
> but in return we also saved extra information (like bold, italic, underline and eventually the font name and color) that are not present in PdfChunk.
Yes, those are not included in PdfChunk - because visual information is encoded in the PDF itself.
> We have yet some code to re-render the document (within a <canvas> or as styled html sub-tree) using just the text content (the CAS SofA) and our layout information.
> So, as a first test, we put our layout information as project layer annotations in a CAS import, and we can re-render the document based on the DIAM loadAnnotation() results.
> Even without the human annotations interaction coded, this first result is quite interesting and encouraging...
I wonder... if you have a way of pseudonymizing your PDFs while retaining layout information etc.
have you considered rendering a new pseudonymized PDF from your original PDF and then importing that?
Or have you considered rendering your pseudonymized PDF layout into (X)HTML and then importing that
into INCEpTION? In both PDF-rendering and HTML-rendering, you would be able to represent your
style and layout information.
> But the two first disadvantages we encounter are:
> - all the layout annotations appear to the user, and so pollute the left-side part of the annotator UI,
> - the various layout information have to be encoded inside the annotation label - so all useful information should appear, and be re-parsed :-/.
If you created a PDF version of your data or an (X)HTML version of your document, the layout information could be
transported to the frontend naturally. Going the (X)HTML route, you could even use e.g. CSS classes or "data" attributes
to encode special (layout) information which you would like your custom editor plugin to react to. Otherwise, if it is
just about the visual aspect, just use an CSS stylesheet or pre-render the visuals into a pseudonymized PDF.
I still do not see why the layout information should be reaching the editor in the form of layers/annotations at all.
> So, if we can move our layout annotations to the internal annotations and access them from the external editor, we will resolve this two problems...
>
> At the day, the only solution seems to find a way to encode our layout information inside an custom XML tree and define the right policies to find them intact after the server-side view filters.
That I think is the best solution. As I said, theoretically, you could also just pre-render your data into pseudonymized PDFs and then import them.
But honestly, the (X)HTML-XML route is very likely to cause much fewer headaches.
Btw. nobody is forcing you to use a document-oriented XML where the tags are inline with the text.
Theoretically, you could happily define a standoff format like:
<doc>
<text>...</text>
<layout>
...
</layout>
</doc>
And then in your editor, you could render the text into a canvas according to the layout information.
Make sure you don't have any text nodes in the layout section (i.e. you only use elements and attributes),
then none of the data from the layout section ends up in the CAS base text. If you remove any whitespace
and line breaks outside the text element, the CAS base text would end just with the string inside the text
element. When you interact with DIAM, make sure that your editor is able to recover the begin/end offsets
from within the text element (or the doc element if you do not remove whitespace).
> As I said you, we haven't yet found the time to explore this option...
> But we have already discussed in this thread the (current) import/export limitations of this approach.
> And, if I'm not mistaken, I foresee another complication: the server-side view is limited by the page size setting that "controls how many sentences are visible in the annotation area"... and the sentences ranges will not exactly match the (Pdf)Page ranges.
The setting for how many sentence are visible in the annotation area applies only to semantic editors
like the brat editor - not to layout oriented editors like the PDF or XML editors.
PDF/XML editors keep track of what part of the document the user is currently viewing in the browser.
Based on that, they then request all annotations from the backend that overlap with the visible part
of the document, and those are then rendered. This is necessary in order to keep the rendering performance
manageable and to keep the editors reactive. If they were always to load all the data from the backend,
they would be terribly slow.
> > What would be the purpose of such reactivity?
>
> Let follow our PDF-like annotation use case.
> Said we define our ad-hoc layout information type system, foo.bar.LayoutPage/LayoutChunk, as they differs a little to org.dkpro.core.api.pdf.type.PdfPage/PdfChunk.
> The external editor can first check if there are foo.bar.Layout... internal annotations and so use them to re-render the document (with bold, italic, color, etc.).
> But if there is no foo.bar.Layout... annotations, it can fall back to org.dkpro.core.api.pdf.type.Pdf... internal annotations and use them to re-render the document (with less style).
> And so on with other families of internal annotations... to finally, if any of them are present, use a default rendering ;-/
> So the same editor could transparently manage/switch between various families of internal annotations - without to find the right way to encode any of them inside an XML tree.
That sounds fragile to me.
Why not define a proper set of datatypes that the editor can rely on being there?
Why not encode those data types into the XML structure of the document as outlined above?
> We also can see other usage of this reactivity.
> Suppose we have another feature that use other types of internal annotations to display some useful patient information (or any other meta-data).
> We can easily define this feature as an independent module as it should only know how its information is encoded in the internal annotations, and not in the resulting XML server view that also combine some other families of internal annotation.
> Based on the presence of this internal annotation, we can dynamically (des)activate the feature.
> And so one with every new feature.
You could include additional metadata into the XML as well:
<doc>
<patient>
<identity givenName="Jack" familyName="Black"/>
</patient>
<text>...</text>
<layout>
...
</layout>
</doc>
And your browser-side editor could check if this information is there and if so render it **separately** from the data in the text element.
You could maybe show it as an infobox above the actual document. Your editor could ensure that data in this info-box is not selectable and
thus not annotatable.
If you wanted the metadata to be editable thought, the right thing to do would be to create a document-level annotation layer for it.
Then the annotator could view and edit the data via the document-level annotation sidebar.
In that case, however, you would no longer import your XML file above directly. Rather you would create a UIMA CAS file (XML or JSON)
which would encode the layout part (doc element above) as Xml... annotations and your editable metadata as a respective document-layer
type. Then you would import that UIMA CAS file into INCEpTION. In your project settings, you would hard-set the annotation editor to
your custom annotation editor which knows how to deal with the XML data. In that way, when your annotators open a document for annotation,
they would immediately be looking at your editor.
> > I don't want a creative annotator to get ideas and circumvent the disabled export by simply using a getCASAnnotations endpoint
>
> We agree with that.
> I suppose we can have imagine some policies that ensure the getCASAnnotations() endpoint limits its results to some families of internal annotations - and eventually with the possibility to refine this policies at the project level.
> B.e. for our pdf-like annotation use case:
> - the external editor will declare in its policies two families of internal annotations, foo.bar.LayoutPage/LayoutChunk and org.dkpro.core.api.pdf.type.PdfPage/PdfChunk,
> and wouldn't never know about or have access to other types of annotations with the getCASAnnotations() endpoint.
> - the project manager could allow the annotators to switch to this external editor, but eventually disable the foo.bar.LayoutPage/LayoutChunk family as the project only use imported PDF (or for any other reason).
I am not sure this is necessary given what I have outlined above.
> > Finally, a CAS annotation (feature structure) is a graph that potentially could have quite a big footprint (e.g. the entire
> > XML structure in case of a XmlDocument annotation). The flat rendered annotations do not have that problem.
>
> We also understand this point, that why we propose that getCASAnnotations(...) have "some range filtering like loadAnnotations(...)"
The document/layout structure is static. I don't really see any need to load that on-demand.
It can be loaded once when a document is opened and then remain in the browser for the lifetime
of the editor.
The PDF is only loaded once from the server - although pdf.js in the browser will render it page-by-page and load annotations only for currently visible pages.
The XML is only loaded once from the server - although editors like Apache or recogito only load annotations for the currently visible part on demand.
> And this also advocates a little for my reluctance to the XML-tree solution.
> Not only XML is not the format with the smallest footprint, but the editor have to define a policies that put all the information it (may) need inside the server-side view to receive it in the initial request that give it the (current sentences range) view.
> With separate access to the internal annotations, we can reduce the footprint of the initial request and the footprint of the getCASAnnotations(...) calls.
> And if some internal annotations are not always necessary (based on some user setting), we can also reduce the numbers/size of the getCASAnnotations(...) calls.
While you might reduce the footprint of the initial request, reloading layout information as you go along will over time likely exceed the initial savings.
Furthermore, if you have no global information about the document and its layout, how could the browser know which part of the document it is looking at
and which parts of the document it might have to load when you scroll up/down.
From my experience, the initial loading when the document is opened can happily have a larger footprint and take a bit longer.
What is more important is that the subsequent interaction with the document after opening (adding annotations, scrolling, etc.)
need to be fast. Annotators are more likely to wait a little longer during document opening, but complain bitterly when the actual annotation
task is sluggish.
Cheers,
-- Richard