practice paper ‘Automatic Subject Cataloguing at the German National Library’

72 views
Skip to first unread message

Jan Jacobs

unread,
Apr 14, 2025, 7:22:39 AMApr 14
to Annif Users
Dear annif users,

Our practice paper ‘Automatic Subject Cataloguing at the German National Library’ was now published in LIBER Quarterly:
https://doi.org/10.53377/lq.19422

It provides a solid overview of our use cases of automatic subject cataloguing and the central role of annif for these purposes.

Christoph Poley and team at the German National Library (posted by Jan Jacobs)

Abstract:

The German National Library (DNB) began developing solutions for automatic subject cataloguing 15 years ago. The main reason for this was the huge and ever-growing number of digital media works that needed to be indexed. Today, the DNB uses open source algorithms and frameworks to assign various types of thematic meta information in this way.

This practice paper provides a deeper insight into automatic subject cataloguing at the DNB. We look at the data and vocabularies used as well as at the different methods and approaches. The vocabulary for classification is based on the Dewey Decimal Classification (DDC). For verbal subject indexing we use the German Integrated Authority File (GND).

The use case of automatic classification is divided into the assignment of DDC Subject Categories and DDC Short Numbers. Due to the large size of the GND vocabulary, the use case of automatic indexing is an extreme multi-label classification (XMLC) problem. A brief report is given about the construction and the performance of our models.

Based on these use cases, we present some implementation aspects of our “subject cataloguing machine” EMa, the environment for automatic subject cataloguing in productive use. We point out the basic feature set and provide a high-level introduction of the productive EMa system. The modular design of the EMa software architecture with the open source software Annif as a central toolkit is described.

The development of EMa is an ongoing task at the DNB. It requires continuous development and maintenance, technological and human resources. Applied research activities in the DNB's AI project are closely related to the EMa ensuring that relevant scientific findings get integrated into its development.

João Oliveira Lima

unread,
Jul 28, 2025, 2:42:07 AMJul 28
to Annif Users
Dear Jan Jacobs (and Annif Users),

Congratulations on the publication of the paper "Automatic Subject Cataloguing at the German National Library" in LIBER Quarterly! It's excellent to see how the Annif toolkit plays such a central role in your approach to automated subject cataloguing.

Given your extensive experience with automatic classification systems, I'd like to pose a question to the group that bridges practical implementation and theoretical considerations:

From your perspective and experience with methods like Omikuji, SVC and the Annif toolkit, how effectively could embedding-based approaches address the challenge of the subjective indeterminacy of 'subject' (as discussed by Patrick Wilson in his work, 'Two Kinds of Power' (1968))? Do you see potential in embeddings complementing your current methods, especially by capturing semantic nuances that might be difficult for discrete classification systems or lexical approaches alone?

Looking forward to your insights!

Best regards,

João Lima

Christoph Poley

unread,
Jul 29, 2025, 5:29:57 AMJul 29
to Annif Users

Dear João Lima,

thank you very much for your positive feedback on our paper in Liber Quarterly!

Embeddings cover a broad field. At DNB we already thought about the opportunities of embedding-based approaches.

As is well known, machine learning approaches can only predict topics that occur in the gold standard annotations used for training. It is not able to make predictions for zero-shot labels. On the other hand, lexical approaches can find any topics that are part of the vocabulary. Unfortunately, lexical approaches often produce a large number of false positives, as the matching of input texts and vocabulary (based solely on their string representation) does not capture semantic context. The disambiguation of topics with similar string representations is also a problem in this context.

As part of our AI project, we have made some experiments with embedding-based matching (https://www.dnb.de/EN/Professionell/ProjekteKooperationen/Projekte/KI/KI.html), which consist of enhencing lexical matching with the performance of sentence transformation embeddings. These embeddings can capture the semantic context of the input text and enable vector-based matching that is not (only) based on the string representation.

Based on our tests so far, embedding-based matching shows strengths in the area of zero-shot label prediction and can also predict other true positives (fortunately false positives, too) than i.e. omikuji or MLLM. As an assumption, the usage of embedding based matching in an ensemble can help to include a semantic perspective into our predictions and to predict better results as a positive effect. More extensive testing for production is still pending.

Currently, we are on implementing the usage of embedding-based matching as a new backend for Annif. However, the work will take a while longer. In keeping with this, I invite you to see the issue "Development of an embedding-based matching backend" https://github.com/NatLibFi/Annif/issues/855.

Greetings, Christoph (for the EMa team)

anna.k...@googlemail.com

unread,
Jul 29, 2025, 8:57:51 AMJul 29
to Annif Users
Dear João Lima,

you might also be interested in this PR: https://github.com/NatLibFi/Annif/pull/798 

Greetings
Argie (from the ZBW team)
Reply all
Reply to author
Forward
0 new messages