Dear João Lima,
thank you very much for your positive feedback on our paper in Liber Quarterly!
Embeddings cover a broad field. At DNB we already thought about the opportunities of embedding-based approaches.
As is well known, machine learning approaches can only predict topics that occur in the gold standard annotations used for training. It is not able to make predictions for zero-shot labels. On the other hand, lexical approaches can find any topics that are part of the vocabulary. Unfortunately, lexical approaches often produce a large number of false positives, as the matching of input texts and vocabulary (based solely on their string representation) does not capture semantic context. The disambiguation of topics with similar string representations is also a problem in this context.
As part of our AI project, we have made some experiments with embedding-based matching (https://www.dnb.de/EN/Professionell/ProjekteKooperationen/Projekte/KI/KI.html), which consist of enhencing lexical matching with the performance of sentence transformation embeddings. These embeddings can capture the semantic context of the input text and enable vector-based matching that is not (only) based on the string representation.
Based on our tests so far, embedding-based matching shows strengths in the area of zero-shot label prediction and can also predict other true positives (fortunately false positives, too) than i.e. omikuji or MLLM. As an assumption, the usage of embedding based matching in an ensemble can help to include a semantic perspective into our predictions and to predict better results as a positive effect. More extensive testing for production is still pending.
Currently, we are on implementing the usage of embedding-based matching as a new backend for Annif. However, the work will take a while longer. In keeping with this, I invite you to see the issue "Development of an embedding-based matching backend" https://github.com/NatLibFi/Annif/issues/855.
Greetings, Christoph (for the EMa team)