Annif awarded at the LLMs4Subjects challenge

68 views

Skip to first unread message

Annif Users

unread,

May 8, 2025, 3:43:54 AM5/8/25

to Annif Users

Dear all,

We recently took part in the LLMs4Subjects challenge at the SemEval-2025 workshop. The task was to use large language models (LLMs) to generate good quality subject indexing for bibliographic records, i.e. titles and abstracts, from the bilingual (English & German) TIBKAT database of technical literature using the large German language GND subject vocabulary. 14 participating teams developed their own solutions for generating subject headings and the output of each system was assessed using both quantitative and qualitative evaluations. Research papers about most of the systems are going to be published around the time of the workshop in late July, and many pre-prints are already available.

We applied Annif together with several LLMs that we used to preprocess the data sets: translated the GND vocabulary terms to English, translated bibliographic records into English and German as required, and generated additional synthetic training data. After the preprocessing, we used the traditional machine learning algorithms in Annif as well as the experimental XTransformer algorithm that is based on language models. We also combined the subject suggestions generated using English and German language records in a novel way.

We are glad to report that our system was ranked 1st in the category where the full vocabulary was used and 2nd in the smaller vocabulary category. Our system was ranked 4th in the qualitative evaluations. More information can be found in our system description preprint:

Suominen, O., Inkinen, J., & Lehtinen, M. (2025). Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs. arXiv. https://doi.org/10.48550/arXiv.2504.19675 (Pre-print)

More information about the task and an overview of the participating systems and their results is available from:

D'Souza, J., Sadruddin, S., Israel, H., Begoin, M., & Slawig, D. (2025). SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog. arXiv. https://doi.org/10.48550/arXiv.2504.07199

We also encourage you to read the preprints of other teams on arXiv. Currently we are aware of:

Bayrami Asl Tekanlou, H., Razmara, J., Sanaei, M., Rahgouy, M., & Babaei Giglou, H. (2025). Homa at SemEval-2025 Task 5: Aligning Librarian Records with OntoAligner for Subject Tagging. arXiv. https://doi.org/10.48550/arXiv.2504.21474 (Pre-print)

Dorkin, A., & Sirts, K. (2025). TartuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval. arXiv. https://doi.org/10.48550/arXiv.2504.21547 (Pre-print)

Islam, B., Ahmad, N., Barbhuiya, F. A., & Dey, K. (2025). NBF at SemEval-2025 Task 5: Light-Burst Attention Enhanced System for Multilingual Subject Recommendation. arXiv. https://doi.org/10.48550/arXiv.2505.03711 (Pre-print)

Kluge, L., & Kähler, M. (2025). DNB-AI-Project at SemEval-2025 Task 5: An LLM-Ensemble Approach for Automated Subject Indexing. arXiv. https://doi.org/10.48550/arXiv.2504.21589 (Pre-print)