Hi all,
I wanted to share a new resource that may be useful for people experimenting with AI-assisted cataloging, subject indexing, or metadata enrichment.
We recently released TIB-SID, a dataset of 136,569 real library catalog records (English/German) linked to the GND authority file, together with a machine-actionable version of the subject taxonomy. The dataset frames subject indexing as a realistic extreme multi-label classification problem over controlled vocabulary terms.
The resource was originally introduced through the LLMs4Subjects shared tasks (SemEval 2025 and GermEval 2025), where more than a dozen teams developed and evaluated automated subject tagging systems using the dataset. The tasks explored approaches ranging from embedding-based retrieval pipelines to LLM prompting and hybrid XMTC systems.
Resources:
Dataset
https://github.com/sciknoworg/tib-sid
Preprint
https://arxiv.org/abs/2603.10876
Shared task pages
https://sites.google.com/view/llms4subjects
https://sites.google.com/view/llms4subjects-germeval
If anyone is experimenting with automated subject indexing, authority control, or multilingual metadata, we would be very interested to hear how the dataset works in other settings.
We would also be happy to hear from others working on similar problems or interested in collaborating on future evaluations or extensions of the dataset.
Best,
Jennifer D’Souza
TIB – Leibniz Information Centre for Science and Technology