[Dataset] TIB-SID – bilingual library subject indexing dataset (GND, 136k records)

32 views

Skip to first unread message

Jennifer D'Souza

unread,

Mar 16, 2026, 12:48:17 PMMar 16

to ai4lam

Hi all,

I wanted to share a new resource that may be useful for people experimenting with AI-assisted cataloging, subject indexing, or metadata enrichment.

We recently released TIB-SID, a dataset of 136,569 real library catalog records (English/German) linked to the GND authority file, together with a machine-actionable version of the subject taxonomy. The dataset frames subject indexing as a realistic extreme multi-label classification problem over controlled vocabulary terms.

The resource was originally introduced through the LLMs4Subjects shared tasks (SemEval 2025 and GermEval 2025), where more than a dozen teams developed and evaluated automated subject tagging systems using the dataset. The tasks explored approaches ranging from embedding-based retrieval pipelines to LLM prompting and hybrid XMTC systems.

Resources:

Dataset
https://github.com/sciknoworg/tib-sid

Preprint
https://arxiv.org/abs/2603.10876

Shared task pages
https://sites.google.com/view/llms4subjects
https://sites.google.com/view/llms4subjects-germeval

If anyone is experimenting with automated subject indexing, authority control, or multilingual metadata, we would be very interested to hear how the dataset works in other settings.

We would also be happy to hear from others working on similar problems or interested in collaborating on future evaluations or extensions of the dataset.

Best,
Jennifer D’Souza
TIB – Leibniz Information Centre for Science and Technology

Reply all

Reply to author

Forward

0 new messages