A brief analysis of MLLM and its runtime

43 views
Skip to first unread message

Clemens Rietdorf

unread,
Dec 10, 2024, 8:15:21 AM12/10/24
to Annif Users
Dear annif users, dear annif team,
The annif-enviroment has been in use at the German National Library (DNB)
for some time now and with it the MLLM backend , which is used for indexing
documents. The index terms used come from the German Integrated Authority
File (GND) , wich currently contains about 9.6 million standardized German
descriptors and is growing continuously. A subset of around 1.4 million GND
descriptors is marked for usage in subject indexing. The MLLM backend is
provided with this entire subset of about 1.4 million descriptors. It does an important job in our indexing process as part of our productive ensemble backend
and provides valuable results, but is also responsible for a large part of the runtime of the whole process. The indexing of most documents with MLLM takes
one to three seconds, but for some documents the runtime with MLLM exceeds
two minutes or more. The use of MLLM can therefore be seen as the bottleneck of our whole indexing process. We conducted a study into this problem
some time ago and tried to find a correlation between the structure and features
of problematic documents and the long runtime needed to process them with
MLLM. In this brief report, we will now present our latest findings and the
result of our analysis of the MLLM backend itself, taking a look at its overall
runtime and the key points with the highest runtime. We have also tested two
solutions to the runtime problem and can offer some advice to annif users who
may encounter a similar problem.

Please find the full report attached.

Best regards
Clemens Rietdorf
A_brief_analysis_of_MLLM.pdf

Osma Suominen

unread,
Dec 11, 2024, 2:28:49 AM12/11/24
to annif...@googlegroups.com
Hello Clemens,

thank you very much for the detailed report and analysis!

I think you've identified the main factors of the slow performance. When
it comes to the calculation of the ambiguity feature, I think the
current algorithm is very simplistic and as you pointed out, it doesn't
scale well when there are many matches. I think it should be relatively
easy to implement it more efficiently and thus mitigate the problem to
some degree. I can already think of a few different approaches. However,
in order to choose a good solution, I think we would need a bit more
empirical information.

Could you please do the following:

1. Make sure you have a trained MLLM model that is trained using the
full GND vocabulary set you use (1.4M subjects) and a document (as a
text file) that takes a long time to process.

2. Add these lines to the _find_subj_ambiguity method:

import json
with open('tsets.jsonl', 'a') as f:
json.dump([[int(tid) for tid in tset] for tset in tsets], f)
f.write('\n')

3. Perform the "annif suggest" operation on the slow document. This
should write some debugging information about the token sets into the
tsets.jsonl file.

4. Open an issue on GitHub referencing this discussion, with a title
like "Slow calculation of ambiguity feature in MLLM", and attach the
tsets.jsonl file to the issue.


Having the real world token sets would allow testing and benchmarking a
few different alternative methods for calculating the ambiguity values.
If we can make the calculation more efficient within the Annif codebase,
that should improve the performance of MLLM for all users, especially
those who are working with large vocabularies such as GND or LCSH.

Thanks,
Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/1d6600fa-8b5b-432a-8b67-c59ec5eabdcen%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/1d6600fa-8b5b-432a-8b67-c59ec5eabdcen%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages