Hello Clemens,
thank you very much for the detailed report and analysis!
I think you've identified the main factors of the slow performance. When
it comes to the calculation of the ambiguity feature, I think the
current algorithm is very simplistic and as you pointed out, it doesn't
scale well when there are many matches. I think it should be relatively
easy to implement it more efficiently and thus mitigate the problem to
some degree. I can already think of a few different approaches. However,
in order to choose a good solution, I think we would need a bit more
empirical information.
Could you please do the following:
1. Make sure you have a trained MLLM model that is trained using the
full GND vocabulary set you use (1.4M subjects) and a document (as a
text file) that takes a long time to process.
2. Add these lines to the _find_subj_ambiguity method:
import json
with open('tsets.jsonl', 'a') as f:
json.dump([[int(tid) for tid in tset] for tset in tsets], f)
f.write('\n')
3. Perform the "annif suggest" operation on the slow document. This
should write some debugging information about the token sets into the
tsets.jsonl file.
4. Open an issue on GitHub referencing this discussion, with a title
like "Slow calculation of ambiguity feature in MLLM", and attach the
tsets.jsonl file to the issue.
Having the real world token sets would allow testing and benchmarking a
few different alternative methods for calculating the ambiguity values.
If we can make the calculation more efficient within the Annif codebase,
that should improve the performance of MLLM for all users, especially
those who are working with large vocabularies such as GND or LCSH.
Thanks,
Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
annif-users...@googlegroups.com
> <mailto:
annif-users...@googlegroups.com>.
> To view this discussion visit
>
https://groups.google.com/d/msgid/annif-users/1d6600fa-8b5b-432a-8b67-c59ec5eabdcen%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/1d6600fa-8b5b-432a-8b67-c59ec5eabdcen%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel.
+358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi