Comparing Maui with MLLM

164 views
Skip to first unread message

Sandro Uhlmann

unread,
Apr 22, 2021, 7:26:57 AM4/22/21
to Annif Users

Hello Osma, hello Annif team and users, 

the German National Library (DNB) evaluated Annif over the last year. At this moment we prepare to go live with a first Annif workflow for automatic subject indexing. We use The Integrated Authority File (GND) as the vocabulary for our purpose to enrich the metadata of German online publications. 

Our favourite candidate for a productive backend is an ensemble consisting of Omikuji-Bonsai and Maui. Therefore, we are very interested and pleased that the new Maui-like lexical matching backend (MLLM) is now part of Annif 0.52. So we have one more tool to improve the quality and we are very happy to work with a sustainable solution to reduce the dimension of the annif installation . 

Here are some very "fresh" results of our first small evaluation: Comparing Maui with MLLM.  

Vocabulary: 

1.3 million GND descriptors modelled in SKOS (simple version with preflabel, altlabel, no relations etc.) 

Training data: 

8559 German-language tables of contents. 

Test data:

Test set A = 1261 German-language online publications

Test set B = 937 German-language tables of contents 


Results:

|| Test set A (Online Publications) F1@5 *Maui 0.174* | F1@5 *MLLM 0.196*||

|| Test set B (Table of Contents) F1@5 *Maui 0.178*  | F1@5 *MLLM 0.205* ||

 See the full eval metrics in the attachement _DNB_Sketch_Comparision_Maui_vs_MLLM_20210421.pdf_


Conclusion:

Under the same conditions the MLLM backend produces out of the box better results than Maui!

This is very encouraging and motivating for us to spend more time to test in detail the properties of the new MLLM backend. We’ll optimize it for our purpose step-by-step especially regarding to the backend-specific parameters. Furthermore, we are going to optimize our GND SKOS file (adding relations, hidden labels and collections).

We also are going to evaluate the stwfsa backend soon as one more great backend with an underlying lexical method.

Thanks to Osma and his colleagues for MLLM, well done! 

Greetings,

Christoph, Sandro and the Team at DNB

DNB_Sketch_Comparision_Maui_vs_MLLM_20210421.pdf

Osma Suominen

unread,
Apr 22, 2021, 9:17:18 AM4/22/21
to annif...@googlegroups.com
Hello Sandro,

Wow, that was quick! :)

Thank you very much for reporting your results. I'm very happy to hear
that MLLM works for you as well. We've had similar initial results with
our data sets - generally MLLM produces better results than Maui with
the same training data. Though when combined into an ensemble, the
difference diminishes somewhat.

Some things to test:
- try the built-in hyperparameter optimization (see MLLM wiki page)
- try the new token_min_length setting for the analyzer, e.g.
token_min_length=2 (though German probably doesn't have very many short
words!)

I see that you only used TOCs for training, but evaluated both on TOCs
and online publications. It might make sense to include also some online
publications in the training set, because the training documents should
be as similar as possible to the documents you evaluate on.

MLLM generally takes longer to train than Maui - perhaps 5x as long but
it depends. I hope that's not a problem for you. It's still usually much
faster to train than e.g. Omikuji because it needs much less training data.


Cheers,
Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/0bd127f8-8077-426c-8576-248e40da056dn%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/0bd127f8-8077-426c-8576-248e40da056dn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Sandro Uhlmann

unread,
Jul 2, 2021, 8:47:06 AM7/2/21
to Annif Users

Dear Osma, Dear Annif Team, 

I would like to add something about the performance of MLLM. As you wrote in the post above, MLLM needs a relatively long time to train. From DNB's point of view this is not a big problem (faster is of course always wishful ;-), because we have separate and not time critical workflows for these kinds of processes. However - and therefore I take up the topic again - MLLM also needs a very, very long time for the processing of documents (processing time/lead time). In view of a productive use of MLLM (standalone or combined in an ensemble) this is indeed a considerable disadvantage.

For comparison, here is the processing time of an electronic document (text length 30000 characters) processed in a Docker container version with annif 0.52:

[INFO ] 2021/05/19 13:32:18: 1162771593

[INFO ] 2021/05/19 13:32:18: Use: gnd-maui-en-0.52-1

[INFO ] 2021/05/19 13:32:18: Use: gnd-mllm-en-0.52-1

[INFO ] 2021/05/19 13:32:24: Use: gnd-omikuji-bonsai-en-0.52-1

[INFO ] 2021/05/19 13:32:25: Use: gnd-ensemble-en-0.52-1 (Maui + omikuji-bonsai)

[INFO ] 2021/05/19 13:32:28: Use: gnd-ensemble-en-0.52-3 (MLLM + omikuji-bonsai)

[INFO ] 2021/05/19 13:32:38: 1167595939

...

Maui takes less than a second for one document, omikuji-bonsai takes about one second, MLLM takes six seconds. An ensemble of Maui + omikuji-bonsai takes three seconds, an ensemble of MLLM + omikuji-bonsai takes ten seconds.

The ensemble of MLLM + omikuji-bonsai produces very good quality results and is therefore a hot candidate for productive use, but it is a bottleneck, too. We have an average daily accession of 3000 online publications (monographs or articles) in DNB, so it makes a difference for our use case whether the processing takes 9000 seconds (Maui + omikuji-bonsai) or 30000 seconds (MLLM + omikuji-bonsai). The dimension is the following one: 30000 seconds are a little bit more than 8 hours each day. Regarding to the fact, that we have some days with more than 10000 fresh online publications this becomes critical. So the problem should be tackled at the source. The reduction of processing time of MLLM or the possibility of more parallel workflows inside of Annif will help us to process more online publications each day and to avoid complex workflows in automatic indexing systems that use Annif with MLLM inside. 

It would be very helpful if a reduction of the processing time of MLLM could be technically realized. 

Thanxs and Greetings,

Christoph & Sandro (Team DNB)

Osma Suominen

unread,
Aug 2, 2021, 5:11:55 AM8/2/21
to annif...@googlegroups.com
Dear Sandro and Christoph,

Thank you very much for your observations! As MLLM is a new algorithm
these kinds of issues are not unexpected and I agree that runtime
performance is very important for production workflows, while training
time is less important but also interesting.

One possible explanation why MLLM is slower than Maui (which is
obviously a very similar algorithm) is that MLLM is implemented in pure
Python (though with bits from scikit-learn, SciPy and NumPy) while Maui
is in Java. But there are also other differences that could be relevant.
I'm a bit surprised that a relatively short document (30000 characters)
takes six seconds to process, and I think this should be investigated
further. I suspect it could be related to your vocabulary - GND with
more than 1 million potential subjects (IIRC) is very big compared to
for example YSO that we mainly use - which isn't small either.

Would it be possible for you to provide the data files for an Annif MLLM
project trained with your vocabulary (i.e. the contents of the data
directory, including both the project and the vocab), as well as one or
a few example documents (as .txt files) that take a long time to
process? This would make it easier to investigate where the time is
being spent.

Cheers,
Osma
> <https://groups.google.com/d/msgid/annif-users/0bd127f8-8077-426c-8576-248e40da056dn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/annif-users/0bd127f8-8077-426c-8576-248e40da056dn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/c9db027f-ddf4-4248-9fd5-333ccf29ebd7n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/c9db027f-ddf4-4248-9fd5-333ccf29ebd7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sandro Uhlmann

unread,
Aug 18, 2021, 3:15:38 AM8/18/21
to Annif Users

Dear Osma, Dear Annif Team, 

thanks for the feedback and willingness to look for a solution!

We will pack the required data and send it to you via E-Mail. 

Many greetings, Christoph and Sandro

Osma Suominen

unread,
Aug 18, 2021, 9:16:00 AM8/18/21
to annif...@googlegroups.com
Dear Sandro,

Thank you, I have received the data and performed some testing. I
already found a bug in the operation that converts between vector and
list representations of subjects. It was not obvious when the vocabulary
is small, but with GND this bug caused an extra slowdown of around 4
seconds per document!

I have just merged the fix (which was really trivial once the issue was
identified) into the master branch:
https://github.com/NatLibFi/Annif/pull/517

It will be included in the Annif 0.54 release. The next release will
most likely also contain optimizations to MLLM training, making use of
parallel processing on multiple CPUs to save time:
https://github.com/NatLibFi/Annif/pull/511

There may be other performance issues as well in MLLM and I will
continue my investigation, but this simple fix should already help a lot
with the processing time.

-Osma
> <https://groups.google.com/d/msgid/annif-users/c9db027f-ddf4-4248-9fd5-333ccf29ebd7n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/annif-users/c9db027f-ddf4-4248-9fd5-333ccf29ebd7n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sandro Uhlmann

unread,
Aug 18, 2021, 9:49:54 AM8/18/21
to Annif Users
Hi Osma,

that sounds great, thank you for the investigation and bugfixing. Four seconds per document less processing time is great. Further performance improvements are of course welcome but this is already a really big step! 

Parallel processing of the training on multiple CPUs is also very welcome. 

We are looking forward to the Annif 0.54 release (anyway but now even more ;-) 

Thank you and Best regards, Sandro

Osma Suominen

unread,
Aug 24, 2021, 3:24:53 AM8/24/21
to annif...@googlegroups.com
Hi Sandro,

Annif 0.54 was just released (see previous post). It contains five(!)
different improvements that should speed up MLLM:

1. The SKOS vocabulary is now stored as a dump, which is faster to load
than parsing it again from a Turtle file.

2. Processing the training documents is done in parallel on multiple CPU
cores (affected by the new --jobs parameter).

3. Useless work is avoided during vector to list conversions.

4. The token index used by MLLM is now constructed in a way that makes
token matching faster.

5. A small optimization to limit_mask creation when filtering results.


The improvements 1-2 reduce training time, while 3-5 mainly speed up use
of the model (e.g. the suggest method and eval command). Old MLLM models
should still work, but to benefit from the token index improvements, you
have to retrain the model anyway.

I'm looking forward to hearing back from you once you've tested the new
release :)

-Osma
> https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sandro Uhlmann

unread,
Sep 1, 2021, 5:40:30 AM9/1/21
to Annif Users

Hi Mona, Hi Juho, Hi Osma,

thank you for the research and improvements on the speed of MLLM! We have set up a small test series, here are the results:


====================*Base*===========================================

 ----------------------------------------------------------------------------------------------

#Hardware

| Test on CLI were done on 16 CPU on VM with OS Ubuntu 18.04.4 LTS, 32 GB Memory, 640 GB RAM

| Test on Docker were done on 8 CPU on VM with OS Ubuntu 20.04.2, 64 GB Memory, 126 GB RAM

   (tests were done each with all cores, option --jobs was not used)

 

#Modells

| gnd-mllm-de-0.53: MLLM Model based on Annif 0.53_1

| gnd-mllm-de-0.54: MLLM Model based on Annif 0.54

| gnd-ensemble-de-0.54-0 (MLLM Model based on Annif 0.54 + omikuji-bonsai Model based on Annif 0.52)

 

#Vocabulary

| German Authorithy File (GND) 1.3 million GND descriptors modelled in SKOS (simple version with preflabels and altlabels, no relations)

 

#Trainset

| 20.998 German-language fulltexts with text length 30000 characters

 

#Testsets

| Small Testset: 47 German-language fulltexts with text length 30000 characters

| Big Testset: 928 German-language fulltexts with text length 30000 characters

 

#Single Test Documents

| Single doc 1/2/3: 3 single German-language fulltexts with text length 30000 characters


  

==============*Test series (results per command)*===============================

 ----------------------------------------------------------------------------------------------

#train

 | gnd-mllm-de-0.53 | Time total: 1478m7,066s (24 hours 38 minutes)

| gnd-mllm-de-0.54 | Time total: 108m2,953s (1 hour 48 minutes)

 

 ----------------------------------------------------------------------------------------------

#eval

 | gnd-mllm-de-0.53 | Small Testset Time total: 10m53,712s

| gnd-mllm-de-0.54 | Small Testset Time total: 7m36,942s

 

| gnd-mllm-de-0.53 | Small Testset Time average per document: 13,91s

| gnd-mllm-de-0.54 | Small Testset Time average per document: 9,72s

 

| gnd-mllm-de-0.53 | Big Testset Time total: 111m39,175s

| gnd-mllm-de-0.54 | Big Testset Time total: 42m21,261s

 

| gnd-mllm-de-0.53 | Big Testset Time average per document: 7,22s

| gnd-mllm-de-0.54 | Big Testset Time average per document: 2,74s

  

----------------------------------------------------------------------------------------------

#index

 | gnd-mllm-de-0.53 | Small Testset Time total: 9m44,783s

| gnd-mllm-de-0.54 | Small Testset Time total: 6m26,651s

 

| gnd-mllm-de-0.53 | Small Testset Time average per document: 12,44s

| gnd-mllm-de-0.54 | Small Testset Time average per document: 8,23s

 

| gnd-mllm-de-0.53 | Big Testset Time total: 91m34,165s

| gnd-mllm-de-0.54 | Big Testset Time total: 21m55,500s

 

| gnd-mllm-de-0.53 | Big Testset Time average per document: 5,92s

| gnd-mllm-de-0.54 | Big Testset Time average per document: 1,42s

  

----------------------------------------------------------------------------------------------

#suggest

| gnd-mllm-de-0.53 | Single doc 1 Time total: 5m24,054s

| gnd-mllm-de-0.54 | Single doc 1 Time total: 5m31,013s

 

| gnd-mllm-de-0.53 | Single doc 2 Time total: 5m18,673s

| gnd-mllm-de-0.54 | Single doc 2 Time total: 5m25,215s

 

| gnd-mllm-de-0.53 | Single doc 3 Time total: 5m15,833s

| gnd-mllm-de-0.54 | Single doc 3 Time total: 5m27,852s

  

---------------------------------------------------------------------------------------------

# Single doc processed in a Docker container version with Annif 0.54

(For comparison, see also my post of 02.07.2021, 14:47:06 above)

 

[INFO ] 2021/08/30 16:52:53: 1162771593

[INFO ] 2021/08/30 16:52:53: Use: gnd-maui-de-0.52-1

[INFO ] 2021/08/30 16:52:53: Use: gnd-mllm-de-0.54-0

[INFO ] 2021/08/30 16:52:54: Use: gnd-omikuji-bonsai-de-0.52-1

[INFO ] 2021/08/30 16:52:55: Use: gnd-ensemble-de-0.52-1 (Maui + omikuji-bonsai)

[INFO ] 2021/08/30 16:52:56: Use: gnd-ensemble-de-0.54-0 (MLLM 0.54 + omikuji-bonsai)

[INFO ] 2021/08/30 16:52:59: 1167595939


  

============*Conclusion performance MLLM Annif 0.53_1 vs. MLLM Annif 0.54*========================

 The training time for MLLM has decreased in our case by a factor of 13,7 (using all 16 CPUs). Wow, thats much faster. Great!

 The eval command with MLLM 0.54 processes a document of the Big Testset (928 docs) with an average time of 2,74s. That’s 4,49 seconds faster than MLLM 0.53_1. This also applies to the Small Testset (47 docs), where the new release is 4,19 seconds faster.

 Same for the index command: MLLM 0.54 indexes a document of the Big Testset with an average time of 1,42s. Thats 4,5 seconds faster than MLLM 0.53_1. This also applies to the Small Testset, where the new release is 4,2 seconds faster.

 When processing a single document on CLI with cat plus the suggest command, surprisingly MLLM 0.53_1 is around 7 to 12 seconds faster than MLLM 0.54 (?). We were therefore all the more interested to see how MLLM 0.54 behaves with suggest under Docker via REST api.

When using suggest under Docker via REST api, the tests happily shows a picture of performance improvement again. MLLM 0.54 processes a document with an average time of 1s. MLLM under 0.52 has taken six seconds. So MLLM 0.54 is 5 seconds faster.  An ensemble of MLLM 0.52 + omikuji-bonsai has taken 10 seconds. An ensemble of MLLM 0.54 + omikuji-bonsai needs 3 seconds. Thats 7 seconds faster!

 In summary: MLLM is much faster and thus more suitable for (productive) use in the future, even when using a large vocabulary. Thanks for the willingness to invest here and the realization!

 Monet terveiset,

 

Christoph & Sandro

Osma Suominen

unread,
Sep 2, 2021, 5:52:22 AM9/2/21
to annif...@googlegroups.com
Hi Sandro,

Thank you very much for reporting such detailed test results! It's very
nice to hear that the recent MLLM fixes are having such a dramatic
effect on performance. This will benefit other users of MLLM too, even
though the performance issues are much more acute in the case of
extremely large vocabularies such as GND. Having access to your trained
model that you sent was crucial here, as it was very easy to debug the
performance issues with a good test setup.

Regarding the CLI suggest command for single documents, which you found
to be a few seconds slower in 0.54: I wouldn't worry too much about
this, as the difference is most likely due to loading the model, which
may be a bit larger after the token index changes. In any case this way
of using Annif on such a large model is extremely inefficient, as the
overwhelming majority of time is spent initializing the model and only
at the very end, during the last few seconds, is the actual document
processed - and then the model is thrown away as the process shuts down.

That said, I think there is a bit of extra data (key tokens) that gets
stored into the token index which wouldn't actually be necessary as it's
only useful during training. So maybe it could be removed from the index
that gets stored to disk to make the model a little bit smaller again,
and thus perhaps restore the performance of the suggest CLI command.

-Osma
> https://groups.google.com/d/msgid/annif-users/e93ae86c-3c15-4d81-86eb-7cc5a2dc1d44n%40googlegroups.com
> https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/annif-users/183e071f-cab2-4a60-8d2a-5f7186e24c1fn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/2ef539e9-4b87-4b98-b615-6497e495c3c4n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/2ef539e9-4b87-4b98-b615-6497e495c3c4n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages