Hi Osma, hi all,
in January, we provided a brief report on the processing time of MLLM. We found out, that in some cases MLLM backend of Annif needs a very long time to suggest GND terms. In your answer you gave us some homework, and we did some investigations to find out ideas for solutions how to get a better performance.
1. Analysis of MLLM suggestions
We tried to find out what happens when MLLM produces as many GND descriptors as possible for the document with the very long processing time (limit=100000): MLLM suggested 5,029 GND descriptors (see file res_1155862600.txt).
_Are there repeating patterns?_
Yes, there are very many.
Examples
MLLM suggested GND descriptors:
Bernhard (2. H. 15. Jh.)
Bernhard, B. (20./21. Jh.)
Bernhard,
Ferdinand (1873-1968)
Bernhard,
Johann Adam (1688-1771)
Bernhard,
Lucian (1883-1972)
Bernhard,
Maria Ludwika (1908-1998)
Bernhard,
Oskar (1861-1939)
- The text
contains "Bernhard Kutzler" three times, but no other Bernhards
appears. Therefore, it is obvious that all Bernhards are found on the basis of
these three text passages "Bernhard Kutzler".
Blum,
Carl (1786-1844)
Blum,
Erhard
Blum,
Harry (1944-2000)
Blum,
Joachim Christian (1739-1790)
Blum,
Thierry (20./21, Jh.)
Blum, Ulrich
Blum, Walter (1937-2013)
Blume, Bernhard (1937-2011)
Blume, Gernot
Blume, Heinrich (1788-1856)
Blume, Helmut (1920-2008)
Blumer,
Johann Jakob (1819-1875)
Blumer-Ris,
Hans (1855-1916)
Blūms, Elmārs
- "Blum" appears five times in the text, each time as a citation "Leiß & Blum 2008". A GND descriptor (Blum, Erhard) has the alternative label Blum, E. and could possibly also be found due to this synonym.
Braun
Braun Aktiengesellschaft
Braun von Fernwald, Carl (1822-1891)
Braun, Adam (1748-1827)
Braun, Adolf (1862-1929)
Braun, Alexander (1805-1877)
Braun, Alexander Karl Hermann (1807-1868)
Braun, Andrzej (1923-2008)
Braun, Christina von
Braun, Edmund Wilhelm (1870-1957)
Braun, Ferdinand (1850-1918)
Braun, Friedrich Eberhard (1774-1848)
Braun, Georg Christian (1785-1834)
Braun, Günter (1928-2008)
Braun, Heinrich (1862-1934)
Braun, Herbert (1903-1991)
Braun, Heribert (1921-2005)
Braun, Hermann (1862-1908)
Braun, Jean Daniel (1728-1740)
Braun, Johann W. J. (1801-1863)
Braun, Johanna (1929-2008)
Braun, Karl Friedrich Wilhelm (1800-1864)
Braun, Karl Guido (1841-1909)
Braun, Karlheinz
Braun, M. (1772-1844)
Braun, Otto (1900-1974)
Braun, P. (19. Jh./20. Jh.)
Braun, Placidus (1756-1829)
Braun, Theodor (1833-1911)
Braun, Wilhelm von (1813-1860)
Braunfels,
Walter (1882-1954)
Brauns,
Josef
- The text contains once (!) Braun. Further candidates with this word stem do not exist. There are many more examples.
Many more examples can be seen in the file res_1155862600.txt (and the comparison with the text in document 1155862600.txt).
However, it can also be noted that other result files whose processing time is in the "normal" range between 1 - 3 seconds also contain a similarly high number of GND descriptors that have the same word stem. See for example the files res_1190087464.txt or res_1173699678.txt. The presence of a large number of matching candidates _on the terminology side_ does not necessarily slow down the processing time, since none of these documents has a processing time as long as 1155862600.txt.
The many short words (keywords: formulas or abbreviations like "dx" or acronyms like "CAS") are obviously another cause for the long processing time, as you have already suspected: They (obviously) lead to a much higher number of candidates for matching _on the text side_. In an experiment we left out all short words up to 3 characters - the processing time reduced down to 9 seconds.
Both together (many candidates on the text side plus many candidates on the terminology side) possibly lead to the high amount of GND candidates for the relevance list of the best.
2. Identification of the piece of the document that is slow to process
Concerning your second proposed test: We did as you suggested and repeatedly splitted the text in half, compared the processing speed for both halves and went on splitting always the half with longer processing time. Indeed, the part with the math expressions was the most problematic one concerning processing time.
We constructed a text containing the same number of characters, i.e. 30000, by doubling the first half of the original text, and MLLM processing took 50% longer than with the original text. The doubled second half of the original text took only 45% of the time for the original text, but nonetheless longer than expected.
We also replicated the most problematic 1875 characters up to 30000, and this text took more than twice as long with MLLM as the original text.
The problematic words contain indeed the variable names of the math expressions (cp. 1155862600_AABA.txt). And the found subjects denote in part wrong chemical substances, besides correct mathematical expressions.
3. Further investigations
In another experiment, we completely removed all short terms up to 3 characters from the text and saw a remarkably better performance.
We also found out, that we can speed up MLLM, when we delete the dots in the underlying text. In our example this manipulation reduced the processing time from 100 seconds to 14 seconds.
Last but not least we did some mass experiments with another text corpus to get more information about the expected processing time. Each of the texts has a length of approx. 80000 characters. For 80 to 90 percent of the suggestions MLLM needs not more than 7 seconds - fig 1.
We assume, that the needed processing time depends on the length of the given text string with a couple of exceptions - fig 2.
4. Finally we made a short summary some observations and corresponding processing times
1. text
with the very long processing time:
- original text: 110 s
- text
without dots: 14 s
- text
without formulas and dots: 14 s
- text
without words up to 3 chars: 9 s
2. another
text with a long processing time:
- original text: 126 s
- text
without dots: 12 s
To track the data and analysis we will pack the data of our test cases and send it to you via e-mail.
So long and best regards,
Claudia
Grote
Jan-Helge
Jacobs
Christoph
Poley
Sandro
Uhlmann
Attachments:
- fig 1
(mllm_fig1.png): processing time of texts with approx. 80000 characters
(x-axis: text length in characters, y-axis: processing time in sec.)
- fig 2
(mllm_fig2.png): processing time of uncutted full text strings (x-axis: text
length in characters, y-axis: processing time in secs.)
Hi Osma,
finally we wo want to give you and the Annif users some final results of our MLLM test suite. An intern student (thanks to Katja Konermann) had a closer look to our MLLM problem and the underlying Annif code.
At first we made a few assumptions:
( 1 ) The processing time depends on the number of candidates that MLLM finds in a text - Corelation: 0.81
Have a look at assumption_1.png
( 2 ) If the processing time for a document depends on different minimum token lengths, then the number of matches for a candidate also changes – Corelation: 0.6
Have a look at assumption_2.png
( 3 ) The extracted candidates of MLLM correspond to a sentence. Documents with a very long sentence have a longer processing time – Corelation: 0.79
Have a look at assumption_3.png
Last but not least we have two more remarks:
( 1 ) For our purpose we only want to process German documents. But a lot of the suspicious documents have English parts inside. Perhaps it can have effects according to the processing time.
( 2 ) The part of not valid tokens (numbers, special characters) has no significant effect regarding to the processing time.
Conclusions:
In our opinion it isn’t easy to find a simple reason according to the very long MLLM processing time for some of our full texts. There will happen rather a bunch of factors, e.g. terminology, minimal token size, type of labels than a single dependency. We think, it is important to know, that in some cases we have an unpredictable processing time there. We have to monitor the effects in our productive Erschließungsmaschine to make further decisions. And we are working on ways to get rid of problematic labels of our terminology that we use with MLLM.
Regards
Christoph
Hi Osma,
We did the same investigations as you did.
And surprise, surprise: we also had a processing time of about 14 seconds, when
we leave out the dots. Hoverever, can we hope to get a new release of MLLM with a sentence tokenizer fix inside?
Regards,
Christoph
Hi Juho,
at the moment we include the MLLM backend as a part of our productive EMa service. For this we use the docker images. So I think it makes sense to wait a little bit for the nltk fix. It doesn’t preserve our operations.
And thank you for the detailed answers.
Christoph