Hi Juho,
thank you for your quick response!
As I went trough the tutorial, the computation times expected in the tutorial were far higher than those I observed. So I gave it a shot with 600.000 documents right from the get go and it work really fast for all backends in the current version of Annif. Interestingly, the better-performing backends (MLLM and Omikuji) also required less computational time.
I'm still amazed of well Annif works. I love this project.
My current elvaluation step is to figure out the best performing backends/ensembles. So for now the ressources are irrelevant, but they will be considered when choosing the right combination for production use. I'm not sure if a 1% better metric is justify using backends with much higher hardware requirements/consumption.
The nn-ensemble with Omikuji (.39) and X-Transformer (.61) backends was build after hyperopt-imizing the basic ensemble and evaluating it. As the current "champion" (which is a nn-ensemble of Omikuji (.86) and MLLM (.14)), is outperforming the same (basic) ensemble around 3%-5% across the board, I ended up evaluating the nn-ensemble before adding MLLM as third backend. Today the learning of the nn-ensemble with Omikuji and the X-Transformer backend did finish. As you expected the overall metrics did not improve. Some were better, some worse.
A short follow up for clarification: do more documents hurt the results or does it just take longer to train?
In our data there are some subjects with only very few occurences. My thought process is: the more examples the better.
Bare with me here, I'm not a scientist and not familiar with the algorithms: does balancing the subjects even matter? Eg.: I got 1.000 documents and 950 of them do have a subject "X" and 10 of them have a subject "Y". Is it best to train with all 1.000 or do I pick 10 examples of "X" and 10 examples of "Y"? I'v read that Annif is used for 300.000 subjects - so, how do you train with only 10.000ish documents?
For my data: in the 600.000 documents there are some subjects with less than
So Omikuji, MLLM and X-Transformer ensemble is next on the menu. Your spoiler sure sounds promising :)
Regards,
Sven
ps.:
Please note: a "document" should better be called "article" since it is quite short. I'm not talking about 500 page-books here, but more 1-2 pages.