Hi Lucas!
Thanks a lot for the details.
Do you have separate validation and/or test sets for evaluating the
quality? I'm asking because you already seemed to have trained several
base projects (tfidf, fasttext, mllm, omikuji-parabel) and then used
these as sources for the NN ensemble, with somewhat different weights.
Did you evaluate the quality of each of these separately against your
test set?
In my experience, it's best to spend some time first trying to squeeze
the best possible results (in terms of F1 score and/or nDCG) from each
individual project before combining them into ensembles. This can
involve adjusting analyzers or trying different hyperparameters (e.g.
Omikuji Parabel vs. Bonsai, or enabling 2-grams).
Once you have the base projects set up and working well, first you
should make a basic ensemble and see if that improves results over the
base projects (usually it does, but not always). One useful step to do
here is to use "annif hyperopt" to optimize the weights of the ensemble;
you have clearly not done this yet since your source weights are just 1
or 2. If the hyperopt result gives some source project a very low
weight, you can consider dropping it altogether (often TF-IDF isn't very
useful in practice as it's more of a toy model). Once you have
configured the basic ensemble with the optimized weights, evaluate once
more. That should be your baseline score before moving into advanced
ensembles.
Now that you know the baseline, you can try if a NN ensemble (with the
same source weights) improves scores or not, and by how much. If you
have set a side a separate validation set, I would recommend using that
to train the NN ensemble, so it's genuinely "fresh" data instead of
something that the source backends have already been exposed to. But if
you don't have that, just training on a sample of the already used
training records could still be better than nothing.
In my experience, the NN ensemble is most useful for correcting bias
caused by using different kinds of training data for different backends.
For example in our YSO models, we've trained the Omikuji and fastText
projects using a large amount of short text metadata records from Finna,
while MLLM has been trained on a much smaller amount of longer fulltext
documents. We mostly want to apply Annif for fulltext documents. So
we've trained the NN ensemble with fulltext documents as well, and that
helps to "adapt" the ensemble for fulltext even though the source
projects were trained mostly on metadata records. In your case, I
understood that you only had one type of data (records from your library
catalog) so I'm not sure if the NN ensemble will provide much
improvement over a basic averaging ensemble. But you should try it in
order to find out! You can even train it first with a small number of
records (maybe a thousand or two) and then evaluate. If it helped, try
training with more records. An incremental approach is usually much
better than going head first into an unknown direction!
Hope this helps,
Osma
>
https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com> <
https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com?utm_medium=email&utm_source=footer <
https://groups.google.com/d/msgid/annif-users/61bf8462-3e9b-411b-ba05-d63f7be18146n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel.
+358 50 3199529
>
osma.s...@helsinki.fi
>
http://www.nationallibrary.fi <
http://www.nationallibrary.fi>
>
https://groups.google.com/d/msgid/annif-users/3074031e-a6ca-4f75-bced-74281365272bn%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/3074031e-a6ca-4f75-bced-74281365272bn%40googlegroups.com?utm_medium=email&utm_source=footer>.