Variable training data for backends

35 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Jan 3, 2025, 5:36:08 AM1/3/25
to Annif Users
Dear all

Happy New Year 2025 !

We are exploring use of Annif in organizing documentary resources on the subject-domain education based on the backend vocabulary ERIC thesaurus (https://eric.ed.gov/eric_thesaurus2023.zip).

The training data is collected from ERIC database, which uses ERIC thesaurus for subject descriptors (by trained information professionals).

We gathered almost 1 million indexed records and so far curated around 0.87 million records.

During the training process we found that the top two backends based on performance by F1@5 and NDCG are fastText and Omikuji (Bonsai).  Our learning curve experiment results for these two backends are attached herewith (01_LearningCurve).

The question arises whether the principle of varying datasets for different backends can be applied to measure their testing efficacies. For instance, the attached file (02_Evaluation) demonstrates that using distinct and carefully selected test datasets results in each backend achieving optimal performance at different data points. Specifically, fastText achieves peak performance (both F1@5 and NDCG) with a test dataset of 10,000 records, whereas Omikuji (Bonsai) reaches its optimal efficacy at a dataset size of 20,000 records.

In most Annif-related presentations and journal articles, researchers typically employ a consistent dataset for evaluating different backends. This practice has led to some ambiguity in determining the optimal approach: (1) using a uniform dataset across all backends for comparative evaluation, or (2) employing distinct datasets tailored to achieve the best possible scores for each backend.


Best regards


Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

https://orcid.org/0000-0003-0717-9413

02-Evaluation.pdf
01_LearningCurve.pdf

Osma Suominen

unread,
Jan 7, 2025, 6:32:47 AM1/7/25
to annif...@googlegroups.com
Hi Parthasarathi,

Happy New Year to you as well!

Thank you for the interesting results. Especially the learning curves
were very informative! I noticed that for Omikuji, the F1@5 score
reached a near-plateau (~0.37) at around 300k training records, but the
NDCG score kept increasing all the way to the end. I think this is a
good reminder that not all metrics are equal: sometimes NDCG can show
subtle differences in quality that will be hidden by the F1 metric.

Regarding choice of the test set: You are right that usually in
experiments like this, the test set is chosen beforehand and remains
identical throughout all evaluations. The reason is that the test set
tries to represent new, unseen data. If you choose different test sets
for each backend, then it's questionable how well the test set can
represent new records, perhaps ones that haven't even been created yet.
The test set should be chosen so that it is as realistic as possible,
not to maximize evaluation score (for a humorous take on this, see the
paper "Data set selection" published in 2003 in the parody journal
"Journal of Machine Learning Gossip". The journal is long gone now, but
the paper can be found e.g. via Google Scholar.)

In the case of Annif evaluation results, most of the metrics are simply
averages calculated over all the evaluation documents. If fastText is
doing well with the first 10,000 documents, it means that some of those
documents are very easy for that particular model to classify (and since
the first 5,000 gave a relatively low score, the easy documents must be
in the 5,000-10,000 range). It might make sense to shuffle the test set
documents, which should even out the scores over various ranges. But
although the evaluation curves look dramatic, if you look closely, the
absolute differences between the different limits seem to be very small.

-Osma

On 03/01/2025 12:35, Parthasarathi Mukhopadhyay wrote:
> Dear all
>
> Happy New Year 2025 !
>
> We are exploring use of Annif in organizing documentary resources on the
> subject-domain education based on the backend vocabulary ERIC thesaurus
> (https://eric.ed.gov/eric_thesaurus2023.zip
> <https://eric.ed.gov/eric_thesaurus2023.zip>).
> <https://orcid.org/0000-0003-0717-9413>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/CAGM_5ubWg4hePuGW8ZwEfuQLuFF%2BepUEX%2B%2BO%3DzGnbwaqof28-A%40mail.gmail.com <https://groups.google.com/d/msgid/annif-users/CAGM_5ubWg4hePuGW8ZwEfuQLuFF%2BepUEX%2B%2BO%3DzGnbwaqof28-A%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,
Jan 11, 2025, 9:38:32 AM1/11/25
to annif...@googlegroups.com

Hello Osma,

Thank you for referring me to the paper! Its humorous intent made for an enjoyable read, and it cleverly highlighted an important cautionary message about dataset selection practices.

Regarding test dataset selection, your explanation clearly articulates the primary purpose behind evaluating the efficacy of a given backend. We now better understand the rationale and have observed that the difference between the best-case and worst-case scenarios for fastText is only 0.6%, which provides a useful perspective.

Thank you for sharing these insights!

Best regards,


Parthasarathi



To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/890c66c6-2303-4a7f-9866-912aacbd4f0a%40helsinki.fi.
Reply all
Reply to author
Forward
0 new messages