The question arises whether the principle of varying datasets for different backends can be applied to measure their testing efficacies. For instance, the attached file (02_Evaluation) demonstrates that using distinct and carefully selected test datasets results in each backend achieving optimal performance at different data points. Specifically, fastText achieves peak performance (both F1@5 and NDCG) with a test dataset of 10,000 records, whereas Omikuji (Bonsai) reaches its optimal efficacy at a dataset size of 20,000 records.
In most Annif-related presentations and journal articles, researchers typically employ a consistent dataset for evaluating different backends. This practice has led to some ambiguity in determining the optimal approach: (1) using a uniform dataset across all backends for comparative evaluation, or (2) employing distinct datasets tailored to achieve the best possible scores for each backend.
Parthasarathi Mukhopadhyay
Professor, Department of Library and Information Science,
University of Kalyani, Kalyani - 741 235 (WB), India
Hello Osma,
Thank you for referring me to the paper! Its humorous intent made for an enjoyable read, and it cleverly highlighted an important cautionary message about dataset selection practices.
Regarding test dataset selection, your explanation clearly articulates the primary purpose behind evaluating the efficacy of a given backend. We now better understand the rationale and have observed that the difference between the best-case and worst-case scenarios for fastText is only 0.6%, which provides a useful perspective.
Thank you for sharing these insights!
Best regards,
Parthasarathi
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/890c66c6-2303-4a7f-9866-912aacbd4f0a%40helsinki.fi.