High NDCG vs. Low F1@5

12 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Jan 11, 2026, 3:40:05 AM (6 days ago) Jan 11
to Annif Users
Dear all

I need one advice on the following matter:

In our experiment for categorizing Indian research output by SDG, we are facing an issue related to High NDCG but low F1@5 in different data points (Learning curve from 25K to 500K) for almost all models. 

Efficacy in terms of F1@5 and NDCG of FastText
image.png

Does it indicate the model(s) ranks relevant items well relative to non-relevant ones, but perhaps the "Cut-off" is wrong, or the ground truth rarely has 5 labels. 
We need to explain why this gap exists. 
If the average document only has 1 to 3 SDG labels, F1@5 is naturally penalized?
Is there any way in Annif to check F1@1, F1@2 or F1@3, in this type of cases?

Thanks and regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

https://orcid.org/0000-0003-0717-9413

juho.i...@helsinki.fi

unread,
Jan 12, 2026, 9:36:27 AM (5 days ago) Jan 12
to Annif Users
Dear Parthasarathi,

Yes, if there are fewer than 5 gold-standard labels in the documents, F1@5 is naturally penalized while NDCG is not. You can use the `--limit N` option of the `annif eval` command to consider only the best N subjects in the evaluation and to obtain e.g. F1@1, F1@2 or F1@3 values.

A sidenote of the results table: it seems to indicate that it is based on evaluations on the training set, because the NDCG reaches almost 1.0. For real evaluations, a different data set than the one used in training the model should be used to avoid over-optimistic estimates.

Regards (and sorry about cross-posting, first only personally
(?) and now via annif-users)
 -Juho

Parthasarathi Mukhopadhyay

unread,
Jan 12, 2026, 11:28:06 AM (5 days ago) Jan 12
to Annif Users
Dear Juho and Micsik 

Heartfelt thanks for your kind response and sound guidance as always. 

Yes Juho, this the final result set for the Learning Curve script we are using to determine optimum training data requirements for different deployed backends (500K training records - short corpus - raising dataset by 25K in each cycle - records spanning from year 2016 to 2024). The evaluation dataset will be based on 2025 publications only, to avoid any data leakage. It's an on-going project. 

Best regards

-Parthasarathi 

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/2da78df5-816d-459b-8f48-13779a170316n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages