metrics in Annif

Elisabeth Mecking

unread,

Jul 10, 2023, 5:23:17 AM7/10/23

to Annif Users

Hi, I have a question about the scores in the evaluation. In Annif-Wiki, it says about NDCG: "getting the top ranked (highest score) result right will matter more than getting the 2nd or 3rd right." How are the results ranked and is that ranking score the same that is used for threshold?

Thank you for your help.

Elisabeth

juho.i...@helsinki.fi

unread,

Jul 10, 2023, 12:40:49 PM7/10/23

to Annif Users

Hi Elisabeth!

The results are ranked by the scores given by the algorithm, and yes, they are the same scores which the threshold is applied to.

The code for calculating the NDCG metric is here, implementing the formulas shown in the Wikipedia page.

-Juho

Elisabeth Mecking

unread,

Jul 10, 2023, 1:35:14 PM7/10/23

to Annif Users

Hi Juho,

thank you so much for your answer. Could you explain a little more? I have mostly used annif eval, which gives me a variety of score, but I know that when you do annif suggest there's a score for each subject. Is that the one that is used for ranking? How is it calculated?

Thanks for your help

Elisabeth

juho.i...@helsinki.fi

unread,

Jul 11, 2023, 4:04:52 AM7/11/23

to Annif Users

Hi!

Yes, the roles and meaning of the various numerical values could be explained more clearly in Annif Wiki.

The "annif suggest" command operates on one document at time. It gives a list of subjects suggestions with numerical values, i.e. the suggestion scores. They come from the backend that the project uses, and the exact way how the backend and its algorithm calculates the scores is complicated. Generally it is not possible to track how the score is calculated. However, the score values are between 0 and 1; and the higher value, the more relevant the suggestion is to the document (or this is what the algorithm thinks). The threshold option of the suggest command applies to the suggestion scores.

The "annif eval" command operates on multiple documents that already have human-selected, "gold-standard" subjects attached. It gives numerical values for many metrics, which are calculated by comparing the subject suggestions to the gold-standard subjects using all given documents. There are many ways how to exactly do the comparison and metric calculation, which is why many metrics exist. Each metric emphasizes a different aspect of "correctness" of the suggestions. We usually aim to optimize the F1@5 score when developing models for Finto AI service.

Hope this helps,

-Juho

Elisabeth Mecking

unread,

Jul 11, 2023, 4:44:32 AM7/11/23

to Annif Users

Hi Juho,

thanks for the detailed explanation, that helps.

Elisabeth

Reply all

Reply to author

Forward