Evaluation in Annif - a few basic questions

Parthasarathi Mukhopadhyay

unread,

Oct 18, 2022, 2:10:49 PM10/18/22

to Annif Users

Dear all,

We are learning Annif to introduce it to students of a course on AI/ML for Libraries.

We are now in the phase of evaluating machine-generated indexing efficiency through retrieval metrics. I have a few basic questions about the evaluation of indexing results:

1. Is the score provided by Annif against each suggested term/concept (annif suggest command) based on NDCG? If so, is it NDCG@5?

2. The 'eval' command gives us an array of metrics, of which there is one called LRAP. Is it based on 'average precision' metrics?

3. What is the utility of the 'optimize' command in the evaluation of indexing efficiency?

4. Is the following test plan the right one?

A. Create a training dataset—We created a training dataset with 15,00,000 records in the Agricultural domain (AGRIS dataset and AGROVOC control vocabulary) in the format Text corpus (Title+Abstract) | Term URI (tsv format).

B. Save 1,00,000 records as a test data set (tsv format) and train Annif with the remaining 14,00,000 records (the records in the test dataset include representative samples for all 40K unique AGROVOC term descriptors).

C. This test dataset (T1) of 1,00,000 records (text corpus + term URIs in tsv format) is human indexed by using AGROVOC, and we are using the 'eval' command to generate retrieval matrices (mainly F1@5, NDCG@5 and LRAP) for T1.

D. Remove the human-indexed term URI (s) column from the test dataset (T2) and use this test dataset in OpenRefine with only the text corpus of records to produce suggested indexing terms from Annif via an API-based access mechanism and then extract term URI (s) for each record in T2.

E. Generate retrieval matrices for T2 by using the 'eval' command.

F. Compare the results of the 'eval' command for T1 (human-indexed) and t2 (machine-indexed).

Best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

juho.i...@helsinki.fi

unread,

Oct 26, 2022, 2:27:33 AM10/26/22

to Annif Users

Hi Parthasarathi,

We will try to answer your questions below inline.

Best regards,

Annif-team

On Tuesday, 18 October 2022 at 21:10:49 UTC+3 psmukho...@gmail.com wrote:

Dear all,

We are learning Annif to introduce it to students of a course on AI/ML for Libraries.

We are now in the phase of evaluating machine-generated indexing efficiency through retrieval metrics. I have a few basic questions about the evaluation of indexing results:

1. Is the score provided by Annif against each suggested term/concept (annif suggest command) based on NDCG? If so, is it NDCG@5?

The score provided by Annif for each suggestion is an estimate of the “goodness” of the suggestion, and is given by the used backend algorithm. The scores by different algorithms are not comparable. However, for every algorithm the scores are in the range from 0 to 1.0, and a bigger value means the algorithm thinks that the suggestion is better.

The scores of the suggestions are not (directly) related to the evaluation metrics, which quantify how well of the suggested subjects correspond to the gold-standard subjects. Note that the metrics concern the whole set of suggestions given to a text (restricted to 5 highest scored suggestions in the case of NDCG@5 or F1@5). This is in contrast to the subject suggestion score, which is a property of each individual subject suggestion.

2. The 'eval' command gives us an array of metrics, of which there is one called LRAP. Is it based on 'average precision' metrics?

Label ranking average precision (LRAP) “is the average over each ‘ground truth label’ assigned to each sample, of the ratio of true vs. total labels with lower score.” (see scikit-learn documentation). We usually rely more on other metrics in our testing, typically F1@5.

3. What is the utility of the 'optimize' command in the evaluation of indexing efficiency?

This command will use different limit (maximum number of subjects) and score threshold values when assigning subjects to each document given and compare the results against the gold standard subjects in the documents. The output is a list of parameter combinations and their scores. From the output, you can determine the optimum limit and threshold parameters depending on which measure you want to target.

The default limit is 10, so Annif gives at most 10 subject suggestions (if the input text is very short, the lexical backends cannot give that many suggestions as they search the text for possible subjects). But 10 suggestions can be too many, if you want to have only suggestions that are “more probably” right, so in that case the limit can be set lower. By choosing the limit value you can balance between precision and recall, i.e. between how many suggested subjects are correct and how many of the correct subjects are suggested. A metric that takes both precision and recall into account is the F1-score.

So the F1 score is different if you select the top 5 suggestions than if you take the top 10. You will have to test different limit settings for achieving the ideal F1 score. The optimize command will do this if you need to.

For the same purpose as the limit you can also set a threshold value, which might be a better approach if there is a fully automated solution in question, that is if there is no human choosing from the suggestions, but the suggestions are used as they are given by Annif. The optimize command can be also used to find the most suitable threshold value.

4. Is the following test plan the right one?

A. Create a training dataset—We created a training dataset with 15,00,000 records in the Agricultural domain (AGRIS dataset and AGROVOC control vocabulary) in the format Text corpus (Title+Abstract) | Term URI (tsv format).

B. Save 1,00,000 records as a test data set (tsv format) and train Annif with the remaining 14,00,000 records (the records in the test dataset include representative samples for all 40K unique AGROVOC term descriptors).

C. This test dataset (T1) of 1,00,000 records (text corpus + term URIs in tsv format) is human indexed by using AGROVOC, and we are using the 'eval' command to generate retrieval matrices (mainly F1@5, NDCG@5 and LRAP) for T1.

D. Remove the human-indexed term URI (s) column from the test dataset (T2) and use this test dataset in OpenRefine with only the text corpus of records to produce suggested indexing terms from Annif via an API-based access mechanism and then extract term URI (s) for each record in T2.

E. Generate retrieval matrices for T2 by using the 'eval' command.

F. Compare the results of the 'eval' command for T1 (human-indexed) and t2 (machine-indexed).

Steps A-C are correct, but if I understood right, the steps D-F are redundant and going the wrong way. In step D, when you obtain the subject suggestions with Annif for each record, they are the same suggestions that the eval in step E internally produces, so you would get perfect retrieval metrics. There you would be comparing the subject suggestions by Annif to (the same) subject suggestions by Annif. This is different to the correct eval step C, where you are comparing the subject suggestions to the gold standard subjects that are known to be “right” as they are given by a human.

In step F you would find that the eval results for T2 are better than for T1. But only the eval results for T1 from step C would have the correct meaning.

Parthasarathi Mukhopadhyay

unread,

Oct 26, 2022, 7:07:14 AM10/26/22

to Annif Users

Dear Team@annif

Heartfelt thanks for these lucid explanations related to the evaluation process.

As per your prediction, the T2 dataset (steps D-F) is producing abnormally correct 'eval' results (we are using tf-idf backend algorithm for this learning exercise):

                              T1                               T2
F1@5: 0.304895340770341     0.878593650793651
NDCG: 0.445248086825471    0.970556527331224
NDCG@5: 0.384222165394685    0.96784455303469
NDCG@10: 0.445262527880237 0.970556527331224

Now, we understand the reason. It is actually comparing Annif vs Annif.

We understand now that the 'eval' command for the dataset T1 (steps A-C) is actually comparing the human-indexed terms (Gold standard) with the Annif(tfidf)-suggested terms, and thereby giving us efficiency of the machine-generated indexing (e.g. NDCG: 0.445248086825471).

Is this correct understanding?

One more question.

We are presently throwing the T1 dataset (Short text document corpus) as a single tsv file containing all 1,00,000 records/rows.

annif eval project_id /path/to/T1.tsv

Will it be a better approach to split the T1.tsv file into 1,00,000 records (one record per row), then gzipped 1,00,000 TSV files, and set the path to the gzipped tsv file?

annif eval project_id /path/to/T1.gz

Actually, we are slightly confused with this instruction given under Evaluation section in https://github.com/NatLibFi/Annif/wiki/Getting-started :

"If you have several documents with gold standard subjects, you can evaluate how well Annif works using the eval command. First you need to place the documents as text files in a directory and store the subjects in TSV files with the same basename."

Could you plz suggest steps to prepare a test dataset as instructed above from T1 (T1 with two columns - Text corpus (Title+Abstract) | Term URI in tsv format)?

Thanks again and best regards

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/ff5f8e7c-189d-47c6-97de-5636d1ad1e3en%40googlegroups.com.

juho.i...@helsinki.fi

unread,

Oct 27, 2022, 5:29:49 AM10/27/22

to Annif Users

Hi!

We are glad to be of help. Again see inline comments.

-Juho

On Wednesday, 26 October 2022 at 14:07:14 UTC+3 psmukho...@gmail.com wrote:

Dear Team@annif

Heartfelt thanks for these lucid explanations related to the evaluation process.

As per your prediction, the T2 dataset (steps D-F) is producing abnormally correct 'eval' results (we are using tf-idf backend algorithm for this learning exercise):

                              T1                               T2
F1@5: 0.304895340770341     0.878593650793651
NDCG: 0.445248086825471    0.970556527331224
NDCG@5: 0.384222165394685    0.96784455303469
NDCG@10: 0.445262527880237 0.970556527331224

Now, we understand the reason. It is actually comparing Annif vs Annif.

We understand now that the 'eval' command for the dataset T1 (steps A-C) is actually comparing the human-indexed terms (Gold standard) with the Annif(tfidf)-suggested terms, and thereby giving us efficiency of the machine-generated indexing (e.g. NDCG: 0.445248086825471).

Is this correct understanding?

Yes, the meaningful comparison is Annif suggested subjects against the gold-standard subjects, which measures how well the machine can generate the correct subjects (here the assumption is always that the gold-standard subjects are correct, which can be questioned, but that is a separate topic).

One more question.

We are presently throwing the T1 dataset (Short text document corpus) as a single tsv file containing all 1,00,000 records/rows.
annif eval project_id /path/to/T1.tsv

Will it be a better approach to split the T1.tsv file into 1,00,000 records (one record per row), then gzipped 1,00,000 TSV files, and set the path to the gzipped tsv file?
annif eval project_id /path/to/T1.gz

There is no need to split the T1.tsv into separate files to use it in gzipped format. The T1.tsv can be turned into gzipped format with command "gzip T1.tsv", which gives T1.tsv.gz file. Annif can read the gzipped file, see https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#short-text-document-corpus-tsv-file

But actually if you are not short of disk space, there is no benefit in gzipping.

Actually, we are slightly confused with this instruction given under Evaluation section in https://github.com/NatLibFi/Annif/wiki/Getting-started :

"If you have several documents with gold standard subjects, you can evaluate how well Annif works using the eval command. First you need to place the documents as text files in a directory and store the subjects in TSV files with the same basename."

Could you plz suggest steps to prepare a test dataset as instructed above from T1 (T1 with two columns - Text corpus (Title+Abstract) | Term URI in tsv format)?

This instruction talks about documents in full-text format. That format is suitable for documents that are long, for example texts extracted from PDF articles. If your documents are "not too long", the TSV format, i.e. short-text format, you have now is fine. The full-text format can be convenient for long texts if you want to take a look of the texts, which would be a bit difficult due to very long lines if short-text format was used.

https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats

Parthasarathi Mukhopadhyay

unread,

Oct 27, 2022, 5:43:42 AM10/27/22

to Annif Users

Dear team@annif

Thanks a lot for helping us in learning this great tool (Annif) bit by bit.

We are now all set to start the evaluation process.

Best regards

To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/ccdbe390-e7b3-4b10-be15-56bee9deb61cn%40googlegroups.com.

Reply all

Reply to author

Forward