Hi Parthasarathi,
We will try to answer your questions below inline.
Best regards,
Annif-team
Dear all,We are learning Annif to introduce it to students of a course on AI/ML for Libraries.We are now in the phase of evaluating machine-generated indexing efficiency through retrieval metrics. I have a few basic questions about the evaluation of indexing results:1. Is the score provided by Annif against each suggested term/concept (annif suggest command) based on NDCG? If so, is it NDCG@5?
The score provided by Annif for each suggestion is an estimate of the “goodness” of the suggestion, and is given by the used backend algorithm. The scores by different algorithms are not comparable. However, for every algorithm the scores are in the range from 0 to 1.0, and a bigger value means the algorithm thinks that the suggestion is better.
The scores of the suggestions are not (directly) related to the evaluation metrics, which quantify how well of the suggested subjects correspond to the gold-standard subjects. Note that the metrics concern the whole set of suggestions given to a text (restricted to 5 highest scored suggestions in the case of NDCG@5 or F1@5). This is in contrast to the subject suggestion score, which is a property of each individual subject suggestion.
2. The 'eval' command gives us an array of metrics, of which there is one called LRAP. Is it based on 'average precision' metrics?
Label ranking average precision (LRAP) “is the average over each ‘ground truth label’ assigned to each sample, of the ratio of true vs. total labels with lower score.” (see scikit-learn documentation). We usually rely more on other metrics in our testing, typically F1@5.
3. What is the utility of the 'optimize' command in the evaluation of indexing efficiency?
This command will use different limit (maximum number of subjects) and score threshold values when assigning subjects to each document given and compare the results against the gold standard subjects in the documents. The output is a list of parameter combinations and their scores. From the output, you can determine the optimum limit and threshold parameters depending on which measure you want to target.
The default limit is 10, so Annif gives at most 10 subject suggestions (if the input text is very short, the lexical backends cannot give that many suggestions as they search the text for possible subjects). But 10 suggestions can be too many, if you want to have only suggestions that are “more probably” right, so in that case the limit can be set lower. By choosing the limit value you can balance between precision and recall, i.e. between how many suggested subjects are correct and how many of the correct subjects are suggested. A metric that takes both precision and recall into account is the F1-score.
So the F1 score is different if you select the top 5 suggestions than if you take the top 10. You will have to test different limit settings for achieving the ideal F1 score. The optimize command will do this if you need to.
For the same purpose as the limit you can also set a threshold value, which might be a better approach if there is a fully automated solution in question, that is if there is no human choosing from the suggestions, but the suggestions are used as they are given by Annif. The optimize command can be also used to find the most suitable threshold value.
4. Is the following test plan the right one?A. Create a training dataset—We created a training dataset with 15,00,000 records in the Agricultural domain (AGRIS dataset and AGROVOC control vocabulary) in the format Text corpus (Title+Abstract) | Term URI (tsv format).B. Save 1,00,000 records as a test data set (tsv format) and train Annif with the remaining 14,00,000 records (the records in the test dataset include representative samples for all 40K unique AGROVOC term descriptors).C. This test dataset (T1) of 1,00,000 records (text corpus + term URIs in tsv format) is human indexed by using AGROVOC, and we are using the 'eval' command to generate retrieval matrices (mainly F1@5, NDCG@5 and LRAP) for T1.D. Remove the human-indexed term URI (s) column from the test dataset (T2) and use this test dataset in OpenRefine with only the text corpus of records to produce suggested indexing terms from Annif via an API-based access mechanism and then extract term URI (s) for each record in T2.E. Generate retrieval matrices for T2 by using the 'eval' command.F. Compare the results of the 'eval' command for T1 (human-indexed) and t2 (machine-indexed).
Steps A-C are correct, but if I understood right, the steps D-F are redundant and going the wrong way. In step D, when you obtain the subject suggestions with Annif for each record, they are the same suggestions that the eval in step E internally produces, so you would get perfect retrieval metrics. There you would be comparing the subject suggestions by Annif to (the same) subject suggestions by Annif. This is different to the correct eval step C, where you are comparing the subject suggestions to the gold standard subjects that are known to be “right” as they are given by a human.
In step F you would find that the eval results for T2 are better than for T1. But only the eval results for T1 from step C would have the correct meaning.
eval
command. First you need to place the documents as text files in a
directory and store the subjects in TSV files with the same basename."--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/ff5f8e7c-189d-47c6-97de-5636d1ad1e3en%40googlegroups.com.
Dear Team@annifHeartfelt thanks for these lucid explanations related to the evaluation process.As per your prediction, the T2 dataset (steps D-F) is producing abnormally correct 'eval' results (we are using tf-idf backend algorithm for this learning exercise):T1 T2
F1@5: 0.304895340770341 0.878593650793651
NDCG: 0.445248086825471 0.970556527331224
NDCG@5: 0.384222165394685 0.96784455303469
NDCG@10: 0.445262527880237 0.970556527331224Now, we understand the reason. It is actually comparing Annif vs Annif.
We understand now that the 'eval' command for the dataset T1 (steps A-C) is actually comparing the human-indexed terms (Gold standard) with the Annif(tfidf)-suggested terms, and thereby giving us efficiency of the machine-generated indexing (e.g. NDCG: 0.445248086825471).
Is this correct understanding?
One more question.We are presently throwing the T1 dataset (Short text document corpus) as a single tsv file containing all 1,00,000 records/rows.annif eval project_id /path/to/T1.tsvWill it be a better approach to split the T1.tsv file into 1,00,000 records (one record per row), then gzipped 1,00,000 TSV files, and set the path to the gzipped tsv file?annif eval project_id /path/to/T1.gz
Actually, we are slightly confused with this instruction given under Evaluation section in https://github.com/NatLibFi/Annif/wiki/Getting-started :"If you have several documents with gold standard subjects, you can evaluate how well Annif works using theeval
command. First you need to place the documents as text files in a directory and store the subjects in TSV files with the same basename."Could you plz suggest steps to prepare a test dataset as instructed above from T1 (T1 with two columns - Text corpus (Title+Abstract) | Term URI in tsv format)?
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/ccdbe390-e7b3-4b10-be15-56bee9deb61cn%40googlegroups.com.