TREC 2019 evaluation results

53 views

Skip to first unread message

Ellen Voorhees

unread,

Oct 22, 2019, 7:40:24 AM10/22/19

to TREC Decision Track

Dear TREC 2019 participants,

I apologize: I just now realized that the message explaining the evaluation results that were mailed last Thursday did not include my description of those results. Below is the message that should have been attached.

----------------------------------

Assessors made three judgments per document: a relevance judgment, an effectiveness judgment, and a credibility judgment. Relevance was judged on a three-way scale:

0: not relevant

1: relevant

2: highly relevant

The other two judgments were made only if the document was judged to be Relevant or Highly Relevant.

Effectiveness judgments can have the following values:

-2: should have been judged but mistakenly was not

-1: relevance was 0, so not judged

0: judged as no info

1: judged ineffective

2: judged inconclusive

3: judged effective

Credibility judgments can have the following values:

-2: should have been judged, but mistakenly was not

-1: relevance was 0, so not judged

0: not credible

1: credible

The qrels file containing these raw judgments is posted to the Decision track section of the tracks page in the active participants' part of the TREC web site. It is in the format

topicid 0 docid relevance-judgment effectiveness-judgment credibility-judgment

Two other variants of the qrels are also posted. "qrels_relevance" is a standard trec_eval qrels file containing only the relevance judgment. "qrels_correctness" is a three-judgments qrels file where the efficacy judgment has been mapped to a correctness aspect. This qrels file is the judgment file to use with the extended trec_eval program that computes three-aspect measures. (Correctness is a match between the generally accepted medical opinion for the question asked in the topic and the document's claim for that effectiveness. A file containing the accepted opinions is also posted to the website as "topics_efficacy". In that file, -1 means the treatment is believed to be Not Helpful, 0 means that the evidence is Inconclusive, and 1 means the treatment is believed to be Helpful.)

The judgment files contain judgments for 50 topics since assessors ran out of assessing budget before one topic (topic 14) could be completed. Judgment pools were originally built to depth 75 over all submitted runs. However, once it became clear that those pools were too large to get judged, the remaining topics had their pools reduced to depth 60.

There are two score reports per run. One, "eval.treceval", is the standard trec_eval report that uses only the relevance judgments. The second, "eval.extended", reports two measures that take into account three judgments (using the correctness aspect rather than raw document effectiveness judgment). The extended version of trec_eval that computes these measures will eventually be released.

Reply all

Reply to author

Forward

0 new messages