Assessors made three judgments per document: a relevance judgment, an effectiveness judgment, and a credibility judgment. Relevance was judged on a three-way scale:
0: not relevant
1: relevant
2: highly relevant
The other two judgments were made only if the document was judged to be Relevant or Highly Relevant.
Effectiveness judgments can have the following values:
-2: should have been judged but mistakenly was not
-1: relevance was 0, so not judged
0: judged as no info
1: judged ineffective
2: judged inconclusive
3: judged effective
Credibility judgments can have the following values:
-2: should have been judged, but mistakenly was not
-1: relevance was 0, so not judged
0: not credible
1: credible
The qrels file containing these raw judgments is posted to the Decision track section of the tracks page in the active participants' part of the TREC web site. It is in the format
topicid 0 docid relevance-judgment effectiveness-judgment credibility-judgment
Two other variants of the qrels are also posted. "qrels_relevance" is a standard trec_eval qrels file containing only the relevance judgment. "qrels_correctness" is a three-judgments qrels file where the efficacy judgment has been mapped to a correctness aspect. This qrels file is the judgment file to use with the extended trec_eval program that computes three-aspect measures. (Correctness is a match between the generally accepted medical opinion for the question asked in the topic and the document's claim for that effectiveness. A file containing the accepted opinions is also posted to the website as "topics_efficacy". In that file, -1 means the treatment is believed to be Not Helpful, 0 means that the evidence is Inconclusive, and 1 means the treatment is believed to be Helpful.)
The judgment files contain judgments for 50 topics since assessors ran out of assessing budget before one topic (topic 14) could be completed. Judgment pools were originally built to depth 75 over all submitted runs. However, once it became clear that those pools were too large to get judged, the remaining topics had their pools reduced to depth 60.
There are two score reports per run. One, "eval.treceval", is the standard trec_eval report that uses only the relevance judgments. The second, "eval.extended", reports two measures that take into account three judgments (using the correctness aspect rather than raw document effectiveness judgment). The extended version of trec_eval that computes these measures will eventually be released.