Measures to use for TREC RF papers

8 views

Skip to first unread message

Chris Buckley

unread,

Oct 18, 2008, 3:29:53 PM10/18/08

to trec-r...@googlegroups.com

I've been asked, given the plethora of measures and evaluation
techniques for the RF track, whether there are particular ones that
people should focus on in their papers and presentations, so that
their work can be easily compared to others.

The short answer is you should use whatever measures on whatever
sets that will demonstrate features of your system the best.

The medium answer is that the Pool10 evaluation on the 31 terabyte
topics is the one that offers the greatest variety of depth of
analysis. mostly because there were nearly 400 docs judged on average
per topic rather than 32 docs or 64, as in the 237 topics evaluated MQ
style. MAP is the official measure there, so unless you have some
other reason, I would take MAP on the 31 terabyte as the first measure
you should be looking at and presenting. If you are doing your own
evaluations, remember that all evaluations should be done on only
the first 1000 docs retrieved for each topic (after the appropriate docs
have been removed - normally Set E).

For the long answer, I'll go into a bit more detail on what was done
in the assessments and why.

There were 2 separate evaluations done. Both of them pooled and
evaluated not only the 118 RF runs, but also the 25 MQ runs submitted
in June (so we have outside base cases for comparison). All official
evaluations on all runs had Set E docs removed before any pooling took
place.

The first evaluation was MQ-style, judging either 32 or 64 (50-50
split) documents for up to 237 topics. There were two different MQ
measures being calculated: statMAP from NEU (on 208 topics), and
expectedMAP from UMass (on 237 topics). These measures are intended
to give the same ranking as MAP would if the runs had been fully
judged, but algorithmically sample only a small number of docs. The
purpose of MQ style evaluation is to be able to evaluate a much larger
number of topics for the same judging effort as the usual TREC topN
pooling. That's important for RF, since topic variability of results
is not only affected by the normal inherent topic difficulty, and user
interpretation of relevance (both always present in ad hoc
evaluations), but also whether the docs used as RF input are
representative. The topics potentially to be evaluated in the MQ style
were 214 topics originally retrieved in the TREC MQ 2007 track, plus 25
topics from the 3 years of TREC Terabyte tracks.

The second evaluation was a normal TREC pooling evaluation, initially
done on the other 25 topics from Terabyte track (all submission were
of 264 topics). There was extra assessor time available after the two
evaluations; that time was used to judge a few more topics - thus
there are 6 terabyte topics that were judged twice (both the MQ style
and Pool10 style). Given limited resources, the pool of docs to be
judged for each topic consisted of the top 10 docs from every run.
With overlap, that amounted an average of a bit less than 400 docs per
topic being judged.

The Pool10 evaluation is an approximation of the normal TREC evaluation
strategy, and should allow ranking of systems by any of the standard
evaluation measures. As always, values of measures may be different
than if full judging were done, but system comparisons should still be
valid. This evaluation should allow investigation of whether the effects
of the RF were concentrated on just the top retrieved docs, or were more
recall oriented. (One ever present RF question for a particular system
is whether the benefit is due to just finding a couple of good query expansion
terms, or due to a lot of expansion terms establishing a useful context).
The MQ evaluation measures should not be used for this sort of investigation.

When analyzing results, remember that the Set E sets of documents
were much larger for the original Terabyte topics. Thus Set E improvement
as compared to Set D might be expected to be larger in the Pool10
evaluation (all Terabyte topics) than in the MQ evaluation measures (only
25 Terabyte topics out of the 208+ topics). That will be an artifact of the
experimental setup rather than showing anything about the two evaluations.

Ideally, most reports should include some failure analysis and examples of
topics that worked and topics that didn't.