Official TREC 2009 Relevance Feedback Guidelines

30 views

Skip to first unread message

Chris Buckley

unread,

May 21, 2009, 12:26:04 PM5/21/09

to trec-r...@googlegroups.com

Relevance Feedback (RF) track in TREC 2009
Guidelines May 21, 2009

Last year's TREC Relevance Feedback (RF) track concentrated on just
the RF algorithm itself: Given a topic and a set of judged documents
for that topic, how does a system take advantage of the judgments in
order to return more docs that will be useful to the user. All groups
used the same sets of relevance judgments and evaluation focused on
the improvement in MAP obtained by using the relevance judgments, and
how the improvement changed as the number of judgments in the
given relevance sets increased.

This year, the track will evaluate how well systems can find good docs
to be judged, as well as the improvement due to the RF algorithm.
There will be two phases to the track. In the first phase, for each
topic groups will identify a small number of docs (5) for which they
wish relevance judgments. The docs will be judged by NIST assessors.
In the second phase, groups will run their RF algorithms based on
different sets of judged docs from the first phase; submitting results
from both their own phase 1 docs, and from several other groups' phase
1 docs. Evaluation can then compare the intrinsic quality of the
phase 1 docs, as well as look at the coupling of the choice of phase 1
docs, and each group's RF algorithm.

Depending on the approach of the group, the phase 1 docs might be
based on
1. the probability of relevance of the docs (trying to get as many
relevant docs as possible.)
2. docs which try to draw the line between relevant and non-relevant.
3. docs which represent different aspects of relevance.
4. docs which represent different interpretations of a
possibly ambiguous topic statement.
5. docs which may not be relevant in themselves, but may offer good
general background (and thus expansion terms) in the area of the
topic.

All of the above approaches, and undoubtedly others, are reasonable
approaches in some circumstances. The track this year will be an
initial attempt to compare these approaches.

GOALS for 2009:

1. As last year: evaluate and compare just the RF algorithm - how
much improvement does RF give over a run without the extra information.

2. Compare approaches to finding docs to be used for RF.

3. Attempt to evaluate how coupled the two phases of RF really are.
At our current stage of RF understanding, are good initial docs good
for all RF algorithms, or can the group which supplied the initial docs
make the best use of them?

DOCUMENTS:

The experimental test bed will be the new ClueWeb09 collection. This
is a new, very large (1,000,000,000 pages), web collection which is
the first real attempt to have a test collection be representative of
the entire web. It is being used for 4 TREC tracks this year.

The size of CLueWeb09 collection presents major obstacles for
the participation of some groups. Because of this, phase 1 (finding
initial docs) will use only the B subset of the collection -
all pages included on the ClueWeb09_English_1 set. This includes
a full crawl of Wikipedia.

In phase 2, groups can either continue to use the B subset, or use
the complete English subset (all pages in ClueWeb09_English_[1-10]),
which is half of the entire ClueWeb09 collection.

Note that the B subset is still quite large - over 3 times the size
of the Terabyte gov2 collection. Indexing it will still take
substantial time.

TOPICS:

NIST assessors will choose 50 topics from a query log for which they
feel qualified to judge relevance. The topics will all be short
topics (most 1-3 words) and are not necessarily unambiguous.

These short topics will be used in both phase 1 and phase 2. As they
judge docs, the NIST assessors will develop longer versions of the
topics (TREC description and narrative sections) that more fully
describe relevance criteria, but those will not be distributed to RF
participants. The only knowledge of relevance for the RF systems will
be the short topics, and the phase 1 judgments.

All topics will be in single line Million Query (MQ) format:
topic_id: topic

PHASE 1 SUBMISSIONS:

After receiving the topics (end of May), participants should determine
up to 5 docs for each topic that they desire judged. Participants
will submit these docs to NIST as a TREC "run" in standard ranked TREC
results format. The WARC-TREC-ID field in each document is the
equivalent of the normal TREC DOCNO field.
Phase 1 runs are due June 22.

Phase 1 runs can either be automatic or manual. A run is manual
if there is any topic dependent human interaction in coming up
with the 5 docs. It is anticipated that almost all phase 1 runs
will be automatic. At submission time, you will be asked to identify
runs as either automatic or manual.

Participants may submit a second "run" if they desire to test alternate
approaches. However, there is no guarantee that the second run will
be judged. There is a fixed assessor "budget" for the phase 1
judgments, and we have no idea of the number of groups who will be
submitting phase 1 runs, and no idea of how much document overlap
there will be between runs of different groups - overlap is expected
to be smaller given the size of the collection.

Depending on the resulting size pool, NIST may form pools from
(in order of decreasing desirability):
1. top 5 docs from one or two runs per group
2. top 5 docs from one run per group
3. top 3 docs from one run per group

Thus the ranking of the submitted 5 docs per topics may be important,
in that the last 2 docs have the possibility of not being judged. We
expect that we will be able to form a pool using choice 1 above, but
need to have the above backup plans in case pools exceed the judging
budget.

Phase 1 submission runs should be named
<basename>.[12]
where basename includes something identifying the group, and the
suffix 1 or 2 indicates first or second run. The basename should be 3
or 4 characters long. Thus "SAB.2" might be a good run name for the
second run of Sabir Research.

PHASE 2 INPUTS:

The results of phase 1 will be used as judged document RF input
for phase 2 of the track.

The judged docs will be distributed as TREC qrels files. Each qrels
file has judgments for all 50 topics, and each line in the qrels file
will be of the ascii text form
topic_id 0 WARC-TREC-ID rel_judgment
where rel_judgment will be an integer from 0 to 2. 0 indicates
nonrelevant, 1 indicates relevant, and 2 indicates highly relevant.

These input qrels files will be named q<phase 1 submission>. Thus
"qSAB.2" will be the qrels file corresponding to the earlier "SAB.2"
submission.

Each participant will be given 5 to 6 qrels files to use as RF input
for phase 2 runs. This set will include the phase 1 submission
qrels of that participant, as well as at least 4 qrels from other
participants.

The participant will then use their RF algorithm on each of their
assigned qrels files and submit a run for each qrels, as well as
submitting a base case run which uses no RF. This run should be in
standard TREC ranked results format giving the top 2500 ranked docs
returned by the system.

The main evaluation will be done on the top 1000 docs for each
topic that were not included in the qrels input docs for that
run. Thus the main comparisons can only be made between runs
that used the same input qrels.

There will be a secondary evaluation where NIST will remove
any document that occurred in any qrels file, and then evaluate
over the top remaining 1000 docs. This evaluation can be used
to compare between any two runs, but will be less accurate
since many fewer relevant docs will used in the evaluation.

Note that this implies you can ignore or remove the RF docs used for a
particular run when you submit that run, but you should NOT remove the
docs from any of the other qrels used as inputs for other runs.

Runs should be named
<run_basename>.<qrels_file>
Thus a set of runs from Sabir Research might be named
SAB09RF.base
SAB09RF.qSAB.1
SAB09RF.qSAB.2
SAB09RF.qABC.2
SAB09RF.qDEF.1
SAB09RF.qGHI.1
SAB09RF.qJKL.2

At the moment, we are planning to allow only one set of runs
per participant, but this may change. Let us know what you
would like. Allowing two sets of runs will mean less judging
resources per run, which may be a problem with a collection
of this size.

PHASE 2 ONLY

Given the new collection and the tight schedule, we may allow some
groups to participate in Phase 2 only. They will not be able to use
their own Phase 1 docs, so groups are strongly encouraged to
participate in Phase 1. Groups wanting to do this will have to
convince the track organizers that their system can be running well on
the collection by mid-July - we will probably require groups to submit
a sample retrieval run. (Given that not all groups are running
everybody else's qrels, the analysis of results is made much more
difficult by no-shows.) A final decision on this has not been made.

EVALUATION:

NIST will perform additional relevance judgments on all
50 topics. Topics will be evaluated in both the MQ track
style (only 40 documents per topic will be judged) and
the top N pool style. Pools will be shared between the
RF track, the MQ track, and the Web track, and will
probably have different pools for the B subset and the
full ClueWeb09 English set.

The primary measure will be MAP at 1000 documents.
Other standard measures will also be reported.

EXPECTED SCHEDULE:

1. End of May: Topics will be available.
2. June 22, 2009: Phase 1 document results due.
3. July 8, 2009: Phase 1 qrels made available.
4. August 24, 2009: All Phase 2 runs are due.
5. End of September: NIST will finish new relevance judgments, and
evaluation results will be sent to the participants.
6. Third week of October: Notebook paper due
7. November 17-20: TREC 2009.

Reply all

Reply to author

Forward

0 new messages