TREC Relevance Feedback 2009 - Collection released

Chris Buckley

unread,

Mar 27, 2009, 4:55:45 PM3/27/09

to trec-r...@googlegroups.com

The ClueWeb09 collection which will be used for several TREC
tracks this year, including our Relevance Feedback
track, has now been released. You can get collection
details at http://boston.lti.cs.cmu.edu/Data/clueweb09

Many thanks to Jamie Callan and his crew for getting
this out. It looks like it will be a major research resource
for the next several years. There are two features of the
collection that should make it very usable this year for
our track; I want to especially thank Jamie et al for doing
the extra work and adding them late in the process.The first
feature is the category B subset, a reasonably broad much
smaller subset that will enable groups to fully participate in
this year's track without ramping up to the full crawl. The
second is the inclusion of a complete crawl of
Wikipedia, which several groups participating in this track
have said they would like to use for feedback.

There is a quite reasonable charge for the collection (~$800 for
the full collection on 4 disks or $250 for just the subset), but
as always it will take a bit of time to get all the paperwork done to
get the collection shipped to you. I would urge all potential track
participants to start working on the paperwork right away. There
is a 1.6 MByte (not 1 GByte as the web page says) sample set of
documents at
http://boston.lti.cs.cmu.edu/Data/clueweb09/samples/ClueWeb09_English_Sample_File.warc.gz
for folks to make the hopefully minor tweaks needed for the
indexing of the new document format.

As discussed at TREC last year, the general focus of this year's
track will be finding good documents upon which to base
relevance feedback. This implies a two-stage track. In the first
stage, groups will submit a small number of documents (5?) in
response to each of a set of 50-100 topics. Those docs will
be judged for each group. This will take place in June.

In the second stage, each group will run their feedback
algorithms based upon both their own small set of judged
docs, and in separate runs, based on several other groups
of judged docs. From this we will get an evaluation of the
intrinsic quality of the sets of judged docs, as well as how
well each system can use its own set of docs. This second
stage will take place in August.

Given the time and complexity of indexing the collection,
I propose the first stage docs be required to be in the
category B subset. NIST will judge the first N docs in
each submission that occur in the B subset. In the second
stage, groups can choose to run on either the B subset or
the English only subset of the full collection. If there are
enough folks interested in multi-lingual RF, we may also offer
the possibility of submitting from the full multi-lingual crawl,
with the understanding that NIST is not going to be able to
judge the non-English docs this year. The multi-lingual
submissions could be regarded as a pilot for future research,
even if the non-English docs will not contribute to the evaluation.

I welcome comments on the rough plan above; I'll write up
a fuller preliminary guideline proposal by the end of next week.
I don't have any fuller details on what is included in the category
B subset yet; that may affect some of what we do (eg, Wikipedia).

Chris

李思

unread,

Mar 28, 2009, 10:33:15 PM3/28/09

to trec-r...@googlegroups.com

If we buy the full collection, how can we identify the B subset from the full collection?

--
李思
北京邮电大学
模式识别与智能系统实验室
Pattern Recognition & Intelligent System Lab.
Beijing University of Posts and Telecommunications
Tel:010-62285019-1012
Cell Phone:134-6639-0040

Le Zhao

unread,

Mar 29, 2009, 2:10:20 PM3/29/09

to trec-r...@googlegroups.com

Quoted from Mark Hoy:

the "Category B" data will be just the data from the directory
"ClueWeb09_English_1" from the entire subset. (i.e. the first 50 million
documents of the English corpus)

Cheers,
Le

Stephen Robertson

unread,

Apr 23, 2009, 9:58:09 AM4/23/09

to trec-r...@googlegroups.com

A few comments on Chris's "rough plan" sent out 27 March:

I think the first-stage restriction to the Category B subset is reasonable. Do we envisage that the second-stage judgements will not be restricted? If not, it might be necessary to ensure that the Category B subset is nevertheless well represented in the judged set (otherwise participants who are restricted to Category B for technical reasons will not be well evaluated). Might not be a problem, but may be for some queries. This would interact with the methods for choosing documents for judgement at the second stage.

The requirement to run feedback routines using documents returned in the first stage by other participating systems means that when the first-stage qrels are issued, they will need carry labels indicating which of the participating systems retrieved them.

Not clear if every participant will run feedback with the first-stage documents from every other participant - probably depends on the scale of participation. If not, some randomization might be good from a global perspective. But from a local perspective, one might want to select which first-stage sets to run with. This would suggest knowing something about the first-stage runs from other participants (e.g. were they retrieved in a particular way in order to help feedback). However, this might be difficult in the schedule, so maybe this has to wait till afterwards. If randomization is required, it might have to be done centrally.

Stephen

Reply all

Reply to author

Forward