update on TREC RF track 2/26/11

76 views
Skip to first unread message

Matt Lease

unread,
Feb 26, 2011, 1:06:20 PM2/26/11
to trec-r...@googlegroups.com
Hi all,

Since track participant papers are due Monday (2/28), we'd hoped to be
able to provide you better information before then. We have a little
more information to offer, though not a lot. See 3 points below.

--------
1. By design, we only pooled submissions to depth 10 to reduce
assessment costs. This means any (topic,documents) found at rank 10 or
higher by any team has not been judged. To evaluate, this suggests
truncating your rankings at depth 10 before running trec_eval, or using
an evaluation metric which only considers judged documents.

A specific issue is that the following 11 documents for topic 20642 were
in the the depth 10 pool but omitted from judging in error:

clueweb09-en0026-31-03796
clueweb09-en0094-51-14012
clueweb09-en0094-72-27093
clueweb09-en0095-49-07074
clueweb09-en0095-49-07077
clueweb09-en0001-51-12444
clueweb09-en0001-51-20886
clueweb09-en0001-61-15150
clueweb09-en0011-49-11008
clueweb09-en0011-49-11316
clueweb09-en0011-49-11468

So results for this topic should take this error into account.


-----------
2. We recently received a question on how to use the judgments
distributed on February 1st with trec_eval. The file "judgments-turk"
looks like:

20002 clueweb09-en0000-66-24091 -1 1
20002 clueweb09-en0001-31-15410 -1 1
20002 clueweb09-en0000-05-22942 -1 0
20002 clueweb09-en0000-05-22943 -1 0
20002 clueweb09-en0006-85-33191 2 0

where the "README" indicates:

# turk judgments
3rd column = prior NIST judgment (-1 = none)
4th column = new judgment (consensus of judgments by mturk workers)

The short answer is to just swap the 2nd and 3rd columns and you will
have a valid input to trec_eval; I've attached this file for convenience.


----------
3. Besides missing coverage for the 11 documents noted above, our
primary concern with the initial turk judgments was relatively low
agreement with prior NIST judgments. If we ignore the rows of
judgments-turk where NIST=-1 (i.e. not previously judged by NIST), there
are 3277 (topic,document) pairs for which we can compute the following
confusion matrix:


0 1 2 Sum
0 0.34239 0.11443 0.00122 0.45804
1 0.11962 0.14312 0.00061 0.26335
2 0.07446 0.20110 0.00305 0.27861
Sum 0.53647 0.45865 0.00488 1.00000


where NIST judgments are on rows and turker judgments are on the
columns. If you sum down the diagonal, you will see agreement between
turker and NIST is 34.2% + 14.3% + 0.3% = 48.8%, so rather lower than
what we would like to see.

I don't believe this low quality is any inherent limitation of the MTurk
platform or abilities of the turkers themselves, but rather reflects the
need for fairly rigorous quality control mechanisms on top of what the
platform natively provides in order to keep the few "rotten apples" out
there (i.e. spammers) from degrading quality. There are also some
challenges inherent to crowdsourcing that have to be managed. For
example, whereas NIST assessors judge webpages using a known browser
with security patches, we have to render webpages as images to protect
workers from malicious attack pages. This means the workers are judging
different representations of the same content between NIST and MTurk.

While we've been working on improving quality of judgments, we've had a
variety of logistical complications on top of one another that have
slowed our progress. We have been jointly pursuing both collecting more
judgments and improving our quality control mechanisms. On top of the
original 98K turker judgments we collected to produce the labels we've
distributed, we've since collected another 32K turker judgments,
restricting this judging to US-only workers for quality. Nonetheless, so
far the spammers are still reducing quality.

While I am confident we can improve the quality of the judgments, it
seems very unlikely we'll have something better by Monday. What we can
offer is to keep everyone posted on our progress, and that if anyone
wants to revise their paper again later on, we can host any updated
versions of your papers on the RF Track website. The track overview
paper we'll write is due in a month, by which time we hope to have
higher quality judgments.

Thanks again for everyone's patience and understanding.
Matt Lease


judgments-turk.reordered.bz2
Reply all
Reply to author
Forward
0 new messages