RF results sent

6 views
Skip to first unread message

Ian Soboroff

unread,
Oct 15, 2009, 12:50:58 PM10/15/09
to trec-r...@googlegroups.com

I have sent out the evaluations for the relevance feedback track.

They actually got sent twice... the latter includes evaluation for
*.base runs. Please disregard the results that are missing that file.
Sorry for the confusion.

I hope to have the evaluation tools and relevance judgments files posted
to the TREC website later today.

Ian

Jonathan Elsas

unread,
Oct 15, 2009, 2:41:49 PM10/15/09
to trec-r...@googlegroups.com, trec-r...@googlegroups.com
Ian -- A few questions about the evaluation data:
- Is the stAP calculated over only those documents not selected for
judgement for this run, or for across all the phase1 runs?
- Can you distribute summary statistics about the other groups' evaluation
(max/min/median evaluation scores, as well as the "Score" from the phase1
set evaluation)?

Chris Buckley

unread,
Oct 15, 2009, 5:17:01 PM10/15/09
to trec-r...@googlegroups.com
Jonathan,

Ian just forwarded my evaluations files to you (I'm the one who's
responsible for forgetting to put the base file evaluations in the
first batch), so here's a couple of answers. I'm writing up a more
major description of how all measures were calculated that should
be out in a couple of days.

On Thu, Oct 15, 2009 at 2:41 PM, Jonathan Elsas <jel...@cs.cmu.edu> wrote:
>
> Ian -- A few questions about the evaluation data:
> - Is the stAP calculated over only those documents not selected for
> judgement for this run, or for across all the phase1 runs?

All evaluations are calculated over residual collection on that run's input
only (no union of phase1 runs, at least for this primary evaluation).
That introduces some problems and solves others; I'll discuss
that later.

> - Can you distribute summary statistics about the other groups' evaluation
> (max/min/median evaluation scores, as well as the "Score" from the phase1
> set evaluation)?

I decided not to distribute summary statistics for the actual phase 2 runs; I
want folks concentrating on the goals of the track, which are to look at
finding good documents. Thus comparison of the phase 1 sets that your
group ran on is an important issue - what docs were important for good
retrieval,
and was it just number of relevant docs that mattered for your runs?

If people really want them, I can prepare those phase2 stats and make them
available, but they're only going to be approximations (I'm not going to break
them down by input set of docs), and really shouldn't be all that useful. For
just general ballpark info:
emap (cat B) ranged from .0168 to .0536
statAP(cat B) ranged from .0434 to .2638
map (cat A) ranged from .0272 to .2414
P_10 (cat A) ranged from .0939 to .5082

I will prepare some summary stats for the phase 1 comparative "Score", at
least per query low, median, high.

Chris

Reply all
Reply to author
Forward
0 new messages