> I think that all metrics can be used to compare across systems if all
> metrics use all of individual system's run instead of an individual
> system's TPs from the previous layer.
That's probably true in theory.
> Why aren't all of individual system's run used for all metrics?
IIRC, in actuality, the numbers were indistinguishable from zero because
the recall was already so low that the filtering effect of each layer
reduced the number of TPs too much to be statistically meaningful.
Instead of an error analysis, it became an anecdotal success analysis.
It's also important to keep in mind that the assessing task was very hard,
and therefore had incomplete coverage--- especially when compared with the
much better recall on the CCR task.
jrf