KBAers,
A couple people have pointed out that the truth data from 07-11 had fewer
entities that met the CCR query criterion than the 10-15 truth data. The
criterion is simply that there be at least 5 vitals in the time range of
the corpus. 20% gets used as training data, which in the minimal case is
just one of five.
In the email linked below, I said "If the training_time_range_end is
``null``, then that entity is not a CCR query target." I should have said
something different, like "they are not scored yet, but use them anyway."
https://groups.google.com/d/msg/trec-kba/dQNWPxBLmfs/oxOavFh0aQQJ
Given that miscommunication, the official scoring for TREC this year will
only use the CCR entities that met that criterion in 07-11. Anyone using
the truth data for new systems in the future should consider using all the
entities.
Here are the counts of number of entities matching the criterionn for the
two sets of truth data:
$ grep training_time trec-kba-2014-10-15-ccr-and-ssf-query-topics.json | grep -v null | wc -l
74
$ grep training_time trec-kba-2014-07-11-ccr-and-ssf-query-topics.json | grep -v null | wc -l
71
jrf