Vincent,
Thanks for highlighting these issues.
> I believe that the filtered corpus has been build using surface form
> name of entities. Is it only the canonical name (from yaml file) or did
> you also consider variant names manually set ?
We considered name variants and also all the strings that appeared in the
slot fills.
There *are* documents in the larger corpus that probably refer to the
target entities. We had to drop some of them in our semi-principled
sampling (and hacking) to cut the corpus size down from 11TB (to 639GB),
because we want people to have enough cycles (mentally and
computationally) to actually try to use the Serif tagging data, which is
quite good.
> Is it worth to use the full corpus to search with our own variant names
> ... ? If so, what if some documents from the full corpus are in my run.
You can include stream_ids from the full corpus in your run submissions.
We might possibly filter them out before scoring.
> Are those documents going to be evaluated manually or would they just be
> ignored ?
They will be ignored in the second round of pooled assessing. I think we
have to do that in order to spend the assessor budget most fairly. If it
turns out that we have "extra" asessor time, we could consider adding them
in. Either way, we'd have to keep them separate from official eval sets
for scoring all systems.
The kba-streamcorpus-2014-v0_3_0-kba-filtered corpus is the *official*
corpus for the CCR and SSF tasks.
For the Accelerate & Create task--- anything goes! Use the full 16TB
corpus, if you'd like :-)
> Last question on entities. There is a field external_profile in the yaml
> file. I guess we are allowed to use those web pages to build an entity
> profile out of it. Could you may be provide the html documents so that
> everybody starts with the same information? We never know whether a
> document is going to change or not depending on the time you crawl the
> document. Don't you think ?
See the two examples in the README.txt:
"... As with all of the KBA 2014 query entities, this entity is *defined*
by the documents the refer to it (rating=1 or rating=2) before the
training_time_range_end and also by its home page, which is provided as a
URL in the external_profile. A CCR system may only use the URLs in
external_profile, if it can ensure that it obeys the "no future info" rule
by accessing only versions of the content from before each date hour being
processed. This rule is straightforward to obey for the 27 entities with
Wikipedia URLs, and may also possible to obey using tools like the WayBack
Machine
http://archive.org/help/wayback_api.php
"This second example is a person entity that did not have enough vitals to
be used as a query target in Vital Filtering (CCR). However, his profile
contains many slot fills, so he is a good SSF query. For SSF, this entity
is defined by *all* of the truth data that mentions him, and also the
external profiles. It is valid for an SSF system to use all of the truth
data from before and after the cutoff and also external profiles. (If we
find that this is sufficient to make SSF "easy," then there will be much
rejoicing and we will make it harder next year :-)"
Does that answer your question?
John