Dear Ugo, Diego, #Microposts2014 NEEL Challenge participants,
> In my opinion it is still not clear what kind of annotations must be
> included in the output that all participants have to submit. Namely,
> must participants filter the results before submission using the
> taxonomy you have provided in the guidelines, or will you manage this
> tasks on yourself before running evaluation scripts. Actually, (almost)
> all annotators participants are using do not rely on that taxonomy and
> most likely they will annotate tweets with DBpedia concepts that are not
> contained in that taxonomy. This doesn't mean that the annotator is not
> working fine, but this challenge is just focused on a reduced set of
> DBpedia concepts.
The taxonomy we have provided is not normative, hence you may follow it
for tuning your system, or just discarding it.
This is part of the challenge.
> And this is perfectly fine, but the issue is that you will evaluate both
> precision and recall, so it is important to remove all DBpedia URIs that
> are not contained in that taxonomy otherwise, even if the annotator has
> correctly found a relevant DBpedia URI but that URI is not part of that
> taxonomy, this case will be considered as a false positive, hence
> affecting the annotator precision and in turn the overall F1 score.
> Could you please clarify what URIs should be included in the TSV files
> to be submitted?
The evaluation is based on checking whether the pair i) entity mention
(surface form), ii) uri matches the one in the GS.
The evaluation will not compute *breakdown figures per category*,
actually the taxonomy will not be considered in evaluation process.
Again, the taxonomy is given for two main reasons: i) the sake of
completeness,
and ii) for tuning your system (if your system needs it).
We expect a TAB separated file, where each record consists of:
tweet_id
entity_mention_1
entity_uri_1
...
entity_mention_n
entity_uri_n
> Additionally, I think you should clarify the metric you will use for the
> evaluation. Previously you have stated that you will consider as true
> positive only "spot-entity" pairs matching the ground-truth. But how do
> you mix all the figures (true positives, false negatives and false
> positives)? Will you use Micro or Macro measures? I think that, since
> there is a significant amount (about 30%) of tweets that do not contain
> annotations at all, the macro measures don't make sense and the only
> applicable ones are micro measures.
The evaluation will be based on micro-average, using the following
definition:
- pair = entity mention, uri
- GS = set of pairs in the gold standard;
- TS = set of pairs in the provided annotation output
- TP = number of relevant pairs(*) in TS
- FN = number of relevant pairs(*) not spotted in TS
- FP = number of irrelevant pairs(**) spotted in TS
(*) relevant pair = pair(TS) that matches pair(GS). In the case of
multiple pairs in a tweet, only correctly sorted pairs will account for
a full score.
(**) irrelevant pair = pair(TS) that do not match pair(GS)
precision = TP / (TP+FP)
recall = TP / (TP+FN)
F1 = 2 * precision * recall / (precision + recall)
More info, examples about the evaluation process in the coming email.
#Microposts2014 Challenge crew