Clarifications about the annotation output to be submitted and evaluation metric

Ugo Scaiella

unread,

Feb 16, 2014, 7:16:13 PM2/16/14

to micropo...@googlegroups.com

Dear Chairs,

I have already posted a similar question in another thread, but I think it may be useful to start a new one.

In my opinion it is still not clear what kind of annotations must be included in the output that all participants have to submit. Namely, must participants filter the results before submission using the taxonomy you have provided in the guidelines, or will you manage this tasks on yourself before running evaluation scripts. Actually, (almost) all annotators participants are using do not rely on that taxonomy and most likely they will annotate tweets with DBpedia concepts that are not contained in that taxonomy. This doesn't mean that the annotator is not working fine, but this challenge is just focused on a reduced set of DBpedia concepts.

And this is perfectly fine, but the issue is that you will evaluate both precision and recall, so it is important to remove all DBpedia URIs that are not contained in that taxonomy otherwise, even if the annotator has correctly found a relevant DBpedia URI but that URI is not part of that taxonomy, this case will be considered as a false positive, hence affecting the annotator precision and in turn the overall F1 score.

Could you please clarify what URIs should be included in the TSV files to be submitted?

In case you don't manage such a filter, ie participants have to filter out URIs that are not part of that taxonomy, I think you should clarify how to match a DBpedia URI with that taxonomy, because it's still not clear (see the related thread "Entities not matching Dbpedia taxonomy").

Additionally, I think you should clarify the metric you will use for the evaluation. Previously you have stated that you will consider as true positive only "spot-entity" pairs matching the ground-truth. But how do you mix all the figures (true positives, false negatives and false positives)? Will you use Micro or Macro measures? I think that, since there is a significant amount (about 30%) of tweets that do not contain annotations at all, the macro measures don't make sense and the only applicable ones are micro measures.

Could you please confirm/clarify this?

Regards,

-- Ugo Scaiella

Diego Ceccarelli

unread,

Feb 17, 2014, 11:09:10 AM2/17/14

to micropo...@googlegroups.com

Dear Chairs,

I agree with all the issues that Ugo pointed out in the previous mail.

It would be really useful to receive clarification about how the
evaluation is performed
(the best thing would be to have access the evaluation framework).

The way the performance are evaluated and differences between the KB
used to generate the annotations and
KB used for the evaluation(in this case, all the entities in the
taxonomy) could seriously
affect the performance of an annotation method.

Would be really useful if you can provide a file containing all the
entities in your taxonomy,
e.g., a list of URIs, one per line would be the best.

Regards,
Diego

> --
> You received this message because you are subscribed to the Google Groups
> "microposts2014" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to microposts201...@googlegroups.com.
> Visit this group at http://groups.google.com/group/microposts2014.
> For more options, visit https://groups.google.com/groups/opt_out.

--
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy

Phone: +39 050 315 2984
Fax: +39 050 315 2040
________________________________________

#Microposts2014 Chairs

unread,

Feb 18, 2014, 10:25:32 AM2/18/14

to micropo...@googlegroups.com

Dear Ugo, Diego, #Microposts2014 NEEL Challenge participants,

> In my opinion it is still not clear what kind of annotations must be
> included in the output that all participants have to submit. Namely,
> must participants filter the results before submission using the
> taxonomy you have provided in the guidelines, or will you manage this
> tasks on yourself before running evaluation scripts. Actually, (almost)
> all annotators participants are using do not rely on that taxonomy and
> most likely they will annotate tweets with DBpedia concepts that are not
> contained in that taxonomy. This doesn't mean that the annotator is not
> working fine, but this challenge is just focused on a reduced set of
> DBpedia concepts.

The taxonomy we have provided is not normative, hence you may follow it
for tuning your system, or just discarding it.
This is part of the challenge.

> And this is perfectly fine, but the issue is that you will evaluate both
> precision and recall, so it is important to remove all DBpedia URIs that
> are not contained in that taxonomy otherwise, even if the annotator has
> correctly found a relevant DBpedia URI but that URI is not part of that
> taxonomy, this case will be considered as a false positive, hence
> affecting the annotator precision and in turn the overall F1 score.
> Could you please clarify what URIs should be included in the TSV files
> to be submitted?

The evaluation is based on checking whether the pair i) entity mention
(surface form), ii) uri matches the one in the GS.
The evaluation will not compute *breakdown figures per category*,
actually the taxonomy will not be considered in evaluation process.
Again, the taxonomy is given for two main reasons: i) the sake of
completeness,
and ii) for tuning your system (if your system needs it).

We expect a TAB separated file, where each record consists of:
tweet_id
entity_mention_1
entity_uri_1
...
entity_mention_n
entity_uri_n

> Additionally, I think you should clarify the metric you will use for the
> evaluation. Previously you have stated that you will consider as true
> positive only "spot-entity" pairs matching the ground-truth. But how do
> you mix all the figures (true positives, false negatives and false
> positives)? Will you use Micro or Macro measures? I think that, since
> there is a significant amount (about 30%) of tweets that do not contain
> annotations at all, the macro measures don't make sense and the only
> applicable ones are micro measures.

The evaluation will be based on micro-average, using the following
definition:
- pair = entity mention, uri
- GS = set of pairs in the gold standard;
- TS = set of pairs in the provided annotation output
- TP = number of relevant pairs(*) in TS
- FN = number of relevant pairs(*) not spotted in TS
- FP = number of irrelevant pairs(**) spotted in TS
(*) relevant pair = pair(TS) that matches pair(GS). In the case of
multiple pairs in a tweet, only correctly sorted pairs will account for
a full score.
(**) irrelevant pair = pair(TS) that do not match pair(GS)

precision = TP / (TP+FP)
recall = TP / (TP+FN)
F1 = 2 * precision * recall / (precision + recall)

More info, examples about the evaluation process in the coming email.

#Microposts2014 Challenge crew

Ugo Scaiella

unread,

Feb 18, 2014, 11:17:39 AM2/18/14

to micropo...@googlegroups.com

Dear Chairs,

Thanks for your clarifications about evaluation process. I fully agree with it.

However, I think it is still not clear the issue related to the taxonomy.

Say we have a tweet like this

"barack obama watches the superbowl"

now, suppose your taxonomy contains a category like "politicians" but not a category for "sport" nor "american football"... in such a case the human annotators wouldn't annotate 'superbowl' because it is not contained in the taxonomy.

However, most likely a generic automatic annotation process would tag the spot 'superbowl' with the relevant dbpedia URI.

If you don't filter such an annotation using the taxonomy, that annotation will be considered as a false positive, negatively affecting the annotator precision (hence the overall f1-measure) even if the annotator is perfectly working fine.

Besides that dummy example, I can find several cases in the training set, where the human annotation process has not identified relevant topics and I suppose that this happened just because the missing topics are not part of the taxonomy.

91936326982696960 "Art is making something out of nothing and selling it." 'art' could be annotated...

92020065662271489 "Rt if u addicted to twitter.....honesty is the first step! Lol" 'twitter' could be annotated

92068856931155968 "JK Rowling's life story to be made into TV movie" 'TV movie' or 'movie' could be annotated

92100829905035264 "Yes it's the start of that singular season in Vancouver -- folk music festival time" folk music or music festival could be annotated

and so on.

I think that most of these topics are not annotated because they are not part of the referenced taxonomy, but I think it is really unfair to penalize an annotator that has correctly tagged those spots.

So, I think that you should

(1) filter out from participant submissions those annotations that are not part of your taxonomy

OR

(2) provide a whitelist of DBpedia URIs that are part of that taxonomy so that participants can filter the result before submission.