Problems with entity annotation in training data set 1.2

33 views
Skip to first unread message

Daniel Dahlmeier

unread,
Jan 27, 2014, 4:47:39 AM1/27/14
to micropo...@googlegroups.com
Hi

first of all thanks again for organizing this task. 

When I was trying to automatically match the annotations to their occurrence  in the tweet, I encountered a number of entities that could not be matched. Some of the problems are just because of the nature of tweets, like spelling mistakes,  others seem to be mistakes in the annotation.

Here is a list of problems I have encountered:

1.   Typos in the entity mention in the  tweet but the annotation uses the correct spelling : 
  Example:
    92717757963059200    @McIlroyRoy  (tweet) vs @McllroyRoy (annotation)  (L instead of  I)
     92640767331418113     St.Clair (tweet) vs  St. Clair  (annotation)   (space is missing)

2.   Typo in the annotation of the mention.
   Example:
     92791449166422017  church (tweet)  vs curch (annotation)

3.     Annotation is accidentally repeated but the entity only appears once in the tweet
     Example:
       94521161421041664  "NASA Briefing To Preview Upcoming Mission To Jupiter:  Webmaster: http://bit.ly/pDA74c @dcottle"  (NASA is annotated twice)
       95182527685337089 : Thai tourist murder video 'claim': Thai police are asked to investigate a man named on a Youtube video as a poss... http://bbc.in/nyZtIG  (Thai police annotated twice)
     

      4. Differences in casing between tweet and annotation  : 

          Example:

               95910369452769280 : Rep. David Wu resigns from Congress. Now he can wear the tiger suit on Chatroulette all he wants [NewsFlash]:   ... http://fk.cm/6420425    (chatroulette  (annotation) vs Chatroulette (tweet))

               96190924983508992 : Thousands cancel their PayPal accounts in protest over treatment of WikiLeaks & its defenders. Join #OpPaypal and boycott the bastards.  (OpPaypal (tweet) vs OpPayPal (annotation)) 

       5. Order of annotations is wrong

                 96314478047662080 : A marvellous curling shot by from outside the area by Thiago gave the keeper no chance. Barça 2 Bayern munich 0. #fcblive        (annotations: Thiago, Barca, keeper, Bayern munich)

       6.      Hashtags or mentions are included in the annotation  : 

                      96665996416397313 : Breaking: The #Cardinals have completed their deal with the #Eagles to get QB Kevin Kolb http://t.co/zlYo69H    (The #Cardinals,   the #Eagles)

                      93059420296183808 : 27 years in jail, 5 years as President of #SouthAfrica, 93 years to prove there is no black and white. Happy Birthday Nelson Mandela!           (President of #South Africa)

      7.       Missing words  where the mention in the tweet is not the canonical form of the entity :  

                   97081523962003456 : Namibia, Kenya, Ireland make it two in two: A round-up of the second day's matches in the ICC Under-19 World Cup ... http://es.pn/qEzszn    (ICC Under -19 World Cup (tweet) vs ICC Under-19 Cricket World Cup (annotation))




Could you please check which are considered mistakes in the annotation and update the data accordingly?

Thanks,

Daniel


MSM

unread,
Jan 28, 2014, 11:10:34 AM1/28/14
to micropo...@googlegroups.com, Daniel Dahlmeier
Hi Daniel,

Thanks very much for your comments. We have tried to address them all in our new gold standard.

Many thanks,
#Microposts2014 Challenge crew
--
You received this message because you are subscribed to the Google Groups "microposts2014" group.
To unsubscribe from this group and stop receiving emails from it, send an email to microposts201...@googlegroups.com.
Visit this group at http://groups.google.com/group/microposts2014.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages