[twitter-api-announce] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

455 views
Skip to first unread message

Raffi Krikorian

unread,
May 13, 2010, 5:25:27 PM5/13/10
to twitter-ap...@googlegroups.com, twitter-deve...@googlegroups.com
tweet text can potentially mention other users, lists, contain URLs, and contain hashtags -- in fact, something like 50% of tweets contain at least one of those.  developers who want to understand the tweet text have to parse the text to try to extract those entities (which can get really hard and difficult when dealing with unicode characters) and then have to potentially make another REST call to resolve that data.  matt sanford (@mzsanford) on our internationalization team released the twitter-text library (http://github.com/mzsanford/twitter-text-rb) to help making parsing easier and standardized (in fact, we use this library ourselves), but we on the Platform team wondered if we could make this even easier for our developers.

as part of our JSON and XML payloads, we are going to start supporting an entities attribute that will contain this parsed and structured data.  you'll see it like so:

{
 "text" : "hey @raffi tell @noradio to check out http://dev.twitter.com #hot",
 ...
 "entities" : {
  "user_mentions" : [
    {
      "id" : 8285392,
      "screen_name" : "raffi",
      "indices" : [4, 9]
    },
    {
      "id" : 3191321,
      "screen_name" : "noradio",
      "indices" : [16, 23]
    }
  ],
  "urls" : [
    { 
      "url" : "http://dev.twitter.com",
      "indices" : [38, 64]
    },
  ],
  "hashtags" : [
    { 
      "text" : "#hot",
      "indices" : [66, 69]
    }
  ]
 }
 ...
}

or like so

<status>
  <text>hey @raffi tell @noradio to check out http://dev.twitter.com #hot</text>
  ...
  <entities>
    <user_mentions>
      <user_mention start="4" end="9">
        <id>8285392</id>
        <screen_name>raffi</screen_name>
      </user_mention>
      <user_mention start="16" end="23">
        <id>3191321</id>
        <screen_name>noradio</screen_name>
      </user_mention>
    </user_mentions>
    <urls>
      <url start="38" end="64">
        <url>http://dev.twitter.com</url>
      </url>
    </urls>
    <hashtags>
      <hashtag start="66" end="69">
        <text>#hot</text>
      </hashtag>
    </hashtags>
  </entities>
  ...
</status>

as shown above, we'll be parsing out all mentioned users, all lists, all included URLs, and all hashtags.  in the case of users, we'll provide you their user ID, and for hashtags we'll provide you the query you can run against the search API.  and, for all of them, we'll also tell you at what character count the entity starts and stops -- that should really take the burden off you guys to parse the text properly.

this entities block will probably be extended later, and these entities are just the start.  have we missed anything?  is there anything else you would like to see?  as always - just drop us a note, and look for these entities to start slowly rolling out.

--
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi

--
Twitter API documentation and resources: http://apiwiki.twitter.com
API updates via Twitter: http://twitter.com/twitterapi
Change your membership to this group: http://groups.google.com/group/twitter-api-announce?hl=en
Reply all
Reply to author
Forward
0 new messages