Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  1 message - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Raffi Krikorian  
View profile  
 More options May 13 2010, 5:25 pm
From: Raffi Krikorian <ra...@twitter.com>
Date: Thu, 13 May 2010 22:25:27 +0100
Local: Thurs, May 13 2010 5:25 pm
Subject: [twitter-api-announce] parsing out entities from tweets (a.k.a. parsing out hashtags is hard!)

tweet text can potentially mention other users, lists, contain URLs, and
contain hashtags -- in fact, something like 50% of tweets contain at least
one of those.  developers who want to understand the tweet text have to
parse the text to try to extract those entities (which can get really hard
and difficult when dealing with unicode characters) and then have to
potentially make another REST call to resolve that data.  matt sanford
(@mzsanford) on our internationalization team released the twitter-text
library (http://github.com/mzsanford/twitter-text-rb) to help making parsing
easier and standardized (in fact, we use this library ourselves), but we on
the Platform team wondered if we could make this even easier for our
developers.

as part of our JSON and XML payloads, we are going to start supporting an
entities attribute that will contain this parsed and structured data.
 you'll see it like so:

{
 "text" : "hey @raffi tell @noradio to check out http://dev.twitter.com#hot",
 ...
 "entities" : {
  "user_mentions" : [
    {
      "id" : 8285392,
      "screen_name" : "raffi",
      "indices" : [4, 9]
    },
    {
      "id" : 3191321,
      "screen_name" : "noradio",
      "indices" : [16, 23]
    }
  ],
  "urls" : [
    {
      "url" : "http://dev.twitter.com",
      "indices" : [38, 64]
    },
  ],
  "hashtags" : [
    {
      "text" : "#hot",
      "indices" : [66, 69]
      "url" : "http://search.twitter.com/search?q=%23hot"
    }
  ]
 }
 ...

}

or like so

<status>
  <text>hey @raffi tell @noradio to check out http://dev.twitter.com#hot</text>
  ...
  <entities>
    <user_mentions>
      <user_mention start="4" end="9">
        <id>8285392</id>
        <screen_name>raffi</screen_name>
      </user_mention>
      <user_mention start="16" end="23">
        <id>3191321</id>
        <screen_name>noradio</screen_name>
      </user_mention>
    </user_mentions>
    <urls>
      <url start="38" end="64">
        <url>http://dev.twitter.com</url>
      </url>
    </urls>
    <hashtags>
      <hashtag start="66" end="69">
        <text>#hot</text>
        <url>http://search.twitter.com/search?q=%23hot</url>
      </hashtag>
    </hashtags>
  </entities>
  ...
</status>

as shown above, we'll be parsing out all mentioned users, all lists, all
included URLs, and all hashtags.  in the case of users, we'll provide you
their user ID, and for hashtags we'll provide you the query you can run
against the search API.  and, for all of them, we'll also tell you at what
character count the entity starts and stops -- that should really take the
burden off you guys to parse the text properly.

this entities block will probably be extended later, and these entities are
just the start.  have we missed anything?  is there anything else you would
like to see?  as always - just drop us a note, and look for these entities
to start slowly rolling out.

--
Raffi Krikorian
Twitter Platform Team
http://twitter.com/raffi

--
Twitter API documentation and resources: http://apiwiki.twitter.com
API updates via Twitter: http://twitter.com/twitterapi
Change your membership to this group: http://groups.google.com/group/twitter-api-announce?hl=en


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »