Tim McNamara
unread,Apr 11, 2011, 12:00:24 AM4/11/11Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-...@googlegroups.com
Many people are interested in the real-time web. However, most people's models are not trained to deal with tweets. Here are some suggestions that will enable your parsers trained on 'normal' prose to be able to interpret tweets more effectively:
Remove hashtags and links at the end of a message. Hashtags at the end of a tweet are generally topic areas/categorisation, rather than content.
If a tweet begins with 1 or more @mentions, convert the tweet to sentence like: <Author> tweeted to <Recipient>, "message content". If you process tweets in this way, they will be treated in a similar way to speech within prose, which is how they're being used.
Normalise "@person" to "Person" when it appears in the middle of a message. That way, your parser will tend to treat @person as a proper noun.
Remember that many tweets are part of a conversation. Use Twitter's API to pull the whole conversation down, meaning your parser can interpret everything in context.
Remove emoticons. They're metadata. Your parser only knows how to deal with data.
I have also been thinking of a few ways to handle retweeting. However, I'll work on their accuracy and let you know how I get on.
Tim
@timClicks