Twitter parsing, some tips

90 views
Skip to first unread message

Tim McNamara

unread,
Apr 11, 2011, 12:00:24 AM4/11/11
to nltk-...@googlegroups.com
Many people are interested in the real-time web. However, most people's models are not trained to deal with tweets. Here are some suggestions that will enable your parsers trained on 'normal' prose to be able to interpret tweets more effectively:

Remove hashtags and links at the end of a message. Hashtags at the end of a tweet are generally topic areas/categorisation, rather than content.

If a tweet begins with 1 or more @mentions, convert the tweet to sentence like: <Author> tweeted to <Recipient>, "message content". If you process tweets in this way, they will be treated in a similar way to speech within prose, which is how they're being used.

Normalise "@person" to "Person" when it appears in the middle of a message. That way, your parser will tend to treat @person as a proper noun. 

Remember that many tweets are part of a conversation. Use Twitter's API to pull the whole conversation down, meaning your parser can interpret everything in context.

Remove emoticons. They're metadata. Your parser only knows how to deal with data. 


I have also been thinking of a few ways to handle retweeting. However, I'll work on their accuracy and let you know how I get on.

Tim
@timClicks

Sander Stepanov

unread,
May 13, 2015, 5:41:31 AM5/13/15
to nltk-...@googlegroups.com
do you have code examples?
Reply all
Reply to author
Forward
0 new messages