A quick update on what I've been working on trying to parse/filter the
TtT tagged tweets I've been grabbing from Twitter.
Identifying duplicates - I've been using the MD5 hash of appropriate
text strings to act as a unique identifier for that string and one
that can be used as a key in a database. I've tried three approaches
so far:
1. hash of the full tweet text
2. hash of the hash tags in the tweet
3. hash of the keywords in the tweet
#1 allows you to spot exact duplicates but its quite a coarse filter
as a result. #2 is a bit too promiscuous as there are many tweets
which use the same collections of tags (though this will be a useful
way for us to see what tag combinations were most popular). #3 does
seem to be doing a reasonable job of identifying real duplicates but
its still not perfect. Here's osme more info:
On the assumption that the content of the tweet is what makes it
unique Im trying to boil the tweet down to the key parts. I take the
tweet and remove hash tags and @names, I take out any 'non-
informative' words by using an english stop word list (removes things
like the, at, and , on, etc), I also ignore 'words' that are composed
only of non-alphanumeric strings. This gives me a list of keywords and
I sort it alphabetically and then create an MD5 hash of that text
string.
I store this MD5 hash in the db and I can then identify if other
tweets have the same set of keywords somewhere in the text (in any
order/position).
I've updated the tweetneed.org site to show these hash values and
allow you a way to see what each approach is doing in terms of
identifying duplicate tweets appropriately. You will see a section
like this:
Keyword Hash: 26152e348728b50745ecdaac57150ebb 2 other copies in db
Full text Hash: 822d5b872a7454b2fd56ad07f7055dcf 0 other full text
copies in db
Hash tag Hash: cfa28ce09bf62402f169c2d234c42c9d 1 other tweets with
these hashtags in db
And you could link to the 2 other tweets that it thinks are dups based
on keywords or the other tweet that has identical hash tags, etc.
This isnt a perfect solution but does seem to perform some reasonable
indication of if something has been seen before or not. Im applying a
similar approach to parse things out of the various sections of tweets
identified by the hash tags to get down to the key info and get rid of
the less useful words and phrases that creep in. You can see how this
is working by looking at the RDF triples below each tweet - here's a
recent example:
Original Tweet:
seaswells: rt @benatdap: we still #need help with data entry/online
research for #haiti reliefoversight.org! have a few hrs today or this
week? #volunteer
Tags: haiti need volunteer
Keyword Hash: 26152e348728b50745ecdaac57150ebb 2 other copies in db
Full text Hash: 822d5b872a7454b2fd56ad07f7055dcf 0 other full text
copies in db
Hash tag Hash: cfa28ce09bf62402f169c2d234c42c9d 1 other tweets with
these hashtags in db
IsRetweet: RETWEET
<tweet_8595395189> <tn:tweet_url> "http://twitter.com/seaswells/status/
8595395189" .
<tweet:8595395189> <tn:tweet_id> "8595395189" .
<tweet:8595395189> <tn:from_user> "seaswells" .
<tweet:8595395189> <tn:to_user_id> "" .
<tweet:8595395189> <tn:created_at> "2010-02-03 17:14:40 UTC" .
<tweet:8595395189> <tn:is_retweet> "true" .
<tweet:8595395189> <tn:has_need> "help" .
<tweet:8595395189> <tn:has_need> "data" .
<tweet:8595395189> <tn:has_need> "entry" .
<tweet:8595395189> <tn:has_need> "online" .
<tweet:8595395189> <tn:has_need> "research" .
<tweet:8595395189> <tn:has_hash> "need" .
<tweet:8595395189> <tn:has_hash> "volunteer" .
<tweet:8595395189> <tn:has_haiti> "reliefoversight.org! have a few hrs
today or this week?" .
<tweet:8595395189> <tn:has_hash> "haiti" .
I am also looking at using stemming (from NLP) to make this keyword
generation a bit less susceptible to variations in the text. A
dictionary of common twitter abbreviations for words of interest might
be handy too - that might not exist but could perhaps be generated by
some manual reviews of the library of Haiti-related tweets that are
being collected by various people.
Any thoughts or feedback most welcome!
Simon.