Proposal / Design for "track" data collector

Skip to first unread message

M. Edward (Ed) Borasky

Jan 29, 2010, 3:50:11 AM1/29/10
I'm looking at the Twitter Streaming API at the moment. I've got the
basic Ruby code running, and I wanted to put forward a "proposal /
design" and see if anyone has an interest. There are pieces of this I
know how to do, and other pieces where I'd need some help. The
Streaming API documentation is at Here's the
relevant description of how the "track" filtering works:


'Specifies keywords to track. Keywords are specified by a comma
separated list. Queries are subject to Track Limitations, described in
Track Limiting and subject to access roles, describe in the
statuses/filter method. Track keywords are case-insensitive logical
ORs. Terms are exact-matched, and also exact-matched ignoring
punctuation. Phrases, keywords with spaces, are not supported.
Keywords containing punctuation will only exact match tokens. Some
UTF-8 keywords will not match correctly- this is a known temporary

'Track examples: The keyword Twitter will match all public statuses
with the following comma delimited tokens in their text field:
TWITTER, twitter, "Twitter", twitter., #twitter and @twitter. The
following tokens will not be matched: TwitterTracker and, The phrase, excluding quotes, "hard alee"
won't match anything. The keyword "helm's-alee" will match helm's-alee
but not #helm's-alee.

'Values: Strings separated by commas. Each string must be between 1
and 30 bytes, inclusive.

'Methods: statuses/filter

'Example: Create a file called 'tracking' that contains, exactly and
excluding the quotation marks:
"track=basketball,football,baseball,footy,soccer". Execute: curl -d
-uAnyTwitterUser:Password.You will receive JSON updates about various
crucial sportsball topics and events.'

How much data will this return? That will depend on the keywords we
choose. We can have up to 200 with the default public access. As the
document notes, tweets matching the logical OR of the keywords will be
returned, and the keyword 'haiti' will match the hashtag '#haiti',
etc. If our keywords are selected carefully, we should get *all*
matches from the public timeline! To quote the document:

"Reasonably focused track predicates will return all occurrences in
the full Firehose stream of public statuses. Overly broad track
predicates will cause the output to be periodically limited."

So here's the proposal:

1. The first step is to decide whether we want to do this, and what
keywords we should use. Certainly "haiti" would be one, but I believe
there are others - major cities, at least the ones with single-word
names, for example.

2. The recommended architecture in the documentation consists of three
processes. One simply monitors the raw stream and inserts the tweets
into a queue. The second process parses the messages coming out of the
queue and stores them in some persistent data store. And the third
performs any downstream filtering and analysis. If we decide to do
this, I can code the initial tweet collection and queuing software
given the keywords, and determine the volume we are getting and
whether we would need to request elevated access.

3. Once we know the volume, and see what some of the messages look
like, we can decide if it's worth building the rest of the tool set.
That would involve a persistence mechanism. For a variety of reasons,
mostly familiarity, I prefer PostgreSQL, but if the volume is
sufficient, I could be persuaded to let someone else implement
persistence in something a little more "modern", like a key-value or
"big table" database. If I were to build it myself today it would just
about have to be PostgreSQL because that's all I know.

4. Given collection / queuing and persistence, the third piece of the
tool set is wrapping the preceding pieces as a service that can be
polled by other components, or as a feed, or as a POST to a URL. If
it's just a PostgreSQL database, it's certainly possible to just make
the server publicly accessible, but that would require a DBA to be


1, Should we do this? I think we should, because the Search API is
more highly filtered by Twitter than Streaming. If we try to do this
using the Search API, we will be losing tweets.

2. If we do decide to do this, what do we want to use for persistence
and what do we want to deliver to whom?

I'm going to experiment with just the single keyword "Haiti" in the
next day or so, just to see what the volume is. If anyone has any
other keywords they want to throw into this, send me an email and I'll
add them.
M. Edward (Ed) Borasky

"I've always regarded nature as the clothing of God." ~Alan Hovhaness

M. Edward (Ed) Borasky

Jan 29, 2010, 3:52:34 PM1/29/10
I have it running and it's been collecting data since "Fri Jan 29
09:18:21 +0000 2010". Last timestamp I checked was "Fri Jan 29
20:43:36 +0000 2010", so I have about 11.25 hours of tweets. The tweet
count for that period is 42,266.

Right now I'm just collecting the tweets and dumping them in YAML, so
I have some sample data. I'm going to shut it down for the moment -
plenty of other people are tracking Haiti Twitter data.

I'll probably write another script that just monitors the tweet
arrival rate, and if I can find a way to serve that up on the web,
I'll post it for the world to watch. Any Sinatra gurus out there want
to give me a quick tutorial? It's too simple for a full Rails app. ;-)

Sam Churchill

Jan 29, 2010, 4:44:46 PM1/29/10
Just wanted to say how much I am enjoying your ideas (even if I don't
understand them). Keep 'em coming, Ed.
Sam Churchill

Jan 29, 2010, 6:40:46 PM1/29/10
to Sam Churchill,,,,
Ditto Sam. This could be a coarse on this stuff.
Reply all
Reply to author
0 new messages