Wow, thanks for the great response, Fenn. Will try and build a basic
implementation today! One more brief question though, what is the
relationship of the PEAR lib system_daemon in all this? does that mean
I can spawn the daemon process from within a standard php script, or
have i missed the point with that as well?
Again, thanks so much for the useful and prompt reply!
Cheers,
Chris
On Jun 9, 6:17 pm, Fenn Bailey <
fenn.bai...@gmail.com> wrote:
> Hey Chris,
>
> I'll have a go at answering your question - You're definitely on the right
> track.
>
> First off - You're absolutely right, simply having a cron task that does a
> few REST request seems like the most logical and simple way to acquire a few
> tweets based on a criteria, and you're absolutely right - it is.
>
> However, due to a bunch of non-obvious reasons, the Streaming API seems to
> be the way forwards, which is a little bit trickier to use at first, but
> ultimately is more powerful and scalable for everyone.
>
> The first thing to do is understand "decoupled collection and processing".
> The reason why this is important is because of the spiky (and growing)
> nature of Twitter traffic.
>
> The intuitive thing to do is to connect to the stream and just decode/insert
> your tweets into the database. This is fine if you're getting 1-2 tweets per
> second, but what about 10, 20 or 200 tweets per second? (which can easily
> happen with twitter). Also, what happens when your database has 100 million
> tweets in it? I can guarantee you 99% of MySQL databases can't insert even
> 20 tweets per second when it has 100 million rows in it.
>
> The problem with this is that your system streaming connection can become
> "backed up" by the volume, which means the tweets queue up and your client
> lags behind and eventually the streaming API will disconnect (and
> potentially, eventually ban) you.
>
> So, what's the answer to all this? Well, the secret is to *decouple *the
> collection from the processing. Processing can be slow, but collection (ie:
> just receiving/storing) the tweets should remain fast.
>
> That's why it's recommended to queue them to something like a file, which
> doesn't slow down if it gets bigger. Once you have tweets being collected,
> you can switch back to the old way of just having a cron task that runs
> every X minutes and consumes the tweets (off the file) just like with a REST
> interface.
>
> So, in summary:
>
> 1. Setup a process (using something like Phirehose) to *Collect* tweets,
> store them somewhere that won't slow down and do *nothing else. *(this
> has to be fast)
> 2. Setup a separate script (crontask, daemon, whatever) to process the