Quite new to this daemon concept and would love some help

80 views
Skip to first unread message

Chris Bull

unread,
Jun 9, 2010, 1:12:37 AM6/9/10
to Phirehose Users
Hey guys,

I'm a php developer who is building an app that will be taking tweets
from twitter and displaying them in our app. I thought this would be
easy using a simple rest call and cron job.....then discovered that it
was being depreceated for that purpose and replaced with the streaming
api. I understand what I have to do, but it's the specifics that are
really confusing me.

With our app we will be monitoring our twitter account and then
publishing any direct replies to our account. As i understand it this
means that I will use phirehose to run as a daemon and collect our
tweets. However from there is what is confusing me....

I'm building the app in the silverstripe (nee sapphire) php framework
and want to have each tweet entered into the database. Where i'm
becoming confused is in something I read yesreday about how it was
suggested to either write them to a file (way too annoying and slow)
or use a memory based database such as redis. The redis route seems
sensible if not very confusing... is there an easy way to use
phriehose to add records to a mysql table? Am i doing it wrong? Have i
misunderstood the implications of the streaming api for php? If i do
use something like redis, will i be able to migrate the data from
redis to mysql? if i can just use mysql, how do i deal with the
problem of two php processes writing the database @ the same time?

Thanks in advance, I really appreciate any advice you guys can provide
to help me better understand this.

Cheers,
Chris

Fenn Bailey

unread,
Jun 9, 2010, 2:17:18 AM6/9/10
to phireho...@googlegroups.com
Hey Chris,

I'll have a go at answering your question - You're definitely on the right track.

First off - You're absolutely right, simply having a cron task that does a few REST request seems like the most logical and simple way to acquire a few tweets based on a criteria, and you're absolutely right - it is.

However, due to a bunch of non-obvious reasons, the Streaming API seems to be the way forwards, which is a little bit trickier to use at first, but ultimately is more powerful and scalable for everyone.

The first thing to do is understand "decoupled collection and processing". The reason why this is important is because of the spiky (and growing) nature of Twitter traffic.

The intuitive thing to do is to connect to the stream and just decode/insert your tweets into the database. This is fine if you're getting 1-2 tweets per second, but what about 10, 20 or 200 tweets per second? (which can easily happen with twitter). Also, what happens when your database has 100 million tweets in it? I can guarantee you 99% of MySQL databases can't insert even 20 tweets per second when it has 100 million rows in it.

The problem with this is that your system streaming connection can become "backed up" by the volume, which means the tweets queue up and your client lags behind and eventually the streaming API will disconnect (and potentially, eventually ban) you.

So, what's the answer to all this? Well, the secret is to decouple the collection from the processing. Processing can be slow, but collection (ie: just receiving/storing) the tweets should remain fast.

That's why it's recommended to queue them to something like a file, which doesn't slow down if it gets bigger. Once you have tweets being collected, you can switch back to the old way of just having a cron task that runs every X minutes and consumes the tweets (off the file) just like with a REST interface.

So, in summary:
  1. Setup a process (using something like Phirehose) to Collect tweets, store them somewhere that won't slow down and do nothing else. (this has to be fast)
  2. Setup a separate script (crontask, daemon, whatever) to process the tweets and remove them from the queue (this can be as slow as you want and can run however you want).
Make sense?

The other confusing part is that most PHP dev's are not particularly used to the idea of writing background processes/daemon's in PHP, but rather think of PHP as something that a webserver executes to display a web page.

Whilst true, it's important to understand that PHP is a general-purpose scripting language and is quite capable of executing as a command-line script or backgrounded process that has nothing to do with webpages.

Phirehose is written with the assumption that it will run this way (it's actually impossible to run it properly from within a webserver).

There's a bunch of general information about PHP command-line stuff located here: http://www.php-cli.com/

Hopefully that helps steer you in the right direction. Let us know if you have any more questions - 

Cheers,

  Fenn. 

Chris Bull

unread,
Jun 9, 2010, 6:11:06 PM6/9/10
to Phirehose Users
Wow, thanks for the great response, Fenn. Will try and build a basic
implementation today! One more brief question though, what is the
relationship of the PEAR lib system_daemon in all this? does that mean
I can spawn the daemon process from within a standard php script, or
have i missed the point with that as well?

Again, thanks so much for the useful and prompt reply!

Cheers,
Chris

On Jun 9, 6:17 pm, Fenn Bailey <fenn.bai...@gmail.com> wrote:
> Hey Chris,
>
> I'll have a go at answering your question - You're definitely on the right
> track.
>
> First off - You're absolutely right, simply having a cron task that does a
> few REST request seems like the most logical and simple way to acquire a few
> tweets based on a criteria, and you're absolutely right - it is.
>
> However, due to a bunch of non-obvious reasons, the Streaming API seems to
> be the way forwards, which is a little bit trickier to use at first, but
> ultimately is more powerful and scalable for everyone.
>
> The first thing to do is understand "decoupled collection and processing".
> The reason why this is important is because of the spiky (and growing)
> nature of Twitter traffic.
>
> The intuitive thing to do is to connect to the stream and just decode/insert
> your tweets into the database. This is fine if you're getting 1-2 tweets per
> second, but what about 10, 20 or 200 tweets per second? (which can easily
> happen with twitter). Also, what happens when your database has 100 million
> tweets in it? I can guarantee you 99% of MySQL databases can't insert even
> 20 tweets per second when it has 100 million rows in it.
>
> The problem with this is that your system streaming connection can become
> "backed up" by the volume, which means the tweets queue up and your client
> lags behind and eventually the streaming API will disconnect (and
> potentially, eventually ban) you.
>
> So, what's the answer to all this? Well, the secret is to *decouple *the
> collection from the processing. Processing can be slow, but collection (ie:
> just receiving/storing) the tweets should remain fast.
>
> That's why it's recommended to queue them to something like a file, which
> doesn't slow down if it gets bigger. Once you have tweets being collected,
> you can switch back to the old way of just having a cron task that runs
> every X minutes and consumes the tweets (off the file) just like with a REST
> interface.
>
> So, in summary:
>
>    1. Setup a process (using something like Phirehose) to *Collect* tweets,
>    store them somewhere that won't slow down and do *nothing else. *(this
>    has to be fast)
>    2. Setup a separate script (crontask, daemon, whatever) to process the

Fenn Bailey

unread,
Jun 9, 2010, 8:58:24 PM6/9/10
to phireho...@googlegroups.com
Hey Chris,

There's a couple of things I should have clarified - 

I'm somewhat assuming you'll be running your script on a unix-like system (ie: linux or mac os x). While the general same principles in windows generally hold true, I don't personally know the best way how to handle long-running PHP processes on a windows box.

Also, you shouldn't actually need any of the fancy PHP daemon stuff (like PEAR System_Daemon). I should have mentioned that a daemon is just any old command line script (ie: shell, perl, PHP, anything) running in the background (ie: not interactively with a user).

Any command can be "daemonized" a few ways. There's not a bad starting point to backgrounding (ie: "daemonizing") processes here: http://www.cyberciti.biz/tips/nohup-execute-commands-after-you-exit-from-a-shell-prompt.html

So if you can run your script at a commandline by executing something like:

bash:~$ php phirehose-consume-tweets.php

Then you can very easily turn it into a daemon just by using the nohup technique above. 

This is a pretty basic way of doing things - many linux distributions have a "proper" way of doing this which lets you start processes on boot, manage, etc, but that's more than enough to get you started.

Cheers!

  Fenn.

Chris Bull

unread,
Jun 16, 2010, 10:46:44 PM6/16/10
to Phirehose Users
Cheers Fenn!

Been going through it now, got the stream consuming and just fine
tuning the processing :D. I am on UNIX (mac os running an ubuntu
virtual machine for developing this app because it seemed like a good
idea to really learn the server environment for when I deploy!). I
guess I have a question then that hopefully you won't mind answering:
Why isn't PEAR System_Daemon suitable for this? I've assumed that
running "$ php StreamCollector.php &" is the correct way to
instantiate your daemon in the most basic sense? When would I use
something like system_daemon for? Thanks for the nohup thing too,
looks very useful!

I really appreciate your answers!

Thanks for putting up with my n00b questions!

Chris

On Jun 10, 12:58 pm, Fenn Bailey <fenn.bai...@gmail.com> wrote:
> Hey Chris,
>
> There's a couple of things I should have clarified -
>
> I'm somewhat assuming you'll be running your script on a unix-like system
> (ie: linux or mac os x). While the general same principles in windows
> generally hold true, I don't personally know the best way how to handle
> long-running PHP processes on a windows box.
>
> Also, you shouldn't actually need any of the fancy PHP daemon stuff (like
> PEAR System_Daemon). I should have mentioned that a daemon is just any old
> command line script (ie: shell, perl, PHP, anything) running in the
> background (ie: not interactively with a user).
>
> Any command can be "daemonized" a few ways. There's not a bad starting point
> to backgrounding (ie: "daemonizing") processes here:http://www.cyberciti.biz/tips/nohup-execute-commands-after-you-exit-f...
>
> So if you can run your script at a commandline by executing something like:
>
> *bash:~$ php phirehose-consume-tweets.php*
> *
> *

Fenn Bailey

unread,
Jun 16, 2010, 11:03:44 PM6/16/10
to phireho...@googlegroups.com
Hi Chris,

Good to hear it's all working for you. To be honest, I hadn't looked at PEAR's System_Daemon for a long time (just had a quick look) and it's pretty neat. Particularly if it helps you create init-scripts/etc, it could be well suited for what you want.

Basically, System_Daemon and nohup (and all the other ways) all achieve much the same thing (daemonizing/backgrounding your process), just in different ways, so you should really use whichever one makes the most sense to you.

To be honest, System_Daemon looks nicely documented and easy to use as it's very "PHP-y", so if it works well for you, it could be a great option.

Cheers!

  Fenn.

Chris Bull

unread,
Jun 17, 2010, 8:33:16 PM6/17/10
to Phirehose Users
Thanks again Fenn!

Chris
Reply all
Reply to author
Forward
0 new messages