how are people collecting spritzer/gardenhose?

24 views
Skip to first unread message

Brendan O'Connor

unread,
May 25, 2009, 6:02:50 PM5/25/09
to twitter-deve...@googlegroups.com
spritzer is great!  well done folks.

I'm wondering how other people are collecting the data.  I'm saving the json-per-line raw output to a flatfile, just using a restarting curl, then processing later.

Something as simple as this seems to work for me:

while true; do
  date; echo "starting curl"
  curl -s -u user:pass http://stream.twitter.com/spritzer.json >> tweets.$(date --iso)
  sleep 1
done |& tee curl.log

... and also, to force file rotation once in a while:

while true; do
  date; echo "forcing curl restart"
  killall curl
  sleep $((60*60*5))
done |& tee kill.log


anyone else?

-Brendan

pplante

unread,
May 26, 2009, 1:38:28 PM5/26/09
to Twitter Development Talk
I am using python to implement a process which listens to the stream
and places all incoming data onto a message queue service. A few
other worker processes in the background work off the queue and store
the data. The message queue is not fault tollerant at this time,
however with a simple switch to an enterprise based MQ service that
could be achieved.

You are essentially doing the same thing via some bash scripts and
flatfiles. How are you parsing and indexing the data once its
collected?

On May 25, 5:02 pm, "Brendan O'Connor" <breno...@gmail.com> wrote:
> spritzer is great!  well done folks.
> I'm wondering how other people are collecting the data.  I'm saving the
> json-per-line raw output to a flatfile, just using a restarting curl, then
> processing later.
>
> Something as simple as this seems to work for me:
>
> while true; do
>   date; echo "starting curl"
>   curl -s -u user:passhttp://stream.twitter.com/spritzer.json>>

Brendan O'Connor

unread,
May 27, 2009, 4:28:30 PM5/27/09
to twitter-deve...@googlegroups.com
On Tue, May 26, 2009 at 10:38 AM, pplante <pplan...@gmail.com> wrote:
You are essentially doing the same thing via some bash scripts and
flatfiles.  How are you parsing and indexing the data once its
collected?

python simplejson, custom tokenizer & other text analysis, then lots of tokyo cabinet/tyrant.

--
Brendan O'Connor - http://anyall.org

M. Edward (Ed) Borasky

unread,
Jun 11, 2009, 5:10:53 PM6/11/09
to Twitter Development Talk
Right now, I'm collecting spritzer data with a simple shell script
"curl <magic incantations> | bzip2 -c > <yyyymmddhhmmss>.bz2". A cron
job checks every minute and restarts the script if it crashes. The
rest is simple ETL. :)

David Fisher

unread,
Jun 11, 2009, 6:28:42 PM6/11/09
to Twitter Development Talk
I'm just using a realtime json parser in Ruby written as a native C
extension (http://github.com/brianmario/yajl-ruby/tree/master)
It's really simple to use and well documented.

I'm just storing everything in a Postgres database, and then using
other scripts to query it. Note: using gardenhose at least you get a
LOT of data fast. In just a few days, I have a 4GB+ database now or so
Reply all
Reply to author
Forward
0 new messages