Production Ready?

8 views
Skip to first unread message

Waleed

unread,
Nov 12, 2009, 8:11:31 PM11/12/09
to Pubsubhubbub
I have an app in production that pulls 250K feeds, and I'm hoping to
offload my feed pulling to PSHB and start to get updates faster. A few
questions for the experts here:

1. I remember hearing that Google's hub at pubsubhubbub.appspot.com
does polling every 3 hours for feeds that don't ping the hub. Is that
accurate?

2. If the above is correct, then I'm planning to start by simply
subscribing to all my 250K feeds on the Google hub. I'll get instant
updates for some feeds, and 3-hour delayed updates for others, which
is okay with me. Does anyone see a problem with this plan?

3. Is the Google Hub production ready? Can I expect, say, 99%
reliability or higher? As in, expect 99% or more of new updates to be
delivered to me?

4. Does it handle non-English content?


Thanks,
Waleed

Brad Fitzpatrick

unread,
Nov 12, 2009, 8:32:22 PM11/12/09
to pubsub...@googlegroups.com
On Thu, Nov 12, 2009 at 5:11 PM, Waleed <wal...@ninua.com> wrote:

I have an app in production that pulls 250K feeds, and I'm hoping to
offload my feed pulling to PSHB and start to get updates faster. A few
questions for the experts here:

1. I remember hearing that Google's hub at pubsubhubbub.appspot.com
does polling every 3 hours for feeds that don't ping the hub. Is that
accurate?

2. If the above is correct, then I'm planning to start by simply
subscribing to all my 250K feeds on the Google hub. I'll get instant
updates for some feeds, and 3-hour delayed updates for others, which
is okay with me. Does anyone see a problem with this plan?


Falling back to polling isn't a required part of the PSHB spec, and while the Google hub *can* do it, it's not great, so it's off in production.  Plus we don't want to confuse people as to what Hubbub does.  So no, the Google hub can't be your feed poller.
 
3. Is the Google Hub production ready? Can I expect, say, 99%
reliability or higher? As in, expect 99% or more of new updates to be
delivered to me?

There are no guarantees, but it's run on App Engine, which is pretty reliable.
 

4. Does it handle non-English content?


Of course.  It's all either treated as just bytes, or Unicode, depending on the codepaths.

 

Thanks,
Waleed

Julien Genestoux

unread,
Nov 12, 2009, 8:38:45 PM11/12/09
to pubsub...@googlegroups.com, pubsub...@googlegroups.com
Waleed, you may want to check http://superfeedr.com. We do polling if we don't find any other way to get the feed updates in real time( pubsubhubbub, rsscloud, sup...)

We send you the notifications thru pubsubhubbub, so if at any point you want to move away from us, that's easy.

Julien

--
Julien Genestoux

Envoyé depuis mon iPhone

Jeff Lindsay

unread,
Nov 12, 2009, 9:04:47 PM11/12/09
to pubsub...@googlegroups.com
Waleed, Superfeedr is the service I was telling you about.
--
Jeff Lindsay
http://webhooks.org -- Make the web more programmable
http://shdh.org -- A party for hackers and thinkers
http://tigdb.com -- Discover indie games
http://progrium.com -- More interesting things

Waleed Abdulla

unread,
Nov 12, 2009, 9:21:56 PM11/12/09
to pubsub...@googlegroups.com
Falling back to polling isn't a required part of the PSHB spec, and while the Google hub *can* do it, it's not great, so it's off in production.  Plus we don't want to confuse people as to what Hubbub does.  So no, the Google hub can't be your feed poller.

Thanks Brad. So, then, my backup plan is to use the pshb code provided by Google and run my own version of the hub, in which I enable polling. Any issues here? I understand that I will run into the task queue limit of 10K tasks a day, but assuming I get a raised quota or implement my own queue with external pinging, is there any other problem I might run into?

And, when you say "it's not great", what exactly do you mean? Is it not well tested yet? Or are there any app engine restrictions that make it problematic? 


Waleed Abdulla

unread,
Nov 12, 2009, 9:37:39 PM11/12/09
to pubsub...@googlegroups.com
Waleed, you may want to check http://superfeedr.com. We do polling if we don't find any other way to get the feed updates in real time( pubsubhubbub, rsscloud, sup...)

Julien, this is interesting. I'd definitely consider it, except for the cost factor. Although you charge a mere $0.0005 per entry, it adds up to about $2K a month for me; which is not a big number for the service, but it is big for a small startup like mine. I'll ping you offline to discuss this further. 


Julien Genestoux

unread,
Nov 12, 2009, 11:39:57 PM11/12/09
to pubsub...@googlegroups.com
Waleed,

Our approach is that "money" should never be an issue. If you can "prove" us that polling is cheaper for you, then : 1- we'll match that, 2- we want to learn how :)

For us "legacy > currency" (tm Garyvee!), and putting a big dent in the "polling" world is a big win for everybody.

Also, we do feed normalization, which means you shouldn't see the issue that you seem to get with TC's feed.

Ju

Waleed Abdulla

unread,
Nov 14, 2009, 1:21:30 AM11/14/09
to pubsub...@googlegroups.com
Our approach is that "money" should never be an issue. If you can "prove" us that polling is cheaper for you, then : 1- we'll match that, 2- we want to learn how :)

I'm currently doing it for 250K feeds on a Unix box with 8GB RAM. I use the SimplePie PHP libraray to parse the feeds, and it does a decent job of isolating me from the odds of each format. 

I have a php file that picks a few hundred feeds at a time from the db and fetches them in a loop. And I run this php file from a cron job that runs every 10 minutes. But I run 50 instances of that file at a time (my crontab file has 50 copies of the line that runs that php file). That's my poor man's approach to multi-threading, but it works. Each feed has an integer ID, and I use the mod of the ID and the MOD of the time in a clever equation to decide which feeds to pull at each point in time. This allows me to not have to keep track of the last time a feed was pulled, which saves quite a bit of db access.

It's not a very sophisticated setup, but it works. Effectively, the ongoing cost is the cost of renting an 8GB box (about $200/month). I'm reaching the limits of what I can do on one box, though. Not because of the polling process itself, but mostly due to the size of the data and the disk-swapping that accompanies that. So I can either get a bigger box, shard my data, or move to the app engine. I prefer the app engine because I've had good experience scaling other projects on it in the recent past, and because I want to solve the scaling problem once and for all rather than delay it.


Julien Genestoux

unread,
Nov 14, 2009, 2:20:49 AM11/14/09
to pubsub...@googlegroups.com
Waleed,

That really looks like what most people build. Can I ask what is the "thruput" of your whole system? How long does it take you to fetch all the feeds in the DB. (BTW, maybe you want to get this conversation out of that mailing list).

If you want to go with GAE because you think that you'll get good enough results with that and don't care about maintenance and all that jazz, that's your decision eventually, but I'd suggest that you at least give a try to our solution, so you cna benchmark/test (and even use it for free while we're still in beta :D)

Let me know,

Julien

--
Julien Genestoux,

http://twitter.com/julien51
http://superfeedr.com

+1 (415) 254 7340
+33 (0)9 70 44 76 29
Sent from San Francisco, CA, United States
Reply all
Reply to author
Forward
0 new messages