Are fat pings efficient?

Alex Barth

unread,

Oct 26, 2009, 12:12:07 PM10/26/09

to Pubsubhubbub

I am *very* excited about the pubsubhubbub work I'm seeing. I consider
making it a mainstay of our aggregation infrastructure.

Reading the spec and some of the issues on project page, my main
question is:

Why does PuSH POST the entire feed to subscribers?

To me it would seem more efficient that the hub exposes the updated
feed on a URL and then POSTs only this URL to the subscribers. The
subscribers would then GET the feed from the hub.

The amount of data to be posted would be a fraction, the updated feed
hosted by the hub could be cached with a reverse proxy like Varnish or
Squid. Subscribers could queue URLs neatly, then work them off
asynchronously.

Further, allowing POSTing a URL where updated data can be fetched
would open Pubsubhubbub to be applied in fields where the data feeds
are large (look at http://data.gov).

What are the reasons behind the design decision on PuSH posting fat
pings? Is there an option to post light pings that I am overlooking?
Are there threads I should be reading up?

Alex

--
I'm one of the geeks at http://developmentseed.org and as such I do a
lot of work with aggregation for news tracking and Open Data in
Drupal. Recently we launched an open source news tracker called
Managing News http://managingnews.com. I maintain and have helped
maintain 3 aggregators for Drupal (e. g. http://drupal.org/project/feedapi
and its reincarnation: http://drupal.org/project/feeds).

Alexis Richardson

unread,

Oct 26, 2009, 12:29:43 PM10/26/09

to pubsub...@googlegroups.com

Alex

PSHB is not using fat pings. There are use cases for fat pings that
are under discussion, but fat pings are not in the spec at this time.

alexis

Brett Slatkin

unread,

Oct 26, 2009, 12:32:14 PM10/26/09

to pubsub...@googlegroups.com

PubSubHubbub has two legs: 1) a light ping from publisher to the hub,
2) a fat ping (or content push) from the hub to the subscriber.

Alexis is referring to #1 not being fat. The story thus far has been
if a publisher wants to fat ping, they should integrate a Hub into
their CMS.

Otherwise, the benefits of fat content pushing versus URL forwarding
is discussed in some detail here:

http://code.google.com/p/pubsubhubbub/wiki/ComparingProtocols

Pádraic Brady

unread,

Oct 26, 2009, 2:00:43 PM10/26/09

to pubsub...@googlegroups.com

Actually PSHB does use fat pings - subscribers are POSTed the delta of any feed when a Publisher notifies a Hub that a change has taken place.

As to efficiency, I think the caching mechanism is slightly off track. Serving a cached delta feed, and sending a delta feed (which presumably is generated just once) are fairly equivelant. There are however differences in the backend request serving - whether the servicing is done using a proxy or a full application. In most cases, I'd assume the second is offloaded as much as possible to a basic task and not being run through an application proper so as to lower the request cost.

To your second point, Subscribers should never synchronously process updates. They should be dumped immediately to a job queue for asynchronous processing. This will help spread the processing load more evenly over time instead of being clumped together which I gather is what you're against. So it's receive update, verify it is an update (input validation), dump update to queue, and respond with a 200 code.

So, I think overall it's still quite an efficient system. The main thing is making sure each party is being efficient about it which is, of course, an implementation point the specification won't be commenting on. I think this will be the biggest mental block over time - web developers are pretty bad at thinking asynchronously ;).

Paddy

Pádraic Brady

http://blog.astrumfutura.com
http://www.survivethedeepend.com
OpenID Europe Foundation Irish Representative

From: Alexis Richardson <alexis.r...@gmail.com>
To: pubsub...@googlegroups.com
Sent: Mon, October 26, 2009 4:29:43 PM
Subject: [pubsubhubbub] Re: Are fat pings efficient?

Jeff Lindsay

unread,

Oct 26, 2009, 2:01:07 PM10/26/09

to pubsub...@googlegroups.com

In short, it actually *is* more efficient to send fat pings. There are less requests, cpu cycles, and transfer in total in the end. However, I don't even care about that. It's just super convenient to HAVE the data when the code triggered from the event of an update. And in practice (because I've been using and watching this model for a long time, more generally known as webhooks), it can really suck to not know what changed when you get the ping. If you pass the relevant data on the event, there is no question, and it's right there, AND it happens to be more efficient in the end all things considered.

Although it seems like this breaks things like caching, you can still do your own caching as the subscriber -- just not HTTP level caching. But really, if you think about it, it doesn't really make sense to cache because you're always dealing with content updates.

There is also a very intentional decision to make both ends as simple as possible, and making the subscriber queue and fetch and diff is really pushing a lot of the complexity and work off to them ... when there is really no need to.

Hopefully that mostly or generally answers your question!

-jeff

--
Jeff Lindsay
http://webhooks.org -- Make the web more programmable
http://shdh.org -- A party for hackers and thinkers
http://tigdb.com -- Discover indie games
http://progrium.com -- More interesting things

Jeff Lindsay

unread,

Oct 26, 2009, 2:14:40 PM10/26/09

to pubsub...@googlegroups.com

To your second point, Subscribers should never synchronously process updates. They should be dumped immediately to a job queue for asynchronous processing. This will help spread the processing load more evenly over time instead of being clumped together which I gather is what you're against. So it's receive update, verify it is an update (input validation), dump update to queue, and respond with a 200 code.

Actually I don't see anything wrong with handling the event synchronously. While it's courteous to the hub, the hubs will HAVE to be able to handle this because that's just how most people will do it. From the subscriber perspective, a job queue is unnecessary because their web server should already be handling the request asynchronously. Apache is generally already a big worker pool using incoming HTTP requests as the job queue.

Pádraic Brady

unread,

Oct 26, 2009, 4:41:52 PM10/26/09

to pubsub...@googlegroups.com

It's sort of an expectations game with the question being what does the Hub expect. Ideally, it's expecting to POST a delta, get a 2xx response, and move on to the next Subscriber. If, however, the Subscriber acts synchronously then the Hub is carrying the cost of maintaining a connection while the Subscriber does all the update processing work before sending a 2xx response.

Should be Hub be stuck waiting for a response because the Subscriber is doing work of absolutely no impact to the expected 2xx response? Personally, I don't think so. That clashes with a web developers instinct to treat all work within a single request as being essential to the response which is why most will (I agree absolutely) not use a separate queue. Doesn't make it correct or efficient though. Subscribers should give the Hub the expected response once it's request needs are met - i.e. the Subscriber received the update and verified it as being valid. Anything outside that is not essential to the Hub response.

I think it's important because synchronous processing will land Hubs with the impact of a Subscriber's ill advised practices - clumped requests taking forever since the server is bogged down in swap, inefficient database ops, slow processing, etc. If I were running a Hub, I'd paint a 10 second max timeout on my connections and make it abundantly clear to Subscribers that not meeting that timeout is their problem to solve.

Maybe I'm being harsh though ;). I just don't like building it into practices that poor implementations can get away with bogging down other parties for no good reason. It's practically begging for people to do the wrong thing because it's actively tolerated. As a wise man once said, programming really is the one discipline where we seem unfathomably obsessed with making life easier for the less skilled of its members.

Paddy

Pádraic Brady

http://blog.astrumfutura.com
http://www.survivethedeepend.com
OpenID Europe Foundation Irish Representative

From: Jeff Lindsay <prog...@gmail.com>
To: pubsub...@googlegroups.com
Sent: Mon, October 26, 2009 6:14:40 PM

Subject: [pubsubhubbub] Re: Are fat pings efficient?

Jeff Lindsay

unread,

Oct 26, 2009, 4:49:06 PM10/26/09

to pubsub...@googlegroups.com

Luckily, probably 80% of subscribers will be doing something so simple that it's not even an issue. I do agree Hub's should have a timeout to encourage good practice. I don't think keeping the connection open is a big deal if you're doing it right. Hell, some people let you keep a connection open indefinitely, even "at scale" (see Twitter Stream API). You'll see in my upcoming hub implementation...

Alex Barth

unread,

Oct 27, 2009, 12:56:48 PM10/27/09

to Pubsubhubbub

Thanks for your answers. Understanding better now, still grappling
with some questions.

How do you guys see the advantages/disadvantages of POSTing feed data
in these scenarios:

1. Hub does not serve delta feed. In my mind, this is can be
interesting for 3 reasons: a) building simple hubs that don't inspect
feeds at all b) building hubs that are completely agnostic to their
feed formats, c) hubs convert feed to standard format, subscribers
pull the first feed data from hub, not from original publisher (heck,
how do the superfeedr guys do that?)
2. Large data sets (i. e. DC's 2009 crime feed has 1.2MB) http://data.octo.dc.gov/
3. Many and often changing subscribers - wouldn't this lead to
unnecessarily sent large POST requests to subscribers that actually
don't exist anymore?

Alex

On Oct 26, 4:49 pm, Jeff Lindsay <progr...@gmail.com> wrote:
> Luckily, probably 80% of subscribers will be doing something so simple that
> it's not even an issue. I do agree Hub's should have a timeout to encourage
> good practice. I don't think keeping the connection open is a big deal if
> you're doing it right. Hell, some people let you keep a connection open
> indefinitely, even "at scale" (see Twitter Stream API). You'll see in my
> upcoming hub implementation...
>

> > OpenID Europe Foundation Irish Representative<http://www.openideurope.eu/>
>
> > ------------------------------
> > *From:* Jeff Lindsay <progr...@gmail.com>
> > *To:* pubsub...@googlegroups.com
> > *Sent:* Mon, October 26, 2009 6:14:40 PM
>
> > *Subject:* [pubsubhubbub] Re: Are fat pings efficient?

>
> > To your second point, Subscribers should never synchronously process
> >> updates. They should be dumped immediately to a job queue for asynchronous
> >> processing. This will help spread the processing load more evenly over time
> >> instead of being clumped together which I gather is what you're against. So
> >> it's receive update, verify it is an update (input validation), dump update
> >> to queue, and respond with a 200 code.
>
> > Actually I don't see anything wrong with handling the event synchronously.
> > While it's courteous to the hub, the hubs will HAVE to be able to handle
> > this because that's just how most people will do it. From the subscriber
> > perspective, a job queue is unnecessary because their web server should
> > already be handling the request asynchronously. Apache is generally already
> > a big worker pool using incoming HTTP requests as the job queue.
>
> > --
> > Jeff Lindsay

> >http://webhooks.org-- Make the web more programmable
> >http://shdh.org-- A party for hackers and thinkers

> >http://tigdb.com-- Discover indie games

> >http://progrium.com-- More interesting things
>
> --
> Jeff Lindsayhttp://webhooks.org-- Make the web more programmablehttp://shdh.org-- A party for hackers and thinkershttp://tigdb.com-- Discover indie gameshttp://progrium.com-- More interesting things

Brett Slatkin

unread,

Oct 27, 2009, 2:07:06 PM10/27/09

to pubsub...@googlegroups.com

Hey Alex,

On Tue, Oct 27, 2009 at 9:56 AM, Alex Barth <lxb...@gmail.com> wrote:
> How do you guys see the advantages/disadvantages of POSTing feed data
> in these scenarios:
>
> 1. Hub does not serve delta feed. In my mind, this is can be
> interesting for 3 reasons: a) building simple hubs that don't inspect
> feeds at all b) building hubs that are completely agnostic to their
> feed formats, c) hubs convert feed to standard format, subscribers
> pull the first feed data from hub, not from original publisher (heck,
> how do the superfeedr guys do that?)

I think (a) isn't too compelling. We're going to have a few, very
well-tested hub implementations that people can run or use as a hosted
service.

We would like (b) to be part of the core spec eventually, with other
secondary specs that explain how to do differential updates for
secondary content types (if necessary).

For (c), Superfeedr is acting as a federated hub, meaning they
subscribe to all other hubs' updates and proxy them to their
subscribers. This allows for composition and data transformation.

> 2. Large data sets (i. e. DC's 2009 crime feed has 1.2MB) http://data.octo.dc.gov/

I think distributing just the changes is significantly more efficient
for large feeds. Instead of pushing 1.2MB each time the feed changes
to 1000+ subscribers, you can just send the newest 2KB update.

Combined with the Atom Tombstoning draft spec
(http://www.ietf.org/id/draft-snell-atompub-tombstones-06.txt) we
should be able to get Hubbub to communicate new and deleted content in
the same way.

> 3. Many and often changing subscribers - wouldn't this lead to
> unnecessarily sent large POST requests to subscribers that actually
> don't exist anymore?

Subscriptions in the hub have a lease period and must be checked for
validity ever-so-often. This allows the hub to prune old/bad
subscribers that aren't receiving the feed anymore. Again, this let's
the data flow be streamlined to the minimum bandwidth possible.

Hope that helps,

-Brett

Julien

unread,

Oct 28, 2009, 12:39:40 AM10/28/09

to Pubsubhubbub

Hey,

On Oct 27, 11:07 am, Brett Slatkin <bslat...@gmail.com> wrote:
> Hey Alex,
>

> On Tue, Oct 27, 2009 at 9:56 AM, Alex Barth <lxba...@gmail.com> wrote:
> > How do you guys see the advantages/disadvantages of POSTing feed data
> > in these scenarios:
>
> > 1. Hub does not serve delta feed. In my mind, this is can be
> > interesting for 3 reasons: a) building simple hubs that don't inspect
> > feeds at all b) building hubs that are completely agnostic to their
> > feed formats, c) hubs convert feed to standard format, subscribers
> > pull the first feed data from hub, not from original publisher (heck,
> > how do the superfeedr guys do that?)
>
> I think (a) isn't too compelling. We're going to have a few, very
> well-tested hub implementations that people can run or use as a hosted
> service.

Agreed... and also, I think the protocol was built to stay simple.
Based on that , we should avoid having lighter (non-compatible)
implementations.
However, I think the "diffing" should not be part of the protocol
itself, but stay "vague" or at least open to other data than RSS/Atom.

>
> We would like (b) to be part of the core spec eventually, with other
> secondary specs that explain how to do differential updates for
> secondary content types (if necessary).

Agreed!

>
> For (c), Superfeedr is acting as a federated hub, meaning they
> subscribe to all other hubs' updates and proxy them to their
> subscribers. This allows for composition and data transformation.

Yes... that is exactly what we do. And more than just "formats" we
also map other protocols into PubSubHubbub, like RSSCloud or (soon...)
SUP, but also streams from app such as identica/twitter... etc.

c) hubs convert feed to standard format, subscribers
> > pull the first feed data from hub, not from original publisher (heck,
> > how do the superfeedr guys do that?)

Well, we just parse the new content and map that into a consistent
form. Then, we push the updates to our subscribers. In our case, we do
not store anything, which means that (as the protocol works anyway),
nobody pulls from us.

>
> > 2. Large data sets (i. e. DC's 2009 crime feed has 1.2MB)http://data.octo.dc.gov/

>
> I think distributing just the changes is significantly more efficient
> for large feeds. Instead of pushing 1.2MB each time the feed changes
> to 1000+ subscribers, you can just send the newest 2KB update.
>
> Combined with the Atom Tombstoning draft spec
> (http://www.ietf.org/id/draft-snell-atompub-tombstones-06.txt) we
> should be able to get Hubbub to communicate new and deleted content in
> the same way.

Not sure why large feeds would be different from smaller feeds.

Bob Wyman

unread,

Oct 28, 2009, 11:58:37 AM10/28/09

to pubsub...@googlegroups.com

On Wed, Oct 28, 2009 at 12:39 AM, Julien <julien.g...@gmail.com> wrote:
> Not sure why large feeds would be
> different from smaller feeds.

A feed may be "large" because it has many entries. If only one entry in the feed has changed, it is inefficient to copy all the entries of the feed since you will discard most of them.
A feed may be "large" because it contains even a small number of "large" entries. It will still be inefficient to receive the entire feed since you will be discarding, as previously seen, a large number of bytes.

Receiving only updated entries is more efficient since you don't suffer the waste that is inherent in polling multi-entry feed documents that contain previously seen entries.

bob wyman

Alex Barth

unread,

Oct 31, 2009, 12:50:24 PM10/31/09

to pubsub...@googlegroups.com

Thank you for your responses. This was *very* helpful for me and I
hope others who are trying to understand the reasoning behind a
fat-pings-only approach.

I remain with some concerns around feed format agnostic hubs and heavy
payloads but I actually need to sit down and do more homework on this.
It may well be the case - like many posters argued here - that these
concerns are practically not relevant.

Alex

Reply all

Reply to author

Forward