Burstiness -- Original Sin in Protocol Design...

9 views
Skip to first unread message

Bob Wyman

unread,
Sep 16, 2009, 12:32:21 PM9/16/09
to pubsub...@googlegroups.com
One of the most profound sins that can be committed by a protocol designer is to introduce sources of burstiness in a distributed system. i.e. Increase the sources of variance in traffic patterns. In general, a push-based system, like PSHB, can be clearly shown to generate less traffic variance than a ping/polling system like what has recently been proposed by the rssCloud advocate. This issue is related to the "thundering herd" problem and should, I think, be covered in the protocol comparison section of the wiki.

As has been noted by others, a PSHB hub avoids the thundering herd problem by pushing fat pings to its clients.Thus, the rate of propogation for information flowing through the hub is largely under the control of the hub. If the hub needs to distribute 1,000,000 messages, it can do so at a rate which is under its control and which maximizes the resources available to it (network, cpu, etc.) This is unlike the case with with a ping/poll protocol like rssCloud that limits the hub to notifying its clients of new content and then relies on the clients to poll for the content. Those who have built such systems in the past are well aware that "thundering herds" (i.e. bursty traffic) appear whenever such systems are deployed.

The advocates of ping/poll systems like rssCloud would argue that just as a push based system can shape traffic patterns by controlling the rate at which it pushes content to clients, a ping/poll server can do the same. A ping/poll system can "bleed out" its update notifications over time and thus shape the response traffic and tame the thundering herd... Well, while this is certainly true, what is missed by most who argue this point is that a ping/poll system that throttles notifications in order to shape traffic must *always* use a tighter throttle than a pure push system would. As a result, information will propogate more slowly and average resource utilization on the hub server will tend to be lower (i.e. you'll be have more idle time for the same hardware investment.)

The reason that ping/poll systems propogate more slowly and need to be more richly configured is that ping/poll systems suffer more sources of traffic than do push based systems. A push based system is only subject to variance in the ability to establish an outbound connection and transfer into a remote TCP/IP buffer any particular fat ping. On the other hand, the ping/poll system has not only the outbound variance associated with the ping but also the variance that results from scheduling of a poll within the remote client (subscriber) and the variance in sending and serving the eventual polling request. In *any* distributed system (as we learn from queuing theory, etc.) we know that increasing variance *always* means lower throughput and greater burstiness of traffic.

If the only problem here was a reduction in throughput and thus greater propogation times, some would argue that this is an acceptable cost for a system that they consider easier to deploy. They might also argue that total system throughput can still me maximized by overlapping the distribution of multiple messages simultaneously. Once again, this fails to consider the full effects of the ping/poll design...

It is very much the fact that networked servers need to be configured not to handle their *average* load, but rather to handle their *peak* loads. Since increased variance increases burstiness, then, the more bursty a system is, the greater the peaks in traffic and thus the more powerful must be the server configuration used to handle the traffic. What you'll find is that for two systems that have exactly the same throughput but use different protocols, it takes more resource to support the more bursty protocol. The difference in resource configuration is, of course, simply waste -- if there is a smoother protocol available.

These subtle effects are, of course, not apparent in low volume systems or in toy-like demonstrations and prototypes. Thus, experimenters, weekend coders and those who provide unsuccessful services don't notice this sort of thing. It is the kind of effect that really only appears when you are building systems under load -- systems that have been successful in gaining users... This may be, in fact, why the advocates of rssCloud appear to be oblivious to these basic, fundamental aspects of distributed protocol design. Until the recent complete redesign of the rssCloud protocol, virtually nobody had ever implemented rssCloud even though a version of it had been defined almost 8 years ago. The only implementations were in rather unsuccessful and now completely defunct blogging systems. As far as I can tell, even the recent rewrite of the rssCloud protocol was done purely as a speculative activity and does not, in fact, benefit from any experience with real traffic volumes or careful analysis. But, we're all used to this sort of amateur design on the web. The great thing about the web is that it gives everyone voice -- but, let's do our best to ensure that the protocols we design to distribute those voices are as efficient, effective and burst-free as possible.

Burstiness is evil. rssCloud results in more burstiness then does PSHB. The conclusion should be obvious...

bob wyman

Matthew Terenzio

unread,
Sep 16, 2009, 1:19:04 PM9/16/09
to pubsub...@googlegroups.com
If you'd like to speak in that tone, I'd appreciate you did it on your blog. I don't come here to get lectures but to discuss protocols. While you obviously have a lot to contribute. Please dispense with the subtle political jabs.

Alexis Richardson

unread,
Sep 16, 2009, 1:20:13 PM9/16/09
to pubsub...@googlegroups.com
Nice post Bob.

Relatedly:

Isn't it also true that a PSHB hub ("pshub"?) can be more efficient
than rssCloud because it can manage push on behalf of multiple
publishers. A pshub which was pushing data for (say) 1000 publishers
could conceivably be 'quite busy most of the time'. A form of
multitenancy would be in effect, and enable efficiencies from the
pshub being a shared resource.

Whereas with rssCloud as currently defined, my understanding (maybe
wrong) is that when the herd thunders, it comes to the publisher. Not
only does this lead to the effects you describe below - high resource
needs at the peak - but there is no way for the publisher to perform
useful work when it is idle. Unlike in PSHB, those extra resources
cannot be utilised. The rssCloud publisher, unlike the pshub, does
not get to share its extra resources with other publishers.

alexis

Matthew Terenzio

unread,
Sep 16, 2009, 1:23:28 PM9/16/09
to pubsub...@googlegroups.com
Nevermind Bob. I apologize. You are and should be free to say what you want. If I find something wrong with it, I'm free to ignore it or unsubscribe. Sorry, again.

On Wed, Sep 16, 2009 at 12:32 PM, Bob Wyman <b...@wyman.us> wrote:

Bob Wyman

unread,
Sep 16, 2009, 1:36:34 PM9/16/09
to pubsub...@googlegroups.com, mter...@gmail.com
Matthew,
You wrote: "Nevermind Bob. I apologize..."
Not a problem... But please know that I do regret if you took my note badly. I tried hard to provide a good bit of content and substance. The issue of burstiness and traffic shaping is an important one. But yes, I probably did let loose a bit more than the optimal amount of emotion at the end there. Please understand that these technical discussions touch on issues and debates that have been going on for many years and some of us are very frustrated by the issues and people that have prevented us from deploying the best technologies and architectures to serve the needs of the many millions of people who unknowingly rely on the quality of our work...

bob wyman

Matthew Terenzio

unread,
Sep 16, 2009, 1:50:56 PM9/16/09
to Bob Wyman, pubsub...@googlegroups.com
On Wed, Sep 16, 2009 at 1:36 PM, Bob Wyman <bobw...@gmail.com> wrote:
Matthew,
You wrote: "Nevermind Bob. I apologize..."
Not a problem... But please know that I do regret if you took my note badly. I tried hard to provide a good bit of content and substance. The issue of burstiness and traffic shaping is an important one. But yes, I probably did let loose a bit more than the optimal amount of emotion at the end there. Please understand that these technical discussions touch on issues and debates that have been going on for many years and some of us are very frustrated by the issues and people that have prevented us from deploying the best technologies and architectures to serve the needs of the many millions of people who unknowingly rely on the quality of our work...


I don't know why it bothered me. I apologize. It's all good stuff.

Just because someone supports something doesn't mean they are oblivious to more elegant solutions, or just never had enough success to know better.

If it wasn't for crappy protocols and weekend coders, I just don't think many things I (we all) hold dear would have become possible.

To keep in the theme of original sin, there is often a "felix culpa" which is necessary to move things forward. That is a "worse solution" that gains wider acceptance in order that better ones can arrive. HTML is poster child there.

I understand what Bob says about better solutions having been stopped for wrong reasons over the years.

I really do believe that things will move forward for reasons outside of these discussions and despite whichever technology is better or whichever has a better name ; )

Bob Wyman

unread,
Sep 16, 2009, 2:27:48 PM9/16/09
to pubsub...@googlegroups.com
On Wed, Sep 16, 2009 at 1:20 PM, Alexis Richardson <alexis.r...@gmail.com> wrote:
> enable efficiencies from the
> pshub being a shared resource...

> Whereas with rssCloud as currently defined, my
> understanding (maybe wrong) is that when the
> herd thunders, it comes to the publisher.

It's hard to say what rssCloud actually does or doesn't do. Are we talking about rssCloud as defined in the formal and frozen RSS V2.0 specification or is it the list of proposed changes that have recently been made on some blogs?

It has always been my reading of the "formal" definition of rssCloud (circa 2003) that rssCloud would send the thundering herds back to the publisher. The impact, of course, is that every publisher needs to be configured to handle bursts of traffic in response to notifications as well as the traffic needed to create and maintain subscriptions. In such a system, we would expect to see many nodes being over-configured to handle bursts and thus would expect a much less efficient use of resources than we would in a system with shared hubs. But, we should look at this in more detail:

Every rssCloud publisher needs to (at least):
  • Respond to notification request XML-RPC calls
  • Maintain a persistent list of current subscribers
  • Handle the timed expiration of subscriptions every 25 hours
  • Send out notifications when changes are made to the "channel"
  • Handle requests from subscribers for changed content announced in notifications
This is, of course, a great deal more burden on the publisher than what you get with PSHB where most of the work of dealing with subscribers is off-loaded from publishers to the shared PSHB hub. As you point out, that shared hub, as a result of the aggregation, ends up being able to be vastly more efficient than any collection of publishers would be.

Of course, the rssCloud folk could argue that many of the benefits of a shared hub are clearly possible even if you are using rssCloud. For instance, what it means to be a "publisher" isn't strictly defined in the 2003 rssCloud specfication. We might initially assume that each "blog" would be a publisher, however, if those blogs were aggregated together on a common server (like Wordpress.com) or via an aggregating intermediary (like FeedBurner), then a single server could work on behalf of many virtual publishers. So, one can argue that a real difference between rssCloud and PSHB is that rssCloud, by not defining "publisher" is open to shared hubs, while PSHB assumes that shared hubs are used.

So, we can't get too hung up in making assumptions about who the publisher is or the ratio of channels to publisher. Certainly, if the ratio is low, then there will probably be terrible inefficiencies, however, there is nothing in the rssCloud spec (2003) that requires a low ratio. Having said that, I don't think it really matters. The reality is that, as described in my earlier note, the very nature of the ping/poll pattern results unacceptable inefficiences due to the large number of sources for traffic variance and the resulting burstiness. These problems get worse if each publisher maintains its own server, however, they remain signficant even if you aggregate many rssCloud publishers onto common servers.

bob wyman

Matthew Terenzio

unread,
Sep 16, 2009, 2:45:04 PM9/16/09
to pubsub...@googlegroups.com
I've implemented an RSS Cloud and it is not linked with the publisher at all. In fact, it handles subscription management and notifications. The publisher only needs to add the  cloud element to the feed and ping the cloud when there is new content.

Jay Rossiter

unread,
Sep 16, 2009, 2:50:49 PM9/16/09
to pubsub...@googlegroups.com
On 9/16/2009 11:45 AM, Matthew Terenzio wrote:
> I've implemented an RSS Cloud and it is not linked with the publisher
> at all. In fact, it handles subscription management and notifications.
> The publisher only needs to add the cloud element to the feed and
> ping the cloud when there is new content.

Management and notification, but not delivery... requests for the
content, once the subscriber has been notified, go to the publisher.


Matthew Terenzio

unread,
Sep 16, 2009, 3:04:16 PM9/16/09
to pubsub...@googlegroups.com

Management and notification, but not delivery... requests for the
content, once the subscriber has been notified, go to the publisher

Yes they do. Some might want it that way, despite knowing that there is "burstiness"

Others might want to let  cloud handle their feed for them in a feedburner like way. That, in and of itself, does nothing to stop the burstiness issue.

Of course, there is nothing saying that a namespace can't be defined for RSS that allows for partial feed "pushes" and then this whole conversation gets flipped and people start asking, "Hey, why doesn't PSHB allow us to decide whether we can have the clients go directly to us. My site is small, and I don't want anyone in between me and my subscribers."

Or is someone on this list against giving the publishers more control of how their content gets syndicated, even if they are weekend coders playing with toys. (just kidding Bob)

Bob Wyman

unread,
Sep 16, 2009, 3:06:44 PM9/16/09
to pubsub...@googlegroups.com
On Wed, Sep 16, 2009 at 2:50 PM, Jay Rossiter <jros...@gmail.com> wrote:
> Management and notification, but not delivery...
> requests for the content, once the subscriber
> has been notified, go to the publisher.
You are right, of course. The only way to relieve the publisher of the polling load in an rssCloud-like ecosystem is if the server that implements the rssCloud service *becomes* the publisher. Feedburner is probably the best example here. For every Feedburner feed URL, there is another URL that you don't typically see. It is the URL that Feedburner reads in order to build the feed that other servers read. Subscribers would then poll the FeedBurner-like system, not the real publisher. Of course, in such a system, you have to "give up" on maintaining your own URLs and thus may lose some "branding" value.

Given a system like PSHB, I could run a feed read by millions on a tiny computer that only gets a one or two hits whenever I post new content. And I could use my own domain on the feed's URL. Heck, I could even turn the thing off for most of the day! To run the same feed with rssCloud -- without letting some other service rewrite my feed and use its own URL on my feed -- I'd have to have one or more machines in a computer room somewhere...

bob wyman

Jay Rossiter

unread,
Sep 16, 2009, 3:17:56 PM9/16/09
to pubsub...@googlegroups.com
I'm not sure how anyone is "in between [you] and [your] subscribers" when using PSHB.   You still host your content - the hub only performs subscription management, and notifications/pushes of updates.  The biggest differences between PSHB and rssCloud are in how the updates are made - whether they require the subscriber to do the extra request to the publisher or not.

As a publisher you tell PSHB that your content is updated, and PSHB in turn notifies the subscribers.  I'm not seeing the loss of control.

There is no "burstiness" (client-generated) in PSHB, because the hub itself controls the rate at which update notifications are sent, and the amount of data being sent is potentially a lot smaller.  With rssCloud, every single subscriber must poll the publisher for the new content, and receive the entire feed as an update, whereas PSHB [can] send only the updated content.

Imagine that a client polls the feed during the same moment they're notified there's an update.  They're immediately going to go poll that feed again because there's no way for them to know that it wasn't updated during those milliseconds.

If rssCloud were to enable partial feed pushes, I think the question would have to be "Why is Dave rewriting PSHB..."

--

Jay Rossiter | Software Engineer/System Administrator
Pioneering RSS Advertising Solutions

jros...@pheedo.com | Phone: 503.896.6187 | Fax: 503.235.2216
Website: www.pheedo.com | RSS: www.pheedo.info/index.xml
pheedo.gif

Matthew Terenzio

unread,
Sep 16, 2009, 3:33:49 PM9/16/09
to pubsub...@googlegroups.com
I UNDERSTAND how both protocols work.

I said a namespace could be created to allow for what is a valid but not insurmountable criticism of a another spec.

I can write the namspaced extension. You could.

I've been working on RSS Cloud for years so don't give me , "Why is Dave rewriting PSHB. . ."

As if to say I can't apply these good ideas to existing systems.

Awww I give up. It's clear this list is getting too political for my blood.

I'll just listen in from now on and you won't have to hear from toy creators like me.


Bob Wyman

unread,
Sep 16, 2009, 3:39:04 PM9/16/09
to pubsub...@googlegroups.com
On Wed, Sep 16, 2009 at 3:04 PM, Matthew Terenzio <mter...@gmail.com> wrote:
> even if they are weekend coders playing with toys. (just kidding Bob)
Hey! Come one! I've got nothing against "weekend coders" -- I do it myself on a regular basis... There all sorts of quick hacks and demonstrations that are really, really useful in teaching us an interface or a code pattern that might otherwise be hard to visualize. It seems most things --even the most complex bits of software -- actually start as ugly and quick hacks. The thing that bothers me is when we don't see the difference between a hack or demonstration and a real, production-ready, scalable design. Each of these two disciplines (quick-hacks and production) are "hard" to do well. But, they are also very different from each other and require different skill sets. "Easy to build" or "simple" etc. are great attributes for many things, but they don't always give us the best solutions to problems of Internet scale.


> Or is someone on this list against giving the
> publishers more control of how their content
> gets syndicated,
I am often criticized for supporting unnecessary complexity... Thus, you shouldn't be surprised that I personally would gladly support options that allowed more flexibility in the way that publishers interwork with PSHB. I also recognize that there may be some particularly good reasons why some publishers would want to retain the burden of serving their own feeds. However, I think the reality is that most of the rest of the PSHB community is much less tolerant of complexity than I am and they are trying really hard to keep the core specification as simple as possible.

But, I'm not too worried about PSHB ending up "too simple" to get the job done. There are alternative and potentially superior ways to handle most of these issues -- without actually modifying PSHB. For instance, if we really want a way for a publisher to say: "My content has changed but you need to visit me to see it!," then it is entirely possible that what we should be doing is focusing on the content of the message rather than wiring anything into the protocol. For instance, we could add a tag to Atom that tells you explicitly that you should fetch the original. Or, we could adopt a convention that says that if a post has an atom.summary but has no atom.content, then you are expected to poll the source feed... Or, we could define an Atom extension that was more expressive... Basically, I think we should use the message content as the means for a publisher to communicate what he wants rather than wiring implied statements into the protocol. (Note: This is what XMPP's XEP-0060 does. The difference between notifications and "content" is in the message, not the protocol.)

Of course, if we come up with ways for publishers to broadcast notifications rather than the updated content, then publishers who send out notifications will probably have to recognize that they may not be able to benefit from at least some services that will grow around the PSHB eco-system... For instance, while today we think mostly about "topic-based" distribution -- i.e. subscribing to known feeds by name, in the future, people might like to subscribe to "concepts" or "words" that appear in the content of updates. Rather than saying "Tell me whenever Tom's feed changes!", you might like to say: "Tell me whenever any feed mentions PSHB." In that case, down stream systems are going to want to have the content (not just a notification of change) in order to match updates to subscriptions. Yes, they could get the notifications, then fetch the original feed to match against, and then forward the notifications that would then set off the thundering herds... There is always a way...

bob wyman

Brett Slatkin

unread,
Sep 16, 2009, 3:43:53 PM9/16/09
to pubsub...@googlegroups.com
Hey Matt,

> Of course, there is nothing saying that a namespace can't be defined for RSS
> that allows for partial feed "pushes" and then this whole conversation gets
> flipped and people start asking, "Hey, why doesn't PSHB allow us to decide
> whether we can have the clients go directly to us. My site is small, and I
> don't want anyone in between me and my subscribers."

I think this is a reasonable use-case and may be the only option for
private feeds. We'll see how that develops. So far the focus has been
on public feeds, which don't require this sensitivity.

> Or is someone on this list against giving the publishers more control of how
> their content gets syndicated, even if they are weekend coders playing with
> toys. (just kidding Bob)

What about the case where the publisher is merged with the hub?
Pádraic Brady is implementing a Hub library for Zend that would enable
any Zend application to function as its own hub. Then publishers get
the best of all things: control, speed, flexibility.

Do you think with the right libraries that weekend coders would be
well served by running their own hubs? There are already three hub
implementations out there for people to use. I'd love to see more.

-Brett

Jay Rossiter

unread,
Sep 16, 2009, 3:46:51 PM9/16/09
to pubsub...@googlegroups.com
On 9/16/2009 12:33 PM, Matthew Terenzio wrote:
I UNDERSTAND how both protocols work.
I said a namespace could be created to allow for what is a valid but not insurmountable criticism of a another spec.
I can write the namspaced extension. You could.

    When you get down to it, isn't that basically what we're all on this list doing?


I've been working on RSS Cloud for years so don't give me , "Why is Dave rewriting PSHB. . ."
As if to say I can't apply these good ideas to existing systems.

    Nobody's saying that you can't add these, or any other, ideas to rssCloud... but when the features you want to add will end up making it virtually indistinguishable from an already available (developing) standard, why is it not better to just combine efforts and work on the one?  So, yes... I consider doing that "rewriting PSHB".  rssCloud also only applies to RSS, while PSHB works with Atom (and potentially any other format that may come to exist).  I just can't see any benefit to the community in it.


Awww I give up. It's clear this list is getting too political for my blood.
I'll just listen in from now on and you won't have to hear from toy creators like me.

    I don't think anyone's being political.  The message that began the thread detailed the reasons about WHY rssCloud never took off.  Even the one major site (Userland) that did implement it had to ditch it because of the exact issues described here.
pheedo.gif

Matthew Terenzio

unread,
Sep 16, 2009, 4:02:37 PM9/16/09
to pubsub...@googlegroups.com


Do you think with the right libraries that weekend coders would be
well served by running their own hubs? There are already three hub
implementations out there for people to use. I'd love to see more.

-Brett


Good question. Honestly, I once said, "Feedburner, who will use that when it's so easy to generate a feed and serve it yourself."

So it may be that practically no one wants to run their own hub.

I personally didn't find the spec difficult to implement. I think it actually is very approachable for weekend coders.

Pádraic Brady

unread,
Sep 16, 2009, 4:59:10 PM9/16/09
to pubsub...@googlegroups.com
I think running your own Hub will be relatively rare but it depends on how useful it is. I can certainly see myself using a Hub on a smaller scale to serve a specific niche, bundling it with additional options and features catering to that niche. I think a lot also depends on how easy it is to get a Hub working and, weekend coders aside (whatever they are), the open source community is sure to roll out ready to use implementations that need very little work to setup.

Open source aside, I found Pubsubhubbub is to be extremely easy to implement. The Hub is trickier to implement but not overly so - the only reason I haven't finished the Zend Hub implementation yet is because I'm working on related components that I want to complete first, and getting a Subscriber and Publisher released are far more pressing to start with. It's just not a complicated programming task.

In regards to politics, I don't see anything overly political. A few slips into emotional statements buried at the end of a compelling argument hardly qualifies. The list is at least partially biased towards Pubsubhubbub ;).

Paddy

Pádraic Brady

http://blog.astrumfutura.com
http://www.survivethedeepend.com
OpenID Europe Foundation Irish Representative



From: Matthew Terenzio <mter...@gmail.com>
To: pubsub...@googlegroups.com
Sent: Wednesday, September 16, 2009 9:02:37 PM
Subject: [pubsubhubbub] Re: Burstiness -- Original Sin in Protocol Design...

Thomas Lord

unread,
Sep 16, 2009, 5:06:02 PM9/16/09
to pubsub...@googlegroups.com
On Wed, 2009-09-16 at 13:59 -0700, Pádraic Brady wrote:
> I think running your own Hub will be relatively rare

I hope not! Then we are just back to centralization!

-t


Pádraic Brady

unread,
Sep 16, 2009, 5:12:32 PM9/16/09
to pubsub...@googlegroups.com
Relatively ;). Like I said it depends on different factors: possible applications, benefits, ease of implementation, available implementations, etc. It will take time for people to realise its potential and in the meantime we will be faced with a certain assumption that PSHB requires big centralised Hubs. Breaking that assumption will take time (and many blog posts ;)).


From: Thomas Lord <lo...@emf.net>
To: pubsub...@googlegroups.com
Sent: Wednesday, September 16, 2009 10:06:02 PM

Subject: [pubsubhubbub] Re: Burstiness -- Original Sin in Protocol Design...
Reply all
Reply to author
Forward
0 new messages