1M Fanout seems unlikely in the real-world (was: Re: [pubsubhubbub] Re: Privacy and PSHB)

21 views
Skip to first unread message

Bob Wyman

unread,
Aug 3, 2009, 3:07:40 PM8/3/09
to pubsub...@googlegroups.com
On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bsla...@gmail.com> wrote:
> On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec...@gmail.com> wrote:
>> ... How would one design a hub, or a series of hubs to
>> distribute 1,000,000 notifications in a few seconds every time he
>> updates?
> This is the "fan-out problem" that people talk about. It's hard to
> solve.
We should consider how likely it is that we're actually going to see PSHB systems that need to do 1:1,000,000 fan-out. Personally, I think it is exceptionally unlikely. 
Just as sending 1M messages puts a great deal of load on the hub, it is likely to put unsustainable load on the Subscriber nodes which are receiving those messages. While there may be some "single-user" Subscriber nodes, I think most Subscriber nodes will actually be supporting many end users. (i.e. just like GMail or Hotmail serve many users.) If a Subscribing server is supporting 100K users, it will NOT want to receive 100K distinct HTTP posts from the hub every time that a popular feed is updated. In all probability, what the Subscriber node would most like to see would be a single message that it could then match against its own local distribution lists or a single message with a list of "subscription ids" (potentially many of them), that it could use to do its own fan-out. (Note: The single-message, multiple-recipients approach is used in XEP-0060 PubSub as well as by SMTP email systems -- precisely to reduce the load caused by this kind of fan-out. This kind of "batch" delivery results naturally in a multi-level, distributed fan-out system that optimizes everyone's resource utilization for the price of some minimal additional complexity.)

Subscribing systems that support multiple users can, of course, reduce the number of messages they receive given the PSHB protocol as it stands today. They need only maintain an internal mapping of subscriptions to users and then only register new subscriptions with the hub when they don't already have an outstanding subscription that matches. As long as PSHB remains a "topic-based" protocol, where there is likely to be substantial and easily detected overlap between subscriptions, that is probably all that is necessary. In theory, a PSHB hub might need to do 1:1M fanout but in reality, it probably won't happen. (Note: It might make sense to update the "Best Practices" section to explicitly suggest that multi-user Subscriber do, in fact, try to conserve their resources by eliminating duplicate subscriptions to a hub.)

(*** The following discusses stuff that PSHB does not support and might never support even though one could easily build a custom hub to implement what is discussed below.)
If, on the other hand, PSHB were to be extended to support "content-based" subscriptions then it would probably make sense to consider supporting some sort of batch delivery. (You get "content-based" subscriptions when you're not just subscribing to feeds but rather to items within feeds whose content matches some pattern -- i.e. "All entries that contain the term 'foobar'".)  The reason is that many subscriptions from a single Subscriber hub could match a single feed entry. In such a case, you would probably want a multi-user Subscriber to be able to identify that it was such a system and only get one or a small number of messages from the "content-based matching" hub even if it needed to distribute those messages to thousands of users. The easiest way to do this would be to have the Subscriber provide a "subscription-id" and "batch-id" for each of its subscriptions and for the hub to group outgoing messages by Subscriber+batch_id. The hub would then append a list of all matching subscription_ids to a single message sent to each batch_id. Thus, even with content-based systems, you can still get really easy distributed fan-out which benefits both the hub and the subscriber.

bob wyman

On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bsla...@gmail.com> wrote:

Hey Joe,

I took a stab at what support for private and authenticated feeds
would look like in PubSubHubbub in this thread:

http://groups.google.com/group/pubsubhubbub/msg/3108fe7fdc740482

Otherwise, your best bet right now is to use obfuscated feed URLs.
These are URLs that are hard to guess (i.e., the URL is a password).
When accessed over SSL they are reasonably secure in keeping the
content private. These can be used right now with Hubbub for
non-public feeds. In the future we'd like to have a full authorization
scheme as I describe above.

Let me know what you think!

On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec...@gmail.com> wrote:
> How does
> PSHB address a publisher that pumps out 100 new updates a second to
> tens of millions of subscribers? Also consider an Ashton Kucher-scale
> publisher. How would one design a hub, or a series of hubs to
> distribute 1,000,000 notifications in a few seconds every time he
> updates?

This is the "fan-out problem" that people talk about. It's hard to
solve. The simple approach is to incrementally send the notifications
to any subscribers at a sustained rate. So say we did 10,000 per
second; for the 1M subscribe case we'd need ~100 seconds to push the
updates to all subscribers. This indicates that the primary pivot here
is load versus latency.

Theoretically we could push 100,000 updates per second and get this
done in 10 seconds, but the amount of load that puts on our servers
(and HTTP connections) is so high that it's really not worth it.
Instead, we pick a load level we're okay with and then eventually
replicate the data out to all subscribers. That means that as your
feed gains subscribers you sacrifice latency for volume. This is the
reality of the situation. I believe it's no different on existing
services out there. The good news is Hubbub can handle this case in
theory-- I plan to further load-test our hub with the 1M subscriber
case to see how fast we can go.

Otherwise, it's good to keep things in perspective. For the 10,000
subscriber case, we can deliver all notifications in 1 second at this
same rate. So for practical situations, this scales just fine for the
vast majority of users.

-Brett

Julien Genestoux

unread,
Aug 3, 2009, 3:34:23 PM8/3/09
to pubsub...@googlegroups.com, pubsub...@googlegroups.com


--
Julien Genestoux

Envoyé depuis mon iPhone


On Aug 3, 2009, at 12:07, Bob Wyman <bobw...@gmail.com> wrote:

On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bsla...@gmail.com> wrote:
> On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec...@gmail.com> wrote:
>> ... How would one design a hub, or a series of hubs to
>> distribute 1,000,000 notifications in a few seconds every time he
>> updates?
> This is the "fan-out problem" that people talk about. It's hard to
> solve.
We should consider how likely it is that we're actually going to see PSHB systems that need to do 1:1,000,000 fan-out. Personally, I think it is exceptionally unlikely.

Indeed... Specifically if we CAN achieve a phsb federation where hubs will subscribe to other hubs and sunscribers to their very own hub... However there is in this case a design problem where currently a subscriber should only subscribe to the 'linked' hub...

 
Just as sending 1M messages puts a great deal of load on the hub, it is likely to put unsustainable load on the Subscriber nodes which are receiving those messages. While there may be some "single-user" Subscriber nodes, I think most Subscriber nodes will actually be supporting many end users. (i.e. just like GMail or Hotmail serve many users.) If a Subscribing server is supporting 100K users, it will NOT want to receive 100K distinct HTTP posts from the hub every time that a popular feed is updated.


That shouldn't happen indeed. A hub should subscribe just once to another hub on behalf of all its users... So instead if having a 1=>1M, we should have 1=>1k 1001 times. Which is probably easier to scale.

In all probability, what the Subscriber node would most like to see would be a single message that it could then match against its own local distribution lists or a single message with a list of "subscription ids" (potentially many of them), that it could use to do its own fan-out. (Note: The single-message, multiple-recipients approach is used in XEP-0060 PubSub as well as by SMTP email systems -- precisely to reduce the load caused by this kind of fan-out. This kind of "batch" delivery results naturally in a multi-level, distributed fan-out system that optimizes everyone's resource utilization for the price of some minimal additional complexity.)

Subscribing systems that support multiple users can, of course, reduce the number of messages they receive given the PSHB protocol as it stands today. They need only maintain an internal mapping of subscriptions to users and then only register new subscriptions with the hub when they don't already have an outstanding subscription that matches. As long as PSHB remains a "topic-based" protocol, where there is likely to be substantial and easily detected overlap between subscriptions, that is probably all that is necessary. In theory, a PSHB hub might need to do 1:1M fanout but in reality, it probably won't happen. (Note: It might make sense to update the "Best Practices" section to explicitly suggest that multi-user Subscriber do, in fact, try to conserve their resources by eliminating duplicate subscriptions to a hub.)
 I will add that.


I think we should somehow change a little bit the subscription process.  I  think that sunscribers should NOT subscribe to the hub identified in the feed but rather to one ( or2) hubs of their choice ands hubs themselves would subscribe to the shown hubs...
That would imply obviously that hubs should keep a internal list of subscribers, but also that they're able to parse feeds to identify the 'official' hub.... 

Alexis Richardson

unread,
Aug 3, 2009, 4:34:02 PM8/3/09
to pubsub...@googlegroups.com
Sorry to topline this.

The justification for a symmetric / federated factoring is to chain
hubs easily, so that fanout can be ramified as Julien suggests..

Brett Slatkin

unread,
Aug 4, 2009, 11:06:18 AM8/4/09
to pubsub...@googlegroups.com
Hi Bob,

On Mon, Aug 3, 2009 at 12:07 PM, Bob Wyman<bobw...@gmail.com> wrote:
> We should consider how likely it is that we're actually going to see PSHB
> systems that need to do 1:1,000,000 fan-out. Personally, I think it is
> exceptionally unlikely.
> Just as sending 1M messages puts a great deal of load on the hub, it is
> likely to put unsustainable load on the Subscriber nodes which are receiving
> those messages. While there may be some "single-user" Subscriber nodes, I
> think most Subscriber nodes will actually be supporting many end users.
> (i.e. just like GMail or Hotmail serve many users.) If a Subscribing server
> is supporting 100K users, it will NOT want to receive 100K distinct HTTP
> posts from the hub every time that a popular feed is updated. In all
> probability, what the Subscriber node would most like to see would be a
> single message that it could then match against its own local distribution
> lists or a single message with a list of "subscription ids" (potentially
> many of them), that it could use to do its own fan-out. (Note: The
> single-message, multiple-recipients approach is used in XEP-0060 PubSub as
> well as by SMTP email systems -- precisely to reduce the load caused by this
> kind of fan-out. This kind of "batch" delivery results naturally in a
> multi-level, distributed fan-out system that optimizes everyone's resource
> utilization for the price of some minimal additional complexity.)

You're missing a key aspect of the PubSubHubbub protocol as it relates
to Atom. Hubbub requires support for aggregation of notification
delivery for a single subscriber callback URL. That means the hub can
spool 1000 messages for a single receiver across 1000 feeds and turn
them into a single HTTP post:

http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html#aggregatedistribution

This can make delivery more efficient for publishers and subscribers,
but it does not obviate the need for a subscriber to do local fan-out
itself. To my knowledge, such aggregation is not supported in RSS,
which impacts the scalability of that subset of the protocol. But I'm
not worried about this.

> Subscribing systems that support multiple users can, of course, reduce the
> number of messages they receive given the PSHB protocol as it stands today.
> They need only maintain an internal mapping of subscriptions to users and
> then only register new subscriptions with the hub when they don't already
> have an outstanding subscription that matches. As long as PSHB remains a
> "topic-based" protocol, where there is likely to be substantial and easily
> detected overlap between subscriptions, that is probably all that is
> necessary. In theory, a PSHB hub might need to do 1:1M fanout but in
> reality, it probably won't happen. (Note: It might make sense to update the
> "Best Practices" section to explicitly suggest that multi-user Subscriber
> do, in fact, try to conserve their resources by eliminating duplicate
> subscriptions to a hub.)

Yes, this would be a good addition to the best practices section.

> (*** The following discusses stuff that PSHB does not support and might
> never support even though one could easily build a custom hub to implement
> what is discussed below.)
> If, on the other hand, PSHB were to be extended to support "content-based"
> subscriptions then it would probably make sense to consider supporting some
> sort of batch delivery. (You get "content-based" subscriptions when you're
> not just subscribing to feeds but rather to items within feeds whose content
> matches some pattern -- i.e. "All entries that contain the term 'foobar'".)
>  The reason is that many subscriptions from a single Subscriber hub could
> match a single feed entry. In such a case, you would probably want a
> multi-user Subscriber to be able to identify that it was such a system and
> only get one or a small number of messages from the "content-based matching"
> hub even if it needed to distribute those messages to thousands of users.
> The easiest way to do this would be to have the Subscriber provide a
> "subscription-id" and "batch-id" for each of its subscriptions and for the
> hub to group outgoing messages by Subscriber+batch_id. The hub would then
> append a list of all matching subscription_ids to a single message sent to
> each batch_id. Thus, even with content-based systems, you can still get
> really easy distributed fan-out which benefits both the hub and the
> subscriber.

I disagree with the idea that the Hub should collate end-user
recipients on behalf of a subscriber. Subscribers should be smart
enough to track subscriptions and do their own local fan-out to
end-users. Baking this into the PubSubHubbub spec would be a mistake
because it forces application developers to write their systems a
certain way. The design philosophy here is to implement the minimum
possible. The functionality you describe here is out of scope.

-Brett

Brett Slatkin

unread,
Aug 4, 2009, 11:19:35 AM8/4/09
to pubsub...@googlegroups.com
> I think we should somehow change a little bit the subscription process.  I
>  think that sunscribers should NOT subscribe to the hub identified in the
> feed but rather to one ( or2) hubs of their choice ands hubs themselves
> would subscribe to the shown hubs...
> That would imply obviously that hubs should keep a internal list of
> subscribers, but also that they're able to parse feeds to identify the
> 'official' hub....

I understand your suggestion, Julien, but I don't think the protocol
needs to change to accommodate it.

In the present world the publisher is in control of how *they* want to
distribute their content. They delegate trust through
auto-discoverable mechanism (HTML -> <link rel="alternate"> ->
<atom:link rel="hub">). Thus, new subscribers who know nothing about
the publisher or the hub can quickly establish a trust relationship
and receive updates from the Hub on behalf of the publisher.

What you're suggesting is another link in the chain to act as an
intermediary between the Hub and the subscriber. This implies that the
subscriber's hub would be trusted by the subscriber, and thus the
trust chain would have another link on it. For simple use-cases this
would require another moving part and make it harder for simple
subscribers to get up and running. This is why I'm against suggesting
this as the "right way" to do things.

That said, I see opportunities for these intermediary hubs to provide
a lot of value for subscribers (e.g., reliability, content
translation, stats). So it's the subscriber's choice and up to the
semantics of their relationship with their intermediary hub to decide
how they want to subscribe. It doesn't need to be baked into the
protocol at all.

-Brett

Bob Wyman

unread,
Aug 4, 2009, 11:43:51 AM8/4/09
to pubsub...@googlegroups.com
On Tue, Aug 4, 2009 at 11:06 AM, Brett Slatkin <bsla...@gmail.com> wrote:
> You're missing a key aspect of the PubSubHubbub
> protocol as it relates to Atom.
Clearly, I'm not communicating well... No, I didn't miss that Atom provides for aggregation... 

Let me explain a bit further...
In one method for setting up a subscriber to support multiple end users, the subscriber would generate user-specific callback URLs. Doing this would allow the subscriber to determine which user should receive an entry by simply looking at which callback URL was invoked. Thus, if the subscriber supported 10,000 accounts, it would use up to 10,000 callback URLs with the hub -- one for each user. (Some developers might want to do this in order to reduce the amount of state they need to maintain or, in the case of something like AppEngine, to reduce the rate of database quota consumption. In some cases, this might even allow a shared subscriber to eliminate or defer some parsing. Developers of small systems are particularly likely to find this approach attractive, however, they will get bitten badly by floods of traffic once their user base grows.) If such an approach became widespread, then we shouldn't be surprised to see hubs needing to do massive fanout.

The more intelligent and scalable way for a subscriber to support multiple end users is for the subscriber to maintain a list of feeds currently subscribed to and a mapping of feeds to end-users. Such a subscriber would only use a single callback URL and, on receiving a new entry, would map to the set of users interested in the specific URL. In such a setup, arbitrarily large amounts of fanout would be distributed from the hub to the subscriber and thus the hub, network, etc. would be relieved of a great deal of work. The subscriber would also receive fewer updates from the hub. If such an approach becomes widespread and if a large percentage of end-users rely on shared subscribers, then the amount of fanout from hubs would be significantly reduced.

Developers of shared subscribers -- even small ones -- should be encouraged to avoid using callback URLs that embed distribution or routing data.

bob wyman

Pádraic Brady

unread,
Aug 4, 2009, 12:19:52 PM8/4/09
to pubsub...@googlegroups.com
Given your explanation - makes perfect sense here. To be honest I would have assumed a Subscriber would not encode these kinds of data in their callback URLs since it's obvious (at least when explained up front) what havoc it could create with a Hub. It's something for the best practices section for the Wiki if not a direct mention in the spec to make it clear how unacceptable it is regarded.
 
Pádraic Brady

http://blog.astrumfutura.com
http://www.survivethedeepend.com
OpenID Europe Foundation Irish Representative



From: Bob Wyman <bobw...@gmail.com>
To: pubsub...@googlegroups.com
Sent: Tuesday, August 4, 2009 4:43:51 PM
Subject: Re: 1M Fanout seems unlikely in the real-world (was: Re: [pubsubhubbub] Re: Privacy and PSHB)

igrigorik

unread,
Aug 4, 2009, 10:25:24 PM8/4/09
to Pubsubhubbub
Just to play devil's advocate.. PSHB can be the enabling technology
that actually makes it possible to achieve the 1M subscriber use case.
Namely, one of problems for any edge-based news client / aggregator is
the fact that it has to poll every resource repeatedly. This wastes
the clients bandwidth and quickly overwhelms the actual publishers.
Instead we've evolved to a use case where one aggregator (google
reader, bloglines, etc), takes on the burden and then redistributes
the content.

But, with PSHB that problem is gone because any edge client can just
subscribe to the updates, so it's actually possible to start thinking
about apps like tweetdeck and others as a potential client for a PSHB
hub.

Of course, there is one major flaw in my argument: nat and firewalls.
Having said that, there is still some possibility of this being
solved: reverse http, private hubs, etc.

ig

On Aug 3, 3:07 pm, Bob Wyman <bobwy...@gmail.com> wrote:
> On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bslat...@gmail.com> wrote:
> > On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec0...@gmail.com> wrote:
> >> ... How would one design a hub, or a series of hubs to
> >> distribute 1,000,000 notifications in a few seconds every time he
> >> updates?> This is the "fan-out problem" that people talk about. It's hard
> to
> > solve.We should consider how likely it is that we're actually going to see
> On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bslat...@gmail.com> wrote:
>
> > Hey Joe,
>
> > I took a stab at what support for private and authenticated feeds
> > would look like in PubSubHubbub in this thread:
>
> >http://groups.google.com/group/pubsubhubbub/msg/3108fe7fdc740482
>
> > Otherwise, your best bet right now is to use obfuscated feed URLs.
> > These are URLs that are hard to guess (i.e., the URL is a password).
> > When accessed over SSL they are reasonably secure in keeping the
> > content private. These can be used right now with Hubbub for
> > non-public feeds. In the future we'd like to have a full authorization
> > scheme as I describe above.
>
> > Let me know what you think!
>

kugutsumen

unread,
Aug 6, 2009, 5:27:23 AM8/6/09
to Pubsubhubbub


On Aug 5, 9:25 am, igrigorik <i...@aiderss.com> wrote:
>
> But, with PSHB that problem is gone because any edge client can just
> subscribe to the updates, so it's actually possible to start thinking
> about apps like tweetdeck and others as a potential client for a PSHB
> hub.
>
> But, with PSHB that problem is gone because any edge client can just
> subscribe to the updates, so it's actually possible to start thinking
> about apps like tweetdeck and others as a potential client for a PSHB
> hub.
>
> Of course, there is one major flaw in my argument: nat and firewalls.
> Having said that, there is still some possibility of this being
> solved: reverse http, private hubs, etc.

Most home routers support upnp by default so a desktop applications
could easily tell the router to listen to a port and dnat incoming
connection to your machine. (I always disable upnp for security
reasons but most people don't know what it is and leave it enabled.

To support roaming users and dynamic IP address, PSHB should
let subscribers update their callback URL.

It won't work in most office environment an some ISPs
firewall completely their consumers' address blocks. In this
particular case the subscriber should be allowed to poll the hub.
Reply all
Reply to author
Forward
0 new messages