Hey Joe,
I took a stab at what support for private and authenticated feeds
would look like in PubSubHubbub in this thread:
http://groups.google.com/group/pubsubhubbub/msg/3108fe7fdc740482
Otherwise, your best bet right now is to use obfuscated feed URLs.
These are URLs that are hard to guess (i.e., the URL is a password).
When accessed over SSL they are reasonably secure in keeping the
content private. These can be used right now with Hubbub for
non-public feeds. In the future we'd like to have a full authorization
scheme as I describe above.
Let me know what you think!
This is the "fan-out problem" that people talk about. It's hard to
On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec...@gmail.com> wrote:
> How does
> PSHB address a publisher that pumps out 100 new updates a second to
> tens of millions of subscribers? Also consider an Ashton Kucher-scale
> publisher. How would one design a hub, or a series of hubs to
> distribute 1,000,000 notifications in a few seconds every time he
> updates?
solve. The simple approach is to incrementally send the notifications
to any subscribers at a sustained rate. So say we did 10,000 per
second; for the 1M subscribe case we'd need ~100 seconds to push the
updates to all subscribers. This indicates that the primary pivot here
is load versus latency.
Theoretically we could push 100,000 updates per second and get this
done in 10 seconds, but the amount of load that puts on our servers
(and HTTP connections) is so high that it's really not worth it.
Instead, we pick a load level we're okay with and then eventually
replicate the data out to all subscribers. That means that as your
feed gains subscribers you sacrifice latency for volume. This is the
reality of the situation. I believe it's no different on existing
services out there. The good news is Hubbub can handle this case in
theory-- I plan to further load-test our hub with the 1M subscriber
case to see how fast we can go.
Otherwise, it's good to keep things in perspective. For the 10,000
subscriber case, we can deliver all notifications in 1 second at this
same rate. So for practical situations, this scales just fine for the
vast majority of users.
-Brett
On Mon, Aug 3, 2009 at 11:29 AM, Brett Slatkin <bsla...@gmail.com> wrote:
> On Mon, Aug 3, 2009 at 8:14 AM, JoeCascio<joec...@gmail.com> wrote:
>> ... How would one design a hub, or a series of hubs to
>> distribute 1,000,000 notifications in a few seconds every time he
>> updates?> This is the "fan-out problem" that people talk about. It's hard to> solve.We should consider how likely it is that we're actually going to see PSHB systems that need to do 1:1,000,000 fan-out. Personally, I think it is exceptionally unlikely.
Just as sending 1M messages puts a great deal of load on the hub, it is likely to put unsustainable load on the Subscriber nodes which are receiving those messages. While there may be some "single-user" Subscriber nodes, I think most Subscriber nodes will actually be supporting many end users. (i.e. just like GMail or Hotmail serve many users.) If a Subscribing server is supporting 100K users, it will NOT want to receive 100K distinct HTTP posts from the hub every time that a popular feed is updated.
In all probability, what the Subscriber node would most like to see would be a single message that it could then match against its own local distribution lists or a single message with a list of "subscription ids" (potentially many of them), that it could use to do its own fan-out. (Note: The single-message, multiple-recipients approach is used in XEP-0060 PubSub as well as by SMTP email systems -- precisely to reduce the load caused by this kind of fan-out. This kind of "batch" delivery results naturally in a multi-level, distributed fan-out system that optimizes everyone's resource utilization for the price of some minimal additional complexity.)Subscribing systems that support multiple users can, of course, reduce the number of messages they receive given the PSHB protocol as it stands today. They need only maintain an internal mapping of subscriptions to users and then only register new subscriptions with the hub when they don't already have an outstanding subscription that matches. As long as PSHB remains a "topic-based" protocol, where there is likely to be substantial and easily detected overlap between subscriptions, that is probably all that is necessary. In theory, a PSHB hub might need to do 1:1M fanout but in reality, it probably won't happen. (Note: It might make sense to update the "Best Practices" section to explicitly suggest that multi-user Subscriber do, in fact, try to conserve their resources by eliminating duplicate subscriptions to a hub.)
The justification for a symmetric / federated factoring is to chain
hubs easily, so that fanout can be ramified as Julien suggests..
On Mon, Aug 3, 2009 at 12:07 PM, Bob Wyman<bobw...@gmail.com> wrote:
> We should consider how likely it is that we're actually going to see PSHB
> systems that need to do 1:1,000,000 fan-out. Personally, I think it is
> exceptionally unlikely.
> Just as sending 1M messages puts a great deal of load on the hub, it is
> likely to put unsustainable load on the Subscriber nodes which are receiving
> those messages. While there may be some "single-user" Subscriber nodes, I
> think most Subscriber nodes will actually be supporting many end users.
> (i.e. just like GMail or Hotmail serve many users.) If a Subscribing server
> is supporting 100K users, it will NOT want to receive 100K distinct HTTP
> posts from the hub every time that a popular feed is updated. In all
> probability, what the Subscriber node would most like to see would be a
> single message that it could then match against its own local distribution
> lists or a single message with a list of "subscription ids" (potentially
> many of them), that it could use to do its own fan-out. (Note: The
> single-message, multiple-recipients approach is used in XEP-0060 PubSub as
> well as by SMTP email systems -- precisely to reduce the load caused by this
> kind of fan-out. This kind of "batch" delivery results naturally in a
> multi-level, distributed fan-out system that optimizes everyone's resource
> utilization for the price of some minimal additional complexity.)
You're missing a key aspect of the PubSubHubbub protocol as it relates
to Atom. Hubbub requires support for aggregation of notification
delivery for a single subscriber callback URL. That means the hub can
spool 1000 messages for a single receiver across 1000 feeds and turn
them into a single HTTP post:
http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html#aggregatedistribution
This can make delivery more efficient for publishers and subscribers,
but it does not obviate the need for a subscriber to do local fan-out
itself. To my knowledge, such aggregation is not supported in RSS,
which impacts the scalability of that subset of the protocol. But I'm
not worried about this.
> Subscribing systems that support multiple users can, of course, reduce the
> number of messages they receive given the PSHB protocol as it stands today.
> They need only maintain an internal mapping of subscriptions to users and
> then only register new subscriptions with the hub when they don't already
> have an outstanding subscription that matches. As long as PSHB remains a
> "topic-based" protocol, where there is likely to be substantial and easily
> detected overlap between subscriptions, that is probably all that is
> necessary. In theory, a PSHB hub might need to do 1:1M fanout but in
> reality, it probably won't happen. (Note: It might make sense to update the
> "Best Practices" section to explicitly suggest that multi-user Subscriber
> do, in fact, try to conserve their resources by eliminating duplicate
> subscriptions to a hub.)
Yes, this would be a good addition to the best practices section.
> (*** The following discusses stuff that PSHB does not support and might
> never support even though one could easily build a custom hub to implement
> what is discussed below.)
> If, on the other hand, PSHB were to be extended to support "content-based"
> subscriptions then it would probably make sense to consider supporting some
> sort of batch delivery. (You get "content-based" subscriptions when you're
> not just subscribing to feeds but rather to items within feeds whose content
> matches some pattern -- i.e. "All entries that contain the term 'foobar'".)
> The reason is that many subscriptions from a single Subscriber hub could
> match a single feed entry. In such a case, you would probably want a
> multi-user Subscriber to be able to identify that it was such a system and
> only get one or a small number of messages from the "content-based matching"
> hub even if it needed to distribute those messages to thousands of users.
> The easiest way to do this would be to have the Subscriber provide a
> "subscription-id" and "batch-id" for each of its subscriptions and for the
> hub to group outgoing messages by Subscriber+batch_id. The hub would then
> append a list of all matching subscription_ids to a single message sent to
> each batch_id. Thus, even with content-based systems, you can still get
> really easy distributed fan-out which benefits both the hub and the
> subscriber.
I disagree with the idea that the Hub should collate end-user
recipients on behalf of a subscriber. Subscribers should be smart
enough to track subscriptions and do their own local fan-out to
end-users. Baking this into the PubSubHubbub spec would be a mistake
because it forces application developers to write their systems a
certain way. The design philosophy here is to implement the minimum
possible. The functionality you describe here is out of scope.
-Brett
I understand your suggestion, Julien, but I don't think the protocol
needs to change to accommodate it.
In the present world the publisher is in control of how *they* want to
distribute their content. They delegate trust through
auto-discoverable mechanism (HTML -> <link rel="alternate"> ->
<atom:link rel="hub">). Thus, new subscribers who know nothing about
the publisher or the hub can quickly establish a trust relationship
and receive updates from the Hub on behalf of the publisher.
What you're suggesting is another link in the chain to act as an
intermediary between the Hub and the subscriber. This implies that the
subscriber's hub would be trusted by the subscriber, and thus the
trust chain would have another link on it. For simple use-cases this
would require another moving part and make it harder for simple
subscribers to get up and running. This is why I'm against suggesting
this as the "right way" to do things.
That said, I see opportunities for these intermediary hubs to provide
a lot of value for subscribers (e.g., reliability, content
translation, stats). So it's the subscriber's choice and up to the
semantics of their relationship with their intermediary hub to decide
how they want to subscribe. It doesn't need to be baked into the
protocol at all.
-Brett