Now with Blogger support!

Brad Fitzpatrick

unread,

Aug 13, 2009, 4:26:04 PM8/13/09

to pubsub...@googlegroups.com

Blogger now has PubSubHubbub support!

Wiki updated:
http://code.google.com/p/pubsubhubbub/wiki/HubbubAtGoogle

(we forgot the RSS discovery tags, but we'll get those in as soon as possible....)

igrigorik

unread,

Aug 14, 2009, 9:27:23 AM8/14/09

to Pubsubhubbub

Now that is awesome! Great job guys.

Hmm, so.. If one wanted to consume _all_ of blogger alerts, any
mechanics in pshb to make that 'easy' as a subscription? Is it
possible to issue a "give me everything you've got" type of request?

Brett Slatkin

unread,

Aug 14, 2009, 2:04:38 PM8/14/09

to pubsub...@googlegroups.com

On Fri, Aug 14, 2009 at 6:27 AM, igrigorik<il...@aiderss.com> wrote:
>
> Now that is awesome! Great job guys.

Thanks!

> Hmm, so.. If one wanted to consume _all_ of blogger alerts, any
> mechanics in pshb to make that 'easy' as a subscription? Is it
> possible to issue a "give me everything you've got" type of request?

The "firehose": There's nothing like this for blogger yet. Easiest way
to get something close to this is to crawl the site for feed links and
subscribe to them all. In the future it may make sense to have an
aggregated combined feed, similar to the "Atom Stream" that Six Apart
runs, with an Atom-aggregated view of all updates.

-Brett

Kevin Marks

unread,

Aug 14, 2009, 2:09:33 PM8/14/09

to pubsub...@googlegroups.com

Doesn't blogger still publish the changes.xml updates pages? Not a flow, but a regularly changing document

On Aug 14, 2009 11:04 AM, "Brett Slatkin" <bsla...@gmail.com> wrote:

On Fri, Aug 14, 2009 at 6:27 AM, igrigorik<il...@aiderss.com> wrote: > > Now that is awesome! Great ...
Thanks!

> Hmm, so.. If one wanted to consume _all_ of blogger alerts, any > mechanics in pshb to make that ...

Brett Slatkin

unread,

Aug 14, 2009, 2:11:16 PM8/14/09

to pubsub...@googlegroups.com

On Fri, Aug 14, 2009 at 11:09 AM, Kevin Marks<kevin...@gmail.com> wrote:
> Doesn't blogger still publish the changes.xml updates pages? Not a flow, but
> a regularly changing document

I'm not sure if there's a blogger-specific one, but Google Blog Search
has one here:

http://www.google.com/help/blogsearch/pinging_API.html#xml-format

David Recordon

unread,

Aug 14, 2009, 2:30:29 PM8/14/09

to pubsub...@googlegroups.com

We've been thinking about how to make a "firehose" as well if we
supported PSHB at Six Apart and would love to collaborate with other
on it. Right now we offer http://updates.sixapart.com/.

--David

Brett Slatkin

unread,

Aug 14, 2009, 2:35:36 PM8/14/09

to pubsub...@googlegroups.com

On Fri, Aug 14, 2009 at 11:30 AM, David Recordon<reco...@gmail.com> wrote:
>
> We've been thinking about how to make a "firehose" as well if we
> supported PSHB at Six Apart and would love to collaborate with other
> on it. Right now we offer http://updates.sixapart.com/.

I talked briefly with Tatsuhiko about this last week, and I think this
part of the spec makes it possible:

http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html#aggregatedistribution

Should be very close to how Atom Stream works, but in reverse. The
hard part is figuring out how to properly throttle pushing to clients
who fall behind. The nice thing about separate HTTP posts is you can
very easily get a success/response status from the subscriber and then
know if you should drop the packet, retry, etc. I'd assume Atom Stream
does this right now, but with TCP throttling indicators.

David Recordon

unread,

Aug 14, 2009, 2:38:08 PM8/14/09

to pubsub...@googlegroups.com

While that section describes how we'd format the notifications, the
piece I don't think we've worked out is how to make subscriptions
work. I want to be able to subscribe to every TypePad blog, every Vox
blog, or both without having to know the URLs of every blog's feed.
So while we could make up a syntax such as subscribing to
http://*.typepad.com/, it would be nice to find a more generic
mechanism that works across multiple hubs.

--David

Brett Slatkin

unread,

Aug 14, 2009, 2:39:59 PM8/14/09

to pubsub...@googlegroups.com

Oh, yeah, so what we discussed in that case was having the content
publisher define a meta-feed URL that contains the whole aggregate
content. For example: http://typepad.com/feeds/atom/everything. Then
subscribers would use that URL to get everything. I don't think this
needs to be in the Hubbub spec. Each publisher can define these
virtual feed URLs however they want, right? Standardizing URL schemes
is usually pretty scary business.

Alexis Richardson

unread,

Aug 14, 2009, 2:45:10 PM8/14/09

to pubsub...@googlegroups.com

FWIW

Adding hub-managed queries to the topic model in PSHB would more or
less force the introduction of a way to cache the results of queries
... i.e. to introduce queues. This could break the lightweight
'pubsub only' model that IMHO helps to make PSHB interesting.

A better way is for hubs to remain silent about how they manage state,
enabling people to choose between:

a) endpoint based approaches, eg Brett's below
b) middleman approaches, eg sticking MQ under PSHB like RabbitHub does

HTH

alexis

Kevin Marks

unread,

Aug 14, 2009, 3:46:24 PM8/14/09

to pubsub...@googlegroups.com

On Fri, Aug 14, 2009 at 11:11 AM, Brett Slatkin<bsla...@gmail.com> wrote:
>
> On Fri, Aug 14, 2009 at 11:09 AM, Kevin Marks<kevin...@gmail.com> wrote:
>> Doesn't blogger still publish the changes.xml updates pages? Not a flow, but
>> a regularly changing document
>
> I'm not sure if there's a blogger-specific one, but Google Blog Search
> has one here:
>
> http://www.google.com/help/blogsearch/pinging_API.html#xml-format

The blogger one is still there too at:

http://www.blogger.com/changes.xml

not sure of its update frequency.

Bob Wyman

unread,

Aug 14, 2009, 10:22:26 PM8/14/09

to pubsub...@googlegroups.com

On Fri, Aug 14, 2009 at 2:45 PM, Alexis Richardson <alexis.r...@gmail.com> wrote:
> Adding hub-managed queries to the topic model

> in PSHB would more or less force the introduction

> of a way to cache the results of queries

PSHB is a wonderfully simple implementation of the "topic-based" pubsub pattern -- where "topics" are feeds. Anything that introduces queries to PSHB is going to go a long way towards breaking PSHB's simplicity and thus, I fear, seriously detract from one of PSHB's best attributes today.

The problem of handling queries that construct aggregate feeds by looking deeper into the content of any input feed than its URL is the general problem solved by the "other" kind of pubsub system -- content-based pubsub. I strongly recommend that PSHB stick to what it does well -- topic-based pubsub -- and leave the problem of content-based pubsub to a different specification and probably different implementations.

A content-based system would probably consume a firehose feed from a PSHB system and then apply various continuous query or matching algorithms against that firehose. Thus, a content based hub would allow you to select based on source URL for entries even if the entries had been published in an aggregate feed. Similarly, some content-based systems would allow you to do things like text search that gave you all entries containing some specified pattern of text. Clearly, there would be room for a variety of such systems and for a variety of query syntaxes (optimized for various purposes) as well. But, this should all be happening downstream from PSHB hubs, or at least defined in a new specification and not considered a capability required of a "standard" PSHB hub.

Let's keep PSHB simple -- and focused on the topic-based pubsub problem. Later or other efforts can address the content-based issue independently and without making PSHB more complex.

bob wyman

igrigorik

unread,

Aug 15, 2009, 10:04:05 AM8/15/09

to Pubsubhubbub

I think content matching, etc. would be a bit of an overkill (better
handled on the consumer side).

What we're looking for here is not unlike a "collection node" in XMPP,
or a "topic exchange" in AMQP. Define a single key (url), and match on
that. If that's too much on server-side, then go with the simplest
possible thing: expose a firehose of _all_ content. Furthermore, don't
worry about caching and clients that drop off.. If you're offline,
you'll miss some content.

I'd be in favor of simple and fast, rather than trying to work out all
the kinks for a larger topic subscription spec with caching, etc.

ig

On Aug 14, 10:22 pm, Bob Wyman <bobwy...@gmail.com> wrote:

Kevin Marks

unread,

Aug 15, 2009, 3:36:34 PM8/15/09

to pubsub...@googlegroups.com

A query-based service could also generate a feed itself, that could
send into a hub too. I agree that applying queries to hubs seems
problematic, but if you think of an AtomPub-like query feed - eg
Google Contacts - in principle it could ping a hub whenever that query
adds a new result that matches, so addressbook updates could flow over
PSHB rapidly too, rather than rely on polling sync.

Bob Wyman

unread,

Aug 15, 2009, 4:17:43 PM8/15/09

to pubsub...@googlegroups.com

On Sat, Aug 15, 2009 at 3:36 PM, Kevin Marks <kevin...@gmail.com> wrote:

> A query-based service could also generate

> a feed itself, that could send into a hub too.

Certainly, this is possible and certainly, it will be done. However, Best Practices with PSHB should probably explicitly discourage it since it is not a pattern that can scale. The problem is that if systems upstream of PSHB are generating synthetic feeds by filtering other feeds, then you will end up with a potential explosion in the number of feeds that PSHB needs to consume.

Take SixApart as an extreme example. Let us imagine that SixApart allows us not only to access individual feeds but to also create synthetic feeds who content is defined by full-text queries. Thus, 10's of thousands of people could all have "personalized" SixApart feeds which are composed of entries from any SixApart blog that contains keywords in which they are interested. If all those feeds are fed into PSHB, then on the creation of just one new blog entry, you could see *many* of copies of the new entry created -- one for each query-based feed. Of course, each of those feeds would need to be processed by PSHB. This is a thundering-herd problem... (Note: When I was at PubSub.com, we offered a real-time prospective search system that matched keywords and phrases against blog entries. Each entry we processed matched an average of 13 unique queries and some entries matched 10's of thousands of queries... You don't want to generate all those duplicates upstream of PSHB...) Also, note that this problem is one that exists for every topic-based pubsub system that has ever been built. This problem is, in part, why content-based pubsub exists...

> but if you think of an AtomPub-like query

> feed - eg Google Contacts - in principle it

> could ping a hub whenever that query adds

> a new result that matches, so addressbook

> updates could flow over PSHB rapidly too,

> rather than rely on polling sync.

The Google Contacts example would only be reasonable if, in fact, the query syntax was limited in expressiveness. If, for instance, the only thing I could say was: "Update whenever <username's> details change." then the maximum number of duplicate entries is small. (You might have one entry on the "all entries" fire-hose feed and one on a username specific query feed.) However, if you increase the expressiveness of the query language you will rapidly get to the point where you have a large number of duplicates again. (i.e. queries for each of fullname, firstname, lastname, street, city, and an infinite number of geographical proximity queries...) Thus, the problem is related to the expressiveness of the query language supported.

Query matching is best done outside of and downstream of PSHB or any similar topic/channel/feed based system. If query matching is done upstream of a topic-based pubsub system, you get topic-explosion. Thus, if query processing *must* be done upstream of PSHB, then Best Practice should argue that such processing should be limited in scope and should not result in high, pre-PSHB fanout for any single new source feed entry.

bob wyman

Alexis Richardson

unread,

Aug 16, 2009, 2:32:38 PM8/16/09

to pubsub...@googlegroups.com

Bob,

On Sat, Aug 15, 2009 at 9:17 PM, Bob Wyman<bobw...@gmail.com> wrote:
>
> The Google Contacts example would only be reasonable if, in fact, the query
> syntax was limited in expressiveness. If, for instance, the only thing I
> could say was: "Update whenever <username's> details change." then the
> maximum number of duplicate entries is small. (You might have one entry on
> the "all entries" fire-hose feed and one on a username specific query feed.)
> However, if you increase the expressiveness of the query language you will
> rapidly get to the point where you have a large number of duplicates again.
> (i.e. queries for each of fullname, firstname, lastname, street, city, and
> an infinite number of geographical proximity queries...) Thus, the problem
> is related to the expressiveness of the query language supported.

Just one side question: do you mean 'duplicates' in the sense that any
given subscriber may see more than one copy of a given message,
because more than one query generates the same result? Or, are you
referring to some other issue - eg. duplication across two users.

alexis

Bob Wyman

unread,

Aug 17, 2009, 12:30:46 AM8/17/09

to pubsub...@googlegroups.com

On Sun, Aug 16, 2009 at 2:32 PM, Alexis Richardson <alexis.r...@gmail.com> wrote:
> do you mean 'duplicates' in the sense that any
> given subscriber may see more than one copy of
> a given message, because more than one query
> generates the same result? Or, are you referring
> to some other issue - eg. duplication across two users.

What I'm concerned about is the load on the hub that comes from reading duplicate copies of entries in many feeds. Imagine that some blog hosting site allowed one to create query-based feeds. Thus, a feed url might look like http://example.com/firehose.atom?q="foobar" if you wanted a feed of only those new posts that contained the keyword foobar. It would not be good if each of these feeds was independently served by a PSHB hub. The reason, of course, is that for any single new original feed entry you might end up with updates to hundreds or even thousands of query feeds. (If queries are limited to single words, the maximum number of feeds updated for each new entry would be equal to the number of unique words in the new entry. If phrases were supported in queries, then the maximum number of feeds updated per new entry could be astronomical... i.e. a number equal to all permutations of all words in the new entry... As the expressiveness of the query syntax increases, the potential fan-out increases.) Thus, in the worst case, a query-based feed generator can cause a tremendous flood of essentially identical updates (which only differ in that they come from different feeds) that need to be processed by the hub. This is not good.

What is also not good here is that users would not be directly exposed to the downside of a system that had up-stream query generated feeds. Each user would only get the data that they were expecting and wouldn't be exposed to the fact that their subscription to a query-generated feed was causing a problem for the hub. This is essentially a set-up for a "tragedy of the commons" situation. Feed generators will be happy to provide more features in order to attract more readers. Readers will be happy to use query-generated feeds since they like the customization. However, the hub (which is the "commons" that makes this little eco-system work) would be getting flooded with unnecessary load. We should have learned enough about the "tragedy of the commons" as it applies to network services to avoid setting up yet another commons to be trashed...

Given this, I argue that we should discourage people from supporting query-based feed generation up-stream of a PSHB hub. Such processing should happen down-stream from the hub and closer to the end user.

bob wyman

igrigorik

unread,

Aug 18, 2009, 9:30:42 AM8/18/09

to Pubsubhubbub

> Given this, I argue that we should discourage people from supporting
> query-based feed generation up-stream of a PSHB hub. Such processing should
> happen down-stream from the hub and closer to the end user.

Agreed.

On Aug 17, 12:30 am, Bob Wyman <bobwy...@gmail.com> wrote:

> On Sun, Aug 16, 2009 at 2:32 PM, Alexis Richardson <alexis.richard...@gmail.com> wrote:
> > do you mean 'duplicates' in the sense that any
> > given subscriber may see more than one copy of
> > a given message, because more than one query
> > generates the same result? Or, are you referring
> > to some other issue - eg. duplication across two users.
>
> What I'm concerned about is the load on the hub that comes from reading
> duplicate copies of entries in many feeds. Imagine that some blog hosting
> site allowed one to create query-based feeds. Thus, a feed url might look

> likehttp://example.com/firehose.atom?q="foobar" if you wanted a feed of

> alexis.richard...@gmail.com> wrote:
>
> > Bob,

>
> > On Sat, Aug 15, 2009 at 9:17 PM, Bob Wyman<bobwy...@gmail.com> wrote:
>
> > > The Google Contacts example would only be reasonable if, in fact, the
> > query
> > > syntax was limited in expressiveness. If, for instance, the only thing I
> > > could say was: "Update whenever <username's> details change." then the
> > > maximum number of duplicate entries is small. (You might have one entry
> > on
> > > the "all entries" fire-hose feed and one on a username specific query
> > feed.)
> > > However, if you increase the expressiveness of the query language you
> > will
> > > rapidly get to the point where you have a large number of duplicates
> > again.
> > > (i.e. queries for each of fullname, firstname, lastname, street, city,
> > and
> > > an infinite number of geographical proximity queries...) Thus, the
> > problem
> > > is related to the expressiveness of the query language supported.
>
> > Just one side question: do you mean 'duplicates' in the sense that any
> > given subscriber may see more than one copy of a given message,
> > because more than one query generates the same result? Or, are you
> > referring to some other issue - eg. duplication across two users.
>
> > alexis
>
> > > bob wyman
>

Reply all

Reply to author

Forward