Content-Based Filtering for PubSubHubbub (A proposal)

17 views
Skip to first unread message

Bob Wyman

unread,
Jun 29, 2010, 6:09:15 PM6/29/10
to pubsubhubbub
Now that we've got substantial experience with topic-based PubSubHubbub, I think it is time we began considering how we would extend the current protocol to support content-based filtering of topics... We've casually discussed this in the past, I suggest that we address the subject seriously.

In a topic based subscription system, a subscriber receives all messages published to any topic to which they have subscribed.
With content-based filtering, the subscriber is able to associate one or more filters with a topic and will then only receive those messages which are published to the topic and which match at least one of the associated filters. Thus, given a topic based on a feed of weather-related announcements, I might create a subscription with a filter that indicates that I'm only interested in messages that are "Storm Alerts" and that are intended for the New York City area. I might also subscribe to "Air Pollution Alerts" in the same area. Given such subscriptions, I would only receive those messages published to the topic that matched one of my filters even if there were many orders of magnitude more messages actually published to the topic.

I would like to suggest the following extensions to support content-based filtering:
  • In "6.1. Subscriber Sends Subscription Request" Add an optional "hub.filter" which would contain a textual filter expression  to be specified in mode "subscribe". (syntax to be discussed later).
  • In "6.1 Subscriber Sends Subscription Request" Add an optional "hub.filterid" which would be used in mode "unsubscribe".
  • In "6.1.2 Subscription Response Details": The hub's response should include the header field X-Hub-FilterId which would contain a unique id for the filter. I believe we should require that this be an MD5 hash of the query.
  • In "6.2 Hub Verifies Intent of the Subscriber" Include hub.filter in the request and provide X-Hub-FilterId in the header
  • In "7.3 Content Distribution" For each subscriber/callback-url that had registered a filter on a topic, a single POST would be generated -- as long as one or more of those filters matched published content. In each POST request from hub to subscriber's callback URL, there would be an X-Hub-FilterId field that lists all of the filter-ids for filters that match the published data.
    Also, when processing feeds which contain more than one entry or item, the feeds should be "broken up" so that content-based matching and thus distribution is done against only one entry or item at a time.
  • In "2. Definitions" Add a definitions for "filter" and filter-id.

Q&A: (In which I try to anticipate the obvious questions...)
  • Why not put the filter query in the Topic URL and eliminate the hub.filter? 
    • Any attempt to use the Topic URL would put constraints on those URLs.
    • We already have PSHB subscriptions to URLs which contain query parameters and thus it would be hard to distinguish PSHB filter parameters from query parameters.
    • Using a distinct URL for each subscription would result in a distinct delivery for each matching message. Thus, if you had thousands of distinct filters on a single topic, you would receive thousands of deliveries.
    • Embedding the filter in the query will inevitably lead to it being transmitted over the network repeatedly as well as being logged in various log files. However, filter queries often contain very sensitive information whose disclosure could compromise either privacy or security. Not including it in the URL somewhat reduces the opportunity for inadvertently disclosing or logging filter queries. 
  • Why use FilterId instead of just using the actual query?
    • In many cases, users' filters will expose data that is private or sensitive. If filters are frequently transmitted between machines, they are more likely to be intercepted by malicious or unintended intermediaries. Privacy and security are enhanced by only passing the actual filter text when it is absolutely necessary.
    • Using FilterId for delivery and for unsubscribe means that the actual query is never exposed except during the initial subscription exchange. Thus, it is possible to implement a "secure" system for creating subscriptions and a somewhat less secure system for distributing the results.
    • While queries will initially probably be very simple (simple keywords), we can expect that queries will become more complex over time. Thus, the length of a query can become very large. Using a FilterId makes handling these things much simpler.
    • While using the query would simplify debugging, the reality is that running systems will be more easily coded if they can rely on the ids of matched filters being of fixed size and numeric -- to facilitate use as database keys, etc.
  • Why list all FilterIds with the delivered data?
    • Doing so makes it possible to deliver a single response that covers a potentially large number of filters. A single feed update might, for a shared subscriber, match 10, 100's, even thousands of filters. In the case of many filter matches, delivering a unique response for each filter would waste vast amounts of network and processing resources.
    • Note: Subscribers who want a distinct response for each matched filter would just subscribe each filter with a distinct call-back URL. 
  • Why use an MD5 hash for the FilterId?
    • MD5 is commonly available on all platforms and in all useful languages and it is easy and cheap to compute.
    • If users are free to provide their own FilterIds, they will inevitably abuse the system and use very long ids that tax the system.
    • MD5 hashes are of known sizes and properties and are system independent. Thus, using them makes it easier to design, implement and deploy across systems, develop open-source implementations, etc..
  • Why not have the subscriber pass the FilterId to the hub?
    • If the FilterId is a simple MD5 hash, it is not necessary for the subscriber to pass it in.
    • If the subscriber can pass in the FilterId, then they will inevitably abuse this -- as discussed above.
  • Why return the FilterId as a header on the Subscription Response?
    • To ensure that both Server and Subscriber agree on the FilterId
    • To accommodate those coders who can't figure out how to compute MD5 hashes... (i.e. don't compute it, just pull it out of the response.)
  • Why use hub.filterid for unsubscribing instead of hub.filter?
    • In some cases, the subscriber will have "forgotten" the mapping from filter to FilterId but may still be getting results for a previous subscription. In order to unsubscribe, they need to provide the FilterId to unsubscribe than thus stop the unexpected messages.
  • Why not specify the filter syntax now?
    • I believe the filter syntax discussion is orthogonal to the discussion of the framework for registering subscriptions and for getting responses.
    • Ideally, folk would use some reasonable subset of XPath filtering. However, that might be overwhelming for some.
  • What about firehose feeds?
    • I'm suggesting that this extension apply to all feeds, not just to the still-undefined firehose feeds even though content-based filtering will be probably be most useful when, in fact, used with firehose feeds. If we assume that a firehose feed is identified by a topic (perhaps a ".well-known" topic...) then what we defined for normal topics should also apply to firehose feeds.
I would appreciate any comments that you might have. I think that the proposal here is about as simple as it can get without introducing serious bandwidth or processing loads on servers and/or subscribers.

bob wyman

Alexis Richardson

unread,
Jun 29, 2010, 6:30:23 PM6/29/10
to pubsub...@googlegroups.com
On Tue, Jun 29, 2010 at 11:09 PM, Bob Wyman <b...@wyman.us> wrote:
> Now that we've got substantial experience with topic-based PubSubHubbub,

With all due respect - I don't think we have enough yet. The spec is
still unstable.

alexis

Brett Slatkin

unread,
Jun 29, 2010, 7:31:43 PM6/29/10
to pubsub...@googlegroups.com
On Tue, Jun 29, 2010 at 3:30 PM, Alexis Richardson
<alexis.r...@gmail.com> wrote:
> On Tue, Jun 29, 2010 at 11:09 PM, Bob Wyman <b...@wyman.us> wrote:
>> Now that we've got substantial experience with topic-based PubSubHubbub,
>
> With all due respect - I don't think we have enough yet.  The spec is
> still unstable.

Specific wording aside, Alexis, it's a fine time to be talking about
these ideas, right?

Alexis Richardson

unread,
Jun 29, 2010, 7:35:02 PM6/29/10
to pubsub...@googlegroups.com

I don't wish to diss Bob's push in this direction. But I'd like to
see the topic model 'settle down' before looking at content. I don't
believe we are there yet.

alexis

Brett Slatkin

unread,
Jun 29, 2010, 7:44:08 PM6/29/10
to pubsub...@googlegroups.com

Gotcha. That's fine. I think everyone wants things to move slowly.
This is the first I've seen of Bob's ideas in this direction and I've
got to let it marinate in my brain a bit. But I'd imagine some new
concerns about the existing spec may fall out of it, which is probably
a good thing, even if we all collectively decide that they should or
should not be in scope.

-Brett

Danny Briere

unread,
Jun 30, 2010, 5:30:38 PM6/30/10
to Pubsubhubbub
This is interesting, because I think we may be more concerned with the
"topic model" than the content filtering. So we have some questions
here:

1) What is the current thinking / discussion / issues about the topic
model? I've searched the recent postings trying to find references
and more specifics....is there a URL I can look at?

2) In the current spec, a topic is the same as a feed URL from the
publisher. But could the hub aggregate and / or segregate content
from incoming feed data and create its own topics?

(For example, in item 2, our hub would aggregate press releases and
the segregate them based on industry / language / geography into
multiple topics that users can subscribe to).

3) Separately, we're also interested in being able to filter by tagged
field values to create filtered feeds.

-Danny

On Jun 29, 7:44 pm, Brett Slatkin <bslat...@gmail.com> wrote:
> On Tue, Jun 29, 2010 at 4:35 PM, Alexis Richardson <ale...@rabbitmq.com> wrote:

Ivan Zuzak

unread,
Jul 1, 2010, 7:24:52 AM7/1/10
to Pubsubhubbub
I think filtering is a great idea and that it should end up in PSHB
one way or the other. So there's not time like now to start talking
about possibilities, while the actual implementation will crystallize
later.

I myself would like filtering (and processing) to go the way of
external filtering services -- the hub would 1) receive a subscription
together with a set of URIs pointing to filtering services, 2) receive
a notification from the publisher, 3) pass the notification to
filtering services, 4) receive the response from filtering services
and forward it to the subscriber. Step 3) is intentionally a bit vague
since there a lot of ways of doing this, but one could imagine it as
"notifying an intermediate subscriber and receiving a new notification
in return". I've written about some of these ideas a few months back
-- http://izuzak.wordpress.com/2010/01/11/real-time-feed-processing-and-filtering/
(pardon the self-promotion). This would make for a more powerful and
open filtering/processing system which of course has it's own
drawbacks, so it's maybe something to think about a bit later, after
simple filtering implemented within the hub.

Cheers,
Ivan

Alexis Richardson

unread,
Jul 1, 2010, 8:22:53 AM7/1/10
to pubsub...@googlegroups.com
On Wed, Jun 30, 2010 at 10:30 PM, Danny Briere <danny....@gmail.com> wrote:
> This is interesting, because I think we may be more concerned with the
> "topic model" than the content filtering.  So we have some questions
> here:
>
> 1) What is the current thinking / discussion / issues about the topic
> model?  I've searched the recent postings trying to find references
> and more specifics....is there a URL I can look at?
>
> 2) In the current spec, a topic is the same as a feed URL from the
> publisher.  But could the hub aggregate and / or segregate content
> from incoming feed data and create its own topics?

For my part, I would see that as a natural extension. How would such
topics be created and managed?

alexis

John Panzer

unread,
Jul 1, 2010, 1:04:13 PM7/1/10
to pubsub...@googlegroups.com
A topic: URI scheme?
--
John Panzer / Google
jpa...@google.com / abstractioneer.org / @jpanzer

Julien Genestoux

unread,
Jul 1, 2010, 1:13:51 PM7/1/10
to pubsub...@googlegroups.com
Hey, I don't mean to interrupt... but maybe this is part of the conversation : Superfeedr announced its track feature today : http://blog.superfeedr.com/track/filter/xmpp/pubsubhubbub/track/

You can subscribe to keywords accross all the feeds that we host (more than 800 hubs now) or the feeds we have for the default hub.

We want to get more filtering done... but it's obviously a quite complex task with high frequency publishing!

Cheers,

julien

Monica Keller

unread,
Jul 6, 2010, 6:45:38 PM7/6/10
to Pubsubhubbub
Good stuff Julien !

My 2c:
When I was working at MySpace we designed an in-depth filtering
mechanism. You can see it ta : http://wiki.developer.myspace.com/index.php?title=Stream_Subscription_Query_Spec
external to the PubSubHubbub spec. We spent significant resources
iterating with developers on whether their permutations of filters
were valid and never got around to providing a test tool.
Facebook started with a far simpler subscription model that only
supports specifying 2 fields to filter on (http://
developers.facebook.com/docs/api/realtime) but also has no test tool.

After seeing developers wasting time figuring out how to filter/
subscribe to things my personal recommendation is
#1 To use the topic url so its testable with a browser.
#2 Which can be done outside of the spec, for hubs to provide
"Virtual" Topic Urls. One example of this would be a firehose url with
parameters for filtering. Also outside the scope of the spec would be
for hubs to promote their virtual topic urls like Julien just did.

This will be helpful as subscribers don't have to subscribe to raw
topic urls one by one and can use the virtual or aggregated topic urls
from the Service Provider

Thoughts ?

On Jul 1, 10:13 am, Julien Genestoux <julien.genest...@gmail.com>
wrote:
> Hey, I don't mean to interrupt... but maybe this is part of the conversation
> : Superfeedr announced its track feature today :http://blog.superfeedr.com/track/filter/xmpp/pubsubhubbub/track/
>
> You can subscribe to keywords accross all the feeds that we host (more than
> 800 hubs now) or the feeds we have for the default hub.
>
> We want to get more filtering done... but it's obviously a quite complex
> task with high frequency publishing!
>
> Cheers,
>
> julien
>
> On Thu, Jul 1, 2010 at 7:04 PM, John Panzer <jpan...@google.com> wrote:
> > A topic: URI scheme?
> > --
> > John Panzer / Google
> > jpan...@google.com / abstractioneer.org <http://www.abstractioneer.org/> /
> > @jpanzer
>
> > On Thu, Jul 1, 2010 at 5:22 AM, Alexis Richardson <ale...@rabbitmq.com>wrote:
>
> >> On Wed, Jun 30, 2010 at 10:30 PM, Danny Briere <danny.bri...@gmail.com>

Alexis Richardson

unread,
Jul 7, 2010, 8:36:39 AM7/7/10
to pubsub...@googlegroups.com
Monica

On Tue, Jul 6, 2010 at 11:45 PM, Monica Keller <monica...@gmail.com> wrote:
> for hubs to provide
> "Virtual" Topic Urls. One example of this would be a firehose url with
> parameters for filtering. Also outside the scope of the spec would be
> for hubs to promote their virtual topic urls like Julien just did.
>
> This will be helpful as subscribers don't have to subscribe to raw
> topic urls one by one and can use the virtual or aggregated topic urls
> from the Service Provider

I think "virtual topics" are a very good idea and a way to open the
door to (federatable) subscriptions for filtered streams of many
kinds.

alexis

Julien Genestoux

unread,
Jul 7, 2010, 1:35:28 PM7/7/10
to pubsub...@googlegroups.com
Indeed, the virtual feed approach is definetely our prefered option and the one we took for our track feeds. This way, they do not interfere in any way with the core protocol.

Bob Wyman

unread,
Jul 7, 2010, 6:11:59 PM7/7/10
to pubsub...@googlegroups.com, Julien Genestoux
Julien Genestoux <julien.g...@gmail.com> wrote:

> Indeed, the virtual feed approach is definetely our prefered
> option and the one we took for our track feeds.

In your implementation, if your hub is processing a multi-entry feed and a filter I specify matches only one of the potentially many updated entries in the feed, what will you deliver to my call-back URL? Will you deliver the entire feed or just the single entry that matched the filter?

If an update to a single source feed matches many filters, it would seem that you would need to deliver a fresh copy of the matched data for each of the filters that matched since each filter defines a distinct virtual feed. Is that correct? 

bob wyman

Julien Genestoux

unread,
Jul 8, 2010, 10:27:15 AM7/8/10
to Bob Wyman, pubsub...@googlegroups.com
Hey Bob,


On Thu, Jul 8, 2010 at 12:11 AM, Bob Wyman <b...@wyman.us> wrote:
Julien Genestoux <julien.g...@gmail.com> wrote:

> Indeed, the virtual feed approach is definetely our prefered
> option and the one we took for our track feeds.

In your implementation, if your hub is processing a multi-entry feed and a filter I specify matches only one of the potentially many updated entries in the feed, what will you deliver to my call-back URL? Will you deliver the entire feed or just the single entry that matched the filter?

No, just the matching entry.
 

If an update to a single source feed matches many filters, it would seem that you would need to deliver a fresh copy of the matched data for each of the filters that matched since each filter defines a distinct virtual feed. Is that correct? 

yes, you would get a notification for each of the matched entry.

The virtual feed approach means that basically if a subscriber was rebuilding the feed with the entry we'd sent him, each of the entries would belong to the url's 'logic'. 
Thanks,

Danny Briere

unread,
Jul 9, 2010, 11:45:39 AM7/9/10
to Pubsubhubbub
In our usage, we (the hub owners) would define or create the topics
according to our own rules, and then aggregate or segregate content
from multiple incoming feeds (again, using rules we would define),
then make these new topics available to subscribers.

On Jul 1, 8:22 am, Alexis Richardson <ale...@rabbitmq.com> wrote:
Reply all
Reply to author
Forward
0 new messages