HTTP URL clustering/truncating proposal (based on me links)

16 views
Skip to first unread message

Brad Fitzpatrick

unread,
Jul 7, 2008, 11:26:07 AM7/7/08
to social-g...@googlegroups.com
While most URLs in the Social Graph API are canonicalized using the open source sgnodemapper code, a fair number of nodes in the graph aren't, and never will be.  Namely, "vanity domains".

Currently, these are all unique nodes in the graph:

http://bradfitz.com/
http://bradfitz.com/foaf.xml
http://identi.ca/bradfitz
http://identi.ca/bradfitz/foaf
http://factoryjoe.com/
http://factoryjoe.com/hcard.html
http://factoryjoe.com/blog
http://factoryjoe.com/blog/2006/02/10/uspto-to-hold-open-source-meeting/
http://factoryjoe.com/blog/2006/07/25/hresume-plugin-now-available/
......

And so on.

I'd like to cluster the three logical sets above, truncated as follows:

http://bradfitz.com/
http://identi.ca/bradfitz
http://factoryjoe.com/

But where to do the truncation?  Nobody likes brittle heuristics like hacky one-off regexp rules or similar.

Fortunately we have a much better data source:  "me" links.  (whether they're XFN, an openid delegate tag, rss/atom/foaf link, etc.)

Talking to Tantek Çelik awhile back, he'd mentioned there's an implict me link from a URL to its parent, (http://foo.com/bar/ ---me--> http://foo.com/) but not vice-versa (which might seem more intuitive) because (as he roughly said), "A root must always be able to partition its namespace."  Consider that if http://foo.com/ implied a me link to http://foo.com/users/attacker/ , then user "attacker" could me link back to foo.com and cluster the whole site together.

Unfortunately, I don't see an explanation of this at http://gmpg.org/xfn/11 so I'm afraid I might be remembering it wrong.

But ideally what I'd like to do, if I'm not grossly confused:

If a url ${prefix} has a me link to url ${prefix} + ${suffix}, and the number of path components in the latter URL are greater than those of the former, then truncate at ${prefix}.

That is, whenever a site http://foo.com/ has a me link (XFN or otherwise: RSS/Atom/FOAF) to http://foo.com/anything, we truncate at http://foo.com/ and any links in the graph too http://foo.com/* now become http://foo.com/

The path component part is necessary because of all the sites which for what I imagine are aesthetic reasons have their URLs like this:

http://identi.ca/bradfitz

... instead of what one could argue is a bit more technically correct, like this:

http://identi.ca/bradfitz/

So considering that people are going to use things like /username as the URL, we need to guard against this case:

http://foo.com/dude
http://foo.com/dude2_unrelated

If the rule were purely prefix-based, then the first dude, being naive or malicious, could "me"-link to dude2_unrelated and cluster with him, stealing all his outgoing and incoming edges, dirtying up the data.

If this is technically sound, then http://factoryjoe.com/ will have one node in the graph for his site, rather than the hundreds or more he does today.  Likewise, a lot of people with a domain + foaf file (like me) will have 1 node on my vanity domain, not two, when doing simple fme=1 queries from it.

Thoughts?

- Brad

Joseph Smarr

unread,
Jul 7, 2008, 12:02:02 PM7/7/08
to social-g...@googlegroups.com
I like it, and I think it's well-founded (I also heard this sub-path argument from Tantek, and I buy it). Couldn't you also Is that for domains like flickr? I think I could make the same argument there. js

Brad Fitzpatrick

unread,
Jul 7, 2008, 12:03:46 PM7/7/08
to social-g...@googlegroups.com
On Mon, Jul 7, 2008 at 9:02 AM, Joseph Smarr <jsm...@gmail.com> wrote:
I like it, and I think it's well-founded (I also heard this sub-path argument from Tantek, and I buy it). Couldn't you also Is that for domains like flickr? I think I could make the same argument there. js

ls that?



artemy tregubenko

unread,
Jul 7, 2008, 12:23:49 PM7/7/08
to social-g...@googlegroups.com
I think people will always be able to shoot themselves in feet, because of
social engineering or personal features. Another point is that attacker
"owning" root domain won't own all urls on it, because truncation should
be done only inside a list of both ways me-linked pages. Also as SGAPI
isn't a trust system, "owning" a root domain won't make real harm.

On Mon, 07 Jul 2008 19:26:07 +0400, Brad Fitzpatrick <brad...@google.com>
wrote:

> http://foo.com/users/attacker/, then user "attacker" could me link


> back to
> foo.com and cluster the whole site together.
>
> Unfortunately, I don't see an explanation of this at

> http://gmpg.org/xfn/11so I'm afraid I might be remembering it wrong.


>
> But ideally what I'd like to do, if I'm not grossly confused:
>
> If a url ${prefix} has a me link to url ${prefix} + ${suffix}, and the
> number of path components in the latter URL are greater than those of the
> former, then truncate at ${prefix}.
>
> That is, whenever a site http://foo.com/ has a me link (XFN or otherwise:
> RSS/Atom/FOAF) to http://foo.com/anything, we truncate at

> http://foo.com/and any links in the graph too

--
arty ( http://arty.name )

Joseph Smarr

unread,
Jul 7, 2008, 12:28:47 PM7/7/08
to social-g...@googlegroups.com
I meant "do that" for domains like flickr.

Martin Atkins

unread,
Jul 7, 2008, 1:54:32 PM7/7/08
to social-g...@googlegroups.com
> to foo.com <http://foo.com> and cluster the whole site together.

>
> Unfortunately, I don't see an explanation of this at
> http://gmpg.org/xfn/11 so I'm afraid I might be remembering it wrong.

The profile page says:

There is an implicit "me" relation from a subdirectory to all of its
contents.

This seems to be the exact opposite of what Tantek said. However, I
agree that the way you describe makes more sense. Just yesterday I was
temporarily hosting someone else's content on my personal site, chock
full of XFN links; it'd suck if that was implicitly linked to my
personal site.

> But ideally what I'd like to do, if I'm not grossly confused:
>
> If a url ${prefix} has a me link to url ${prefix} + ${suffix}, and the
> number of path components in the latter URL are greater than those of
> the former, then truncate at ${prefix}.
>
> That is, whenever a site http://foo.com/ has a me link (XFN or
> otherwise: RSS/Atom/FOAF) to http://foo.com/anything, we truncate at
> http://foo.com/ and any links in the graph too http://foo.com/* now
> become http://foo.com/
>
> The path component part is necessary because of all the sites which for
> what I imagine are aesthetic reasons have their URLs like this:
>
> http://identi.ca/bradfitz
>
> ... instead of what one could argue is a bit more technically correct,
> like this:
>
> http://identi.ca/bradfitz/
>
> So considering that people are going to use things like /username as the
> URL, we need to guard against this case:
>
> http://foo.com/dude
> http://foo.com/dude2_unrelated
>
> If the rule were purely prefix-based, then the first dude, being naive
> or malicious, could "me"-link to dude2_unrelated and cluster with him,
> stealing all his outgoing and incoming edges, dirtying up the data.

Sounds reasonable, assuming I'm following correctly.

So this would collapse http://martin.atkins.me.uk/friends/ into
http://martin.atkins.me.uk/ as long as I have a "me" link from the
latter to the former. (which I do.)

Presumably though this isn't going to make my blog entries be "me"
unless I add a rel="me" to my permalinks. I think that's a good thing,
since my permalink URLs "represent" the blog entry, not myself. Am I
right in thinking that the craziness on factoryjoe.com is just because
he publishes "me" links to his individual entry URLs?


Brad Fitzpatrick

unread,
Jul 7, 2008, 2:06:50 PM7/7/08
to social-g...@googlegroups.com
On Mon, Jul 7, 2008 at 10:54 AM, Martin Atkins <ma...@degeneration.co.uk> wrote:


This seems to be the exact opposite of what Tantek said. However, I
agree that the way you describe makes more sense. Just yesterday I was
temporarily hosting someone else's content on my personal site, chock
full of XFN links; it'd suck if that was implicitly linked to my
personal site.

Let's not design for that.  :-)  If you're hosting somebody's content on your domain in your URL space, indexable by spiders, anything's game.  Give your friend his own {sub,}domain name or port or mark it not crawlable... something.
 

Once it learns a truncation point (from seeing http://foo.com/ --me--> http://foo.com/bar/), then everything under http://foo.com/ would be truncated, including http://foo.com/something/else/2008/06/21/whatever.html.

- Brad


Reply all
Reply to author
Forward
0 new messages