scrapers and apis

Will Kahn-Greene

unread,

Dec 30, 2011, 5:07:35 PM12/30/11

to

I've been reading through the browserid documentation and the code
wondering:

1. how would I write a web scraper for a site that uses browserid?

2. if I had a site that used browserid for authentication, how would I
be able to reuse the authentication code for non-browser apps
accessing the API for my site?

I haven't seen any documentation or other material on either of these
questions.

I ask these for a couple of reasons:

1. web scrapers are pretty useful

2. I'm thinking about implementing browserid for MediaGoblin and other
projects I work on, but we want to support non-browser applications as
well that probably can't run JavaScript, so I'm not sure how that'd
work.

Has there been any thought on either of these? If so, where are
things at?

/will

Henry Story

unread,

Dec 30, 2011, 5:24:26 PM12/30/11

to Will Kahn-Greene, dev-id...@lists.mozilla.org

yes, for that you're certainly better off using WebID

http://webid.info/spec

as it requires just TLS client certificate authentication, which all libraries implement. It
also works at the resource level, so that each resource can ask the client at the connection
time for its id.

At some point in the future when the W3C gets the crypto api into the browsers
then BrowserId will be both distributed and able to work with TLS certificates
too.

http://www.w3.org/2011/11/webcryptography-charter.html

In which case as argued in the stack exchange article below the two protocols will just be slight variations
on each other.

http://security.stackexchange.com/questions/5406/what-are-the-main-advantages-and-disadvantages-of-webid-compared-to-browserid

Henry

>
> /will
> _______________________________________________
> dev-identity mailing list
> dev-id...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-identity

Social Web Architect
http://bblfish.net/

Ian Bicking

unread,

Dec 30, 2011, 5:34:58 PM12/30/11

to Will Kahn-Greene, dev-id...@lists.mozilla.org

On Fri, Dec 30, 2011 at 4:07 PM, Will Kahn-Greene <wi...@bluesock.org> wrote:
> I've been reading through the browserid documentation and the code
> wondering:
>
> 1. how would I write a web scraper for a site that uses browserid?
>
> 2. if I had a site that used browserid for authentication, how would I
> be able to reuse the authentication code for non-browser apps
> accessing the API for my site?
>
> I haven't seen any documentation or other material on either of these
> questions.
>
> I ask these for a couple of reasons:
>
> 1. web scrapers are pretty useful

If you want to scrape a site that's not being helpful (which is
usually the case if you are scraping) then generally you have to
reverse engineer the login process. Of course a simple login/password
form makes this pretty easy, but sites that use iframes and the like
(generally more secure sites) are similar to browserid in their
difficulty. You can use some similar techniques, like scraping the
cookie jar of a browser after you log in (usable then for however long
the cookies are valid).

The general/non-hacky answer would mean a way of delegating your login
credentials to another user agent besides your browser (i.e., the
scraper). I don't think there's an answer in this case, and
supporting non-browser user agents (like mobile apps) is still
something browserid developers are working through (actively, but I
think still inconclusively).

> 2. I'm thinking about implementing browserid for MediaGoblin and other
> projects I work on, but we want to support non-browser applications as
> well that probably can't run JavaScript, so I'm not sure how that'd
> work.

How do you want the non-browser applications to interact with these
applications? OAuth is still applicable for an app that uses
browserid. If it's a question of something that's essentially a full
user agent (but without some of the capabilities necessary for
browserid support) then I guess we're back to the same questions as
scraping.

Ian

Dan Mills

unread,

Dec 30, 2011, 5:39:36 PM12/30/11

to Will Kahn-Greene, dev-id...@lists.mozilla.org

Hi Will,

As the spec stands today, JS is required, and so your scrapers/clients would need a JS environment. That isn't super hard (there are libraries you can use), though it's not super easy, either.

There is a problem that prevents you from being able to fully automate the process, however: there is no way to initiate the sign-in process from the outside. Content must request an assertion and provide a callback. There has been some talk about moving to an event-based API, that would allow the sign-in process to be started from the outside (e.g., we could add a "sign in" button to browser chrome, or you could add a sign in step to your scraper).

Note that for the JS requirement, it would be possible for us to standardize on a protocol for the browser to directly deliver an assertion to the server. That would allow you to write a client without a JS interpreter. However, so far demand for that has been low, so it hasn't been high up on our (or at least my) list of stuff to think about.

Dan

> _______________________________________________
> dev-identity mailing list
> dev-id...@lists.mozilla.org (mailto:dev-id...@lists.mozilla.org)
> https://lists.mozilla.org/listinfo/dev-identity
>
>

Ian Bicking

unread,

Dec 30, 2011, 5:47:21 PM12/30/11

to Dan Mills, dev-id...@lists.mozilla.org, Will Kahn-Greene

On Fri, Dec 30, 2011 at 4:39 PM, Dan Mills <thu...@mozilla.com> wrote:
> Note that for the JS requirement, it would be possible for us to standardize on a protocol for the browser to directly deliver an assertion to the server. That would allow you to write a client without a JS interpreter. However, so far demand for that has been low, so it hasn't been high up on our (or at least my) list of stuff to think about.

I believe Firefox Sync BrowserID support will be exploring something
like this option, as there's no "frontend" to sync (that is, there's
no HTML page that represents "Sync" that you would log into).
Discussion of this has just begun in the last week or so.

Ian

Will Kahn-Greene

unread,

Dec 30, 2011, 5:59:54 PM12/30/11

to

Also, we've got similar situation with SUMO--I had forgotten we're
thinking of doing API stuff, too.

I'll take a look at OAuth and see where that leads in regards to
supporting non-browser applications.

Thanks!

/will

Dan Mills

unread,

Dec 30, 2011, 6:30:08 PM12/30/11

to Ian Bicking, dev-id...@lists.mozilla.org, Will Kahn-Greene

On Friday, December 30, 2011 at 2:47 PM, Ian Bicking wrote:

Yes, though for websites at large we'd need some additional pieces such as, for example, the ability to discover where to send assertions to based on markup.

Dan

Ian Bicking

unread,

Dec 30, 2011, 6:54:04 PM12/30/11

to Dan Mills, dev-id...@lists.mozilla.org, Will Kahn-Greene

On Fri, Dec 30, 2011 at 5:30 PM, Dan Mills <thu...@mozilla.com> wrote:

> On Friday, December 30, 2011 at 2:47 PM, Ian Bicking wrote:
>

> On Fri, Dec 30, 2011 at 4:39 PM, Dan Mills <thu...@mozilla.com> wrote:
>
> Note that for the JS requirement, it would be possible for us to
> standardize on a protocol for the browser to directly deliver an assertion
> to the server. That would allow you to write a client without a JS
> interpreter. However, so far demand for that has been low, so it hasn't
> been high up on our (or at least my) list of stuff to think about.
>
>
> I believe Firefox Sync BrowserID support will be exploring something
> like this option, as there's no "frontend" to sync (that is, there's
> no HTML page that represents "Sync" that you would log into).
> Discussion of this has just begun in the last week or so.
>
>
> Yes, though for websites at large we'd need some additional pieces such
> as, for example, the ability to discover where to send assertions to based
> on markup.
>

This seems to be the expectation, but I wonder if it would be fine to
simply send the assertion on the first request (for any resource), like:
"Authorization: BrowserID assertion=XXX" - it's not something you want to
send with each request (e.g., assertion expiration doesn't seem to mean the
same thing as what you'd expect in this context), but having a
hello/handshake request doesn't seem necessary (maybe even wasteful).

The missing part would be what to do after that first request, how to tell
the client to stop sending the assertion, and send something else (probably
a token?). The Sync discussion is moving towards signing the entire
request (in some OAuthish manner), but that's mostly so the
assertion-sniffing attack area is limited to that initial request - so I
don't see a big difference in security if you allow that the first request
actually does something, and then treat subsequent requests differently.
It just means any endpoint is capable of doing the handshake, which is
generally easy to implement. The first request could even just do the
handshake and ask the client to retry the request.

Or maybe you are thinking about the scraping use case, in which case yes
you'd definitely need to figure out where to send the assertion, and in
what format, and websites don't necessarily care to make it easy, or
standardized, or even reliable.

Ian

Dan Mills

unread,

Dec 30, 2011, 7:07:15 PM12/30/11

to Ian Bicking, dev-id...@lists.mozilla.org, Will Kahn-Greene

Right, I was talking about the scraper use-case (which is similar to what the browser would need to add a sign in button in chrome). One straightforward and relatively webby way to do that is to add a little markup that tells the browser that it's supported and where to an assertions to.

Dan

On Friday, December 30, 2011 at 3:54 PM, Ian Bicking wrote:

> On Fri, Dec 30, 2011 at 5:30 PM, Dan Mills <thu...@mozilla.com (mailto:thu...@mozilla.com)> wrote:
> > On Friday, December 30, 2011 at 2:47 PM, Ian Bicking wrote: