Replacing Gecko's URL parser

Anne van Kesteren

unread,

Jul 1, 2013, 12:43:01 PM7/1/13

to dev-pl...@lists.mozilla.org

I'd like to discuss the implications of replacing/morphing Gecko's URL
parser with/into something that conforms to
http://url.spec.whatwg.org/

The goal is to get URL parsing to the level of quality of our CSS and
HTML parsers and get convergence over time with other browsers as at
the moment it's quite different between browsers.

I'm interested in hearing what people think. I outlined two issues
below, but I'm sure there are more. By the way, independently of the
parser bit, we are proceeding with implementing the URL API as drafted
in the URL Standard in Gecko, which should make testing URL parsing
easier.

Idempotent: Currently Gecko's parser and the URL Standard's parser are
not idempotent. E.g. http://@/mozilla.org/ becomes
http:///mozilla.org/ which when parsed becomes http://mozilla.org/
which is somewhat bad for security. My plan is to change the URL
Standard to fail parsing empty host names. I'll have to research if
there's other cases that are not idempotent.

File URLs: As far as I know in Gecko parsing file URLs is
platform-specific so the URL object you get back will have
platform-specific characteristics. In the URL Standard I tried to
align parsing mostly with Windows, allowing interpretation of the file
URL up to the platform. This means platform-specific badness is
exposed, but is risky.

--
http://annevankesteren.nl/

Boris Zbarsky

unread,

Jul 1, 2013, 1:01:58 PM7/1/13

to

On 7/1/13 12:43 PM, Anne van Kesteren wrote:
> I outlined two issues
> below, but I'm sure there are more.

Another big one I'm aware of is the issue of how to treat '\\' in URLs.

-Boris

Gijs Kruitbosch

unread,

Jul 1, 2013, 1:20:21 PM7/1/13

to

We also have issues with hashes and URI-encoding (
https://bugzilla.mozilla.org/show_bug.cgi?id=483304 ).

~ Gijs

Benjamin Smedberg

unread,

Jul 1, 2013, 1:58:54 PM7/1/13

to Anne van Kesteren, dev-pl...@lists.mozilla.org

On 7/1/2013 12:43 PM, Anne van Kesteren wrote:
>
> I'm interested in hearing what people think. I outlined two issues
> below, but I'm sure there are more. By the way, independently of the
> parser bit, we are proceeding with implementing the URL API as drafted
> in the URL Standard in Gecko, which should make testing URL parsing
> easier.

Currently protocol handlers are extensible and so parsing is spread
throughout the tree. I expect that extensible protocol handling is a
non-goal, and that there are just a few kinds of URI parsing that we
need to support. Is it your plan to replace extensible parsing with a
single mechanism?

>
>
> Idempotent: Currently Gecko's parser and the URL Standard's parser are
> not idempotent. E.g. http://@/mozilla.org/ becomes
> http:///mozilla.org/ which when parsed becomes http://mozilla.org/
> which is somewhat bad for security. My plan is to change the URL
> Standard to fail parsing empty host names. I'll have to research if
> there's other cases that are not idempotent.

I don't actually know what this means. Are you saying that
"http://@/mozilla.org/" sometimes resolves to one URI and sometimes another?

>
> File URLs: As far as I know in Gecko parsing file URLs is
> platform-specific so the URL object you get back will have
> platform-specific characteristics. In the URL Standard I tried to
> align parsing mostly with Windows, allowing interpretation of the file
> URL up to the platform. This means platform-specific badness is
> exposed, but is risky.

Files are inherently platform-specific. What are the specific risks you
are trying to mitigate?

--BDS

Gavin Sharp

unread,

Jul 1, 2013, 2:30:46 PM7/1/13

to Benjamin Smedberg, dev-platform

.sOn Mon, Jul 1, 2013 at 10:58 AM, Benjamin Smedberg

<benj...@smedbergs.us> wrote:
>> Idempotent: Currently Gecko's parser and the URL Standard's parser are
>> not idempotent. E.g. http://@/mozilla.org/ becomes
>> http:///mozilla.org/ which when parsed becomes http://mozilla.org/
>> which is somewhat bad for security. My plan is to change the URL
>> Standard to fail parsing empty host names. I'll have to research if
>> there's other cases that are not idempotent.
>
> I don't actually know what this means. Are you saying that
> "http://@/mozilla.org/" sometimes resolves to one URI and sometimes another?

function makeURI(str) ioSvc.newURI(str, null, null)

makeURI("http://@/mozilla.org/").spec -> http:///mozilla.org/
makeURI("http:///mozilla.org/").spec -> http://mozilla.org/

In other words,

makeURI(makeURI(str).spec).spec does not always return "str".

Gavin

Axel Hecht

unread,

Jul 1, 2013, 3:05:00 PM7/1/13

to

nitpicking, that's not "not idempotent". It's not round-tripping, but it
looks like it's idempotent.

Axel

Ms2ger

unread,

Jul 1, 2013, 3:17:07 PM7/1/13

to

Actually, the issue is that makeURI(makeURI(str).spec).spec !=
makeURI(str).spec.

HTH
Ms2ger

Patrick McManus

unread,

Jul 1, 2013, 3:30:56 PM7/1/13

to Anne van Kesteren, dev-platform

On Mon, Jul 1, 2013 at 12:43 PM, Anne van Kesteren <ann...@annevk.nl> wrote:

> I'd like to discuss the implications of replacing/morphing Gecko's URL
> parser with/into something that conforms to
> http://url.spec.whatwg.org/
>

I know its not your motivation, but the lack of thread safety in the
various nsIURIs is a common roadblock for me and something I'd love to see
solved in a rewrite... but as benjamin mentions there are a lot of
pre-existing implementations..

Anne van Kesteren

unread,

Jul 1, 2013, 4:47:48 PM7/1/13

to Benjamin Smedberg, dev-pl...@lists.mozilla.org

On Mon, Jul 1, 2013 at 6:58 PM, Benjamin Smedberg <benj...@smedbergs.us> wrote:
> Currently protocol handlers are extensible and so parsing is spread
> throughout the tree. I expect that extensible protocol handling is a
> non-goal, and that there are just a few kinds of URI parsing that we need to
> support. Is it your plan to replace extensible parsing with a single
> mechanism?

Yes. Basically all non-blessed schemes would end up with scheme,
scheme data, query, and fragment components (blessed schemes would
also have username, password, host, port, and path segments). Any
further parsing would have to be done through scheme-specific
processing and cannot cause URL parsing to fail. More concretely,
data:blah would not fail the URL parser, but http://test:test/ would.

The idea here is to provide consistency with regards to URL parsing as
far as it's exposed to the web and to remain compatible with the
current web.

>> Idempotent: Currently Gecko's parser and the URL Standard's parser are
>> not idempotent. E.g. http://@/mozilla.org/ becomes
>> http:///mozilla.org/ which when parsed becomes http://mozilla.org/
>> which is somewhat bad for security. My plan is to change the URL
>> Standard to fail parsing empty host names. I'll have to research if
>> there's other cases that are not idempotent.
>
> I don't actually know what this means. Are you saying that
> "http://@/mozilla.org/" sometimes resolves to one URI and sometimes another?

I'm saying that if you parse and serialize it and then parse it again,
mozilla.org is suddenly the host rather than the path. Non-idempotency
of the URL parser has caused security issues though I'm not really at
liberty to discuss them here.

>> File URLs: As far as I know in Gecko parsing file URLs is
>> platform-specific so the URL object you get back will have
>> platform-specific characteristics. In the URL Standard I tried to
>> align parsing mostly with Windows, allowing interpretation of the file
>> URL up to the platform. This means platform-specific badness is
>> exposed, but is risky.
>
> Files are inherently platform-specific. What are the specific risks you are
> trying to mitigate?

I want the object you get out of new URL("file://C:/test") to be
consistent across platforms. I don't want JavaScript APIs to become
platform-specific, especially one as core as URL. (That the algorithm
that uses the URL to retrieve the data uses a platform-specific code
path is fine, that part is not observable.)

I cannot really comment on adding thread safety other than that it
seems good to have for the workers implementation.

--
http://annevankesteren.nl/

Mike Hommey

unread,

Jul 1, 2013, 7:51:59 PM7/1/13

to Anne van Kesteren, dev-pl...@lists.mozilla.org

On Mon, Jul 01, 2013 at 05:43:01PM +0100, Anne van Kesteren wrote:
> I'd like to discuss the implications of replacing/morphing Gecko's URL
> parser with/into something that conforms to
> http://url.spec.whatwg.org/
>

> The goal is to get URL parsing to the level of quality of our CSS and
> HTML parsers and get convergence over time with other browsers as at
> the moment it's quite different between browsers.
>

> I'm interested in hearing what people think. I outlined two issues
> below, but I'm sure there are more. By the way, independently of the
> parser bit, we are proceeding with implementing the URL API as drafted
> in the URL Standard in Gecko, which should make testing URL parsing
> easier.
>
>

> Idempotent: Currently Gecko's parser and the URL Standard's parser are
> not idempotent. E.g. http://@/mozilla.org/ becomes
> http:///mozilla.org/ which when parsed becomes http://mozilla.org/
> which is somewhat bad for security. My plan is to change the URL
> Standard to fail parsing empty host names. I'll have to research if
> there's other cases that are not idempotent.

Note that some "custom" schemes may be relying on empty host names. In
Gecko, we have about:foo as well as resource:///foo. In both cases, foo
is the path part.

Mike

Neil

unread,

Jul 2, 2013, 4:40:52 AM7/2/13

to

Mike Hommey wrote:

>Note that some "custom" schemes may be relying on empty host names. In Gecko, we have about:foo as well as resource:///foo. In both cases, foo is the path part.
>
>

about:foo is actually an nsSimpleURI, not an nsStandardURL, so it just
throws when you try to access its host.

On the other hand, chrome:// URIs are currently handed off to
nsStandardURL too, which means that all chrome package names have to be
lower case, since the host part is use for that.

--
Warning: May contain traces of nuts.

Anne van Kesteren

unread,

Jul 2, 2013, 6:55:22 AM7/2/13

to Mike Hommey, dev-pl...@lists.mozilla.org

On Tue, Jul 2, 2013 at 12:51 AM, Mike Hommey <m...@glandium.org> wrote:
> Note that some "custom" schemes may be relying on empty host names. In
> Gecko, we have about:foo as well as resource:///foo. In both cases, foo
> is the path part.

about URLs don't have a host name. resource URLs might I suppose. Do
resource URLs support relative URLs? The main problem of the URL
Standard is that relative URLs becomes a fixed list of schemes. But
given the way URLs in browsers work there's not much we can do about
that.

There's a couple of things we can do for such URLs:

a) Standardize them and add their parsing behavior to the URL Standard.
b) Rework Gecko to no longer require them.
c) Rework the URL to no longer be one that supports relative URLs.

--
http://annevankesteren.nl/

Benjamin Smedberg

unread,

Jul 2, 2013, 7:09:04 AM7/2/13

to Anne van Kesteren, Mike Hommey, dev-pl...@lists.mozilla.org

On 7/2/2013 6:55 AM, Anne van Kesteren wrote:
> On Tue, Jul 2, 2013 at 12:51 AM, Mike Hommey <m...@glandium.org> wrote:
>> Note that some "custom" schemes may be relying on empty host names. In
>> Gecko, we have about:foo as well as resource:///foo. In both cases, foo
>> is the path part.
> about URLs don't have a host name. resource URLs might I suppose. Do
> resource URLs support relative URLs?

Both resource: and chrome: have host names and need to support relative
URIs. Neither of them is a candidate for standardization, though. We
should just add them as special known schemes in our parser.

--BDS

Mike Hommey

unread,

Jul 2, 2013, 7:57:20 AM7/2/13

to Benjamin Smedberg, dev-pl...@lists.mozilla.org

Note that host name is optional in resource: urls (for which no host name
is the same as host == app), but not chrome:. There is no host on about:
urls as well, as I mentioned, but about: urls never have a slash before
the path.

Mike

Cameron Kaiser

unread,

Jul 2, 2013, 10:06:22 PM7/2/13

to

On 7/1/13 4:51 PM, Mike Hommey wrote:
>> Idempotent: Currently Gecko's parser and the URL Standard's parser are
>> not idempotent. E.g. http://@/mozilla.org/ becomes
>> http:///mozilla.org/ which when parsed becomes http://mozilla.org/
>> which is somewhat bad for security. My plan is to change the URL
>> Standard to fail parsing empty host names. I'll have to research if
>> there's other cases that are not idempotent.
>
> Note that some "custom" schemes may be relying on empty host names. In
> Gecko, we have about:foo as well as resource:///foo. In both cases, foo
> is the path part.

I'll bet some extensions implement custom protocol handlers that do the
same thing. (In fact, I'm sure of at least one; OverbiteFF uses null
host names to denote pseudo URLs internal to the protocol handler.) This
might be considered gauche, but probably not an isolated case.

Cameron Kaiser

Anne van Kesteren

unread,

Jul 3, 2013, 2:49:39 AM7/3/13

to Benjamin Smedberg, Mike Hommey, dev-pl...@lists.mozilla.org

On Tue, Jul 2, 2013 at 12:09 PM, Benjamin Smedberg
<benj...@smedbergs.us> wrote:
> Both resource: and chrome: have host names and need to support relative
> URIs. Neither of them is a candidate for standardization, though. We should
> just add them as special known schemes in our parser.

Well, either we have to standardize their parsing behavior, limit
their parsing behavior to chrome, or think of some third alternative.
We do not want

url = new URL(rel, base)

to differ across engines for any rel or base.

--
http://annevankesteren.nl/

Axel Hecht

unread,

Jul 3, 2013, 8:17:20 AM7/3/13

to

How many odd protocols and assumptions on how they work do we still have
in mailnews' abuse of RDF? Not sure what's lurking in localstore.rdf and
mimeTypes.rdf.

Also, sorry, can't offer more help than asking these days.

Axel

Benjamin Smedberg

unread,

Jul 3, 2013, 10:08:12 AM7/3/13

to Anne van Kesteren, Mike Hommey, dev-pl...@lists.mozilla.org

On 7/3/2013 2:49 AM, Anne van Kesteren wrote:
> On Tue, Jul 2, 2013 at 12:09 PM, Benjamin Smedberg
> <benj...@smedbergs.us> wrote:
>> Both resource: and chrome: have host names and need to support relative
>> URIs. Neither of them is a candidate for standardization, though. We should
>> just add them as special known schemes in our parser.
> Well, either we have to standardize their parsing behavior, limit
> their parsing behavior to chrome, or think of some third alternative.
> We do not want
>
> url = new URL(rel, base)
>
> to differ across engines for any rel or base

I don't understand why it matters. chrome: and resource: are both
gecko-specific extensions and we have no desire to standardize them.
Chromium uses a different scheme for their chrome: protocol.

Web content typically is not allowed to link or load chrome resources,
although there is an ancient exception for chrome://global that was
included for remote XUL and may not be necessary any more. But I don't
think we should either try to standardize these protocols, nor should we
try to change URL parsing behavior depending on whether we're chrome or
content.

--BDS

Bobby Holley

unread,

Jul 3, 2013, 12:50:42 PM7/3/13

to Benjamin Smedberg, Mike Hommey, dev-pl...@lists.mozilla.org

On Wed, Jul 3, 2013 at 8:08 AM, Benjamin Smedberg <benj...@smedbergs.us>wrote:

> We do not want
>>
>> url = new URL(rel, base)
>>
>> to differ across engines for any rel or base
>>
>
> I don't understand why it matters. chrome: and resource: are both
> gecko-specific extensions and we have no desire to standardize them.
> Chromium uses a different scheme for their chrome: protocol.
>
> Web content typically is not allowed to link or load chrome resources,
> although there is an ancient exception for chrome://global that was
> included for remote XUL and may not be necessary any more. But I don't
> think we should either try to standardize these protocols, nor should we
> try to change URL parsing behavior depending on whether we're chrome or
> content.
>

Is there ever a reason for content to do |new URL(foo)| for some
resource:// or chrome:// foo? If so, why can't we just check the subject
principal in the constructor and forbid it? Seems like good
defense-in-depth to me.

bholley

Anne van Kesteren

unread,

Jul 4, 2013, 4:45:00 AM7/4/13

to Benjamin Smedberg, Mike Hommey, dev-pl...@lists.mozilla.org

On Wed, Jul 3, 2013 at 4:08 PM, Benjamin Smedberg <benj...@smedbergs.us> wrote:
> I don't understand why it matters. chrome: and resource: are both
> gecko-specific extensions and we have no desire to standardize them.
> Chromium uses a different scheme for their chrome: protocol.

Because doing so would be a violation of the standard. It's pretty
clear on what the results must be for any given input. Having
differences for some unknown-in-advance schemes would be bad.

--
http://annevankesteren.nl/

Kyle Huey

unread,

Jul 4, 2013, 11:17:24 AM7/4/13

to Anne van Kesteren, Mike Hommey, Benjamin Smedberg, dev-pl...@lists.mozilla.org

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

Presumably we could have a blacklist of the handful of protocols that are
internal to browsers and have compat issues. "It violates the standard"
isn't a very compelling argument when the standard is in the process of
being written and nobody implements it.

- Kyle

Anne van Kesteren

unread,

Jul 4, 2013, 12:22:38 PM7/4/13

to Kyle Huey, Mike Hommey, Benjamin Smedberg, dev-pl...@lists.mozilla.org

On Thu, Jul 4, 2013 at 5:17 PM, Kyle Huey <m...@kylehuey.com> wrote:
> Presumably we could have a blacklist of the handful of protocols that are
> internal to browsers and have compat issues. "It violates the standard"
> isn't a very compelling argument when the standard is in the process of
> being written and nobody implements it.

Then we might as well define them. It's not like their parsing
behavior is particularly hard. And this does not just affect new
URL(), it affects how e.g. <a> behaves as well. And a large part of
the reason of defining this is because of the current mess of nobody
following the IETF standards (which were not compatible, URL standard
hopefully is)...

In any event, this issue seems relatively minor compared to what's
involved in getting Gecko to switch URL parsers.

--
http://annevankesteren.nl/

Gervase Markham

unread,

Jul 5, 2013, 6:22:35 AM7/5/13

to Anne van Kesteren, Kyle Huey, Mike Hommey, Benjamin Smedberg, dev-pl...@lists.mozilla.org

On 04/07/13 17:22, Anne van Kesteren wrote:
> On Thu, Jul 4, 2013 at 5:17 PM, Kyle Huey <m...@kylehuey.com> wrote:
>> Presumably we could have a blacklist of the handful of protocols that are
>> internal to browsers and have compat issues. "It violates the standard"
>> isn't a very compelling argument when the standard is in the process of
>> being written and nobody implements it.
>
> Then we might as well define them. It's not like their parsing
> behavior is particularly hard.

Didn't bsmedberg say that chrome:// in Mozilla is different to chrome://
in Chromium? Who wins?

Gerv

Gervase Markham

unread,

Jul 5, 2013, 6:22:35 AM7/5/13

to Anne van Kesteren, Mike Hommey, Kyle Huey, Benjamin Smedberg, dev-pl...@lists.mozilla.org

On 04/07/13 17:22, Anne van Kesteren wrote:

> On Thu, Jul 4, 2013 at 5:17 PM, Kyle Huey <m...@kylehuey.com> wrote:
>> Presumably we could have a blacklist of the handful of protocols that are
>> internal to browsers and have compat issues. "It violates the standard"
>> isn't a very compelling argument when the standard is in the process of
>> being written and nobody implements it.
>
> Then we might as well define them. It's not like their parsing
> behavior is particularly hard.

Anne van Kesteren

unread,

Jul 15, 2013, 3:14:05 PM7/15/13

to Boris Zbarsky, dev-pl...@lists.mozilla.org

On Mon, Jul 1, 2013 at 1:01 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
> Another big one I'm aware of is the issue of how to treat '\\' in URLs.

In the specification this is resolved in favor of how
WebKit/Chromium/Trident go about it. Which is to treat it identically
to "/" but flag it with a warning in the console, if applicable.

What should we use as tactic for fixing URL parsing?

a) Have side-by-side implementations as with HTML and slowly switch
over callers.
b) Incrementally fix the existing URL parser (file bugs on specific
issues such as "\")
c) Both?

I have an implementation in JavaScript: https://github.com/annevk/url
Next I'm going to get the test suite into better shape. I guess what's
unclear is how to track the Gecko bits.

--
http://annevankesteren.nl/

Ehsan Akhgari

unread,

Jul 15, 2013, 4:20:05 PM7/15/13

to Anne van Kesteren, Boris Zbarsky, dev-pl...@lists.mozilla.org

On 2013-07-15 3:14 PM, Anne van Kesteren wrote:
> On Mon, Jul 1, 2013 at 1:01 PM, Boris Zbarsky <bzba...@mit.edu> wrote:

>> Another big one I'm aware of is the issue of how to treat '\\' in URLs.
>

> In the specification this is resolved in favor of how
> WebKit/Chromium/Trident go about it. Which is to treat it identically
> to "/" but flag it with a warning in the console, if applicable.
>
>
> What should we use as tactic for fixing URL parsing?
>
> a) Have side-by-side implementations as with HTML and slowly switch
> over callers.
> b) Incrementally fix the existing URL parser (file bugs on specific
> issues such as "\")

The second approach seems like less work, but of course we should hide
changes behind pref kill switches in case things go wrong with web compat.

Ehsan