I want url::Canonicalize but harder

48 views
Skip to first unread message

Nigel Tao

unread,
Apr 12, 2023, 8:42:45 AM4/12/23
to chromium-dev
How can I canonicalize a URL when url::Canonicalize isn't canonical enough?

```
#include "url/url_util.h"

void canonicalize(const std::string src) {
constexpr bool trim_path_end = false;
std::string dst;
url::StdStringCanonOutput o(&dst);
url::Parsed p;
url::Canonicalize(src.data(), src.size(), trim_path_end, NULL, &o, &p);
o.Complete();
LOG(ERROR) << dst;
}

// etc

canonicalize("http://a.example.com/x-y=z");
canonicalize("http://b.example.com/x%2Dy%3Dz");
```

prints

```
http://a.example.com/x-y=z
http://b.example.com/x-y%3Dz
```

so the "%2D" and "%3D" are treated differently. I'm guessing because
"=" / "%3D" is a reserved character:
https://en.wikipedia.org/wiki/URL_encoding#Types_of_URI_characters

It looks deliberate, although it's not obvious why:
https://source.chromium.org/chromium/chromium/src/+/main:url/url_canon_unittest.cc;l=1301;drc=ed519e442491476fbf09e2e419efb27716a94bed

Neither "url/url_canon.h", "url/url_util.h" or
https://source.chromium.org/chromium/chromium/src/+/main:url/README.md
give detail on what "URL canonicalization" means exactly.

My problem is that I'm have URLs whose path contains what looks like a
base-64 encoded something, and base-64 uses "=" for padding. Some
times my URLs have "=" and other times they have "%3D" and I'd like to
canonicalize them so I can compare for equality.

Converting a std::string to a GURL and then callling path() doesn't
help, since that basically just invokes url::Canonicalize().

This is for ChromiumOS Fusebox
(https://source.chromium.org/chromium/chromium/src/+/main:chrome/browser/ash/fusebox/README.md)
drag-and-drop, where third party JS can offer filesystem URLs as
drag-and-drop data sources. I want to compare these URLs (and their
prefixes) with an allow-list, and thought that canonicalization would
facilitate that, but url::Canonicalize doesn't collapse "=" and "%3D"
to the same thing.

I'm writing C++ (not JS), so I don't have access to JS's
decodeURIComponent() or decodeURI().

"url/url_util.h" does offer a DecodeURLEscapeSequences C++ function
but I'm hesitant to use it because, IIUC, it's not idempotent. Given
"%2541" input, DecodeURLEscapeSequences'ing it once gives "%41" but
DecodeURLEscapeSequences'ing it twice gives "A".

Do I have to roll my own URL canonicalization, separate from
url::Canonicalize from "url/url_util.h"?

Nick Harper

unread,
Apr 12, 2023, 10:36:40 AM4/12/23
to nige...@chromium.org, chromium-dev
In the general case "%3D" and "=" in a URL aren't the same thing. Consider a URL with a query string, e.g. http://example.com/path?x=1&y=2, which is different from http://example.com/path?x=1&y%3D2. The former has a key named "y" with a value of "2", while the latter has a key named "y=2" with no value.

https://www.rfc-editor.org/rfc/rfc3986#section-2.2 makes reference to percent-encoding a reserved character potentially changing how the application will interpret the URL.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
    http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEdON6Yc0S-kKVRYZEHMegNoxOvyXekLYDiX%3DOeBo81SMgt8aA%40mail.gmail.com.

Nigel Tao

unread,
Apr 12, 2023, 5:52:02 PM4/12/23
to Nick Harper, chromium-dev
On Thu, Apr 13, 2023 at 12:35 AM Nick Harper <nha...@chromium.org> wrote:
> In the general case "%3D" and "=" in a URL aren't the same thing. Consider a URL with a query string, e.g. http://example.com/path?x=1&y=2, which is different from http://example.com/path?x=1&y%3D2. The former has a key named "y" with a value of "2", while the latter has a key named "y=2" with no value.

In the general case, yes, but IIUC before the ?, or if there isn't a
?, they mean the same thing.

In the ChromeOS Files App, I get FileSytem URLs that look like this:

(1) filesystem:chrome://file-manager/external/arc-documents-provider/com.example.foo/ABCsomebase64thingXYZ==

(2) filesystem:chrome://file-manager/external/arc-documents-provider/com.example.foo/ABCsomebase64thingXYZ%3D%3D

There is no ? query component, but there is trailing "=" vs "%3D" in
the base-64 encoded thing in the URL path.

I'd like to canonicalize these two to be the same. If
url::Canonicalize won't do it (and it doesn't take any options), do I
have to roll my own canonicalization?

Peter Kasting

unread,
Apr 12, 2023, 6:13:58 PM4/12/23
to nige...@chromium.org, Nick Harper, chromium-dev
On Wed, Apr 12, 2023 at 2:50 PM Nigel Tao <nige...@chromium.org> wrote:
On Thu, Apr 13, 2023 at 12:35 AM Nick Harper <nha...@chromium.org> wrote:
> In the general case "%3D" and "=" in a URL aren't the same thing. Consider a URL with a query string, e.g. http://example.com/path?x=1&y=2, which is different from http://example.com/path?x=1&y%3D2. The former has a key named "y" with a value of "2", while the latter has a key named "y=2" with no value.

In the general case, yes, but IIUC before the ?, or if there isn't a
?, they mean the same thing.

I believe per the WHATWG URL spec '=' is percent-encoded only in the "userinfo" section ("authority state", the username and password).

This could be a bug in url::Canonicalize(), being too aggressive about percent-encoding these everywhere. you could attempt to tweak the behavior and add relevant tests.

PK

Brett Wilson

unread,
Apr 12, 2023, 6:22:17 PM4/12/23
to nige...@chromium.org, Nick Harper, chromium-dev
The current canonicalizer does as much as it can without changing the meaning of a URL (*). Since servers can mostly do whatever they want, the amount of canonicalization is somewhat limited in the path section, and very limited in the query section.

Since you're writing the files app, that's effectively the "server" from the browser's perspective which is a different use-case than normal. In your case you probably want to unescape everything (at least in the path component). There is some code to do this somewhere (I forget where). You might start by looking at with components/url_formatter/elide_url.h does since it may do some path formatting.

Brett

(*) This is given the understanding of URL requirements at the time of writing. But it's quite difficult to change: even if the spec says something is OK, it doesn't mean it can be changed without breaking some web sites. So careful study is required for every change.

Nigel Tao

unread,
Apr 13, 2023, 3:20:32 AM4/13/23
to Brett Wilson, Nick Harper, chromium-dev
For the record, when starting with a std::string, converting to a GURL
and then a storage::FileSystemURL and then back to a GURL (and then a
std::string) does sufficient canonicalization (see the
storage::FileSystemURL::ToGURL implementation link below) for my
immediate problem. That only works for "filesystem:" URLs, but that's
all I need right now.

https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:storage/browser/file_system/file_system_url.cc;l=136-157;drc=9329f990b89052f1e7f82e0a1a4b298b72359263
Reply all
Reply to author
Forward
0 new messages