Well, this one seems to be the latest RFC regarding URLs, modulo a bit
of HTTP-specific stuff like authority-form appearing in rfc7230. Is
there something newer?
> How do you feel about the WhatWG URL specification at https://url.spec.whatwg.org?
Quite frankly, I hate it. This "specification" manages to organize and
present the information in the most obtuse manner possible. I find it
hostile to implementers like myself.
> It has a large body of tests ...that you may consider looking at.
Yep, it does! Integrating them is on my to-do- list:
<https://github.com/CPPAlliance/url/tree/c6c4b433c3b1057161b6ce50bb4fba0b5f59b4ee/test/wpt>
> Do you have any general plan for a strategy for handling non-ASCII input?
Yes, the plan is to reject such input. Strings containing non-ASCII
characters are not valid URLs. And even some ASCII characters are not
allowed to appear in a URL, for example all control characters.
> I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8
> encoded non-ASCII string...
I think what you're asking is, what if someone supplies a URL which
has escaped characters which, when percent-decoding is applied, become
valid UTF-8 code point sequences? That's perfectly fine.
Percent-encoded URL parts are in fact "ASCII strings."
> ...into the constructor?
You can't construct a URL from a string, you have to go through one of
the parsing functions. This is because the library recognizes several
variations of URL grammar, and does not favor any particular grammar
by choosing one to support construction. See Table 1.1:
<https://master.url.cpp.al/url/parsing.html>
Thanks
Do you have any general plan for a strategy for handling non-ASCII input? I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8 encoded non-ASCII string into the constructor?
>> Do you have any general plan for a strategy for handling non-ASCII input?
>
> Yes, the plan is to reject such input. Strings containing non-ASCII
> characters are not valid URLs. And even some ASCII characters are not
> allowed to appear in a URL, for example all control characters.
That is certainly a choice you can make, but at some point you may run into issues with people trying to give your library input like http://example.com/💩 <http://example.com/%F0%9F%92%A9> and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9 <http://example.com/%F0%9F%92%A9> for you like it does in some other URL libraries. I see Punycode encoding and decoding doesn’t seem to be in the scope of this library, and for your use cases that may be fine and for others that might not be fine. It seems like you’re aware of this design choice, though.
Assuming this is supported, it raises the question of what the parser would
be expected to do with http://example.com/💩/%F0%9F%92%A9.
Should it encode the percents, under the assumption that they are literal
because everything is non-encoded? Or should it leave the percents as-is
and only encode the emoji?
If the use case is "the user typed or pasted something into the address bar",
I suppose a best-effort DWIM (e.g. option 2) makes more sense. But in
other scenarios, option 1 might.
I plan to pick through the WhatWG specification and see if there are
any tidbits that could have value. The procedural exposition (append
this character, execute this algorithm, output this string) makes it
very difficult to grasp the higher level semantics of the thing. The
BNF in the RFC is way easier to grok, e,g:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
and in fact this is exactly how I have decomposed the parsing, into
each individual named element described by the RFC.
> That is certainly a choice you can make, but at some point you
> may run into issues with people trying to give your library input
> like http://example.com/💩
I'm not sure how someone would give the library that input. Is this
expressible in a string_view?
> I see Punycode encoding and decoding doesn’t seem to be in the scope of this library,
> and for your use cases that may be fine and for others that might not be fine.
I actually have all the punycode algorithms ready including tests:
<https://github.com/CPPAlliance/url/tree/punycode>
But after giving it some thought, I couldn't see the use-case for it.
Callers who want to perform name resolution on a host can't pass a
utf-8 string they have to pass the punycode-encoded string. The only
use-case that I can discern for punycode is for display purposes which
is out of scope. Unless, do you know of any other use-case?
> It seems like you’re aware of this design choice, though.
Yes
Actually, come to think of it - there is a use-case for punycode
encoding and that is to take an international domain name string in
utf-8 encoding and apply puny-code encoding. I think... I have no
experience with this so some guidance would be helpful.
--
Regards,
Vinnie
Follow me on GitHub: https://github.com/vinniefalco
_______________________________________________
Okay, I think what you're saying is that you will have this string literal:
string_view s = "http://example.com/\xf0\x9f\x92\xa9";
Unfortunately, this is not a valid URL and I don't think that the
library should accept this input. However, you could write this:
url u = parse_uri( "http://example.com" ).value();
u.set_path( "/\xf0\x9f\x92\xa9" );
This will produce:
assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" );
Is this what you meant? I'm guessing that the turd emoji is inserted
into the C++ source file as a utf-8 encoded code point, so that's what
you get in the string literal.
Thanks
Not so fast, I think that this can be decided objectively.
A URL in the context of Boost.URL refers to "URI" in the rfc3986
sense. I use URL because most people never heard of URI.
What you are thinking of as a "valid URL parser input" is actually an
Internationalized Resource Identifier, which supports the broader
universal character set instead of just ASCII and is abbreviated by
the even more obscure acronym "IRI." It is covered by rfc3987:
<https://datatracker.ietf.org/doc/html/rfc3987>
Translating your comment, I think you're saying "Boost.URL should
support Internationalized Resource Identifiers." That is unfortunately
out of scope for the library, as Boost.URL is mostly designed for the
exchange of URLs between machines or programs and not necessarily for
display to users. Perhaps someday, the entire world will have switched
to IRIs (maybe after IPv4 is no longer in use) but we are not there
yet, and most systems require IRIs to be mapped to their URI
equivalent:
<https://datatracker.ietf.org/doc/html/rfc3987#section-3>
There is some value to IRIs but not as much as there is for the ASCII
URLs, which fill a tremendous user need (HTTP/WebSocket clients and
servers using Beast or Asio).
> On Oct 12, 2021, at 3:15 PM, Vinnie Falco <vinnie...@gmail.com> wrote:
>
> On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen <achris...@apple.com> wrote:
>> at some point you may run into issues with people trying to give
>> your library input like http://example.com/💩 and expecting the
>> URL parser to normalize it to http://example.com/%F0%9F%92%A9
>
> Okay, I think what you're saying is that you will have this string literal:
>
> string_view s = "http://example.com/\xf0\x9f\x92\xa9";
>
> Unfortunately, this is not a valid URL and I don't think that the
> library should accept this input.
It is perfectly valid input that some URL libraries I work with accept and percent encode, and some URL libraries I work with reject it as an invalid URL. I think it’s a valid URL parser input that ought to produce a valid URL, but not everyone agrees on this yet.
> However, you could write this:
>
> url u = parse_uri( "http://example.com" ).value();
>
> u.set_path( "/\xf0\x9f\x92\xa9" );
>
> This will produce:
>
> assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" );
>
> Is this what you meant?
Welcome to the crazy world of URL parsing!
I would not say so. I see 2 different use cases :
* parse an uri : the uri must be properly encoded
* programatically build an uri: the different components shall not be
encoded.
-> u.set_path("%f0%9f%92%a9") will then produce
"http://example.com/%25f0%259f%2592%25a9".
However, that seems inconsistent with the way set_host works (the docs
says it needs to be encoded), so i tend to agree with the red flag here
(or maybe they're just documentation issues, since there is also
set_encoded_host). And then there's the issue of the '/' character in
set_path if you take unencoded strings (not sure how this should be
handled...)
Regards,
Julien
>> And then there's the issue of the '/' character in
>> set_path if you take unencoded strings (not sure how this should be
>> handled...)
>
> Well, there's not much of an issue there. set_path() treats slashes as
> segment separators. If that's not what you want, you can set the
> individual unencoded segments through the container returned by
> url::segments(). The same goes for the query parameters (call
> url::params()).
I'm not at ease with that. If i build a mailto: uri, and call set_path
with a email address containing a / character in it (such as
valid/em...@example.com), it will result in a uri containing two
segments, which is nonsense for a mailto scheme (and obviously not what
i expected).
I understand the api is fitted toward web uris, but i see there a
potential for hard to find bugs due to api misuse. Maybe it's just the
naming that needs to be improved.
Regards,
Julien
Exactly, it's like returning a string_view from a function - which
you should never do, at least not without an obvious indication...
> Maybe having parse_url_as_view(str) and a safer parse_url(str) using
> the former but not returning a view would satisfy you?
Yes, that's exactly what I suggested.
The alternative is to try to find some way to return something that
can't be assigned to auto. This may be possible with something like
a private_url_view class that wraps a view and makes everything private:
url u = parse_url(s);
// parse_url() returns a private_url_view, url is constructable from private_view
url_view v = parse_url(s);
// parse_url() returns a private_url_view, url_view is constructable from private_view
auto u = parse_url(s);
// Can we make that fail to compile by making private_url_view's ctors private?
// If not...
auto p = u.path();
// ...we can prevent the user from using it for anything by hiding all of
// its methods.
// But:
url u2 = u;
func(u);
// We need to prevent those too. I think we can prevent assigning from
// a private_view to a url[_view] by qualifying url[_view]'s assignment
// and copy ctor with &&:
url::operator=(const private_url_view& v) = delete;
url::operator=(private_url_view&&) { ... }
I'm sure that smarter people than me already know the correct pattern for
doing this. Searching I found p0672r0 which addresses how to fix the
similar issues for proxy types and expression templates but doesn't mention
views.
I also found blog posts by Arthur O'Dwyer talking about string_view, e.g.
https://quuxplusone.github.io/blog/2018/03/27/string-view-is-a-borrow-type/
Quote:
A function may have a borrow type as its return type, but if so, the
function must be explicitly annotated as returning a potentially dangling
reference. That is, the programmer must explicitly acknowledge responsibility
for the annotated function’s correctness.
Regardless, if f is a function so annotated, the result of f must not
be stored into any named variable except a function parameter or for-loop
control variable. For example, auto x = f() must still be diagnosed as a
violation.
Regards, Phil.
Segment iteration is not going to be compatible. In addition to adding
an initial "/" segment for absolute paths, Filesystem also collapses
consecutive / separators. So iterating "/foo//bar//baz///" produces
"/" │ "foo" │ "bar" │ "baz" │ ""
(https://godbolt.org/z/EsjKzc5f1)
A design goal of URL seems to be that the information that the accessors
give accurately reflects the contents of the string (and that there's no
hidden metadata that the string doesn't reflect.)
So the segments of the above path are
{ "foo", "", "bar", "", "baz", "", "", "" }
because otherwise the segments of the above and "/foo/bar/baz/" will
be the same, which means that it won't be possible to reconstruct the string
from the information the URL accessors give.
That is, if you have a string, and you construct an URL object from it, then
take its state as returned by the accessors, copy it into another empty URL,
the resultant string should be the same as the original. And similarly, if you
take an empty URL, set its parts to some values, take the string, create
another URL object from the string, the accessors should give the original
values.
Destructive transformations such as what Filesystem does above cannot
work in this model.
Again, what about the case where the original input URL contained that
leading dot? You can't argue "we must report it unchanged" when by
definition there are conditions when you are changing it.
The only mechanism that seems sane to me is that encoded_url() and
friends are documented to normalise (or at least to partially normalise,
limited to adding/removing the path prefix) the URL before returning a
string, at which point segments() may change content. (But it's
important that it doesn't break if you push_back each segment
individually instead of assigning it all at once.)
> If we then remove the scheme, I think the library needs to remove the
> prefix that it added. Not a full "normalization" (that's a separate
> member function that the user calls explicitly). The rationale is that
> if the library can add something extra to make things consistent, it
> should strive to remove that something extra when it can do so. Yes
> this means that if the user adds the prefix themselves and then
> performs a mutation, the library could end up removing the thing that
> the user added. I think that's ok though, because the (up to 2
> character) path prefixes that the library can add are all semantic
> no-ops even in the filesystem cases.
I don't disagree with this, but I do disagree with the iteration methods
trying to "hide" elements that are actually present in the URL.
On 19/10/2021 08:37, Peter Dimov wrote:
> Segment iteration is not going to be compatible. In addition to
> adding an initial "/" segment for absolute paths, Filesystem also
> collapses consecutive / separators. So iterating "/foo//bar//baz///"
> produces
>
> "/" │ "foo" │ "bar" │ "baz" │ ""
Fair point; I hadn't considered that one. That's unfortunate. I agree
that URL cannot collapse adjacent separators.
The rationale is the URI standard RFC. It is not permitted for URI
parsers to perform such collapsing because such URIs are entirely legal.
Unusual, certainly, especially for http[s] URIs that usually eventually
end up at a filesystem that would collapse consecutive slashes. Most
web servers (regardless of whether evaluating vs a filesystem or not)
will indeed collapse consecutive slashes too, at least while matching
against resource rules.
But URIs are not required to wind up leading to a filesystem or to be
served by a web server -- it's entirely possible that some other scheme
may treat additional consecutive slashes as significant for some purpose.
It's a constant thorn in Range's side, though, especially with all the
various kinds of non-owning filters. But by the same token people
should start getting more used to these sorts of considerations and be
less prone to making such errors.
And the static analysis tooling is a bit better now at spotting such
things, though not perfect.
> c_str does, but virtually nobody makes this mistake for some reason.
I recall some older (broken) C++ compiler (or rather: optimizer)
implementations where the temporary in (rvalue).c_str() was sometimes
destroyed after calling c_str() instead of after the entire
full-expression, which caused all sorts of Fun™ for the common cases of
passing this to a method call, which in turn led to defensively creating
a lot of otherwise unnecessary lvalues.
Thankfully modern compilers don't have that problem any more.