[boost] Boost.URL: Something Wicked This Way Comes! (Was: Needs Review)

507 views
Skip to first unread message

Vinnie Falco via Boost

unread,
Oct 12, 2021, 10:55:15 AM10/12/21
to boost@lists.boost.org List, Vinnie Falco
Folx:

Well, I was on a pause but I picked Boost.URL back up and whipped it
into shape. The good news, this Belle is ready for the Ball ! I invite
you to kick the tires and light the fires on the next major component
in my line of network and protocol libraries:

Repository
<https://github.com/CPPAlliance/url/>

Documentation
<https://master.url.cpp.al/>

Docs are still being worked on, of course (docs are never "done").
`url_view`, `url`, `static_url`, and `authority_view` should be
completely solid and ready to get beat up. And I'm ready for all that
lovely negative feedback !!!

If a kind soul would like to volunteer to be the review manager, that
would be swell.

I'm very happy to answer any questions about the library, and happy to
get feedback regarding the design and implementation.

--
Regards,
Vinnie

Follow me on GitHub: https://github.com/vinniefalco

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Ivan Matek via Boost

unread,
Oct 12, 2021, 1:17:14 PM10/12/21
to boost, Ivan Matek
On Tue, Oct 12, 2021 at 4:55 PM Vinnie Falco via Boost <
bo...@lists.boost.org> wrote:

> And I'm ready for all that
> lovely negative feedback !!!
>
>
> Beginner question:
I see examples work with std::cout, would it work (in C++20 compilers) with
fmt::format?

Vinnie Falco via Boost

unread,
Oct 12, 2021, 1:27:23 PM10/12/21
to Ivan Matek, Vinnie Falco, boost
On Tue, Oct 12, 2021 at 10:17 AM Ivan Matek <libb...@gmail.com> wrote:
> I see examples work with std::cout, would it work (in C++20 compilers) with fmt::format?

Good question! I'm not sure... the library provides operator<< for all
important types. If fmt::format uses operator<< internally, then I
suppose it should work. But if it doesn't work, and there's something
special that Boost.URL needs to do to make it work, there's no problem
adding it - conditionally compiled when C++20 is available.

Thanks

Peter Dimov via Boost

unread,
Oct 12, 2021, 1:33:24 PM10/12/21
to bo...@lists.boost.org, Ivan Matek, Peter Dimov
Vinnie Falco wrote:
> On Tue, Oct 12, 2021 at 10:17 AM Ivan Matek <libb...@gmail.com> wrote:
> > I see examples work with std::cout, would it work (in C++20 compilers) with
> fmt::format?
>
> Good question! I'm not sure... the library provides operator<< for all
> important types. If fmt::format uses operator<< internally, then I suppose it
> should work.

No, it doesn't. std::format has its own customization mechanism.

Vinnie Falco via Boost

unread,
Oct 12, 2021, 2:17:50 PM10/12/21
to Alex Christensen, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 10:57 AM Alex Christensen
<achris...@apple.com> wrote:
> Why did you use RFC 3986 as your specification?

Well, this one seems to be the latest RFC regarding URLs, modulo a bit
of HTTP-specific stuff like authority-form appearing in rfc7230. Is
there something newer?

> How do you feel about the WhatWG URL specification at https://url.spec.whatwg.org?

Quite frankly, I hate it. This "specification" manages to organize and
present the information in the most obtuse manner possible. I find it
hostile to implementers like myself.

> It has a large body of tests ...that you may consider looking at.

Yep, it does! Integrating them is on my to-do- list:

<https://github.com/CPPAlliance/url/tree/c6c4b433c3b1057161b6ce50bb4fba0b5f59b4ee/test/wpt>

> Do you have any general plan for a strategy for handling non-ASCII input?

Yes, the plan is to reject such input. Strings containing non-ASCII
characters are not valid URLs. And even some ASCII characters are not
allowed to appear in a URL, for example all control characters.

> I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8
> encoded non-ASCII string...

I think what you're asking is, what if someone supplies a URL which
has escaped characters which, when percent-decoding is applied, become
valid UTF-8 code point sequences? That's perfectly fine.
Percent-encoded URL parts are in fact "ASCII strings."

> ...into the constructor?
You can't construct a URL from a string, you have to go through one of
the parsing functions. This is because the library recognizes several
variations of URL grammar, and does not favor any particular grammar
by choosing one to support construction. See Table 1.1:

<https://master.url.cpp.al/url/parsing.html>


Thanks

Vinnie Falco via Boost

unread,
Oct 12, 2021, 2:20:06 PM10/12/21
to Peter Dimov, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 10:32 AM Peter Dimov <pdi...@gmail.com> wrote:
> No, it doesn't. std::format has its own customization mechanism.

The issue may be tracked here:

<https://github.com/CPPAlliance/url/issues/62>

Thanks

Alex Christensen via Boost

unread,
Oct 12, 2021, 3:27:16 PM10/12/21
to bo...@lists.boost.org, Alex Christensen
Why did you use RFC 3986 as your specification? How do you feel about the WhatWG URL specification at https://url.spec.whatwg.org <https://url.spec.whatwg.org/>? It has a large body of tests at https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json <https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json> that you may consider looking at.

Do you have any general plan for a strategy for handling non-ASCII input? I haven’t tested it yet, but what do you plan to do if someone passes a UTF-8 encoded non-ASCII string into the constructor?

Alex Christensen via Boost

unread,
Oct 12, 2021, 4:02:12 PM10/12/21
to Vinnie Falco, Alex Christensen, boost@lists.boost.org List
> On Oct 12, 2021, at 12:17 PM, Vinnie Falco <vinnie...@gmail.com> wrote:
>
> On Tue, Oct 12, 2021 at 10:57 AM Alex Christensen
> <achris...@apple.com> wrote:
>> Why did you use RFC 3986 as your specification?
>
> Well, this one seems to be the latest RFC regarding URLs, modulo a bit
> of HTTP-specific stuff like authority-form appearing in rfc7230. Is
> there something newer?
I would say that the WhatWG URL specification is that something newer, but I sympathize. It is difficult to get started with.

>> Do you have any general plan for a strategy for handling non-ASCII input?
>
> Yes, the plan is to reject such input. Strings containing non-ASCII
> characters are not valid URLs. And even some ASCII characters are not
> allowed to appear in a URL, for example all control characters.

That is certainly a choice you can make, but at some point you may run into issues with people trying to give your library input like http://example.com/💩 <http://example.com/%F0%9F%92%A9> and expecting the URL parser to normalize it to http://example.com/%F0%9F%92%A9 <http://example.com/%F0%9F%92%A9> for you like it does in some other URL libraries. I see Punycode encoding and decoding doesn’t seem to be in the scope of this library, and for your use cases that may be fine and for others that might not be fine. It seems like you’re aware of this design choice, though.

Peter Dimov via Boost

unread,
Oct 12, 2021, 4:49:33 PM10/12/21
to bo...@lists.boost.org, Vinnie Falco, Peter Dimov
Alex Christensen wrote:
> That is certainly a choice you can make, but at some point you may run into
> issues with people trying to give your library input like
> http://example.com/💩 <http://example.com/%F0%9F%92%A9> and
> expecting the URL parser to normalize it to
> http://example.com/%F0%9F%92%A9
> <http://example.com/%F0%9F%92%A9> for you like it does in some other
> URL libraries.

Assuming this is supported, it raises the question of what the parser would
be expected to do with http://example.com/💩/%F0%9F%92%A9.

Should it encode the percents, under the assumption that they are literal
because everything is non-encoded? Or should it leave the percents as-is
and only encode the emoji?

If the use case is "the user typed or pasted something into the address bar",
I suppose a best-effort DWIM (e.g. option 2) makes more sense. But in
other scenarios, option 1 might.

Vinnie Falco via Boost

unread,
Oct 12, 2021, 5:01:53 PM10/12/21
to Alex Christensen, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen <achris...@apple.com> wrote:
> I would say that the WhatWG URL specification is that something
> newer, but I sympathize. It is difficult to get started with.

I plan to pick through the WhatWG specification and see if there are
any tidbits that could have value. The procedural exposition (append
this character, execute this algorithm, output this string) makes it
very difficult to grasp the higher level semantics of the thing. The
BNF in the RFC is way easier to grok, e,g:

scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

and in fact this is exactly how I have decomposed the parsing, into
each individual named element described by the RFC.

> That is certainly a choice you can make, but at some point you
> may run into issues with people trying to give your library input
> like http://example.com/💩

I'm not sure how someone would give the library that input. Is this
expressible in a string_view?

> I see Punycode encoding and decoding doesn’t seem to be in the scope of this library,
> and for your use cases that may be fine and for others that might not be fine.

I actually have all the punycode algorithms ready including tests:

<https://github.com/CPPAlliance/url/tree/punycode>

But after giving it some thought, I couldn't see the use-case for it.
Callers who want to perform name resolution on a host can't pass a
utf-8 string they have to pass the punycode-encoded string. The only
use-case that I can discern for punycode is for display purposes which
is out of scope. Unless, do you know of any other use-case?

> It seems like you’re aware of this design choice, though.

Yes

Actually, come to think of it - there is a use-case for punycode
encoding and that is to take an international domain name string in
utf-8 encoding and apply puny-code encoding. I think... I have no
experience with this so some guidance would be helpful.

--
Regards,
Vinnie

Follow me on GitHub: https://github.com/vinniefalco

_______________________________________________

Vinnie Falco via Boost

unread,
Oct 12, 2021, 5:15:53 PM10/12/21
to Alex Christensen, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen <achris...@apple.com> wrote:
> at some point you may run into issues with people trying to give
> your library input like http://example.com/💩 and expecting the

> URL parser to normalize it to http://example.com/%F0%9F%92%A9

Okay, I think what you're saying is that you will have this string literal:

string_view s = "http://example.com/\xf0\x9f\x92\xa9";

Unfortunately, this is not a valid URL and I don't think that the
library should accept this input. However, you could write this:

url u = parse_uri( "http://example.com" ).value();

u.set_path( "/\xf0\x9f\x92\xa9" );

This will produce:

assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" );

Is this what you meant? I'm guessing that the turd emoji is inserted
into the C++ source file as a utf-8 encoded code point, so that's what
you get in the string literal.

Thanks

Gavin Lambert via Boost

unread,
Oct 12, 2021, 6:28:16 PM10/12/21
to bo...@lists.boost.org, Gavin Lambert
On 13/10/2021 10:15, Vinnie Falco wrote:
> Okay, I think what you're saying is that you will have this string literal:
>
> string_view s = "http://example.com/\xf0\x9f\x92\xa9";
>
> Unfortunately, this is not a valid URL and I don't think that the
> library should accept this input. However, you could write this:
>
> url u = parse_uri( "http://example.com" ).value();
>
> u.set_path( "/\xf0\x9f\x92\xa9" );

Why would set_path accept encoding that parse_uri does not? That sounds
like a red flag.

I note in the docs that there is also a set_encoded_path, which implies
that set_path might be "unencoded", which might explain it -- but
set_path has no documentation, so this is unclear.

On an unrelated note, set_query_part seems wrong, it should have a
leading question mark but the BNF specifies a leading hash. Also it
claims its input is an encoded string, which seems inconsistently named
with set_encoded_query (which in turn mentions fragments in its docs,
which also seems wrong).

Vinnie Falco via Boost

unread,
Oct 12, 2021, 6:39:10 PM10/12/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Tue, Oct 12, 2021 at 3:28 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> I note in the docs that there is also a set_encoded_path, which implies
> that set_path might be "unencoded", which might explain it -- but
> set_path has no documentation, so this is unclear.

The docs could use work as usual. Yes, there are two functions

set_encoded_path
set_path

For almost all functions that set a string, there is the "encoded"
version which requires a valid percent-encoded string, and the plain
version which accepts any string and then escapes the reserved
characters as needed.

> On an unrelated note, set_query_part seems wrong, it should have a
> leading question mark but the BNF specifies a leading hash. Also it
> claims its input is an encoded string, which seems inconsistently named
> with set_encoded_query (which in turn mentions fragments in its docs,
> which also seems wrong).

Little bit of under-construction going on there, it will be resolved
today, thanks for looking!

Regards

Vinnie Falco via Boost

unread,
Oct 12, 2021, 9:06:55 PM10/12/21
to Alex Christensen, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 2:52 PM Alex Christensen <achris...@apple.com> wrote:
> It is perfectly valid input that some URL libraries I work with accept and
> percent encode, and some URL libraries I work with reject it as an invalid
> URL. I think it’s a valid URL parser input that ought to produce a valid URL,
> but not everyone agrees on this yet.

Not so fast, I think that this can be decided objectively.

A URL in the context of Boost.URL refers to "URI" in the rfc3986
sense. I use URL because most people never heard of URI.

What you are thinking of as a "valid URL parser input" is actually an
Internationalized Resource Identifier, which supports the broader
universal character set instead of just ASCII and is abbreviated by
the even more obscure acronym "IRI." It is covered by rfc3987:

<https://datatracker.ietf.org/doc/html/rfc3987>

Translating your comment, I think you're saying "Boost.URL should
support Internationalized Resource Identifiers." That is unfortunately
out of scope for the library, as Boost.URL is mostly designed for the
exchange of URLs between machines or programs and not necessarily for
display to users. Perhaps someday, the entire world will have switched
to IRIs (maybe after IPv4 is no longer in use) but we are not there
yet, and most systems require IRIs to be mapped to their URI
equivalent:

<https://datatracker.ietf.org/doc/html/rfc3987#section-3>

There is some value to IRIs but not as much as there is for the ASCII
URLs, which fill a tremendous user need (HTTP/WebSocket clients and
servers using Beast or Asio).

Alex Christensen via Boost

unread,
Oct 12, 2021, 9:48:35 PM10/12/21
to Vinnie Falco, Alex Christensen, boost@lists.boost.org List

> On Oct 12, 2021, at 3:15 PM, Vinnie Falco <vinnie...@gmail.com> wrote:
>
> On Tue, Oct 12, 2021 at 1:02 PM Alex Christensen <achris...@apple.com> wrote:
>> at some point you may run into issues with people trying to give
>> your library input like http://example.com/💩 and expecting the
>> URL parser to normalize it to http://example.com/%F0%9F%92%A9
>
> Okay, I think what you're saying is that you will have this string literal:
>
> string_view s = "http://example.com/\xf0\x9f\x92\xa9";
>
> Unfortunately, this is not a valid URL and I don't think that the
> library should accept this input.

It is perfectly valid input that some URL libraries I work with accept and percent encode, and some URL libraries I work with reject it as an invalid URL. I think it’s a valid URL parser input that ought to produce a valid URL, but not everyone agrees on this yet.

> However, you could write this:


>
> url u = parse_uri( "http://example.com" ).value();
>
> u.set_path( "/\xf0\x9f\x92\xa9" );
>
> This will produce:
>
> assert( u.encoded_url() == "http://example.com/%f0%9f%92%a9" );
>
> Is this what you meant?

Welcome to the crazy world of URL parsing!

Vinnie Falco via Boost

unread,
Oct 12, 2021, 9:49:24 PM10/12/21
to Alex Christensen, Vinnie Falco, boost@lists.boost.org List
On Tue, Oct 12, 2021 at 6:06 PM Vinnie Falco <vinnie...@gmail.com> wrote:
> What you are thinking of as a "valid URL parser input" is actually an
> Internationalized Resource Identifier, which supports the broader
> universal character set instead of just ASCII and is abbreviated by
> the even more obscure acronym "IRI." It is covered by rfc3987:
>
> <https://datatracker.ietf.org/doc/html/rfc3987>

We looked over this RFC and I think, it would be possible to support
IRIs simply by providing a new set of parsing functions, for example

void parse_iri ( string_view, error_code&, url& );
void parse_irelative_ref ( string_view, error_code&, url& );
void parse_absolute_iri ( string_view, error_code&, url& );
void parse_iri_reference ( string_view, error_code&, url& );

It wouldn't be possible to parse into a url_view, since UTF-8 encoded
characters have to be converted to percent-encoded escapes. But this
could be made to work, and it fits neatly into the current
implementation. There would be an additional function to take a url
and convert it back into its IRI string, which is mostly just decoding
percent-escaped characters. The library would disallow invalid UTF-8.

Julien Blanc via Boost

unread,
Oct 13, 2021, 2:24:25 AM10/13/21
to bo...@lists.boost.org, Julien Blanc, Gavin Lambert
Le mercredi 13 octobre 2021 à 11:27 +1300, Gavin Lambert via Boost a
écrit :

> On 13/10/2021 10:15, Vinnie Falco wrote:
> > Okay, I think what you're saying is that you will have this string
> > literal:
> >
> >      string_view s = "http://example.com/\xf0\x9f\x92\xa9";
> >
> > Unfortunately, this is not a valid URL and I don't think that the
> > library should accept this input. However, you could write this:
> >
> >      url u = parse_uri( "http://example.com" ).value();
> >
> >      u.set_path( "/\xf0\x9f\x92\xa9" );
>
> Why would set_path accept encoding that parse_uri does not?  That
> sounds like a red flag.

I would not say so. I see 2 different use cases :
* parse an uri : the uri must be properly encoded
* programatically build an uri: the different components shall not be
encoded.

-> u.set_path("%f0%9f%92%a9") will then produce
"http://example.com/%25f0%259f%2592%25a9".

However, that seems inconsistent with the way set_host works (the docs
says it needs to be encoded), so i tend to agree with the red flag here
(or maybe they're just documentation issues, since there is also
set_encoded_host). And then there's the issue of the '/' character in
set_path if you take unencoded strings (not sure how this should be
handled...)

Regards,

Julien

Gavin Lambert via Boost

unread,
Oct 13, 2021, 2:46:30 AM10/13/21
to bo...@lists.boost.org, Gavin Lambert
On 13/10/2021 18:56, Julien Blanc wrote:
> And then there's the issue of the '/' character in
> set_path if you take unencoded strings (not sure how this should be
> handled...)

Exactly, that's a fundamental problem with the concept of accepting
not-yet-encoded strings -- the URL specification does not have uniform
encoding; it is not possible to take an arbitrary URL and round-trip it
between encoded and decoded forms as a whole (even ignoring equivalent
variants like %20 vs + or %-encoding more characters than strictly
required).

Where a / occurs inside a single path-component, it must be escaped so
as to not be seen as a path-component-separator. And as such, it's not
possible to (reliably) pass multiple unencoded path-components as a
single string.

The same problems occur with query parameters and &= characters, or with
?# characters appearing anywhere.

(Where the authority component is not a DNS hostname there may be issues
with :@/ characters appearing there too.)


I think the only sane choice here is (if supported at all) to only
provide access to unencoded values at the smallest possible unit (i.e.
only for a single parameter or a single path component, or a collection
of such, but never as a single string). (But to still provide easy
access to the encoder/decoder for arbitrary external usage.)

Phil Endecott via Boost

unread,
Oct 13, 2021, 10:10:11 AM10/13/21
to bo...@lists.boost.org, Phil Endecott
Hi Vinnie,

Vinnie Falco wrote:
> Well, I was on a pause but I picked Boost.URL back up and whipped it
> into shape.

Thanks for posting that. Assorted comments follow.

Back in 2005-ish I decided to write a URL parser based on the RFC BNF
using Boost.Spirit. At that time the most recent RFCs were RFC2616
(1999) and RFC2396 (1998). I did a fairly direct translation of the
BNF to Spirit and ... it didn't work. It turned out that the BNF in
the RFCs was wrong, and had been for 6 years at that point. So I
advise that you are careful about directly using the RFC BNF.

Have you looked at the RFC 3986 errata at
https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
? For example, I note that your rule for relative-part on the
"Parsing" page doesn't seem to include the path-abempty alternative
added in erratum 5428.

I would be interested to see how compile time, run time, and
lines-of-code compare between your implementation, a Spirit parser,
and a regexp. You also have the choice of (a) parsing and storing
pointers to each component as I believe you do, or (b) parsing to
validate but storing only the complete string, and doing a partial
parse when necessary to extract components. (b) could be better if
you're mostly just passing around complete URLs, storing them in
containers, etc.

There's also the question of worst-case complexity; does your parser
backtrack? If so, is there a pathological input that causes it to
take exponential time? Can you add a complexity guarantee to the
docs? (I posted about this in the context of regular expression
libraries a few weeks ago.)

This worries me:

auto url = parse_url(str);

The problem is that url is a view of the string parameter, but
nothing warns the user about that and...

str = something_else;

..now url is dangling. Am I right, or is there something that
prevents this?

I believe that functions that returns views should be named so
that that is obvious, and in cases like this the function with the
simplest name should be the safer version that returns a value-type.

(There was a similar discussion for fixed_string, which originally
proposed to return a view from its substr() method.)

Also I note that parse_uri() returns a result<> and doesn't throw.
My preference would be to name that e.g. try_parse_uri() and for the
simpler-named function to be a wrapper that returns a plain uri
or throws on error. My rationale is to make the simplest-named
functions provide the easiest-to-use functionality, so that
*teaching C++ to a Javascript programmer is not so embarrassing*,
while retaining the more performant API variants for those who
need them.

(Is it parse_uri() or parse_url()? I thought you had decided on
URL.)

"pct" doesn't immediately mean "percent" to me. I'd prefer
percent_encode() to pct_encode(). If that's too long for you,
truncate "encode" to "enc". Or maybe I would call this "escaping"
rather than "encoding". Actually I think the RFC terminology
might be "URL encoding"; I don't know if I prefer that but if
you want "strict adherence to the RFCs"...

It doesn't look like your url type is comparable. I think it
should be. The comparisons need to be careful about case-sensitivity.
And it should be hashable.

I recently had to implement request signing for AWS. This requires
canonicalisation of URLs. There are some subtleties in percent
encoding: you need to be able to control exactly which characters
are escaped (I think you allow this); you need to be able to
control whether + is used to escape space (do you?) and you need
to be able to control whether the hex characters are upper or
lower case (do you?).

This code also needed to sort the query parameters by key; that
raises the question of whether the query is a sequence or an
associative container of key-value pairs, and if an associative
container, what the comparison function is.

Actually you might like to have a look at the AWS SDK URI class:
http://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_http_1_1_u_r_i.html
One thing that I believe it does is to set port to a default
based on the scheme, if it's not specified in the url string.
I'd say that's a helpful thing to do. Are there any other URI
classes that you could compare/contrast with?

I would hope that the path would be similar to a std::filesystem::path
i.e. it should use the same vocabulary where possible (or could
even actually be a std::filesystem::path, on C++17).

Conversion between std::filesystem::path and file: URLs would be
nice to have.

In your table 1.7, "Return the type of host specified, if any."
should I think say "Return the host, if any."

Your percent-decoding methods return strings. I'm a bit surprised
that you've not tried to return lazy decoding views, at least as an
option.

Anyway that's enough ramblings for now....


Regards, Phil.

Vinnie Falco via Boost

unread,
Oct 13, 2021, 10:55:23 AM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Tue, Oct 12, 2021 at 11:24 PM Julien Blanc via Boost
<bo...@lists.boost.org> wrote:
> However, that seems inconsistent with the way set_host works (the docs
> says it needs to be encoded), so i tend to agree with the red flag here
> (or maybe they're just documentation issues, since there is also
> set_encoded_host).

Probably doc issue, there are two functions:

url::set_host( string_view )
url::set_encoded_host( string_view )

As with all the APIs, functions with "set_encoded" in the name, accept
a percent-encoded string and throw an exception on invalid input.
"set_host" will apply percent-encoding to the input as needed.

> And then there's the issue of the '/' character in
> set_path if you take unencoded strings (not sure how this should be
> handled...)

Well, there's not much of an issue there. set_path() treats slashes as
segment separators. If that's not what you want, you can set the
individual unencoded segments through the container returned by
url::segments(). The same goes for the query parameters (call
url::params()).

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 11:14:44 AM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Tue, Oct 12, 2021 at 11:46 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> it is not possible to take an arbitrary URL and round-trip it
> between encoded and decoded forms as a whole (even ignoring equivalent
> variants like %20 vs + or %-encoding more characters than strictly
> required).

That's right.

> Where a / occurs inside a single path-component, it must be escaped so
> as to not be seen as a path-component-separator. And as such, it's not
> possible to (reliably) pass multiple unencoded path-components as a
> single string.

Not exactly. If you call set_path() it treats slash as a separator,
since it works with the entire path portion of the URL. If you call
segments().push_back( "my/slash" ) then you get the slash
percent-encoded. The library APIs are biased towards interpreting the
path hierarchically but you can still treat the path as a monolithic
string using set_path and set_encoded_path.

> The same problems occur with query parameters and &= characters, or with
> ?# characters appearing anywhere.

There isn't any actual ambiguity here. set_query() treats '&' and '='
as separators. If that is not your intent you can use
params().insert() or params().push_back() which will percent-encode
these symbols. If you call set_query() with a literal string you will
get that literal string as a percent-encoded query in the final URL.
If you then call query() you will get back the original unencoded
string. Modulo the wrinkle where "+" becomes a space on decoding, but
even that can be controlled by the user:

<https://master.url.cpp.al/url/ref/boost__urls__pct_decode_opts.html>

> (Where the authority component is not a DNS hostname there may be issues
> with :@/ characters appearing there too.)

Well, there is no set_authority() function yet, only
set_encoded_authority(). I haven't looked closely at it but it would
similarly to the others. However, I'm not convinced it is necessary.
If you call set_host() you can put any string you want in there and it
will be correctly percent-encoded. If there is no user, password, or
port, then that percent-encoded string is effectively the entire
"authority" - so there is already a way to set the authority to an
arbitrary string.

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 12:03:43 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> Have you looked at the RFC 3986 errata at
> https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0

ruh roh.. I have not gotten to that yet but yeah. I see problems :)
Not difficult problems, but there are relevant errata that must be
fixed before any review. Thanks!

> ? For example, I note that your rule for relative-part on the
> "Parsing" page doesn't seem to include the path-abempty alternative
> added in erratum 5428.

Thanks I need to update the javadocs and exposition. Although, I think
the parsing is still correct despite not showing path-abempty. I will
add more tests.

> I would be interested to see how compile time, run time, and
> lines-of-code compare between your implementation, a Spirit parser,
> and a regexp.

Someone volunteered to do that so I might explore it at one point.

> You also have the choice of (a) parsing and storing
> pointers to each component as I believe you do,

Right, the library keeps a small table of offsets to various parts.
For the modifiable containers I plan to also keep a table for the
segments and query parameters, to allow O(1) access of indexed
elements (right now its linear). I think this is a good balance for
accessing parts of the URL and helping speed up mutation operations.

> (b) could be better if
> you're mostly just passing around complete URLs, storing them in
> containers, etc.

Well, in url the table adds overhead. If you don't want that overhead
then... just store the URL string :) and parse it again when you need
it, this is equivalent to your (b), and less impact on the library.

> There's also the question of worst-case complexity; does your parser
> backtrack? If so, is there a pathological input that causes it to
> take exponential time?

I have done no profiling or optimization to speak of except some minor
design work to facilitate the use of character-based SIMD algorithms
later. I don't have a monolithic "parser" per-se, there are classes
called "bnf elements" which can be used in the generic bnf::parse
function. These elements call parse on child elements recursively. I
do backtrack. For example, URI-reference is defined as:

URI-reference = URI / relative-ref

The class uri_reference_bnf implements this by first attempting to
parse the scheme and colon, and if that fails then it backtracks and
attempts parsing a "relative-ref." You can see this here:

https://github.com/CPPAlliance/url/blob/680f6db547ffc3367e1ba93bae29e4c9278fd0e7/include/boost/url/rfc/impl/uri_reference_bnf.ipp#L35

I'm not really sure how this can be improved, you need to know if
there's a scheme and that can only be determined by scanning forward
until reaching a colon or an invalid character for scheme. In theory
this could scan the entire input. Thus I have not put any effort into
making this better.

> Can you add a complexity guarantee to the
> docs? (I posted about this in the context of regular expression
> libraries a few weeks ago.)

Maybe... I did it for JSON because performance is critical and the
difference between a poorly performing library and a greatly
performing library is massive. I'm not sure that URL parsing is the
same. URLs tend to be short and it is sort of difficult to really get
bad performance out of parsing them even if you go a character at a
time. Unlike JSON you don't need to convert between base 2 and base 10
for numbers (which is a headache). Mutations always give the strong
exception safety guarantee and it also guarantees only one memmove, I
will probably document that.

I can give you a better answer in the future when it is at the point
where benchmarks are explored.

> auto url = parse_url(str);
>
> The problem is that url is a view of the string parameter, but
> nothing warns the user about that and...

Well, it is documented profusely. I agree there is some danger here
but with great danger comes great power. Or something about great
responsibility?

> ..now url is dangling. Am I right, or is there something that
> prevents this?

No you are right, in the statement below `uv` references the string
and does not take ownership:

url_view uv = parse_uri( "https://example.com/" ).value();

However, url is constructible from url_view and url DOES make a copy
with ownership, so this statement also works and doesn't have the
problem of dangling:

url u = parse_uri( "https://example.com/" ).value();

> I believe that functions that returns views should be named so
> that that is obvious, and in cases like this the function with the
> simplest name should be the safer version that returns a value-type.

These algorithms only return views, because there are a few different
owning URL containers and they can all be constructed from the views.
Overloading on return type doesn't really work and some containers may
have template parameters or allocators which makes APIs to parse into
them awkward. If we're debating the name though, it means we have
accepted the general design of the library :)

> Also I note that parse_uri() returns a result<> and doesn't throw.
> My preference would be to name that e.g. try_parse_uri() and for the
> simpler-named function to be a wrapper that returns a plain uri
> or throws on error. My rationale is to make the simplest-named
> functions provide the easiest-to-use functionality, so that
> *teaching C++ to a Javascript programmer is not so embarrassing*,
> while retaining the more performant API variants for those who
> need them.

Well the whole point of boost::system::result (the new type that
Boost.URL uses) is to cut down on those extra names and overloads.
Your suggestion would keep the additional complexity of result and
also increase the number of overloads which isn't sounding like a
clear cut win to me... although, this "result" return type is sort of
new.

> (Is it parse_uri() or parse_url()? I thought you had decided on
> URL.)

I use the term URL generally, but I use the precise wording when
referring to grammar. "parse_uri" uses the URI grammar specified in
rf3986, modulo any errata fixes which I have yet to apply.

> "pct" doesn't immediately mean "percent" to me.

The grammar says "pct-encode", I followed that. I think there is
sufficient use of the abbreviation that it is justified but I could be
convinced otherwise if there is new data.

> It doesn't look like your url type is comparable. I think it
> should be. The comparisons need to be careful about case-sensitivity.>

> And it should be hashable.

You can track those issues here:

<https://github.com/CPPAlliance/url/issues/64>
<https://github.com/CPPAlliance/url/issues/65>

> I recently had to implement request signing for AWS. This requires
> canonicalisation of URLs. There are some subtleties in percent
> encoding: you need to be able to control exactly which characters
> are escaped (I think you allow this); you need to be able to
> control whether + is used to escape space (do you?)

Yes and yes.

> and you need
> to be able to control whether the hex characters are upper or
> lower case (do you?).

But no to this... we can add it though, there's a "pct_encode_opts" structure:

<https://master.url.cpp.al/url/ref/boost__urls__pct_encode_opts.html>

> This code also needed to sort the query parameters by key; that
> raises the question of whether the query is a sequence or an
> associative container of key-value pairs, and if an associative
> container, what the comparison function is.

The library treats the query parameters as a ForwardRange (url_view)
or RandomAccessRange (url) of key/value pairs, allowing for duplicate
keys. The functions find/find_next can be used to visit all values
with the same key. If you are using the "encoded" params container (by
calling encoded_params()) then comparisons are bitwise. If you are
using the "plain" params container (by calling params()) which decodes
everything for you, then the comparison is made using notional
percent-decoding. That is, your decoded key is compared against each
container key as-if it had percent-decoding applied.

> I would hope that the path would be similar to a std::filesystem::path
> i.e. it should use the same vocabulary where possible (or could
> even actually be a std::filesystem::path, on C++17).

OOF I hope never to see this. std::filesystem is kind of a mess. These
are the path containers:

https://master.url.cpp.al/url/ref/boost__urls__segments.html
https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html
https://master.url.cpp.al/url/ref/boost__urls__segments_view.html
https://master.url.cpp.al/url/ref/boost__urls__segments_encoded_view.html

"view" containers are read-only and reference the character buffer
without ownership, these containers are obtained by calling functions
on url_view. The other containers allow modification (the iterators
are still read-only, because a proxy-assignment implementation proved
to be clumsy) and these containers are obtained by calling functions
on url.

> Conversion between std::filesystem::path and file: URLs would be
> nice to have.

Yeah something like that is possible, falling back to
boost::filesystem on C++11.

> In your table 1.7, "Return the type of host specified, if any."
> should I think say "Return the host, if any."

hmm, yeah - the function "host_type()" is missing from that list (it
returns the type of host).

> Your percent-decoding methods return strings. I'm a bit surprised
> that you've not tried to return lazy decoding views, at least as an
> option.

Now that is a very interesting idea! I didn't think of it. But, its
hard to see the use case, what would you do with a lazy view other
than unroll it into a linear buffer?

Thanks

Gavin Lambert via Boost

unread,
Oct 13, 2021, 6:52:22 PM10/13/21
to bo...@lists.boost.org, Gavin Lambert
On 14/10/2021 05:02, Vinnie Falco via Boost wrote:
> On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott wrote:
>> Have you looked at the RFC 3986 errata at
>> https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0
>
> ruh roh.. I have not gotten to that yet but yeah. I see problems :)
> Not difficult problems, but there are relevant errata that must be
> fixed before any review. Thanks!

Bear in mind that most of those are Reported and not Verified, so they
should be treated as non-authoritative comments.

For example, the suggested amendment for parsing too many .. path
components seems incorrect and the original is more correct. (Unmatched
.. components should not be left in a canonical URL.)

Vinnie Falco via Boost

unread,
Oct 13, 2021, 6:57:39 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Wed, Oct 13, 2021 at 7:10 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> Have you looked at the RFC 3986 errata at
> https://www.rfc-editor.org/errata_search.php?rfc=3986&rec_status=0?

Okay, this is taken care of now!!!

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 7:00:06 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 3:52 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> Bear in mind that most of those are Reported and not Verified, so they
> should be treated as non-authoritative comments.

Never a dull moment!!

> For example, the suggested amendment for parsing too many .. path
> components seems incorrect and the original is more correct. (Unmatched
> .. components should not be left in a canonical URL.)

I'm still working on that part. But I think that the reg-name change is legit?

The addition of path-abempty doesn't actually change anything it
seems, but I have updated the documentation.

Thanks

Gavin Lambert via Boost

unread,
Oct 13, 2021, 7:09:06 PM10/13/21
to bo...@lists.boost.org, Gavin Lambert
On 14/10/2021 11:59, Vinnie Falco wrote:
> On Wed, Oct 13, 2021 at 3:52 PM Gavin Lambert wrote:
>> Bear in mind that most of those are Reported and not Verified, so they
>> should be treated as non-authoritative comments.
>
> Never a dull moment!!
>
>> For example, the suggested amendment for parsing too many .. path
>> components seems incorrect and the original is more correct. (Unmatched
>> .. components should not be left in a canonical URL.)
>
> I'm still working on that part. But I think that the reg-name change is legit?
>
> The addition of path-abempty doesn't actually change anything it
> seems, but I have updated the documentation.

I have no particular opinion on path-abempty; that one might be ok.

But the suggested change to .. parsing is actively harmful and you
should not do it -- a canonical path should never contain .. components,
especially not leading ones.

Vinnie Falco via Boost

unread,
Oct 13, 2021, 7:53:34 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 4:09 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> ...

Given the following path:

/path/to/file.txt

What should iterating the segments return?

{ "path", "to", "file.txt" }
or
{ "/path", "/to", "/file.txt" }

or something else?

Given this path:

my/download/folder/

What should iterating the segments return?

{ "my", "download", "folder", "" }
or
{ "my/", "download/", "folder/" }
or
{ "my", "/download", "/folder", "/" }

or something else?

Given an empty relative-ref (empty string) in a url u, what is the
encoded path after executing this statement?

u.segments().push_back( "index.htm" );

Anyone feel free to answer!

Thanks

Andrey Semashev via Boost

unread,
Oct 13, 2021, 8:26:17 PM10/13/21
to bo...@lists.boost.org, Andrey Semashev
On 10/14/21 2:52 AM, Vinnie Falco via Boost wrote:
> On Wed, Oct 13, 2021 at 4:09 PM Gavin Lambert via Boost
> <bo...@lists.boost.org> wrote:
>> ...
>
> Given the following path:
>
> /path/to/file.txt
>
> What should iterating the segments return?
>
> { "path", "to", "file.txt" }
> or
> { "/path", "/to", "/file.txt" }
>
> or something else?
>
> Given this path:
>
> my/download/folder/
>
> What should iterating the segments return?
>
> { "my", "download", "folder", "" }
> or
> { "my/", "download/", "folder/" }
> or
> { "my", "/download", "/folder", "/" }
>
> or something else?
>
> Given an empty relative-ref (empty string) in a url u, what is the
> encoded path after executing this statement?
>
> u.segments().push_back( "index.htm" );
>
> Anyone feel free to answer!

If std::filesystem::path is an example to follow then path elements
should not contain path separators, except the root directory. The
trailing separator is indicated by an empty final element. So:

/my/download/folder/

would correspond to:

{ "/", "my", "download", "folder", "" }

and

/path/to/file.txt

to

{ "/", "path", "to", "file.txt" }

I think, URL paths must always have the root directory as the leading
slash is a necessary separator from the host and port, so you could
probably omit it from the list of elements.

Andrey Semashev via Boost

unread,
Oct 13, 2021, 8:32:34 PM10/13/21
to bo...@lists.boost.org, Andrey Semashev
Two corner cases to consider is when you have no path at all, and when
you have a root directory as a path. Those two cases have to be distinct
in terms of observed sequence of elements. I.e.

https:://example.com?foo=42

should be different from

https:://example.com/?foo=42


in terms of observed elements of the path.

Gavin Lambert via Boost

unread,
Oct 13, 2021, 8:56:18 PM10/13/21
to bo...@lists.boost.org, Gavin Lambert
On 14/10/2021 12:52, Vinnie Falco wrote:
> Given the following path:
>
> /path/to/file.txt
>
> What should iterating the segments return?
>
> { "path", "to", "file.txt" }
> or
> { "/path", "/to", "/file.txt" }
>
> or something else?

It would have to be the former, because supplying the latter as input to
the unencoded variants would presumably try to %-encode the slash (or if
it didn't, you'd have problems dealing with slashes elsewhere in the
string). Also part of the point of splitting paths into subcomponents
is so that you don't have to think about the path separators.

Some URI libraries will also explicitly iterate "" or "/" as the first
component (by itself) to indicate the difference between absolute and
relative URIs, although that could be indicated another way instead.

In which case it might be best to render the absolute path as:

{ "/", "path", "to", "file.txt" }

> Given this path:
>
> my/download/folder/
>
> What should iterating the segments return?
>
> { "my", "download", "folder", "" }
> or
> { "my/", "download/", "folder/" }
> or
> { "my", "/download", "/folder", "/" }
>
> or something else?

Again, the first.

> Given an empty relative-ref (empty string) in a url u, what is the
> encoded path after executing this statement?
>
> u.segments().push_back( "index.htm" );
>
> Anyone feel free to answer!

I would imagine if u.segments() == { "my", "download", "folder", "" }
before the push_back then after == { "my", "download", "folder",
"index.htm" }.

But that would be a special case only when the trailing component is empty.

The app would have to subsequently push_back("") if it was intending to
leave a trailing final slash (or just push_back the next component if it
wasn't finished yet). But either way it's the app's responsibilty to
explicitly indicate whether the URL has a trailing slash or not; they
are semantically different and a library cannot decide that for it.

Basically it ought to behave somewhat consistently with the URI
resolution rules in section 5 of the RFC (except for not replacing a
trailing non-empty component). (On that note, url::resolve is
undocumented and doesn't seem to have the correct signature -- having
both a base and relative parameter only makes sense if it's static or
free, but it appears to be declared as instance. Unless the instance is
only used as a completely-replaced output, but that seems weird and a
free function with return value makes more sense to me.)

Vinnie Falco via Boost

unread,
Oct 13, 2021, 9:10:54 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 5:56 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> (On that note, url::resolve is
> undocumented and doesn't seem to have the correct signature -- having
> both a base and relative parameter only makes sense if it's static or
> free, but it appears to be declared as instance. Unless the instance is
> only used as a completely-replaced output, but that seems weird and a
> free function with return value makes more sense to me.)

Yes `this` is used as a "completely-replace output." A free function
with a return value doesn't work, because there are a few different
flavors of url (url, static_url, and possibly a couple more if users
request them). They differ in how they obtain their storage. But they
use url as a base class. So having `url::resolve` a as member function
makes it all work out.

Vinnie Falco via Boost

unread,
Oct 13, 2021, 9:13:49 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco
On Wed, Oct 13, 2021 at 5:25 PM Andrey Semashev via Boost
<bo...@lists.boost.org> wrote:
> { "/", "my", "download", "folder", "" }
> ...
> { "/", "path", "to", "file.txt" }

These are not really great because then elements are no longer
homogeneous. That is, "/" is allowed in the front element but not the
others. So

u.segments().insert( u.segments().begin(), "/" );

is okay, but these are not:

u.segments().insert( u.segments().begin(), "home" );

u.segments().push_back( "/" );

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 9:15:31 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco
On Wed, Oct 13, 2021 at 5:32 PM Andrey Semashev via Boost
<bo...@lists.boost.org> wrote:
> https:://example.com?foo=42
>
> should be different from
>
> https:://example.com/?foo=42

Well... in your specific case not, because normalization of the https
scheme inserts "/" when the path is zero length. Although this is
scheme-specific. Generally speaking, there is always a path, it can be
zero length. This is different from the query and fragment - which may
or may not exist. A query can exist and be zero length, and the query
can not exist.

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 9:19:59 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 5:56 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> part of the point of splitting paths into subcomponents
> is so that you don't have to think about the path separators.

Yeah.

> Some URI libraries will also explicitly iterate "" or "/" as the first
> component (by itself) to indicate the difference between absolute and
> relative URIs, although that could be indicated another way instead.

Currently this is done with

// returns true if the path has a leading slash ('/')
bool segments_view::is_absolute() const noexcept;

> In which case it might be best to render the absolute path as:
>...
> { "/", "path", "to", "file.txt" }

The problem comes with defining the mutation API. How does the user
put a URL into the state above using the segments container interface?

> I would imagine if u.segments() == { "my", "download", "folder", "" }
> before the push_back then after == { "my", "download", "folder",
> "index.htm" }.
>
> But that would be a special case only when the trailing component is empty.

I didn't even think of this, but that's yet another wrinkle.. lol

> The app would have to subsequently push_back("") if it was intending to
> leave a trailing final slash (or just push_back the next component if it
> wasn't finished yet). But either way it's the app's responsibilty to
> explicitly indicate whether the URL has a trailing slash or not; they
> are semantically different and a library cannot decide that for it.

Given the current container-like interface for modifiable encoded segments:

<https://master.url.cpp.al/url/ref/boost__urls__segments_encoded.html>

What might that API look like that allows the caller to indicate where
the slashes are?

Thanks

Vinnie Falco via Boost

unread,
Oct 13, 2021, 9:28:50 PM10/13/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 6:19 PM Vinnie Falco <vinnie...@gmail.com> wrote:
> ...

Thinking out loud here, it seems that the segments container models the path as

vector< string >

When it actually needs to model the container as

struct
{
bool leading_slash;
vector< string >
};

Should I just add this function

void segments::set_absolute( bool path_should_be_absolute );

and keep

bool segments::is_absolute() const noexcept;

?

Gavin Lambert via Boost

unread,
Oct 13, 2021, 9:53:05 PM10/13/21
to bo...@lists.boost.org, Gavin Lambert
On 14/10/2021 14:28, Vinnie Falco wrote:
> Thinking out loud here, it seems that the segments container models the path as
>
> vector< string >
>
> When it actually needs to model the container as
>
> struct
> {
> bool leading_slash;
> vector< string >
> };
>
> Should I just add this function
>
> void segments::set_absolute( bool path_should_be_absolute );
>
> and keep
>
> bool segments::is_absolute() const noexcept;
>
> ?

I guess the answer to that depends on how you want
clear-and-multiple-push_back to behave. Does that preserve or alter
absoluteness of the path?


Most of my experience with manipulating URIs (other than just using
hard-coded strings) is via the .NET Uri class, and FWIW with those:

1. Instances are immutable; the only way to get a new Uri is to parse a
new one from scratch or resolve a relative one from a base. (Both the
base and the 'relative' can actually be either absolute or relative.)

2. Its 'Segments' property does include trailing slashes in the
components -- so { "/", "foo/", "bar/", "index.html" } and is also
read-only. (And also doesn't support relative URIs, though I don't know
why not as they wouldn't be ambiguous.)

I'm not saying these are necessarily good choices for your
implementation as well; just as a point of comparison.

Julien Blanc via Boost

unread,
Oct 14, 2021, 3:35:14 AM10/14/21
to Vinnie Falco, Julien Blanc, boost@lists.boost.org List
Le 2021-10-13 16:54, Vinnie Falco a écrit :
> On Tue, Oct 12, 2021 at 11:24 PM Julien Blanc via Boost

>> And then there's the issue of the '/' character in


>> set_path if you take unencoded strings (not sure how this should be
>> handled...)
>
> Well, there's not much of an issue there. set_path() treats slashes as
> segment separators. If that's not what you want, you can set the
> individual unencoded segments through the container returned by
> url::segments(). The same goes for the query parameters (call
> url::params()).

I'm not at ease with that. If i build a mailto: uri, and call set_path
with a email address containing a / character in it (such as
valid/em...@example.com), it will result in a uri containing two
segments, which is nonsense for a mailto scheme (and obviously not what
i expected).

I understand the api is fitted toward web uris, but i see there a
potential for hard to find bugs due to api misuse. Maybe it's just the
naming that needs to be improved.

Regards,

Julien

Andrey Semashev via Boost

unread,
Oct 14, 2021, 6:47:25 AM10/14/21
to boost@lists.boost.org List, Andrey Semashev
On 10/14/21 4:13 AM, Vinnie Falco wrote:
> On Wed, Oct 13, 2021 at 5:25 PM Andrey Semashev via Boost
> <bo...@lists.boost.org> wrote:
>> { "/", "my", "download", "folder", "" }
>> ...
>> { "/", "path", "to", "file.txt" }
>
> These are not really great because then elements are no longer
> homogeneous. That is, "/" is allowed in the front element but not the
> others. So
>
> u.segments().insert( u.segments().begin(), "/" );
>
> is okay, but these are not:
>
> u.segments().insert( u.segments().begin(), "home" );
>
> u.segments().push_back( "/" );

As I said, you're probably fine omitting the root directory.

Andrey Semashev via Boost

unread,
Oct 14, 2021, 6:51:21 AM10/14/21
to boost@lists.boost.org List, Andrey Semashev
On 10/14/21 4:15 AM, Vinnie Falco wrote:
> On Wed, Oct 13, 2021 at 5:32 PM Andrey Semashev via Boost
> <bo...@lists.boost.org> wrote:
>> https:://example.com?foo=42
>>
>> should be different from
>>
>> https:://example.com/?foo=42
>
> Well... in your specific case not, because normalization of the https
> scheme inserts "/" when the path is zero length. Although this is
> scheme-specific. Generally speaking, there is always a path, it can be
> zero length. This is different from the query and fragment - which may
> or may not exist. A query can exist and be zero length, and the query
> can not exist.

Ok. But do other schemes make the same requirement? And does Boost.URL
support those?

My point is that if there is a case supported by Boost.URL where an
empty path and a path of "/" are considered distinct, Boost.URL has to
recognize and expose that distinction in its path elements sequence. For
example, an empty path would correspond to an empty sequence, and the
path of "/" would correspond to a single element sequence, with the
element being "".

Phil Endecott via Boost

unread,
Oct 14, 2021, 8:04:17 AM10/14/21
to bo...@lists.boost.org, Phil Endecott
data: URLs are a good example of long URLs that actually exist in
practice. But what I'm really thinking about is guarding against
malicious malformed input. You don't want your parser to have N^2
let alone exp(N) complexity when the input could legitimately be
a data: URL of a few kbytes.

And there's no need for it to have worse than linear complexity,
because I'm pretty sure that the URL syntax is a Regular Language (in
the computer science sense, not in the "regular expression is a
synonym for search pattern" I'm-a-perl-programmer sense), and all
regular languages can by definition be parsed in linear time.

Regarding your example, i.e. "http://foo.com/" vs the relative URL
"http.html", the point is that when the parser has reached the 'p'
it will be in a state meaning "either this is a scheme or the first
segment of the path of a relative URL". The transformation from
BNF to a no-backtracking state machine of that sort is what tools
like flex and libraries like RE2 do.

Now having said all that, I do suspect that the particular grammar
of URLs can be parsed even by a backtracking parser in linear time
(with some constant factor) - but it will depend on the rules.
I think you aspire to making the library "secure", and if that
includes not having pathological performance in the presence of
malicious input, you need to understand this stuff!


>> auto url = parse_url(str);
>>
>> The problem is that url is a view of the string parameter, but
>> nothing warns the user about that and...
>
> Well, it is documented profusely. I agree there is some danger here
> but with great danger comes great power. Or something about great
> responsibility?

I don't think this is acceptable. What do others think?


>> I would hope that the path would be similar to a std::filesystem::path
>> i.e. it should use the same vocabulary where possible (or could
>> even actually be a std::filesystem::path, on C++17).
>
> OOF I hope never to see this. std::filesystem is kind of a mess.

Well it's not my top favourite library either. But it's familiar,
and the authors have considered many of the issues that have been
mentioned in other threads i.e. whether path components include
directory separators, distinguishing absolute and relative paths,
etc.


Having mentioned data: URLs above, I have tried to understand how
these would be handled by your library. (mailto: also.) Note that
a data: URL will generally contains /s, e.g. in the content type
e.g. image/jpeg and also in the base64 data but these are not
directory separators.

I think what's really needed is a specific getter/setter for the
body of an "unconventional" URL scheme like these. It may be
possible to use your set_encoded_path() but it's not obvious.


Please fix the docs for set[_encoded]_query() which currently
seem confused with the operations for fragments.


Regards, Phil.

Dominique Devienne via Boost

unread,
Oct 14, 2021, 9:01:47 AM10/14/21
to bo...@lists.boost.org, Dominique Devienne
On Thu, Oct 14, 2021 at 2:04 PM Phil Endecott via Boost <
bo...@lists.boost.org> wrote:

> >> auto url = parse_url(str);
> >>
> >> The problem is that url is a view of the string parameter, but
> >> nothing warns the user about that and...
> >
> > Well, it is documented profusely. I agree there is some danger here
> > but with great danger comes great power. Or something about
> great responsibility?
>
> I don't think this is acceptable. What do others think?


I don't mind. It's no different than std::string and std::string_view.
You pay for copies only if you really want to. Maybe having
parse_url_as_view(str)
and a safer parse_url(str) using the former but not returning a view would
satisfy you?

Vinnie Falco via Boost

unread,
Oct 14, 2021, 10:02:19 AM10/14/21
to boost@lists.boost.org List, Vinnie Falco
On Thu, Oct 14, 2021 at 6:02 AM Dominique Devienne via Boost
<bo...@lists.boost.org> wrote:
> You pay for copies only if you really want to. Maybe having
> parse_url_as_view(str)
> and a safer parse_url(str) using the former but not returning a view would
> satisfy you?

Remember that there are potentially many url types that have ownership
of the buffer:

url
static_url
dynamic_url<Allocator>
pmr_url

Which one would the safe_parse_url() return?

Thanks

Robert Ramey via Boost

unread,
Oct 14, 2021, 2:35:53 PM10/14/21
to Phil Endecott via Boost, Robert Ramey
On 10/13/21 7:09 AM, Phil Endecott via Boost wrote:
> Hi Vinnie,
>
> Vinnie Falco wrote:
>> Well, I was on a pause but I picked Boost.URL back up and whipped it
>> into shape.
>
> Thanks for posting that. Assorted comments follow.
>
> Back in 2005-ish I decided to write a URL parser based on the RFC BNF
> using Boost.Spirit. At that time the most recent RFCs were RFC2616
> (1999) and RFC2396 (1998). I did a fairly direct translation of the
> BNF to Spirit and ... it didn't work. It turned out that the BNF in
> the RFCs was wrong, and had been for 6 years at that point. So I
> advise that you are careful about directly using the RFC BNF.
...

Too me, all this argues for the usage of boost spirit as a url parser.
Maintenance, Verification, etc. Of a protocol with the problems you
mention is much easier when the specification of the syntax is separated
from the code which does the parsing.

Before one says "performance" don't bother - even if it were true in
this case - which I actually doubt.

Vinnie Falco via Boost

unread,
Oct 14, 2021, 2:38:36 PM10/14/21
to boost@lists.boost.org List, Vinnie Falco
On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
<bo...@lists.boost.org> wrote:
> Too me, all this argues for the usage of boost spirit as a url parser.

Never going to happen. I would never take a dependency on Spirit for
anything network related or that will have security reviews. And users
don't want this dependency either.

Thanks

Vinnie Falco via Boost

unread,
Oct 14, 2021, 4:13:59 PM10/14/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 13, 2021 at 8:14 AM Vinnie Falco <vinnie...@gmail.com> wrote:
> ...

Yeah you all uncovered some lurking problems so I am going back to the
drawing board to come up with a comprehensive set of fixes. Thanks!!!

Robert Ramey via Boost

unread,
Oct 14, 2021, 5:00:36 PM10/14/21
to bo...@lists.boost.org, Robert Ramey
On 10/14/21 11:38 AM, Vinnie Falco via Boost wrote:
> On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
> <bo...@lists.boost.org> wrote:
>> Too me, all this argues for the usage of boost spirit as a url parser.
>
> Never going to happen. I would never take a dependency on Spirit for
> anything network related or that will have security reviews. And users
> don't want this dependency either.

Hmmm - how about this idea. Create all your tests and examples first.
Basically this will consist of a long list of urls - each one more
elaborate and indecipherable then the last. Write the spirit parse to
run in these tests. When you get all this working, write your
"hand-rolled" parser (this will take a while) and test against the same
list of urls and verify that that the parsing matches the spirit
generated ones. Then you'll have 2 things:

a) verifiable, maintainable, confirmation that the test parsing is very
correct
b) test results which show that your hand rolled parser is either
correct or difficult to prove otherwise.

Robert Ramey

Ivan Matek via Boost

unread,
Oct 14, 2021, 11:32:51 PM10/14/21
to boost, Ivan Matek, Robert Ramey
On Thu, Oct 14, 2021 at 11:00 PM Robert Ramey via Boost <
bo...@lists.boost.org> wrote:

> On 10/14/21 11:38 AM, Vinnie Falco via Boost wrote:
> > On Thu, Oct 14, 2021 at 11:35 AM Robert Ramey via Boost
> > <bo...@lists.boost.org> wrote:
> >> Too me, all this argues for the usage of boost spirit as a url parser.
> >
> > Never going to happen. I would never take a dependency on Spirit for
> > anything network related or that will have security reviews. And users
> > don't want this dependency either.
>
> Hmmm - how about this idea. Create all your tests and examples first.
> Basically this will consist of a long list of urls - each one more
> elaborate and indecipherable then the last. Write the spirit parse to
> run in these tests. When you get all this working, write your
> "hand-rolled" parser (this will take a while) and test against the same
> list of urls and verify that that the parsing matches the spirit
> generated ones. Then you'll have 2 things:
>
> a) verifiable, maintainable, confirmation that the test parsing is very
> correct
> b) test results which show that your hand rolled parser is either
> correct or difficult to prove otherwise.
>
> I think maybe even better is to take some other implementations from other
languages(if they exist), maybe C# or Rust have official/unofficial
equivalents of Boost.URL so one could do a "diff" against them.

Phil Endecott via Boost

unread,
Oct 15, 2021, 7:41:40 AM10/15/21
to bo...@lists.boost.org, Phil Endecott
Robert Ramey wrote:
> On 10/13/21 7:09 AM, Phil Endecott via Boost wrote:
>> Back in 2005-ish I decided to write a URL parser based on the RFC BNF
>> using Boost.Spirit. At that time the most recent RFCs were RFC2616
>> (1999) and RFC2396 (1998). I did a fairly direct translation of the
>> BNF to Spirit and ... it didn't work. It turned out that the BNF in
>> the RFCs was wrong, and had been for 6 years at that point. So I
>> advise that you are careful about directly using the RFC BNF.
> ...
>
> Too me, all this argues for the usage of boost spirit as a url parser.
> Maintenance, Verification, etc. Of a protocol with the problems you
> mention is much easier when the specification of the syntax is separated
> from the code which does the parsing.

Spirit is also a backtracking parser, so can't promise
linear complexity - unless you can show that it does for
the particular grammar that you are using.

This isn't a performance issue, it's a security issue i.e.
you don't want your application to suffer denial of service
when it encounters malicious input.

Phil Endecott via Boost

unread,
Oct 15, 2021, 8:16:03 AM10/15/21
to bo...@lists.boost.org, Phil Endecott
Dominique Devienne wrote:
> On Thu, Oct 14, 2021 at 2:04 PM Phil Endecott via Boost <
> bo...@lists.boost.org> wrote:
>
>> >> auto url = parse_url(str);
>> >>
>> >> The problem is that url is a view of the string parameter, but
>> >> nothing warns the user about that and...
>> >
>> > Well, it is documented profusely. I agree there is some danger here
>> > but with great danger comes great power. Or something about
>> great responsibility?
>>
>> I don't think this is acceptable. What do others think?
>
>
> I don't mind. It's no different than std::string and std::string_view.

Exactly, it's like returning a string_view from a function - which
you should never do, at least not without an obvious indication...

> Maybe having parse_url_as_view(str) and a safer parse_url(str) using
> the former but not returning a view would satisfy you?

Yes, that's exactly what I suggested.

The alternative is to try to find some way to return something that
can't be assigned to auto. This may be possible with something like
a private_url_view class that wraps a view and makes everything private:

url u = parse_url(s);
// parse_url() returns a private_url_view, url is constructable from private_view

url_view v = parse_url(s);
// parse_url() returns a private_url_view, url_view is constructable from private_view

auto u = parse_url(s);
// Can we make that fail to compile by making private_url_view's ctors private?
// If not...
auto p = u.path();
// ...we can prevent the user from using it for anything by hiding all of
// its methods.
// But:
url u2 = u;
func(u);
// We need to prevent those too. I think we can prevent assigning from
// a private_view to a url[_view] by qualifying url[_view]'s assignment
// and copy ctor with &&:
url::operator=(const private_url_view& v) = delete;
url::operator=(private_url_view&&) { ... }

I'm sure that smarter people than me already know the correct pattern for
doing this. Searching I found p0672r0 which addresses how to fix the
similar issues for proxy types and expression templates but doesn't mention
views.

I also found blog posts by Arthur O'Dwyer talking about string_view, e.g.
https://quuxplusone.github.io/blog/2018/03/27/string-view-is-a-borrow-type/

Quote:

A function may have a borrow type as its return type, but if so, the
function must be explicitly annotated as returning a potentially dangling
reference. That is, the programmer must explicitly acknowledge responsibility
for the annotated function’s correctness.

Regardless, if f is a function so annotated, the result of f must not
be stored into any named variable except a function parameter or for-loop
control variable. For example, auto x = f() must still be diagnosed as a
violation.


Regards, Phil.

Vinnie Falco via Boost

unread,
Oct 15, 2021, 11:05:30 AM10/15/21
to boost@lists.boost.org List, Vinnie Falco, Robert Ramey
On Thu, Oct 14, 2021 at 8:33 PM Ivan Matek via Boost
<bo...@lists.boost.org> wrote:
> a) verifiable, maintainable, confirmation that the test parsing is very
> correct
> b) test results which show that your hand rolled parser is either
> correct or difficult to prove otherwise.

I already have all that... this is like HTTP parsing redux. The
parsing of URLs is the least interesting and least difficult aspect of
this project, but seems to be what people focus on. Where there is
difficulty is in the API for presenting and modifying the path
segments and the query parameters as a container. Have you had a look
at the query parameters container value_type?

https://github.com/CPPAlliance/url/blob/d866b5b09f08d12d597f783da295be823ad6f81c/include/boost/url/params.hpp#L70

Thanks

Vinnie Falco via Boost

unread,
Oct 15, 2021, 11:14:13 AM10/15/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Fri, Oct 15, 2021 at 5:15 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> Exactly, it's like returning a string_view from a function - which
> you should never do, at least not without an obvious indication...

All functions in url_view which return encoded parts have a
string_view return type...

> > Maybe having parse_url_as_view(str) and a safer parse_url(str) using
> > the former but not returning a view would satisfy you?
>
> Yes, that's exactly what I suggested.

How would you convert this code to use "safer parse_url(str)"?

url_view u1 = parse_relative_ref( "/" ).value();
url u2 = parse_relative_ref( "/" ).value();
static_url<1024> u3 = parse_relative_ref( "/" ).value();
pmr_url u4 = parse_relative_ref( "/" ).value();
dynamic_url<
std::allocator<char>> u5 = parse_relative_ref( "/" ).value();

> The alternative is to try to find some way to return something that
> can't be assigned to auto. This may be possible with something like
> a private_url_view class that wraps a view and makes everything private:

This sounds like an awful lot of try-hard. string_view is one of those
controversial types which is going to split people into two camps. I'm
of the camp that you have to be careful how you hold the knife. Rather
than put the library through contortions with questionable degrees of
effectiveness at solving the "problem" I think it is just easier and
more straightforward to say that ownership of strings is simply not
transferred using parse functions or url_view.

Thanks

Rainer Deyke via Boost

unread,
Oct 15, 2021, 12:03:56 PM10/15/21
to bo...@lists.boost.org, Rainer Deyke
On 15.10.21 17:13, Vinnie Falco via Boost wrote:
> How would you convert this code to use "safer parse_url(str)"?
>
> url_view u1 = parse_relative_ref( "/" ).value();
> url u2 = parse_relative_ref( "/" ).value();
> static_url<1024> u3 = parse_relative_ref( "/" ).value();
> pmr_url u4 = parse_relative_ref( "/" ).value();
> dynamic_url<
> std::allocator<char>> u5 = parse_relative_ref( "/" ).value();

The following would do the trick:

auto u1 = parse_relative_ref<url_view>( "/" ).value();
auto u2 = parse_relative_ref<url>( "/" ).value();
auto u3 = parse_relative_ref<static_url<1024>>( "/" ).value();
auto u4 = parse_relative_ref<pmr_url>( "/" ).value();
auto u5 = parse_relative_ref<dynamic_url<
std::allocator<char>>>( "/" ).value();


--
Rainer Deyke (rai...@eldwood.com)

Vinnie Falco via Boost

unread,
Oct 15, 2021, 12:35:33 PM10/15/21
to boost@lists.boost.org List, Vinnie Falco, Rainer Deyke
On Fri, Oct 15, 2021 at 9:04 AM Rainer Deyke via Boost
<bo...@lists.boost.org> wrote:
> The following would do the trick:
>
> auto u1 = parse_relative_ref<url_view>( "/" ).value();
> auto u2 = parse_relative_ref<url>( "/" ).value();
> auto u3 = parse_relative_ref<static_url<1024>>( "/" ).value();
> auto u4 = parse_relative_ref<pmr_url>( "/" ).value();
> auto u5 = parse_relative_ref<dynamic_url<
> std::allocator<char>>>( "/" ).value();

Yeah, kinda. This is not expert friendly but also, does not
accommodate this use-case:

struct connection {
url u_;

void process( string_view s )
{
u_ = parse_uri( s );
...
}
};

In the code above, the capacity for u_ is reused for each I/O cycle.

Thanks

Rainer Deyke via Boost

unread,
Oct 15, 2021, 1:25:29 PM10/15/21
to bo...@lists.boost.org, Rainer Deyke
On 15.10.21 18:34, Vinnie Falco via Boost wrote:
> On Fri, Oct 15, 2021 at 9:04 AM Rainer Deyke via Boost
> <bo...@lists.boost.org> wrote:
>> The following would do the trick:
>>
>> auto u1 = parse_relative_ref<url_view>( "/" ).value();
>> auto u2 = parse_relative_ref<url>( "/" ).value();
>> auto u3 = parse_relative_ref<static_url<1024>>( "/" ).value();
>> auto u4 = parse_relative_ref<pmr_url>( "/" ).value();
>> auto u5 = parse_relative_ref<dynamic_url<
>> std::allocator<char>>>( "/" ).value();
>
> Yeah, kinda. This is not expert friendly but also, does not
> accommodate this use-case:
>
> struct connection {
> url u_;
>
> void process( string_view s )
> {
> u_ = parse_uri( s );
> ...
> }
> };
>
> In the code above, the capacity for u_ is reused for each I/O cycle.

You can always do this:

struct connection {
url u_;

void process( string_view s )
{
u_ = parse_uri<url_view>( s );
...
}
};



--
Rainer Deyke (rai...@eldwood.com)

Phil Endecott via Boost

unread,
Oct 16, 2021, 11:02:41 AM10/16/21
to bo...@lists.boost.org, Phil Endecott
Vinnie Falco wrote:
> On Fri, Oct 15, 2021 at 5:15 AM Phil Endecott via Boost
> <bo...@lists.boost.org> wrote:
>> Exactly, it's like returning a string_view from a function - which
>> you should never do, at least not without an obvious indication...
>
> All functions in url_view which return encoded parts have a
> string_view return type...

Right. I think it's OK for a type whose name is *_view to have
getters that return views. Actually Arthur O'Dwyer's blog post
that I linked to before suggested that it would be OK for a
getter method of any object to return a view whose lifetime is
the same as the object - I'm not sure that I agree with that:

auto old_host = url.host();
url.set_host( new_host );
// oops, old_host is invalid because it was a view.


> How would you convert this code to use "safer parse_url(str)"?
>
> url_view u1 = parse_relative_ref( "/" ).value();
> url u2 = parse_relative_ref( "/" ).value();
> static_url<1024> u3 = parse_relative_ref( "/" ).value();
> pmr_url u4 = parse_relative_ref( "/" ).value();
> dynamic_url<
> std::allocator<char>> u5 = parse_relative_ref( "/" ).value();

There are plenty of possibilities, including:

1. Verbose names that indicate the return type.
2. Out parameters for the result.
3. Template parameter for the result type.
4. Make these constructors for the url_* types, not a free function.

These all have pros and cons, none are perfect. Here's another idea
that has just occurred to me:

5. Base the return type on the input string type. I.e. if the
input is a string_view, return a url_view; if the input is a
string with a particular allocator, return a url with the same
allocator etc. (Does this avoid all the possible nasty surprises?)

In particular, if the input is a string&&, you can move that into
the string stored by the url:

url parse_url(std::string&& s)
{
url u;
u.str = std::move(s);
... parse u.str ...
return u;
}

In this case, if the caller writes:

auto u = parse_url( get_string_from_somewhere() );

then there is no allocation, and u is a "safe" non-view type.

This seems to work for this example:

> struct connection {
> url u_;
>
> void process( string_view s )
> {
> u_ = parse_uri( s );
> ...
> }
> };

s is a string_view, so the result of parse is a url_view, which
you can assign to the url without re-allocation (subject to capacity).


Regards, Phil

Vinnie Falco via Boost

unread,
Oct 16, 2021, 11:15:19 AM10/16/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Sat, Oct 16, 2021 at 8:02 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> ...

To all, thanks for all the feedback. I'm not ignoring it, but we
discovered a super duper annoying problem with how the path segments
and query parameters containers represent their respective parts of
the URL, and fixing it is requiring some heroics so I am fixing that
up. To keep things spicy this is the unit test that I am trying to
make work:

void
testPathContainer()
{
auto const check = [](
string_view s,
std::initializer_list<
string_view> init,
bool abs)
{
url_view u =
parse_uri_reference(
s).value();
auto const segs =
u.encoded_segments();
BOOST_TEST(abs ==
u.is_path_absolute());
if(! BOOST_TEST(
segs.size() == init.size()))
return;
BOOST_TEST(equal(
segs.begin(), segs.end(),
init.begin(), init.end()));
};

auto const abs = [&check](
string_view s,
std::initializer_list<
string_view> init)
{
check(s, init, true);
};

auto const rel = [&check](
string_view s,
std::initializer_list<
string_view> init)
{
check(s, init, false);
};

rel("", {});
rel("./", { "" });
rel("index.htm", { "index.htm" });
rel("path/to/file.txt", { "path", "to", "file.txt" });
rel("//example.com", {} );
rel("x:y:z", { "y:z" });
rel("x:y:z/", { "y:z", "" });
rel("./y:z", { "y:z" });
rel("./y:z/", { "y:z", "" });

abs("/", {});
abs("/./", { "" });
abs("//example.com/", {} );
abs("//example.com/./", { "" } );
abs("/index.htm", { "index.htm" });
abs("/home/", { "home", "" });
abs("//x//", { "", "" });
abs("/.//", { "", "" });
abs("//x/y", { "y" });
abs("/././/", { ".", "", "" });
abs("/.//", { "", "" });
abs("x:/.//", { "", "" });
}

So yeah, if you are reading it correctly it means that if you have the
relative-ref "./" and you iterate the path segments you will get just
one element, an empty string { "" }. This might seem counterintuitive
but it is necessary to provide the invariant that the segments
container acts like vector<string>. If you push_back a series of
elements to an empty path, then iterating the container will give you
exactly those strings back. And the implementation preserves the
"absoluteness" of the path. That is, if the path previously started
with a slash then it will continue to start with a slash. If the path
previously did not start with a slash, then the new path will not
start with a slash - modulo the provision that any URL with an
authority can never be relative. There's a new function
set_absolute_path(bool) that adjusts the path accordingly. Making this
all work seamlessly is a chore..

Thanks

Phil Endecott via Boost

unread,
Oct 16, 2021, 2:10:21 PM10/16/21
to boost@lists.boost.org List, Phil Endecott
Vinnie Falco wrote:
> To all, thanks for all the feedback. I'm not ignoring it, but we
> discovered a super duper annoying problem with how the path segments
> and query parameters containers represent their respective parts of
> the URL

I would be very interested to see a comparison of how your
decomposition of paths into segments, and the reverse, compares
to what std::filesystem::path does, and rationale for the
differences.

> abs("/././/", { ".", "", "" });

Well that's an interesting one. Why is the middle element "" not
"."? Can you clarify?

I might perhaps hope that something like this would work:

for (auto dir: path.segments) {
chdir(dir.c_str());
}

i.e. I'd end up in the same directory as chdir(path.c_str()) would
get me to. But chdir("") is an error.

I believe that std::filesystem::path merges adjacent directory separators,
so the "" segments disappear.


Regards, Phil.

Vinnie Falco via Boost

unread,
Oct 16, 2021, 2:54:33 PM10/16/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Sat, Oct 16, 2021 at 11:10 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> I would be very interested to see a comparison of how your
> decomposition of paths into segments, and the reverse, compares
> to what std::filesystem::path does, and rationale for the
> differences.

To be honest I have no idea, I have never used std::filesystem to any
meaningful extent.

> > abs("/././/", { ".", "", "" });
>
> Well that's an interesting one. Why is the middle element "" not
> "."? Can you clarify?

So yeah there are some basic rules. For example if you call

u.encoded_segments() = { ... }; // initializer_list<string_view>

Then the result of

for( auto it : u.encoded_segments() )
...

Should yield the same list of strings used in the initializer list. If
we accept that a path of "" is an empty relative path, and a path of
"/" is an empty absolute path, then iterating either of those paths
should produce the empty set: {}

A goal of the library is to guarantee that all mutations leave the URL
in a syntactically valid state. Consider the following sequence of
operations:

url u = parse_uri( "https://example.com//index.htm" ).value();

u.remove_authority();

What should the resulting encoded URL be? Well a naive algorithm might
leave us with "https://index.htm" but this is clearly wrong, as
index.htm is now the host! To fix this, the library prepends "/." to
the path (this is guidance from rfc3986):

assert( u.encoded_url() == "https:/.//index.htm" );

There are a handful of other cases like this (for example, removing
the scheme from a relative-ref whose first segment has a colon).
Coming back to:

abs("/././/", { ".", "", "" });

We treat a leading "/." as not appearing in the segments, to make the
behavior of the library doing these syntactic adjustments transparent
and satisfy the rule that assignments from segments produce the same
result when iterated.

> I might perhaps hope that something like this would work:
>
> for (auto dir: path.segments) {
> chdir(dir.c_str());
> }

I believe it will work for the cases where it should work, although I
can't say with certainty until I have finished making these changes.

> I believe that std::filesystem::path merges adjacent directory separators,
> so the "" segments disappear.

Yep, that needs to be a supported operation but it should not be
called push_back or insert, because those operations which have the
same names as their vector equivalent, should behave exactly like
vector. And vector<string> doesn't delete a trailing "" when you
push_back a non-empty string. I would have to add more functions such
as (bikeshedding aside) merge_back or merge_at.

Thanks

Gavin Lambert via Boost

unread,
Oct 17, 2021, 5:52:18 PM10/17/21
to bo...@lists.boost.org, Gavin Lambert
On 17/10/2021 07:53, Vinnie Falco wrote:
> What should the resulting encoded URL be? Well a naive algorithm might
> leave us with "https://index.htm" but this is clearly wrong, as
> index.htm is now the host! To fix this, the library prepends "/." to
> the path (this is guidance from rfc3986):
>
> assert( u.encoded_url() == "https:/.//index.htm" );

I assume this was intended to be "https://./index.htm"?

> There are a handful of other cases like this (for example, removing
> the scheme from a relative-ref whose first segment has a colon).
> Coming back to:
>
> abs("/././/", { ".", "", "" });
>
> We treat a leading "/." as not appearing in the segments, to make the
> behavior of the library doing these syntactic adjustments transparent
> and satisfy the rule that assignments from segments produce the same
> result when iterated.

If you're stripping leading ./ then shouldn't the result just be "/" alone?

Same reason that "/../../foo/../bar/" should become "/bar/".

Vinnie Falco via Boost

unread,
Oct 17, 2021, 8:03:57 PM10/17/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Sun, Oct 17, 2021 at 2:52 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> > assert( u.encoded_url() == "https:/.//index.htm" );
>
> I assume this was intended to be "https://./index.htm"?

Nope, it was correct as I wrote it. You managed to produce an
authority with a single dot :)

> > abs("/././/", { ".", "", "" });
> >
> > We treat a leading "/." as not appearing in the segments, to make the
> > behavior of the library doing these syntactic adjustments transparent
> > and satisfy the rule that assignments from segments produce the same
> > result when iterated.
>
> If you're stripping leading ./ then shouldn't the result just be "/" alone?
> Same reason that "/../../foo/../bar/" should become "/bar/".

Well no, there's a difference between what is in the value returned by
url_view::encoded_path() and what you get when you iterate the
segments. Leading "/." or "./" stays in the encoded segments but is
not returned by iterating segments, for the reason that it is
considered "metadata" about the path that keeps it regular without
changing the meaning.

I think you are getting confused with "normalization" which is a
different thing entirely. Given the following URL:

https:/.//index.htm

Normalization would leave it as-is. Given:

https:/././/index.htm

normalization would return

https:/.//index.htm

If you start with the URL above and add an authority:

url u = parse_uri( "https:/.//index.htm" ).value();

u.set_encoded_authority( "example.com" );

The result is

assert( u.encoded_url() == "https://example.com//index.htm" );

So there are three things at play here:

1. Modifying the path for normalization
2. Tweaking the path to match the grammar
3. Tweaking the path to provide segments() container invariants

Number 1 above is what people are mostly familiar with, for example
collapsing double dotted segments ".." safely.

Number 2 is understood by fewer people but is a consequence of the
grammar in the RFC. For example, if you have an authority, and a path
that starts with double slash "//", then if you remove the authority
you have to prepend "/." to the path. Another one, if you have a
scheme and a relative path whose first segment contains a colon, and
you remove the scheme then you have to prepend "./" to the path. These
tweaks let the library guarantee that all mutation operations leave
the URL in a syntactically valid state without having to do weird
things like throw exceptions, return error codes, ignore the request,
or worse impose additional semantic changes to the URL (for example,
turning an absolute path into a relative one in a case where the user
didn't explicitly request it).

Number 3 is the most controversial and unintuitive, it falls out as a
consequence of making the segments and encoded_segments containers
behave exactly like vector<string>. For example, if you call clear()
on the container, then it should return an empty list:

url u = parse_relative_ref( "path/to/file.txt" ).value();

u.segments().clear();

assert( u.segments().begin() == u.segments().end() );

However, what if you have an absolute path?

url u = parse_relative_ref( "/path/to/file.txt" );

assert( u.segments() == { "path", "to", "file.txt" } );

Okay so far so good but what if you clear?

u.segments().clear();

assert( u.encoded_url() == "/" );

Wait, that's not clear, there's still a path segment! Well of course
there is, if you clear an absolute path you should get back an
absolute path. But the segments container should be empty:

assert( u.segments().begin() == u.segments().end() ); // has to pass

See what's happening here? Lets start with a relative path

url u = parse_relative_ref( "index.htm" );

Now lets reassign the path:

u.segments() = { "my:file.txt" };

Well, we can't leave the URL as "my:file.txt" because that would be
interpreted as a scheme. So the library prepends "./":

assert( u.encoded_segments() == "./my:file.txt" );

But we have an invariant, after you assign the segments you have to
get the same thing back:

assert( u.segments() = { "my:file.txt" } );

To enforce this invariant we have to treat some path prefixes as if
they weren't there. "/" by itself, "./", and "/.".

And that's how its done

Thanks

Gavin Lambert via Boost

unread,
Oct 17, 2021, 9:05:42 PM10/17/21
to bo...@lists.boost.org, Gavin Lambert
On 18/10/2021 13:01, Vinnie Falco wrote:
>>> assert( u.encoded_url() == "https:/.//index.htm" );
>>
>> I assume this was intended to be "https://./index.htm"?
>
> Nope, it was correct as I wrote it. You managed to produce an
> authority with a single dot :)

Yes, that was my intent, from your description of replacing the
authority with a dot.

Though I see the issue now. I was reading the input as:

"https://example.com/index.htm"

For which removing the authority should result in:

"https:/index.htm"

(Although this is unusual; typically relative URIs will omit the scheme
as well.)


For the input:

"https://example.com//index.htm"

Then it does make sense at a purely-URL-level to transform this to:

"https:/.//index.htm"

(Although most web servers would treat either as illegal, but you could
envisage some not-HTTP protocol that requires such syntax.)


Adding the authority back to this URL should result in
"https://example.com/.//index.htm", however -- it should not be
"ignoring" the prefix once it exists. At least not until the URL is
normalised. (Unless you're documenting that URLs are always stored in
normalised form, or that setters will automatically normalise.)

> We treat a leading "/." as not appearing in the segments, to make the
> behavior of the library doing these syntactic adjustments transparent
> and satisfy the rule that assignments from segments produce the same
> result when iterated.

So what about the input "https://example.com/./index.htm"? Unless
you're documenting automatic normalisation, this should still iterate
the "." and "index.htm" path components separately.

Gavin Lambert via Boost

unread,
Oct 17, 2021, 10:04:20 PM10/17/21
to bo...@lists.boost.org, Gavin Lambert
On 17/10/2021 07:53, Vinnie Falco wrote:
> On Sat, Oct 16, 2021 at 11:10 AM Phil Endecott wrote:
>> I would be very interested to see a comparison of how your
>> decomposition of paths into segments, and the reverse, compares
>> to what std::filesystem::path does, and rationale for the
>> differences.
>
> To be honest I have no idea, I have never used std::filesystem to any
> meaningful extent.

Getting back to this point, as I mentioned in another branch it looks
like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/",
"foo/", "bar/", "baz.htm" }.

Like it or hate it, this is the standard now, and it's likely that
people would expect that Boost.Url would use similar segmentation.


On a related note, you should strongly consider cross-compatibility with
{std,boost}::filesystem::path anyway, since inevitably an URL-parsing
library will hit one of the following scenarios sooner or later:

- conversion of an absolute file:/// uri to a filesystem path and back
- conversion of a relative uri path (only) to a relative filesystem
path and back (or perhaps better: conversion of [absolute uri, base uri,
base path] to absolute path)

While you could get away with doing the second kind of interop only with
complete strings, having an explicit method for it is better as it
allows you to encourage security checks such as considering it illegal
to use a relative path that would escape the base path.

Vinnie Falco via Boost

unread,
Oct 17, 2021, 10:06:50 PM10/17/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Sun, Oct 17, 2021 at 7:04 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> Getting back to this point, as I mentioned in another branch it looks
> like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/",
> "foo/", "bar/", "baz.htm" }.

I'm not sure that's viable or correct for URL paths though.

> On a related note, you should strongly consider cross-compatibility with
> {std,boost}::filesystem::path anyway, since inevitably an URL-parsing
> library will hit one of the following scenarios sooner or later:

Yeah, I agree that there needs to be something to facilitate working
with paths that are formed from URLs. I haven't given it any thought
yet because I am still trying to finish the baseline functionality.

Thanks

Vinnie Falco via Boost

unread,
Oct 17, 2021, 10:15:58 PM10/17/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Sun, Oct 17, 2021 at 6:05 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> So what about the input "https://example.com/./index.htm"? Unless
> you're documenting automatic normalisation, this should still iterate
> the "." and "index.htm" path components separately.

Well I wouldn't say that the library is "doing automatic
normalization." It is just a special case treatment of up to 2 leading
characters of the path. As a general rule I think the library needs to
make a best-effort to preserve the components making up the path
iteration. That means, if you have a URL with index.htm and for
whatever reason the library needs to put this prefix "./" in front of
it, then iterating the path should still only give index.htm. And
furthermore that the implementation does not hide any "state"
information outside of the string itself.

If you have "//x//y" and you remove the authority to get ".//y", then
add the authority back to get "//x/.//y" and iterating would produce {
".", "", "", "y" ) then we have changed what iterating the segments
returns without the user asking for it.

So I think the library has to treat up to 2 leading characters of the
path special ("/", "./", and "/."), adding and removing them as needed
to preserve syntactic correctness and semantic equivalence when
performing mutation operations. To ensure the stability of iterated
segments, this means that we shouldn't return that malleable prefix of
the path.

Thanks

Gavin Lambert via Boost

unread,
Oct 17, 2021, 11:56:03 PM10/17/21
to bo...@lists.boost.org, Gavin Lambert
On 18/10/2021 15:06, Vinnie Falco wrote:
> On Sun, Oct 17, 2021 at 7:04 PM Gavin Lambert wrote:
>> Getting back to this point, as I mentioned in another branch it looks
>> like {std,boost}::filesystem breaks down "/foo/bar/baz.htm" to { "/",
>> "foo/", "bar/", "baz.htm" }.
>
> I'm not sure that's viable or correct for URL paths though.

Actually, sorry, I was getting my wires crossed. The above is how
.NET's System.Uri class behaves. (So Microsoft evidently thinks it's
viable for URLs.)

{std,boost}::filesystem will break down "/foo/bar/baz.htm" to { "/",
"foo", "bar", "baz.htm" } (and "/foo/bar/" to { "/", "foo", "bar", "" }
-- except boost:: returns "." for the final element).

Filesystems perhaps have a slight advantage here as AFAIK no filesystem
accepts an escaped path separator within a filename (modulo Linux
allowing backslashes; including WSL, but the latter actually stores a
different character), while there's no rule against %2F appearing inside
an URL path (although it's more common in a query string). But this
isn't insurmountable.

You can perhaps argue that the leading "/" element is not needed if
there is another method to indicate whether the URI is absolute or not;
Filesystem includes it for lexicographic sorting & comparison purposes,
but those don't entirely apply since we're only considering part of the
URL here anyway. So that is more justifiable.

With that in mind, it's probably reasonable for this iteration:

- "/foo/bar/baz.htm" => { "foo", "bar", "baz.htm" }
- "/foo/bar/" => { "foo", "bar", "" }
- "/foo/bar" => { "foo", "bar" }
- "/.//foo/bar" => { ".", "", "foo", "bar" }

But as I said before, this makes it impossible to convert between
relative and absolute paths via the path-segment API alone (as the
initial "/" isn't exposed, and instead a different method must be used).
Whether this is a feature or a bug is in the eye of the beholder.

>> On a related note, you should strongly consider cross-compatibility with
>> {std,boost}::filesystem::path anyway, since inevitably an URL-parsing
>> library will hit one of the following scenarios sooner or later:
>
> Yeah, I agree that there needs to be something to facilitate working
> with paths that are formed from URLs. I haven't given it any thought
> yet because I am still trying to finish the baseline functionality.

It's worthwhile considering these things from the start, as they can
inform design of your baseline (such as compatibility of path segment
iteration).

Vinnie Falco via Boost

unread,
Oct 18, 2021, 12:11:19 PM10/18/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> You can perhaps argue that the leading "/" element is not needed if
> there is another method to indicate whether the URI is absolute or not;

Yep, you're right that is needed and we have these functions

bool url_view::is_path_absolute() const noexcept;

// returns true on success
bool url::set_path_absolute( bool should_be_absolute );

set_path_absolute(false) can fail in the one case where there is an
authority and at least one segment.

> - "/.//foo/bar" => { ".", "", "foo", "bar" }

The list looks fine except for the above, which I think has to be {
"", "foo", "bar" } for the reason that assigning the path should give
you back the same results when iterated:

url u = parse_uri( "http:" ).value();

u.segments() = { "", "foo", "bar" };

assert( u.encoded_url() == "http:/.//foo/bar" );
assert( u.segments() == { "", "foo", "bar" } ); // same list

If we then remove the scheme, I think the library needs to remove the
prefix that it added. Not a full "normalization" (that's a separate
member function that the user calls explicitly). The rationale is that
if the library can add something extra to make things consistent, it
should strive to remove that something extra when it can do so. Yes
this means that if the user adds the prefix themselves and then
performs a mutation, the library could end up removing the thing that
the user added. I think that's ok though, because the (up to 2
character) path prefixes that the library can add are all semantic
no-ops even in the filesystem cases.

> It's worthwhile considering these things from the start, as they can
> inform design of your baseline (such as compatibility of path segment
> iteration).

Library-added prefixes are semantic no-ops anyway, even for the
filesystem case, so they should not change anything in terms of
filesystem compatibility. And they don't appear in segment iteration
so it is like they were never there, thus zero impact.

Thanks

Peter Dimov via Boost

unread,
Oct 18, 2021, 3:38:24 PM10/18/21
to bo...@lists.boost.org, Peter Dimov
> On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
> > It's worthwhile considering these things from the start, as they can
> > inform design of your baseline (such as compatibility of path segment
> > iteration).

Segment iteration is not going to be compatible. In addition to adding
an initial "/" segment for absolute paths, Filesystem also collapses
consecutive / separators. So iterating "/foo//bar//baz///" produces

"/" │ "foo" │ "bar" │ "baz" │ ""

(https://godbolt.org/z/EsjKzc5f1)

A design goal of URL seems to be that the information that the accessors
give accurately reflects the contents of the string (and that there's no
hidden metadata that the string doesn't reflect.)

So the segments of the above path are

{ "foo", "", "bar", "", "baz", "", "", "" }

because otherwise the segments of the above and "/foo/bar/baz/" will
be the same, which means that it won't be possible to reconstruct the string
from the information the URL accessors give.

That is, if you have a string, and you construct an URL object from it, then
take its state as returned by the accessors, copy it into another empty URL,
the resultant string should be the same as the original. And similarly, if you
take an empty URL, set its parts to some values, take the string, create
another URL object from the string, the accessors should give the original
values.

Destructive transformations such as what Filesystem does above cannot
work in this model.

Vinnie Falco via Boost

unread,
Oct 18, 2021, 4:12:44 PM10/18/21
to boost@lists.boost.org List, Vinnie Falco, Peter Dimov
On Mon, Oct 18, 2021 at 12:38 PM Peter Dimov via Boost
<bo...@lists.boost.org> wrote:
> That is, if you have a string, and you construct an URL object from it, then
> take its state as returned by the accessors, copy it into another empty URL,
> the resultant string should be the same as the original

Yes, and note that the "absoluteness" of the path is a property of the
URL which is reflected in the url API and not the segments:

bool url_view::is_path_absolute() const noexcept;
bool url::set_path_absolute( bool should_be_absolute );

Thanks

Gavin Lambert via Boost

unread,
Oct 18, 2021, 7:42:34 PM10/18/21
to bo...@lists.boost.org, Gavin Lambert
On 19/10/2021 05:10, Vinnie Falco wrote:
>> - "/.//foo/bar" => { ".", "", "foo", "bar" }
>
> The list looks fine except for the above, which I think has to be {
> "", "foo", "bar" } for the reason that assigning the path should give
> you back the same results when iterated:
>
> url u = parse_uri( "http:" ).value();
>
> u.segments() = { "", "foo", "bar" };
>
> assert( u.encoded_url() == "http:/.//foo/bar" );
> assert( u.segments() == { "", "foo", "bar" } ); // same list

Again, what about the case where the original input URL contained that
leading dot? You can't argue "we must report it unchanged" when by
definition there are conditions when you are changing it.

The only mechanism that seems sane to me is that encoded_url() and
friends are documented to normalise (or at least to partially normalise,
limited to adding/removing the path prefix) the URL before returning a
string, at which point segments() may change content. (But it's
important that it doesn't break if you push_back each segment
individually instead of assigning it all at once.)

> If we then remove the scheme, I think the library needs to remove the
> prefix that it added. Not a full "normalization" (that's a separate
> member function that the user calls explicitly). The rationale is that
> if the library can add something extra to make things consistent, it
> should strive to remove that something extra when it can do so. Yes
> this means that if the user adds the prefix themselves and then
> performs a mutation, the library could end up removing the thing that
> the user added. I think that's ok though, because the (up to 2
> character) path prefixes that the library can add are all semantic
> no-ops even in the filesystem cases.

I don't disagree with this, but I do disagree with the iteration methods
trying to "hide" elements that are actually present in the URL.


On 19/10/2021 08:37, Peter Dimov wrote:
> Segment iteration is not going to be compatible. In addition to
> adding an initial "/" segment for absolute paths, Filesystem also
> collapses consecutive / separators. So iterating "/foo//bar//baz///"
> produces
>
> "/" │ "foo" │ "bar" │ "baz" │ ""

Fair point; I hadn't considered that one. That's unfortunate. I agree
that URL cannot collapse adjacent separators.

Phil Endecott via Boost

unread,
Oct 19, 2021, 9:12:56 AM10/19/21
to bo...@lists.boost.org, Phil Endecott
Peter Dimov wrote:
>> On Sun, Oct 17, 2021 at 8:56 PM Gavin Lambert via Boost
>> > It's worthwhile considering these things from the start, as they can
>> > inform design of your baseline (such as compatibility of path segment
>> > iteration).
>
> Segment iteration is not going to be compatible. In addition to adding
> an initial "/" segment for absolute paths, Filesystem also collapses
> consecutive / separators. So iterating "/foo//bar//baz///" produces
>
> "/" ? "foo" ? "bar" ? "baz" ? ""
>
> (https://godbolt.org/z/EsjKzc5f1)
>
> A design goal of URL seems to be that the information that the accessors
> give accurately reflects the contents of the string (and that there's no
> hidden metadata that the string doesn't reflect.)
>
> So the segments of the above path are
>
> { "foo", "", "bar", "", "baz", "", "", "" }
>
> because otherwise the segments of the above and "/foo/bar/baz/" will
> be the same, which means that it won't be possible to reconstruct the string
> from the information the URL accessors give.

Right. But why has it chosen that goal, rather than the alternatives?
What's the rationale?

It seems to me that a URL with redundant /s (e.g. http://foo.com/path/////to/file)
is either (a) malicious or erroneous input, or (b) equivalent to the
versions without the redundant /s. So a user might want to (a) get an
exception or error, or (b) ignore the redundant segments. Under what
circumstances would a user want to see the empty segments between those
/s?


Here's an alternative:

- Skip over duplicate adjacent '/' when iterating segments.

- Return "/" as the first segment for absolute paths.

- Return "" as the last segment for paths with a trailing "/".

- Give p.push_back(s) a precondition that s must not be empty if
p.back() is empty.

I think this gives pretty sane behaviour. The invariant that push_back()ing
a series of segments and then iterating returns the same strings holds.


Vinnie Falco wrote:
> note that the "absoluteness" of the path is a property of the
> URL which is reflected in the url API and not the segments:

You're saying this because that's what the BNF says. Your URL api
doesn't have to exactly mirror the BNF. If it would make sense for
"absoluteness" to be a property of the path rather than the URL from
the point of view of the library user, you can do that.


Two other things to consider:

- What is your operator== going to do about redundant /s?
- How does this all work with data: and mailto: URLs?


Regards, Phil.

Gavin Lambert via Boost

unread,
Oct 19, 2021, 5:55:35 PM10/19/21
to bo...@lists.boost.org, Gavin Lambert
On 20/10/2021 02:12, Phil Endecott wrote:
> Right. But why has it chosen that goal, rather than the alternatives?
> What's the rationale?
>
> It seems to me that a URL with redundant /s (e.g.
> http://foo.com/path/////to/file)
> is either (a) malicious or erroneous input, or (b) equivalent to the
> versions without the redundant /s. So a user might want to (a) get  an
> exception or error, or (b) ignore the redundant segments. Under what
> circumstances would a user want to see the empty segments between those
> /s?

The rationale is the URI standard RFC. It is not permitted for URI
parsers to perform such collapsing because such URIs are entirely legal.

Unusual, certainly, especially for http[s] URIs that usually eventually
end up at a filesystem that would collapse consecutive slashes. Most
web servers (regardless of whether evaluating vs a filesystem or not)
will indeed collapse consecutive slashes too, at least while matching
against resource rules.

But URIs are not required to wind up leading to a filesystem or to be
served by a web server -- it's entirely possible that some other scheme
may treat additional consecutive slashes as significant for some purpose.

Ahmed Charles via Boost

unread,
Oct 20, 2021, 8:37:07 PM10/20/21
to bo...@lists.boost.org, Ahmed Charles, Phil Endecott
On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost <bo...@lists.boost.org> wrote:
>
> Exactly, it's like returning a string_view from a function - which
> you should never do, at least not without an obvious indication...

My primary objection to this line of thinking is that the standard begin/end functions have the same dangerous implications as this function does. The input can be a string and the output is something that becomes invalid when that string is destroyed.

C++ doesn't have a good solution to this situation and clearly begin/end don't give an obvious indication.

Peter Dimov via Boost

unread,
Oct 20, 2021, 8:42:45 PM10/20/21
to bo...@lists.boost.org, Peter Dimov
Ahmed Charles wrote:
> On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost
> <bo...@lists.boost.org> wrote:
> >
> > Exactly, it's like returning a string_view from a function - which you
> > should never do, at least not without an obvious indication...
>
> My primary objection to this line of thinking is that the standard begin/end
> functions have the same dangerous implications as this function does. The
> input can be a string and the output is something that becomes invalid when
> that string is destroyed.

begin/end don't have this problem in practice because you need to call both,
and that's hard to do on the same temporary.

c_str does, but virtually nobody makes this mistake for some reason.

Gavin Lambert via Boost

unread,
Oct 20, 2021, 8:58:20 PM10/20/21
to bo...@lists.boost.org, Gavin Lambert
On 21/10/2021 13:42, Peter Dimov wrote:
> begin/end don't have this problem in practice because you need to call both,
> and that's hard to do on the same temporary.

It's a constant thorn in Range's side, though, especially with all the
various kinds of non-owning filters. But by the same token people
should start getting more used to these sorts of considerations and be
less prone to making such errors.

And the static analysis tooling is a bit better now at spotting such
things, though not perfect.

> c_str does, but virtually nobody makes this mistake for some reason.

I recall some older (broken) C++ compiler (or rather: optimizer)
implementations where the temporary in (rvalue).c_str() was sometimes
destroyed after calling c_str() instead of after the entire
full-expression, which caused all sorts of Fun™ for the common cases of
passing this to a method call, which in turn led to defensively creating
a lot of otherwise unnecessary lvalues.

Thankfully modern compilers don't have that problem any more.

Peter Dimov via Boost

unread,
Oct 20, 2021, 9:13:57 PM10/20/21
to bo...@lists.boost.org, Peter Dimov
Gavin Lambert wrote:
> And the static analysis tooling is a bit better now at spotting such things,
> though not perfect.

Not really.

https://godbolt.org/z/P8xb4Ern4
https://godbolt.org/z/Y56KnxGeT
https://godbolt.org/z/M7aGb7jq8
https://pdimov.github.io/blog/2020/05/17/state-of-c-static-analysis-circa-2020/
https://godbolt.org/z/Pba64YjYv

Well, there is https://godbolt.org/z/nz9Pvzfoj, I suppose.

Oh, -fanalyzer has improved: https://godbolt.org/z/xhnr88ah1

It wasn't very useful for C++ last I checked, but it works now.

Peter Dimov via Boost

unread,
Oct 20, 2021, 9:22:02 PM10/20/21
to bo...@lists.boost.org, Peter Dimov
> Oh, -fanalyzer has improved: https://godbolt.org/z/xhnr88ah1
>
> It wasn't very useful for C++ last I checked, but it works now.

Or not. That's a false positive. :-)

Gavin Lambert via Boost

unread,
Oct 20, 2021, 9:24:40 PM10/20/21
to bo...@lists.boost.org, Gavin Lambert
On 21/10/2021 14:13, Peter Dimov wrote:
> Gavin Lambert wrote:
>> And the static analysis tooling is a bit better now at spotting such things,
>> though not perfect.
>
> Not really.

Yes, this was admittedly a small value of "a bit".

Vinnie Falco via Boost

unread,
Oct 20, 2021, 9:44:33 PM10/20/21
to boost@lists.boost.org List, Vinnie Falco, Gavin Lambert
On Wed, Oct 20, 2021 at 5:58 PM Gavin Lambert via Boost
<bo...@lists.boost.org> wrote:
> people should start getting more used to these sorts of
> considerations and be less prone to making such errors.

This is my position as well.

string_view is "dangerous" but you get a corresponding improvement in
power. It really isn't much different than calling async functions
with unowned buffers, e.g.:

std::string s;

sock.async_write_some( asio::buffer(s), ... );

There isn't any realistic way to protect from this which is why I
think that the proposed schemes for helping the users avoid shooting
themselves in the foot are not good solutions.

Thanks

Phil Endecott via Boost

unread,
Oct 21, 2021, 9:47:09 AM10/21/21
to bo...@lists.boost.org, Phil Endecott
Ahmed Charles wrote:
> On Friday, October 15th, 2021 at 5:14 AM, Phil Endecott via Boost <bo...@lists.boost.org> wrote:
>> Exactly, it's like returning a string_view from a function - which
>> you should never do, at least not without an obvious indication...
>
> My primary objection to this line of thinking is that the
> standard begin/end functions have the same dangerous
> implications as this function does. The input can be a
> string and the output is something that becomes invalid
> when that string is destroyed.

I'd be happy with the parsing function returning a view if
its name reflected that. My complaint is that the short name
parse_url() doesn't give the user that "obvious indication".

There are a couple of differences with begin()/end(). Firstly,
"everyone knows" that begin() and end() return iterators so
the names do give the required clue. Secondly, these are
methods, not free functions; you can argue that it's OK to
return iterator/pointer/view types from getter methods if
the lifetime of the pointee is the same as the object. (Yes
I'm conveniently overlooking the begin()/end() free functions.)

I'd also suggest that the existence of foot-guns already in the
language is not an excuse for adding more!

These are not absolute, yes/no decisions; there are trade-offs
to be made. The only benefit of naming this view-returning
function parse_url() is less typing. I don't think that's
worthwhile.


Regards, Phil.

Vinnie Falco via Boost

unread,
Oct 21, 2021, 10:53:15 AM10/21/21
to boost@lists.boost.org List, Vinnie Falco, Phil Endecott
On Thu, Oct 21, 2021 at 6:47 AM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> I'd be happy with the parsing function returning a view if
> its name reflected that. My complaint is that the short name
> parse_url() doesn't give the user that "obvious indication".

In C++ the "name" of a function includes its return type
and parameter types, so actually the name is:

result<url_view> parse_uri( string_view s ) noexcept;

Thus it already reflects that a view is returned. If you think
that the function name is just "parse_uri," then we have
even bigger usability problems, as users are going to be
in for a surprise to discover they have to call has_error(),
value(), or error() on the returned result<url_view> to get
something recognizable.

Thanks
Reply all
Reply to author
Forward
0 new messages