[boost] Boost.URL review

91 views
Skip to first unread message

Zach Laine via Boost

unread,
Aug 21, 2022, 3:24:13 PM8/21/22
to Boost mailing list, Zach Laine
- Does this library bring real benefit to C++ developers for real
world use-case?

Yes.

- Do you have an application for this library?

No.

- Does the API match with current best practices?

See details below.

- Is the documentation helpful and clear?

In Overview:

- s/will be easy to read/is easy to read (no need to use the future tense)
- s/1.78/1.81 (can't use 1.78 if URL depends on TOT, which is stated
elsewhere in the review announcement)
- The bullet list is an awkward mix of verb- and noun-phrases (e.g.
"Require only C++11" and "Fast compilation, few templates"). I
suggest all noun-phrases, like "Requires only C++11").
- s/1.80/1.81 -- same reason as above.
- I think that flow in tutorial-style documentation is important. In
the Overview, you introduce what the library is at a high level, then
jump into some build- and compiler-support weeds, only to then follow
with the Quick Look section. IMO, you should move the bits after the
bullet list to a section called Compilers and Build Considerations or
similar, and place it much later in the docs.

In Containers:

The text at the top makes the relative/absolute distinction sound
important, but the rest of the page does not define or show examples
of the difference between the two.

I suggest single quotes around "/" in "The reason is any decoded
character / could make", since on first reading I thought it was part
of the documentation text, not in code-font. Maybe do it like you do
for the mention of double slashes a few lines later?

When the ambiguity of the string
"https://www.example.com/get-customer.php?name=John%26Doe" is
mentioned, the docs say that if the query string represents
parameters, the quoted "&" is not ambiguous. Then it shows an example
of getting all the parameters from the string
"https://www.example.com/get-customer.php?id=409&name=Joe&individual".
This begs the question, what happens when there's a "%26" in the
string, and you use .params()? Please show this as part of the
example, or at least explain it.

Examples:

It would be nice if these were on separate pages. Even nicer would be
some context, besides just a bunch of code. I had to Google "magnet
link", and I still don't really know what it is. A brief explanation
would be helpful.

API Docs:

The utl::params API docs refer to it as a random-access view, but it
is neither random-access, nor a view.

- Did you try to use it? What problems or surprises did you encounter?

I did not.

- What is your evaluation of the implementation?

I did not look too much at the implementation. but what I did look at
was somewhat hard to follow. I went looking for where a result<> is
filled in during the parse, and it took me a solid 45 minutes to find
it.


Design:

Overall, I quite like the design. Lazy, lazy, lazy. The fact that
only views into the underlying sequence are returned from the parse is
great, and the fact that there are both owning (url::url and
fixed-capcity owning url::static_url) and view versions of the URL
representations is exactly what I would need as a user.

When I call parse_uri(), I get a result of type result<url_view>? Why
isn't the function spelled parse_url(), or the return type spelled
result<uri_view>? This seems needlessly inconsistent, given the
assertion in the docs that URL is fine to use in place of URI.

The assign_to() API seems awkward to me (as in std::string str;
u.query().assign_to(str);). This would be better as a conversion
function with a name, like string(), to_string(), or as_string().
Assignment is fundamental, and the compiler knows what to do with it
(how to optimize it away, for one thing). Assignment through an
out-param is awkward at best, and is actually more likely to lead to
more pessimal code, since RVO cannot be applied.

The append_to() API is really very nice. Most libraries don't support
appending as a first-order operation, but it is essential for string
perf in many use patterns.

However, the containers are frankly a mess. What are they, exactly,
at a high level of abstraction? For instance, is params a
singly-linked list? It has a forward iterator, so it kinda looks like
one. Is it an associative container? It has key/value pairs (where,
sure the value might be empty, but it still exists at the level of the
container), so it looks like one of those too. Is it array-like? The
keys are not sorted, and the order corresponds to the order in a URL,
so it sure looks array-like. This is important to users who want to
use your containers in generic code, including the standard algorithms
(which often have a pessimal implementation specific to
forward_iteretors). Also, your containers have about 10% of the API
of any of the kinds of containers listed above, meaning generic code
will have to special-case use of your container if it's missing some
of the SequenceContainer or AssociativeContainer API that the generic
code relies on. I recommend that you implement params in terms of a
std::vector<std::pair<string_type, string_type>> (using whatever
string_type you're already using), and then add the API that is
URL-specific on top of it, and have two members:

std::vector<std::pair<string_type, string_type>> & storage();
std::vector<std::pair<string_type, string_type>> const & storage() const;

Then make the iterators RA. Similar changes seem appropriate for the
other containers as well.

"Remove" is not really a term used in the STL. I think your
"remove*()" container members should be "erase*()" instead.

The replace() API is nice. I only wish it were range-capable (that
is, I wish there were an overload taking two iterators to erase, and
two iterators to insert).

Why did you define your own find_if() and find_if_not(), instead of
using the std ones? And why the hell does CharSet have these two as
members? That's crazy!

I find the creation of a mini-parsing library and the expectation that
users will use this mini-library to write their own parsers to be
somewhat problematic. No one is going to use your mini-library for
serious parsing work (except perhaps for modifications to the way URL
does its parsing). It does have proper error reporting, for one
thing. There are other equally obvious limitations. If the parsing
were an implementation detail, or if you were only providing the
parsing lib for extensions to your existing parsing, what you've
provided would be enough. However, you also claim that you can use
URL as part of a larger parser, and I don't think that's realistic.

I vote to ACCEPT. I think I've raised legitimate concerns about the
design here, but none of them rises to the level of a blocker.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Vinnie Falco via Boost

unread,
Aug 21, 2022, 4:09:10 PM8/21/22
to bo...@lists.boost.org, Vinnie Falco
On Sun, Aug 21, 2022 at 12:24 PM Zach Laine via Boost
<bo...@lists.boost.org> wrote:
> Yes.

Thank you for taking the time to review the library!

> - Is the documentation helpful and clear?
> ...

Oh I very much like feedback about documentation, especially when it
is actionable (i.e. offers specific advice instead of just "should be
better"). I agree with everything you wrote. I do struggle with
documentation, it seems it is never as good as I want it to be no
matter how much time I put into it. We will copy this into an issue
and get to work on it!

> IMO, you should move the bits after the
> bullet list to a section called Compilers and Build Considerations or
> similar, and place it much later in the docs.

Well, the Overview page is designed to tell prospective users of the
library everything they need to know to make a determination on
whether or not the library will suit their needs. And it also serves
double-duty as a promotional page. We want the library to look as good
as possible - showing that we have taken the time to support all those
compilers and build configurations seems to me to be a signal of
quality. We can probably shuffle some things around and refine the
grammar though.

> The text at the top makes the relative/absolute distinction sound
> important, but the rest of the page does not define or show examples
> of the difference between the two.

Yes... the docs need work.

> ...
> Please show this as part of the example, or at least explain it.
> ...

Agree with everything there.

> It would be nice if these were on separate pages. Even nicer would be
> some context, besides just a bunch of code. I had to Google "magnet
> link", and I still don't really know what it is. A brief explanation
> would be helpful.

Yes, we agree with everything. And we need many more small
self-contained examples of doing common things.

> The utl::params API docs refer to it as a random-access view, but it
> is neither random-access, nor a view.

This is what we in the biz call a "bug" in the documentation :) It is
a mutable BidirectionalRange which references the original url's
character buffer.

> When I call parse_uri(), I get a result of type result<url_view>? Why
> isn't the function spelled parse_url(), or the return type spelled
> result<uri_view>? This seems needlessly inconsistent, given the
> assertion in the docs that URL is fine to use in place of URI.

Yes good question, and I think this is explained in the docs. When we
refer to the general family of strings we call it "URL." Containers
are all capable of holding the generic syntax (i.e. ALL valid
productions of the grammar). thus containers are named url, url_view,
and static_url.

However, when we refer to specific grammars from rfc3986 we always use
the exact name of the grammar. Names of parsing functions follow a
consistent pattern: "parse_" followed by the grammar which they
implement:

// URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
parse_uri();

// absolute-URI = scheme ":" hier-part [ "?" query ]
parse_absolute_uri();

// relative-ref = relative-part [ "?" query ] [ "#" fragment ]
parse_relative_ref();

// URI-reference = URI / relative-ref
parse_uri_reference();

// origin-form = absolute-path [ "?" query ]
parse_origin_form(); // rfc7230

However, they all return the same container, which is url_view,
because url_view is capable of holding all of these things. We briefly
explored baking different grammars into the type system (e.g.
absolute_uri_view) but it becomes totally unergonomic with no
corresponding benefit.

We do have a separate type, authority_view - because an authority by
itself does not satisfy the generic syntax. And of course the function
is suitably named:

// authority = [ userinfo "@" ] host [ ":" port ]
result< authority_view > rv = parse_authority( s );

> The assign_to() API seems awkward to me (as in std::string str;
> u.query().assign_to(str);). This would be better as a conversion
> function with a name, like string(), to_string(), or as_string().

I've resisted an implicit conversion to std::string but of course
Ramey Rule kicks in and steers you elsewhere. We are very likely going
to add it, because the API impedance is too high otherwise. So you
will just be able to assign it to your string. And we can trivially
add member to_string() or whatever name is in fashion these days.

> Assignment is fundamental, and the compiler knows what to do with it
> (how to optimize it away, for one thing). Assignment through an
> out-param is awkward at best, and is actually more likely to lead to
> more pessimal code, since RVO cannot be applied.
> The append_to() API is really very nice. Most libraries don't support
> appending as a first-order operation, but it is essential for string
> perf in many use patterns.

Hey thanks for noticing this!

Yes it was entirely our intent to create an API that is suitable for
working with allocators, without having to specify Allocators if you
know what I mean. assign_to and append_to are for reusing the capacity
in existing strings. We have assign_to because the MutableString
concept doesn't specify a way of clearing the string, and we want to
be able to express algorithms without a precondition "the mutable
string must have a zero size." I think the MutableString concept needs
some more work though, we are open to a design discussion on how to
improve it. The idea is to support algorithms that require temporary
buffers to operate.

> However, the containers are frankly a mess. What are they, exactly,
> at a high level of abstraction? For instance, is params a
> singly-linked list? It has a forward iterator, so it kinda looks like
> one.

It is a lazy flat_map without random access but with location-based
insertion and erase. Segments is a lazy list<string>.

> This is important to users who want to use your containers in
> generic code, including the standard algorithms (which often
> have a pessimal implementation specific to forward_iteretors).

Most users should be able to use these lazy containers as-is. For
example they might simply iterate over the params, ignore the ones
that they don't care about, and take action on the ones they want.

> Also, your containers have about 10% of the API
> of any of the kinds of containers listed above, meaning generic code
> will have to special-case use of your container if it's missing some
> of the SequenceContainer or AssociativeContainer API that the generic
> code relies on.

The containers are designed to model as closely as possible standard
containers, where it makes sense to do so. Because they are lazy and
not node based, some concessions have to be made of course. The
priority in all cases is to be lazy (avoid allocation). The escape
hatch is that users who need the exact APIs offered by a standard
container, are free to copy the lazy range into their standard
container and then do everything they need to do. And then they should
be able to assign the result back, since these lazy ranges support
assignment of the entire range.

> I recommend that you implement params in terms of a
> std::vector<std::pair<string_type, string_type>> (using whatever
> string_type you're already using), and then add the API that is
> URL-specific on top of it, and have two members:

Uhh... well... that's not lazy :) and it allocates memory. If users
want this, they have to opt-in to it. I would be reluctant to design
anything that doesn't give the user control over the allocator. But if
you think about it, these lazy ranges give you that. Users can use
standard algorithms like copy with a back_insert_iterator into any
container they want, regardless of allocator.

> std::vector<std::pair<string_type, string_type>> & storage();
> std::vector<std::pair<string_type, string_type>> const & storage() const;
>
> "Remove" is not really a term used in the STL. I think your
> "remove*()" container members should be "erase*()" instead.

I'm not sure where you're seeing this? There are no "remove" functions
in these containers.

>
> The replace() API is nice. I only wish it were range-capable (that
> is, I wish there were an overload taking two iterators to erase, and
> two iterators to insert).

That is something we could do, but there was no obvious use-case for
removing ranges of query params.

> Why did you define your own find_if() and find_if_not(), instead of
> using the std ones? And why the hell does CharSet have these two as
> members? That's crazy!

They are optional members. A particular charset might know how to
implement find_if much better than a generic algorithm. The free
function grammar::find_if will call the member find_if on the CharSet
if it exists, otherwise it uses a standard algorithm. For example
grammar::lut_chars::find_if uses an SSE2 implementation when
available. And it can be further optimized but we just haven't
bothered to do it yet.

> I find the creation of a mini-parsing library and the expectation that
> users will use this mini-library to write their own parsers to be
> somewhat problematic. No one is going to use your mini-library for
> serious parsing work (except perhaps for modifications to the way URL
> does its parsing).

I agree that this is not going to be for most users but if even 1 or 2
users find utility (for example to further validate custom
scheme-specific grammars) then I would be happy with it. I am using
this extensively in downstream libraries, for "serious parsing work"
so I disagree that it is not serious. This is a mini-library for now
so downstream libraries can use it ergonomically, but once it gets
some more practical usage in more varied settings (and becomes stable
API-wise) I would rather propose it as a new library.

> It does have proper error reporting, for one thing.

Did you mean that it "doesn't" have proper error reporting? Where do
you see this lack of reporting (if that's what you meant)?

> I vote to ACCEPT. I think I've raised legitimate concerns about the
> design here, but none of them rises to the level of a blocker.

Thanks!!

Regards

Vinnie Falco via Boost

unread,
Aug 21, 2022, 7:47:52 PM8/21/22
to Zach Laine, boost@lists.boost.org List, Vinnie Falco
On Sun, Aug 21, 2022 at 4:03 PM Zach Laine <whatwasth...@gmail.com> wrote:
> ...

It wouldn't be a proper review without at least one person forgetting
to use "Reply All" :) So here it is, from Zach. I have quoted his
message:

> This makes sense. The larger issue I'm raising is that a user is
> going to have a harder time remembering this, if the names don't
> match. A way to mitigate that is to make this naming rationale
> explicit, so that user
> expectations will adjust. I didn't get that this was the pattern from
> the documentation, though I think you did say it. A call-out
> specifically noting this would porbably be enough.

Yeah we can do that.

> Hm. By lazy flat_map, do you mean that the positions are computed one
> step at a time during iteration? is that why they have
> forward_iterators?

Yes. operator++ and operator-- intelligently find the next param or
segment using a non-validating parse algorithm (since we know that the
underlying character buffer has already been validated).

> If so, please document that. It might make a
> pretty big difference whether I decided to keep using a params object
> that I need to update a lot, or copy everything over to a proper
> flat_map or vector to do my work.

Yes I agree, but I thought we did document that. Here, no?

<https://master.url.cpp.al/url/ref/boost__urls__params_encoded_view.html#url.ref.boost__urls__params_encoded_view.complexity>

We could probably add something here too:

<https://master.url.cpp.al/url/ref/boost__urls__params_encoded_view/iterator.html>

> This works for an immutable container. It is *very surprising* for a
> mutable one. The rest of my notes about the containers are predicated
> on a false assumption that mutability implied non-laziness in the
> container itself. I think all of these container would be better off
> referred to as views, and made immutable. All you really need to
> support is begin/end, plus maybe key-lookup. I suggest you drop the
> mutability entirely. With mutability, these containers are trying to
> do two things at once.

Well, that's another way to do it and it certainly would remove
complexity and simplify the names. Right now we have 8 containers:

segments
segments_view
segments_encoded
segments_encoded_view
params
params_view
params_encoded
params_encoded_view

All of these containers reference the underlying character buffer of
the URL (although you can also construct the views from separately
parsed string views but that's basically the same thing in terms of
the ownership model being reference-like).

If we remove mutability as you suggest, then we can cut the number of
containers down to 4:

params_view
segments_view
params_encoded_view
segments_encoded_view

But this leaves us with a problem - how do you modify individual path
segments and query params? How do you do position-based insert and
erase of elements using an iterator? These features seem kind of
useful, I'm not sure I want to lose them.

> > > "Remove" is not really a term used in the STL. I think your
> > > "remove*()" container members should be "erase*()" instead.
> >
> > I'm not sure where you're seeing this? There are no "remove" functions
> > in these containers.
>
> params::remove_value() is one. I think(?) there are others.

Oh. Well... remove_value() only removes the value part of the param.
It does not erase the entire param. There's no such thing as
insert_value. The javadoc is wrong unfortunately (sorry about that).
We do use the term erase() when the operation removes the entire
element.

> > Did you mean that it "doesn't" have proper error reporting? Where do
> > you see this lack of reporting (if that's what you meant)?
>
> By "proper" I mean that if the parse fails it should provide an
> end-user-friendly diagnostic, and optionally quote the line and put a
> caret under the position that the parse failed (a la the Clang
> diagnostics). Your non-end-user error reporting is fine.

Oh... hmm.. well, each rule does leave the iterator out-param at the
spot where an error occurred. But I'm not quite sure how to emit the
diagnostic - surely you aren't suggesting we write to std::cout or
std::cerr? Are we using curses or something? Or maybe just plain ASCII
art? How exactly would this look?

Thanks to Peter's work on boost::system::error_code, when a parse
error occurs the error is augmented with file name, line, and function
name information. And I have worked on adding visualizers so these are
nicely presented in the debugger. So at least we have that.

Thanks

Zach Laine via Boost

unread,
Aug 22, 2022, 10:47:37 AM8/22/22
to Vinnie Falco, Boost mailing list, Zach Laine
On Mon, Aug 22, 2022 at 9:41 AM Zach Laine <whatwasth...@gmail.com> wrote:
> Right, and I understood how those worked. However, I had no idea that
> the owning containers were based on the same lazy approach (unless I
> misunderstand what you said in this thread...).
> Easy -- you just copy them into a flat_map, vector, or whatever. The
> 90% use case is just to look at the result of the parse. Also making
> the parsing library responsible for mutations you do to the data later
> seems unnecessary. If you provide parsing -- including the lazy views
> -- and then writing of URLs, and the writing can be done via a generic
> container, that's all your library needs, IMO. To be clear:
>
> Parsing -> lazy_containers -> user mutations outside of Boost.URL ->
> writing URLs
>
> That last step only needs to understand the different kinds of data it
> might write, abstractly. That abstraction might be a params_view, a
> flat_map, or a std::vector<std::pair<strd::string, std::string>>.
>
> > How do you do position-based insert and
> > erase of elements using an iterator? These features seem kind of
> > useful, I'm not sure I want to lose them.
>
> You won't lose them, because the STL exists.
>
> > > > Did you mean that it "doesn't" have proper error reporting? Where do
> > > > you see this lack of reporting (if that's what you meant)?
> > >
> > > By "proper" I mean that if the parse fails it should provide an
> > > end-user-friendly diagnostic, and optionally quote the line and put a
> > > caret under the position that the parse failed (a la the Clang
> > > diagnostics). Your non-end-user error reporting is fine.
> >
> > Oh... hmm.. well, each rule does leave the iterator out-param at the
> > spot where an error occurred. But I'm not quite sure how to emit the
> > diagnostic - surely you aren't suggesting we write to std::cout or
> > std::cerr? Are we using curses or something? Or maybe just plain ASCII
> > art? How exactly would this look?
>
> The diagnostics go to some optional output channel, which might be a
> stream or file or logger. If you don't set the output, the diagnostic
> is never generated. You set this output in the call to parse*(), as
> an optional param -- usually a std::function<void(std::string const
> &)>.
>
> > Thanks to Peter's work on boost::system::error_code, when a parse
> > error occurs the error is augmented with file name, line, and function
> > name information. And I have worked on adding visualizers so these are
> > nicely presented in the debugger. So at least we have that.
>
> That's really nice for programmers, but it leaves the programmer with
> a lot of work to do to present the error to the user.
>
> Zach

Forgot to reply-all *again*.

Zach

Vinnie Falco via Boost

unread,
Aug 22, 2022, 11:41:00 AM8/22/22
to Zach Laine, Vinnie Falco, Boost mailing list
On Mon, Aug 22, 2022 at 7:47 AM Zach Laine <whatwasth...@gmail.com> wrote:
> > > But this leaves us with a problem - how do you modify individual path
> > > segments and query params?
> >
> > Easy -- you just copy them into a flat_map, vector, or whatever.
> > ...
> > Parsing -> lazy_containers -> user mutations outside of Boost.URL ->
> > writing URLs

It sounds like what you are proposing, is that Boost.URL should remove
its algorithms for performing modifications to the URL. Or at least,
remove the ability to modify the path segments and query parameters in
terms of their equivalent sequences (BidirectionalRange).

The modifiable containers in Boost.URL maintain an important
invariant: modifications to the container will always leave the URL in
a valid state. The implication is that the stored string is always in
its "serialized" form.

> > That last step only needs to understand the different kinds of data it
> > might write, abstractly. That abstraction might be a params_view, a
> > flat_map, or a std::vector<std::pair<strd::string, std::string>>.

This isn't workable at all, for several reasons. First of all when the
user passes an encoded parameter to a modification function (named
with the suffix "_encoded") the library performs validation on the
input to preserve the invariant. Second, the library remembers the
decoded size so that it doesn't have to re-parse that portion of the
URL again. Third, when the user passes a decoded parameter to a
modification function the library applies whatever percent-escapes are
necessary to make the string valid for the part being changed, and
stores that. And finally, there are certain modifications to the URL
which require that the library adjusts the result in order to both
satisfy the user's request and preserve the invariant that all stored
URLs are valid. For example, consider these statements:

url u( "ldap:local:userdb/8675309" );
u.remove_scheme();

What should the resulting URL be? Well, it can't be
"local:userdb/8675309" because that still has a scheme. The library
has to produce:

u == "local%3a/8675309";

In other words we go from an absolute-URI to a relative-ref. In
addition to preserving the invariant that all modifications produce
valid URLs, the library also ensures that it can satisfy every
possible user-requested modification (presuming that no
incorrectly-encoded strings are passed to functions which accept
encoded parameters). There is a fair bit of cleverness going on behind
the scenes to bring this convenience to the user. That would all be
lost if we outsourced mutation operations to std containers.

> > The diagnostics go to some optional output channel, which might be a
> > stream or file or logger. If you don't set the output, the diagnostic
> > is never generated. You set this output in the call to parse*(), as
> > an optional param -- usually a std::function<void(std::string const
> > &)>.

huh...idk about all that. This library was designed for program to
program communication and not really user-input and user display.
Although it could be used for that. If someone wants to print the URL
and show a squiggly line near the part that failed, they can always do
that by calling the parse function that uses an iterator as an
out-param, indicating where the error occurred. I don't know that I
want the URL library to go out of its way to offer this functionality,
especially adding a `std::function` to an API.

Regards

Zach Laine via Boost

unread,
Aug 23, 2022, 11:44:40 AM8/23/22
to Vinnie Falco, Zach Laine, Boost mailing list
Ok, I'm convinced. I am still not convinced that the containers that
maintain these invariants should be lazy. That still seems weird to
me. If they own the data, and are regular types, they should probably
be eager. The larger issue to me is that they have a subset of the
expected STL API.

> > > The diagnostics go to some optional output channel, which might be a
> > > stream or file or logger. If you don't set the output, the diagnostic
> > > is never generated. You set this output in the call to parse*(), as
> > > an optional param -- usually a std::function<void(std::string const
> > > &)>.
>
> huh...idk about all that. This library was designed for program to
> program communication and not really user-input and user display.
> Although it could be used for that. If someone wants to print the URL
> and show a squiggly line near the part that failed, they can always do
> that by calling the parse function that uses an iterator as an
> out-param, indicating where the error occurred. I don't know that I
> want the URL library to go out of its way to offer this functionality,
> especially adding a `std::function` to an API.

That's kind of my point. You have a parsing minilib that is useful
for URL parsing, but not *general use*. If that's the case, I think
you should present it as that, and not a general use parsing lib.

(The use of std::function is incidental; it could be a template
parameter instead.)

Zach

Vinnie Falco via Boost

unread,
Aug 23, 2022, 12:42:54 PM8/23/22
to Zach Laine, Vinnie Falco, Boost mailing list
On Tue, Aug 23, 2022 at 8:44 AM Zach Laine <whatwasth...@gmail.com> wrote:
> Ok, I'm convinced.

The URL classes are kind of weird in the sense that they have three parts:

1. the getter/setter functions for the singular pieces of the URL
2. the container-like interface for the segments
3. the container-like interface for the params

Because each URL effectively exposes two different containers/ranges
they are turned into separate types. It wouldn't make sense to have
url::begin() and url::end().

Just so that we are on the same page, and anyone who is reading this
now or in the future has clarity, when you write

url u( "http://www.example.com/path/to/file.txt" );

segments us = u.segments();

The value `us` models a lazy, modifiable BidirectionalRange which
references the underlying `url`. That is to say, that when you invoke
modifiers on us, such as:

us.pop_back();

it is the underlying `url` (or `static_url`) which changes. `segments`
is a lightweight type which has the semantics of a reference. If you
were to, say, make a copy of `us` you are just getting another
reference to the same underlying url. A `segments` cannot be
constructed by itself.

When we say that the range is lazy, this means that the increment and
decrement operation of its iterators executes in linear time rather
than constant time. And of course there is no random access. The
laziness refers to the fact that incrementing a path iterator requires
finding the next slash ('/') character in the underlying URL.

> I am still not convinced that the containers that maintain these invariants
> should be lazy. That still seems weird to me. If they own the data, and
> are regular types, they should probably be eager.

Here is where I am a little lost. When you say "the containers that
maintain these invariants" are you referring to `segments` or `url`?
Because `segments` does not actually implement any of the business
logic required to modify the path. All of that is delegated to private
implementation details of `url` (or more correctly: `url_base`).

Perhaps when you say "if they own the data", the term "they" refers to
the `url`? Even in that case, laziness is required, because this
library only stores URLs in their serialized form. There were earlier
designs which did it differently but it became apparent very quickly
that the tradeoffs were not favorable.

I'm not quite sure what "eager" means in this context.

> The larger issue to me is that they have a subset of the expected STL API.

Right, well we implemented as much STL-conforming API as possible
under the constraint that the URL is stored in its serialized form.
Really I consider myself as more of an explorer than a designer,
because once we made the design choice that the URL would be stored
serialized, the remainder of the API design and implementation was
more of an exercise in discovering what the consequences of that
design choice would be and how familiar to standard containers we
could make the interfaces become.

Matching the STL API would require giving up one or more things that
we currently have. This is possible, but we leave it up to the user to
make this decision (by copying the data into a new std container).

> That's kind of my point. You have a parsing minilib that is useful
> for URL parsing, but not *general use*. If that's the case, I think
> you should present it as that, and not a general use parsing lib.

Yes you are right about this, it is not for general use. It is
specifically designed for implementing the ABNF grammars found in
protocol-related RFCs such as rfc3986 which defines URL grammars used
in Boost.URL, non-well-known schemes, HTTP messages, HTTP fields,
Websocket fields. For example consider this grammar (from rfc7230)

Transfer-Encoding = 1#transfer-coding
transfer-coding = "chunked" ; Section 4.1
/ "compress" ; Section 4.2.1
/ "deflate" ; Section 4.2.2
/ "gzip" ; Section 4.2.3
/ transfer-extension
transfer-extension = token *( OWS ";" OWS transfer-parameter )
transfer-parameter = token BWS "=" BWS ( token / quoted-string )

A downstream library like not-yet-proposed-for-boost.HTTP.Proto could
use this minilib thusly:

constexpr auto transfer_encoding_rule = list_rule( transfer_coding_rule, 1 );

<https://github.com/CPPAlliance/http_proto/blob/f2382d8eab8be2e9d6e6e14c5502d90ccf55e95f/include/boost/http_proto/rfc/transfer_encoding_rule.hpp#L117>

There's a lot going on here behind the scenes. HTTP defines the
list-rule which is a comma separate sequence of elements where due to
legacy reasons you can have extra unnecessary comments and whitespace
anywhere between the elements. In ABNF the list-rule is denoted by the
hash character in the Transfer-Encoding grammar above ( "one or more
of transfer-coding" )

Boost.URL provides the lazy range which allows the downstream library
to express the list-rule as a ForwardRange of transfer_coding. This
allows the caller to iterate the list elements in the
Transfer-Encoding value without allocating memory for each element.
There is a recurring theme here - I use lazy ranges to defer memory
allocation :) This goes back to Beast which offers a crude and quite
frankly inelegant set of lazy parsing primitives. I took that concept
and formalized it in Boost.URL and used the principle to let users
opt-in to interpreting the path and query as ranges of segments and
params respectively.

Now this is a downstream library so you might wonder what this has to
do with URLs. Well, Boost.URL is designed to handle ALL URLs. This
includes the well-known hierarchical schemes like http and file but
also opaque schemes, of which there are uncountably many as often
these schemes are private or unpublished. However, the library can't
possibly know how to decompose URLs into the parts defined by these
schemes. In order to do this, a user has to write a parsing component
which understands the scheme.

We will use the mailto scheme as an example. First let me point out
that ALL URLs which use the mailto scheme, are still URLs. They follow
the generic syntax, and Boost.URL is capable of parsing them with no
problem - since it can parse all URLs no matter the scheme.

But users who want to do a deep-dive into the mailto scheme can't be
satisfied merely with parsing a mailto URL. They want it decomposed,
and obviously Boost.URL can't do that in the general case because
every scheme is different. Here is the syntax of a mailto URI:

mailtoURI = "mailto:" [ to ] [ hfields ]
to = addr-spec *("," addr-spec )
hfields = "?" hfield *( "&" hfield )
hfield = hfname "=" hfvalue
hfname = *qchar
hfvalue = *qchar
addr-spec = local-part "@" domain
local-part = dot-atom-text / quoted-string
domain = dot-atom-text / "[" *dtext-no-obs "]"
dtext-no-obs = %d33-90 / ; Printable US-ASCII
%d94-126 ; characters not including
; "[", "]", or "\"
qchar = unreserved / pct-encoded / some-delims
some-delims = "!" / "$" / "'" / "(" / ")" / "*"
/ "+" / "," / ";" / ":" / "@"

To begin, a user might write this function:

result< url_view > parse_mailto_uri( string_view s );

This is easy to implement at first because all mailto URIs are URLs.
We might start with this:

result< url_view > rv
parse_mailto_uri( string_view s )
{
auto rv = parse_uri( s );
if( ! rv )
return rv.error();
if( ! grammar:ci_is_equal( rv->scheme(), "mailto" ) )
return error::scheme_mismatch;
...
return *rv;
}

This is a good start but it is unsatisfying, because we are getting
the "to" fields in the path part of the URL, and Boost.URL doesn't
know how to split up the recipients of the mailto since they are comma
separated and not slash separated. Remember though, that this is still
a valid URL and that Boost.URL can represent it.

So now we want to implement this grammar:

to = addr-spec *("," addr-spec )

If you expand the addr-spec ABNF rule you will see that it has
unreserved character sets, percent-encoding possibilities, and quoted
strings. I can't get into all this here (perhaps it would make a good
example for a contributor) but you might start like this:

constexpr auto addr_spec_rule = grammar::tuple_rule(
local_part_rule, squelch(grammar::delim_rule('@')), domain_rule );

then you would continue to define each of those rules, and eventually
you would be able to 1. validate that a particular mailto URL matches
the scheme, and 2. decompose the elements of the mailto URL based on
the requirements of the scheme itself.

The idea here is to incubate grammar/ in URL, as more downstream
libraries get field experience with it, and then propose it as its own
library. I'm hoping to see people implement custom schemes, but maybe
that's wishful thinking.

Phew...

Thanks

Alan de Freitas via Boost

unread,
Aug 23, 2022, 1:08:24 PM8/23/22
to bo...@lists.boost.org, Alan de Freitas
>
> Ok, I'm convinced. I am still not convinced that the containers that
> maintain these invariants should be lazy. That still seems weird to

me. If they own the data, and are regular types, they should probably
> be eager. The larger issue to me is that they have a subset of the
> expected STL API.


Yes. If I understand the discussion correctly, the container that maintains
the invariant is not lazy but the documentation could probably be better at
explaining some other things.
It's the urls::url that maintains the invariants and owns the data in

url u( "ldap:local:userdb/8675309" );
u.remove_scheme();
(u == "local%3a/8675309");

The urls::segment and urls::params views are just references to part of
this urls::url. The invariants belong to the url.
They are helpers to operate on the urls::url segments and parameters at an
individual level, while the underlying urls::url adjusts itself to keep the
invariants:

url u( "" );
u.set_path("//index.htm");
(u == "/.//index.htm");

url u( "" );
u.is_absolute(true);
u.segments().append({"", "index.htm"});
(u == "/.//index.htm");

So the eager urls::url is the one maintaining the invariants, but the
segment or parameter views are a better way to operate on the URL when
dealing with individual subcomponents.
Thus, the functions are not exactly a subset of the STL because the
operations are ensuring the invariants of the underlying url remain valid.
(The names "segments" and "params" might be a little confusing because it
sounds like it owns the data. Some other review has recommended
mutable_segments_view/mutable_params_view.)

The views are useful because it gives us a single contiguous string with
the complete URL at the end.
This buffer is always ready to be used by another application. All
consecutive URL components are also always immediately available as a
contiguous buffer.

boost::asio::connect(sock, r.resolve(u.encoded_host));

or

boost::beast::http::request<http::string_body> req{http::verb::get, target,
version};
req.set(http::field::host, u.encoded_host());

Besides performance reasons, this contiguous representation is quite useful
because we don't have to reassemble the URLs or combinations of components
in another string to be able to use it.
They can go straight in a buffer sequence. The documentation should
probably highlight that design a little more: the idea that we try to keep
things in a single contiguous buffer and why.

An isolated container owning only the segments or parameters wouldn't be
very useful because it wouldn't do much more than a std::vector or a
std::map would.
These containers would maintain all the invariants we need in decoded
segments and params and some other function could then percent-encode the
data in these containers to paste that into a url.
But what the library is attempting to achieve is avoiding this process.

--
Alan Freitas
https://github.com/alandefreitas

Zach Laine via Boost

unread,
Aug 24, 2022, 10:59:57 AM8/24/22
to bo...@lists.boost.org, Zach Laine
This seems to be the heart of the matter. I expect params to be a
real flat_map or map or sorted vec, and it's not. If my expectation
is wrong because of the lack of a _view suffix in the name, that's
fine -- but please add that to make things clear. However, the _view
suffix in std C++ is used exclusively for immutable types, so maybe
something else is indicated, like _facade or some variation of span.

Zach

Vinnie Falco via Boost

unread,
Aug 24, 2022, 11:37:43 AM8/24/22
to bo...@lists.boost.org, Vinnie Falco
On Wed, Aug 24, 2022 at 7:59 AM Zach Laine via Boost
<bo...@lists.boost.org> wrote:
> This seems to be the heart of the matter. I expect params to be a
> real flat_map or map or sorted vec, and it's not.

Yeah, I see what you mean. "params" sounds like its a first-class
object, but it isn't.

> If my expectation is wrong because of the lack of a _view suffix in
> the name, that's fine -- but please add that to make things clear.

There are 8 containers which need names:

segments
segments_view
segments_encoded
segments_encoded_view
params
params_view
params_encoded
params_encoded_view

We prefix the type name with the type of element "segment" or "param",
so that it sorts nicely. We could have went with "encoded_segments"
which also makes sense but then they don't group together as well and
it hurts discoverability in the docs (Ramey Rule).

The _encoded suffix is self-explanatory. We chose the suffix _view to
distinguish from the read-only containers from the mutating ones.
Someone suggested putting "const" and "mutable" in the name but in my
opinion that is ugly:

const_segments_view
mutable_segments_view
const_encoded_segments_view
mutable_encoded_segments_view
const_params_view
mutable_params_view
const_params_view
mutable_params_view

> However, the _view suffix in std C++ is used exclusively for immutable
> types, so maybe something else is indicated, like _facade or some
> variation of span.

As if we didn't have a hard enough time coming up with 8 names that
have synergy and aesthetics :) How about the suffix _ref?

segments_ref
segments_cref
segments_encoded_ref
segments_encoded_cref
params_ref
params_cref
params_encoded_ref
params_encoded_cref

Or we could mix _ref and _view?

segments_ref
segments_view
segments_encoded_ref
segments_encoded_view
params_ref
params_view
params_encoded_ref
params_encoded_view

We did the best we could with the current names and I know they aren't
perfect. We're more than happy to hear feedback on whether any of
these alternatives are resonating with folks, or if someone wants to
propose a new set of 8 identifier names.

Thanks

Zach Laine via Boost

unread,
Aug 24, 2022, 1:16:24 PM8/24/22
to Vinnie Falco, Zach Laine, bo...@lists.boost.org
For this case, I definitely prefer ref to view.

Zach

Vinnie Falco via Boost

unread,
Aug 24, 2022, 3:02:40 PM8/24/22
to Zach Laine, Vinnie Falco, bo...@lists.boost.org
On Wed, Aug 24, 2022 at 10:16 AM Zach Laine
<whatwasth...@gmail.com> wrote:
> For this case, I definitely prefer ref to view.

\phew... progress :)

So, to be explicit - does everyone agree that this set of container
names is currently the best alternative thus far?

segments_ref
segments_view
segments_encoded_ref
segments_encoded_view
params_ref
params_view
params_encoded_ref
params_encoded_view

Zach Laine via Boost

unread,
Aug 24, 2022, 3:23:42 PM8/24/22
to Vinnie Falco, Zach Laine, bo...@lists.boost.org
Ship it.

Zach

Seth via Boost

unread,
Aug 24, 2022, 5:44:43 PM8/24/22
to Boost List, Seth
On Wed, Aug 24, 2022, at 9:02 PM, Vinnie Falco via Boost wrote:
> On Wed, Aug 24, 2022 at 10:16 AM Zach Laine
> <whatwasth...@gmail.com> wrote:
>> For this case, I definitely prefer ref to view.
>
> \phew... progress :)
>
> So, to be explicit - does everyone agree that this set of container
> names is currently the best alternative thus far?

Even though this is solliciting to become design-by-committee (there will never be full consensus) I will go on record that I think the `_ref`/`_cref` suffixes are reasonable to me. At least they're relatively short and consistent (ref/cref is a closer pair than mutable/const)

In my code I would ironically be tempted to use auto more to hide the crufty names because I know they're views anyways. Which is probably why I don't mind the old names. In terms of minimizing surprise these names do a better job.

Vinnie Falco via Boost

unread,
Aug 24, 2022, 7:42:49 PM8/24/22
to bo...@lists.boost.org, Vinnie Falco
On Wed, Aug 24, 2022 at 2:44 PM Seth via Boost <bo...@lists.boost.org> wrote:
> Even though this is solliciting to become design-by-committee (there
> will never be full consensus) I will go on record that I think the `_ref`/`_cref`
> suffixes are reasonable to me. At least they're relatively short and consistent
> (ref/cref is a closer pair than mutable/const)

I prefer to think of it as "focus group sampling." This is on the table now:

params_view
params_const_view
params_encoded_view
params_encoded_const_view
segments_view
segments_const_view
segments_encoded_view
segments_encoded_const_view

Alan brings up a good point that while these are longer they are more
memorable and it leaves no doubt as to meaning.

Thanks
Reply all
Reply to author
Forward
0 new messages