Parsing Numbers

713 views
Skip to first unread message

Olaf van der Spek

unread,
May 18, 2015, 2:34:07 PM5/18/15
to std-pr...@isocpp.org
Let's get the party started.

What have we got?

We've got functions like strtol and stoi which take a const char* or std::string and return a number. 

long strtol(const char*, char **str_end, int base);
int  stoi(const std::string&, std::size_t* pos = 0, int base = 10);

What do we want?

Input should not be required to be null terminated, so string_view seems like a suitable input type.
Error detection should be simpler, but not everyone is a fan of exceptions. 

And IMO skipping spaces should not be part of the parse function.
There's also the question of what to do when not the entire input can be parsed. Return an error or not.


So, what about this one?

optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);

An alternative could be:

error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);



Vicente J. Botet Escriba

unread,
May 18, 2015, 3:52:19 PM5/18/15
to std-pr...@isocpp.org
Le 18/05/15 20:34, Olaf van der Spek a écrit :
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or
> std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view
> seems like a suitable input type.
> Error detection should be simpler, but not everyone is a fan of
> exceptions.
>
> And IMO skipping spaces should not be part of the parse function.
> There's also the question of what to do when not the entire input can
> be parsed. Return an error or not.
>
>
> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);

Or

expected<T,error_code> parse(string_view, std::size_t* pos = 0, int base
= 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);
>
>
No please.

Vicente

Jeffrey Yasskin

unread,
May 18, 2015, 4:11:39 PM5/18/15
to std-pr...@isocpp.org
Woot!

On Mon, May 18, 2015 at 11:34 AM, Olaf van der Spek
<olafv...@gmail.com> wrote:
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or
> std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view seems
> like a suitable input type.
> Error detection should be simpler, but not everyone is a fan of exceptions.

Also, errors in parsing aren't generally exceptional.

> And IMO skipping spaces should not be part of the parse function.

+1. We should have a way to skip a string_view past spaces, but I
agree that normal parsing should probably insist that the number
appear at the start of the string.

> There's also the question of what to do when not the entire input can be
> parsed. Return an error or not.

I believe "not", so that these functions can be used in parsing larger formats.

> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

I assume *pos gets the last position that was part of the parsed number?

In the paper that proposes this, it'd be good to see examples of
parsing code using each of the possible interfaces. That'll help
produce a more informed decision than just looking at the interfaces
abstractly.

Jeffrey

Jeffrey Yasskin

unread,
May 18, 2015, 4:12:22 PM5/18/15
to std-pr...@isocpp.org
Er, the position one after that.

Jens Maurer

unread,
May 18, 2015, 5:08:11 PM5/18/15
to std-pr...@isocpp.org
On 05/18/2015 08:34 PM, Olaf van der Spek wrote:
> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

I would appreciate if there would be a compile-time choice for base 2,
base 8, base 10, base 16, not (only) a facility with a runtime parameter.

Jens

Magnus Fromreide

unread,
May 18, 2015, 5:16:39 PM5/18/15
to std-pr...@isocpp.org
On Mon, May 18, 2015 at 11:34:07AM -0700, Olaf van der Spek wrote:
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or
> std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view seems
> like a suitable input type.

I think iterators/ranges are a better input type. Why should we require that
the input is consecutive?

> Error detection should be simpler, but not everyone is a fan of exceptions.
>
> And IMO skipping spaces should not be part of the parse function.

Here I agree fully - skipping spaces only prevents the use of the parse
function in contexts where space skipping shouldn't happen.

> There's also the question of what to do when not the entire input can be
> parsed. Return an error or not.

Return the result and the iterator that refer to the input position after
the number?

>
> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

Here I agree with Vincente, expected is a good return type.

expected<pair<T, Iterator>, error_code>
parse(Iterator, Iterator, int base = 10);

And yes, the content of the expected shouldn't be a pair but rather a more
descriptive type.

/MF
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Matthew Fioravante

unread,
May 18, 2015, 5:19:19 PM5/18/15
to std-pr...@isocpp.org


On Monday, May 18, 2015 at 2:34:07 PM UTC-4, Olaf van der Spek wrote:
Let's get the party started.

What have we got?

We've got functions like strtol and stoi which take a const char* or std::string and return a number. 

long strtol(const char*, char **str_end, int base);
int  stoi(const std::string&, std::size_t* pos = 0, int base = 10);

What do we want?

Input should not be required to be null terminated, so string_view seems like a suitable input type.
Error detection should be simpler, but not everyone is a fan of exceptions. 

Exceptions must be optional. For high availability systems that should not crash one often is forced to disable exceptions. 
 

And IMO skipping spaces should not be part of the parse function.

I agree with this very strongly. Let users write wrappers if they want whitespace handling. By default not processing white space is more efficient. Its also easier and more natural to add parsing rules rather than trying to awkwardly back out of the defaults.

There's also the question of what to do when not the entire input can be parsed. Return an error or not.

For maximum efficiency the base implementation must allow a non-zero tail string and also return it. If the tail string is handled by an out parameter, we could have 2 overloads:

modern_return_thing<T> parse(string_view&& tail,string_view s); //Parses a T from s.Sets tail to the end of the string
modern_return_thing
<T> parse(string_view s); //Parses a T from s. Error if there are extra characters after the parsed string.


We may also want to think generically. Can clients easily implement efficient std::parse<T> routines for their own user defined types?

Matthew Woehlke

unread,
May 18, 2015, 5:40:24 PM5/18/15
to std-pr...@isocpp.org
On 2015-05-18 14:34, Olaf van der Spek wrote:
> Let's get the party started.

That ship has sailed long ago :-). Please make sure you are up to speed
on the previous discussion on this topic.

> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

Neither. As Vicente pointed out, use std::expected. The original
discussion on this topic was the source of std::expected in the first
place; it would be rather disingenuous to not use it.

(Also... please spell invalid pointers as "nullptr" :-).)

> There's also the question of what to do when not the entire input can be
> parsed. Return an error or not.

I'll lean toward "not", but the user needs this information, either to
add that assertion themselves, or because after a number is consumed
they need to know where to resume parsing. So at minimum we need to know
how much was parsed.

You didn't specify, but I am guessing that maybe that's what you meant
by 'pos'? I would strongly consider however returning instead a new
string_view with the text that was not consumed. (If needed/useful, the
size_t* flavor can be a convenience overload, or the other way around.
It should be possible to implement one in terms of the other with
minimal overhead.)

Actually, I agree with the other Matthew Fioravante's suggestion of
mutating the input string_view / iterators. (Maybe we should just
support this like 'parse(in, &in)' and making sure that is efficient.)

--
Matthew

Nicol Bolas

unread,
May 18, 2015, 5:57:03 PM5/18/15
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net


On Monday, May 18, 2015 at 5:40:24 PM UTC-4, Matthew Woehlke wrote:
On 2015-05-18 14:34, Olaf van der Spek wrote:
> Let's get the party started.

That ship has sailed long ago :-). Please make sure you are up to speed
on the previous discussion on this topic.

> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

Neither. As Vicente pointed out, use std::expected. The original
discussion on this topic was the source of std::expected in the first
place; it would be rather disingenuous to not use it.

(Also... please spell invalid pointers as "nullptr" :-).)

Before making this proposal dependent on `expected`, we should find out where the committee is with that. I don't much care for `expected`, but if it will get the error codes people off our API backs, I'll take it.

I just want to see the "sto*" functions get string_view ASAP. We shouldn't wait on `expected` just to accomplish that.

Jeffrey Yasskin

unread,
May 18, 2015, 6:09:39 PM5/18/15
to std-pr...@isocpp.org
On Mon, May 18, 2015 at 2:40 PM, Matthew Woehlke
<mw_t...@users.sourceforge.net> wrote:
> Actually, I agree with the other Matthew Fioravante's suggestion of
> mutating the input string_view / iterators. (Maybe we should just
> support this like 'parse(in, &in)' and making sure that is efficient.)

This one gathered an objection at
https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Hs1s2329FCo/dl9N2GnXfxQJ,
that the potential aliasing between the const char* and the
string_view itself can cause problems. I suspect any problems can be
fixed with a careful implementation, but I'm not certain, so it'd be a
good thing for the proposal to show tests of.

Jeffrey

Olaf van der Spek

unread,
May 18, 2015, 6:11:46 PM5/18/15
to std-pr...@isocpp.org
2015-05-19 0:09 GMT+02:00 'Jeffrey Yasskin' via ISO C++ Standard -
Future Proposals <std-pr...@isocpp.org>:
I don't get that one, isn't aliasing only an issue in the presence of writes?


--
Olaf

Olaf van der Spek

unread,
May 18, 2015, 6:16:27 PM5/18/15
to std-pr...@isocpp.org
2015-05-18 22:11 GMT+02:00 'Jeffrey Yasskin' via ISO C++ Standard -
Future Proposals <std-pr...@isocpp.org>:
>> There's also the question of what to do when not the entire input can be
>> parsed. Return an error or not.
>
> I believe "not", so that these functions can be used in parsing larger formats.

Right, perhaps both should be supported. Require entire input to be
parsed if !pos


>> So, what about this one?
>>
>> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>>
>> An alternative could be:
>>
>> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);
>
> I assume *pos gets the last position that was part of the parsed number?

Yes, I copied it from http://en.cppreference.com/w/cpp/string/basic_string/stol


> In the paper that proposes this, it'd be good to see examples of
> parsing code using each of the possible interfaces. That'll help
> produce a more informed decision than just looking at the interfaces
> abstractly.

The disadvantage of not having a variant with the number as an out
parameter is the requirement of specifying the type.
I've got a lot of cases where the number is stored into a pre-existing variable.
Then again, perhaps this is just a basic building block on which other
variants can be build.




--
Olaf

Olaf van der Spek

unread,
May 18, 2015, 6:17:09 PM5/18/15
to std-pr...@isocpp.org
Why? Performance?
Shouldn't the compiler take care of that?


--
Olaf

Jeffrey Yasskin

unread,
May 18, 2015, 6:29:10 PM5/18/15
to std-pr...@isocpp.org
I could imagine an implementation like:

double parse(string_view& s) {
for (; !s.empty(); s.remove_prefix(1)) {
whatever(s.front());
}
}

in which the compiler has to assume the s.front() is modified by the
s.remove_prefix(1) call. On the other hand, it's easy enough for the
implementation to act more like:

double parse(string_view& s) {
auto b = s.begin(), e = s.end();
for (; b != e; ++b) {
whatever(*b);
}
s = string_view(b, e);
}

which does seem to avoid any aliasing concerns. That's why it'd be
good for the proposal to include some tests.

On Mon, May 18, 2015 at 3:16 PM, Olaf van der Spek <olafv...@gmail.com> wrote:
> 2015-05-18 22:11 GMT+02:00 'Jeffrey Yasskin' via ISO C++ Standard -
>> In the paper that proposes this, it'd be good to see examples of
>> parsing code using each of the possible interfaces. That'll help
>> produce a more informed decision than just looking at the interfaces
>> abstractly.
>
> The disadvantage of not having a variant with the number as an out
> parameter is the requirement of specifying the type.
> I've got a lot of cases where the number is stored into a pre-existing variable.
> Then again, perhaps this is just a basic building block on which other
> variants can be build.

Sure, and for primitive types it doesn't really matter, but for
symmetry when people write their own parsing functions, it'd be nice
to let folks parse non-default-constructible types. Code examples will
help make either case.
If the implementer writes special cases for a few bases, and arranges
their function boundaries suitably, the compiler can inline just the
top level in order to delete the inactive options. This is something
to test though.

Jeffrey

Jeffrey Yasskin

unread,
May 18, 2015, 6:30:38 PM5/18/15
to std-pr...@isocpp.org
On Mon, May 18, 2015 at 2:57 PM, Nicol Bolas <jmck...@gmail.com> wrote:
> I just want to see the "sto*" functions get string_view ASAP. We shouldn't
> wait on `expected` just to accomplish that.

That seems like a straightforward paper to get through the committee.
If you write it, I'll try to prevent the group from creeping its
scope.

Jeffrey

Vicente J. Botet Escriba

unread,
May 18, 2015, 7:48:26 PM5/18/15
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
Le 18/05/15 23:57, Nicol Bolas a écrit :


On Monday, May 18, 2015 at 5:40:24 PM UTC-4, Matthew Woehlke wrote:
On 2015-05-18 14:34, Olaf van der Spek wrote:
> Let's get the party started.

That ship has sailed long ago :-). Please make sure you are up to speed
on the previous discussion on this topic.

> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);

Neither. As Vicente pointed out, use std::expected. The original
discussion on this topic was the source of std::expected in the first
place; it would be rather disingenuous to not use it.

(Also... please spell invalid pointers as "nullptr" :-).)

Before making this proposal dependent on `expected`, we should find out where the committee is with that. I don't much care for `expected`, but if it will get the error codes people off our API backs, I'll take it.

It was designed just for that :) but it seems this is not enough. Having two (or an additional) alternative way to report errors is not something the standard should support, as it would be confusing for the users. In other words, what some says "this is not the C++ way".

I just want to see the "sto*" functions get string_view ASAP. We shouldn't wait on `expected` just to accomplish that.

You are right, expected should/could not block any proposal. expected in not in the standard, so we couldn't use it.
We must continue making/adopting new non controversial proposal with a well defined scope, using the available abstractions. Maybe the PO could use alternatively the FileSystem TS way, as today we don't have anything else.

Again, my apologies for going outside the scope of the PO proposal.

Vicente

Nicol Bolas

unread,
May 18, 2015, 9:01:29 PM5/18/15
to std-pr...@isocpp.org

Actually, that paper already exists (N4015), with a revision (N4109). According to Vicente Escriba, who both wrote those proposals and replied here, scope creep has already started. Also, N4109 has been a while ago (almost a year now), with no followup paper from any discussions based on it.

I think at this point, we should focus on getting the core feature: parsing strings via string_view. And those many malign the FileSystem TS solution, it is prior art on dealing with error codes in standard library C++.

Jeffrey Yasskin

unread,
May 18, 2015, 9:17:29 PM5/18/15
to std-pr...@isocpp.org
You said you want the "sto*" functions to get string_view. You should
write the paper proposing that the "sto*" functions get string_view.
You should not write the expected<> paper.

Thiago Macieira

unread,
May 19, 2015, 2:49:32 AM5/19/15
to std-pr...@isocpp.org
On Monday 18 May 2015 23:16:35 Magnus Fromreide wrote:
> > long strtol(const char*, char **str_end, int base);
> > int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
> >
> > What do we want?
> >
> > Input should not be required to be null terminated, so string_view seems
> > like a suitable input type.
>
> I think iterators/ranges are a better input type. Why should we require that
> the input is consecutive?

Because at least one instantiation of those functions is not inline and will
require contiguous memory. Also note the requirement for char, not char16_t,
char32_t, MyChar, etc.. I'd even say that wchar_t need not be included.

All the unsigned instantiations can be implemented inline by calling an out-
of-line uintmax_t instantiation and the same goes for the signed versions and
intmax_t. Or, depending on the behaviour, all instantiations call a backend
that takes (u)intmax_t min and max of the type in question.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Jens Maurer

unread,
May 19, 2015, 2:53:39 AM5/19/15
to std-pr...@isocpp.org
Yes. In most use cases, you know at compile-time which base you
expect. I believe an appropriate interface should allow me to
convey that compile-time information to the callee.

Even if you don't know at compile-time, it's likely that you're
making a decision before the call (e.g. when parsing C-style
number prefixes such as 0x, 0b, 0). There is no need for the
callee to decide again.

> Shouldn't the compiler take care of that?

These parser functions are not templates, and at least for
the base-10 case, the required code is large, so I anticipate
these functions to be out-of-line.

http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf

From a modularity standpoint, the compile-time-base functions
can be invoked from the generic one with no unanticipated overhead,
but not vice versa.

Jens

Thiago Macieira

unread,
May 19, 2015, 2:55:41 AM5/19/15
to std-pr...@isocpp.org
On Monday 18 May 2015 14:19:19 Matthew Fioravante wrote:
> We may also want to think generically. Can clients easily implement
> efficient std::parse<T> routines for their own user defined types?

I don't think so.

Integer parsing may be acceptable, but it's borderline already -- the FreeBSD
implementations of strtoull and strtoll are around 60 lines of C code, the
glibc implementation is 300 lines and supports parsing of locale-specified
number grouping. But floating point parsing can't reasonably be done inline.

Olaf van der Spek

unread,
May 19, 2015, 6:34:26 AM5/19/15
to std-pr...@isocpp.org
2015-05-19 8:53 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> Why? Performance?
>
> Yes. In most use cases, you know at compile-time which base you
> expect. I believe an appropriate interface should allow me to
> convey that compile-time information to the callee.

IMO the optimizer should take care of that.

> Even if you don't know at compile-time, it's likely that you're
> making a decision before the call (e.g. when parsing C-style
> number prefixes such as 0x, 0b, 0). There is no need for the
> callee to decide again.

Parsing the prefix is currently part of the number parser..

>> Shouldn't the compiler take care of that?
>
> These parser functions are not templates, and at least for
> the base-10 case, the required code is large, so I anticipate
> these functions to be out-of-line.

If the base is known at compile-time it shouldn't matter (to a good
compiler) whether it's a template or not.

> http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf

ETIMEOUT

> From a modularity standpoint, the compile-time-base functions
> can be invoked from the generic one with no unanticipated overhead,
> but not vice versa.

True


--
Olaf

Matthew Woehlke

unread,
May 19, 2015, 10:42:39 AM5/19/15
to std-pr...@isocpp.org
On 2015-05-18 18:09, 'Jeffrey Yasskin' via ISO C++ Standard - Future
Proposals wrote:
> On Mon, May 18, 2015 at 2:40 PM, Matthew Woehlke wrote:
>> Actually, I agree with the other Matthew Fioravante's suggestion of
>> mutating the input string_view / iterators. (Maybe we should just
>> support this like 'parse(in, &in)' and making sure that is efficient.)
>
> This one gathered an objection at
> https://groups.google.com/a/isocpp.org/d/msg/std-proposals/Hs1s2329FCo/dl9N2GnXfxQJ,
> that the potential aliasing between the const char* and the
> string_view itself can cause problems.

That sounds like a QOI issue. It also sounds like something that vendors
should be able to fix / work around.

> I suspect any problems can be fixed with a careful implementation,

Yes. Don't modern compilers have a way to tell the compiler to assume
that two entities do not alias?

--
Matthew

Thiago Macieira

unread,
May 19, 2015, 11:29:27 AM5/19/15
to std-pr...@isocpp.org
On Tuesday 19 May 2015 10:42:23 Matthew Woehlke wrote:
> > I suspect any problems can be fixed with a careful implementation,
>
> Yes. Don't modern compilers have a way to tell the compiler to assume
> that two entities do not alias?

C99 restricted pointers may help here.

Jeffrey Yasskin

unread,
May 19, 2015, 11:37:58 AM5/19/15
to std-pr...@isocpp.org
On Tue, May 19, 2015 at 3:34 AM, Olaf van der Spek <olafv...@gmail.com> wrote:
> 2015-05-19 8:53 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>> Why? Performance?
>>
>> Yes. In most use cases, you know at compile-time which base you
>> expect. I believe an appropriate interface should allow me to
>> convey that compile-time information to the callee.
>
> IMO the optimizer should take care of that.

Your opinion (and Jens's opinion) don't actually matter here. Whoever
writes the paper should investigate what optimizers actually do and
describe it in the paper.

Jeffrey

Matthew Fioravante

unread,
May 19, 2015, 12:33:32 PM5/19/15
to std-pr...@isocpp.org


On Tuesday, May 19, 2015 at 11:37:58 AM UTC-4, Jeffrey Yasskin wrote:
On Tue, May 19, 2015 at 3:34 AM, Olaf van der Spek <olafv...@gmail.com> wrote:
> 2015-05-19 8:53 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>> Why? Performance?
>>
>> Yes.  In most use cases, you know at compile-time which base you
>> expect.  I believe an appropriate interface should allow me to
>> convey that compile-time information to the callee.

If the base was a template parameter, then we would need to provide a second version which accepts base as a runtime parameter for those use cases when we don't know base at compile time. Having one version of parse with template parameters and another with normal runtimes parameters seems like it would be bloated and confusing.
 
>
> IMO the optimizer should take care of that.

Your opinion (and Jens's opinion) don't actually matter here. Whoever
writes the paper should investigate what optimizers actually do and
describe it in the paper.


Performance of these routines is paramount. I've had big data processing applications who ended up spending a large portion of their runtimes inisde of strtod().

Almost every time, we know the base at compile time so the parse routine can just be an inline wrapper which calls the specific optimized versions if they exist.

namespace detail {
 ret
<int> optimized_parse_int_base10(string_view s); //defined out of line
 ret
<int> optimized_parse_int_base16(string_view s); //defined out of line
 ret
<int> generic_parse_int(string_view s); //defined out of line
};

template <>
inline ret<int> parse<int>(string_view s, int base=10) {
 
if(base == 10) {
   
return detail::optimized_parse_int_base10(s);
 
}
 
if(base == 0x10) {
   
return detail::optimized_parse_int_base16(s);
 
}
 
return detail::generic_parse_int(s);
}


Removing the check for base==N and calling the correct underlying function directly is very easy for any modern optimizer. My guess is that 99% of uses cases will specify the base as a compile time constant so this optimization is probably a good bet for performance in a generic interface.

Cases where this approach *might* actually be a performance loser:
- If the client is parsing numbers of many different bases, it might be faster to just always use the generic routine even if it does a bit more computation than the optimized methods. The reason being that the one generic routine will stay hot in the Icache while many different optimized routines can occupy different cache lines, all of which will be less hot and therefore may be swapped back out to main memory more often. I've worked with several projects where code size differences like this have a measurable impact on performance.
- If the client is passing a base whose value is truly determined at runtime (the compiler can't prove anything about it). In this scenario, its possible that the overhead of checking for the optimized values of base could be more expensive than the gains from the optimized routines themselves. Even if it still makes sense to check for and call optimized routines, it may be better to do the dispatch out of line to avoid code bloat at all of the call sites (maybe the inliner can figure this out?).

Its easy enough to measure and figure out these performance concerns for a single project but I'm not sure how you'd do it for a generic interface intended to be used by the whole world.

If the overhead of dispatching turns out to be a real concern, then 2 functions can be introduced.
ret<T> parse_generic<T>(string_view s, int base); //parses a T from s
ret
<T> parse<T>(string_view, int base); //same as parse_generic(), but may do inline dispatch to optimized routines for specific values of base

But of course now we have a rather expensive feature creep for an arguably dubious performance concern. If we had constexpr overloading or some method of constexpr parameter detection, the implementation could choose to do the inline dispatch only when base is constexpr (if that makes sense to do).

Aren't all of these points QOI issues anyway?

Thiago Macieira

unread,
May 19, 2015, 12:37:23 PM5/19/15
to std-pr...@isocpp.org
On Tuesday 19 May 2015 09:33:32 Matthew Fioravante wrote:
> > >> Yes. In most use cases, you know at compile-time which base you
> > >> expect. I believe an appropriate interface should allow me to
> > >> convey that compile-time information to the callee.
>
> If the base was a template parameter, then we would need to provide a
> second version which accepts base as a runtime parameter for those use
> cases when we don't know base at compile time. Having one version of parse
> with template parameters and another with normal runtimes parameters seems
> like it would be bloated and confusing.

Agreed. Jens's use-case seems to be easily solved with std::bind,
std::function and/or lambdas.

Matthew Fioravante

unread,
May 19, 2015, 10:41:05 PM5/19/15
to std-pr...@isocpp.org
If out parameters are to be used and their presence (or lack thereof) changes behavior, then we should pass them by rvalue-reference.

See example:

ret<T> parse(string_view&& tail,string_view s); //Parses a T from s.Sets tail to the end of the string
ret
<T> parse(string_view s); //Parses a T from s. Error if there are extra characters after the parsed string.

auto a = parse<int>(tail, str); //Parse an int from str, storing the tail of the string in tail
auto b = parse<int>(str); //Parse an int from str,is an error if str has trailing characters after the value
auto c = parse<int>(string_view{},str); //Parse an int from str and ignore any characters after the value

The last example would not be possible if tail was passed by lvalue reference.

Using an rvalue reference allows this kind of mistake to pass the compiler however.

string_view readStr();
//Oops
auto a = parse<int>(readStr(),tail);


If the API worked this way:
ret<T> parse(string_view& tail,string_view s); //Parses a T from s.Sets tail to the end of the string
ret
<T> parse(string_view s); //Parses a T from s and ignores the remaining characters.

Then there is no reason to ever pass an rvalue tail so an lvalue reference is probably more appropriate as it makes the above bug a compiler error.


Olaf van der Spek

unread,
May 20, 2015, 8:07:45 AM5/20/15
to std-pr...@isocpp.org
2015-05-18 23:16 GMT+02:00 Magnus Fromreide <ma...@lysator.liu.se>:
> I think iterators/ranges are a better input type. Why should we require that
> the input is consecutive?

Simplicity?
I don't know, I presume iostream internals support parsing from
iterators so it might be good to expose that.

On the other hand the input is often contiguous and existing functions
mostly require contiguous input.



--
Olaf

Olaf van der Spek

unread,
May 20, 2015, 8:08:25 AM5/20/15
to std-pr...@isocpp.org
The proposed function is like: parse(string_view in, size_t* pos =
nullptr, int base = 10);
Nothing is passed by lvalue (or rvalue).

Using string_view* tail might make sense in certain use cases but not
in others..

Let's have a look at some real-world use cases.

m_downloaded = to_int(value);
m_uploaded = to_int(value);
auto u = find_user(to_int(req_["u"]));
int y0 = to_int(req_["y0"]);

In most cases the entire input has to be a valid number, so pos / tail
isn't needed.
In most cases the number is used to initialize a new variable, so an
out parameter wouldn't work for the output number.
0 is used to signal a failed conversion.

I think my use cases are best served by an atoi-like convenience
function, so the actual interface of the real parse function wouldn't
matter.
Perhaps atoi/strtol-like convenience functions should be proposed too.


--
Olaf

Vicente J. Botet Escriba

unread,
May 20, 2015, 10:35:40 AM5/20/15
to std-pr...@isocpp.org
Le 18/05/15 20:34, Olaf van der Spek a écrit :
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or
> std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view
> seems like a suitable input type.
Maybe instead of using string_view the function should work on any model
of a given ParserState Concept. What are the operations a parser need
from this ParserState?
> Error detection should be simpler, but not everyone is a fan of
> exceptions.
We can question ourselves which interface we will had if exceptions were
acceptable. Without taking the base in account, should we have

template< class T, class ParserState>
T parse(ParserState& state);

or

template< class T, class ParserState>
pair<T, ParserState> parse(ParserState state);

I would prefer the second, as it compose better, but I'm biased by the
functional approach.

An alternative is to define a parser object and then apply a member extract

parser p(...);
p.extract<int>();

Would uniform syntax allows the following

extract<int>(p);

>
> And IMO skipping spaces should not be part of the parse function.
> There's also the question of what to do when not the entire input can
> be parsed. Return an error or not.
Parsing is not the same than matching. When you parse, you want to parse
several things, so you need a new state of the ParserState on which
apply again the function parser. When you want to match the whole input
must be consumed.
I suggest to use a different function for this use case.
>
>
> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);
>
>
>
Sorry, but what the pos parameter is used for?

When exceptions can not be used we need to add an output for an error code.

FS TS adds it as output parameter

template< class T, class ParserState>
pair<T, ParserState> parse(error_code&, ParserState state, int base = 10);


But, why adding error_code as an out parameter? Is it because we are
used to it (C-style)? is it because is more efficient? What is wrong with

template< class T, class ParserState>
tuple<T, error_code, ParserState> parse(ParserState state, int base = 10);
?

If we finish by adopting variant, would the following be a better interface?

template< class T, class ParserState>
pair<variant<T, error_code>, ParserState> parse(ParserState state, int
base = 10);

Would a more specific type be preferable?

Vicente

Vicente J. Botet Escriba

unread,
May 20, 2015, 10:44:48 AM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 14:08, Olaf van der Spek a écrit :
How would you report errors?
>
> Using string_view* tail might make sense in certain use cases but not
> in others..
>
> Let's have a look at some real-world use cases.
>
> m_downloaded = to_int(value);
> m_uploaded = to_int(value);
> auto u = find_user(to_int(req_["u"]));
> int y0 = to_int(req_["y0"]);
>
> In most cases the entire input has to be a valid number, so pos / tail
> isn't needed.
> In most cases the number is used to initialize a new variable, so an
> out parameter wouldn't work for the output number.
Agreed.
> 0 is used to signal a failed conversion.
Ugh, I see now how do you report errors. I'm almost sure this wouldn't
be accepted by the C++ standard committee.
>
> I think my use cases are best served by an atoi-like convenience
> function, so the actual interface of the real parse function wouldn't
> matter.
> Perhaps atoi/strtol-like convenience functions should be proposed too.
>
>
You lost me here.

Vicente

Matthew Fioravante

unread,
May 20, 2015, 11:27:54 AM5/20/15
to std-pr...@isocpp.org

Even if the base implementation uses iterators, string_view (or a string_view compatible) overloads should be provided.

Because of that, we could just do a string_view proposal for now and if later someone wants to propose an iterator version the string_view functions could be just reimplemented using the iterator version. It would be really great if this library could make it into the standard at the same time string_view does. Not having string_view compatible number parsing is a major hole in the string_view API.
 

On Wednesday, May 20, 2015 at 8:08:25 AM UTC-4, Olaf van der Spek wrote:
The proposed function is like: parse(string_view in, size_t* pos =
nullptr, int base = 10);
Nothing is passed by lvalue (or rvalue).

Using string_view* tail might make sense in certain use cases but not
in others..

Not sure I follow you here.
Do you have any specific use case where a size_t* (or char*) is more appropriate? A string_view object is a better tail string representation because it is a range with invariants built in from the start.

Consider the following example, which would be a lot more clumsy if you used a size_t* instead of a string_view*.

string_view s = "1 2 3";
auto a = parse<int>(s, &s);
s
.pop_front();
auto b = parse<int>(s, &s);
s
.pop_front();
auto c = parse<int>(s, &s);

Using a size_t* just means I'm probably going to be constructing a string_view on the next line and now I have to awkwardly deal with stuff like { in.begin() + pos, in.end() } without making mistakes.

The tail out param could also be passed as a string_view*, which nullptr signifying "allow tail strings but throw them away".


Let's have a look at some real-world use cases.

m_downloaded = to_int(value);
m_uploaded = to_int(value);
auto u = find_user(to_int(req_["u"]));
int y0 = to_int(req_["y0"]);

In most cases the entire input has to be a valid number, so pos / tail
isn't needed.

I agree that retrieving the tail should be optional and not get in the way if you don't need it. This means that if tail is an out parameter we should provide an overload without it. If tail is part of the returned object, it should be easy to extract the value and error_code and ignore the tail.
 
In most cases the number is used to initialize a new variable,

This is a major problem with expected, optional, etc... Its really nice to just be able to say auto x = parse<int>(s); and have x be an int.

 
so an
out parameter wouldn't work for the output number.
0 is used to signal a failed conversion.

Generally, using 0 or any other perfectly valid value to signal failure is a really bad idea. It only makes sense when your use case is "Parse the value or give me some default if it fails". In that case, 0 may not be the default value you want so its better to be able to actually specify it.

This use case why I invented parse_or() for my libraries. Regardless of how the base version is implemented, this wrapper can still be added for atoi() like convenience.

template <typename T>
T parse_or
(string_view s, T val_if_error);

auto x = parse_or(s, 1.0);
//decltype(x) == double

Notice how also because of the default, you don't even need to specify the type T. It can be deduced from the constant used to initialize val_if_error.

Using such a method emphasizes the value you are returning and that you don't care about errors.

 

I think my use cases are best served by an atoi-like convenience
function, so the actual interface of the real parse function wouldn't
matter.
Perhaps atoi/strtol-like convenience functions should be proposed too.

I have this use case often as well, but many other times I'm more focused on correctness and want to report errors to users if they incorrectly specify a number. Both idioms should be easily supported. My parse_or() (or something similar) should serve the atoi() use case with minimal complexity. Do you see any situation where it would not?

I think what is needed is for the paper to survey possible several interfaces and show examples of all of the common use cases with each one, carefully pointing out the pros and cons. Only when we have all of the data in front of us with their trade offs of safety, convenience, and readability can we choose one.

If you're planning to actually write this paper and want to collaborate, I'd be happy to help with this.

What are all of the use cases for an API like this? Here is what I can think of:
- Parse a number and throw an exception if error occurs
- Parse a number and let me handle errors without exceptions
- Parse a number and give me a default value if an error occurs
- Parse a number and give me the tail string so I can continue parsing the next object.
- Initialize a variable using parse(), preferably using auto / type deduction.
- Assign to a pre-existing variable using parse(), preferably with type deduction.

Matthew Woehlke

unread,
May 20, 2015, 11:43:11 AM5/20/15
to std-pr...@isocpp.org
On 2015-05-20 11:27, Matthew Fioravante wrote:
> Generally, using 0 or any other perfectly valid value to signal failure is
> a really bad idea. It only makes sense when your use case is "Parse the
> value or give me some default if it fails". In that case, 0 may not be the
> default value you want so its better to be able to actually specify it.

Doesn't expected already handle this?

> This is a major problem with expected, optional, etc... Its really nice to
> just be able to say auto x = parse<int>(s); and have x be an int.

auto x = parse<int>(s).value_or(0);

Slightly more typing, but less API complexity. And you could trivially
write an inline parse_or that wraps this.

> What are all of the use cases for an API like this? Here is what I can
> think of:
> - Parse a number and throw an exception if error occurs
> - Parse a number and let me handle errors without exceptions
> - Parse a number and give me a default value if an error occurs

Returning an expected covers all of these. If you want the exception,
just blindly take the value of the expected; it will throw if there is
no value. If you want to check for errors without exceptions, check if
the expected contains a value. If you don't care about errors, use
expected::value_or (or whatever it's called).

> - Parse a number and give me the tail string so I can continue parsing the
> next object.

This is only possible with a default value? With a default value, an
inline wrapper can trivially provide it.

> - Initialize a variable using parse(), preferably using auto / type
> deduction.

Also possible with a trivial inline wrapper. (I don't think we need to
worry about having parse() write to the existing variable directly;
we're talking about numeric types; assigning the return value to an
existing variable is not expensive.)

> - Assign to a pre-existing variable using parse(), preferably with type
> deduction.

It seems to be that 'expected<T> parse(string_view in, size_t* pos, int
base)', or similar with some tweaking of how we communicate what was
consumed, covers all of the above use cases (with the addition of some
trivial inline convenience wrappers built on top of it).

--
Matthew

Vicente J. Botet Escriba

unread,
May 20, 2015, 12:29:51 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 04:41, Matthew Fioravante a écrit :
If out parameters are to be used and their presence (or lack thereof) changes behavior, then we should pass them by rvalue-reference.

See example:

ret<T> parse(string_view&& tail,string_view s); //Parses a T from s.Sets tail to the end of the string
ret
<T> parse(string_view s); //Parses a T from s. Error if there are extra characters after the parsed string.


What about returning the tail also? Why passing it by rvalue or lvalue?

pair<ret<T>,string_view> parse(string_view s); //Parses a T from s. Return s the tail
ret
<T> parse_exact(string_view s); //Parses a T from s. Error if there are extra characters after the parsed string.

auto a = parse<int>(tail, str); //Parse an int from str, storing the tail of the string in tail
auto b = parse<int>(str); //Parse an int from str,is an error if str has trailing characters after the value
auto c = parse<int>(string_view{},str); //Parse an int from str and ignore any characters after the value

tie(a, str) = parse<int>(str);
auto b = parse_exact<int>(str); //Parse an int from str,is an error if str has trailing characters after the value
tie(c,
ignore ) = parse<int>(str); //Parse an int from str and ignore the characters after the value

Not yet there, but I would find quire readable to assign multiple values and even declare them in situ as well

{auto a, str} = parse<int>(str);

{auto c,
ignore} = parse<int>(str);

The advantage of a functional interface (no out/in-out parameters)

parse<T> :: S -> (T, S) 

is that this function can be composed using fold like functions, that consume from the parser state and folds e.g. on a list.

The tail out parameter makes the signature too specific

parse<T> :: S&, S -> T 

that composes less easily.


The last example would not be possible if tail was passed by lvalue reference.

Using an rvalue reference allows this kind of mistake to pass the compiler however.

string_view readStr();
//Oops
auto a = parse<int>(readStr(),tail);

Having only input parameters makes this error not possible as tie expects a reference

string_view readStr();
//Oops
tie(a,
readStr()) = parse<int>(tail); // compile error



If the API worked this way:
ret<T> parse(string_view& tail,string_view s); //Parses a T from s.Sets tail to the end of the string
ret
<T> parse(string_view s); //Parses a T from s and ignores the remaining characters.

Then there is no reason to ever pass an rvalue tail so an lvalue reference is probably more appropriate as it makes the above bug a compiler error.

Yes the rvalue tail parameter doesn't seam a good idea.

Vicente

Vicente J. Botet Escriba

unread,
May 20, 2015, 12:30:12 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 17:27, Matthew Fioravante a écrit :
>
>
> If you're planning to actually write this paper and want to
> collaborate, I'd be happy to help with this.
>
> What are all of the use cases for an API like this? Here is what I can
> think of:
> - Parse a number and throw an exception if error occurs
Do we really want this?
> - Parse a number and let me handle errors without exceptions
Agreed.
> - Parse a number and give me a default value if an error occurs
Agreed with the use case, not sure we need something specific..
> - Parse a number and give me the tail string so I can continue parsing
> the next object.
This seems to me the normal case.
> - Initialize a variable using parse(), preferably using auto / type
> deduction.
Do you mean a variable of type int e.g.?

auto i = parse<int>(p); // decltype(i) int
> - Assign to a pre-existing variable using parse(), preferably with
> type deduction.
You are requesting that the following should be valid

i = parse<int>(p);

?
I believe that I don't understand what do you want here?

Vicente

Olaf van der Spek

unread,
May 20, 2015, 2:16:14 PM5/20/15
to std-pr...@isocpp.org
2015-05-20 16:35 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
> Maybe instead of using string_view the function should work on any model of
> a given ParserState Concept. What are the operations a parser need from this
> ParserState?

Maybe, what's a ParserState?

>>
>> Error detection should be simpler, but not everyone is a fan of
>> exceptions.
>
> We can question ourselves which interface we will had if exceptions were
> acceptable.

We could but what's the point?

> Parsing is not the same than matching. When you parse, you want to parse
> several things, so you need a new state of the ParserState on which apply
> again the function parser. When you want to match the whole input must be
> consumed.
> I suggest to use a different function for this use case.

What should such a function be named?

> Sorry, but what the pos parameter is used for?

http://en.cppreference.com/w/cpp/string/basic_string/stol




--
Olaf

Matthew Fioravante

unread,
May 20, 2015, 2:28:14 PM5/20/15
to std-pr...@isocpp.org


On Wednesday, May 20, 2015 at 12:29:51 PM UTC-4, Vicente J. Botet Escriba wrote:
Not yet there, but I would find quire readable to assign multiple values and even declare them in situ as well

{auto a, str} = parse<int>(str);

{auto c,
ignore} = parse<int>(str);
Its quite a different proposal, but I think a syntax like this is badly needed. We have tie() which is only 3 characters but its actually 8 characters because pretty much in all of my code I'd be saying std::tie(). Also the tuple/tie solution is not as well known as a good way to return multiple values. This kind of thing is good enough to be a core language feature IMO.

Maybe not braces since current use of braces always introduce a scope and that auto a lives in the outer scope. Maybe [] can be reused?

[ auto a, str ] = parse<int>(str);


 
> is that this function can be composed using fold like functions, that consume from the parser state and folds e.g. on a list.

It would probably be good to have a specific example or 2 in the paper showing the strengths of the functional programming approach.


> Yes the rvalue tail parameter doesn't seam a good idea.

The tail can also be included in the returned object and maybe that is the superior approach. If for whatever reason out parameter is deemed the best solution, I tried to outline some concerns as to whether it should be passed by lvalue or rvalue.
 

On Wednesday, May 20, 2015 at 12:30:12 PM UTC-4, Vicente J. Botet Escriba wrote:
Le 20/05/15 17:27, Matthew Fioravante a écrit :
>
>
> If you're planning to actually write this paper and want to
> collaborate, I'd be happy to help with this.
>
> What are all of the use cases for an API like this? Here is what I can
> think of:
> - Parse a number and throw an exception if error occurs
Do we really want this?

I think so, at least as a wrapper if nothing else. The use case of "I'm just going to try to parse a bunch of stuff and bail out if any one fails" is a not uncommon.
 
> - Parse a number and let me handle errors without exceptions
Agreed.
> - Parse a number and give me a default value if an error occurs
Agreed with the use case, not sure we need something specific..
> - Parse a number and give me the tail string so I can continue parsing
> the next object.
This seems to me the normal case.
> - Initialize a variable using parse(), preferably using auto / type
> deduction.
Do you mean a variable of type int e.g.?

auto i = parse<int>(p); // decltype(i) int

Yes. The idea is that if we specify the type int in the parse() invocation then we should be able to take advantage of type deduction and skip it in the declaration of the result variable. Using your multiple return values can achieve this goal as well.
 
> - Assign to a pre-existing variable using parse(), preferably with
> type deduction.
You are requesting that the following should be valid

i = parse<int>(p);

?
I believe that I don't understand what do you want here?

Even more nice to have. I mean like this:

template <typename T>
error_code parse
(T& val, string_view s);

double x = 1.0;
if(!parse(x, s)) {
 
//handle error
}

Here we did not need to specify the type we want to parse because its deduced from the out parameter.

If I later change x to a float or int or whatever, the parsing code will update itself to be correct, vs this scenario:

//double x; //old code
int x; //new code
x
= parse<double>(s); //oops!

x
= parse<decltype(x)>(s); //Workaround, but ugly



Olaf van der Spek

unread,
May 20, 2015, 2:28:27 PM5/20/15
to std-pr...@isocpp.org
> On Wednesday, May 20, 2015 at 8:08:25 AM UTC-4, Olaf van der Spek wrote:
> Not sure I follow you here.
> Do you have any specific use case where a size_t* (or char*) is more
> appropriate? A string_view object is a better tail string representation
> because it is a range with invariants built in from the start.
>
> Consider the following example, which would be a lot more clumsy if you used
> a size_t* instead of a string_view*.
>
> string_view s = "1 2 3";
> auto a = parse<int>(s, &s);
> s.pop_front();
> auto b = parse<int>(s, &s);
> s.pop_front();
> auto c = parse<int>(s, &s);
>
> Using a size_t* just means I'm probably going to be constructing a
> string_view on the next line and now I have to awkwardly deal with stuff
> like { in.begin() + pos, in.end() } without making mistakes.

s.drop_front(pos); ? Or was drop/pop_front(N) dropped from string_view?

If one wants the input view to be updated it makes sense to have a
function that takes it by reference though.

I'm just wondering whether a string_view* tail makes sense if the
input isn't a string_view.

> The tail out param could also be passed as a string_view*, which nullptr
> signifying "allow tail strings but throw them away".

I'd go for don't allow tails in that case.

>> 0 is used to signal a failed conversion.
>
>
> Generally, using 0 or any other perfectly valid value to signal failure is a
> really bad idea. It only makes sense when your use case is "Parse the value

Of course. The default could be passed in via an optional parameter.
These functions only make sense if there's such a default.

> I have this use case often as well, but many other times I'm more focused on
> correctness and want to report errors to users if they incorrectly specify a
> number. Both idioms should be easily supported. My parse_or() (or something
> similar) should serve the atoi() use case with minimal complexity. Do you
> see any situation where it would not?

parse_or() seems good.

> I think what is needed is for the paper to survey possible several
> interfaces and show examples of all of the common use cases with each one,
> carefully pointing out the pros and cons. Only when we have all of the data
> in front of us with their trade offs of safety, convenience, and readability
> can we choose one.
>
> If you're planning to actually write this paper and want to collaborate, I'd
> be happy to help with this.

Sounds like a plan.


--
Olaf

Olaf van der Spek

unread,
May 20, 2015, 2:49:49 PM5/20/15
to std-pr...@isocpp.org
2015-05-20 17:43 GMT+02:00 Matthew Woehlke <mw_t...@users.sourceforge.net>:
> On 2015-05-20 11:27, Matthew Fioravante wrote:
>> Generally, using 0 or any other perfectly valid value to signal failure is
>> a really bad idea. It only makes sense when your use case is "Parse the
>> value or give me some default if it fails". In that case, 0 may not be the
>> default value you want so its better to be able to actually specify it.
>
> Doesn't expected already handle this?

Does expected exist?
And does it have something like test_and_set()?

What would a parse date (Y-M-D) function look like?

// returning expected or optional
optional<Date> parse_date1(string_view is)
{
int year;
int month;
int day;
return true
&& test_and_set(year, parse<decltype(year)>(is, &is))
&& parse_separator(is, &is)
&& test_and_set(month, parse<decltype(month)>(is, &is))
&& parse_separator(is, &is)
&& test_and_set(day, parse<decltype(day)>(is, &is)))
? { year, month, day }
: {};
}

// returning error_code
optional<Date> parse_date2(string_view is)
{
int year;
int month;
int day;
return true
&& !parse(year, is, &is))
&& !parse_separator(is, &is)
&& !parse(month, is, &is))
&& !parse_separator(is, &is)
&& !parse(day, is, &is)))
? { year, month, day }
: {};
}

One of these functions looks cleaner to me. Did I do something wrong?

--
Olaf

Matthew Fioravante

unread,
May 20, 2015, 3:02:31 PM5/20/15
to std-pr...@isocpp.org


On Wednesday, May 20, 2015 at 2:28:27 PM UTC-4, Olaf van der Spek wrote:
> On Wednesday, May 20, 2015 at 8:08:25 AM UTC-4, Olaf van der Spek wrote:
> Not sure I follow you here.
> Do you have any specific use case where a size_t* (or char*) is more
> appropriate? A string_view object is a better tail string representation
> because it is a range with invariants built in from the start.
>
> Consider the following example, which would be a lot more clumsy if you used
> a size_t* instead of a string_view*.
>
> string_view s = "1 2 3";
> auto a = parse<int>(s, &s);
> s.pop_front();
> auto b = parse<int>(s, &s);
> s.pop_front();
> auto c = parse<int>(s, &s);
>
> Using a size_t* just means I'm probably going to be constructing a
> string_view on the next line and now I have to awkwardly deal with stuff
> like { in.begin() + pos, in.end() } without making mistakes.

s.drop_front(pos); ? Or was drop/pop_front(N) dropped from string_view?

If one wants the input view to be updated it makes sense to have a
function that takes it by reference though.

I think passing s as the tail param is cleaner than using pop_front(n).

If the tail is going to be returned then definately a string_view is better than just a size_t. We can pass that string_view along directly and it will compose better.
 

I'm just wondering whether a string_view* tail makes sense if the
input isn't a string_view.

Sure it makes sense. Just like if you want to efficiently take a substring of any string type such as std::string you get a string_view. Thats all the tail is, a sub string of the original.

Also the input is converted to a string_view, so the interface only knows the input string is string_view compatible and thats the only guarantee it can provide on the way back out.
 

> The tail out param could also be passed as a string_view*, which nullptr
> signifying "allow tail strings but throw them away".

I'd go for don't allow tails in that case.

Not sure that's a good approach. We completely change the behavior of the function if the the tail string pointer happens to be null.
I imagine such an interface could result in very surprising bugs.

I think a better way to require an exact parse is with an overload which doesn't accept a tail. Now the rules are described by the interface instead of the value of the parameters passed to it.

Jens Maurer

unread,
May 20, 2015, 3:18:24 PM5/20/15
to std-pr...@isocpp.org
On 05/18/2015 08:34 PM, Olaf van der Spek wrote:
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view seems like a suitable input type.
> Error detection should be simpler, but not everyone is a fan of exceptions.
>
> And IMO skipping spaces should not be part of the parse function.
> There's also the question of what to do when not the entire input can be parsed. Return an error or not.
>
>
> So, what about this one?
>
> optional<T> parse(string_view, std::size_t* pos = 0, int base = 10);
>
> An alternative could be:
>
> error_code parse(T&, string_view, std::size_t* pos = 0, int base = 10);


My suggestion:

const char * ret = parse(T& v, const char * first, const char * last, int base, error_code&);

[ret, last[ is the unparsed part of the string. If ret == first, v is not
overwritten and we have an error, otherwise v contains the parsed value.
(Feel free to switch the base vs. error_code& parameters to allow a defaulted base.)


This allows some symmetry with similar output operations:

char * p = output(char * first, char * last, T v);

which outputs "v" into the space provided by [first, last[,
returning the remaining space as [p, last[.

(This doesn't work with string_view, because it's read-only.)


This does not use arbitrary iterators: It doesn't make a lot of sense
to have "char" buffers that are not (at least) partially contiguous.
std::list<char>? No, thanks.

This does not use or consider locales: That's for iostreams to deal with.


Here's some (totally untested) code that shows a composition: Parse
a comma-separated list of "int"s into a std::vector<int>. It seems
having some space-skipping function and a "parse this expected
sequence of chars" function might be helpful.


#include <vector>

struct error_code;

const char * parse(int&, const char * first, const char * last, int base, error_code& ec);

const char * parse(std::vector<int>& res, const char * first, const char * last, int base, error_code& ec)
{
res.clear();
while (first != last) {
int v;
const char * p = parse(v, first, last, base, ec);
if (p == first) // error; ec is already set
return p;
res.push_back(v);
if (p == last)
break;
if (*p != ',') { // error
ec = /* whatever */;
return p;
}
first = p + 1;
}
}


Jens

Olaf van der Spek

unread,
May 20, 2015, 3:38:32 PM5/20/15
to std-pr...@isocpp.org
2015-05-20 21:18 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
> On 05/18/2015 08:34 PM, Olaf van der Spek wrote:
> My suggestion:
>
> const char * ret = parse(T& v, const char * first, const char * last, int base, error_code&);
>
> If ret == first, v is not
> overwritten and we have an error, otherwise v contains the parsed value.

Your function fails these requirements..

--
Olaf

Miro Knejp

unread,
May 20, 2015, 4:19:56 PM5/20/15
to std-pr...@isocpp.org
The Functional Template Library ( https://github.com/beark/ftl ) has an example of monadic parser generators inspired by Haskell.
https://github.com/beark/ftl/blob/master/docs/Parsec-I.md

Now that library also has overloads of operator>>= and others, which is not part of "turn a string into an int" problem, but with all parsing functions returning parser monads the composition is much easier to do.

It starts of by introducing the monad itself:
template<typename T>
    using parser = ftl::eitherT<error,ftl::function<T(std::istream&)>>;

and a function to execute the actual parser
template<typename T>
    ftl::either<error,T> run(parser<T> p, std::istream& is);

Each parsing function then returns a parser object.
parser<int> parseNatural();

This obviously serves more than a simple "turn a string into an int" but is a prime example of composability that really shines with the combining operators like >> or << etc. It makes things like this easy
parser<std::vector<int>> parseLispList() {
    using namesapce ftl;
    return parseChar('(')
        >> parseList()
        << parseChar(')');
}

I thought I'd throw this in just as an example. We had these discussions earlier without any consensus and the only new viewpoint brought in this time is the functional approach mentioned by Vicente J. Whether this is something the standard library can/should follow, or provides the performance people need, or it can solve all the use cases people can come up with and make everyone happy I don't know.

Vicente J. Botet Escriba

unread,
May 20, 2015, 6:00:21 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 20:49, Olaf van der Spek a écrit :
2015-05-20 17:43 GMT+02:00 Matthew Woehlke <mw_t...@users.sourceforge.net>:
On 2015-05-20 11:27, Matthew Fioravante wrote:
Generally, using 0 or any other perfectly valid value to signal failure is
a really bad idea. It only makes sense when your use case is "Parse the
value or give me some default if it fails". In that case, 0 may not be the
default value you want so its better to be able to actually specify it.
Doesn't expected already handle this?
Does expected exist?
Do you mean an implementation to play with? Yes (https://github.com/ptal/expected)

And does it have something like test_and_set()?
No. What test_and_set would have as parameters and what would be the effect?
Using a tail as out parameter it should be someting like

expected<Date, error_code> parse_date(string_view is, string_view& out)
{
  return make_date % // fmap
	( parse<int>(is, &is) 	// year
		>> parse_separator(is, &is) ) * 
	( parse<int>(is, &is)  	// month
		>> parse_separator(is, &is) ) *
	  parse<int>(is, &out)	// day
	; 
}
or using the functional form with a parameter in-out it could be
expected<Date> parse_date(string_view& is)
{
  return fmap(make_date, 
		mdo( parse<int>(is),	// year
		     parse_separator(is) ), 
		mdo( parse<int>(is),  	// month
		     parse_separator(is) ),
		parse<int>(is)  	// day
	     );
}
There other possibilities (e.g. using Parser) as Miro pointed out.

Vicente

Vicente J. Botet Escriba

unread,
May 20, 2015, 6:07:32 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 20:16, Olaf van der Spek a écrit :
> 2015-05-20 16:35 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
>> Maybe instead of using string_view the function should work on any model of
>> a given ParserState Concept. What are the operations a parser need from this
>> ParserState?
> Maybe, what's a ParserState?
As said just above, is the Concept of a Parser State. string_view could
be a model of this concept.
>
>>> Error detection should be simpler, but not everyone is a fan of
>>> exceptions.
>> We can question ourselves which interface we will had if exceptions were
>> acceptable.
> We could but what's the point?
Interface using exceptions compose quite well. The interface that
doesn't use exceptions should be based on it. We need to see just how to
report the error_code. As a result, as out parameter or as a TLS.
>
>> Parsing is not the same than matching. When you parse, you want to parse
>> several things, so you need a new state of the ParserState on which apply
>> again the function parser. When you want to match the whole input must be
>> consumed.
>> I suggest to use a different function for this use case.
> What should such a function be named?
I would not be against parse_exact.
>
>> Sorry, but what the pos parameter is used for?
> http://en.cppreference.com/w/cpp/string/basic_string/stol
>
>
Oh, I missed this link. as Matthew, I don't see the point to use it if
there is an output of a ParserState or a string_view.

Vicente

Vicente J. Botet Escriba

unread,
May 20, 2015, 6:28:07 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 20:28, Matthew Fioravante a écrit :


On Wednesday, May 20, 2015 at 12:29:51 PM UTC-4, Vicente J. Botet Escriba wrote:
Not yet there, but I would find quire readable to assign multiple values and even declare them in situ as well

{auto a, str} = parse<int>(str);

{auto c,
ignore} = parse<int>(str);
Its quite a different proposal, but I think a syntax like this is badly needed. We have tie() which is only 3 characters but its actually 8 characters because pretty much in all of my code I'd be saying std::tie(). Also the tuple/tie solution is not as well known as a good way to return multiple values. This kind of thing is good enough to be a core language feature IMO.

Maybe not braces since current use of braces always introduce a scope and that auto a lives in the outer scope. Maybe [] can be reused?

[ auto a, str ] = parse<int>(str);


Forget it, it is out of the scope.

 
> is that this function can be composed using fold like functions, that consume from the parser state and folds e.g. on a list.

It would probably be good to have a specific example or 2 in the paper showing the strengths of the functional programming approach.
The FTL library is a good example of what can be done. I've to learn a lot from it.


> Yes the rvalue tail parameter doesn't seam a good idea.

The tail can also be included in the returned object and maybe that is the superior approach. If for whatever reason out parameter is deemed the best solution, I tried to outline some concerns as to whether it should be passed by lvalue or rvalue.
Understood.

 
On Wednesday, May 20, 2015 at 12:30:12 PM UTC-4, Vicente J. Botet Escriba wrote:
Le 20/05/15 17:27, Matthew Fioravante a écrit :
>
>
> If you're planning to actually write this paper and want to
> collaborate, I'd be happy to help with this.
>
> What are all of the use cases for an API like this? Here is what I can
> think of:
> - Parse a number and throw an exception if error occurs
Do we really want this?

I think so, at least as a wrapper if nothing else. The use case of "I'm just going to try to parse a bunch of stuff and bail out if any one fails" is a not uncommon.
The idiom in this case would be to get the value

auto r = parse(...).value(); // throws if there is no value :)
 
> - Assign to a pre-existing variable using parse(), preferably with
> type deduction.
You are requesting that the following should be valid

i = parse<int>(p);

?
I believe that I don't understand what do you want here?

Even more nice to have. I mean like this:

template <typename T>
error_code parse
(T& val, string_view s);

double x = 1.0;
if(!parse(x, s)) {
 
//handle error
}

Here we did not need to specify the type we want to parse because its deduced from the out parameter.

If I later change x to a float or int or whatever, the parsing code will update itself to be correct, vs this scenario:

//double x; //old code
int x; //new code
x
= parse<double>(s); //oops!

x
= parse<decltype(x)>(s); //Workaround, but ugly

I will not bay it, but I have no major problem, as this can be built on top of the basic interface. Is just a different syntax.

Vicente

Vicente J. Botet Escriba

unread,
May 20, 2015, 6:41:01 PM5/20/15
to std-pr...@isocpp.org
Le 20/05/15 21:18, Jens Maurer a écrit :
An alternative solution using a no-raw loops with FTL [1,2] could be something like

parser<std::vector<int>> parseVectorInt() {
    return curry(cons) 
        % parse<int>()
        * option(whitespace() >> lazy(parseList), std::vector<int>());
}

where 


parser<std::string> whitespace() {
    return many1(oneOf(" \t\r\n"));
}
or using functional notation

parser<std::vector<int>> parseVectorInt() {
    return fmap(curry(cons),
        parse<int>(),
        option(mdo(whitespace(), lazy(parseList)), std::vector<int>());
}

Vicente

[1] https://github.com/beark/ftl/blob/master/docs/Parsec-I.md
[2] https://github.com/beark/ftl/blob/master/docs/Parsec-II.md


Matthew Woehlke

unread,
May 20, 2015, 6:55:11 PM5/20/15
to std-pr...@isocpp.org
On 2015-05-20 14:28, Matthew Fioravante wrote:
> On Wednesday, May 20, 2015 at 12:29:51 PM UTC-4, Vicente J. Botet Escriba
> wrote:
>
>> Not yet there, but I would find quire readable to assign multiple values
>> and even declare them in situ as well
>>
>> {auto a, str} = parse<int>(str);
>>
>> {auto c, ignore} = parse<int>(str);
>
> Its quite a different proposal, but I think a syntax like this is badly
> needed.

Agreed. Unfortunately it came up before and IIRC did not get very far.

Nit: I think it would be good to allow a declaration 'void' (that is, no
name) to ignore a value. Example:

[auto result, void] = parse<int>(str);

So, each item in the [] list is either an assignable declaration,
assignable expression, or 'void'. (I could continue, but it's off topic
for this thread.)

> Maybe not braces since current use of braces always introduce a scope and
> that auto a lives in the outer scope. Maybe [] can be reused?
>
> [ auto a, str ] = parse<int>(str);

Works for me, FWIW.

--
Matthew

Jeffrey Yasskin

unread,
May 20, 2015, 7:00:03 PM5/20/15
to std-pr...@isocpp.org
There's a significant risk here that if the proposal is too
complicated, nothing will get accepted.

Jens' suggestion has an advantage that it's clearly sufficient and in
line with the rest of the library, even if the interface might not be
as convenient as some other options. Even the interface, though, isn't
too bad when you look at the code using it.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.

Matthew Fioravante

unread,
May 20, 2015, 7:04:00 PM5/20/15
to std-pr...@isocpp.org


On Wednesday, May 20, 2015 at 6:55:11 PM UTC-4, Matthew Woehlke wrote:
On 2015-05-20 14:28, Matthew Fioravante wrote:
> On Wednesday, May 20, 2015 at 12:29:51 PM UTC-4, Vicente J. Botet Escriba
> wrote:
>
>> Not yet there, but I would find quire readable to assign multiple values
>> and even declare them in situ as well
>>
>> {auto a, str} = parse<int>(str);
>>
>> {auto c, ignore} = parse<int>(str);
>
> Its quite a different proposal, but I think a syntax like this is badly
> needed.

Agreed. Unfortunately it came up before and IIRC did not get very far.

Nit: I think it would be good to allow a declaration 'void' (that is, no
name) to ignore a value. Example:

  [auto result, void] = parse<int>(str);

+1 must better than std::ignore.

But we are off topic! :)

On Wednesday, May 20, 2015 at 7:00:03 PM UTC-4, Jeffrey Yasskin wrote:
There's a significant risk here that if the proposal is too
complicated, nothing will get accepted.

That's my biggest fear as well, and also the reason why the last 2 threads about this died with no action. Even if the standard version is pretty simple its very easy to write your favorite flavor of wrapper ontop of it.

Matthew Fioravante

unread,
May 20, 2015, 11:18:42 PM5/20/15
to std-pr...@isocpp.org


On Wednesday, May 20, 2015 at 3:18:24 PM UTC-4, Jens Maurer wrote:
My suggestion:

const char * ret = parse(T& v, const char * first, const char * last, int base, error_code&);

What is the value add of using char* pairs and a char* return vs string_view? Symmetry with iterator algorithms?

string_view parse(T& v, string_view s, int base, error_code&);

//If you have char* pointers
char* cb;
char* ce;
error_code ec
;
int value;
auto tail = parse(value, {cb,ce}, 10, ec);

 

[ret, last[ is the unparsed part of the string.  If ret == first, v is not
overwritten and we have an error, otherwise v contains the parsed value.
(Feel free to switch the base vs. error_code& parameters to allow a defaulted base.)


You could also add an overload:

template <typename T> inline const char* parse(T& v, const char* b, const char* e, error-code& ec) { return parse(v,b,e,10,ec); }

 

This allows some symmetry with similar output operations:

char * p = output(char * first, char * last, T v);  

which outputs "v" into the space provided by [first, last[,
returning the remaining space as [p, last[.

(This doesn't work with string_view, because it's read-only.)

This kind of API looks like it could be another use case for mstring_view. array_view<char> would also work.

Olaf van der Spek

unread,
May 21, 2015, 1:10:07 AM5/21/15
to std-pr...@isocpp.org
2015-05-21 0:00 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
> Le 20/05/15 20:49, Olaf van der Spek a écrit :
>
> 2015-05-20 17:43 GMT+02:00 Matthew Woehlke <mw_t...@users.sourceforge.net>:
>
> On 2015-05-20 11:27, Matthew Fioravante wrote:
>
> Generally, using 0 or any other perfectly valid value to signal failure is
> a really bad idea. It only makes sense when your use case is "Parse the
> value or give me some default if it fails". In that case, 0 may not be the
> default value you want so its better to be able to actually specify it.
>
> Doesn't expected already handle this?
>
> Does expected exist?
>
> Do you mean an implementation to play with? Yes
> (https://github.com/ptal/expected)

No, I mean accepted by the committee and in a TS.

> And does it have something like test_and_set()?
>
> No. What test_and_set would have as parameters and what would be the effect?

bool test_and_set(T& out, optional<T> opt)
{
if (!opt)
return false;
out = *opt;
return true;
}


> Using a tail as out parameter it should be someting like
>
> expected<Date, error_code> parse_date(string_view is, string_view& out)
> {
> return make_date % // fmap
> ( parse<int>(is, &is) // year
> >> parse_separator(is, &is) ) *
> ( parse<int>(is, &is) // month
> >> parse_separator(is, &is) ) *
> parse<int>(is, &out) // day
> ;
> }
>
> or using the functional form with a parameter in-out it could be
>
> expected<Date> parse_date(string_view& is)
> {
> return fmap(make_date,
> mdo( parse<int>(is), // year
> parse_separator(is) ),
> mdo( parse<int>(is), // month
> parse_separator(is) ),
> parse<int>(is) // day
> );
> }
>
> There other possibilities (e.g. using Parser) as Miro pointed out.

Looks interesting but IMO that's higher-level then the basic level
we're aiming for here.
This can easily be build on top of the basic level can't it?

> Interface using exceptions compose quite well. The interface that doesn't use exceptions should be based on it. We need to see just how to report the error_code. As a result, as out parameter or as a TLS.

IMO it should be the other way around. It's cleaner to build a
throwing interface on top of a non-throwing interface..

> Oh, I missed this link. as Matthew, I don't see the point to use it if there is an output of a ParserState or a string_view.

Right
--
Olaf

Magnus Fromreide

unread,
May 21, 2015, 3:51:21 AM5/21/15
to std-pr...@isocpp.org
On Mon, May 18, 2015 at 11:34:07AM -0700, Olaf van der Spek wrote:
> Let's get the party started.
>
> What have we got?
>
> We've got functions like strtol and stoi which take a const char* or
> std::string and return a number.
>
> long strtol(const char*, char **str_end, int base);
> int stoi(const std::string&, std::size_t* pos = 0, int base = 10);
>
> What do we want?
>
> Input should not be required to be null terminated, so string_view seems
> like a suitable input type.
> Error detection should be simpler, but not everyone is a fan of exceptions.
>
> And IMO skipping spaces should not be part of the parse function.
> There's also the question of what to do when not the entire input can be
> parsed. Return an error or not.

I have one more thing to throw into the air.

If the type asked for is unsigned then I think parsing of negative values
should result in a parse error.

"-1" should not be parseable when converting to unsigned T, the fail position
is before the -.

/MF

Olaf van der Spek

unread,
May 21, 2015, 2:31:58 PM5/21/15
to std-pr...@isocpp.org
Could we have some more real-world use cases please?

Magnus Fromreide

unread,
May 21, 2015, 3:53:49 PM5/21/15
to std-pr...@isocpp.org
On Thu, May 21, 2015 at 08:31:55PM +0200, Olaf van der Spek wrote:
> Could we have some more real-world use cases please?

I do not know if this is real-world enough for you?

unsigned int number_of_things = parse<unsigned int>("-1").value();

Why should parse<unsigned int> accept signed values?

What one have to do in C to parse an unsigned value is to use strtol and then
check that the result is greater or equal than zero. Couldn't C++ do better
here?

/MF

Olaf van der Spek

unread,
May 21, 2015, 4:27:48 PM5/21/15
to std-pr...@isocpp.org
Really? http://www.cplusplus.com/reference/cstdlib/strtoul/


> Couldn't C++ do better
> here?

IMO your example should throw an out-of-range exception.



--
Olaf

Olaf van der Spek

unread,
May 21, 2015, 4:28:40 PM5/21/15
to std-pr...@isocpp.org
2015-05-21 21:53 GMT+02:00 Magnus Fromreide <ma...@lysator.liu.se>:
> On Thu, May 21, 2015 at 08:31:55PM +0200, Olaf van der Spek wrote:
>> Could we have some more real-world use cases please?
>
> I do not know if this is real-world enough for you?

I don't know, is that what you're currently using? If so, could you
share your parse implementation?



--
Olaf

Thiago Macieira

unread,
May 21, 2015, 5:08:23 PM5/21/15
to std-pr...@isocpp.org
On Thursday 21 May 2015 22:27:45 Olaf van der Spek wrote:
> > What one have to do in C to parse an unsigned value is to use strtol and
> > then check that the result is greater or equal than zero.
>
> Really? http://www.cplusplus.com/reference/cstdlib/strtoul/

Really. strtoul will parse negative numbers too.

int main()
{
std::cout << strtoul("-1", 0, 0) << std::endl;
}

18446744073709551615
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Magnus Fromreide

unread,
May 22, 2015, 1:25:48 AM5/22/15
to std-pr...@isocpp.org
On Thu, May 21, 2015 at 10:27:45PM +0200, Olaf van der Spek wrote:
> 2015-05-21 21:53 GMT+02:00 Magnus Fromreide <ma...@lysator.liu.se>:
> > On Thu, May 21, 2015 at 08:31:55PM +0200, Olaf van der Spek wrote:
> >> Could we have some more real-world use cases please?
> >
> > I do not know if this is real-world enough for you?
> >
> > unsigned int number_of_things = parse<unsigned int>("-1").value();
> >
> > Why should parse<unsigned int> accept signed values?
> >
> > What one have to do in C to parse an unsigned value is to use strtol and then
> > check that the result is greater or equal than zero.
>
> Really? http://www.cplusplus.com/reference/cstdlib/strtoul/

Sadly. http://pubs.opengroup.org/onlinepubs/9699919799/functions/strtoul.html

> > Couldn't C++ do better here?
>
> IMO your example should throw an out-of-range exception.

I kind of agree - I am not sure about the exact exception, one could imagine
invalid-argument like if the input string had been ";", but I do think that
the tail should be "-1".

/MF

Jens Maurer

unread,
May 22, 2015, 2:24:18 AM5/22/15
to std-pr...@isocpp.org
Right, that specification was intended for the "elementary" parse
operations only. The specification needs to be weaker for the
compound parses. Something like "if ret == first, the value of
"v" is unspecified, otherwise v is the result of the (partial) parse,
with error_code containing the error encountered at the end (if any)."

Note that functions like that are NOT intended for building
large parsers without additional helper layers in between,
although showing some code that performs compounding is
(in my opinion) helpful to determine whether the interface
is somewhat reasonable.

My focus here is to make the low-level functions available to
users. I'm not against providing some high-level functions, too.
(The C++ standard currently does not offer low-level functions;
strtoul is locale-dependent, which slows it down substantially.
See also N4412.)

Jens

Olaf van der Spek

unread,
May 22, 2015, 4:25:31 AM5/22/15
to std-pr...@isocpp.org
2015-05-22 8:24 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
> My focus here is to make the low-level functions available to
> users. I'm not against providing some high-level functions, too.
> (The C++ standard currently does not offer low-level functions;
> strtoul is locale-dependent, which slows it down substantially.
> See also N4412.)

Right, locales..
Should a locale-aware variant be provided?
A locale-unaware one?
Both?
Should the locale be a parameter?

Some probably want an ASCII-only variant for simplicity and performance.




--
Olaf

Matthew Woehlke

unread,
May 22, 2015, 10:06:00 AM5/22/15
to std-pr...@isocpp.org
On 2015-05-22 04:25, Olaf van der Spek wrote:
> 2015-05-22 8:24 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> My focus here is to make the low-level functions available to
>> users. I'm not against providing some high-level functions, too.
>> (The C++ standard currently does not offer low-level functions;
>> strtoul is locale-dependent, which slows it down substantially.
>> See also N4412.)
>
> Right, locales..
> Should a locale-aware variant be provided?
> A locale-unaware one?
> Both?
> Should the locale be a parameter?

IMHO, yes (both). Locale-aware because dealing with locales is hard, and
we should not require the user to do this. ASCII-only because there will
be cases where that is all that is needed (reading machine-written data)
and because the performance is much, much faster.

I would probably go with something like 'parse(...)' (ASCII-only) and
'parse_l(..., std::locale* = nullptr)' which uses the current locale if
none is provided (and possibly falls back on the fast ASCII-only if the
locale is 'C').

--
Matthew

Matthew Fioravante

unread,
May 22, 2015, 1:05:42 PM5/22/15
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
How exactly do locales come into play for this algorithm? Choosing . or , for the decimal point? What else?

Most big data processing for numeric text is going to be in simple ascii format, so an optimized routine for that use case would help a lot of people.

Matthew Woehlke

unread,
May 22, 2015, 1:23:12 PM5/22/15
to std-pr...@isocpp.org, Matthew Fioravante
On 2015-05-22 13:05, Matthew Fioravante wrote:
> How exactly do locales come into play for this algorithm? Choosing . or ,
> for the decimal point? What else?

Digit grouping separators. Potentially digits themselves (e.g. should "
六万七千七百四十三" be parsed? "תשסד"?).

> Most big data processing for numeric text is going to be in simple ascii
> format, so an optimized routine for that use case would help a lot of
> people.

Yes, and that's why I think an ASCII-only version is crucial. However, I
also think it makes sense to have a version that accepts any reasonable
user input, e.g. "13", "2.569,86", "伍佰肆拾弐" (okay, that one may be a
little dubious as it uses some obsolete forms), etc.

--
Matthew

Jens Maurer

unread,
May 22, 2015, 1:25:04 PM5/22/15
to std-pr...@isocpp.org
On 05/22/2015 10:25 AM, Olaf van der Spek wrote:
> 2015-05-22 8:24 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> My focus here is to make the low-level functions available to
>> users. I'm not against providing some high-level functions, too.
>> (The C++ standard currently does not offer low-level functions;
>> strtoul is locale-dependent, which slows it down substantially.
>> See also N4412.)
>
> Right, locales..
> Should a locale-aware variant be provided?

No.

> A locale-unaware one?

Yes, basic execution character set only.
(Remember, we're dealing with parsing elementary items
such as ints and doubles. The latter is remarkably
hard to do correctly. It's not totally out of the
question to convert your locale-obliterated string
into a "plain" ASCII string for parsing at the
application level. Or use C++ locale's num_get<>
facet, if that does what you want.)

> Should the locale be a parameter?

No. (Not for the functions I'm talking about, anyway.)

> Some probably want an ASCII-only variant for simplicity and performance.

Indeed.

Locale support is hard, and what we have in C++ is insufficient.
The proper way forward is probably ICU with a decent C++ "wrapper"
of some sort. Anyway, adding locales to this discussion is sure
to balloon the scope beyond manageability.

Jens

Matthew Fioravante

unread,
May 22, 2015, 1:28:55 PM5/22/15
to std-pr...@isocpp.org
I agree with Jens here. Adding support for locales and parsing asian numbers etc.. is way out of scope. strtol() doesn't even do that.

Lets focus on the simple common use case that benefits 99% of users first. If a fully locale aware variant is desired, that can be done later as a followup proposal.

Matthew Woehlke

unread,
May 22, 2015, 1:46:31 PM5/22/15
to std-pr...@isocpp.org
On 2015-05-22 13:28, Matthew Fioravante wrote:
> I agree with Jens here. Adding support for locales and parsing asian
> numbers etc.. is way out of scope. strtol() doesn't even do that.

Previous comments notwithstanding, I'll agree also; having locale
support is good, but not worth derailing getting the version that only
works on ASCII / C-locale support. The latter is much more important
(and more clearly in scope).

--
Matthew

Vicente J. Botet Escriba

unread,
May 22, 2015, 6:04:06 PM5/22/15
to std-pr...@isocpp.org
Le 21/05/15 07:10, Olaf van der Spek a écrit :
> 2015-05-21 0:00 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
>> Le 20/05/15 20:49, Olaf van der Spek a écrit :
>>
>> 2015-05-20 17:43 GMT+02:00 Matthew Woehlke <mw_t...@users.sourceforge.net>:
>>
>> On 2015-05-20 11:27, Matthew Fioravante wrote:
>>
>> Generally, using 0 or any other perfectly valid value to signal failure is
>> a really bad idea. It only makes sense when your use case is "Parse the
>> value or give me some default if it fails". In that case, 0 may not be the
>> default value you want so its better to be able to actually specify it.
>>
>> Doesn't expected already handle this?
>>
>> Does expected exist?
>>
>> Do you mean an implementation to play with? Yes
>> (https://github.com/ptal/expected)
> No, I mean accepted by the committee and in a TS.
No, not yet ;-)
>
>> And does it have something like test_and_set()?
>>
>> No. What test_and_set would have as parameters and what would be the effect?
> bool test_and_set(T& out, optional<T> opt)
> {
> if (!opt)
> return false;
> out = *opt;
> return true;
> }
>
The proposal doesn't includes this function, it can be useful in an
imperative paradigm.
>> Using a tail as out parameter it should be someting like
>>
>> expected<Date, error_code> parse_date(string_view is, string_view& out)
>> {
>> return make_date % // fmap
>> ( parse<int>(is, &is) // year
>>>> >>parse_separator(is, &is) ) *
>> ( parse<int>(is, &is) // month
>>>> >>parse_separator(is, &is) ) *
>> parse<int>(is, &out) // day
>> ;
>> }
>>
>> or using the functional form with a parameter in-out it could be
>>
>> expected<Date> parse_date(string_view& is)
>> {
>> return fmap(make_date,
>> mdo( parse<int>(is), // year
>> parse_separator(is) ),
>> mdo( parse<int>(is), // month
>> parse_separator(is) ),
>> parse<int>(is) // day
>> );
>> }
>>
>> There other possibilities (e.g. using Parser) as Miro pointed out.
> Looks interesting but IMO that's higher-level then the basic level
> we're aiming for here.
I see.
> This can easily be build on top of the basic level can't it?
Sure but what is the reason to have a lower-level? Performances?
>
>> Interface using exceptions compose quite well. The interface that doesn't use exceptions should be based on it. We need to see just how to report the error_code. As a result, as out parameter or as a TLS.
> IMO it should be the other way around. It's cleaner to build a
> throwing interface on top of a non-throwing interface..
I'm not talking about implementation, but about design of the interface.

Vicente

Olaf van der Spek

unread,
May 24, 2015, 6:29:30 AM5/24/15
to std-pr...@isocpp.org
2015-05-23 0:04 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
>> Looks interesting but IMO that's higher-level then the basic level
>> we're aiming for here.
>
> I see.
>>
>> This can easily be build on top of the basic level can't it?
>
> Sure but what is the reason to have a lower-level? Performances?

Yes, simplicity and performance. A much more complex proposal takes
longer to write and much much longer to agree on.


--
Olaf

Vicente J. Botet Escriba

unread,
May 24, 2015, 5:24:59 PM5/24/15
to std-pr...@isocpp.org
Le 24/05/15 12:29, Olaf van der Spek a écrit :
I agree with you on the last point.

Why do you think that the lower level interface would perform better than the higher level?
Is simplicity of the proposal or simplicity of the user code that you are referring to?

Vicente

Olaf van der Spek

unread,
May 25, 2015, 9:02:18 AM5/25/15
to std-pr...@isocpp.org
2015-05-24 23:24 GMT+02:00 Vicente J. Botet Escriba <vicent...@wanadoo.fr>:
> Why do you think that the lower level interface would perform better than
> the higher level?

Ideally all abstractions would be free but that's not always the case.

> Is simplicity of the proposal or simplicity of the user code that you are
> referring to?

Both, though IMO performance > user code simplicity > proposal simplicity

Olaf van der Spek

unread,
May 29, 2015, 2:39:47 AM5/29/15
to std-pr...@isocpp.org
2015-05-22 8:24 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
> On 05/20/2015 09:38 PM, Olaf van der Spek wrote:
>> 2015-05-20 21:18 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>> On 05/18/2015 08:34 PM, Olaf van der Spek wrote:
>>> My suggestion:
>>>
>>> const char * ret = parse(T& v, const char * first, const char * last, int base, error_code&);
>>>
>>> If ret == first, v is not
>>> overwritten and we have an error, otherwise v contains the parsed value.
>>
>> Your function fails these requirements..
>
> Right, that specification was intended for the "elementary" parse
> operations only. The specification needs to be weaker for the
> compound parses. Something like "if ret == first, the value of
> "v" is unspecified, otherwise v is the result of the (partial) parse,
> with error_code containing the error encountered at the end (if any)."

Consistency is king. So do we check for errors via ret == first or via ec?

Would this one work for you?
error_code parse(T&, string_view&, int base = 10);

> Note that functions like that are NOT intended for building
> large parsers without additional helper layers in between,

Why not let it work without helpers right away?



--
Olaf

Jens Maurer

unread,
May 29, 2015, 3:06:11 AM5/29/15
to std-pr...@isocpp.org
On 05/29/2015 08:39 AM, Olaf van der Spek wrote:
> 2015-05-22 8:24 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> On 05/20/2015 09:38 PM, Olaf van der Spek wrote:
>>> 2015-05-20 21:18 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>>> On 05/18/2015 08:34 PM, Olaf van der Spek wrote:
>>>> My suggestion:
>>>>
>>>> const char * ret = parse(T& v, const char * first, const char * last, int base, error_code&);
>>>>
>>>> If ret == first, v is not
>>>> overwritten and we have an error, otherwise v contains the parsed value.
>>>
>>> Your function fails these requirements..
>>
>> Right, that specification was intended for the "elementary" parse
>> operations only. The specification needs to be weaker for the
>> compound parses. Something like "if ret == first, the value of
>> "v" is unspecified, otherwise v is the result of the (partial) parse,
>> with error_code containing the error encountered at the end (if any)."
>
> Consistency is king. So do we check for errors via ret == first or via ec?

Check for errors using "ec", since ret == first doesn't seem the
natural outcome for partially-successful parses of lists.

That doesn't exclude a more strict specification for certain functions,
such as when parsing a simple "double" or "int".

> Would this one work for you?
> error_code parse(T&, string_view&, int base = 10);

I prefer the iterator-style approach. The style is well-established
in the standard library, can be extended to more general iterators
if someone feels so inclined (not me), has fewer issues with over-
eager aliasing assumptions that prevent compiler optimization,
and can be made to look parallel with output. Output doesn't work
with string_view at all, because its elements are const.

Regarding the aliasing assumptions: It's easy to say "that's the
implementer's job", but I have been unable to coerce gcc into
doing the right thing when examining similar style variations for
output operations, even with gcc's __restrict extension.
Maybe I'm just stupid.

>> Note that functions like that are NOT intended for building
>> large parsers without additional helper layers in between,
>
> Why not let it work without helpers right away?

It does work, but it's certainly more inconvenient to use for a
large-scale parser compared to a parser-builder meta-language
(as shown elsewhere on this thread), and that argument applies
to both iterator-style and string_view variants.

Jens

Olaf van der Spek

unread,
May 29, 2015, 3:22:05 AM5/29/15
to std-pr...@isocpp.org
2015-05-29 9:06 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> Would this one work for you?
>> error_code parse(T&, string_view&, int base = 10);
>
> I prefer the iterator-style approach. The style is well-established
> in the standard library, can be extended to more general iterators
> if someone feels so inclined (not me), has fewer issues with over-

Iterator-style is well-established indeed but range-style is more
convenient IMO.
Can't the string_view one be easily generalized too?

> eager aliasing assumptions that prevent compiler optimization,

I still don't get the aliasing problem. Isn't that only an issue with writes?
Wouldn't doing
auto it = is.begin();
auto last = is.end();
be enough?

> and can be made to look parallel with output.
> Output doesn't work
> with string_view at all, because its elements are const.

True, but IMO we should punish input for that.

Jens Maurer

unread,
May 29, 2015, 4:29:54 AM5/29/15
to std-pr...@isocpp.org
On 05/29/2015 09:22 AM, Olaf van der Spek wrote:
> 2015-05-29 9:06 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>> Would this one work for you?
>>> error_code parse(T&, string_view&, int base = 10);
>>
>> I prefer the iterator-style approach. The style is well-established
>> in the standard library, can be extended to more general iterators
>> if someone feels so inclined (not me), has fewer issues with over-
>
> Iterator-style is well-established indeed but range-style is more
> convenient IMO.

Possibly. I find reference parameters where read-modify-write
happens a lot less transparent in the calling code than pass-by-value.
For me, the main purpose of a parser is to advance the state
"where am I", producing parsed values and possibly an error code
while doing so. For my taste, hiding "where am I" in a
read-modify-write parameter is too obtuse if I can help it.

(Constant ranges, i.e. ranges whose extent is not modified,
are great, though. It's always been painful to pass
<expression>.begin(), <expression>.end()
to standard algorithms, where <expression> might be non-short.)

> Can't the string_view one be easily generalized too?

Which way? Have a range_view that takes a pair of arbitrary
iterators of type It? Sure. This feels even farther away
from established STL precedence.

>> eager aliasing assumptions that prevent compiler optimization,
>
> I still don't get the aliasing problem. Isn't that only an issue with writes?
> Wouldn't doing
> auto it = is.begin();
> auto last = is.end();
> be enough?

Consider a sequence of parses; let's take "int" for exposition:

int i1, i2;
string_view s( /* whatever */ );
parse(i1, s);
parse(i2, s);

The "string_view" contains "const char *" internally, which is allowed
to alias anything.

Let's assume the "parse" function is inlined. Ideally, string_view's
components should be kept in registers and never hit memory.
Now, the first "parse" call changes the value of the member of "s"
holding s.begin() (let's call it s.first). This write must hit
memory, because s.first might point to itself or to s.end under
the aliasing rules.

In the second parse call, the "auto it = is.begin()" and
"is.end()" must read from memory afresh, because those values
might have been changed by the write to s.first of the first
parse call.

End result: string_view's components are not kept in registers.

When I looked at the issue a while ago in an output() context,
my gcc wasn't clever enough in its points-to analysis to understand
that s.first does not point to the string_view itself. Points-to
analysis is probably fairly brittle anyway, i.e. has to pessimize
quite often. Maybe that has changed meanwhile.

When s.first and s.end are separate local variables, it's obvious
to the compiler that an externally-supplied value cannot possibly
point to them.

>> and can be made to look parallel with output.
>> Output doesn't work
>> with string_view at all, because its elements are const.
>
> True, but IMO we should punish input for that.

When we're done, we should have a set of elementary parse()
functions in the standard, and a corresponding set of output()
functions. Considering some in isolation is fine up to a point,
but we shouldn't lose track of the big picture.

Jens


Olaf van der Spek

unread,
May 29, 2015, 6:12:14 AM5/29/15
to std-pr...@isocpp.org
2015-05-29 10:29 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> Can't the string_view one be easily generalized too?
>
> Which way? Have a range_view that takes a pair of arbitrary
> iterators of type It? Sure. This feels even farther away
> from established STL precedence.

No, I mean adding an overload taking two or three iterators..
I guess we'll include it anyway due to the aliasing concerns.

File read() and write() also update the file pointer in a hidden way..
is doing so really that bad?
If we get unified call syntax we even might be able to write is.parse()

> Let's assume the "parse" function is inlined. Ideally, string_view's
> components should be kept in registers and never hit memory.
> Now, the first "parse" call changes the value of the member of "s"
> holding s.begin() (let's call it s.first). This write must hit
> memory, because s.first might point to itself or to s.end under
> the aliasing rules.
>
> In the second parse call, the "auto it = is.begin()" and
> "is.end()" must read from memory afresh, because those values
> might have been changed by the write to s.first of the first
> parse call.

is.begin will almost surely have been changed.

> End result: string_view's components are not kept in registers.

True, but how big of a problem is one or two reads from L1 cache?

> When s.first and s.end are separate local variables, it's obvious
> to the compiler that an externally-supplied value cannot possibly
> point to them.

Hmm, I'm inclined to call this a quality of implementation issue.

>>> and can be made to look parallel with output.
>>> Output doesn't work
>>> with string_view at all, because its elements are const.
>>
>> True, but IMO we should punish input for that.
>
> When we're done, we should have a set of elementary parse()
> functions in the standard, and a corresponding set of output()
> functions. Considering some in isolation is fine up to a point,
> but we shouldn't lose track of the big picture.

Output / format functions are a whole different story. I'm not sure
it's worth it to worry about that for this proposal.


--
Olaf

Jens Maurer

unread,
May 29, 2015, 7:06:12 AM5/29/15
to std-pr...@isocpp.org
On 05/29/2015 12:12 PM, Olaf van der Spek wrote:
> 2015-05-29 10:29 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>>> Can't the string_view one be easily generalized too?
>>
>> Which way? Have a range_view that takes a pair of arbitrary
>> iterators of type It? Sure. This feels even farther away
>> from established STL precedence.
>
> No, I mean adding an overload taking two or three iterators..

I'm looking for the most basic abstraction to be standardized.

We've standardized quite a few non-basic abstractions, e.g.
iostreams or num_get<>(), which leaves a performance gap between
"what the standard provides" and "what the user could write himself
when not needing the bells + whistles". I don't want to have a
gap discussion ever again in the area we're talking about.

I have no sustained objection to standardizing various additional
overloads in addition to standardizing the basic abstraction,
if people feel like it. (I'm not really in favor, because it
blows up the std.lib interface, which makes it a little harder to
learn each time we do that.)

> I guess we'll include it anyway due to the aliasing concerns.
>
> File read() and write() also update the file pointer in a hidden way..
> is doing so really that bad?

File I/O has hidden state, yes. Is that bad? Maybe. We also
have pread and pwrite for those that don't like the hidden state.

> If we get unified call syntax we even might be able to write is.parse()

Yes, if string_view is the first parameter (which it currently isn't).

>> Let's assume the "parse" function is inlined. Ideally, string_view's
>> components should be kept in registers and never hit memory.
>> Now, the first "parse" call changes the value of the member of "s"
>> holding s.begin() (let's call it s.first). This write must hit
>> memory, because s.first might point to itself or to s.end under
>> the aliasing rules.
>>
>> In the second parse call, the "auto it = is.begin()" and
>> "is.end()" must read from memory afresh, because those values
>> might have been changed by the write to s.first of the first
>> parse call.
>
> is.begin will almost surely have been changed.

Yes, sorry.

>> End result: string_view's components are not kept in registers.
>
> True, but how big of a problem is one or two reads from L1 cache?

That might not be the only effect. Memory writes to an arbitrary
location (from the viewpoint of the compiler) might inhibit
code motion / scheduling between the two calls to parse, too,
producing more pipeline stalls etc.

>> When s.first and s.end are separate local variables, it's obvious
>> to the compiler that an externally-supplied value cannot possibly
>> point to them.
>
> Hmm, I'm inclined to call this a quality of implementation issue.

As I said earlier, I failed to achieve the QoI for the "output"
case and a similar interface structure.

>>>> and can be made to look parallel with output.
>>>> Output doesn't work
>>>> with string_view at all, because its elements are const.
>>>
>>> True, but IMO we should punish input for that.
>>
>> When we're done, we should have a set of elementary parse()
>> functions in the standard, and a corresponding set of output()
>> functions. Considering some in isolation is fine up to a point,
>> but we shouldn't lose track of the big picture.
>
> Output / format functions are a whole different story. I'm not sure
> it's worth it to worry about that for this proposal.

My use case is a (say) JSON input / output library.
I don't want this in the standard right now, but I do want to have
the basic building blocks available so that I can write my own
in C++ whose performance blows everything else out of the water while
using standard library facilities for the basic building blocks.

The status quo is that I won't even try, because I can't get rid of
locale-dependent parsing / output formatting of numbers when using
the standard library.

Thanks,
Jens

Olaf van der Spek

unread,
May 29, 2015, 7:11:10 AM5/29/15
to std-pr...@isocpp.org
2015-05-29 13:06 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
> I'm looking for the most basic abstraction to be standardized.

> My use case is a (say) JSON input / output library.
> I don't want this in the standard right now, but I do want to have
> the basic building blocks available so that I can write my own
> in C++ whose performance blows everything else out of the water while
> using standard library facilities for the basic building blocks.
>
> The status quo is that I won't even try, because I can't get rid of
> locale-dependent parsing / output formatting of numbers when using
> the standard library.

I heard RapidJSON is quite.. quick.

That said, I'm also aiming for functions without unnecessary overhead.


--
Olaf

Miro Knejp

unread,
May 29, 2015, 8:38:40 AM5/29/15
to std-pr...@isocpp.org
I think this argumentation is backwards. You are not doing a write operation through the char pointer, you are replacing the pointer object itself, a member of s. Even if the char pointer does alias the string_view there is no modifying operation taking place on the dereferenced char pointer that would change the string_view object itself. At least if the full function definition is available to the translation unit. So keeping with your assumption that the parse() calls are fully inlined the compiler *can* keep s in registers because of read-only accesses of the char pointer.

Jens Maurer

unread,
May 29, 2015, 10:47:30 AM5/29/15
to std-pr...@isocpp.org
On 05/29/2015 01:11 PM, Olaf van der Spek wrote:
> 2015-05-29 13:06 GMT+02:00 Jens Maurer <Jens....@gmx.net>:
>> I'm looking for the most basic abstraction to be standardized.
>
>> My use case is a (say) JSON input / output library.
>> I don't want this in the standard right now, but I do want to have
>> the basic building blocks available so that I can write my own
>> in C++ whose performance blows everything else out of the water while
>> using standard library facilities for the basic building blocks.
>>
>> The status quo is that I won't even try, because I can't get rid of
>> locale-dependent parsing / output formatting of numbers when using
>> the standard library.
>
> I heard RapidJSON is quite.. quick.

https://github.com/miloyip/rapidjson

This is a good example why we need fast low-level parsing
and output in the standard.

For floating-point, it does:
// This is a C++ header-only implementation of Grisu2 algorithm from the publication:
// Loitsch, Florian. "Printing floating-point numbers quickly and accurately with
// integers." ACM Sigplan Notices 45.6 (2010): 233-243.

For integer, it has its own
inline char* u32toa(uint32_t value, char* buffer);
and
inline char* u64toa(uint64_t value, char* buffer);

These are general-purpose functions, yet the author has seen fit
to duplicate the implementation (and not use strtod or similar)?
That can't be right.

> That said, I'm also aiming for functions without unnecessary overhead.

Good. :-)

Jens

Olaf van der Spek

unread,
May 29, 2015, 11:42:52 AM5/29/15
to std-pr...@isocpp.org
That's output, not input..
I do agree we need better output functions too but again that's not in
scope for this proposal.


--
Olaf

Jeffrey Yasskin

unread,
May 29, 2015, 12:32:17 PM5/29/15
to std-pr...@isocpp.org
FWIW, I think Jens' arguments are going to be convincing to the
committee, so it's probably a good idea to follow them in the first
iteration of this paper.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at http://groups.google.com/a/isocpp.org/group/std-proposals/.

Thiago Macieira

unread,
May 29, 2015, 4:15:47 PM5/29/15
to std-pr...@isocpp.org
Indeed, but we probably need it even more badly than parsing numbers.

We have strtol/strtod from C99 and POSIX to parse numbers and we even have
strtol_l from POSIX to parse it independent of locale. But the string
conversion still requires sprintf in C or ostringstream in C++. That's a hell
of an overhead...

Matthew Fioravante

unread,
May 29, 2015, 11:30:23 PM5/29/15
to std-pr...@isocpp.org
Output is a much harder problem because you don't know the size of the resulting string required to store the textual representation.

The easiest way is to just dynamically allocate like to_string() but of course this is not an optimal interface because you have to allocate a separate buffer for every conversion.

template <typename T>
string serialize(const T& val);

The slightly less convenient way is passing string or vector<char> or some other "container" via out parameter and have the method resize it as needed. This is more efficient because you can reuse the same buffer for multiple parses (see std::getline()). This is easy to use but it creates a hard dependency on std::string or whatever is used as the buffer class type. An interface like this could be templated to allow any container supporting push_back().

template <typename T>
void serialize(std::string& buf, const T& val);

The most low level interface is to provide a fixed size buffer (via pointers) but then what happens if you run out of space? Truncate and report an error? How do you know how much space is actually needed? Can you restart the operation from the exact point where it failed or do you have to restart the serialization algorithm from scratch again?

Pushing the buffer size question back onto the client allows for the most generic interface possible. The client's response to a full buffer could be many depending on the situation. Example behaviors include truncating, allocating more space and continuing, writing the current buffer to a stream and then continuing again from the start.. Such an interface could allow class authors to write out of line class serialization kernels that would work efficiently for all uses cases (memory, files, network, etc...). For example, a higher level interface which serializes to files could run this kernel function directly on the internal file buffer and then flush it to disk when full.

template <typename T>
char* serialize(char* b, char*e, const T& val, error_code& ec);

Bjorn Reese

unread,
May 30, 2015, 5:04:33 AM5/30/15
to std-pr...@isocpp.org
On 05/29/2015 04:47 PM, Jens Maurer wrote:

> These are general-purpose functions, yet the author has seen fit
> to duplicate the implementation (and not use strtod or similar)?
> That can't be right.

Adherence to specification is one of the most main reasons for
implementing your own conversion functions for textual protocols (of
which JSON is just one example.) These protocols usually define exactly
how, say, floating-point numbers must be represented, and these
representations may not correspond to those chosen by C++.

One of the requirements mentioned in this thread is the absence of
locale. This means that it becomes difficult to use the proposed parsing
functions for a wider set of protocols, because there is variation in
how they represent numbers. For example, XSLT uses thousands
separators, while JSON does not. Another example, NaN and infinity is
represented as "null" in JSON.

If you investigate JSON or XML parsers you will find that they even
implement their own ctypes. For instance, a whitespace in JSON does not
match the C locale std::isspace().

Olaf van der Spek

unread,
May 30, 2015, 7:35:47 AM5/30/15
to std-pr...@isocpp.org
2015-05-29 18:31 GMT+02:00 'Jeffrey Yasskin' via ISO C++ Standard -
Future Proposals <std-pr...@isocpp.org>:
> FWIW, I think Jens' arguments are going to be convincing to the
> committee, so it's probably a good idea to follow them in the first
> iteration of this paper.

The aliasing issue (which shouldn't be an issue) or the iterator-based
interface?

Jens Maurer

unread,
May 30, 2015, 3:32:22 PM5/30/15
to std-pr...@isocpp.org
On 05/30/2015 11:04 AM, Bjorn Reese wrote:
> On 05/29/2015 04:47 PM, Jens Maurer wrote:
>
>> These are general-purpose functions, yet the author has seen fit
>> to duplicate the implementation (and not use strtod or similar)?
>> That can't be right.
>
> Adherence to specification is one of the most main reasons for
> implementing your own conversion functions for textual protocols (of
> which JSON is just one example.) These protocols usually define exactly
> how, say, floating-point numbers must be represented, and these
> representations may not correspond to those chosen by C++.

I haven't seen a textual protocol where thousand separators
would be mandatory.

For floating-point parsing, doing a pre-parse and removing the
thousand separators and switching the decimal point is an option;
another option would be to make these parameters of the standard
parse functions for floating-point. I'm not seeing a large
use for that.

> One of the requirements mentioned in this thread is the absence of
> locale. This means that it becomes difficult to use the proposed parsing
> functions for a wider set of protocols, because there is variation in
> how they represent numbers.

From what I've seen so far, if a protocol chooses decimal text
representations, the representation of integers is the
"obvious" one.
For floating-point, people seem to want exact round-trip capability
with minimal text used. C++ doesn't guarantee this right now, but
it seems reasonable to require this from new parser/output
functions. The algorithms for these are non-trivial, but
well-known.

http://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf
http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf

> For example, XSLT uses thousands
> separators,

I can't find support for this claim.

http://www.w3.org/TR/xslt#section-Expressions
says XSLT is using XPath expressions, and
http://www.w3.org/TR/xpath/#NT-Number
doesn't seem to allow thousand-separators.

> while JSON does not. Another example, NaN and infinity is
> represented as "null" in JSON.

Good point about NaN and infinity. We should probably offer
parsers for these separate from the "main" floating-point
parser to allow for more flexibility for the caller.

Since JSON can't represent NaN and infinity (which one is
it if you get a "null"?), that's a separate challenge for
the JSON parser.

> If you investigate JSON or XML parsers you will find that they even
> implement their own ctypes. For instance, a whitespace in JSON does not
> match the C locale std::isspace().

Agreed. A JSON parser is still a JSON parser and is probably bad
at parsing XSLT. I do not anticipate standard-provided number
parsers to skip whitespace (deviating from strtod's behavior).

Depending on how much of a "complete" solution we want to offer
in the standard, a function such as

parse_and_skip("set-of-chars-to-skip")

might be helpful. (For null-terminated strings, that function
seems to be called strspn, btw.)

Jens

Olaf van der Spek

unread,
Jun 12, 2015, 9:51:55 AM6/12/15
to std-pr...@isocpp.org
Some more Qs:
Should (unsigned) short be supported?
Should (unsigned) char be supported?
Bool?

Should octal input be supported if base = 0?

Matthew Woehlke

unread,
Jun 12, 2015, 10:03:42 AM6/12/15
to std-pr...@isocpp.org
On 2015-06-12 09:51, Olaf van der Spek wrote:
> Should octal input be supported if base = 0?

IMHO, yes, if by "base = 0" you mean "detect base from prefix in input
string". Particularly, we should support at least what strtol does.

--
Matthew

Matthew Fioravante

unread,
Jun 12, 2015, 11:49:20 AM6/12/15
to std-pr...@isocpp.org


On Friday, June 12, 2015 at 9:51:55 AM UTC-4, Olaf van der Spek wrote:
Some more Qs:
Should (unsigned) short be supported?
Should (unsigned) char be supported?

I would support all integral types. I also believe unsigned types should be supported because if you really want that high order bit without triggering an overflow error you can't use a signed type. I would also make it an error to try to specify a negative literal (i.e. '-' prefix) if T is unsigned integral.
 
Bool?

This is tricky because valid strings could be (0, 1), (true, false), (True, False), (T,F), etc.... If the parsing function is named something like str_to_num() then it makes sense to support bool with "0" and "1". If its more generic like parse(), then the question of valid inputs becomes more ambiguous.

 

Should octal input be supported if base = 0?

I would at minimum support all of the literals supported by the core language string literals. That is, hex (0x), decimal (), octal (0), and binary (0b).

Olaf van der Spek

unread,
Jun 12, 2015, 1:14:17 PM6/12/15
to std-pr...@isocpp.org
2015-06-12 17:49 GMT+02:00 Matthew Fioravante <fmatth...@gmail.com>:
>
>
> On Friday, June 12, 2015 at 9:51:55 AM UTC-4, Olaf van der Spek wrote:
>>
>> Some more Qs:
>> Should (unsigned) short be supported?
>>
>> Should (unsigned) char be supported?
>
>
> I would support all integral types. I also believe unsigned types should be
> supported because if you really want that high order bit without triggering
> an overflow error you can't use a signed type.

Of course

> I would also make it an error
> to try to specify a negative literal (i.e. '-' prefix) if T is unsigned
> integral.

What about "-0"?
"+1"?

>>
>> Bool?
>
>
> This is tricky because valid strings could be (0, 1), (true, false), (True,
> False), (T,F), etc.... If the parsing function is named something like
> str_to_num() then it makes sense to support bool with "0" and "1". If its
> more generic like parse(), then the question of valid inputs becomes more
> ambiguous.

True
There's also the question of what to do with for example -1 and 2. Map
to true or return invalid input?

>> Should octal input be supported if base = 0?
>
>
> I would at minimum support all of the literals supported by the core
> language string literals. That is, hex (0x), decimal (), octal (0), and
> binary (0b).

Does strtol support 0b?
0 as prefix is problematic as 09 for example is probably not intended
as an octal number.


--
Olaf

Matthew Fioravante

unread,
Jun 12, 2015, 1:58:10 PM6/12/15
to std-pr...@isocpp.org


On Friday, June 12, 2015 at 1:14:17 PM UTC-4, Olaf van der Spek wrote:
2015-06-12 17:49 GMT+02:00 Matthew Fioravante <fmatth...@gmail.com>:
>
>
> On Friday, June 12, 2015 at 9:51:55 AM UTC-4, Olaf van der Spek wrote:
>>
>> Some more Qs:
>> Should (unsigned) short be supported?
>>
>> Should (unsigned) char be supported?
>
>
> I would support all integral types. I also believe unsigned types should be
> supported because if you really want that high order bit without triggering
> an overflow error you can't use a signed type.

Of course

> I would also make it an error
> to try to specify a negative literal (i.e. '-' prefix) if T is unsigned
> integral.

What about "-0"?

Good question. Having an exception for zero seems odd. It could depend on how the error condition is defined. If "-235" results in a kind of "out of range" error then one could argue that "-0" is in range so it should be valid. On the other hand if "-235" is considered a "parsing" error because '-' is not in the grammar, then I guess "-0" should be rejected as well.

Another argument is that for integers "-0" is not an actual value but more like taking 0 and negating it with operator-(int). In otherwords, this is an expression and now a raw literal integer value. These functions only parse raw numbers, not mathematical expressions so "-0" is invalid.

I'd be fine either way. This is borderline bikeshed materal but should definately be brought up in a proposal.

"+1"?

If "+" prefix is supported for signed, I would support it for unsigned also.


>>
>> Bool?
>
>
> This is tricky because valid strings could be (0, 1), (true, false), (True,
> False), (T,F), etc.... If the parsing function is named something like
> str_to_num() then it makes sense to support bool with "0" and "1". If its
> more generic like parse(), then the question of valid inputs becomes more
> ambiguous.

True
There's also the question of what to do with for example -1 and 2. Map
to true or return invalid input?

Whats the valid range? 32 bits? 64 bits? sizeof(bool) / CHAR_BIT bits?

I would say invalid.

If you want to support parsings like -1 and 2 into bool then parse the string into an int and then convert the int to bool yourself. I think we should leave type conversions up to the user and be very strict with parsing only valid values of the given type.

The parser itself should no do type conversions. This is also why I think unsigned parsing should not allow negative values.

Since std::is_Integral<bool> == true, one could argue that we should treat bools just like ints. That is, the parse overload for bool has a base parameter (which actually does nothing since the only valid values are 0 and 1 in all bases). A base of "0" (auto-detect) could support all of the same prefixes as well as the strings "true" and "false" since those are literals in C++.

//Ignore errors for this example
template <typename T> T parse(string_view s);

for
(int i = 2; i < 16; ++i) {
 
assert(!parse<bool>("0", i));
 
assert(parse<bool>("1", i));
}

//hex prefix
assert(!parse<bool>("0x0", 0));
assert(parse<bool>("0x1", 0));

//decimal no prefix
assert(!parse<bool>("0", 0));
assert(parse<bool>("1", 0));
//octal prefix
assert(!parse<bool>("00", 0));
assert(parse<bool>("01", 0));
//binary prefix
assert(!parse<bool>("0b0", 0));
assert(parse<bool>("0b1", 0));

//C++ true and false literals
assert(!parse<bool>("false", 0));
assert(parse<bool>("true", 0));





 

>> Should octal input be supported if base = 0?
>
>
> I would at minimum support all of the literals supported by the core
> language string literals. That is, hex (0x), decimal (), octal (0), and
> binary (0b).

Does strtol support 0b?

http://en.cppreference.com/w/cpp/string/byte/strtol

No, but neither did C or C++ when the interface was defined. Binary literals are useful and should be included. We are already breaking compatibility by removing support for leading white space. I don't see any strong reason why we have to conform to strtol(). I'd rather use the integer literal prefixes supported by the core language as a guide. Compatibility with the core language in this respect makes the interface easier to understand for novices because they learn one set of numeric prefixes and it works the same way everywhere. Eventually strtol() and its ilk will go to the dustbin of history. People will not be referring back to strtol() to figure out how to use this new interface.
 
0 as prefix is problematic as 09 for example is probably not intended
as an octal number.

If the user wants to accept 09 as a decimal number, they should call parse<int>(s, 10).

The base=0 is a simple helper utility for the most common use case. If the user for some reason has unusal requirements like only supporting hex and decimal but not octal and binary, then they can write a wrapper which parses the prefix.

If you want to give the user maximum control, make a third defaulted out argument which is a bitmask of what prefixes are requested. This would also allow users to turn off prefix support when parsing a non-zero base (see below).

auto hex_prefix = 1 << 16
auto dec_prefix = 1 << 10
auto oct_prefix = 1 << 8
auto bin_prefix = 1 << 2

template <typename T>
  T parse(string_view s, int base=10, unsigned int prefixes_to_check=0xFFFF);

auto hex_or_dec = hex_prefix | dec_prefix;

parse<int>("0x9", 0, hex_or_dec); //Ok == 9
parse<int>("9" , 0, hex_or_dec); //Ok == 9
parse<int>("09", 0, hex_or_dec); //Ok == 9 (parsed as decimal)
parse<int>("0b1001", 0, hex_or_dec); //Error bad input character "b"

parse<int>("F", 16); //Ok == 15
parse<int>("0xF", 16); //Ok == 15
parse<int>("F", 16, 0); //Ok == 15
parse<int>("0xF", 16, 0; //Error, bad input character "x"


Thiago Macieira

unread,
Jun 12, 2015, 6:33:38 PM6/12/15
to std-pr...@isocpp.org
On Friday 12 June 2015 10:58:10 Matthew Fioravante wrote:
> If you want to give the user maximum control, make a third defaulted out
> argument which is a bitmask of what prefixes are requested. This would also
> allow users to turn off prefix support when parsing a non-zero base (see
> below).

That requires a 64-bit parameter for the bitfield mask, as there are at 35
options (bases 2 to 36). Not the end of the world, but will require two
parameter slots on 32-bit systems.

Magnus Fromreide

unread,
Jun 12, 2015, 6:53:38 PM6/12/15
to std-pr...@isocpp.org
On Fri, Jun 12, 2015 at 10:58:10AM -0700, Matthew Fioravante wrote:
>
>
> If "+" prefix is supported for signed, I would support it for unsigned
> also.

I see a big if looming there.

Leaing pluses makes it impossible to know what input the parse routine got
since it creates two possible inputs that give the same output.

> > >> Bool?
> > >
> > >
> > > This is tricky because valid strings could be (0, 1), (true, false),
> > (True,
> > > False), (T,F), etc.... If the parsing function is named something like
> > > str_to_num() then it makes sense to support bool with "0" and "1". If
> > its
> > > more generic like parse(), then the question of valid inputs becomes
> > more
> > > ambiguous.
> >
> > True
> > There's also the question of what to do with for example -1 and 2. Map
> > to true or return invalid input?
> >
>
> Since std::is_Integral<bool> == true, one could argue that we should treat
> bools just like ints. That is, the parse overload for bool has a base
> parameter (which actually does nothing since the only valid values are 0
> and 1 in all bases). A base of "0" (auto-detect) could support all of the
> same prefixes as well as the strings "true" and "false" since those are
> literals in C++.

As long as we make dead sure that nothing tries to drag in the localization
code, and that is why I am mildly against support for true/false.

/MF

Olaf van der Spek

unread,
Jun 13, 2015, 8:14:50 AM6/13/15
to std-pr...@isocpp.org
2015-06-13 0:33 GMT+02:00 Thiago Macieira <thi...@macieira.org>:
> On Friday 12 June 2015 10:58:10 Matthew Fioravante wrote:
>> If you want to give the user maximum control, make a third defaulted out
>> argument which is a bitmask of what prefixes are requested. This would also
>> allow users to turn off prefix support when parsing a non-zero base (see
>> below).
>
> That requires a 64-bit parameter for the bitfield mask, as there are at 35
> options (bases 2 to 36). Not the end of the world, but will require two
> parameter slots on 32-bit systems.

Why's that? Most of the bases don't have a prefix so can only be
selected manually.


--
Olaf

Matthew Fioravante

unread,
Jun 13, 2015, 9:38:53 AM6/13/15
to std-pr...@isocpp.org


On Friday, June 12, 2015 at 6:53:38 PM UTC-4, Magnus Fromreide wrote:
On Fri, Jun 12, 2015 at 10:58:10AM -0700, Matthew Fioravante wrote:
>
>
> If "+" prefix is supported for signed, I would support it for unsigned
> also.

I see a big if looming there.

Leaing pluses makes it impossible to know what input the parse routine got
since it creates two possible inputs that give the same output.

Why is it important to know which input string produced a value? And anyway you already have this problem because you don't know what base was used for the input string.
 

> > >> Bool?
> > >
> > >
> > > This is tricky because valid strings could be (0, 1), (true, false),
> > (True,
> > > False), (T,F), etc.... If the parsing function is named something like
> > > str_to_num() then it makes sense to support bool with "0" and "1". If
> > its
> > > more generic like parse(), then the question of valid inputs becomes
> > more
> > > ambiguous.
> >
> > True
> > There's also the question of what to do with for example -1 and 2. Map
> > to true or return invalid input?
> >
>
> Since std::is_Integral<bool> == true, one could argue that we should treat
> bools just like ints. That is, the parse overload for bool has a base
> parameter (which actually does nothing since the only valid values are 0
> and 1 in all bases). A base of "0" (auto-detect) could support all of the
> same prefixes as well as the strings "true" and "false" since those are
> literals in C++.

As long as we make dead sure that nothing tries to drag in the localization
code, and that is why I am mildly against support for true/false.

Agree, it true/false means locales and all that id rather skip it. But we already have an 'x' and a 'b' for hex and binary prefix.

Miro Knejp

unread,
Jun 13, 2015, 10:16:29 AM6/13/15
to std-pr...@isocpp.org
There should definitely be a way to disable the "+" prefix and treat it as error. Not all data or interchange formats allow a leading plus and in some aviation protocols I know (but can't talk about) anything that is not set in stone in the specification is an error. A number parser where the plus sign cannot be disabled is useless in such situations as it would not pass certification. Sure, I could check for the presence of "+" manually first, but why do the same work twice?

The thread so far is heavily revolving around parsing C-syntax numbers as seen by the prefix discussion. This is quite restrictive and should only be an addition on top of the absolute basic number parsing. The numeric (ASCII) decimal/hex digits are the same in every programming/scripting language or textual file formats, prefixes and other stuff are not. If these functions are to be the basic foundation for all number parsing that makes it obsolete for everyone to roll their own as is currently the case then all these extras must be optional.

Jens Maurer

unread,
Jun 14, 2015, 5:10:30 PM6/14/15
to std-pr...@isocpp.org
On 06/13/2015 04:17 PM, Miro Knejp wrote:
> The thread so far is heavily revolving around parsing C-syntax
> numbers as seen by the prefix discussion. This is quite restrictive
> and should only be an addition on top of the absolute basic number
> parsing. The numeric (ASCII) decimal/hex digits are the same in every
> programming/scripting language or textual file formats, prefixes and
> other stuff are not. If these functions are to be the basic
> foundation for all number parsing that makes it obsolete for everyone
> to roll their own as is currently the case then all these extras must
> be optional.

Yes, fully agreed.

Jens

Olaf van der Spek

unread,
Jun 15, 2015, 7:10:36 AM6/15/15
to std-pr...@isocpp.org
2015-06-13 16:17 GMT+02:00 Miro Knejp <miro....@gmail.com>:
> There should definitely be a way to disable the "+" prefix and treat it as
> error. Not all data or interchange formats allow a leading plus and in some
> aviation protocols I know (but can't talk about) anything that is not set in
> stone in the specification is an error. A number parser where the plus sign
> cannot be disabled is useless in such situations as it would not pass
> certification. Sure, I could check for the presence of "+" manually first,
> but why do the same work twice?

What about minus? Leading zeros?
Disallowing things is certainly necessary in some cases but I'm afraid
it makes the proposal more complex.

> The thread so far is heavily revolving around parsing C-syntax numbers as
> seen by the prefix discussion. This is quite restrictive and should only be
> an addition on top of the absolute basic number parsing. The numeric (ASCII)
> decimal/hex digits are the same in every programming/scripting language or
> textual file formats, prefixes and other stuff are not. If these functions
> are to be the basic foundation for all number parsing that makes it obsolete
> for everyone to roll their own as is currently the case then all these
> extras must be optional.

Auto-detection of base when base = 0 is only a small part of this function..
It is loading more messages.
0 new messages