String to T conversions - getting it right this time

644 views
Skip to first unread message

Matt Fioravante

unread,
Jan 26, 2014, 11:25:02 AM1/26/14
to std-pr...@isocpp.org
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.

Lets review:

int atoi(const char* s); //and atoll,atol,atoll, atof etc..

Whats wrong with this?
  • Returns 0 on parsing failure, making it impossible to parse 0 strings. This already renders this function effectively useless and we can skip the rest of the bullet points right here.
  • It discards leading whitespace, this has several problems of its own:
    • If we want to check whether the string is strictly a numeric string, we have to add our own check that the first character is a digit. This makes the interface clumsy to use and easy to screw up.
    • std::isspace() is locale dependent and requires an indirect function call (try it on gcc.godbolt.org). This makes what could be a very simple and inlinable conversion potentially expensive. It also prevents constexpr.
    • From a design standpoint, this whitespace handling is a very narrow use case. It does too many things and in my opinion is a bad design. I often do not have whitespace delimited input in my projects.
  • No atod() for doubles or atold() for long doubles.
  • No support for unsigned types, although this may not actually be a problem.
  • Uses horrible C interface (type suffixes in names) with no overloading or template arguments. What function do we use if we want to parse an int32_t?
long strtol(const char* str, char **str_end, int base);

Whats wrong with this one?
  • Again it has this silly leading whitespace behavior (see above).
  • Its not obvious how to correctly determine whether or not parsing failed. Every time I use this function I have to look it up again to make sure I get it exactly right and have covered all of the corner cases.
  • Uses 0/T_MAX/T_MIN to denote errors, when these could be validly parsed from strings. Checking whether or not these values were parsed or are representing errors is clumsy.
  • Again C interface issues (see above).

At this point, I think we are ready to define a new set of int/float parsing routines.

Design goals:
  • Easy to use, usage is obvious.
  • No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
  • Efficient and inline
  • constexpr
Here is a first attempt for an integer parsing routine.

//Attempts to parse s as an integer. The valid integer string consists of the following:
//* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)
//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).
//* All of the rest of the characters MUST be digits.
//Returns true if an integral value was successfully parsed and stores the value in val,
//otherwise returns false and leaves val unmodified. 
//Sets errno to ERANGE if the string was an integer but would overflow type integral.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);

//Same as the previous, except that instead of trying to parse the entire string, we only parse the integral part. 
//The beginning of the string must be an integer as specified above. Will set tail to point to the end of the string after the integral part.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base, string_view& tail);


First off, all of these return bool which makes it very easy to check whether or not parsing failed.

While the interface does not allow this idom:

int x = atoi(s);

It works with this idiom which in all of my use cases is much more common:
int val;
if(!strto(s, val, 10)) {
  throw some_error();
}
printf("We parsed %d!\n", val);

Some examples:

int val;
string_view sv= "12345";
assert(strto(sv, val, 10));
assert(val == 12345);
sv = "123 456";
val = -2;
assert(!strto(sv, val, 10));
assert(val == -2);
assert(strto(sv, val, 10, sv));
assert(val == 123);
assert(sv == " 456");
sv.remove_prefix(1); //chop off the " ";
assert(sv == "456");
assert(strto(sv, val, 10));
assert(val = 456);
val = 0;
assert(strto(sv, val, 10, sv));
assert(val == 456);
assert(sv == "");


Similarly we can define this for floating point types. We may also want null terminated const char* versions as converting a const char* to sting_view requires a call to strlen(). 

dgutson .

unread,
Jan 26, 2014, 11:54:47 AM1/26/14
to std-pr...@isocpp.org
On Sun, Jan 26, 2014 at 1:25 PM, Matt Fioravante <fmatth...@gmail.com> wrote:
> string to T (int, float, etc..) conversions seem like to rather easy task
> (aside from floating point round trip issues), and yet for the life of C and
> C++ the standard library has consistently failed to provide a decent
> interface.
>
> Lets review:

Why didn't you include stringstream in your review? E.g. something
like https://code.google.com/p/mili/source/browse/mili/string_utils.h#267
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.



--
Who’s got the sweetest disposition?
One guess, that’s who?
Who’d never, ever start an argument?
Who never shows a bit of temperament?
Who's never wrong but always right?
Who'd never dream of starting a fight?
Who get stuck with all the bad luck?

Miro Knejp

unread,
Jan 26, 2014, 12:29:09 PM1/26/14
to std-pr...@isocpp.org

Am 26.01.2014 17:25, schrieb Matt Fioravante:
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.

Lets review:

int atoi(const char* s); //and atoll,atol,atoll, atof etc..

Whats wrong with this?
  • Returns 0 on parsing failure, making it impossible to parse 0 strings. This already renders this function effectively useless and we can skip the rest of the bullet points right here.
  • It discards leading whitespace, this has several problems of its own:
    • If we want to check whether the string is strictly a numeric string, we have to add our own check that the first character is a digit. This makes the interface clumsy to use and easy to screw up.
    • std::isspace() is locale dependent and requires an indirect function call (try it on gcc.godbolt.org). This makes what could be a very simple and inlinable conversion potentially expensive. It also prevents constexpr.
    • From a design standpoint, this whitespace handling is a very narrow use case. It does too many things and in my opinion is a bad design. I often do not have whitespace delimited input in my projects.
  • No atod() for doubles or atold() for long doubles.
  • No support for unsigned types, although this may not actually be a problem.
  • Uses horrible C interface (type suffixes in names) with no overloading or template arguments. What function do we use if we want to parse an int32_t?
long strtol(const char* str, char **str_end, int base);

Whats wrong with this one?
  • Again it has this silly leading whitespace behavior (see above).
  • Its not obvious how to correctly determine whether or not parsing failed. Every time I use this function I have to look it up again to make sure I get it exactly right and have covered all of the corner cases.
  • Uses 0/T_MAX/T_MIN to denote errors, when these could be validly parsed from strings. Checking whether or not these values were parsed or are representing errors is clumsy.
  • Again C interface issues (see above).
I am currently facing the same problems while working on my format proposal: https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/std-proposals/CIlWCTOe5kc
All the existing functions work on null-terminated strings only which is totally useless for my use cases as I am parsing substrings in-place. I intend to design the string processing stuff that I'm using general enough so it can be used independently but for now I just want to make it work in the first place.


At this point, I think we are ready to define a new set of int/float parsing routines.

Design goals:
  • Easy to use, usage is obvious.
  • No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
  • Efficient and inline
  • constexpr
Here is a first attempt for an integer parsing routine.

//Attempts to parse s as an integer. The valid integer string consists of the following:
//* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)
//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).
//* All of the rest of the characters MUST be digits.
//Returns true if an integral value was successfully parsed and stores the value in val,
//otherwise returns false and leaves val unmodified. 
//Sets errno to ERANGE if the string was an integer but would overflow type integral.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);
Please no ERxxx nonsense. optional, expected, exceptions, pairs, whatever but no ER codes, that's even more silly C. I currently base mine on iterators and provide string_view as convenience overloads.


//Same as the previous, except that instead of trying to parse the entire string, we only parse the integral part. 
//The beginning of the string must be an integer as specified above. Will set tail to point to the end of the string after the integral part.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base, string_view& tail);

With iterators it could return the iterator to the first element not part of the integer. a pair<optional<int>, Iter> or similar is a possibility. Certainly not the best concept but I'd prefer it to checking errno anyday and depending on the combination of optional's engaged and the iterator position you can determine whether it failed and if so, why. Well, just giving some spontaneous food for thought. I only very recently started with the number parsing part of my proposal, so the interface will probbaly be very unstable for quite a while. And then there's locales, and together with them a whole new world of problems...

Thiago Macieira

unread,
Jan 26, 2014, 1:29:01 PM1/26/14
to std-pr...@isocpp.org
On domingo, 26 de janeiro de 2014 08:25:02, Matt Fioravante wrote:
> At this point, I think we are ready to define a new set of int/float
> parsing routines.
>
> Design goals:
>
> - Easy to use, usage is obvious.

I'm sure that everyone intended that for their functions. That survived only
until the first round of feedback or encounter with reality...

> - No assumptions about use cases, we just want to parse strings. This
> means none of this automatic whitespace handling.

Fair enough. It's easier to compose with other space checkers if you need to
than to remove functionality.

> - Efficient and inline
> - constexpr

Efficient, definitely. Inline and constexpr? Forget it, it can't be done. Have
you ever looked at the source of a string-to-double function? They're huge!
This might be left as a suggestion to compilers to implement this as an
intrinsic.

> //Attempts to parse s as an integer. The valid integer string consists of
> the following:
> //* '+' or '-' sign as the first character (- only acceptable for signed
> integral types)

But no U+2212?

> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).
> //* All of the rest of the characters MUST be digits.

Where, by "digits", we understand the regular ASCII digits 0 to 9 and the
letters that compose digits on this base, both in uppercase and lowercase.

> //Returns true if an integral value was successfully parsed and stores the
> value in val,
> //otherwise returns false and leaves val unmodified.
> //Sets errno to ERANGE if the string was an integer but would overflow type
> integral.

What if it failed to parse? What's the return condition?

As others have said, using errno is too C, but then again this kind of
function should be done in conjunction with the C people. Any improvements we
need, they probably need too.

> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base);

Replace string_view with a pair of InputIterators.

Do you know what this means? Parsing char16_t, char32_t and wchar_t too.

> //Same as the previous, except that instead of trying to parse the entire
> string, we only parse the integral part.
> //The beginning of the string must be an integer as specified above. Will
> set tail to point to the end of the string after the integral part.
> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base, string_view&
> tail);

Same as above.

> First off, all of these return bool which makes it very easy to check
> whether or not parsing failed.

That's the opposite of what most people want. Most people want to get the
parsed number, not whether it succeded or failed. Maybe invert the logic?

That's what we do for {QString,QByteArray,QLocale}::to{Int,Double,etc.}. And
one suggestion I received a few weeks ago was to add the overload that returns
the end pointer and does not fail if there's more stuff after it.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Vicente J. Botet Escriba

unread,
Jan 26, 2014, 4:26:57 PM1/26/14
to std-pr...@isocpp.org
Le 26/01/14 18:29, Miro Knejp a écrit :

Am 26.01.2014 17:25, schrieb Matt Fioravante:
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.


At this point, I think we are ready to define a new set of int/float parsing routines.

Design goals:
  • Easy to use, usage is obvious.
  • No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
  • Efficient and inline
  • constexpr
Here is a first attempt for an integer parsing routine.

//Attempts to parse s as an integer. The valid integer string consists of the following:
//* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)
//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).
//* All of the rest of the characters MUST be digits.
//Returns true if an integral value was successfully parsed and stores the value in val,
//otherwise returns false and leaves val unmodified. 
//Sets errno to ERANGE if the string was an integer but would overflow type integral.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);
Please no ERxxx nonsense. optional, expected, exceptions, pairs, whatever but no ER codes, that's even more silly C.
+1

Vicente

Matt Fioravante

unread,
Jan 26, 2014, 5:10:53 PM1/26/14
to std-pr...@isocpp.org
On Sunday, January 26, 2014 11:54:47 AM UTC-5, dgutson . wrote:

Why didn't you include stringstream in your review? E.g. something 
like https://code.google.com/p/mili/source/browse/mili/string_utils.h#267 

Sure, I'll review it right now.
Stringstream is slow. It is also painful to use as you are forced to have your string stored within a stringstream object. This makes it even slower as you have to copy your string data into a string stream, possibly also allocating memory. Its a non-starter.

On Sunday, January 26, 2014 12:29:09 PM UTC-5, Miro Knejp wrote:
I am currently facing the same problems while working on my format proposal: https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/std-proposals/CIlWCTOe5kc
All the existing functions work on null-terminated strings only which is totally useless for my use cases

Agree this is a completely unacceptable restriction. One day I hope null terminated strings will just go away and we will all use string_view.
 
as I am parsing substrings in-place. I intend to design the string processing stuff that I'm using general enough so it can be used independently but for now I just want to make it work in the first place.

Please no ERxxx nonsense. optional, expected, exceptions, pairs, whatever but no ER codes, that's even more silly C. I currently base mine on iterators and provide string_view as convenience overloads.

Fair enough, errno does seem to be pretty loathed.. The question then is how do you know tell the user why it failed? Does the user even care? Perhaps instead of returning bool, you do the old style of returning an int. 0 means success and different non-zero values can be used to represent different reasons for failure such as parsing errors and range errors. We can reuse errno tags for the return value or create an enum.

if((rc = strto(s, val, 10) != 0) {
  if(rc == ERANGE) {
    printf("Out of range!\n");  
 } else {
    printf("Parsing error!\n"); 
 }
}

Exceptions are possible but rather heavy weight. Constructing an exception usually means also constructing a string error message. Not only do you have to pay for the allocation of this string, it may not match the kind of error reporting you'd like to provide to the caller, if any at all. The exception would also need to provide a kind of error code for quickly detecting why the conversion failed if you want to specially handle different failure modes.

In many cases users might want to throw an exception but we should not force that because exceptions can be too expensive for red hot parsing code. We might want to also add strto_e() or something that simply wraps strto() and throws an exception for convenience when throwing on parsing error makes sense.

With iterators it could return the iterator to the first element not part of the integer. a pair<optional<int>, Iter> or similar is a possibility.

Seems rather complicated to stuff all of that into the return value no? Returning a std::optional<int> would be ok because you can directly check the return value with operator bool(). It still doesn't provide information on why the failure occurred though.

I kind of like my idea of having an overload where you pass a string_view/iterator by reference to get the end position if you want it. Not all out parameters have be in the return value. Return values make it easy to write expressions. Using out parameters is perfectly fine too. This also makes it very easy to parse a string that is supposed exactly match a number.

Certainly not the best concept but I'd prefer it to checking errno anyday and depending on the combination of optional's engaged and the iterator position you can determine whether it failed and if so, why. Well, just giving some spontaneous food for thought. I only very recently started with the number parsing part of my proposal, so the interface will probbaly be very unstable for quite a while. And then there's locales, and together with them a whole new world of problems...

std::isdigit() at least on gcc linux is inlined, so it looks like digits don't require locales for the int conversions. Floating point will require more careful handling with the comma vs period.

On Sunday, January 26, 2014 1:29:01 PM UTC-5, Thiago Macieira wrote:

I'm sure that everyone intended that for their functions. That survived only
until the first round of feedback or encounter with reality...

That's probably true, although I still can't imagine the thought process that went into the 0 return value for atoi(), or gets() (but thats another story).
 
>    - No assumptions about use cases, we just want to parse strings. This
>    means none of this automatic whitespace handling.

Fair enough. It's easier to compose with other space checkers if you need to
than to remove functionality.

Better to add functionality then have to remove stuff you don't want.

>    - Efficient and inline
>    - constexpr

Efficient, definitely. Inline and constexpr? Forget it, it can't be done. Have
you ever looked at the source of a string-to-double function? They're huge!
This might be left as a suggestion to compilers to implement this as an
intrinsic.
It might be nice to have compile time string -> double conversions, but I agree for floating point its a huge complicated problem.  

For int conversions, inline/constexpr might be doable (but not if we're using errno).


> //Attempts to parse s as an integer. The valid integer string consists of
> the following:
> //* '+' or '-' sign as the first character (- only acceptable for signed
> integral types)

But no U+2212?

We could consider unicode as well. That's a good question. 

> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).
> //* All of the rest of the characters MUST be digits.

Where, by "digits", we understand the regular ASCII digits 0 to 9 and the
letters that compose digits on this base, both in uppercase and lowercase.

Yes that's right. 

Maybe we should add an extra boolean argument (defaulted to true) that be used to disable the hex and octal prefixes. Sometimes you really want to just parse a hex string without the 0x prefex. Adding an extra false to the parameter list is nicer than doing this check for yourself. Its similar to disabling the leading whitespace check of strtol().


> //Returns true if an integral value was successfully parsed and stores the
> value in val,
> //otherwise returns false and leaves val unmodified.
> //Sets errno to ERANGE if the string was an integer but would overflow type
> integral.

What if it failed to parse? What's the return condition?

As others have said, using errno is too C, but then again this kind of
function should be done in conjunction with the C people. Any improvements we
need, they probably need too.

With overloading, templates, iterators, string_view, etc.. its not so C compatible. Do we really care so much anyway? I don't like the idea of handicapping C++ interfaces in the name of C compatibility.

As mentioned above, its a question of how to tell the user why the failure occurred. If not through errno then it must be through the return value. 

> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base);

Replace string_view with a pair of InputIterators.

Agree, although I'd still want a string_view wrapper for convenience.
 

Do you know what this means? Parsing char16_t, char32_t and wchar_t too.

Yes, but that's not so difficult.
 

> //Same as the previous, except that instead of trying to parse the entire
> string, we only parse the integral part.
> //The beginning of the string must be an integer as specified above. Will
> set tail to point to the end of the string after the integral part.
> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base, string_view&
> tail);

Same as above.

> First off, all of these return bool which makes it very easy to check
> whether or not parsing failed.

That's the opposite of what most people want. Most people want to get the
parsed number, not whether it succeded or failed. Maybe invert the logic?

I think that's a voting/bikeshed question (or use std::optional). I much prefer the pass/fail in the return value because I can wrap the call to strto and error check in a single line. Any code related to parsing is always very heavy with error checking conditionals.

This:

int x;
if(!strto(s, x, 10)) {
  throw error;

vs this;

bool rc;
int x = strto(s, rc, 10);
if(!rc) {
  throw error;
}

The first version is more compact and in my opinion easier to read. Parsing code almost always requires very careful error handling. Unless you know a priori at compile time that the string really is an integer string and you just need a conversion (which is an incredibly rare case). 

The boolean return value also emphasizes that you must be diligent about checking user input and I'd argue it encourages this behavior for novice programmers. Its very easy for a beginner to just write x = atoi(s); and move on. Later having to track down some bug that shows up somewhere else because they didn't check the result.

The only thing you're buying with the value itself being returned is being able to use strto() in an expression.

bool rc;
int y = (x + (5 * strto(s, rc, 10))) / 25;
if(!rc) {
  throw error;
}

I don't like the above idiom for several reasons;
  • The error checking comes after the whole expression, forcing the person reading the code to expend more mental energy to link the error check with the strto() call buried in all of that math. Multiple strto() calls in the same expression with multiple boolean variables is even more fun. Error checking should occur right after the thing being checked or even better within a single expression. Also it may be more efficient to avoid the unnecessary math in the case of the error (although the compiler is likely to reorder for you anyway).
  • This kind of expression is actually somewhat dangerous for floating point. The result coming from strto<float>()  parsing error could cause a signalling nan to be generated within the following expression and then throw an exception the programmer was not expecting. More fun debugging that one. You can avoid this but again that means the programmer has to think before using strto(). strtol() and friends already require too much thinking as I've already explained.



That's what we do for {QString,QByteArray,QLocale}::to{Int,Double,etc.}. And
one suggestion I received a few weeks ago was to add the overload that returns
the end pointer and does not fail if there's more stuff after it.

That's pretty much the same as my second overload. I think both use cases are very valid. In many situations you have a long string and you want to parse the int at the beginning and then continue on with whatever it supposed to be after it. Finding the position after the int of course requires parsing out the int.
 

Bengt Gustafsson

unread,
Jan 26, 2014, 5:13:47 PM1/26/14
to std-pr...@isocpp.org
What we really need for the input string is "something that begin() and end() functions work for". This avoids the tedious sending of two iterators, while covering vectors, string literals (I hope), strings and string_views without overloading.

In the fairly high percentage of cases when we want to do skipspace we need to have a easy way to do that. One may be:

auto skipspace(const RNG& src)->string_view<decltype(*begin(src)> { ... }

I think that string_view<T> will work for at least MOST of the stuff that begin() and end() works for, so it is a logical return type choice, although a more generic range<T> could also be used, or even _impementation_dependant_ although that puts us in a place where we need to define what this undefined type can be used for.

Anyhow, this helper allows us to write

strto(dest, skipspace(src));

Which I think is a decent syntax. (and I do think that the destination should be the first parameter).

However, we still not allow trailing spaces, so maybe skipspace should be like the classic strip instead, i.e. be able to handle both ends of the string by flags or like this sketch:

strip_front(src)
strip_back(src)
strip (src) { return strip_back(strip_front(src)); }

With the begin(src)/end(src) usage of the source string the type of the value indicating the final position should be of type decltype(begin(src))& I guess. This is not a range so the reassembly of the range is up to the caller, which is not optimal. But the alternative, to use the same template parameter for s and tail. The problem with this is that we need to be able to set its contents but the only thing we know of this type is that it has begin() and end() defined for it...

I have no particularly good solution to this problem. I had a similar proposal as this one in the works, but it also mandates a string_view as the source. The main difference was that there were two different function names, and one of them modified the string_view in situ. I think that while this does not solve the problem with updating a generic range it gives a nicer type of code at the call site for the parsing case (use only one string_view as the "cursor" of parsing:

bool from_string(T& dest, const string_view<C>& src);  // requires the src to only contain the string representation of T
bool parse_string(T& dest, string_vew<C>& src);          // updates src to reflect how many chars were consumed converting to T

Matt Fioravante

unread,
Jan 26, 2014, 5:25:39 PM1/26/14
to std-pr...@isocpp.org


On Sunday, January 26, 2014 5:13:47 PM UTC-5, Bengt Gustafsson wrote:
What we really need for the input string is "something that begin() and end() functions work for". This avoids the tedious sending of two iterators, while covering vectors, string literals (I hope), strings and string_views without overloading.

In the fairly high percentage of cases when we want to do skipspace we need to have a easy way to do that. One may be:

auto skipspace(const RNG& src)->string_view<decltype(*begin(src)> { ... }

I think a general facility for stripping spaces, quotes etc.. would be nice. But that's a separate proposal.

template <typename F>
string_view lstrip(string_view s, F conditional);
string_view lstrip(string_view s, char c);
//along with rstrip() and strip() for both sides. 

Now you can do strto(lstrip(s, std::isspace), val, 10); if you want the whitespace behavior.

Bengt Gustafsson

unread,
Jan 26, 2014, 5:26:09 PM1/26/14
to std-pr...@isocpp.org
I have thought about proposing an error_return class which throws in its destructor unless an operator bool() has been executed. It has an ignore() method for the case that you don't care about errors. This class could be augmented with an error code/error string or something that indicates what the cause of the error was. Also it could have a rethtrow() method you can call if you explicitly want a failed conversion to throw. Use cases:

from_string(x, "123").ignore();  // Ignore error

if (!from_string(...))
    handle error;

from_string(...).rethrow();    // Ask for an exception if conversion fails. (Bikeshed warning here!)

from_string(...);       // Throw on first call, even if conversion worked. Or preferably (if possible to implement) generate a static_assert.

This would be a general class useful in this type of cases througout the standard (and proprietary) C++ libraries. It could be templated on the error information's type I guess.

Matt Fioravante

unread,
Jan 26, 2014, 5:33:47 PM1/26/14
to std-pr...@isocpp.org
An just to further emphasize the question of returning pass/fail vs returning the value.

My general philosophy with parsing is to emphasize error handling first and then the actual results second. The success or failure of the parse should be thrown right in your face, forcing you to deal with it. This helps remind us to write more correct code. I'd be happy to know if people agree or not.

Roland Bock

unread,
Jan 26, 2014, 5:41:55 PM1/26/14
to std-pr...@isocpp.org
On 2014-01-26 23:33, Matt Fioravante wrote:
An just to further emphasize the question of returning pass/fail vs returning the value.

My general philosophy with parsing is to emphasize error handling first and then the actual results second. The success or failure of the parse should be thrown right in your face, forcing you to deal with it. This helps remind us to write more correct code. I'd be happy to know if people agree or not.
--
It really depends on the use case.

  1. If 0 (or some other value) is an acceptable fall-back result, I don't want to litter my code with error handling.
  2. If a parse error has to be recorded or has to provoke some other action (e.g. ask the user to re-enter data), then I want to be forced to deal with the errors.

A good interface supports at least these two options, I'd say.

Just my 2t...

Cheers,

Roland

Thiago Macieira

unread,
Jan 26, 2014, 5:45:54 PM1/26/14
to std-pr...@isocpp.org
On domingo, 26 de janeiro de 2014 14:10:53, Matt Fioravante wrote:
> Exceptions are possible but rather heavy weight. Constructing an exception
> usually means also constructing a string error message. Not only do you
> have to pay for the allocation of this string

Exceptions are heavy weight, indeed, but allocating memory for the string
message is usually a big no-no. Rule of thumb for exceptions: don't allocate
memory in order to throw (or how would you report an OOM situation?).

> > > - Efficient and inline
> > > - constexpr
> >
> > Efficient, definitely. Inline and constexpr? Forget it, it can't be done.
> > Have
> > you ever looked at the source of a string-to-double function? They're
> > huge!
> > This might be left as a suggestion to compilers to implement this as an
> > intrinsic.
>
> It might be nice to have compile time string -> double conversions, but I
> agree for floating point its a huge complicated problem.
> http://www.exploringbinary.com/how-strtod-works-and-sometimes-doesnt/
>
> For int conversions, inline/constexpr might be doable (but not if we're
> using errno).

Only after we get a way to have constexpr code that is only used when
expanding constexpr arguments at compile time. The code for executing a
constexpr integer conversion will probably be larger than the optimised non-
constexpr version. Since the vast majority of the uses of this function will
be to parse strings not known at compile time, I much prefer that they be
efficient for runtime operation.

> > As others have said, using errno is too C, but then again this kind of
> > function should be done in conjunction with the C people. Any improvements
> > we
> > need, they probably need too.
>
> With overloading, templates, iterators, string_view, etc.. its not so C
> compatible. Do we really care so much anyway? I don't like the idea of
> handicapping C++ interfaces in the name of C compatibility.

C11 has generics, so that solves the problem of the templates and the
iterators.

But it might be that this C++ function get implemented with calls to strtol,
strtoul, strtoll, strtoull, strtod, etc. anyway. So maybe the C guys already
have what they need, except for the C Generic version.

And the char16_t, char32_t and wchar versions.

> > Do you know what this means? Parsing char16_t, char32_t and wchar_t too.
>
> Yes, but that's not so difficult.

That depends on whether locale parsing is performed. We need functions that
don't depend on the locale, in which case a conversion from char16_t and
char32_t to the execution charset can be done quite quickly (again something
for which the constexpr version would be slower than the runtime optimised
version). After the conversion is done, the char variant can be called.

That's how QString::to{Int,Double} is implemented: first, fast-convert from
UTF-16 to Latin 1, then call the internal strtoll / strtod.

> > That's the opposite of what most people want. Most people want to get the
> > parsed number, not whether it succeded or failed. Maybe invert the logic?
>
> I think that's a voting/bikeshed question (or use std::optional).

Agreed, which is why I'm not going to continue this part of the discussion :-)
signature.asc

Thiago Macieira

unread,
Jan 26, 2014, 5:51:04 PM1/26/14
to std-pr...@isocpp.org
On domingo, 26 de janeiro de 2014 14:33:47, Matt Fioravante wrote:
> My general philosophy with parsing is to emphasize error handling first and
> then the actual results second. The success or failure of the parse should
> be thrown right in your face, forcing you to deal with it. This helps
> remind us to write more correct code. I'd be happy to know if people agree
> or not.

You're asking for the bikeshed discussion.

I don't agree. Sometimes you already know that the data is well-formed and you
don't need the error status. Therefore, emphasising the actual data is more
important.

Given use of exceptions, the philosophy I described in the paragraph seems to
apply to the Standard Library.
signature.asc

Miro Knejp

unread,
Jan 26, 2014, 6:32:43 PM1/26/14
to std-pr...@isocpp.org

Seems rather complicated to stuff all of that into the return value no? Returning a std::optional<int> would be ok because you can directly check the return value with operator bool(). It still doesn't provide information on why the failure occurred though.
It really depends on what you want to do. If you use iterators in the first place you have very likely a scenario where you want to know the end of you number for further processing. If you consider iterators noise in your current use case, use the non-iterator overloads like string_view.

Just for exposition:
optional<int> x;
tie(x, from) = string_to<int>(from, to);
if(x) ...

- or -

auto x = string_to<int>(from, to).first;
if(x) ...

If you only care whether there was a valid number or not then test x and continue parsing/throwing/whatever. It may look weird but it's in line with the spirit of the standard library. Afaik there is not a single std function that uses iterators as out-arguments. They are always taken and returned by value. Deviating from that would need good reasons. Multiple return values would be such a nice thing right now (without tupling everything)...



> //Attempts to parse s as an integer. The valid integer string consists of
> the following:
> //* '+' or '-' sign as the first character (- only acceptable for signed
> integral types)

But no U+2212?

We could consider unicode as well. That's a good question. 

> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).
> //* All of the rest of the characters MUST be digits.

Where, by "digits", we understand the regular ASCII digits 0 to 9 and the
letters that compose digits on this base, both in uppercase and lowercase.

Yes that's right. 

Maybe we should add an extra boolean argument (defaulted to true) that be used to disable the hex and octal prefixes. Sometimes you really want to just parse a hex string without the 0x prefex. Adding an extra false to the parameter list is nicer than doing this check for yourself. Its similar to disabling the leading whitespace check of strtol().
Bools aren't very desriptive. I'd prefer a more explicit syntax for overloads. For example:

to_string<int>(s); // Use current locale
to_string<int>(s, no_locale); // Tag type: ASCII only, fast path with no facets, no virtuals
to_string<int>(s, myLocale);

Makes sense? It's something I'm playing around with currently.

Matt Fioravante

unread,
Jan 26, 2014, 6:33:55 PM1/26/14
to std-pr...@isocpp.org

On Sunday, January 26, 2014 5:41:55 PM UTC-5, Roland Bock wrote:
On 2014-01-26 23:33, Matt Fioravante wrote:
An just to further emphasize the question of returning pass/fail vs returning the value.

My general philosophy with parsing is to emphasize error handling first and then the actual results second. The success or failure of the parse should be thrown right in your face, forcing you to deal with it. This helps remind us to write more correct code. I'd be happy to know if people agree or not.
-- 
It really depends on the use case. 

  1. If 0 (or some other value) is an acceptable fall-back result, I don't want to litter my code with error handling.
Your fallback value may be 0, that guy's fallback might be 1, the other guys is INT_MIN, and maybe mine is -1. One interface cannot account for all of these possibilities and assuming one of them is the same sin as parsing white space before the string. Its easy enough to write a wrapper to suit your specific requirements. 

constexpr int kFallback = 0;

int mystrto(string_view s, int base) {
  int x;
  return strto(s, x, base) ? x : kFallback;
}

If strto() is inline, the compiler can remove the boolean return value and the conditional in mystrto, having the resulting code return kFallback on error, resulting in no runtime overhead for your wrapper.

  1. If a parse error has to be recorded or has to provoke some other action (e.g. ask the user to re-enter data), then I want to be forced to deal with the errors.

A good interface supports at least these two options, I'd say.

We have 2 differing uses cases here. Can one interface cleanly support both?

C++14 constexpr is pretty liberal, but I agree that runtime performance is absolutely paramount. constexpr can always be tacked on later if its deemed feasible and useful to someone. Lets forget the constexpr question for now.


> > As others have said, using errno is too C, but then again this kind of 
> > function should be done in conjunction with the C people. Any improvements 
> > we 
> > need, they probably need too. 

> With overloading, templates, iterators, string_view, etc.. its not so C 
> compatible. Do we really care so much anyway? I don't like the idea of 
> handicapping C++ interfaces in the name of C compatibility. 

C11 has generics, so that solves the problem of the templates and the 
iterators. 

But it might be that this C++ function get implemented with calls to strtol, 
strtoul, strtoll, strtoull, strtod, etc. anyway. So maybe the C guys already 
have what they need, except for the C Generic version. 

I say we focus on making the best C++ interface possible. If the C community steps up and shows interest, we can then consider changing things for compatibility. Or they can make their own that works best for C. Like it or not the 2 languages are diverging.

 

And the char16_t, char32_t and wchar versions. 

> > Do you know what this means? Parsing char16_t, char32_t and wchar_t too. 

> Yes, but that's not so difficult. 

That depends on whether locale parsing is performed. We need functions that 
don't depend on the locale, in which case a conversion from char16_t and 
char32_t to the execution charset can be done quite quickly (again something 
for which the constexpr version would be slower than the runtime optimised 
version). After the conversion is done, the char variant can be called. 

That's how QString::to{Int,Double} is implemented: first, fast-convert from 
UTF-16 to Latin 1, then call the internal strtoll / strtod. 

constexpr aside, all QoI issues I believe.
 

> > That's the opposite of what most people want. Most people want to get the 
> > parsed number, not whether it succeded or failed. Maybe invert the logic? 

> I think that's a voting/bikeshed question (or use std::optional). 

Agreed, which is why I'm not going to continue this part of the discussion :-) 

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org 
   Software Architect - Intel Open Source Technology Center 
      PGP/GPG: 0x6EF45358; fingerprint: 
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358 
 
On Sunday, January 26, 2014 5:51:04 PM UTC-5, Thiago Macieira wrote:
On domingo, 26 de janeiro de 2014 14:33:47, Matt Fioravante wrote:
> My general philosophy with parsing is to emphasize error handling first and
> then the actual results second. The success or failure of the parse should
> be thrown right in your face, forcing you to deal with it. This helps
> remind us to write more correct code. I'd be happy to know if people agree
> or not.

You're asking for the bikeshed discussion.
 
Haha ok bring in on :)
 

I don't agree. Sometimes you already know that the data is well-formed and you
don't need the error status. Therefore, emphasizing the actual data is more
important.

I don't know about you or others, but the majority of the time when I need to do these conversions I am getting my string from the command line, a file, a network socket, and/or another user input of some kind etc.. All of those require strict error checking. 


Even if you know the string is supposed to be parsed correctly, it doesn't hurt to throw in an assert() or debug mode error message check in there in case someone (or you yourself) made a mistake earlier and broke your invariant. 

You're a QT developer, so I suppose the use case you're mentioning is a GUI box which already checks the input before passing it down? Self validating UI is one very common use case, but not the only one as I've mentioned already.
 
What has come out of this are 2 distinct use cases, error checking emphasis vs results emphasis. We have 3 options:
  1. Come up with an interface that somehow satisfies both
  2. Make 2 interfaces, one for each situation
  3. Prioritize one over the other
I'm not sure how to do (1) and (2) seems like it could be confusing to have 2 interfaces that do the exact same thing with slightly different calling conventions. So that leaves us with (3), and I obviously stand firm in the safety camp.

Here is another project which seems to agree with my philosophy. Its a sample library designed to teach linux kernel developers how to write good userspace C libraries.

Functions should return int and negative errors instead of NULL
  - Return NULL in malloc() is fine, return NULL in fopen() is not!
  - Pass allocated objects as parameter (yes, ctx_t** is OK!)
  - Returning kernel style negative <errno.h> error codes is cool in
    userspace too. Do it!

This is a C API, but the concept still translates to C++. Here they encourage returning error conditions, and stuffing the results into out parameters. While I might not do this for everything, I certainly would for error heavy routines such as parsing.


Given use of exceptions, the philosophy I described in the paragraph seems to
apply to the Standard Library.

Maybe so, but I'd rather come up with the safest, most expressive, most easy to use, and most efficient interface possible. Regardless of past precedents.

Matt Fioravante

unread,
Jan 26, 2014, 6:42:31 PM1/26/14
to std-pr...@isocpp.org
Also for the UI validation case, you can reuse the same function to do the validation, in which case the error handling becomes paramount and must be efficient (no exceptions). You can even cache the parsed result into your widget and then reuse it later instead of parsing the input string twice, if that makes sense to do.

Matt Fioravante

unread,
Jan 26, 2014, 6:46:29 PM1/26/14
to std-pr...@isocpp.org


On Sunday, January 26, 2014 6:32:43 PM UTC-5, Miro Knejp wrote:
Bools aren't very desriptive. I'd prefer a more explicit syntax for overloads. For example:
 

to_string<int>(s); // Use current locale
to_string<int>(s, no_locale); // Tag type: ASCII only, fast path with no facets, no virtuals
to_string<int>(s, myLocale);

Makes sense? It's something I'm playing around with currently.

I've had ideas in this direction as well, but thats a separate proposal. In one instance I benchmarked a hand written 
inline bool ascii::isspace(char c) which proved to be almost 30% faster than std::isspace(). Those indirect function calls are horribly expensive.

Roland Bock

unread,
Jan 27, 2014, 2:50:29 AM1/27/14
to std-pr...@isocpp.org
On 2014-01-27 00:33, Matt Fioravante wrote:

A good interface supports at least these two options, I'd say.

We have 2 differing uses cases here. Can one interface cleanly support both?
Overloads.
One with, one without fall-back value.

Or additional methods like parse_xy_with_default()...





I don't agree. Sometimes you already know that the data is well-formed and you
don't need the error status. Therefore, emphasizing the actual data is more
important.

I don't know about you or others, but the majority of the time when I need to do these conversions I am getting my string from the command line, a file, a network socket, and/or another user input of some kind etc.. All of those require strict error checking.
RPC input, database input might have other checks, e.g. checksums for the whole message. In that case it makes no sense to check each conversion.
And you might /expect/ broken input and have a fall-back.


Even if you know the string is supposed to be parsed correctly, it doesn't hurt to throw in an assert() or debug mode error message check in there in case someone (or you yourself) made a mistake earlier and broke your invariant.
Yeah, that's fine. You could add a is_parseable(...) or so to be used in asserts().


What has come out of this are 2 distinct use cases, error checking emphasis vs results emphasis. We have 3 options:
  1. Come up with an interface that somehow satisfies both
  2. Make 2 interfaces, one for each situation
  3. Prioritize one over the other
I'm not sure how to do (1) and (2) seems like it could be confusing to have 2 interfaces that do the exact same thing with slightly different calling conventions. So that leaves us with (3), and I obviously stand firm in the safety camp.
1 and 2 are basically the same. I'd go that way...


Given use of exceptions, the philosophy I described in the paragraph seems to
apply to the Standard Library.

Maybe so, but I'd rather come up with the safest, most expressive, most easy to use, and most efficient interface possible. Regardless of past precedents.
 
Well, you have received quite a lot of feedback, try to come up with that :-)


Bjorn Reese

unread,
Jan 27, 2014, 4:23:45 AM1/27/14
to std-pr...@isocpp.org
On 01/26/2014 05:25 PM, Matt Fioravante wrote:
> string to T (int, float, etc..) conversions seem like to rather easy
> task (aside from floating point round trip issues), and yet for the life
> of C and C++ the standard library has consistently failed to provide a
> decent interface.

You need to specify more clearly what T is. For instance, you mention
integral types but does that include bool or char? Does it include
arbitrary classes? (as handled by the proposal linked below)

What about char_traits?

What about locale?

> Lets review:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1973.html

Matthew Woehlke

unread,
Jan 27, 2014, 12:27:11 PM1/27/14
to std-pr...@isocpp.org
On 2014-01-26 11:25, Matt Fioravante wrote:
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).

If this also works for float/double I will be a happy man :-). (I have a
project for which I intend to use base-16 to store double values on disk
(textual format) in order to avoid rounding issues.)

> //* All of the rest of the characters MUST be digits.

Do you support locale-specific digits? E.g. will you parse "二千二十五"
in a Japanese locale? What about locale-specific digit grouping and
decimal separators?

> //Returns true if an integral value was successfully parsed and stores the
> value in val,

Why not return a std::optional? I've never been a fan of (non-const)
by-ref parameters; they make it hard to impossible to store the "output"
value in a const local.

> //Sets errno to ERANGE if the string was an integer but would overflow type
> integral.

Floating point can overflow also, or do you return +/- infinity in that
case? (Maybe do both?)

> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base);

'strto' seems to be missing something (though if you return an optional,
one would have to write 'strto<type>' which would be better). You might
also consider something like 'parse_number'; I wouldn't name it like
'strto' just because there are already functions named similarly.

> First off, all of these return bool which makes it very easy to check
> whether or not parsing failed.
>
> While the interface does not allow this idom:
>
> int x = atoi(s);

Again, if you instead returned a std::optional, I believe both of these
would be covered. Referring to later discussion, std::optional would
satisfy both the 'check the result' case and the 'I have a default that
I can silently fall back on' (via value_or) case.

The only downside is you can't store the result in a const *and* write
the call to strto inside the if() to check the result. (Though with your
proposal, you can't store the result in a const at all...)

--
Matthew

Bengt Gustafsson

unread,
Jan 29, 2014, 11:18:13 AM1/29/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
Regarding default values produced when the conversion fails this is another argument for this style:

<error return type> from_string(T& dest, string_view& src)

Now the standard can specify that the function shall not touch dest if conversion fails. The default value is the previous value of the variable!

The error return type can be the error_return type I mentioned above, which actually should contain an exception_ptr:

class error_return {
public:
    error_return
() m_handled(false) {}   // Ok case: No exception
    error_return
(exception_ptr ex) : m_handled(false), m_exception(ex) {}
   
~error_return() {
       
if (m_handled)
           
return;
       
if (m_exception)
            rethrow_exception
(mException);
       
else
           
throw exception("return value not checked");
   
}

   
void ignore() { m_handled = true; }   // Use to explicitly ignore errors
   
void rethrow() { m_handled = true; rethrow_exception(m_exception); }
   
operator bool() { m_handled = true; return !m_exception; }      // for if-type check. true (good) if m_exeption is null.
private:
   
bool m_handled;
   
exception_ptr m_exception;
};

Usage:

error_return from_string
(T& dest, string_view& src) {
   
if (... could convert ...)
       
return error_return();
   
else
       
return make_exception_ptr(exception("Could not convert"));
}


// No check required
from_string
(dest, "123").ignore();

// Check using if
if (from_string(dest, "123"))
   
... handle error;

// Throw on error
from_string
(dest, "123").rethrow();

// Programming error!
from_string
(dest, "123");    // throws on first call even if conversion can be made!

Nevin Liber

unread,
Jan 29, 2014, 11:23:59 AM1/29/14
to std-pr...@isocpp.org
On 29 January 2014 10:18, Bengt Gustafsson <bengt.gu...@beamways.com> wrote:
Regarding default values produced when the conversion fails this is another argument for this style:

<error return type> from_string(T& dest, string_view& src)

Now the standard can specify that the function shall not touch dest if conversion fails.

How exactly do you guarantee that if an exception is thrown while dest is being mutated?
-- 
 Nevin ":-)" Liber  <mailto:ne...@eviloverlord.com(847) 691-1404

Matthew Woehlke

unread,
Jan 29, 2014, 11:49:40 AM1/29/14
to std-pr...@isocpp.org
On 2014-01-29 11:18, Bengt Gustafsson wrote:
> Regarding default values produced when the conversion fails this is another
> argument for this style:
>
> <error return type> from_string(T& dest, string_view& src)
>
> Now the standard can specify that the function shall not touch dest if
> conversion fails. The default value is the previous value of the variable!

So... not only can I still not assign the result to a const local, now
'dest' potentially contains uninitialized memory? I don't see how that's
an improvement.

If it is really necessary to have a description of the failure type (and
errno is not suitable; personally I find nothing wrong with using
errno), then maybe a return type that is similar to std::optional with
an additional 'why it is disengaged' could be created. (Maybe even
subclass std::optional and call it e.g. std::result?)

> // No check required
> from_string(dest, "123").ignore();

You omitted the declaration and initialization of 'dest'. IOW:

// your proposal
auto dest = int{12};
from_string(dest, "34").ignore();
foo(dest);

- vs. -

// std::optional as return type
foo(from_string<int>("34").value_or(12));

Using std::optional, I (in the above example, anyway) avoided even
having a named variable to receive the value. And if I wanted one, I
could make it const, which I couldn't do with your version.

--
Matthew

Miro Knejp

unread,
Jan 29, 2014, 2:56:14 PM1/29/14
to std-pr...@isocpp.org
I am now using the following interface in the format parser implementation:

pair<optional<T>, Iter> parse_integer<T>(Iter first, Iter last, int
radix = 10)

and the convenience overload

optional<T> parse_integer<T>(string_view s, int radix = 10);

which could, using internal tag dispatching, be reduced to

parse<T>(...)

The signatures are very easy to use and give me all I need. Both
greedily consume as many valid characters as possible (even on
over/underflow) so parsing can continue past the (in)valid input. Sure,
you don't get a detailed error report but how important is it really?
All I care about is whether the number was valid or not. If the optional
is disengaged I can *guess* what happened using the iterator overload.
If the returned iterator equals first, then there was no number to begin
with, otherwise the number format was wrong or over/underflowed occured.
In any case the returned iterator points to the first character past the
number. Not sure how important it really is to distinguish
over/underflow from an invalid pattern though. I had no use for that
information so I would be interested to hear about scenarios where it
really does matter.

This version does not skip whitespaces or parse the radix prefix, for
that I have separate

pair<int, Iter> parse_radix_prefix(Iter first, Iter last) // Consume
"0x", "0X", "0b", "0B" or "0" and return 16, 2, 8 or 0 and iterator to next
pair<optional<T>, Iter> parse_prefixed_integer<T>(Iter first, Iter last,
int radix = 10) // Accepts 0x123, 0b111, -0123, -0b111 and uses "radix"
if no prefix was found

The negative sign is only accepted if T is signed and they currently
completely ignore anything locale specific and are ASCII only. In my
oppinion it's better to have separate overloads to control whether
parsing is done locale-agnostic or not and in my current use case
language neutral parsing had higher priority, but that's just a matter
of bikeshedding and overloads doing locale supported parsing to detect
culture based sign, grouping, decimal, etc. should be available, too.

Instead of using "pair" one might have a custom utility struct where the
members aren't named "first" and "second" for clarity.


Vicente J. Botet Escriba

unread,
Jan 30, 2014, 2:10:29 AM1/30/14
to std-pr...@isocpp.org
Le 29/01/14 20:56, Miro Knejp a écrit :
Am 29.01.2014 17:49, schrieb Matthew Woehlke:
On 2014-01-29 11:18, Bengt Gustafsson wrote:
Regarding default values produced when the conversion fails this is another
argument for this style:

<error return type> from_string(T& dest, string_view& src)

Now the standard can specify that the function shall not touch dest if
conversion fails. The default value is the previous value of the variable!

So... not only can I still not assign the result to a const local, now 'dest' potentially contains uninitialized memory? I don't see how that's an improvement.

If it is really necessary to have a description of the failure type (and errno is not suitable; personally I find nothing wrong with using errno), then maybe a return type that is similar to std::optional with an additional 'why it is disengaged' could be created. (Maybe even subclass std::optional and call it e.g. std::result?)

// No check required
from_string(dest, "123").ignore();

You omitted the declaration and initialization of 'dest'. IOW:

// your proposal
auto dest = int{12};
from_string(dest, "34").ignore();
foo(dest);

- vs. -

// std::optional as return type
foo(from_string<int>("34").value_or(12));

Using std::optional, I (in the above example, anyway) avoided even having a named variable to receive the value. And if I wanted one, I could make it const, which I couldn't do with your version.

I am now using the following interface in the format parser implementation:

pair<optional<T>, Iter> parse_integer<T>(Iter first, Iter last, int radix = 10)



Given a function

In Boost.Expected we have an example with something like

pair< Iter, expected<T, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);

or 

expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);

A parse interger range could be implemented as
 
expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
    auto f = parse_integer<T>(s, e); RETURN_IF_UNEXPECTED(f);
    auto m = parse_string("..", f.first, e); RETURN_IF_UNEXPECTED(m);
    auto l = parse_integer<T>(m, e); RETURN_IF_UNEXPECTED(l);
    return make_expected(make_pair(l.first, make_pair(f.second, l.second))));
}

where

#define RETURN_IF_UNEXPECTED(f) if (! f) return f.get_exceptional();

Note that we can also see pair< Iter, expected<T, std::ios_base::iostate>> as equivalent to expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> and so make it a monad also.

I would like to be able to write it just as

expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
    auto f = await parse_integer<T>(s, e);
    auto m = await parse_string("..", f.first, e);
    auto l = await parse_integer<T>(m, e);
    return make_pair(l.first, make_pair(f.second, l.second)));
}

The keyword await could be subject to discussion.
The advantage here is that we are writing the code as if the functions parse_integer thrown an exception in case of errors.
The await operator would make return the parse_integer_range if the expression on the right has an error stored.
The returned value would have the type of the parse_integer_range with the stored error.

and the convenience overload

optional<T> parse_integer<T>(string_view s, int radix = 10);
How do will use this overload? Could you define a parse_interger_range with?
Or is the intent to match the whole string and so the name should be match_integer?


which could, using internal tag dispatching, be reduced to

parse<T>(...)

The signatures are very easy to use and give me all I need. Both greedily consume as many valid characters as possible (even on over/underflow) so parsing can continue past the (in)valid input. Sure, you don't get a detailed error report but how important is it really? All I care about is whether the number was valid or not. If the optional is disengaged I can *guess* what happened using the iterator overload. If the returned iterator equals first, then there was no number to begin with, otherwise the number format was wrong or over/underflowed occured. In any case the returned iterator points to the first character past the number. Not sure how important it really is to distinguish over/underflow from an invalid pattern though. I had no use for that information so I would be interested to hear about scenarios where it really does matter.

Maybe your application don't care of the detailed error, but when designing a library it is better to provide as much information as has been obtained so that the user can do whatever she needs.


Besy,
Vicente

Miro Knejp

unread,
Jan 30, 2014, 4:21:25 AM1/30/14
to std-pr...@isocpp.org


Given a function

In Boost.Expected we have an example with something like

pair< Iter, expected<T, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);
iostate is probably not very helpful since there might not be any streams involved so all you'd get is failbit or goodbit.


or 

expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);

A parse interger range could be implemented as
 
expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
    auto f = parse_integer<T>(s, e); RETURN_IF_UNEXPECTED(f);
    auto m = parse_string("..", f.first, e); RETURN_IF_UNEXPECTED(m);
    auto l = parse_integer<T>(m, e); RETURN_IF_UNEXPECTED(l);
    return make_expected(make_pair(l.first, make_pair(f.second, l.second))));
}

where

#define RETURN_IF_UNEXPECTED(f) if (! f) return f.get_exceptional();

Note that we can also see pair< Iter, expected<T, std::ios_base::iostate>> as equivalent to expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> and so make it a monad also.

I would like to be able to write it just as

expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
    auto f = await parse_integer<T>(s, e);
    auto m = await parse_string("..", f.first, e);
    auto l = await parse_integer<T>(m, e);
    return make_pair(l.first, make_pair(f.second, l.second)));
}

The keyword await could be subject to discussion.
The advantage here is that we are writing the code as if the functions parse_integer thrown an exception in case of errors.
The await operator would make return the parse_integer_range if the expression on the right has an error stored.
The returned value would have the type of the parse_integer_range with the stored error.
and the convenience overload

optional<T> parse_integer<T>(string_view s, int radix = 10);
How do will use this overload? Could you define a parse_interger_range with?
No. That's what the iterator overloads are for. Also, what makes parse_integer_range so special?

Or is the intent to match the whole string and so the name should be match_integer?
It's called a *convenience* overload for a reason. The idea was to provide a simple (novice friendly) interface for the very basic and introductory/trivial use cases. All it does is call parse_xxx(begin(s), end(s), ...) and discards the returned iterator. Alternatively one might return a new string_view with the remainder but that doesn't really add to its simplicity. If the intention was to match the string exactly I would have called it match_xxx. Which of course shouldn't mean there's no use for a match_xxx-like interface but it's implementation is trivial once you have parse_xxx.


which could, using internal tag dispatching, be reduced to

parse<T>(...)

The signatures are very easy to use and give me all I need. Both greedily consume as many valid characters as possible (even on over/underflow) so parsing can continue past the (in)valid input. Sure, you don't get a detailed error report but how important is it really? All I care about is whether the number was valid or not. If the optional is disengaged I can *guess* what happened using the iterator overload. If the returned iterator equals first, then there was no number to begin with, otherwise the number format was wrong or over/underflowed occured. In any case the returned iterator points to the first character past the number. Not sure how important it really is to distinguish over/underflow from an invalid pattern though. I had no use for that information so I would be interested to hear about scenarios where it really does matter.

Maybe your application don't care of the detailed error, but when designing a library it is better to provide as much information as has been obtained so that the user can do whatever she needs.
I didn't mean to imply nobody had a use for it. If the error type used in expected<> is not a convoluted object like an exception (which usually requires allocation of an error message) I'm all for adding it. Maybe errc with values like errc::value_too_large or errc::invalid_argument. We have this enum now so why not make use of it. As long as there are no interactions/side effects with errno. Whatever the return type is, I think it is beneficial for all if the syntax

x = parse_something(...)
if(x) // or x.first depending on the overload used
    ...

is well formed and intuitive.

Bengt Gustafsson

unread,
Jan 30, 2014, 7:49:58 AM1/30/14
to std-pr...@isocpp.org
Neivn: I was only thinking of simple types when I contemplated the ignore() paradigm. My bad, we should strive for making this useful even for cases when copying a T after making sure the actual parsing went ok can in itself throw. Thus, it seems better with an API that returns something that contans T or an error code. optional<T> is similar to what we need but only contains a bool, not the code. The value_or method of optional<T> seems like a good functionality to say "I'm handling any possible error by using this default value", which is one important use case.

The Boost::expected template suggested by Miro comes close to what I want, but does not make sure that the error code was checked in its dtor, which I think was the main feature of my proposed error_return class. I don't know whether there is a proposal for a std::expected but if so I suggest that it should do the destructor error check that error_return does.

Nevin Liber

unread,
Jan 30, 2014, 11:44:54 AM1/30/14
to std-pr...@isocpp.org
On 30 January 2014 06:49, Bengt Gustafsson <bengt.gu...@beamways.com> wrote:


The Boost::expected template suggested by Miro comes close to what I want, but does not make sure that the error code was checked in its dtor, which I think was the main feature of my proposed error_return class.

You do know that it is extremely unlikely for a class which throws from its destructor to be approved by the committee anytime soon, right?  First you would need to address the issues brought up in the article and discussion at <http://cpp-next.com/archive/2012/08/evil-or-just-misunderstood/>.

Besides, it seems to me it ought to assert, not throw, as I can't imagine any circumstance where this isn't a programming bug.

Bengt Gustafsson

unread,
Jan 31, 2014, 6:00:38 PM1/31/14
to std-pr...@isocpp.org
You are right, it should be an assert of course. In the return code football thread I just wrote about a outlandish feature that would make this a static_assert even! This is a check to make sure that the conversion error code is being checked, so it is really a static feature of the code at the call site.

I won't repeat myself here. I would rather defer discussions on the error return method to that thread:

Olaf van der Spek

unread,
Feb 3, 2014, 6:13:23 AM2/3/14
to std-pr...@isocpp.org
On Sunday, January 26, 2014 5:25:02 PM UTC+1, Matthew Fioravante wrote:
Here is a first attempt for an integer parsing routine.

//Attempts to parse s as an integer. The valid integer string consists of the following:
//* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)

I'd prefer 07 to be parsed as 7. Most non-dev people probably expect this as well.
Is octal still being used?
 
Similarly we can define this for floating point types. We may also want null terminated const char* versions as converting a const char* to sting_view requires a call to strlen(). 

What's the problem with strlen()? 

Olaf van der Spek

unread,
Feb 3, 2014, 6:18:44 AM2/3/14
to std-pr...@isocpp.org
On Wednesday, January 29, 2014 8:56:14 PM UTC+1, Miro Knejp wrote:
I am now using the following interface in the format parser implementation:

pair<optional<T>, Iter> parse_integer<T>(Iter first, Iter last, int
radix = 10)

and the convenience overload

optional<T> parse_integer<T>(string_view s, int radix = 10);

which could, using internal tag dispatching, be reduced to

parse<T>(...)

The signatures are very easy to use and give me all I need. Both

 Do you perhaps have a link to a project using this interface?

Matthew Woehlke

unread,
Feb 3, 2014, 11:20:34 AM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 06:13, Olaf van der Spek wrote:
> On Sunday, January 26, 2014 5:25:02 PM UTC+1, Matthew Fioravante wrote:
>> Here is a first attempt for an integer parsing routine.
>>
>> //Attempts to parse s as an integer. The valid integer string consists of
>> the following:
>> //* '+' or '-' sign as the first character (- only acceptable for signed
>> integral types)
>> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
>
> I'd prefer 07 to be parsed as 7. Most non-dev people probably expect this
> as well.
> Is octal still being used?

You can do this by passing base = 10. '0' as a prefix only means base 8
when passing base = 0 (i.e. detect from prefix).

Or rather, I would hope/expect the above is true. Reading closer it
isn't obvious if leading 0's are permitted when the base is explicitly
specified. Probably they should be.

>> Similarly we can define this for floating point types. We may also want
>> null terminated const char* versions as converting a const char* to
>> sting_view requires a call to strlen().
>
> What's the problem with strlen()?

It requires additional execution cycles that don't provide any real
benefit. And yes, that *does* matter; there are definitely cases where
string to number conversion is a performance bottleneck, e.g. when
reading large files of data in textual format. (I say this from actual
real-world personal experience.)

--
Matthew

Olaf van der Spek

unread,
Feb 3, 2014, 11:25:46 AM2/3/14
to std-pr...@isocpp.org
On Mon, Feb 3, 2014 at 5:20 PM, Matthew Woehlke
<mw_t...@users.sourceforge.net> wrote:
>> I'd prefer 07 to be parsed as 7. Most non-dev people probably expect this
>> as well.
>> Is octal still being used?
>
>
> You can do this by passing base = 10. '0' as a prefix only means base 8 when
> passing base = 0 (i.e. detect from prefix).

What if I want dec and hex but no octal? ;)

>> What's the problem with strlen()?
>
>
> It requires additional execution cycles that don't provide any real benefit.
> And yes, that *does* matter; there are definitely cases where string to
> number conversion is a performance bottleneck, e.g. when reading large files
> of data in textual format. (I say this from actual real-world personal
> experience.)

Numbers in text files are not nul-terminated.

Matthew Woehlke

unread,
Feb 3, 2014, 12:36:45 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 11:25, Olaf van der Spek wrote:
> On Mon, Feb 3, 2014 at 5:20 PM, Matthew Woehlke
> <mw_t...@users.sourceforge.net> wrote:
>>> I'd prefer 07 to be parsed as 7. Most non-dev people probably expect this
>>> as well.
>>> Is octal still being used?
>>
>>
>> You can do this by passing base = 10. '0' as a prefix only means base 8 when
>> passing base = 0 (i.e. detect from prefix).
>
> What if I want dec and hex but no octal? ;)

That's a fair question :-). (But so is if we should throw out the 0
prefix as indicating octal.)

I suppose you could test if the string starts with '0x' and call with
either in,base=10 or in+2,base=16. Not saying that's ideal, though.
(Even if I suspect that performance-wise it would be similar to base=0.)

>>> What's the problem with strlen()?
>>
>> It requires additional execution cycles that don't provide any real benefit.
>> And yes, that *does* matter; there are definitely cases where string to
>> number conversion is a performance bottleneck, e.g. when reading large files
>> of data in textual format. (I say this from actual real-world personal
>> experience.)
>
> Numbers in text files are not nul-terminated.

They are if I'm using a CSV or XML parsing library that yields
NUL-terminated char*. (And I seem to recall that such do exist, i.e.
they take a char* buffer and substitute NUL at the end of "values".)

"What's the problem with strlen()" is that is has potential performance
implications given char const* input data. I'm not convinced that trying
to determine if there is any reasonable instance where one has char
const* data in a situation that is performance sensitive qualifies as
reason to disregard that.

--
Matthew

Jeffrey Yasskin

unread,
Feb 3, 2014, 12:46:27 PM2/3/14
to std-pr...@isocpp.org
FWIW, your CSV or XML parsing library should be changed to return a
string_view or equivalent. It has the size, and is throwing it out,
forcing your number parser to do redundant checks for '\0', which
slows you down.

Yes, I know it will take time to propagate the new interface through
system libraries.

> "What's the problem with strlen()" is that is has potential performance
> implications given char const* input data. I'm not convinced that trying to
> determine if there is any reasonable instance where one has char const* data
> in a situation that is performance sensitive qualifies as reason to
> disregard that.
>
> --
> Matthew
>
>
> --
>
> --- You received this message because you are subscribed to the Google
> Groups "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/.

Thiago Macieira

unread,
Feb 3, 2014, 12:58:09 PM2/3/14
to std-pr...@isocpp.org
Em seg 03 fev 2014, às 12:36:45, Matthew Woehlke escreveu:
> > Numbers in text files are not nul-terminated.
>
> They are if I'm using a CSV or XML parsing library that yields
> NUL-terminated char*. (And I seem to recall that such do exist, i.e.
> they take a char* buffer and substitute NUL at the end of "values".)

No, they're not. None of my CSV and XML files on disk have NULs.

If you're getting a NUL, it means your library actually did malloc() to
allocate memory just so it could set a \0 there, which totally offsets the cost
of strlen. If your library is doing that, then strlen() performance is not the
issue.

Matthew Woehlke

unread,
Feb 3, 2014, 1:41:27 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 12:46, Jeffrey Yasskin wrote:
> On Mon, Feb 3, 2014 at 9:36 AM, Matthew Woehlke wrote:
>> On 2014-02-03 11:25, Olaf van der Spek wrote:
>>> Numbers in text files are not nul-terminated.
>>
>> They are if I'm using a CSV or XML parsing library that yields
>> NUL-terminated char*. (And I seem to recall that such do exist, i.e. they
>> take a char* buffer and substitute NUL at the end of "values".)
>
> FWIW, your CSV or XML parsing library should be changed to return a
> string_view or equivalent.

What if it's a C library? :-)

> It has the size, and is throwing it out,
> forcing your number parser to do redundant checks for '\0', which
> slows you down.

I'm not sure it's redundant... if I pass a char*, then the
implementation must check for NUL but does not need to do any sort of
index check. OTOH if I pass a string_view, presumably it is going to
stop parsing if it finds a NUL anyway, same as it would stop for e.g.
'!', random-control-character, etc.

So actually, it may be that string_view implementation does everything
that the the char* implementation does *plus* an index check. (Which, in
fact, means that the char* implementation may be preferred if you know
that your string is NUL-terminated, even if you *have* a string_view...)

--
Matthew

Matthew Woehlke

unread,
Feb 3, 2014, 1:44:23 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 12:58, Thiago Macieira wrote:
> Em seg 03 fev 2014, às 12:36:45, Matthew Woehlke escreveu:
>>> Numbers in text files are not nul-terminated.
>>
>> They are if I'm using a CSV or XML parsing library that yields
>> NUL-terminated char*. (And I seem to recall that such do exist, i.e.
>> they take a char* buffer and substitute NUL at the end of "values".)
>
> No, they're not. None of my CSV and XML files on disk have NULs.
>
> If you're getting a NUL, it means your library actually did malloc() to
> allocate memory just so it could set a \0 there, which totally offsets the cost
> of strlen. If your library is doing that, then strlen() performance is not the
> issue.

I'm talking about libraries that require a mutable input buffer¹ and
replace ends-of-"values" in that buffer with NUL. (I can't think what it
was, offhand, but pretty sure I came across a library that did exactly
this.)

(¹ or is doing the file I/O itself and so has a mutable input buffer.)

--
Matthew

Thiago Macieira

unread,
Feb 3, 2014, 2:45:13 PM2/3/14
to std-pr...@isocpp.org
Em seg 03 fev 2014, às 13:44:23, Matthew Woehlke escreveu:
> > If you're getting a NUL, it means your library actually did malloc() to
> > allocate memory just so it could set a \0 there, which totally offsets the
> > cost of strlen. If your library is doing that, then strlen() performance
> > is not the issue.
>
> I'm talking about libraries that require a mutable input buffer¹ and
> replace ends-of-"values" in that buffer with NUL. (I can't think what it
> was, offhand, but pretty sure I came across a library that did exactly
> this.)
>
> (¹ or is doing the file I/O itself and so has a mutable input buffer.)

Those are rare. Anyway, that's just the cost of strlen(), unless you can
modify the library to return the pointer to where it set the NUL. I imagine
that is a useful feature anyway, since you may want to continue parsing from
where you stopped.

Bengt Gustafsson

unread,
Feb 3, 2014, 3:47:41 PM2/3/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
I know of a XML parser that sets nul in the buffer to save time (apart from the one I made 15 years ago). Forgot the name now.

I think we should try to find a good solution to the problem of using a Range-concept for input (same prerequisites as to be able to use range based for) as the input and preserve
the information about how many characters were consumed. 

Creating a range type which stops at nul is simple, for instance let end() return nullptr and begin() return the input pointer wrapped in an object whose operator++ sets the wrapped pointer to nullptr when a nul is encountered. This of course only works as a forward iterator, but that's ok for this case.

Here's an idea of how to design the from_string() without using a ref parameter as the result:

The return type has three members:

a) the value
b) a RANGE representing the rest of the input
c) an exception_ptr or similar error information

The idea is if you just do:

auto v = from_string<int>("123");

You get an exception both if exeption_ptr is non-null and if the returned range is non-empty. To avoid the dreaded exception in destructor these checks must be done in the return objects operator T.

This however needs operator bool() to be called before operator T, and (to handle the remaining source character case) also retrieve the remaining range to show that you handle it. This does not provide for any neat call sites.

No, I don't see how this can be done in a reasonably neat way without using a T& dest in the parameter list. As for the destructor exception the return value can now be just a wrapper around a exception_ptr, it should be possible to design so that its dtor can throw the encapsulated expception_ptr without creating problems, as there are no other members.

class exception_enforcer {
public:
     ~exception_enforcer() {
            if (ex_ptr)
                  rethrow_exception(ex_ptr);    // I think this will destroy the ex_ptr while unwinding, while having taken a copy of the pointee
     operator bool() { bool ret=ex_ptr; ex_ptr = nullptr; }
};

template<typename T, typename RANGE> exception_enforcer parse_string(T& dest, RANGE& src);  // Note non-const ref to range. from_string instead has a const ref and requires all of src to be eaten.

Miro Knejp

unread,
Feb 3, 2014, 3:48:45 PM2/3/14
to std-pr...@isocpp.org
The format implementation can be found here: https://github.com/mknejp/std-format
The defintions of the parse_xxx methods (of which there currently are only integer versions) are in include/std-format/detail/parse_tools.hpp and they are used in include/std-format/detail/format_parser.hpp.

Not sure how serious an example that is as for processing the actual format string I only need to parse integers at two locations in format_parser.hpp and there is no error reporting (only success/failure). However, at some point I also need to start parsing the format options for all the builtin and std types and there it will become clearer how useful the interface really is. Last time I had to write number parsing myself was some 10 years ago so please don't mind if the implementation of the parse methods isn't perfect.

Considering the debate about octal numbers and prefixes I went along and split the methods up depending on use case, so I have:

parse_integer(...) <- accepts [+-]?[0-9a-zA-Z]+

parse_radix_prefix(...) <- accepts (0[xXbX]?)? retuning the radix 2, 8, 16 or 0 if the pattern doesn't apply

parse_prefixed_integer(...) <- accepts [+-]?(0[xXbB]?)?[0-9]+ and if the prefix is not recognized uses the radix passed as argument

The actual range of valid characters in [0-9a-zA-Z] depends on the radix and the minus sign is only accepted for signed integer types. None of them skip any whitespace characters on any end of the string. They consume all valid characters even if overflow occurs. Feel free to replace any character with culture specific signs and digits when locales apply.

Then I was thinking some more about the return/error dilemma. Inspired by the mentioning of match_integer three use case scenarios come to mind:
  1. Parsing a longer text. At some point you determine that at position i should be a number. This is the case where you probably need the most information: success/failure, error description if it failed and in both cases an iterator to the next position so you can continue processing the remaining source. I guess this is what I tried to cover with my parse_xxx interface.
  2. You have a string and it hast to contain a number. In this case the source has to match exactly with zero tolerance. This would be the case for a match_xxx interface where invalid characters at the end cause a failed conversion.
  3. You have a string of some description and expect it to begin with a number. Invalid characters at the end of the source range do not cause an error. This might involve skipping of whitespaces before the number. I see this use case occurring especially in exercises and introductory courses when dealing with basic user input for the first time. I don't really have a fitting name for such an interface. from_string maybe? No idea.

I see number 1 as the most fundamental. The interface is based on iterators and can thus work with almost any type of input. 2 and 3 can be implemented in terms of 1 and a string_view overload would probably be used more often there. As soon as locales are involved all three should automatically recognize the correct grouping, separating and decimal characters.

Matthew Woehlke

unread,
Feb 3, 2014, 4:23:51 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 15:48, Miro Knejp wrote:
> Then I was thinking some more about the return/error dilemma. Inspired
> by the mentioning of match_integer three use case scenarios come to mind:

FWIW... when considering Thiago's (OT) question re: Qt's API, I actually
ended up suggesting that the end iterator be a method/member of the
return type. IOW the return is "essentially" a tuple of value, status,
end_iter, but with convenience methods to access those rather than being
literally a std::tuple.

If you care about neither the status nor end_iter, you can use this like:

use(parse<type>(in).value());
-or-
use(parse<type>(in).value_or(default));

(Optional: operator T to implicitly obtain value... but conflicts with
operator bool to implicitly obtain success/failure.)

If you do care about the status and/or end_iter, I don't see how you
would possibly avoid at least one local variable declaration (unless
passing to a function that takes the result type, in which case the
result type already must contain everything). Returning everything in
the result type allows that local to be 'auto const' and avoids the
various pitfalls of a by-reference output parameter.

If using operator bool to check status, you can also store the result
inside an if() (assuming you only need the value and/or end_iter if
parsing succeeded).

After thinking about it a lot, I don't see any way for the API to be
simpler than that. Any other variation requires similar or additional
declarations. (The only minor downside I see is that the conversion type
must be provided as an explicit template parameter¹ rather than inferred
from a previous declaration. However I don't see this as being such a
bad thing, as it makes it more clear what is the expected output type.
As was mentioned elsewhere, code is written once and read many times.)

(¹ Which also means that the function must be a template rather than
overloaded, though I don't think that is an issue?)

--
Matthew

Matthew Woehlke

unread,
Feb 3, 2014, 5:01:05 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 15:47, Bengt Gustafsson wrote:
> I think we should try to find a good solution to the problem of using a
> Range-concept for input (same prerequisites as to be able to use range
> based for) as the input and preserve
> the information about how many characters were consumed.
>
> Creating a range type which stops at nul is simple, for instance let end()
> return nullptr and begin() return the input pointer wrapped in an object
> whose operator++ sets the wrapped pointer to nullptr when a nul is
> encountered. This of course only works as a forward iterator, but that's ok
> for this case.

Yes, I believe that would work. Next question: can you trivially create
a string_view from this? If yes, then the problem is solved :-).

(If no, maybe taking (only) string_view is not the best API? I wonder
now if there should be a string_forward_view? Or would it be sufficient
to have an iterator type whose operator== returns true for a magic end
iterator when the iterator points to NUL?)

> Here's an idea of how to design the from_string() without using a ref
> parameter as the result:
>
> The return type has three members:
>
> a) the value
> b) a RANGE representing the rest of the input
> c) an exception_ptr or similar error information

That's exactly how I would do it; see also my reply to Miro.

> This however needs operator bool() to be called before operator T, and (to
> handle the remaining source character case) also retrieve the remaining
> range to show that you handle it. This does not provide for any neat call
> sites.

Or don't provide implicit conversion operators for both the value and
status, at least in case of the conversion type == bool.

> No, I don't see how this can be done in a reasonably neat way without using
> a T& dest in the parameter list.

Why is it so horrible to write either '.okay()' or '.value()'?

If you do care about more than exactly one of the three possible output
information parts, I don't see any way to avoid having at least one
local variable. So what is wrong with:

auto const result = from_string<type>(in);
if (result)
{
use(result.value());
// optional: do something with result.last_consumed()
}
else
{
// handle error
}

...?

The above uses exactly one local variable which can be declared 'const'
and provides both the status and information on consumed characters (and
the value, of course). For type != bool, '.value()' I believe that
'.value()' can even still be implicit. And it also allows direct
hand-off to a function that uses the status and/or iterator information
(which is more awkward with out params and, in case of implicit value
conversion, precludes switching between the full result and just the
value solely by changing the signature of the use function).

Compare the above to:

type out; // not const :-(
if (auto result = from_string(in, out))
{
// handle error
}
else
{
use(out);
}

...which is more logic (and more characters, if you don't count the
'const' in the above which is missing here, even with '.value()' above),
and requires that operator bool return the opposite of the expected
result. Besides that the 'handle the exception first' logic flow feels
unnatural to me :-).

(Or requires a language enhancement to allow 'if (!(auto result =
...))', which of course adds even more logic and characters...)

Plus, if use() changes to want also 'result', the above must be
refactored because in its current form, 'result' is unavailable at the
call site.


It's even worse if you want to know about consumed characters:

unknown_t last_consumed; // um... what's the type of this?
type out;
if (auto result = from_string(in, out, last_consumed))
{
// handle error
}
else
{
use(out);
// do something with last_consumed
}

Now I have an entire additional local variable, which not only can't be
'const', but has a non-trivial type declaration that can't trivially be
written as "auto".


And of course, there's the case that we don't care about the status:

type out = default;
from_string(in, out).ignore();
use(out);

- versus -

use(from_string<type>(in).value_or(default));

--
Matthew

gmis...@gmail.com

unread,
Feb 3, 2014, 5:35:26 PM2/3/14
to std-pr...@isocpp.org
Hi everyone

Thanks for the interesting read. My ten cents on this subject:

It seems parsing/conversion means looking for numbers/values which sometimes might not be present, and even if they are, they may be expectedly or unexpectedly followed by other stuff; and any value may that is present still may be outside of the expected range. Theoretically, something might even happen that's so unexpected that we might not even parse anything even to be sure if a value is present or correct at all.

Input from a file, command line or configuration file is often like that:

"100"      - a value
"100;"     - a value followed by something/anything here it's a ;
"hello"    - no value, but something e.g. the "h" from "hello"
"257       - a failed value (for a given type) e.g. this is too big for an 8 bit unsigned char
"-1;2"     - failed value (say for unsigned int) followed by something, in this case the ";" from ";2"
""         - nothing  - eof / empty range passed in.

Considering all of this, it suggest the set of all posibilities might be this: (represented here as an enum):

enum parse_status // Ultimate outcome of converting a string to a type.
{
    got_value,                      // success, a value, nothing more
    got_something,                  // something, but not a value, probable failure
    got_value_and_something,        // got a value and something else. Success or failure likely determined by caller.
    got_failed_value,               // got an unusable value out of range or whatever.
    got_failed_value_and_something, // got an unusable value and something else too.
    got_nothing,                    // nothing, empty input range etc.
    got_error                       // worse than anything above. Invalid argument etc.
    // anything else?
};

When parsing fails to get a value, the reason is known and it's helpful to be able to report something detailed.
e.g. number too big, too small, not a number, nothing, not integral

Even if the callers code is wrong and they've passed an invalid argument to the parse routine etc.

Whether parsing failed or succeeded, i often want to know how far parsing got, so I can continue.

Putting all of this togther too, leads me to think a structure like this is needed to report things:

struct conversion_result
{
    parse_status status;        // got nothing, a value/bad value and/or something else or some other error.
    std::error_code   code;     // What exactly went wrong: e.g. like ERANGE / EINVAL/ invalid arg. etc.
    InputIterator     next;     // Points to something if indicated else end
};

A key question (to me at least) seems it might be possibe to do away with the parse_status completely, but the parser routine is aware of the exact details anyway so is it good to throw that away and it helps if we can examine the return value and error code as little as needed.

I'm keen to see which is more readable, looking at tests on status codes or code that re-creates those tests by if'ing on different error code and iterator values to (re) deduce these facts.

In conclusion, I'm was thinking an interface like exhibits some of the traits like this one is needed:

// never throws, (image range/iterator pair versions as you see fit):

conversion_result parse( signed char& value, Range range ) noexcept;
conversion_result parse( char& value, Range range ) noexcept;
conversion_result parse( unsigned char& value, Range range ) noexcept;
conversion_result parse( int& value, tRange range ) noexcept;
conversion_result parse( unsigned int& value, Range range ) noexcept;
conversion_result parse( long& value, Range range ) noexcept;
conversion_result parse( unsigned long& value, Range range ) noexcept;
conversion_result parse( long long& value, Range range ) noexcept;
conversion_result parse( unsigned long long& value, InputRange range ) noexcept;
conversion_result parse( float& value, Range range ) noexcept;
conversion_result parse( double& value, Range range ) noexcept;
conversion_result parse( long double& value, Range range ) noexcept;
conversion_result parse( signed char& value, Range range ) noexcept;

Not everybody wants/needs to handle codes or wants to map manually map the results to exceptions and lose info.
So I think an a exception version or mapping function is needed too, maybe something like this:

// highly likely to throw.
enum conversion_options;
void parse_checked( int& value, InputRange range, conversion_options check_options = allow_value);

enum conversion_options // bit mask, probably can be simplified, but conveys the idea
{
    allow_value,
    allow_value_and_something,
    allow_failed_value,
    allow_failed_value_and_something,
    allow_errors,
    allow_nothing
};

// highly likely to throw.
void parse_checked( int& value, InputRange range, conversion_options check_options = allow_value);

By default checking is strict, it only allows an exact value and nothing more without an exception.
Accepting a value followed by something else or a value or nothing or an out of range value can be done through the options.

Some open qestions I have:
* I'm looking at the template interface ideas and I can't decide if they are genius or excessive. But that's something I always think about templates.
* Are we really saying C can't even do this task sufficiently well. Kind of sad! Won't "they" revist this and won't we get that later too.
* should +- symbols be rejected for unsigned types. If they aren't necessary/useful, why accept them?
* should .00 be accepted as a float of 0.00?
* locales. is 1,100 is that 1 followed by someting (a comma), or 1100 locale. Can a user override that to pick either interpretation.

Design choices I think make sense:
* Don't skip leading whitespace as skipping tabs, lfs, crs etc. can be surprising, unwanted and slow, and can be explicitly done easily.
* Have exception and exception free versions to allow noexcept optimizations and routines useful in exception free environments.
* Errors should return/raise detailed and more identifiable errors than using atoi etc.
* Return value should aid in diagnosing where to continue to parse next.
* Don't use errno etc. as it seems to raise questionable conccurency questions and doesn't appear a clean solution anyway.
* Use overloaded names hopefully organised for more useability in generic code and more memorable than guessing is it strtol/stroul etc.
* Takes conversion value by parameter rather than return to allow setting a default and simplification of other things.
* Leaves Hex/octal conversion operations to other interfaces so as to reduce interface complexity.

An interesting acid test might be see how usable such routines would be to parse a comma delimited list of values for any locale; or a windows style .ini file of any types and compare that to using strtoul etc.

That's my input for now. I am enjoying reading what everybody else has to say on the subject.
It's fascinating to find out how simple or not or flexible or not the end result will be.

Thanks. 

Olaf van der Spek

unread,
Feb 3, 2014, 6:02:12 PM2/3/14
to std-pr...@isocpp.org
On Mon, Feb 3, 2014 at 11:01 PM, Matthew Woehlke
<mw_t...@users.sourceforge.net> wrote:
> If you do care about more than exactly one of the three possible output
> information parts, I don't see any way to avoid having at least one local
> variable. So what is wrong with:

user_t& u = ...
if (parse(u.age, input)) // or !parse, depending on return type
return / throw

No local var required, no type duplication

if (auto err = parse(u.age, is)) // or !parse, depending on return type
return err / throw

> unknown_t last_consumed; // um... what's the type of this?

(const) iterator perhaps, though it's also possible to update the
string_view in-place

> And of course, there's the case that we don't care about the status:
>
> type out = default;
> from_string(in, out).ignore();
> use(out);
>
> - versus -
>
> use(from_string<type>(in).value_or(default));

That one liner is nice and would require a wrapper indeed.


--
Olaf

Matthew Woehlke

unread,
Feb 3, 2014, 7:20:03 PM2/3/14
to std-pr...@isocpp.org
On 2014-02-03 18:02, Olaf van der Spek wrote:
> On Mon, Feb 3, 2014 at 11:01 PM, Matthew Woehlke
> <mw_t...@users.sourceforge.net> wrote:
>> If you do care about more than exactly one of the three possible output
>> information parts, I don't see any way to avoid having at least one local
>> variable. So what is wrong with:
>
> user_t& u = ...
> if (parse(u.age, input)) // or !parse, depending on return type
> return / throw
>
> No local var required, no type duplication

Okay. This is the first decent argument I've seen in favor of using an
out param. However I still think the out param is sub-optimal except in
this case (which I suspect is not the more common case).

Maybe we should just provide both...

--
Matthew

Bengt Gustafsson

unread,
Feb 3, 2014, 8:57:12 PM2/3/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
@Matthew, regarding strlen avoidance: By RANGE I meant a template type which has the same api as required by a range based for (which is kind of a build in template function). This means that you don't have to create a string_view, any type of range will do. Here, for instance, is a char_ptr_range for this case:

template<typename T> zero_terminate_iterator {
public:
    zero_terminate_iterator
() : ptr(nullptr) {}
    zero_terminate_iterator
(T* p) : ptr(p) {}


   
bool operator(const zero_terminate_iterator<T>& rhs) const {
       
if (ptr == rhs.ptr)
           
return true;    // includes the case that both are nullptr
       
if (rhs.ptr == nullptr && *ptr == 0)
           
return true;
       
if (ptr == nullptr && *rhs.ptr == 0)
           
return true;
       
return false;
   
}
   
// etc...
private:
    T
* ptr;
}

auto char_ptr_range(char* p) { return range<zero_terminate_iterator<char>>(zero_terminate_iterator<char>(p), <zero_terminate_iterator<char>()); }  // Uses C++14 function return type deduction.

//Now you can write:
char* p = "1234";
for(auto c : char_ptr_range(p))
    process_char
(c);

// and you can write
auto x = parse<int>(char_ptr_range(p));  // or whatever API we end up with.


Note that as we are aiming for a extensible set of conversions including user defined types, say WGS84 geospactial coordintaes there is also an open set of error codes, so an enum or int value is not enough. (The from_string may be wrapped in a template function which can't be expected to know the interpretation of an int error code for any T it may be instantiated for!

Now I want to check some common use cases for the solution with a return triplet. To complete the use cases we can also add a fourth member skipped which is true if we had to skip spaces. I think that the smartest way to solve this may be to provide a set of value() functions, but no cast operators:

Tentatively I call the "states" of the return value:

strict - no spaces skipped. All of the string could be converted.
complete - space skipping ok, but no trailing junk.
ok - space skipping ok, and trailing junk.
bad - no parsing was possible, even after skipping spaces.

// The actual parsing is always the same:
const auto r = parse<int>(char_ptr_range("123"));

int x = r.strict_value(); // throws if r is not strict.
int y = r.complete_value() // throws if r is not complete
int z = r.value() // throws if r is not ok
int w = r.value_or(17); // never throws

r would also have is_complete(), is_strict(), is_ok() for those who want to handle errors without throwing.
An iterator to the next char can be returned by r.next()
The error code can be returned by r.error().

Something like that, what do you think about that?

We need to take a look at the parse use case. For instance to parse a comma separated list of ints stored as a nul terminated string:

const char* p = "123, 456 ,789"; 
auto range = char_ptr_range(p);
vector<int> numbers;

while (!range.empty()) {
    const auto r = from_string<int>(range);
    numbers.emplace_back(r.value());   // requires at least one parsable char
    range.first = r.last();
    skipspace(range);     // Assume we have a skip function which works on the range in situ
    if (*range.first != ',')
       throw "no comma";
    range.first++;
}

This assumes that range<T> has an empty method and that the starting iterator is called first.

The only main wart would be that we must manually update the range's first member. It would be possible to store the end iterator to be able to return a RANGE.

Then I could throw in a must_be(RANGE, token) helper which performs the last three rows, for a new appearance like this:

const char* p = "123, 456 ,789"; 
auto range = char_ptr_range(p);
vector<int> numbers;

while (!range.empty()) {
    const auto r = from_string<int>(range);
    numbers.emplace_back(r.value());   // requires at least one parsable char, but allows leading white space
    range = r.rest();
    skipspace(range);     // skip spaces between number and comma
    must_be(range, ',');
}

With a lazy split this could also be written, without loosing performance:

for (auto str : lazy_split(char_ptr_range(p), ','))  // str is probably a string_view now, but we really don't need to know that.
    numbers.emplace_back(from_string<int>(str).complete_value());

No, not really the same, now we don't allow space between number and comma. But we can introduce a new level which allows leading and trailing space. However, this can't be a const method of 'r' or we can't defer skipping of trailing spaces until we ask whether there are any. BTW this mandates the entire range to be stored in 'r' or the skipping could run wild...

Note that decltype(r) could be dependant on the type being converted (and of course the RANGE type). It is a concept rather than a class. This means, I guess, that the return type of error() could differ depending on T, but using that feature of course limits usability in template code.

Miro Knejp

unread,
Feb 3, 2014, 9:30:54 PM2/3/14
to std-pr...@isocpp.org

Am 04.02.2014 02:57, schrieb Bengt Gustafsson:
@Matthew, regarding strlen avoidance: By RANGE I meant a template type which has the same api as required by a range based for (which is kind of a build in template function). This means that you don't have to create a string_view, any type of range will do. Here, for instance, is a char_ptr_range for this case:

...


When we are talking ranges is it the same stuff SG9 is working on? I'm not sure if it's really important whether the input to the parsing methods are iterators or ranges, since in the end the latter should always be somehow convertible to the former for interoperability with the remaining standard library and there seems to be more disagreement on how to provide the results of the conversion, not the inputs. If all goes south a pair of iterators will always do the job...

Note that as we are aiming for a extensible set of conversions including user defined types, say WGS84 geospactial coordintaes there is also an open set of error codes, so an enum or int value is not enough. (The from_string may be wrapped in a template function which can't be expected to know the interpretation of an int error code for any T it may be instantiated for!

Now I want to check some common use cases for the solution with a return triplet. To complete the use cases we can also add a fourth member skipped which is true if we had to skip spaces. I think that the smartest way to solve this may be to provide a set of value() functions, but no cast operators:

Tentatively I call the "states" of the return value:

strict - no spaces skipped. All of the string could be converted.
complete - space skipping ok, but no trailing junk.
ok - space skipping ok, and trailing junk.
bad - no parsing was possible, even after skipping spaces.

// The actual parsing is always the same:
const auto r = parse<int>(char_ptr_range("123"));

int x = r.strict_value(); // throws if r is not strict.
int y = r.complete_value() // throws if r is not complete
int z = r.value() // throws if r is not ok
int w = r.value_or(17); // never throws
Does the actual parsing happen in the parse() method or r.*value*()? If the latter then what happens if the range consists of input iterators (e.g. istreambuf_iterator, i.e. single-pass) and I try to call an r.*value*() method multiple times? Will it prevent me from doing that or is the sky just going to fall down on my head? If the parsing happens inside parse() then how do I prevent it from consuming leading whitesapces from input iterators if leading whitespaces are not allowed in my use case? It seems that in order to distinguish r.complete_value() and r.strict_value() the whitespaces were already consumed otherwise they couldn't provide that information.

Magnus Fromreide

unread,
Feb 4, 2014, 2:48:14 AM2/4/14
to std-pr...@isocpp.org
On Mon, Feb 03, 2014 at 09:48:45PM +0100, Miro Knejp wrote:
>
> As soon as locales are involved all three should automatically recognize
> the correct grouping, separating and decimal characters.

I think the use of locales should be optional and - importantly - follow the
if you don't use it you don't pay for it rule.

/MF

Bengt Gustafsson

unread,
Feb 4, 2014, 10:33:42 AM2/4/14
to std-pr...@isocpp.org
Miro:

Yes, leading whitespace is always consumed in parse. If you don't allow this you loose some performance as you actually convert the number when you could know that it was an error as soon as you saw the first space. I don't think this is a big problem. The idea is to skip the space and set a flag in the return value if there was some space to skip. The strict_value() function checks this flag and throws if it was set.

However, trailing space can not be skipped until we ask for it. I don't think this is a big deal as we already have the remaining range in the result object.

The Range concept we use here should be synchronized with what the range working group are doing. Apart from the obvious begin() and end() functions that are anyway required for range based for we only need:

- iterators must be copyable
- A Range must be constructible from two iterators (begin/end).

I feel pretty safe that this will be part of the range concept the working group comes up with (although the copying part may be debatable with input iterators). Of course passing the range in by non-const reference solves this problem elegantly but all such "mutating" suggestions seem to be hard to get approval of here. For me it would be the easiest route to take: less copying of ranges, easier to use, less requirements on the Range concept. (The returned value will have another flag value indicating if the range ended, but the late trailing space check is of course hard to implement).

Magnus:

When it comes to locales, how important is this for parsing numbers? Can't we rely on narrow chars being ascii and wide chars being unicode, i.e. demand that multibyte data is either converted to unicode strings before parsing or that the range's iterators do the conversion on the fly, in a lazy style. For other types this may be more important, but I still think that the locale is more of a thing for the stream than the conversion function.

We will of course need some adaptor so that cin and similar can be used as the source range. Maybe an overloaded function is needed as an istream does not comply to the Range concept and we don't want to have to write from_string(istream_range(cin)) at every call, do we? BTW: This shows that a new bytestream type should be Range concept compatible!

Matthew Woehlke

unread,
Feb 4, 2014, 1:39:54 PM2/4/14
to std-pr...@isocpp.org
This seems like overkill. Either the text is well-formed or it isn't.
I'd say that should be the first check. (Probably considering "-1" for
unsigned as 'not well formed'.)

If the text is well-formed, then and only then would I get into other
reasons the parse might have failed, e.g. because the value would overflow.

This corresponds loosely with the output iterator and whether or not all
possible text was consumed.

Users that really care about the empty input case can check that
themselves easily enough.

> enum conversion_options // bit mask, probably can be simplified, but
> conveys the idea
> {
> allow_value,
> allow_value_and_something,
> allow_failed_value,
> allow_failed_value_and_something,
> allow_errors,
> allow_nothing
> };

While I like the idea, I don't think this set of options are all that
useful. Instead I would suggest:

enum class option // magic syntax to make bits? ;-)
{
accept_whitespace,
accept_trailing_characters,
accept_overflow,
// others?
}
STD_FLAGS(options, option) // ;-)

The accept_overflow flag would only be honored when the numeric type is
real (float, double, etc.... "numeric type" here meaning e.g. also if we
had a std::complex version). The effect would be to return ±inf if
overflow occurs.

(Vaguely related note: the real flavors should also accept e.g. "-inf"
and related NaN forms.)

> By default checking is strict, it only allows an exact value and nothing
> more without an exception.

...which should/would be the behavior of 'options opts = {}'.

> Accepting a value followed by something else or a value or nothing or an
> out of range value can be done through the options.

I don't think an empty string should ever parse successfully... what
would be the resulting value? I suspect anyone inclined to use that
feature would do better to use a default-value on any invalid input.

> * Are we really saying C can't even do this task sufficiently well. Kind of
> sad! Won't "they" revist this and won't we get that later too.

Inasmuch as we'd like to use iterators and string_view and such, I think
that might be hard :-). Maybe C will implement new API's based on these,
resulting in C++ standards being adopted into C for a change :-).

> * should .00 be accepted as a float of 0.00?

IMHO yes; it's valid in source code after all. (What does strtod do?)

> * locales. is 1,100 is that 1 followed by someting (a comma), or 1100
> locale.

I would implement two versions: one locale-aware and one not (equivalent
to using the "C" locale). The answer then depends on the locale; if ","
is the group separator, then "1,100" is equivalent to "1100".

> Can a user override that to pick either interpretation.

Yes, by passing a specific locale to the locale-aware version. (Note
that in some locales, "1,100" == 11e-1.)

> Design choices I think make sense:
> * Don't skip leading whitespace as skipping tabs, lfs, crs etc. can be
> surprising, unwanted and slow, and can be explicitly done easily.

Yes, if accept_whitespace is not used.

> * Return value should aid in diagnosing where to continue to parse next.

I'm guessing you're talking about providing the position where parsing
"quit"?

> * Use overloaded names hopefully organised for more useability in generic
> code and more memorable than guessing is it strtol/stroul etc.

Speaking of "why can't C do it first"... :-)

> * Leaves Hex/octal conversion operations to other interfaces so as to
> reduce interface complexity.

Not sure about this. Which reminds me, your proposed API's don't take a
base parameter. IMO that is mandatory; there *will* be users that need
to parse a number in e.g. base 16. (Parsing bases other than 2, 8, 10
and 16 is probably unusual, however.)

--
Matthew

Matthew Woehlke

unread,
Feb 4, 2014, 5:08:31 PM2/4/14
to std-pr...@isocpp.org
On 2014-02-03 20:57, Bengt Gustafsson wrote:
> Tentatively I call the "states" of the return value:
>
> strict - no spaces skipped. All of the string could be converted.
> complete - space skipping ok, but no trailing junk.
> ok - space skipping ok, and trailing junk.
> bad - no parsing was possible, even after skipping spaces.
>
> // The actual parsing is always the same:
> const auto r = parse<int>(char_ptr_range("123"));
>
> int x = r.strict_value(); // throws if r is not strict.
> int y = r.complete_value() // throws if r is not complete
> int z = r.value() // throws if r is not ok
> int w = r.value_or(17); // never throws

Unfortunately, that makes it much more awkward to say 'use the parse
value if a strict parse succeeded, else use a default value'. Sure, I
can write 'r.is_strict() ? r.value() : default', but that's much more
awkward than if I could always write 'r.value_or(default)'.

I still think it makes more sense to tell the parse up front how
tolerant it should be. Having said that, I could possibly see still
using your 'strict' and 'complete' terminology. I would probably then
name the third option something like 'relaxed'.

> An iterator to the next char can be returned by r.next()
> The error code can be returned by r.error().

Except for the above comments, I like.

Another reason to move the strict/relaxed/etc. to input parameters is
that it would allow the return type to more trivially subclass
std::optional. Basically, you would be adding error() and next() to
std::optional, without having to redefine/overload value() and related bits.

> We need to take a look at the parse use case. For instance to parse a comma
> separated list of ints stored as a nul terminated string:
> [example snipped]

With my above suggestions, I believe the code in your example is the
same except that you would also pass 'relaxed' to from_string.

(Loosely related: I realize we haven't been trying to pin down the name
of the function itself, but I would prefer something more like
'string_to', so that the name plus template type makes a natural phrase.)

> The only main wart would be that we must manually update the range's first
> member. It would be possible to store the end iterator to be able to return
> a RANGE.

Agreed, we should probably (also?) have that. (I first thought to call
it "remaining" / "remainder", but "rest" is okay too.)

> must_be(range, ',');

Not really relevant, but I would name this something else to indicate
that it modifies the range... usually I tend to "consume" or "chomp".
(You might also consider returning the modified range rather than
modifying it in-place, just for the sake of clarity.)

> With a lazy split this could also be written, without loosing performance:
>
> for (auto str : lazy_split(char_ptr_range(p), ',')) // str is probably a
> string_view now, but we really don't need to know that.
> numbers.emplace_back(from_string<int>(str).complete_value());
>
> No, not really the same, now we don't allow space between number and comma.

I would argue that 'complete' should also eat trailing space :-). (This
is another reason to pass the mode as an input parameter; in relaxed
mode we don't care what follows the number, but in complete mode spaced
should be consumed so that we can tag the parse as successful. This
wouldn't need to be a different mode.)

> Note that decltype(r) could be dependant on the type being converted (and
> of course the RANGE type). It is a concept rather than a class. This means,
> I guess, that the return type of error() could differ depending on T, but
> using that feature of course limits usability in template code.

Um. That could work. Or you could specify that it returns int and that
the allowed values depend on the value type. (I'm not convinced that's
an issue... it's still extensible; any one type can use millions+ of
possibly results, and the values can be declared to be dependent on the
value type, i.e. 5 for parsing an int may mean something different than
5 for parsing a user_t. Reasonable implementations would stick to some
common subset of well-known values. Template code, regardless, must also
stick to that subset or else be conditional on the value type.)

You might be able to do both; have error() return an int but allow
specializations for user types to return a subclass / specialization of
the "common" result type that adds additional members.

--
Matthew

Bengt Gustafsson

unread,
Feb 4, 2014, 6:30:39 PM2/4/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
I think you are right, Matthew. The most logical choice is probably to send the parsing options in as flags. The big bike shed regards their names, polarities and defaults of course...

Here's a sketch:


// expect holds a value or an exception_ptr. I think this is basically the same as boost::expect, which in turn was inspired by Andrei Alexandrescu's idea.
// All ways of getting at the value except value_or() throws any
// pending exception inside.
// This should be refined as optional to avoid relying on a default
// ctor in the error case.
template <typename T> class expect {
public:
    expect(exception_ptr ex) : m_exception(ex) {}
    expect(const T& val) : m_value(val) {}

    operator bool() { return !m_exception; }
    exception_ptr exception() { return m_exception; }

    operator T() { return value(); }        // specialization removes this for bool (or follow optional<bool>'s example)
    T value() {
        if (m_exception)
            rethrow_exception(m_exception);

        return m_value;
    }
    T value_or(const T& defval) {
        if (m_exception)
            return defval;

        return m_value;
    }

private:
    T m_value;
    exception_ptr m_exception;
};


enum StrToFlags {
    noleading = 1,
    notrailing = 2,
    complete = 4,
    strict = 7
};


// Maybe better to get flags as a template parameter? Or offer both
// versions? Not having to test flags on each call saves time and they
// are going to be fixed per call site 99% of the time.
template<typename T, typename RANGE> expect<T> str_to(RANGE& range, StrToFlags flags)
{
    if (flags & noleading) {
        if (isspace(*begin(range)))
            return make_exception_ptr("No leading space allowed");
    }
    else
        skipspace(range);

    T tmp;
    ... Do the conversion into tmp and return on errors ...;
    
    if (flags & notrailing) {
        if (isspace(*begin(range)))
            return make_exception_ptr("No trailing space allowed");
    }
    else
        skipspace(range);

    if (flags & complete && begin(range) != end(range))
        return make_exception_ptr("Junk after value");
        
    return tmp;
}


// Use cases

// Throw on any error:
int val = strto<int>(char_ptr_range("123"), strict);

// Accept any error
int v2 = strto<int>(char_ptr_range("123")).value_or(17);

// Parse comma separated using lazy split:
for (auto str : char_ptr_range("123, 234 ,345,34"))
    numbers.push_back(strto<int>(str, complete));       // allow leading and trailing space, but nothing else trailing except the comma.

// Use templated flag version to save time:
for (auto str : char_ptr_range("123, 234 ,345,34"))
    numbers.push_back(strto<int, complete>(str));       // allow leading and trailing space, but nothing else trailing except the comma.


Magnus Fromreide

unread,
Feb 4, 2014, 6:46:57 PM2/4/14
to std-pr...@isocpp.org
On Tue, Feb 04, 2014 at 01:39:54PM -0500, Matthew Woehlke wrote:
> On 2014-02-03 17:35, gmis...@gmail.com wrote:
>
> This seems like overkill. Either the text is well-formed or it
> isn't. I'd say that should be the first check. (Probably considering
> "-1" for unsigned as 'not well formed'.)
>
> If the text is well-formed, then and only then would I get into
> other reasons the parse might have failed, e.g. because the value
> would overflow.
>
> This corresponds loosely with the output iterator and whether or not
> all possible text was consumed.
>
> Users that really care about the empty input case can check that
> themselves easily enough.
>
> >enum conversion_options // bit mask, probably can be simplified, but
> >conveys the idea
> >{
> > allow_value,
> > allow_value_and_something,
> > allow_failed_value,
> > allow_failed_value_and_something,
> > allow_errors,
> > allow_nothing
> >};
>
> While I like the idea, I don't think this set of options are all
> that useful. Instead I would suggest:

I dislike this idea - I would prefer to separate each responsibility in
order to make each function easier to understand and combine as building
blocks.

end_of_input(number(args));
number(args);
end_of_input(optional(number(args)));
optional(number(args));
/* not clear what allow_errors implies */
end_of_input(args);

This is incidentally pretty close to the design used in boost spirit, save that
they overload operators to change the apperance of the method calls and builds
a parse tree from them.

Paul Tessier

unread,
Feb 4, 2014, 6:58:56 PM2/4/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net

It seems that the function that parses the least parses best.  It is always possible to compose more complex parse functions from simpler building blocks but, the reverse is not always possible, or desirable.  A regex of one's choosing can be used to skip any kind of prefix or suffix surrounding the section to be parsed.  Adding additional prefix elimination to the parse routine, just complicates the behaviour and make the composition more cumbersome.

Locales must be taken into consideration as "-9" and "(9)" can both mean that same thing, similarly "510,023.34" or "51.0023,34" may be equivalent depending upon locale chosen.

To the problems of the interface.  It would seem that out parameters would be the best for the base upon which to build.  A interface with out parameters can be composed into any of the other interfaces discussed so far.  The same cannot be said for the other interfaces.   Therefore I propose that at minimum an interface that accepts a range (iterators/actual range), an out parameters for the parsed result, a locale (most likely defaulted), and returns an enum/error code would solve all problems listed to date.

Example Interface:
int parse<T,U>( range<U> r, T& value, locale loc = default_locale);

Value Returning Example:
T parse_or_zero<T,U>( range<U> r, bool skip_white = true, locale loc = default_locale) {
  if( skip_white ) { r = skip_white_space(r); }
  T retval = 0;
  parse( r, retval, loc );
  return retval;
}

Expected Returning Example:
expected<T> parse_expected<T,U>( range<U> r, bool skip_white = true, locale loc = default_locale) {
  if( skip_white ) { r = skip_white_space(r); }
  T retval = 0;
  expected<T> retval;
  parse_err err = parse( r, retval.value, loc );
  if( err != parse_err::success ) { retval.set_exception( some_exception( err ) ); }
  return retval;
}

It's much easier to compose other interfaces to fit the tastes of the user, if out parameters are available.  Whether the other interfaces should be supplied in the standard should be debated but, at minimal the out parameter interface must be in the standard from what has be discussed so far to allow others to have there way too, even if not standardized, at least available by simple composition.

Miro Knejp

unread,
Feb 4, 2014, 7:07:11 PM2/4/14
to std-pr...@isocpp.org

Am 04.02.2014 16:33, schrieb Bengt Gustafsson:
> Miro:
>
> Yes, leading whitespace is always consumed in parse. If you don't
> allow this you loose some performance as you actually convert the
> number when you could know that it was an error as soon as you saw the
> first space. I don't think this is a big problem. The idea is to skip
> the space and set a flag in the return value if there was some space
> to skip. The strict_value() function checks this flag and throws if it
> was set.
Well what if my use case doesn't allow leading whitespaces? Wasn't that
one of the initial concerns that started this whole thing? If it's a
single pass input iterator this is a big deal and the parser must not
consume any invalid characters I didn't tell it to as they are forever
lost to the caller. What if it consumes the whitespaces and then
realizes they are not followed by a number? The actual whitespace
content is lost. What if I needed this information to increment a line
counter after parsing the number? Plus, expected behavior should not
differ depending on the iterator category. I further fail to see how
that has any impact on performance. If the first character is not a
valid part of the number then parsing immediately fails. Where's the
lost performance? If the user wants to skip whitespaces that operation
can be prepended or composed on top of prase(). Adding the implicit
whitespace skipping only limits its range of applicability. I don't
think parse() should attempt to be too clever.
>
> However, trailing space can not be skipped until we ask for it. I
> don't think this is a big deal as we already have the remaining range
> in the result object.
>
> The Range concept we use here should be synchronized with what the
> range working group are doing. Apart from the obvious begin() and
> end() functions that are anyway required for range based for we only need:
>
> - iterators must be copyable
> - A Range must be constructible from two iterators (begin/end).
>
> I feel pretty safe that this will be part of the range concept the
> working group comes up with (although the copying part may be
> debatable with input iterators). Of course passing the range in by
> non-const reference solves this problem elegantly but all such
> "mutating" suggestions seem to be hard to get approval of here. For me
> it would be the easiest route to take: less copying of ranges, easier
> to use, less requirements on the Range concept. (The returned value
> will have another flag value indicating if the range ended, but the
> late trailing space check is of course hard to implement).
So far all standard iterators are copyable. The only concern is too many
increments on input iterators as copies, but that's an implementation
topic. I think this whole range talk isn't really helpful and getting us
anywhere. How the methods get their input is the least of my concerns to
be honest. Where to place the result, error and next_iter (or next
subrange for that matter) is the biggest disagreement we have.
>
> Magnus:
>
> When it comes to locales, how important is this for parsing numbers?
> Can't we rely on narrow chars being ascii and wide chars being
> unicode, i.e. demand that multibyte data is either converted to
> unicode strings before parsing or that the range's iterators do the
> conversion on the fly, in a lazy style. For other types this may be
> more important, but I still think that the locale is more of a thing
> for the stream than the conversion function.
It's not just about codecvt. Take 1,000.00 versus 1.000,00 versus
1'000.00. Or stuff like ௧௨௩௪. If it can't correctly interpret multibyte
numerals in a UTF-8 string (given an appropriate locale) it's not of
much use in an international environment.

For this I was thinking along the lines of:
parse(...) <- use current global locale (or from an istream's locale if
that's passed as source).
parse(..., locale) <- use provided locale object
parse(..., no_locale) <- no_locale_t tag type to do it with the "C"
locale but without actually using the (virtual!) methods of facets but
an optimized language neutral fast-path version instead


Matthew Woehlke

unread,
Feb 4, 2014, 7:09:16 PM2/4/14
to std-pr...@isocpp.org
On 2014-02-04 18:30, Bengt Gustafsson wrote:
> I think you are right, Matthew. The most logical choice is probably to send
> the parsing options in as flags. The big bike shed regards their names,
> polarities and defaults of course...

I may be wrong, but my impression is that it's been generally felt that
the default should be strict.

> Here's a sketch:
> [snip definition of expect]

Qt seems to be in the process of implementing a similar API. In that
case, I've suggested resolving the ambiguity issues in case value_t ==
bool by simply not providing implicit conversion to value_t, but relying
entirely on operator* instead. The overhead is only one character ('*')
to access the value, but it avoids all sorts of issues.

(Actually looking at std::optional now, it looks like that does actually
do as above.)

If possible I would encourage subclassing std::optional. This will
provide a bunch of needed functionality 'for free' and allow the result
to be passed to something expecting a std::optional.

> enum StrToFlags {
> noleading = 1,
> notrailing = 2,
> complete = 4,
> strict = 7
> };

Per first comment, 'strict = 0', additional bits relax constraints.

> // Maybe better to get flags as a template parameter? Or offer both
> // versions? Not having to test flags on each call saves time and they
> // are going to be fixed per call site 99% of the time.

Good points. I wouldn't object to that. You might even be able to write
the implementation as if they were a runtime parameter and rely on the
compiler to optimize out irrelevant code.

> template<typename T, typename RANGE> expect<T> str_to(RANGE& range,
> StrToFlags flags)
> {
> if (flags & noleading) {
> if (isspace(*begin(range)))
> return make_exception_ptr("No leading space allowed");

I wonder if this isn't overkill... if leading space not allowed, just
continue to parsing the value and fail when the leading space isn't a
valid character?

> if (flags & notrailing) {
> if (isspace(*begin(range)))
> return make_exception_ptr("No trailing space allowed");

Related to above, I don't think this is quite right. I think first you'd
be checking if the entire input was consumed, and acting accordingly
depending on if the flags allowed trailing "stuff".

Anyway, these are implementation details that aren't critical.

Except that I miss the remaining range in the result type, I don't think
I'm seeing anything in the API I don't like that I haven't commented on
above. (Mostly swapping the meaning of the flags...)

> // Use cases
> [snipped]

...all look good :-)

--
Matthew

Matthew Woehlke

unread,
Feb 4, 2014, 7:29:00 PM2/4/14
to std-pr...@isocpp.org
On 2014-02-04 18:58, Paul Tessier wrote:
> Locales *must* be taken into consideration as "-9" and "(9)" can both mean
> that same thing, similarly "510,023.34" or "51.0023,34" may be equivalent
> depending upon locale chosen.

Definitely agreed. However I don't think there is much that needs to be
discussed here besides bikeshedding the actual name of the method.

There should be a "C" locale version. We should get that API right
first. We want this for performance reasons, as C locale is going to be
a common use case, and being able to ignore locale issues likely has a
non-trivial impact on performance.

Then there should be an l_foo¹/foo_l version of the same that takes an
optional locale and is locale-aware. What "locale aware" means is I
think mainly an implementation detail that doesn't affect the API.

(¹ While foo_l would be the usual convention, that breaks reading in
case of e.g. string_to<int>, where the _l suffix would awkwardly break
into what is otherwise a natural phrase. Plus in that case, "locale
string" fits the natural phrasing. Other options: lfoo, locale_foo, etc.)

> To the problems of the interface. It would seem that out parameters would
> be the best for the base upon which to build. A interface with out
> parameters can be composed into any of the other interfaces discussed so
> far. The same cannot be said for the other interfaces.

By "composed", do you mean I can write a returns-everything version from
an out-param version? (If yes, I assure you I can do the converse as
well; see below.)

> Therefore I propose that at minimum an interface that accepts a range
> (iterators/actual range), an out parameters for the parsed result, a
> locale (most likely defaulted), and returns an enum/error code would
> solve all problems listed to date.

Does not allow the value to be assigned to a const or passed directly to
a user. And output parameters are just awkward to work with in general;
making the one parameter that will ALWAYS be used an out parameter is
IMHO the worst possible design that's been proposed.

Conversely, there are all sorts of advantages to returning a (subclass
of) std::optional...

I do see you proposed to provide both, which is good. If so, I think it
should be left as an implementation detail which is the 'real'
implementation and which are just wrappers. And I expect the
everything-via-return is going to be most used. (Definitely it's the
only one *I* would use...)

> Example Interface:
> int parse<T,U>( range<U> r, T& value, locale loc = default_locale);

int parse<T, U>(range<U> r, T& value, locale loc = default_locale)
{
auto const result = parse<T>(r, loc);
if (result) value = *result;
return // um... int? from whence do I get an int?
}

That wasn't so hard.

That said... I notice here that there is no way to return the remaining
range. But this is fixed if the return type is parse_result<void>. (Then
we just need a clever way to move-construct the parse_result<T> into
parse_result<void> and we're good... maybe parse_result<void> could have
a specialized ctor to take any other parse_result and just throw out the
value.)

--
Matthew

Matthew Woehlke

unread,
Feb 4, 2014, 7:40:30 PM2/4/14
to std-pr...@isocpp.org
On 2014-02-04 19:07, Miro Knejp wrote:
> Am 04.02.2014 16:33, schrieb Bengt Gustafsson:
>> Yes, leading whitespace is always consumed in parse. If you don't
>> allow this you loose some performance as you actually convert the
>> number when you could know that it was an error as soon as you saw the
>> first space. I don't think this is a big problem. The idea is to skip
>> the space and set a flag in the return value if there was some space
>> to skip. The strict_value() function checks this flag and throws if it
>> was set.
>
> Well what if my use case doesn't allow leading whitespaces? Wasn't that
> one of the initial concerns that started this whole thing? If it's a
> single pass input iterator this is a big deal and the parser must not
> consume any invalid characters I didn't tell it to as they are forever
> lost to the caller.

Do you mean you actually have a real example of an iterator that cannot
be dereferenced more than once? (What on earth would create such a thing?)

> The actual whitespace
> content is lost. What if I needed this information to increment a line
> counter after parsing the number?

...then don't tell the parser to eat whitespace. (Note: the parser
*must* have a strict mode... so I agree with you there. A mode that eats
everything possible and then tells you how far it got may also be
required. Anything else is probably in the 'nice to have' category.)

--
Matthew

Paul Tessier

unread,
Feb 4, 2014, 7:55:57 PM2/4/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net

Except that with an out parameter no copies need be made, which depending on cost of copying said type, this may be a bottle neck.  Your version of an out parameter composed of a value returning version forces a copy regardless of the need for one.  The opposite cannot be said for the reverse composition.

For const, initializing a const object from another object always requires a copy.

big_int cache;
parse<big_int>( r, cache ); // avoids copying

big_int const val = parse_expected<big_int>( r ).value(); // copies regardless

These two versions do the least amount of work.  Including both, is perfectly fine but, the assumption of equal composition is false.  In the above, supplying parse<T> for any custom type could be seen as the base for the other versions and therefore only requirement for supporting the other versions.  This is helpful if constructing and copying T is expensive, as this expense can be avoided by only using the out parameter version.  If this is not the case then, these is no way to avoid the cost and still use the structure provided by the standard, and will require extra implementations for the other versions to achieve performance.

Matthew Woehlke

unread,
Feb 4, 2014, 8:41:13 PM2/4/14
to std-pr...@isocpp.org
On 2014-02-04 19:55, Paul Tessier wrote:
>> int parse<T, U>(range<U> r, T& value, locale loc = default_locale)
>> {
>> auto const result = parse<T>(r, loc);
>> if (result) value = *result;
>> return // um... int? from whence do I get an int?
>> }
>
> Except that with an out parameter no copies need be made, which depending
> on cost of copying said type, this may be a bottle neck. Your version of
> an out parameter composed of a value returning version forces a copy
> regardless of the need for one.

Where?

A "good" implementation would emplace in the return value. And it should
be possible to tweak the assignment to be a move-assignment (hmm, should
add a take() to std::optional). No copies there?

> The opposite cannot be said for the reverse composition.

Yours enforces a default construction and, at best, some less-expensive
form of changing 'value'. Conversely, if I want everything-via-return, a
good implementation will do either something similar or an initialized
construction, which will be at least as cheap.

If anything, it seems to me that the 'real' implementation being
everything-via-return is going to be least expensive.

> For const, initializing a const object from another object always requires
> a copy.
>
> big_int const val = parse_expected<big_int>( r ).value(); // copies
> regardless

Eh? Your return type is 'parse_result<T> const'? Why? (In fact, why
would you *ever* use 'const' on a return type?)

(Even if it is, a '&' will avoid the copy. Actually, 'const&' is
probably preferred anyway.)

--
Matthew

Paul Tessier

unread,
Feb 4, 2014, 9:16:53 PM2/4/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net

Assume that big_int requires the heap to allow for very big int's, say 10 to 2000 digits, a value returning version has no way to avoid allocating at each parse, regardless of move-assignment or RVO.  A parameter out version can reuse the same big_int and therefore potentially avoid the cost of new allocations at each parse.

It is always possible to take any snippet of code and replace it with a function that takes in and out parameters, the reverse cannot be said for value returning functions.  If your arguments are stylistic, I have no objections to that.  I also find the readability of a version that returns some kind of expected<T> to be better.  I only stand by my position that to allow all points of contention to be solved the out parameter version is required, and as such should be the base for the other versions.  Whether all version are supplied, I cannot say.  I would prefer at least in addition to the out parameter version, that parse_or_zero or something similar be included.  Whether consensus can be reached for a version returning an expected<T> or optional<T>, I find doubtful, until such time as those things are already part of the standard.

I sorry if I wasn't more concise about const.  It had been mentioned that out parameters could not initialize a const value, to which I used the parse_expected version from earlier to show that version solved that problem.  I had not meant to imply that the return itself was const.

I would also propose that the range provided to be parsed should be, exactly what should be parsed and not more therefore, eliminating the need to modified/return the range used.  Providing a correct range should fall to the responsibility of a regex or similar facility before parsing to a type occurs.  This alleviates the need for parse to do more work than is necessary.

Nevin Liber

unread,
Feb 4, 2014, 9:27:44 PM2/4/14
to std-pr...@isocpp.org
On 4 February 2014 19:41, Matthew Woehlke <mw_t...@users.sourceforge.net> wrote:
On 2014-02-04 19:55, Paul Tessier wrote:
int parse<T, U>(range<U> r, T& value, locale loc = default_locale)
{
    auto const result = parse<T>(r, loc);
    if (result) value = *result;
    return // um... int? from whence do I get an int?
}

Except that with an out parameter no copies need be made, which depending
on cost of copying said type, this may be a bottle neck.  Your version of
an out parameter composed of a value returning version forces a copy
regardless of the need for one.

Where?

A "good" implementation would emplace in the return value.

To be fair, if you pass in the parameter and the type has an internal heap allocation, it can reuse the space that was allocated.  Of course, if it has an internal heap allocation, parse is unlikely to be the bottleneck, and assignment is probably expensive as well.

Plus,  it is a horrible, horrible interface.

Let's say you wanted to use the result in a member initializer list.  If so, you end up having to write something which returns a value, as in:

struct Foo
{
    template<typename U>
    explicit Foo(Range<U> r)
    : theNumber{[&]{ BigInt bi; parse( r, bi ); return bi; }}
    {}

    BigInt theNumber;
};

Ugh,  This is C++, not C.  You can do better.
-- 
 Nevin ":-)" Liber  <mailto:ne...@eviloverlord.com(847) 691-1404

Nevin Liber

unread,
Feb 4, 2014, 9:44:43 PM2/4/14
to std-pr...@isocpp.org
On 4 February 2014 20:16, Paul Tessier <pher...@gmail.com> wrote:
Assume that big_int requires the heap to allow for very big int's, say 10 to 2000 digits, a value returning version has no way to avoid allocating at each parse, regardless of move-assignment or RVO. 

If you are *that* concerned about performance, it is extremely unlikely you'd be willing to pay the runtime cost for locale support, either.
 
A parameter out version can reuse the same big_int and therefore potentially avoid the cost of new allocations at each parse.

It is always possible to take any snippet of code and replace it with a function that takes in and out parameters,

Not always.  Sometimes it is impossible to construct the object without knowing all the parameters first, and they can't necessarily be faked.
 
the reverse cannot be said for value returning functions. 

The problem is that people have to keep writing those functions to make up for bad interfaces.

-- 

Paul Tessier

unread,
Feb 4, 2014, 10:21:55 PM2/4/14
to std-pr...@isocpp.org

It is provable that any snippet of code can be replaced with a function that takes in and out parameters.  The number of parameters is equal to the number variables in use, otherwise [&]{//some code} would not compile.  There is nothing in language that is the equivalent for such code cutting.  As such, since parse has such contentious views about its interface, the simplest solutions is to cut the part that does the actual work, a parameter out function, and let everyone else do as they see fit.  Nothing is gained by ignoring opportunities, and a parameter out version allows all opportunities, the reverse cannot be said for the other versions in all cases.  I would not propose that parameter out functions always be used but, in this case it solves the current conflicting view points by allowing all implementations with the absolute minimum cost.

I agree that in average coding, the return by value is much easier to read and reason about, and should be preferred where its use is reasonable.  I'm all for included a return by value version.  It's just that it is a simple 3 line template that uses the parameter out version.  The reverse is not possible, keeping in mind all side effects.

T parse_value<T,R>( R range, /* other stuff */ ) {
  T rvo; // default constructed
  parse<T,R>( r, rvo, /* other stuff */ ); // no throw
  return rvo;
}

I'm all for including a parameter out "parse" and a return value "parse_value" like the above, as this seems to solve the most basic and most complex use cases.  An optional<T> version is easily constructed in a similar fashion, as is a throwing version.

Matthew Fioravante

unread,
Feb 5, 2014, 1:02:15 AM2/5/14
to std-pr...@isocpp.org
Wow, a lot of activity for this thread. After reading through, here are some thoughts I had.

Whitespace handling:

I said it before but I will reiterate again. I think it is a huge mistake to include whitespace handling, even as a optional flag.

Whitespace handling is orthogonal to number parsing and should be completely removed from this interface. Maybe the user only wants to parse spaces but not tabs, maybe they want locale dependent whitespace checking and maybe they don't? Why not let them use a modern whitespace handling API (different thread) to remove prefixes and suffixes and we focus just on parsing the numbers. All of these parsing routines should be simple and composable. Also adding more flags makes the interface more complicated, lets keep it simple, exposing only the bare minimum of options required. 

If we add whitespace handling to this API it will be half baked. That is, it will have to make a lot of assumptions about how the user wants to handle whitespace (see the variants I just mentioned). Or if we account for all of the possibilities of whitespace handling with flags, now we have a whitespace API and a number parsing API mashed together, which is even worse.

Out parameters vs return:

After reading Matthew's posts (the other matthew, not me!). I like his ideas better. The below is very elegant.

auto result = parse<int>("1234");
if(!result) {
  //handle error
  //Can do switch(result.error()) if we want
}
use(result.value());

We could make value() throw on error like std::expected, giving us a C style return code interface and C++ style exception interface all in one. If the user explicitly checks the error status before calling value(), its easy for the compiler to optimize out the conditional and the throwing logic entirely, making this an efficient exceptionless interface. As an added bonus if the caller is a noexcept function, calling value() without checking for an error will result in a call to std::terminate(), giving us a runtime check that we checked the error status. We don't need extra overloads for exceptions. value_or() also provides for users who want the defaulting behavior, without yet another overload.

So far the most convincing argument I've seen for the out parameter interface is the std::getline() example where you're parsing an expensive to copy type like big_int. Perhaps one compromise to this could be an overload:

//Ignoring iterator versions for brevity
template <typename T>
return_type parse<T>(string_view s);
template <typename T>
return_ref_type parse<T>(string_view s, T& t);

The second one here allows you to provide the object with which to store the result. The return type of return_ref_type is the same as return_type with value(), error(), next() etc.. except that value() just returns a reference to the t passed into the function instead of containing the value. Alternatively the returned object could omit value() and just have next() and error().

While the output has been the biggest question (even spawning off the error code thread), I think most people prefer returning over out parameters. If we return, we must return all 3 bits of information (value, error, string tail) in an elegant way.

Ranges:

There was some talk about ranges. I think for now we can stick to string_view, iterator pairs, and maybe const char* for supporting null terminated strings. Once ranges come into the standard I don't think its too difficult to just add a range overload.

Locales:

Locales absolutely must be optional. Parsing routines such as this can easily become performance hotspots in applications and must be as fast as possible. Locales mean indirect function calls (virtuals) which can quickly ruin performance.

Configuring the Parse (the next big question?):

Ignoring whitespace, there is still a possibility for a lot configuration with how the user may want to do the parse:
  • What radix do I use?
  • Do I allow 0x and 0 (octal) prefixes?
  • Do I allow + prefix?
  • Locales?
  • Commas or just digits?
First, I think we would need to decide on a complete set of options. The next big question is what is the best interface which will allow the user to specify them easily and what are the most sensible defaults?


Olaf van der Spek

unread,
Feb 5, 2014, 3:32:19 AM2/5/14
to std-pr...@isocpp.org
On Wed, Feb 5, 2014 at 3:44 AM, Nevin Liber <ne...@eviloverlord.com> wrote:
> On 4 February 2014 20:16, Paul Tessier <pher...@gmail.com> wrote:
>>
>> Assume that big_int requires the heap to allow for very big int's, say 10
>> to 2000 digits, a value returning version has no way to avoid allocating at
>> each parse, regardless of move-assignment or RVO.
>
>
> If you are *that* concerned about performance, it is extremely unlikely
> you'd be willing to pay the runtime cost for locale support, either.

True, some use cases probably require a version without locale support.

>> the reverse cannot be said for value returning functions.
>
>
> The problem is that people have to keep writing those functions to make up
> for bad interfaces.

Only if one doesn't provide both interfaces in the library.
In some use cases it's exactly the right interface and it seems like a
proper and simple foundation for other interfaces.

--
Olaf

Miro Knejp

unread,
Feb 5, 2014, 7:34:04 AM2/5/14
to std-pr...@isocpp.org

Do you mean you actually have a real example of an iterator that cannot be dereferenced more than once? (What on earth would create such a thing?)
That's not what I described. If I pass a single-pass InputIterator, for example istreambuf_iterator, to parse() it chews away N whitespaces and then fails to recognize a number then any information on the whitespaces is gone as I cannot go back and re-iterate the range.


The actual whitespace
content is lost. What if I needed this information to increment a line
counter after parsing the number?

...then don't tell the parser to eat whitespace. (Note: the parser *must* have a strict mode... so I agree with you there. A mode that eats everything possible and then tells you how far it got may also be required. Anything else is probably in the 'nice to have' category.)

And I am on the same track as Matthew F. in that parse() should have one responsibility and one only: convert the textual representation of a value to a value, and nothing else. Skipping anything before the value is an entirely separate matter and has nothing to do with processing the actual value. Do it before invoking parse. That is separation of concerns. Let's focus on parsing the actual value, not the noise around it.


Except that with an out parameter no copies need be made, which depending on cost of copying said type, this may be a bottle neck.  Your version of an out parameter composed of a value returning version forces a copy regardless of the need for one.
What happens with the out parameter when parsing fails? Is it in an undefined state? Or left unmodified? If the latter then parse() had to create a temporary and the entire allocation prevention and no-copy argument is down the drain.

I would prefer the value to be in a well defined state when parsing fails. Incrementing or accessing an iterator may throw an exception and if the value is then left in a partial state that's really bad. It also defeats the purpose of using it as a default fallback value. I'd rather have a strong exception guarantee for the value I'm passing into the function.

Since without optional<>, expected<> or return codes a failed parse might as well be indicated by throwing an exception (and that is the case in many languages and frameworks) a failed parse should have the same guarantees as if an exception were thrown in my oppinion.


Ignoring whitespace, there is still a possibility for a lot configuration with how the user may want to do the parse:
  • What radix do I use?
  • Do I allow 0x and 0 (octal) prefixes?
  • Do I allow + prefix?
  • Locales?
  • Commas or just digits?
First, I think we would need to decide on a complete set of options. The next big question is what is the best interface which will allow the user to specify them easily and what are the most sensible defaults?
I'm not a fan of specifying gazillions of options hence I proposed separate overloads for each task and "utility" methods to detect each parts separately (like parse_radix_prefix) to allow composition of the various parsers and make them available to users other than std implementers. Options should only exist where flags are mutually exclusive, like parsing radix 10 versus radix 11, both cannot be done at the same time and should therefore be specified as an argument. But anything that is optional (signs, prefixes, locales) can be provided as separete overloads or methods for composition

Matthew Fioravante

unread,
Feb 5, 2014, 8:48:33 AM2/5/14
to std-pr...@isocpp.org


On Wednesday, February 5, 2014 7:34:04 AM UTC-5, Miro Knejp wrote:
Except that with an out parameter no copies need be made, which depending on cost of copying said type, this may be a bottle neck.  Your version of an out parameter composed of a value returning version forces a copy regardless of the need for one.
What happens with the out parameter when parsing fails? Is it in an undefined state? Or left unmodified? If the latter then parse() had to create a temporary and the entire allocation prevention and no-copy argument is down the drain.

I would prefer the value to be in a well defined state when parsing fails. Incrementing or accessing an iterator may throw an exception and if the value is then left in a partial state that's really bad. It also defeats the purpose of using it as a default fallback value. I'd rather have a strong exception guarantee for the value I'm passing into the function.


For simple types like int, I prefer the convention of leaving the out parameter unmodified on failure. That way you can initialize it yourself if you want the default behavior. Also I've had situations where I'm using parse to update some configuration variable. If the parse fails, i want it to fall back on whatever the previous value was. The unmodified behavior supports this use case very nicely. I don't don't have to cache the old value in a temporary and reassign it on error.

That being said, for expensive types like big_int you're absolutely right that this requires the creation of temporaries and destroys any performance benefit the out parameter design. To avoid the temporary, you have to parse twice, once to check for correctness and again to actually load the value.

For arbitrary types, now the question becomes what state do you leave them in? One option is an indetermined but valid state. This might be the most efficient but a little more error prone. Another option is to create a default constructed temporary and swap with it (or copy from it). This is safer but may have performance implications if the default construction does any real work. Also it requires your type to be default constructable and (copy/move assignable or swapable).

Probably the indetermined state is best as it offers the least complications. For ints and floats, we can optimize by giving the unmodified guarantee and all of the benefits it entails.

Or each type has its own convention (most will probably use the "0" representation).
 
Since without optional<>, expected<> or return codes a failed parse might as well be indicated by throwing an exception (and that is the case in many languages and frameworks) a failed parse should have the same guarantees as if an exception were thrown in my oppinion.

Ignoring whitespace, there is still a possibility for a lot configuration with how the user may want to do the parse:
  • What radix do I use?
  • Do I allow 0x and 0 (octal) prefixes?
  • Do I allow + prefix?
  • Locales?
  • Commas or just digits?
First, I think we would need to decide on a complete set of options. The next big question is what is the best interface which will allow the user to specify them easily and what are the most sensible defaults?
I'm not a fan of specifying gazillions of options hence I proposed separate overloads for each task and "utility" methods to detect each parts separately (like parse_radix_prefix) to allow composition of the various parsers and make them available to users other than std implementers. Options should only exist where flags are mutually exclusive, like parsing radix 10 versus radix 11, both cannot be done at the same time and should therefore be specified as an argument. But anything that is optional (signs, prefixes, locales) can be provided as separete overloads or methods for composition


I'm not sure I like the option flags idea as well, but I am currently at a loss for the best way to handle the optionality. Maybe overloads as you say. Need to think about it more. 

Olaf van der Spek

unread,
Feb 5, 2014, 9:08:04 AM2/5/14
to std-pr...@isocpp.org
On Wed, Feb 5, 2014 at 7:02 AM, Matthew Fioravante
<fmatth...@gmail.com> wrote:
> After reading Matthew's posts (the other matthew, not me!). I like his ideas
> better. The below is very elegant.
>
> auto result = parse<int>("1234");
> if(!result) {
> //handle error
> //Can do switch(result.error()) if we want
> }
> use(result.value());

int result;
if (auto err = parse(result, "1234")) {
//handle error
//Can do switch(err) if we want
}
use(result);

;)

> We could make value() throw on error like std::expected, giving us a C style
> return code interface and C++ style exception interface all in one. If the
> user explicitly checks the error status before calling value(), its easy for
> the compiler to optimize out the conditional and the throwing logic

Are you sure it's easy? Got a reference?

> What radix do I use?

Have a base/radix parameter defaulting to 10.

> Do I allow 0x and 0 (octal) prefixes?

I'd allow 0x (when radix = 0), I'd never allow 0.

> Do I allow + prefix?

Sure, why not?

> Locales?
> Commas or just digits?

Both handled by a locale-aware variant.

Matthew Fioravante

unread,
Feb 5, 2014, 9:45:58 AM2/5/14
to std-pr...@isocpp.org


On Wednesday, February 5, 2014 9:08:04 AM UTC-5, Olaf van der Spek wrote:
> We could make value() throw on error like std::expected, giving us a C style
> return code interface and C++ style exception interface all in one. If the
> user explicitly checks the error status before calling value(), its easy for
> the compiler to optimize out the conditional and the throwing logic

Are you sure it's easy? Got a reference?

http://gcc.godbolt.org/#%7B%22version%22%3A3%2C%22filterAsm%22%3A%7B%22labels%22%3Atrue%2C%22directives%22%3Atrue%2C%22commentOnly%22%3Atrue%7D%2C%22compilers%22%3A%5B%7B%22source%22%3A%22%23include%20%3Cstdexcept%3E%5Cn%5Cnclass%20expected%20%7B%5Cn%20%20public%3A%5Cn%20%20int%20value()%20const%20%7B%20%5Cn%20%20%20%20if(_err)%20%7B%5Cn%20%20%20%20%20%20return%20_v%3B%5Cn%20%20%20%20%7D%5Cn%20%20%20%20throw%20std%3A%3Aexception()%3B%5Cn%20%20%7D%5Cn%20%20explicit%20operator%20bool()%20const%20%7B%20return%20_err%3B%20%7D%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5Cn%20%20private%3A%5Cn%20%20int%20_v%3B%5Cn%20%20bool%20_err%3B%5Cn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%5Cn%7D%3B%5Cn%5Cnexpected%20parse_int(const%20char*%20s)%3B%5Cn%5Cnint%20f1()%20%7B%5Cn%20%20auto%20x%20%3D%20parse_int(%5C%221234%5C%22)%3B%5Cn%20%20return%20x.value()%3B%5Cn%7D%5Cn%5Cnint%20f2()%20%7B%5Cn%20%20auto%20x%20%3D%20parse_int(%5C%22567%5C%22)%3B%5Cn%20%20if(!x)%20%7B%5Cn%20%20%20%20return%200%3B%5Cn%20%20%7D%5Cn%20%20return%20x.value()%3B%5Cn%7D%22%2C%22compiler%22%3A%22%2Fopt%2Fclang-3.3%2Fbin%2Fclang%2B%2B%22%2C%22options%22%3A%22-O3%20-march%3Dnative%20-std%3Dc%2B%2B11%22%7D%5D%7D

Here is the code if that link does not work. Take the following code and drop it into http://gcc.godbolt.org. Turn on optimizations and -std=c++11 and note the dissassembly for f2(). There are no exceptions being created on any code path.

#include <stdexcept>

class expected {
  public:
  int value() const { 
    if(_err) {
      return _v;
    }
    throw std::exception();
  }
  explicit operator bool() const { return _err; }
                     
  private:
  int _v;
  bool _err;
                     
};

expected parse_int(const char* s);

int f1() {
  auto x = parse_int("1234");
  return x.value();
}

int f2() {
  auto x = parse_int("567");
  if(!x) {
    return 0;
  }
  return x.value();
}

 

> What radix do I use?

Have a base/radix parameter defaulting to 10.

> Do I allow 0x and 0 (octal) prefixes?

I'd allow 0x (when radix = 0), I'd never allow 0.

Some people still use octal. I wouldn't remove support for it.


> Do I allow + prefix?

Sure, why not?

> Locales?
> Commas or just digits?

Both handled by a locale-aware variant.

That's a good a point. If you specify a locale, its probably because you want comma support along with language support. Maybe comma support should be bound to whether or not a locale was requested, with the default fast path just expecting digits. 

Matthew Fioravante

unread,
Feb 5, 2014, 9:59:05 AM2/5/14
to std-pr...@isocpp.org


On Wednesday, February 5, 2014 9:45:58 AM UTC-5, Matthew Fioravante wrote:
> Do I allow 0x and 0 (octal) prefixes?

I'd allow 0x (when radix = 0), I'd never allow 0.

Some people still use octal. I wouldn't remove support for it.


Also another way to handle the radix is with bit flags, then you can specify exactly which radii you want to support. If you don't specify radix_8, then the leading zeroes can just denote a decimal values with leading zeroes.

parse("1234", radix_8 | radix_16 | radix_10);

or for brevity we can have a shortcut

parse("1234", radix_all); 

If we go with the flags approach, all of the flag handling and branching should be inlined, even if the underlying function call to do the parsing is not. This will allow the compiler to optimize out the branches since 99% of the time, users will be specifying constants rather than collecting flags in a variable and dynamically choosing how to parse at runtime.

Thiago Macieira

unread,
Feb 5, 2014, 11:30:12 AM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 15:08:04, Olaf van der Spek escreveu:
> > Do I allow 0x and 0 (octal) prefixes?
>
> I'd allow 0x (when radix = 0), I'd never allow 0.

Don't deviate from strtoll.

radix = 0 implies prefixes 0 and 0x are recognised. If the library is updated
with 0b prefix for binaries, then that too.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Matthew Fioravante

unread,
Feb 5, 2014, 12:00:53 PM2/5/14
to std-pr...@isocpp.org


On Wednesday, February 5, 2014 11:30:12 AM UTC-5, Thiago Macieira wrote:
Em qua 05 fev 2014, às 15:08:04, Olaf van der Spek escreveu:
> > Do I allow 0x and 0 (octal) prefixes?
>
> I'd allow 0x (when radix = 0), I'd never allow 0.

Don't deviate from strtoll.

radix = 0 implies prefixes 0 and 0x are recognised. If the library is updated
with 0b prefix for binaries, then that too.

Why wait for the library, 0b prefix is useful and should be supported anyway.

Matthew Woehlke

unread,
Feb 5, 2014, 12:17:37 PM2/5/14
to std-pr...@isocpp.org
On 2014-02-05 07:34, Miro Knejp wrote:
>> Do you mean you actually have a real example of an iterator that
>> cannot be dereferenced more than once? (What on earth would create
>> such a thing?)
>
> That's not what I described. If I pass a single-pass InputIterator, for
> example istreambuf_iterator, to parse() it chews away N whitespaces and
> then fails to recognize a number then any information on the whitespaces
> is gone as I cannot go back and re-iterate the range.

This iterator is non-copyable? And/or incrementing it is destructive to
copies of the iterator? (I would hope not the latter, as that is
terrible API.)

If not, there should not be a problem. (Okay, given it is
istreambuf_iterator, I suppose I can imagine one or both of the above
being true. It's not obvious to me from either cplusplus.com or
cppreference.com if istreambuf_iterator is or is not copyable...)

> And I am on the same track as Matthew F. in that parse() should have one
> responsibility and one only: convert the textual representation of a
> value to a value, and nothing else.

I can live with that. (I'm not sure I ever felt handling whitespace was
*necessary*, just that I don't object to it as strongly as you.)

I do think we need at least one parsing option; whether or not to allow
trailing characters.

> What happens with the out parameter when parsing fails? Is it in an
> undefined state? Or left unmodified? If the latter then parse() had to
> create a temporary and the entire allocation prevention and no-copy
> argument is down the drain.

IMO it shall leave it unmodified. One of the arguments for an output
parameter was to implement defaults like:

int value = default_value;
parse(in, value);

That said, in defense of that argument, I could conceive of a
specialization that stores parts of the value in cheap-to-create values
and doesn't have to build the expensive type until it knows the parse is
okay. But I agree that that's tenuous, and more likely your point will
be true.

--
Matthew

Matthew Fioravante

unread,
Feb 5, 2014, 12:43:27 PM2/5/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net


On Wednesday, February 5, 2014 12:17:37 PM UTC-5, Matthew Woehlke wrote:
On 2014-02-05 07:34, Miro Knejp wrote:
>> Do you mean you actually have a real example of an iterator that
>> cannot be dereferenced more than once? (What on earth would create
>> such a thing?)
>
> That's not what I described. If I pass a single-pass InputIterator, for
> example istreambuf_iterator, to parse() it chews away N whitespaces and
> then fails to recognize a number then any information on the whitespaces
> is gone as I cannot go back and re-iterate the range.

This iterator is non-copyable? And/or incrementing it is destructive to
copies of the iterator? (I would hope not the latter, as that is
terrible API.)

its the later. Each copy of an input iterator may point to the same state (current character). But as soon as you increment one, it invalidates all of the others. It seems terrible but that is the consequence of abstracting one pass things like file io with iterators. This interface while dangerous allows them to be efficient. 

I've used input iterators for other IO abstractions such as asynchronous IO. The operator++() blocks until the background thread has data available and then keeps returning the next data point until it has to block again. There is no way to "rewind" in this situation or maintain a copy of the previous state without a lot of expensive reference counting or some other complexity.

Its a useful abstraction, but requiring operator++(int) for input iterators is ridiculous and dangerous.

If not, there should not be a problem. (Okay, given it is
istreambuf_iterator, I suppose I can imagine one or both of the above
being true. It's not obvious to me from either cplusplus.com or
cppreference.com if istreambuf_iterator is or is not copyable...)

input iterators must be copyable because they implement operator++(int). Maybe the standard should be amended to relax these restrictions?


> And I am on the same track as Matthew F. in that parse() should have one
> responsibility and one only: convert the textual representation of a
> value to a value, and nothing else.

I can live with that. (I'm not sure I ever felt handling whitespace was
*necessary*, just that I don't object to it as strongly as you.)

I do think we need at least one parsing option; whether or not to allow
trailing characters.

Trailing characters must be supported for efficiency along with the .next() method returning iterator/string view in the return object. Many times you have a number in your string followed by other stuff. You must parse the number to know where the number characters end. The efficient way is to parse the number and move the character iterator at the same time.

 

> What happens with the out parameter when parsing fails? Is it in an
> undefined state? Or left unmodified? If the latter then parse() had to
> create a temporary and the entire allocation prevention and no-copy
> argument is down the drain.

IMO it shall leave it unmodified. One of the arguments for an output
parameter was to implement defaults like:

int value = default_value;
parse(in, value);

That said, in defense of that argument, I could conceive of a
specialization that stores parts of the value in cheap-to-create values
and doesn't have to build the expensive type until it knows the parse is
okay. But I agree that that's tenuous, and more likely your point will
be true.

Leaving an expensive type unmodified requires either creating a copy or parsing twice. I'm not convinced we want to pay either performance cost for the relatively minor convenience of the "unmodified on failure" behavior.

Thiago Macieira

unread,
Feb 5, 2014, 12:55:38 PM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 09:00:53, Matthew Fioravante escreveu:
> > Don't deviate from strtoll.
> >
> > radix = 0 implies prefixes 0 and 0x are recognised. If the library is
> > updated
> > with 0b prefix for binaries, then that too.
>
> Why wait for the library, 0b prefix is useful and should be supported
> anyway.

Because you shouldn't deviate from strtoll. The support should be done first in
strtoll and then in whatever you're proposing. Yes, I know I'm asking you to
convince ISO C and the POSIX standard groups.

The reason being that most C++ standard library implementations will delegate
to strtoll or similar functions (like we do in Qt). Asking for functionality
different from strtoll means asking for more complexity from library
developers.

Alternatively, make sure that strtoll could be implemented on top of a plain C
library routine that is the backend for your new function. That would solve
the problem of complexity.

Olaf van der Spek

unread,
Feb 5, 2014, 12:58:11 PM2/5/14
to std-pr...@isocpp.org
On Wed, Feb 5, 2014 at 6:55 PM, Thiago Macieira <thi...@macieira.org> wrote:
> Em qua 05 fev 2014, às 09:00:53, Matthew Fioravante escreveu:
>> > Don't deviate from strtoll.
>> >
>> > radix = 0 implies prefixes 0 and 0x are recognised. If the library is
>> > updated
>> > with 0b prefix for binaries, then that too.
>>
>> Why wait for the library, 0b prefix is useful and should be supported
>> anyway.
>
> Because you shouldn't deviate from strtoll. The support should be done first in
> strtoll and then in whatever you're proposing. Yes, I know I'm asking you to
> convince ISO C and the POSIX standard groups.
>
> The reason being that most C++ standard library implementations will delegate
> to strtoll or similar functions (like we do in Qt). Asking for functionality
> different from strtoll means asking for more complexity from library
> developers.

strtoll doesn't support non-nul terminated input does it?



--
Olaf

Matthew Fioravante

unread,
Feb 5, 2014, 1:33:11 PM2/5/14
to std-pr...@isocpp.org
input to strtoll and friends must be null terminated. Therefore, we cannot use them as an implementation base for this proposal without performance degradation (copying to a local null terminated buffer). Parsing ints is not too hard to do. For floats, we will need to dig in and see what strtof() uses under the hood, libmpfr or some such.

strtoll() also doesn't support locales with commas and all of those possible variants.

Thiago makes a valid point though about implementation burden. Any proposal would need to include advice for how to implement the library using already existing C components on the most popular platforms, or if none exist, make a strong case for why we must reimplement everything from scratch. We can also petition the authors of these components to support (const char*, size_t) variants of their methods.

For a 3rd party library like QT, it seems like a huge amount of work with little payoff to reimplement the parsing functions. For the standard library, it is a possibility if there is good reason for it.

Miro Knejp

unread,
Feb 5, 2014, 3:20:52 PM2/5/14
to std-pr...@isocpp.org
@Matthew Fioravante:

For simple types like int, I prefer the convention of leaving the out parameter unmodified on failure. That way you can initialize it yourself if you want the default behavior. Also I've had situations where I'm using parse to update some configuration variable. If the parse fails, i want it to fall back on whatever the previous value was. The unmodified behavior supports this use case very nicely. I don't don't have to cache the old value in a temporary and reassign it on error.

That being said, for expensive types like big_int you're absolutely right that this requires the creation of temporaries and destroys any performance benefit the out parameter design.
From an implementation standpoint both are identical: create a temporary, populate it, and assign (or swap) when done.
To avoid the temporary, you have to parse twice, once to check for correctness and again to actually load the value.
Which is not possible with InputIterator.


For arbitrary types, now the question becomes what state do you leave them in? One option is an indetermined but valid state. This might be the most efficient but a little more error prone. Another option is to create a default constructed temporary and swap with it (or copy from it). This is safer but may have performance implications if the default construction does any real work. Also it requires your type to be default constructable and (copy/move assignable or swapable).

Probably the indetermined state is best as it offers the least complications. For ints and floats, we can optimize by giving the unmodified guarantee and all of the benefits it entails.
Requiring the type to be default constructible when modifying the value in-place seems like a really puzzling prerequisite, doesn't it?

If the behavior is not consistent across all parsers for both builtin and composite std types chaos will ensue and the committe will probably never let it through. What you do in implementations for user defined types is your own business, but the parsers for builtin and std types should all provide consistent guarantees and behavior. That kind of inconsistency makes it close to impossible to write reliable template<T> methods that make use of parse<T>(). How do you know whether the out parameter is in any usable state after a failure when T is arbitrary? This needs to be consistent across the board.

@Matthew Woehlke:
This iterator is non-copyable? And/or incrementing it is destructive to copies of the iterator? (I would hope not the latter, as that is terrible API.)

If not, there should not be a problem. (Okay, given it is istreambuf_iterator, I suppose I can imagine one or both of the above being true. It's not obvious to me from either cplusplus.com or cppreference.com if istreambuf_iterator is or is not copyable...)
Just imagine the iterator performs an fread() of 1 character every time it is incremented, thus incrementing the file read position. Copying the iterator doesn't do anything and is perfectly fine but as soon as you increment one of them it affects every other. It's a silly example but shows exactly how the iterator category works. It's design allows it to read from an unbuffered data source, which you can do only once without storing the previous value somewhere first.

[input.iterators]/3 "For input iterators, a == b does not imply ++a == ++b." and "Algorithms on input iterators should never attempt to pass through the same iterator twice. They should be single pass algorithms."


I do think we need at least one parsing option; whether or not to allow trailing characters.
I don't think that is required. the parser can just stop when it reaches an invalid character and signal success if the input to that point was sufficient to create a value. You can inspect the returned iterator whether the end of the input was reached or not and act accordingly. This way parse() is very flexible and can work at the core of more advanced interfaces. There was already mentioning of a match_integer method requiring the entire source to represent the value and once you have parse() with it's one well defined reponsibility it is trivial to implement such a match_X method on top of it.

@Thiago Macieira:
Because you shouldn't deviate from strtoll. The support should be done first in
strtoll and then in whatever you're proposing. Yes, I know I'm asking you to
convince ISO C and the POSIX standard groups.

The reason being that most C++ standard library implementations will delegate
to strtoll or similar functions (like we do in Qt). Asking for functionality
different from strtoll means asking for more complexity from library
developers.

Alternatively, make sure that strtoll could be implemented on top of a plain C
library routine that is the backend for your new function. That would solve
the problem of complexity.
While it makes sense and sounds great, you can only implement it in terms of strtoll if
  1. The iterators point to contiguous memory, and
  2. The value type is char, and
  3. They range is null terminated.

There is currently no reliable way to detect 1 and 3. The contiguous iterator category proposal would solve 1, but 3 requires dereferencing the end iterator and thus UB. That limits our options alot. If implementability with strtoxx is a requirement you can just drop the iterators, templates, locales and mark this thread as closed since the limitations of strtoxx started it in the first place.

Matthew Woehlke

unread,
Feb 5, 2014, 3:49:49 PM2/5/14
to std-pr...@isocpp.org
On 2014-02-05 15:20, Miro Knejp wrote:
>> I do think we need at least one parsing option; whether or not to
>> allow trailing characters.
>
> I don't think that is required. the parser can just stop when it reaches
> an invalid character and signal success if the input to that point was
> sufficient to create a value. You can inspect the returned iterator
> whether the end of the input was reached or not and act accordingly.

My impression is that this is a sufficiently common use case (and it is)
that users should not have to endlessly rewrite that code.

In fact, I expect it is more common to consider any unparsable
characters to be an error than otherwise. The latter case only happens
when you're implementing your own parsing of an input stream that is
expected to contain multiple logical values. The former is the case any
time your input has already been value delimited (or represents a single
value, as in e.g. an input widget).

> This way parse() is very flexible and can work at the core of more
> advanced interfaces. There was already mentioning of a match_integer
> method requiring the entire source to represent the value and once you
> have parse() with it's one well defined reponsibility it is trivial to
> implement such a match_X method on top of it.

I suppose you could have the "real implementation" version always accept
extra characters and return the end position, and provide a wrapper that
implements the aforementioned check. But that wrapper is important, as
that's what is going to be used more often than not. I would consider
the proposal incomplete if it does not provide that API.

--
Matthew

Matthew Fioravante

unread,
Feb 5, 2014, 3:56:18 PM2/5/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net


On Wednesday, February 5, 2014 3:49:49 PM UTC-5, Matthew Woehlke wrote:
I suppose you could have the "real implementation" version always accept
extra characters and return the end position, and provide a wrapper that
implements the aforementioned check. But that wrapper is important, as
that's what is going to be used more often than not. I would consider
the proposal incomplete if it does not provide that API.


I agree completely, both versions must be provided. One which parses as much as it can and returns an end iterator, and another (simple wrapper) which requires that all characters are part of the value.

I would have a lot of use cases for both of these.

gmis...@gmail.com

unread,
Feb 5, 2014, 5:23:15 PM2/5/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
This sounds like what I was suggesting earlier with the two routines

conversion_result parse(int& value,range) noexcept;;
and
conversion_result parse_checked(int& v,range r, check_options o);

parse_checked which throws calls parse() which never throws, it just checks. all the heavy work is done by parse(), parse_checked just looks at the parse_status to decide what to throw.

which is also why parse must check for empty ranges etc as crashing isn't reasonable for such a condition I think it has to test for anyway and parse_checked relies on it to throw.

But I don't think they should have to call each other that's just an implementation detail. parse checked might be able to  get better error messages etc. if it didn't.

I am now thinking that the interface should be something like

conversion_result parse(int& value,range,parse_options) noexcept;;
and
conversion_result parse_checked(int& v,range r, parse_options o);

and parse_options contains things like the format(s)/radix expected and things like options for accepting leading white space.

I'm still not convinced leading white space and trailing space etc. options should be supported, I think it complicates the interface and performance, but  I'm still thinking about that.

but if it should the parse_options can handle that. There can be a default_parse_options() function.or something that defaults to strict or maybe accepts leading and trailing spaces, but I'm not found of accepting leading and trailing whitespace as that basically includes tabs, cr's, lf's, etc. which I think is not a good idea.

Thiago Macieira

unread,
Feb 5, 2014, 5:30:54 PM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 21:20:52, Miro Knejp escreveu:
> > The reason being that most C++ standard library implementations will
> > delegate
> > to strtoll or similar functions (like we do in Qt). Asking for
> > functionality
> > different from strtoll means asking for more complexity from library
> > developers.
> >
> > Alternatively, make sure that strtoll could be implemented on top of a
> > plain C
> > library routine that is the backend for your new function. That would
> > solve
> > the problem of complexity.
>
> While it makes sense and sounds great, you can only implement it in
> terms of strtoll if
>
> 1. The iterators point to contiguous memory, and
> 2. The value type is char, and
> 3. They range is null terminated.
>
> There is currently no reliable way to detect 1 and 3. The contiguous
> iterator category proposal would solve 1, but 3 requires dereferencing
> the end iterator and thus UB. That limits our options alot. If
> implementability with strtoxx is a requirement you can just drop the
> iterators, templates, locales and mark this thread as closed since the
> limitations of strtoxx started it in the first place.

If you go for my alternative proposal, then you skip the need for 3. That is,
implementing strtoll on top of a plain C function that operates on contiguous
memory and receives a begin and end pointer.

However, I do think 1 and 2 are *reasonable*. I think the whole discussion
about input iterators that this discussion has gone on for the past few days
is unnecessary. Simply force people to read into a contiguous-memory buffer of
one of the base char types.

In specific: I'd like this discussion to assume that the number parsing code is
implemented out-of-line. Inline parsing is, forgive me for saying so, nuts.
You could maybe do it for integers, but you'd never do it for floating-point
results.

Thiago Macieira

unread,
Feb 5, 2014, 5:33:48 PM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 10:33:11, Matthew Fioravante escreveu:
> For a 3rd party library like QT, it seems like a huge amount of work with
> little payoff to reimplement the parsing functions. For the standard
> library, it is a possibility if there is good reason for it.

We actually have to do it. Qt carries a copy of strtoll, strtoull and strtod
from FreeBSD because the regular C functions are unusable -- they're locale-
dependent. The _l versions of those functions are present from POSIX.1-2008,
but we can't rely on them.

Thiago Macieira

unread,
Feb 5, 2014, 5:40:45 PM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 18:58:11, Olaf van der Spek escreveu:
> strtoll doesn't support non-nul terminated input does it?

Then go for the alternate proposal: strtoll implemented on top of the backend
function:

long long strntoll(const char *nptr, size_t len, char **endptr, int base);

long long strtoll(const char *nptr, char **endptr, int base)
{
return strtoll(nptr, strlen(nptr), endptr, base);
}

Let me emphasise again: you do not want to have too many copies of those
functions lying around, specially not strtod. It's a highly complex piece of
code. It's definitely not suitable for inlining.

So I'll repeat what I said in the other email: assume that the implementation
contains an out-of-line backend. If you can't do that in the proposal, I'd say
its chances of passing the committee, much less of being implemented properly,
are very slim.

Matthew Fioravante

unread,
Feb 5, 2014, 9:26:50 PM2/5/14
to std-pr...@isocpp.org

On Wednesday, February 5, 2014 5:30:54 PM UTC-5, Thiago Macieira wrote:
Em qua 05 fev 2014, às 21:20:52, Miro Knejp escreveu: 
> > The reason being that most C++ standard library implementations will 
> > delegate 
> > to strtoll or similar functions (like we do in Qt). Asking for 
> > functionality 
> > different from strtoll means asking for more complexity from library 
> > developers. 
> > 
> > Alternatively, make sure that strtoll could be implemented on top of a 
> > plain C 
> > library routine that is the backend for your new function. That would 
> > solve 
> > the problem of complexity. 

> While it makes sense and sounds great, you can only implement it in 
> terms of strtoll if 

>  1. The iterators point to contiguous memory, and 
>  2. The value type is char, and 
>  3. They range is null terminated. 

> There is currently no reliable way to detect 1 and 3. The contiguous 
> iterator category proposal would solve 1, but 3 requires dereferencing 
> the end iterator and thus UB. That limits our options alot. If 
> implementability with strtoxx is a requirement you can just drop the 
> iterators, templates, locales and mark this thread as closed since the 
> limitations of strtoxx started it in the first place. 

If you go for my alternative proposal, then you skip the need for 3. That is, 
implementing strtoll on top of a plain C function that operates on contiguous 
memory and receives a begin and end pointer. 

However, I do think 1 and 2 are *reasonable*.

I'm fine with 1 and 2 if it turns out genericity is too hard. Possibly with overloads for the other character types.

 
I think the whole discussion 
about input iterators that this discussion has gone on for the past few days 
is unnecessary. Simply force people to read into a contiguous-memory buffer of 
one of the base char types. 

In all honesty I only care about string_view. Generic iterators are nice (maybe so we can support vector<char> if there's a compelling reason to use one for something??), but I'll be happily using this interface with string_view all the time and probably little else.
 

In specific: I'd like this discussion to assume that the number parsing code is 
implemented out-of-line. Inline parsing is, forgive me for saying so, nuts. 
You could maybe do it for integers, but you'd never do it for floating-point 
results. 

I think at this point we need to do a real study into implementations to answer these questions accurately, particularly with floating point as that's the real bear. How is strtof() implemented on different platforms? What do gcc and other compilers use for floating point literals? Also you voted against binary 0b prefix because its not supported by strtol(), but something is going to be/already is written to support C++14 binary literals. That something can be leveraged here.

Thiago Macieira

unread,
Feb 5, 2014, 9:50:38 PM2/5/14
to std-pr...@isocpp.org
Em qua 05 fev 2014, às 18:26:50, Matthew Fioravante escreveu:
> I think at this point we need to do a real study into implementations to
> answer these questions accurately, particularly with floating point as
> that's the real bear. How is strtof() implemented on different platforms?
> What do gcc and other compilers use for floating point literals? Also you
> voted against binary 0b prefix because its not supported by strtol(), but
> something is going to be/already is written to support C++14 binary
> literals. That something can be leveraged here.

I would prefer that the functionality match strtol, which means convincing the
C guys that they should also parse 0b (maybe also convince them to add it to
their language).

But the important thing is that strtol and this proposal share a backend. It i
possible to disable just the 0b detection to still keep the required POSIX
functionality. Or some library developers may call it "an extension" and go
with it.

Magnus Fromreide

unread,
Feb 6, 2014, 2:36:10 AM2/6/14
to std-pr...@isocpp.org
I would consider it unlucky if

parse(istream_iterator(cin), istream_iterator())

didn't work.

My use case involves parsing data from part-wise contiguos containers (think
std::deque) where the ability to erase the head of the container is useful.


I further think that forcing the value type to be 'char' is unreasonable,
especially given that there are locales with wide decimal point and
thousands separator (ps_AF).


> In specific: I'd like this discussion to assume that the number parsing code is
> implemented out-of-line. Inline parsing is, forgive me for saying so, nuts.
> You could maybe do it for integers, but you'd never do it for floating-point
> results.

I agree. Maybe something along the lines of

struct parser {
typedef /* implementaion-defined */ code_point;
status_type input(code_point symbol);
pair<status_type, code_point*>
input(code_point* begin, code_point* end);
long long value() const;
};

Here, code_point is typically the widest character type, but I wouldn't
be opposed to further overloads for the input member, one could e.g. imagine
overloads for char.

/MF

Thiago Macieira

unread,
Feb 6, 2014, 3:06:42 AM2/6/14
to std-pr...@isocpp.org
Em qui 06 fev 2014, às 08:36:10, Magnus Fromreide escreveu:
> I would consider it unlucky if
>
> parse(istream_iterator(cin), istream_iterator())
>
> didn't work.
>
> My use case involves parsing data from part-wise contiguos containers (think
> std::deque) where the ability to erase the head of the container is useful.

If we can support that, by all means.

However, I would consider it a showstopper if

parse("1234");

had to be done all inline.

Miro Knejp

unread,
Feb 7, 2014, 3:18:41 PM2/7/14
to std-pr...@isocpp.org
Am 06.02.2014 09:06, schrieb Thiago Macieira:
> Em qui 06 fev 2014, às 08:36:10, Magnus Fromreide escreveu:
>> I would consider it unlucky if
>>
>> parse(istream_iterator(cin), istream_iterator())
>>
>> didn't work.
>>
>> My use case involves parsing data from part-wise contiguos containers (think
>> std::deque) where the ability to erase the head of the container is useful.
> If we can support that, by all means.
>
> However, I would consider it a showstopper if
>
> parse("1234");
>
> had to be done all inline.
Speaking floats, which part is the more complex/bloated one:
Extracting digits and symbols from the input or assembling them into a
floating point value with minimal rounding errors, etc? The latter can
easily be separated into a stateful object that is fed the numerical
values (i.e. digit values) and semantics (i.e. sign, comma, exponent
indicators) of the input at which point input encodings, character types
or locales are already translated to a neutral subset. Some part of the
numeric parsers certainly needs to be inline but some can be implemented
out-of-line.

Matthew Woehlke

unread,
Feb 7, 2014, 5:14:10 PM2/7/14
to std-pr...@isocpp.org
On 2014-02-07 15:18, Miro Knejp wrote:
> Speaking floats, which part is the more complex/bloated one:
> Extracting digits and symbols from the input or assembling them into a
> floating point value with minimal rounding errors, etc? The latter can
> easily be separated into a stateful object that is fed the numerical
> values (i.e. digit values) and semantics (i.e. sign, comma, exponent
> indicators) of the input at which point input encodings, character types
> or locales are already translated to a neutral subset. Some part of the
> numeric parsers certainly needs to be inline but some can be implemented
> out-of-line.

While that may be true (and in fact, probably quite valuable in terms of
simplicity of implementation), how would you store the intermediate
state without said storage killing performance?

I suppose you could do something like:

parse<float>(/*params elided*/)
{
fp_impl impl;
...
impl.set_sign(fp_impl::sign_positive);
...
while /*digits*/
{
impl.consume_digit(int_value_of_digit)
}
...etc.
return impl.value(); // grotesquely simplified
}

...where fp_impl is a class/struct that contains whatever internal state
it needs to operate.

Hmm... actually now I sort-of like that, although trying to turn that
into something the C library could also use is more "interesting".

--
Matthew

Bengt Gustafsson

unread,
Feb 7, 2014, 6:22:19 PM2/7/14
to std-pr...@isocpp.org, mw_t...@users.sourceforge.net
well, with such a helper object with its implementation in a cpp file somewhere we are not going to get extraordinary performance, but maybe good enough? A non-virtual call per character seems a little too much overhead to me.

Thusly we need more than that visible in the h files. Or maybe we can rely on link time optimization nowadays?

To be noted however, is that everything is not inlined jut because it is a template function. There will be a handfull instantiations of the parse<T> template per executable, for string, char*, string_view and maybe some more. For big template functions it is unlikely that the compiler will actually inline. Instead the compiler stores the implementation
as a soft symbol in each obj-file (or better, has a backend which keeps track of what instantiations have already been code-generated for this exe). I know that an ancient DEC compiler did the latter, but it was dead slow anyway... Does gcc or clang remember instantiations? I know MS doesn't...

So the big problem is probably not code bloat but compile times. I don't know how this is affected by precompiled headers, probably not much, I guess code generation happens every time anyway...?

As for putting the skipspace handling outside I think it would be a real turnoff:

- The simplest code "works" until users happen to type a leading or trailing space on that important demo.

- When parsing something more complex than a number, say a point x, y you would have to explicitly call skipspace between each member, and testing is again a real problem.

- In what situation is it important to give an error message if there is whitespace? What can go wrong in a real application if the whitespace is ignored? I fail to see those cases other than very marginal. I mean, even if you have speced a file format to forbid spaces (for some reason) you can be quite certain that the other guy interfacing ot you will send you spaces anyway. What good does it do to anyone to fail in this case?

And noone has said that we should not provide a "strict" mode/return flag/input flag or something to cater for these cases.

I mean, "getting it right" must mean that it is easy to use and works as expected. All other number converters in all languages I know of eat leading spaces. Most of them can't even tell you if there were any!

Miro Knejp

unread,
Feb 7, 2014, 7:47:11 PM2/7/14
to std-pr...@isocpp.org
Am 08.02.2014 00:22, schrieb Bengt Gustafsson:
> - The simplest code "works" until users happen to type a leading or
> trailing space on that important demo.
Well that's user input handling and could fill its own book. But that's
a question of sensible defaults or using the right flags/overloads.
>
> - In what situation is it important to give an error message if there
> is whitespace? What can go wrong in a real application if the
> whitespace is ignored? I fail to see those cases other than very
> marginal. I mean, even if you have speced a file format to forbid
> spaces (for some reason) you can be quite certain that the other guy
> interfacing ot you will send you spaces anyway. What good does it do
> to anyone to fail in this case?
>
<Number>
12345
</Number>
Your scema tells your validator that the Number element is an integer,
so you use (a whitespace skipping) parse<int>() to get it's value. Now
you skipped the newline after the opening tag and your line counters are
wrong which may cause other stuff to break. Same as your example above,
just inverted.

Silly example? Maybe. But specifications and text based communication
protocols do exist for which the certification process strictly requires
to only accept valid input and error otherwise. Leading whitespaces are
not valid input. Not even a leading + sign for positive numbers. "Oh
wait, your application accepts these? No certificate for you, go home."
The aviation industry is full of these but unfortunately I can't tell
you how they look because they are covered by NDAs. My point is that the
use case exists and therefore should be supported. Who has the
omniscience to judge whether it's common enough to be justified or not?
I certainly have uses for both variations in enough projects.

> And noone has said that we should not provide a "strict" mode/return
> flag/input flag or something to cater for these cases.
I'm aware of that. What I don't like is the attempt to squeeze
*everything* into a single method. My fear is it makes the call site
convoluted and hard to read and for some modes a dedicated overload with
a fitting and descriptive name would be more beneficial. What the
sensible defaults for whatever comes out of this discussion are is an
entirely different topic.
>
> I mean, "getting it right" must mean that it is easy to use and works
> as expected. All other number converters in all languages I know of
> eat leading spaces. Most of them can't even tell you if there were any!
That doesn't conclude we should therefore apply the same limitations. It
also doesn't mean there must be only one single method for everything.
Always RTFM when using a function you don't know.


Thiago Macieira

unread,
Feb 7, 2014, 9:59:56 PM2/7/14
to std-pr...@isocpp.org
Em sex 07 fev 2014, às 21:18:41, Miro Knejp escreveu:
> Speaking floats, which part is the more complex/bloated one:
> Extracting digits and symbols from the input or assembling them into a
> floating point value with minimal rounding errors, etc? The latter can
> easily be separated into a stateful object that is fed the numerical
> values (i.e. digit values) and semantics (i.e. sign, comma, exponent
> indicators) of the input at which point input encodings, character types
> or locales are already translated to a neutral subset. Some part of the
> numeric parsers certainly needs to be inline but some can be implemented
> out-of-line.

That could be done. As I said, the requirement is that this code is not inline
and not templated. It must exist in a .cpp file not visible to the user.

If you want to pass a traits object that specifies how to recognise digits,
decimals, thousands separators, exponents, plus a function to get the next
digit, by all means.

Here's an implementation of strtod to get you started (freely licensed):

http://code.google.com/p/freebsd/source/browse/contrib/gdtoa/strtod.c
http://code.google.com/p/freebsd/source/browse/contrib/gdtoa/gdtoaimp.h

If you manage to do that, I'll be very interested in the code. Right now, to
parse a UTF-16 number in Qt, we must first convert it to Latin1, which means
allocating memory, which means I can't make those functions noexcept.

Csaba Csoma

unread,
Feb 8, 2014, 11:37:08 PM2/8/14
to std-pr...@isocpp.org
Related Stack Overflow discussion:

Csaba

On Sunday, January 26, 2014 8:25:02 AM UTC-8, Matthew Fioravante wrote:
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.

Lets review:

int atoi(const char* s); //and atoll,atol,atoll, atof etc..

Whats wrong with this?
  • Returns 0 on parsing failure, making it impossible to parse 0 strings. This already renders this function effectively useless and we can skip the rest of the bullet points right here.
  • It discards leading whitespace, this has several problems of its own:
    • If we want to check whether the string is strictly a numeric string, we have to add our own check that the first character is a digit. This makes the interface clumsy to use and easy to screw up.
    • std::isspace() is locale dependent and requires an indirect function call (try it on gcc.godbolt.org). This makes what could be a very simple and inlinable conversion potentially expensive. It also prevents constexpr.
    • From a design standpoint, this whitespace handling is a very narrow use case. It does too many things and in my opinion is a bad design. I often do not have whitespace delimited input in my projects.
  • No atod() for doubles or atold() for long doubles.
  • No support for unsigned types, although this may not actually be a problem.
  • Uses horrible C interface (type suffixes in names) with no overloading or template arguments. What function do we use if we want to parse an int32_t?
long strtol(const char* str, char **str_end, int base);

Whats wrong with this one?
  • Again it has this silly leading whitespace behavior (see above).
  • Its not obvious how to correctly determine whether or not parsing failed. Every time I use this function I have to look it up again to make sure I get it exactly right and have covered all of the corner cases.
  • Uses 0/T_MAX/T_MIN to denote errors, when these could be validly parsed from strings. Checking whether or not these values were parsed or are representing errors is clumsy.
  • Again C interface issues (see above).

At this point, I think we are ready to define a new set of int/float parsing routines.

Design goals:
  • Easy to use, usage is obvious.
  • No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
  • Efficient and inline
  • constexpr
Here is a first attempt for an integer parsing routine.

//Attempts to parse s as an integer. The valid integer string consists of the following:
//* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)
//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).
//* All of the rest of the characters MUST be digits.
//Returns true if an integral value was successfully parsed and stores the value in val,
//otherwise returns false and leaves val unmodified. 
//Sets errno to ERANGE if the string was an integer but would overflow type integral.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);

//Same as the previous, except that instead of trying to parse the entire string, we only parse the integral part. 
//The beginning of the string must be an integer as specified above. Will set tail to point to the end of the string after the integral part.
template <typename integral>
constexpr bool strto(string_view s, integral& val, int base, string_view& tail);


First off, all of these return bool which makes it very easy to check whether or not parsing failed.

While the interface does not allow this idom:

int x = atoi(s);

It works with this idiom which in all of my use cases is much more common:
int val;
if(!strto(s, val, 10)) {
  throw some_error();
}
printf("We parsed %d!\n", val);

Some examples:

int val;
string_view sv= "12345";
assert(strto(sv, val, 10));
assert(val == 12345);
sv = "123 456";
val = -2;
assert(!strto(sv, val, 10));
assert(val == -2);
assert(strto(sv, val, 10, sv));
assert(val == 123);
assert(sv == " 456");
sv.remove_prefix(1); //chop off the " ";
assert(sv == "456");
assert(strto(sv, val, 10));
assert(val = 456);
val = 0;
assert(strto(sv, val, 10, sv));
assert(val == 456);
assert(sv == "");


Similarly we can define this for floating point types. We may also want null terminated const char* versions as converting a const char* to sting_view requires a call to strlen(). 

Miro Knejp

unread,
Feb 11, 2014, 10:08:21 PM2/11/14
to std-pr...@isocpp.org

Am 08.02.2014 03:59, schrieb Thiago Macieira:
> Em sex 07 fev 2014, às 21:18:41, Miro Knejp escreveu:
>> Speaking floats, which part is the more complex/bloated one:
>> Extracting digits and symbols from the input or assembling them into a
>> floating point value with minimal rounding errors, etc? The latter can
>> easily be separated into a stateful object that is fed the numerical
>> values (i.e. digit values) and semantics (i.e. sign, comma, exponent
>> indicators) of the input at which point input encodings, character types
>> or locales are already translated to a neutral subset. Some part of the
>> numeric parsers certainly needs to be inline but some can be implemented
>> out-of-line.
> That could be done. As I said, the requirement is that this code is not inline
> and not templated. It must exist in a .cpp file not visible to the user.
>
> If you want to pass a traits object that specifies how to recognise digits,
> decimals, thousands separators, exponents, plus a function to get the next
> digit, by all means.
>
> Here's an implementation of strtod to get you started (freely licensed):
>
> http://code.google.com/p/freebsd/source/browse/contrib/gdtoa/strtod.c
> http://code.google.com/p/freebsd/source/browse/contrib/gdtoa/gdtoaimp.h
>
> If you manage to do that, I'll be very interested in the code. Right now, to
> parse a UTF-16 number in Qt, we must first convert it to Latin1, which means
> allocating memory, which means I can't make those functions noexcept.
Well that's a problem of codecvt and friends. There's no iterative
interface to avoid said allocations and if there was, it would most
likely require a virtual call per input character. It would certainly
land in the cache and branch prediction would help, too. But that's just
how cultures/encodings complicate things. A language-neutral fast path
ASCII overload would not suffer from these drawbacks. I see this working
in two steps. First, translate the next X input characters into a
Unicode codepoint and second, translate that character into a
digit/separator/decimal/etc. The former potentially needs allocations
and depends only on the encoding, the latter depends on
language/culture. If codecvt had an iterative interface one could at
least measure what dominates: the virtual calls, the allocation, or
assembling the actual float correctly.

But as long as codecvt does not have a method to consume characters
incrementally I see no way to go without some sort of temporary output
buffer.

David Krauss

unread,
Feb 11, 2014, 10:22:28 PM2/11/14
to std-pr...@isocpp.org
On Feb 12, 2014, at 11:08 AM, Miro Knejp <mi...@knejp.de> wrote:

Well that's a problem of codecvt and friends.
But as long as codecvt does not have a method to consume characters incrementally I see no way to go without some sort of temporary output buffer.

Codecvt uses a user-supplied buffer of type mbstate_t to allow incremental processing. However because mbstate_t is entirely implementation-dependent (aside from being POD), there's no way for the user to define their own stateful codecvt. It's likely that Qt uses something else, although I don't know much about it.

Thiago Macieira

unread,
Feb 11, 2014, 10:40:01 PM2/11/14
to std-pr...@isocpp.org
Em qua 12 fev 2014, às 11:22:28, David Krauss escreveu:
> It's likely that Qt uses something else, although I don't know much about
> it.

QStrings are always UTF-16 encoded and the conversion from UTF-16 to Latin1 is
highly optimised. char16_t strings and char32_t strings also have a very well-
defined encoding. There's no problem working with them, since conversion can be
done easily. wchar_t strings have an implementation-defined encoding, but it's
the same people who decide that encoding as the people who will write the
conversion function, so it's no problem either.

The problem is only for char strings, which can be multibyte and whose
encoding can vary at runtime. Though often enough such encoding is compatible
with ASCII and, therefore, the implementation can ignore the multibyte data.

Anyway, note that this discussion did not start about encodings. I don't think
that dealing with the four character types is a problem. It's just more work
for the implementation developers.

I was talking about parsing a non-contiguous block of data and depending on
the iterator. Parsing character by character (whichever character type) is the
problem here.

David Krauss

unread,
Feb 11, 2014, 11:28:19 PM2/11/14
to std-pr...@isocpp.org
On Feb 12, 2014, at 11:40 AM, Thiago Macieira <thi...@macieira.org> wrote:

I was talking about parsing a non-contiguous block of data and depending on
the iterator. Parsing character by character (whichever character type) is the
problem here.

I've not been following the discussion but just saw the short message making a false assertion about codecvt.

Now that it's been mentioned, codecvt isn't such a bad parsing model, when the input doesn't fit in memory and you want to make multiple calls. However that's not the situation for numeric conversion. I think what Miro intended to say is std::num_get, which is stateless.

Looking at num_get now, it's templated over a generic input iterator type so you could define a discontiguous range that way. But, that only gets you the C locale. You cannot go from a std::locale object to a num_get compatible template or arbitrary specialization.

Alternately, you could define your own streambuf class, like std::istringstream but non-owning and discontiguous. The overhead would be one virtual call per storage block, perhaps minus one. (If the statically accessible pointers in the streambuf cover the text to be parsed, no virtual call should be needed except the one handling locale indirection, which you can avoid by statically calling a locale object if you know which one you want.) This really seems like the way to go for a "rope" class.

Anyway, I would think that discontiguous storage would be a disqualification to using whatever convenient interface from the present proposal.

It is loading more messages.
0 new messages