string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.
Lets review:
int atoi(const char* s); //and atoll,atol,atoll, atof etc..
Whats wrong with this?
- Returns 0 on parsing failure, making it impossible to parse 0 strings. This already renders this function effectively useless and we can skip the rest of the bullet points right here.
- It discards leading whitespace, this has several problems of its own:
- If we want to check whether the string is strictly a numeric string, we have to add our own check that the first character is a digit. This makes the interface clumsy to use and easy to screw up.
- std::isspace() is locale dependent and requires an indirect function call (try it on gcc.godbolt.org). This makes what could be a very simple and inlinable conversion potentially expensive. It also prevents constexpr.
- From a design standpoint, this whitespace handling is a very narrow use case. It does too many things and in my opinion is a bad design. I often do not have whitespace delimited input in my projects.
- No atod() for doubles or atold() for long doubles.
- No support for unsigned types, although this may not actually be a problem.
- Uses horrible C interface (type suffixes in names) with no overloading or template arguments. What function do we use if we want to parse an int32_t?
long strtol(const char* str, char **str_end, int base);
Whats wrong with this one?
- Again it has this silly leading whitespace behavior (see above).
- Its not obvious how to correctly determine whether or not parsing failed. Every time I use this function I have to look it up again to make sure I get it exactly right and have covered all of the corner cases.
- Uses 0/T_MAX/T_MIN to denote errors, when these could be validly parsed from strings. Checking whether or not these values were parsed or are representing errors is clumsy.
- Again C interface issues (see above).
At this point, I think we are ready to define a new set of int/float parsing routines.
Design goals:
- Easy to use, usage is obvious.
- No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
- Efficient and inline
- constexpr
Here is a first attempt for an integer parsing routine.
//Attempts to parse s as an integer. The valid integer string consists of the following://* '+' or '-' sign as the first character (- only acceptable for signed integral types)//* prefix (0) indicating octal base (applies only when base is 0 or 8)//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).//* All of the rest of the characters MUST be digits.//Returns true if an integral value was successfully parsed and stores the value in val,//otherwise returns false and leaves val unmodified.//Sets errno to ERANGE if the string was an integer but would overflow type integral.template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);
//Same as the previous, except that instead of trying to parse the entire string, we only parse the integral part.
//The beginning of the string must be an integer as specified above. Will set tail to point to the end of the string after the integral part.template <typename integral>constexpr bool strto(string_view s, integral& val, int base, string_view& tail);
Am 26.01.2014 17:25, schrieb Matt Fioravante:
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.
At this point, I think we are ready to define a new set of int/float parsing routines.
Design goals:
- Easy to use, usage is obvious.
- No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
- Efficient and inline
- constexpr
Here is a first attempt for an integer parsing routine.
//Attempts to parse s as an integer. The valid integer string consists of the following://* '+' or '-' sign as the first character (- only acceptable for signed integral types)//* prefix (0) indicating octal base (applies only when base is 0 or 8)//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).//* All of the rest of the characters MUST be digits.//Returns true if an integral value was successfully parsed and stores the value in val,//otherwise returns false and leaves val unmodified.//Sets errno to ERANGE if the string was an integer but would overflow type integral.template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);
Please no ERxxx nonsense. optional, expected, exceptions, pairs, whatever but no ER codes, that's even more silly C.
Why didn't you include stringstream in your review? E.g. something
like https://code.google.com/p/mili/source/browse/mili/string_utils.h#267
I am currently facing the same problems while working on my format proposal: https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/std-proposals/CIlWCTOe5kc
All the existing functions work on null-terminated strings only which is totally useless for my use cases
as I am parsing substrings in-place. I intend to design the string processing stuff that I'm using general enough so it can be used independently but for now I just want to make it work in the first place.
Please no ERxxx nonsense. optional, expected, exceptions, pairs, whatever but no ER codes, that's even more silly C. I currently base mine on iterators and provide string_view as convenience overloads.
With iterators it could return the iterator to the first element not part of the integer. a pair<optional<int>, Iter> or similar is a possibility.
Certainly not the best concept but I'd prefer it to checking errno anyday and depending on the combination of optional's engaged and the iterator position you can determine whether it failed and if so, why. Well, just giving some spontaneous food for thought. I only very recently started with the number parsing part of my proposal, so the interface will probbaly be very unstable for quite a while. And then there's locales, and together with them a whole new world of problems...
I'm sure that everyone intended that for their functions. That survived only
until the first round of feedback or encounter with reality...
> - No assumptions about use cases, we just want to parse strings. This
> means none of this automatic whitespace handling.
Fair enough. It's easier to compose with other space checkers if you need to
than to remove functionality.
> - Efficient and inline
> - constexpr
Efficient, definitely. Inline and constexpr? Forget it, it can't be done. Have
you ever looked at the source of a string-to-double function? They're huge!
This might be left as a suggestion to compilers to implement this as an
intrinsic.
> //Attempts to parse s as an integer. The valid integer string consists of
> the following:
> //* '+' or '-' sign as the first character (- only acceptable for signed
> integral types)
But no U+2212?
> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).
> //* All of the rest of the characters MUST be digits.
Where, by "digits", we understand the regular ASCII digits 0 to 9 and the
letters that compose digits on this base, both in uppercase and lowercase.
> //Returns true if an integral value was successfully parsed and stores the
> value in val,
> //otherwise returns false and leaves val unmodified.
> //Sets errno to ERANGE if the string was an integer but would overflow type
> integral.
What if it failed to parse? What's the return condition?
As others have said, using errno is too C, but then again this kind of
function should be done in conjunction with the C people. Any improvements we
need, they probably need too.
> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base);
Replace string_view with a pair of InputIterators.
Do you know what this means? Parsing char16_t, char32_t and wchar_t too.
> //Same as the previous, except that instead of trying to parse the entire
> string, we only parse the integral part.
> //The beginning of the string must be an integer as specified above. Will
> set tail to point to the end of the string after the integral part.
> template <typename integral>
> constexpr bool strto(string_view s, integral& val, int base, string_view&
> tail);
Same as above.
> First off, all of these return bool which makes it very easy to check
> whether or not parsing failed.
That's the opposite of what most people want. Most people want to get the
parsed number, not whether it succeded or failed. Maybe invert the logic?
That's what we do for {QString,QByteArray,QLocale}::to{Int,Double,etc.}. And
one suggestion I received a few weeks ago was to add the overload that returns
the end pointer and does not fail if there's more stuff after it.
What we really need for the input string is "something that begin() and end() functions work for". This avoids the tedious sending of two iterators, while covering vectors, string literals (I hope), strings and string_views without overloading.In the fairly high percentage of cases when we want to do skipspace we need to have a easy way to do that. One may be:auto skipspace(const RNG& src)->string_view<decltype(*begin(src)> { ... }
An just to further emphasize the question of returning pass/fail vs returning the value.
My general philosophy with parsing is to emphasize error handling first and then the actual results second. The success or failure of the parse should be thrown right in your face, forcing you to deal with it. This helps remind us to write more correct code. I'd be happy to know if people agree or not.
--It really depends on the use case.
Seems rather complicated to stuff all of that into the return value no? Returning a std::optional<int> would be ok because you can directly check the return value with operator bool(). It still doesn't provide information on why the failure occurred though.
> //Attempts to parse s as an integer. The valid integer string consists of
> the following:
> //* '+' or '-' sign as the first character (- only acceptable for signed
> integral types)
But no U+2212?
We could consider unicode as well. That's a good question.
> //* prefix (0) indicating octal base (applies only when base is 0 or 8)
> //* prefix (0x or 0X) indicating hexadecimal base (applies only when base
> is 16 or 0).
> //* All of the rest of the characters MUST be digits.
Where, by "digits", we understand the regular ASCII digits 0 to 9 and the
letters that compose digits on this base, both in uppercase and lowercase.
Yes that's right.
Maybe we should add an extra boolean argument (defaulted to true) that be used to disable the hex and octal prefixes. Sometimes you really want to just parse a hex string without the 0x prefex. Adding an extra false to the parameter list is nicer than doing this check for yourself. Its similar to disabling the leading whitespace check of strtol().
On 2014-01-26 23:33, Matt Fioravante wrote:
An just to further emphasize the question of returning pass/fail vs returning the value.
My general philosophy with parsing is to emphasize error handling first and then the actual results second. The success or failure of the parse should be thrown right in your face, forcing you to deal with it. This helps remind us to write more correct code. I'd be happy to know if people agree or not.
--
It really depends on the use case.
- If 0 (or some other value) is an acceptable fall-back result, I don't want to litter my code with error handling.
- If a parse error has to be recorded or has to provoke some other action (e.g. ask the user to re-enter data), then I want to be forced to deal with the errors.
A good interface supports at least these two options, I'd say.
> > As others have said, using errno is too C, but then again this kind of
> > function should be done in conjunction with the C people. Any improvements
> > we
> > need, they probably need too.
>
> With overloading, templates, iterators, string_view, etc.. its not so C
> compatible. Do we really care so much anyway? I don't like the idea of
> handicapping C++ interfaces in the name of C compatibility.
C11 has generics, so that solves the problem of the templates and the
iterators.
But it might be that this C++ function get implemented with calls to strtol,
strtoul, strtoll, strtoull, strtod, etc. anyway. So maybe the C guys already
have what they need, except for the C Generic version.
And the char16_t, char32_t and wchar versions.
> > Do you know what this means? Parsing char16_t, char32_t and wchar_t too.
>
> Yes, but that's not so difficult.
That depends on whether locale parsing is performed. We need functions that
don't depend on the locale, in which case a conversion from char16_t and
char32_t to the execution charset can be done quite quickly (again something
for which the constexpr version would be slower than the runtime optimised
version). After the conversion is done, the char variant can be called.
That's how QString::to{Int,Double} is implemented: first, fast-convert from
UTF-16 to Latin 1, then call the internal strtoll / strtod.
> > That's the opposite of what most people want. Most people want to get the
> > parsed number, not whether it succeded or failed. Maybe invert the logic?
>
> I think that's a voting/bikeshed question (or use std::optional).
Agreed, which is why I'm not going to continue this part of the discussion :-)
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
On domingo, 26 de janeiro de 2014 14:33:47, Matt Fioravante wrote:
> My general philosophy with parsing is to emphasize error handling first and
> then the actual results second. The success or failure of the parse should
> be thrown right in your face, forcing you to deal with it. This helps
> remind us to write more correct code. I'd be happy to know if people agree
> or not.
You're asking for the bikeshed discussion.
I don't agree. Sometimes you already know that the data is well-formed and you
don't need the error status. Therefore, emphasizing the actual data is more
important.
Functions should return int and negative errors instead of NULL
- Return NULL in malloc() is fine, return NULL in fopen() is not!
- Pass allocated objects as parameter (yes, ctx_t** is OK!)
- Returning kernel style negative <errno.h> error codes is cool in
userspace too. Do it!
This is a C API, but the concept still translates to C++. Here they encourage returning error conditions, and stuffing the results into out parameters. While I might not do this for everything, I certainly would for error heavy routines such as parsing.
Given use of exceptions, the philosophy I described in the paragraph seems to
apply to the Standard Library.
Bools aren't very desriptive. I'd prefer a more explicit syntax for overloads. For example:
to_string<int>(s); // Use current locale
to_string<int>(s, no_locale); // Tag type: ASCII only, fast path with no facets, no virtuals
to_string<int>(s, myLocale);
Makes sense? It's something I'm playing around with currently.
A good interface supports at least these two options, I'd say.
We have 2 differing uses cases here. Can one interface cleanly support both?
I don't agree. Sometimes you already know that the data is well-formed and you
don't need the error status. Therefore, emphasizing the actual data is more
important.
I don't know about you or others, but the majority of the time when I need to do these conversions I am getting my string from the command line, a file, a network socket, and/or another user input of some kind etc.. All of those require strict error checking.
Even if you know the string is supposed to be parsed correctly, it doesn't hurt to throw in an assert() or debug mode error message check in there in case someone (or you yourself) made a mistake earlier and broke your invariant.
What has come out of this are 2 distinct use cases, error checking emphasis vs results emphasis. We have 3 options:
- Come up with an interface that somehow satisfies both
- Make 2 interfaces, one for each situation
- Prioritize one over the other
I'm not sure how to do (1) and (2) seems like it could be confusing to have 2 interfaces that do the exact same thing with slightly different calling conventions. So that leaves us with (3), and I obviously stand firm in the safety camp.
Given use of exceptions, the philosophy I described in the paragraph seems to
apply to the Standard Library.
Maybe so, but I'd rather come up with the safest, most expressive, most easy to use, and most efficient interface possible. Regardless of past precedents.
class error_return {
public:
error_return() m_handled(false) {} // Ok case: No exception
error_return(exception_ptr ex) : m_handled(false), m_exception(ex) {}
~error_return() {
if (m_handled)
return;
if (m_exception)
rethrow_exception(mException);
else
throw exception("return value not checked");
}
void ignore() { m_handled = true; } // Use to explicitly ignore errors
void rethrow() { m_handled = true; rethrow_exception(m_exception); }
operator bool() { m_handled = true; return !m_exception; } // for if-type check. true (good) if m_exeption is null.
private:
bool m_handled;
exception_ptr m_exception;
};
Usage:
error_return from_string(T& dest, string_view& src) {
if (... could convert ...)
return error_return();
else
return make_exception_ptr(exception("Could not convert"));
}
// No check required
from_string(dest, "123").ignore();
// Check using if
if (from_string(dest, "123"))
... handle error;
// Throw on error
from_string(dest, "123").rethrow();
// Programming error!
from_string(dest, "123"); // throws on first call even if conversion can be made!
Regarding default values produced when the conversion fails this is another argument for this style:<error return type> from_string(T& dest, string_view& src)Now the standard can specify that the function shall not touch dest if conversion fails.
Am 29.01.2014 17:49, schrieb Matthew Woehlke:
On 2014-01-29 11:18, Bengt Gustafsson wrote:I am now using the following interface in the format parser implementation:
Regarding default values produced when the conversion fails this is another
argument for this style:
<error return type> from_string(T& dest, string_view& src)
Now the standard can specify that the function shall not touch dest if
conversion fails. The default value is the previous value of the variable!
So... not only can I still not assign the result to a const local, now 'dest' potentially contains uninitialized memory? I don't see how that's an improvement.
If it is really necessary to have a description of the failure type (and errno is not suitable; personally I find nothing wrong with using errno), then maybe a return type that is similar to std::optional with an additional 'why it is disengaged' could be created. (Maybe even subclass std::optional and call it e.g. std::result?)
// No check required
from_string(dest, "123").ignore();
You omitted the declaration and initialization of 'dest'. IOW:
// your proposal
auto dest = int{12};
from_string(dest, "34").ignore();
foo(dest);
- vs. -
// std::optional as return type
foo(from_string<int>("34").value_or(12));
Using std::optional, I (in the above example, anyway) avoided even having a named variable to receive the value. And if I wanted one, I could make it const, which I couldn't do with your version.
pair<optional<T>, Iter> parse_integer<T>(Iter first, Iter last, int radix = 10)
and the convenience overload
optional<T> parse_integer<T>(string_view s, int radix = 10);
which could, using internal tag dispatching, be reduced to
parse<T>(...)
The signatures are very easy to use and give me all I need. Both greedily consume as many valid characters as possible (even on over/underflow) so parsing can continue past the (in)valid input. Sure, you don't get a detailed error report but how important is it really? All I care about is whether the number was valid or not. If the optional is disengaged I can *guess* what happened using the iterator overload. If the returned iterator equals first, then there was no number to begin with, otherwise the number format was wrong or over/underflowed occured. In any case the returned iterator points to the first character past the number. Not sure how important it really is to distinguish over/underflow from an invalid pattern though. I had no use for that information so I would be interested to hear about scenarios where it really does matter.
Given a function
In Boost.Expected we have an example with something like
pair< Iter, expected<T, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);
or
expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> parse_integer<T>(Iter first, Iter last);
A parse interger range could be implemented as
expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
auto f = parse_integer<T>(s, e); RETURN_IF_UNEXPECTED(f);
auto m = parse_string("..", f.first, e); RETURN_IF_UNEXPECTED(m);
auto l = parse_integer<T>(m, e); RETURN_IF_UNEXPECTED(l);
return make_expected(make_pair(l.first, make_pair(f.second, l.second))));
}
where
#define RETURN_IF_UNEXPECTED(f) if (! f) return f.get_exceptional();
Note that we can also see pair< Iter, expected<T, std::ios_base::iostate>> as equivalent to expected< pair< Iter, T>, pair<Iter, std::ios_base::iostate>> and so make it a monad also.
I would like to be able to write it just as
expected< pair< Iter, pair<T,T>>, pair<Iter, std::ios_base::iostate>> parse_integer_range<T>(Iter s, Iter e) {
auto f = await parse_integer<T>(s, e);
auto m = await parse_string("..", f.first, e);
auto l = await parse_integer<T>(m, e);
return make_pair(l.first, make_pair(f.second, l.second)));
}
The keyword await could be subject to discussion.
The advantage here is that we are writing the code as if the functions parse_integer thrown an exception in case of errors.
The await operator would make return the parse_integer_range if the expression on the right has an error stored.
The returned value would have the type of the parse_integer_range with the stored error.
and the convenience overloadHow do will use this overload? Could you define a parse_interger_range with?
optional<T> parse_integer<T>(string_view s, int radix = 10);
Or is the intent to match the whole string and so the name should be match_integer?
Maybe your application don't care of the detailed error, but when designing a library it is better to provide as much information as has been obtained so that the user can do whatever she needs.
which could, using internal tag dispatching, be reduced to
parse<T>(...)
The signatures are very easy to use and give me all I need. Both greedily consume as many valid characters as possible (even on over/underflow) so parsing can continue past the (in)valid input. Sure, you don't get a detailed error report but how important is it really? All I care about is whether the number was valid or not. If the optional is disengaged I can *guess* what happened using the iterator overload. If the returned iterator equals first, then there was no number to begin with, otherwise the number format was wrong or over/underflowed occured. In any case the returned iterator points to the first character past the number. Not sure how important it really is to distinguish over/underflow from an invalid pattern though. I had no use for that information so I would be interested to hear about scenarios where it really does matter.
The Boost::expected template suggested by Miro comes close to what I want, but does not make sure that the error code was checked in its dtor, which I think was the main feature of my proposed error_return class.
Here is a first attempt for an integer parsing routine.
//Attempts to parse s as an integer. The valid integer string consists of the following://* '+' or '-' sign as the first character (- only acceptable for signed integral types)
//* prefix (0) indicating octal base (applies only when base is 0 or 8)
Similarly we can define this for floating point types. We may also want null terminated const char* versions as converting a const char* to sting_view requires a call to strlen().
I am now using the following interface in the format parser implementation:
pair<optional<T>, Iter> parse_integer<T>(Iter first, Iter last, int
radix = 10)
and the convenience overload
optional<T> parse_integer<T>(string_view s, int radix = 10);
which could, using internal tag dispatching, be reduced to
parse<T>(...)
The signatures are very easy to use and give me all I need. Both
I see number 1 as the most fundamental. The interface is based on
iterators and can thus work with almost any type of input. 2 and 3
can be implemented in terms of 1 and a string_view overload would
probably be used more often there. As soon as locales are involved
all three should automatically recognize the correct grouping,
separating and decimal characters.
template<typename T> zero_terminate_iterator {
public:
zero_terminate_iterator() : ptr(nullptr) {}
zero_terminate_iterator(T* p) : ptr(p) {}
bool operator(const zero_terminate_iterator<T>& rhs) const {
if (ptr == rhs.ptr)
return true; // includes the case that both are nullptr
if (rhs.ptr == nullptr && *ptr == 0)
return true;
if (ptr == nullptr && *rhs.ptr == 0)
return true;
return false;
}
// etc...
private:
T* ptr;
}
auto char_ptr_range(char* p) { return range<zero_terminate_iterator<char>>(zero_terminate_iterator<char>(p), <zero_terminate_iterator<char>()); } // Uses C++14 function return type deduction.
//Now you can write:
char* p = "1234";
for(auto c : char_ptr_range(p))
process_char(c);
// and you can write
auto x = parse<int>(char_ptr_range(p)); // or whatever API we end up with.
@Matthew, regarding strlen avoidance: By RANGE I meant a template type which has the same api as required by a range based for (which is kind of a build in template function). This means that you don't have to create a string_view, any type of range will do. Here, for instance, is a char_ptr_range for this case:
When we are talking ranges is it the same stuff SG9 is working on? I'm not sure if it's really important whether the input to the parsing methods are iterators or ranges, since in the end the latter should always be somehow convertible to the former for interoperability with the remaining standard library and there seems to be more disagreement on how to provide the results of the conversion, not the inputs. If all goes south a pair of iterators will always do the job...
...
Note that as we are aiming for a extensible set of conversions including user defined types, say WGS84 geospactial coordintaes there is also an open set of error codes, so an enum or int value is not enough. (The from_string may be wrapped in a template function which can't be expected to know the interpretation of an int error code for any T it may be instantiated for!
Now I want to check some common use cases for the solution with a return triplet. To complete the use cases we can also add a fourth member skipped which is true if we had to skip spaces. I think that the smartest way to solve this may be to provide a set of value() functions, but no cast operators:
Tentatively I call the "states" of the return value:
strict - no spaces skipped. All of the string could be converted.complete - space skipping ok, but no trailing junk.ok - space skipping ok, and trailing junk.bad - no parsing was possible, even after skipping spaces.
// The actual parsing is always the same:const auto r = parse<int>(char_ptr_range("123"));
int x = r.strict_value(); // throws if r is not strict.int y = r.complete_value() // throws if r is not completeint z = r.value() // throws if r is not okint w = r.value_or(17); // never throws
// expect holds a value or an exception_ptr. I think this is basically the same as boost::expect, which in turn was inspired by Andrei Alexandrescu's idea.// All ways of getting at the value except value_or() throws any// pending exception inside.// This should be refined as optional to avoid relying on a default// ctor in the error case.template <typename T> class expect {public: expect(exception_ptr ex) : m_exception(ex) {} expect(const T& val) : m_value(val) {}
operator bool() { return !m_exception; } exception_ptr exception() { return m_exception; }
operator T() { return value(); } // specialization removes this for bool (or follow optional<bool>'s example) T value() { if (m_exception) rethrow_exception(m_exception);
return m_value; } T value_or(const T& defval) { if (m_exception) return defval;
return m_value; }
private: T m_value; exception_ptr m_exception;};
enum StrToFlags { noleading = 1, notrailing = 2, complete = 4, strict = 7};
// Maybe better to get flags as a template parameter? Or offer both// versions? Not having to test flags on each call saves time and they// are going to be fixed per call site 99% of the time.template<typename T, typename RANGE> expect<T> str_to(RANGE& range, StrToFlags flags){ if (flags & noleading) { if (isspace(*begin(range))) return make_exception_ptr("No leading space allowed"); } else skipspace(range);
T tmp; ... Do the conversion into tmp and return on errors ...; if (flags & notrailing) { if (isspace(*begin(range))) return make_exception_ptr("No trailing space allowed"); } else skipspace(range);
if (flags & complete && begin(range) != end(range)) return make_exception_ptr("Junk after value"); return tmp;}
// Use cases
// Throw on any error:int val = strto<int>(char_ptr_range("123"), strict);
// Accept any errorint v2 = strto<int>(char_ptr_range("123")).value_or(17);
// Parse comma separated using lazy split:for (auto str : char_ptr_range("123, 234 ,345,34")) numbers.push_back(strto<int>(str, complete)); // allow leading and trailing space, but nothing else trailing except the comma.
// Use templated flag version to save time:for (auto str : char_ptr_range("123, 234 ,345,34")) numbers.push_back(strto<int, complete>(str)); // allow leading and trailing space, but nothing else trailing except the comma.
On 2014-02-04 19:55, Paul Tessier wrote:
int parse<T, U>(range<U> r, T& value, locale loc = default_locale)
{
auto const result = parse<T>(r, loc);
if (result) value = *result;
return // um... int? from whence do I get an int?
}
Except that with an out parameter no copies need be made, which depending
on cost of copying said type, this may be a bottle neck. Your version of
an out parameter composed of a value returning version forces a copy
regardless of the need for one.
Where?
A "good" implementation would emplace in the return value.
Assume that big_int requires the heap to allow for very big int's, say 10 to 2000 digits, a value returning version has no way to avoid allocating at each parse, regardless of move-assignment or RVO.
A parameter out version can reuse the same big_int and therefore potentially avoid the cost of new allocations at each parse.
It is always possible to take any snippet of code and replace it with a function that takes in and out parameters,
the reverse cannot be said for value returning functions.
Do you mean you actually have a real example of an iterator that cannot be dereferenced more than once? (What on earth would create such a thing?)
The actual whitespace
content is lost. What if I needed this information to increment a line
counter after parsing the number?
...then don't tell the parser to eat whitespace. (Note: the parser *must* have a strict mode... so I agree with you there. A mode that eats everything possible and then tells you how far it got may also be required. Anything else is probably in the 'nice to have' category.)
Except that with an out parameter no copies need be made, which depending on cost of copying said type, this may be a bottle neck. Your version of an out parameter composed of a value returning version forces a copy regardless of the need for one.
Ignoring whitespace, there is still a possibility for a lot configuration with how the user may want to do the parse:
- What radix do I use?
- Do I allow 0x and 0 (octal) prefixes?
- Do I allow + prefix?
- Locales?
- Commas or just digits?
First, I think we would need to decide on a complete set of options. The next big question is what is the best interface which will allow the user to specify them easily and what are the most sensible defaults?
Except that with an out parameter no copies need be made, which depending on cost of copying said type, this may be a bottle neck. Your version of an out parameter composed of a value returning version forces a copy regardless of the need for one.What happens with the out parameter when parsing fails? Is it in an undefined state? Or left unmodified? If the latter then parse() had to create a temporary and the entire allocation prevention and no-copy argument is down the drain.
I would prefer the value to be in a well defined state when parsing fails. Incrementing or accessing an iterator may throw an exception and if the value is then left in a partial state that's really bad. It also defeats the purpose of using it as a default fallback value. I'd rather have a strong exception guarantee for the value I'm passing into the function.
Since without optional<>, expected<> or return codes a failed parse might as well be indicated by throwing an exception (and that is the case in many languages and frameworks) a failed parse should have the same guarantees as if an exception were thrown in my oppinion.
Ignoring whitespace, there is still a possibility for a lot configuration with how the user may want to do the parse:I'm not a fan of specifying gazillions of options hence I proposed separate overloads for each task and "utility" methods to detect each parts separately (like parse_radix_prefix) to allow composition of the various parsers and make them available to users other than std implementers. Options should only exist where flags are mutually exclusive, like parsing radix 10 versus radix 11, both cannot be done at the same time and should therefore be specified as an argument. But anything that is optional (signs, prefixes, locales) can be provided as separete overloads or methods for composition
- What radix do I use?
- Do I allow 0x and 0 (octal) prefixes?
- Do I allow + prefix?
- Locales?
- Commas or just digits?
First, I think we would need to decide on a complete set of options. The next big question is what is the best interface which will allow the user to specify them easily and what are the most sensible defaults?
> We could make value() throw on error like std::expected, giving us a C style
> return code interface and C++ style exception interface all in one. If the
> user explicitly checks the error status before calling value(), its easy for
> the compiler to optimize out the conditional and the throwing logic
Are you sure it's easy? Got a reference?
> What radix do I use?
Have a base/radix parameter defaulting to 10.
> Do I allow 0x and 0 (octal) prefixes?
I'd allow 0x (when radix = 0), I'd never allow 0.
> Do I allow + prefix?
Sure, why not?
> Locales?
> Commas or just digits?
Both handled by a locale-aware variant.
> Do I allow 0x and 0 (octal) prefixes?
I'd allow 0x (when radix = 0), I'd never allow 0.Some people still use octal. I wouldn't remove support for it.
Em qua 05 fev 2014, às 15:08:04, Olaf van der Spek escreveu:
> > Do I allow 0x and 0 (octal) prefixes?
>
> I'd allow 0x (when radix = 0), I'd never allow 0.
Don't deviate from strtoll.
radix = 0 implies prefixes 0 and 0x are recognised. If the library is updated
with 0b prefix for binaries, then that too.
On 2014-02-05 07:34, Miro Knejp wrote:
>> Do you mean you actually have a real example of an iterator that
>> cannot be dereferenced more than once? (What on earth would create
>> such a thing?)
>
> That's not what I described. If I pass a single-pass InputIterator, for
> example istreambuf_iterator, to parse() it chews away N whitespaces and
> then fails to recognize a number then any information on the whitespaces
> is gone as I cannot go back and re-iterate the range.
This iterator is non-copyable? And/or incrementing it is destructive to
copies of the iterator? (I would hope not the latter, as that is
terrible API.)
If not, there should not be a problem. (Okay, given it is
istreambuf_iterator, I suppose I can imagine one or both of the above
being true. It's not obvious to me from either cplusplus.com or
cppreference.com if istreambuf_iterator is or is not copyable...)
> And I am on the same track as Matthew F. in that parse() should have one
> responsibility and one only: convert the textual representation of a
> value to a value, and nothing else.
I can live with that. (I'm not sure I ever felt handling whitespace was
*necessary*, just that I don't object to it as strongly as you.)
I do think we need at least one parsing option; whether or not to allow
trailing characters.
> What happens with the out parameter when parsing fails? Is it in an
> undefined state? Or left unmodified? If the latter then parse() had to
> create a temporary and the entire allocation prevention and no-copy
> argument is down the drain.
IMO it shall leave it unmodified. One of the arguments for an output
parameter was to implement defaults like:
int value = default_value;
parse(in, value);
That said, in defense of that argument, I could conceive of a
specialization that stores parts of the value in cheap-to-create values
and doesn't have to build the expensive type until it knows the parse is
okay. But I agree that that's tenuous, and more likely your point will
be true.
For simple types like int, I prefer the convention of leaving the out parameter unmodified on failure. That way you can initialize it yourself if you want the default behavior. Also I've had situations where I'm using parse to update some configuration variable. If the parse fails, i want it to fall back on whatever the previous value was. The unmodified behavior supports this use case very nicely. I don't don't have to cache the old value in a temporary and reassign it on error.
That being said, for expensive types like big_int you're absolutely right that this requires the creation of temporaries and destroys any performance benefit the out parameter design.
To avoid the temporary, you have to parse twice, once to check for correctness and again to actually load the value.
For arbitrary types, now the question becomes what state do you leave them in? One option is an indetermined but valid state. This might be the most efficient but a little more error prone. Another option is to create a default constructed temporary and swap with it (or copy from it). This is safer but may have performance implications if the default construction does any real work. Also it requires your type to be default constructable and (copy/move assignable or swapable).
Probably the indetermined state is best as it offers the least complications. For ints and floats, we can optimize by giving the unmodified guarantee and all of the benefits it entails.
This iterator is non-copyable? And/or incrementing it is destructive to copies of the iterator? (I would hope not the latter, as that is terrible API.)
If not, there should not be a problem. (Okay, given it is istreambuf_iterator, I suppose I can imagine one or both of the above being true. It's not obvious to me from either cplusplus.com or cppreference.com if istreambuf_iterator is or is not copyable...)
I do think we need at least one parsing option; whether or not to allow trailing characters.
Because you shouldn't deviate from strtoll. The support should be done first in
strtoll and then in whatever you're proposing. Yes, I know I'm asking you to
convince ISO C and the POSIX standard groups.
The reason being that most C++ standard library implementations will delegate
to strtoll or similar functions (like we do in Qt). Asking for functionality
different from strtoll means asking for more complexity from library
developers.
Alternatively, make sure that strtoll could be implemented on top of a plain C
library routine that is the backend for your new function. That would solve
the problem of complexity.
There is currently no reliable way to detect 1 and 3. The
contiguous iterator category proposal would solve 1, but 3
requires dereferencing the end iterator and thus UB. That limits
our options alot. If implementability with strtoxx is a
requirement you can just drop the iterators, templates, locales
and mark this thread as closed since the limitations of strtoxx
started it in the first place.
I suppose you could have the "real implementation" version always accept
extra characters and return the end position, and provide a wrapper that
implements the aforementioned check. But that wrapper is important, as
that's what is going to be used more often than not. I would consider
the proposal incomplete if it does not provide that API.
Em qua 05 fev 2014, às 21:20:52, Miro Knejp escreveu:
> > The reason being that most C++ standard library implementations will
> > delegate
> > to strtoll or similar functions (like we do in Qt). Asking for
> > functionality
> > different from strtoll means asking for more complexity from library
> > developers.
> >
> > Alternatively, make sure that strtoll could be implemented on top of a
> > plain C
> > library routine that is the backend for your new function. That would
> > solve
> > the problem of complexity.
>
> While it makes sense and sounds great, you can only implement it in
> terms of strtoll if
>
> 1. The iterators point to contiguous memory, and
> 2. The value type is char, and
> 3. They range is null terminated.
>
> There is currently no reliable way to detect 1 and 3. The contiguous
> iterator category proposal would solve 1, but 3 requires dereferencing
> the end iterator and thus UB. That limits our options alot. If
> implementability with strtoxx is a requirement you can just drop the
> iterators, templates, locales and mark this thread as closed since the
> limitations of strtoxx started it in the first place.
If you go for my alternative proposal, then you skip the need for 3. That is,
implementing strtoll on top of a plain C function that operates on contiguous
memory and receives a begin and end pointer.
However, I do think 1 and 2 are *reasonable*.
I think the whole discussion
about input iterators that this discussion has gone on for the past few days
is unnecessary. Simply force people to read into a contiguous-memory buffer of
one of the base char types.
In specific: I'd like this discussion to assume that the number parsing code is
implemented out-of-line. Inline parsing is, forgive me for saying so, nuts.
You could maybe do it for integers, but you'd never do it for floating-point
results.
string to T (int, float, etc..) conversions seem like to rather easy task (aside from floating point round trip issues), and yet for the life of C and C++ the standard library has consistently failed to provide a decent interface.
Lets review:int atoi(const char* s); //and atoll,atol,atoll, atof etc..Whats wrong with this?
- Returns 0 on parsing failure, making it impossible to parse 0 strings. This already renders this function effectively useless and we can skip the rest of the bullet points right here.
- It discards leading whitespace, this has several problems of its own:
- If we want to check whether the string is strictly a numeric string, we have to add our own check that the first character is a digit. This makes the interface clumsy to use and easy to screw up.
- std::isspace() is locale dependent and requires an indirect function call (try it on gcc.godbolt.org). This makes what could be a very simple and inlinable conversion potentially expensive. It also prevents constexpr.
- From a design standpoint, this whitespace handling is a very narrow use case. It does too many things and in my opinion is a bad design. I often do not have whitespace delimited input in my projects.
- No atod() for doubles or atold() for long doubles.
- No support for unsigned types, although this may not actually be a problem.
- Uses horrible C interface (type suffixes in names) with no overloading or template arguments. What function do we use if we want to parse an int32_t?
long strtol(const char* str, char **str_end, int base);Whats wrong with this one?
- Again it has this silly leading whitespace behavior (see above).
- Its not obvious how to correctly determine whether or not parsing failed. Every time I use this function I have to look it up again to make sure I get it exactly right and have covered all of the corner cases.
- Uses 0/T_MAX/T_MIN to denote errors, when these could be validly parsed from strings. Checking whether or not these values were parsed or are representing errors is clumsy.
- Again C interface issues (see above).
At this point, I think we are ready to define a new set of int/float parsing routines.
Design goals:
- Easy to use, usage is obvious.
- No assumptions about use cases, we just want to parse strings. This means none of this automatic whitespace handling.
- Efficient and inline
- constexpr
Here is a first attempt for an integer parsing routine.//Attempts to parse s as an integer. The valid integer string consists of the following://* '+' or '-' sign as the first character (- only acceptable for signed integral types)//* prefix (0) indicating octal base (applies only when base is 0 or 8)//* prefix (0x or 0X) indicating hexadecimal base (applies only when base is 16 or 0).//* All of the rest of the characters MUST be digits.//Returns true if an integral value was successfully parsed and stores the value in val,//otherwise returns false and leaves val unmodified.//Sets errno to ERANGE if the string was an integer but would overflow type integral.template <typename integral>
constexpr bool strto(string_view s, integral& val, int base);//Same as the previous, except that instead of trying to parse the entire string, we only parse the integral part.
//The beginning of the string must be an integer as specified above. Will set tail to point to the end of the string after the integral part.template <typename integral>constexpr bool strto(string_view s, integral& val, int base, string_view& tail);First off, all of these return bool which makes it very easy to check whether or not parsing failed.While the interface does not allow this idom:int x = atoi(s);It works with this idiom which in all of my use cases is much more common:int val;if(!strto(s, val, 10)) {throw some_error();
}printf("We parsed %d!\n", val);Some examples:int val;string_view sv= "12345";assert(strto(sv, val, 10));assert(val == 12345);sv = "123 456";val = -2;assert(!strto(sv, val, 10));assert(val == -2);assert(strto(sv, val, 10, sv));assert(val == 123);assert(sv == " 456");sv.remove_prefix(1); //chop off the " ";assert(sv == "456");assert(strto(sv, val, 10));assert(val = 456);val = 0;assert(strto(sv, val, 10, sv));assert(val == 456);assert(sv == "");Similarly we can define this for floating point types. We may also want null terminated const char* versions as converting a const char* to sting_view requires a call to strlen().
Well that's a problem of codecvt and friends.
But as long as codecvt does not have a method to consume characters incrementally I see no way to go without some sort of temporary output buffer.
I was talking about parsing a non-contiguous block of data and depending on
the iterator. Parsing character by character (whichever character type) is the
problem here.