namespace std {
namespace ascii {
constexpr int isspace(char c) noexcept { return c == ' ' || c == '\f' || c == '\n' || c == '\r' || c == '\t' || c == '\v';
} //namespace ascii
} //namespace std
template <typename T>
bool decode_hex_string(const char *string, T &output)
{
enum : std::ptrdiff_t { NUM_NIBBLES = std::numeric_limits<T>::digits / 4 };
static_assert(std::is_unsigned<T>::value, "only unsigned types are allowed");
static_assert(std::numeric_limits<T>::digits % 4 == 0,
"the type must have a number of bits that's a multiple of 4");
T result = 0;
for (const char *current = string; current - string <= NUM_NIBBLES; ++current)
{
char ch = *current;
if (ch == '\0')
{
if (current == string)
break;
output = result;
return true;
}
// The lovely extra parentheses and the static_cast.
if (!(std::ascii::isxdigit)(static_cast<unsigned char>(ch)))
return false;
int nibble = (std::ascii::toxdigit)(static_cast<unsigned char>(ch));
// Safe even if promotion results in a signed integer type--
// the Standard allows left-shifting into the sign bit.
result <<= 4;
// This, however, is not safe, so protect against mishaps.
result = static_cast<T>(static_cast<unsigned>(nibble) +
static_cast<std::make_unsigned_t<decltype(+result)>>(+result));
}
return false;
}
On Friday 26 September 2014 20:52:57 Myriachan wrote:
> How should wchar_t, char16_t and char32_t be handled? int is not
I'm not sure they have to. But we can add the char16_t and char32_t overloads
because ASCII is a strict subset of UTF-16 and UTF-32, so the code is exactly
the same.
> guaranteed to be large enough to hold them. One way is to have isascii()
> be a template function that actually has meaning, returning whether its
> parameter is within +00 to +7F (or +7E, if you don't count the +7F
> backspace).
I don't see how char16_t or char32_t wouldn't be wide enough to contain ASCII
if, by definition, they contain UTF-16 and UTF-32 codepoints that are a
superset of ASCII.
I don't think this needs to be a template either, but I would think that
char16_t and char32_t make sense.
Let's just drop wchar_t.
> How should negative integers be handled? It has always been very annoying
> that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all
> the time, otherwise you assert in debug builds when your input is UTF-8,
> or, worse, cause undefined behavior--reading a table in la-la land--in
> release builds.
I'm not sure how this is even a question. Negative numbers aren't ASCII,
therefore they fail any trait test.
const char *string = u8"猫";
if (!isdigit(*string))
return false;
const char *string = u8"猫";
if (!isdigit(static_cast<unsigned char>(*string)))
return false;
> What about the fact that many, if not most, C implementations currently
> implement these functions as macros? Calling these functions will be
> annoying like it is with numeric_limits: combined with the above, you'll
> have to call them like this:
That is an issue. We'd have to find an identifier that doesn't collide.
On Friday, September 26, 2014 11:50:10 PM UTC-7, Thiago Macieira wrote:On Friday 26 September 2014 20:52:57 Myriachan wrote:
> How should wchar_t, char16_t and char32_t be handled? int is not
I'm not sure they have to. But we can add the char16_t and char32_t overloads
because ASCII is a strict subset of UTF-16 and UTF-32, so the code is exactly
the same.
I meant that, for example, on systems with a 16-bit "int", attempting to pass a char32_t to std::ascii::isdigit(int) would be bad. But yes, with overloads for them, this problem is avoided.
> How should negative integers be handled? It has always been very annoying
> that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all
> the time, otherwise you assert in debug builds when your input is UTF-8,
> or, worse, cause undefined behavior--reading a table in la-la land--in
> release builds.
I'm not sure how this is even a question. Negative numbers aren't ASCII,
therefore they fail any trait test.
This question is actually quite relevant. This has been a source of crash bugs with the C library functions for decades. On an implementation where "char" is signed, which is most, the following is undefined behavior:const char *string = u8"猫";
if (!isdigit(*string))
return false;
I would like to suggest that the comparison is done in a traits class
so we can add other "constexpr-essional" character sets later on; an
obvious example is XML characters.
On Saturday 27 September 2014 00:25:30 Myriachan wrote:
> I meant that "int" can't necessarily hold char16_t and char32_t.
I don't see how "int" got into the discussion.
std::ascii::isxxx should take char, char16_t and char32_t. It stands to reason
that you already know that a given character is ASCII before using the
function. A wchar_t overload can be present only if the implementation knows
at compile time that wchar_t is always a superset of ASCII.
The use of int for ctype.h is an extreme legacy from C's obscure past.
> > I don't think this needs to be a template either, but I would think that
> > char16_t and char32_t make sense.
> >
> > Let's just drop wchar_t.
>
> Even when the rest of STL supports it, like std::iswdigit?
Yes. Let's just drop it because of its C legacy that says its encoding can be
anything. Let's stick to ASCII, UTF-16 and UTF-32.
> > I'm not sure how this is even a question. Negative numbers aren't ASCII,
> > therefore they fail any trait test.
>
> This question is actually quite relevant.
[snip the rest]
Looks like the question is relevant because you were thinking of these
functions taking int as parameters, not char.
If they take char, then the point is moot.
> I just looked and this may actually not be a problem. I looked at Linux
> GCC and Visual Studio 2010 and 2013, and the macro versions are #ifndef
> __cplusplus. This makes sense, with the std::isdigit versions taking two
> arguments existing.
Better.
On Saturday, September 27, 2014 6:35:15 PM UTC-7, Thiago Macieira wrote:On Saturday 27 September 2014 00:25:30 Myriachan wrote:
> I meant that "int" can't necessarily hold char16_t and char32_t.
I don't see how "int" got into the discussion.
std::ascii::isxxx should take char, char16_t and char32_t. It stands to reason
that you already know that a given character is ASCII before using the
function. A wchar_t overload can be present only if the implementation knows
at compile time that wchar_t is always a superset of ASCII.
The use of int for ctype.h is an extreme legacy from C's obscure past.
His draft used "int"; that's why I mentioned it and pointed out why it's problematic.
https://github.com/fmatthew5876/stdcxx-ascii/blob/master/proposal/draft.html
I would say the implementation should provide those if it can be sure that the
wide execution charset at runtime is a superset of ASCII. That's the case on
Windows (wchar_t == char16_t).
If it can't be sure or if the wide char encoding is changeable at runtime,
then those functions should be omitted.
The ascii::isX() functions are designed for text which comes from platform agnostic sources such as data files where the encoding is known a priori.
If you have the platform specific knowledge to know your wchar_t is ascii compatible, then you also have the platform specific knowledge to know whether it should be a char16_t or char32_t. Someone who for example wants to parse a windows path using ascii methods could convert the wchar_t's to char16_t.
Am I understanding your argument correctly?
On Wednesday, October 1, 2014 4:47:15 PM UTC-7, Matthew Fioravante wrote:The ascii::isX() functions are designed for text which comes from platform agnostic sources such as data files where the encoding is known a priori.
If you have the platform specific knowledge to know your wchar_t is ascii compatible, then you also have the platform specific knowledge to know whether it should be a char16_t or char32_t. Someone who for example wants to parse a windows path using ascii methods could convert the wchar_t's to char16_t.
Am I understanding your argument correctly?
You could make the same argument to say that there is no reason to support anything but char32_t. Everything above applies equally to char as it does to wchar_t.
If you know beforehand that your char buffer comes from platform-agnostic sources and contains ASCII text, which is the whole point of this thread, you can pass each char converted to char32_t and get the same result. Every one of these functions returns false or has undefined behavior for the ranges of characters U+00D800 through U+00DFFF inclusive and U+100000 through U+10FFFF inclusive, so supporting char16_t serves no purpose either; just convert all char16_t values to char32_t before calling the char32_t versions.
The above is meant to just be reductio ad absurdum. I think it's better to just support them all, since sprinkling static_casts every time we call these functions is just annoying.
By the way, no love for strtod/wcstod? I'd love to avoid having to call nonstandard functions like _strtod_l with a custom-made "C" locale object in order to make sure that '.' is always the digit separator and not ','. There are times for locale handling and there are times for consistency.
Melissa
// This is how you have it now in cctype.hh
constexpr bool isascii(int c) noexcept
{
return c & 0x7F;
}
// This is how I imagine it should work.
constexpr bool isascii(int c) noexcept
{
return (c >= 0 && c <= 0x7F);
}
I might be misunderstanding its function, but the isascii(int) function might give invalid results as implemented. It returns false for a null character and true if any of the first 7 bits are set. This does not match the description of the function where it returns true if the value passed is >= 0 and <= 0x7F.
constexpr bool std::ascii:isalnum(char16_t c) noexcept;
constexpr bool std::ascii:isalnum(char32_t c) noexcept;
Return: std::ascii::isalpha(c) || std::asci::isdigit(c)
Needs two colons before isalnum. This one is systemic of the Character Checks part. isalnum specifically has a typo where asci should be ascii.
constexpr int todigit(char c, int m) noexcept;
constexpr int todigit(wchar_t c, int m) noexcept;
constexpr int todigit(char16_t c, int m) noexcept;
constexpr int todigit(char32_t c, int m) noexcept;
Return: std::ascii::todigit(c) * m
I think that you intended these to say std::ascii:: before todigit.
Return: (c >= 33 && c<= 126)
Space after second c.
Words that should be in monospace font:
"This proposal adds a set of //constexpr// free functions"
Phrases that need hyphens:
digit-to-int
high-performance
error-prone
ASCII-compatible
char-to-int
platform-agnostic
platform-specific
user-defined
std-proposals (because that's the name of this list)
Typos / misspellings:
Change: ~large data files which use ascii delimeters.
To: ~ASCII delimiters.
Change: ~github.
To: ~GitHub.
Change: We propose 2 additional useful functions todigit, and toxdigit.
To: ~functions: todigit and toxdigit. (Add colon, remove comma. Oxford command is awkward with two items.)
Change: 2
To: two (this is a generic change to make things more readable)
Change: Alternatively, all of these defintions
To: Alternatively, all these definitions (delete "of', fix spelling of "definitions")
Change: Each function has overloads for type char, wchar_t, char16_t, and char32_t.
To: (Delete final comma. This is stylistic, though, and definitely up to you, of course.)
Change: [note-- All of these~
To: [Note: All these~ (If you want to follow the Standard's formatting, it's non-italic [ followed by italic Note: and non-italic note text. Delete "of".)
Change: locale settings. --end-note]
To: locale settings. —end note] (Similarly, this is an "em-dash" (U+002014), "end note" in italic (no hyphen between "end" and "note") and a non-italic ] )
Change: Return:
To: Returns:
Change: On systems such as windows
To: On systems such as Windows
Change: where wchar_t happens to be using an ASCII-compatible encoding
To: where wchar_t happens to be using an ASCII-compatible encoding (UTF-16)
Change: presense
To: presence
Change: standard committee
To: Standard Committee
Change: Csv file parsers
To: .csv file parsers or CSV file parsers
Change: decide that wchar_t is a bad idea
To: decide that wchar_t support in this proposal is a bad idea
Wasn't there previously a fromdigit? Also, what you have now as todigit seems more like a fromdigit than a todigit, because it just passed an isdigit check...
constexpr bool std::ascii::isascii(char c) noexcept; constexpr bool std::ascii::isascii(wchar_t c) noexcept; constexpr bool std::ascii::isascii(char16_t c) noexcept;constexpr bool std::ascii::isascii(char32_t c) noexcept;Returns:
trueifc >= 0 && c < 128
I love your proposal, though, since at where I work we've had to implement an equivalent of these functions on who knows how many projects at this point...
Melissa
The first question--about fromdigit--was more about whether there was an inverse operation: the equivalent of "0123456789abcdef"[x].
This is less necessary, though, since it's obviously very simple to implement. =)
toint works as a name, I think, but sucks for readability; to_int would be better for readability, but doesn't fit the pattern of the rest of the functions' names. If the function were named toint or similar, the inverse operation could be named tochar or similar. I suppose that for tochar, lowercase would be what is returned for digits 10-15. A Boolean parameter to select whether to use uppercase could be possible, or a bool template parameter, but just wrapping toupper around tochar would work.
I think that this should be changed to <= 127, in order to avoid issues related to 128 not being a valid char on platforms with an 8-bit signed char (i.e., most). I'm sure that readers would have understood what was meant, so it's really just me being pedantic again:
constexpr bool std::ascii::isascii(char c) noexcept; constexpr bool std::ascii::isascii(wchar_t c) noexcept; constexpr bool std::ascii::isascii(char16_t c) noexcept;constexpr bool std::ascii::isascii(char32_t c) noexcept;Returns:
trueifc >= 0 && c < 128I love your proposal, though, since at where I work we've had to implement an equivalent of these functions on who knows how many projects at this point...