constexpr ascii character checks

869 views
Skip to first unread message

Matthew Fioravante

unread,
Sep 23, 2014, 9:03:52 PM9/23/14
to std-pr...@isocpp.org
A few times, I have written hand rolled ascii routines as optimized versions of the routimes in <cctype>.

For example:

namespace std {
namespace ascii {
constexpr int isspace(char c) noexcept { return c == ' ' || c == '\f' || c == '\n' || c == '\r' || c == '\t' || c == '\v';
} //namespace ascii
} //namespace std

In some cases I found these to be much faster than the <cctype> equivalents because cctype uses runtime dispatch to handle the global locale setting. I used these for times when I had a pre-condition that the input string was ascii. Ascii is still important these days because its very easy to parse. Many large data sets use ascii separators, such as large csv files exported from microsoft excel. For very large data files, parsing can take a significant portion of the entire application run time.

Runtime dispatch is very slow because it limits the compiler and hardware's opportunities to optimize branching. For ascii, these checks are very simple. They are inlined which might also allow an optimizer to vectorize the character checks performed in a loop on a string. Finally, this version is also constexpr, allowing use in template logic and constexpr string code.

A further direction could be to also support unicode character checks, but I'm not sure if that even makes sense. To ask what is a digit in an encoding which covers all possible languages is a sort of ambiguous question.



Douglas Boffey

unread,
Sep 25, 2014, 9:32:12 AM9/25/14
to std-pr...@isocpp.org
I like it :)

Matthew Fioravante

unread,
Sep 25, 2014, 9:36:00 AM9/25/14
to std-pr...@isocpp.org
I've put up a draft proposal and header here:

https://github.com/fmatthew5876/stdcxx-ascii

I'm also trying to research if including isascii() is worthwhile. Part of that research involves this SO question:

http://stackoverflow.com/questions/26030928/why-is-isascii-deprecated

Matthew Fioravante

unread,
Sep 26, 2014, 2:53:52 PM9/26/14
to std-pr...@isocpp.org
A final version of the proposal is here:

https://github.com/fmatthew5876/stdcxx-ascii

Does anyone have any feedback or suggestions? Is the lack of response an indication that this is not a good idea?

Thanks!

Ville Voutilainen

unread,
Sep 26, 2014, 3:04:38 PM9/26/14
to std-pr...@isocpp.org
Seems quite decent to me. To the LEWG!

Myriachan

unread,
Sep 26, 2014, 11:52:58 PM9/26/14
to std-pr...@isocpp.org

How should wchar_t, char16_t and char32_t be handled?  int is not guaranteed to be large enough to hold them.  One way is to have isascii() be a template function that actually has meaning, returning whether its parameter is within +00 to +7F (or +7E, if you don't count the +7F backspace).

How should negative integers be handled?  It has always been very annoying that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all the time, otherwise you assert in debug builds when your input is UTF-8, or, worse, cause undefined behavior--reading a table in la-la land--in release builds.

What about the fact that many, if not most, C implementations currently implement these functions as macros?  Calling these functions will be annoying like it is with numeric_limits: combined with the above, you'll have to call them like this:

template <typename T>
bool decode_hex_string(const char *string, T &output)
{
   
enum : std::ptrdiff_t { NUM_NIBBLES = std::numeric_limits<T>::digits / 4 };
   
static_assert(std::is_unsigned<T>::value, "only unsigned types are allowed");
   
static_assert(std::numeric_limits<T>::digits % 4 == 0,
       
"the type must have a number of bits that's a multiple of 4");

    T result
= 0;

   
for (const char *current = string; current - string <= NUM_NIBBLES; ++current)
   
{
       
char ch = *current;
       
if (ch == '\0')
       
{
           
if (current == string)
               
break;
            output
= result;
           
return true;
       
}

       
// The lovely extra parentheses and the static_cast.
       
if (!(std::ascii::isxdigit)(static_cast<unsigned char>(ch)))
           
return false;
       
int nibble = (std::ascii::toxdigit)(static_cast<unsigned char>(ch));

       
// Safe even if promotion results in a signed integer type--
       
// the Standard allows left-shifting into the sign bit.
        result
<<= 4;
       
// This, however, is not safe, so protect against mishaps.
        result
= static_cast<T>(static_cast<unsigned>(nibble) +
           
static_cast<std::make_unsigned_t<decltype(+result)>>(+result));
   
}
   
return false;
}



Thiago Macieira

unread,
Sep 27, 2014, 2:50:10 AM9/27/14
to std-pr...@isocpp.org
On Friday 26 September 2014 20:52:57 Myriachan wrote:
> How should wchar_t, char16_t and char32_t be handled? int is not

I'm not sure they have to. But we can add the char16_t and char32_t overloads
because ASCII is a strict subset of UTF-16 and UTF-32, so the code is exactly
the same.

> guaranteed to be large enough to hold them. One way is to have isascii()
> be a template function that actually has meaning, returning whether its
> parameter is within +00 to +7F (or +7E, if you don't count the +7F
> backspace).

I don't see how char16_t or char32_t wouldn't be wide enough to contain ASCII
if, by definition, they contain UTF-16 and UTF-32 codepoints that are a
superset of ASCII.

I don't think this needs to be a template either, but I would think that
char16_t and char32_t make sense.

Let's just drop wchar_t.

> How should negative integers be handled? It has always been very annoying
> that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all
> the time, otherwise you assert in debug builds when your input is UTF-8,
> or, worse, cause undefined behavior--reading a table in la-la land--in
> release builds.

I'm not sure how this is even a question. Negative numbers aren't ASCII,
therefore they fail any trait test.

> What about the fact that many, if not most, C implementations currently
> implement these functions as macros? Calling these functions will be
> annoying like it is with numeric_limits: combined with the above, you'll
> have to call them like this:

That is an issue. We'd have to find an identifier that doesn't collide.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

Myriachan

unread,
Sep 27, 2014, 3:25:31 AM9/27/14
to std-pr...@isocpp.org
On Friday, September 26, 2014 11:50:10 PM UTC-7, Thiago Macieira wrote:
On Friday 26 September 2014 20:52:57 Myriachan wrote:
> How should wchar_t, char16_t and char32_t be handled?  int is not

I'm not sure they have to. But we can add the char16_t and char32_t overloads
because ASCII is a strict subset of UTF-16 and UTF-32, so the code is exactly
the same.


I meant that, for example, on systems with a 16-bit "int", attempting to pass a char32_t to std::ascii::isdigit(int) would be bad.  But yes, with overloads for them, this problem is avoided.

> guaranteed to be large enough to hold them.  One way is to have isascii()
> be a template function that actually has meaning, returning whether its
> parameter is within +00 to +7F (or +7E, if you don't count the +7F
> backspace).

I don't see how char16_t or char32_t wouldn't be wide enough to contain ASCII
if, by definition, they contain UTF-16 and UTF-32 codepoints that are a
superset of ASCII.


I meant that "int" can't necessarily hold char16_t and char32_t.
 
I don't think this needs to be a template either, but I would think that
char16_t and char32_t make sense.

Let's just drop wchar_t.


Even when the rest of STL supports it, like std::iswdigit?
 
> How should negative integers be handled?  It has always been very annoying
> that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all
> the time, otherwise you assert in debug builds when your input is UTF-8,
> or, worse, cause undefined behavior--reading a table in la-la land--in
> release builds.

I'm not sure how this is even a question. Negative numbers aren't ASCII,
therefore they fail any trait test.


This question is actually quite relevant.  This has been a source of crash bugs with the C library functions for decades.  On an implementation where "char" is signed, which is most, the following is undefined behavior:

const char *string = u8"猫";
if (!isdigit(*string))
   
return false;

This is because isdigit takes a parameter of type int, not char.  The UTF-8 encoding of the string ends up with negative chars, which get sign-extended to negative ints.  Passing negative ints (*) to isdigit, etc. results in undefined behavior.  The proper way to call the isdigit family of functions is as follows:

const char *string = u8"猫";
if (!isdigit(static_cast<unsigned char>(*string)))
   
return false;

And that's just annoying.  This is probably why std::isdigit<CharType>(CharType, const std::locale &) doesn't take int as a parameter =)

(*) Often with the exception of -1, because that's usually the implementation's chosen value of EOF.  EOF is a valid value to pass to these functions, and they don't assert or read into bad memory if you call them with EOF.
 
> What about the fact that many, if not most, C implementations currently
> implement these functions as macros?  Calling these functions will be
> annoying like it is with numeric_limits: combined with the above, you'll
> have to call them like this:

That is an issue. We'd have to find an identifier that doesn't collide.

I just looked and this may actually not be a problem.  I looked at Linux GCC and Visual Studio 2010 and 2013, and the macro versions are #ifndef __cplusplus.  This makes sense, with the std::isdigit versions taking two arguments existing.

Bjorn Reese

unread,
Sep 27, 2014, 6:08:58 AM9/27/14
to std-pr...@isocpp.org
On 09/27/2014 05:52 AM, Myriachan wrote:

> What about the fact that many, if not most, C implementations currently
> implement these functions as macros? Calling these functions will be
> annoying like it is with numeric_limits: combined with the above, you'll
> have to call them like this:

If this is still a problem, then it would also apply to the character
classification functions in std::locale.

Bjorn Reese

unread,
Sep 27, 2014, 6:55:56 AM9/27/14
to std-pr...@isocpp.org
On 09/26/2014 08:53 PM, Matthew Fioravante wrote:
> A final version of the proposal is here:
>
> https://github.com/fmatthew5876/stdcxx-ascii
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Ffmatthew5876%2Fstdcxx-ascii&sa=D&sntz=1&usg=AFQjCNGPEl1MJgYu6lvpkqLc3VrVMtTsZw>
>
> Does anyone have any feedback or suggestions? Is the lack of response an
> indication that this is not a good idea?

I would like to suggest that the comparison is done in a traits class
so we can add other "constexpr-essional" character sets later on; an
obvious example is XML characters.

The following is just to give you an idea. Notice that I use an
underscore in is_space() to distinguish it from the locale-based
counterparts.

// Generic part
template <typename Traits>
struct ctype_traits;

template <typename CharT, typename Traits>
constexpr bool is_space(CharT c)
{
return ctype_traits<Traits>::is_space(c);
}

// ASCII part
struct ctype_ascii_tag {};

template <>
struct ctype_traits<ctype_ascii_tag>
{
template <typename CharT>
static constexpr bool is_space(CharT c)
{
return (c == 9 || c == 32);
}
};

namespace ascii
{
// Convenience functions
template <typename CharT>
constexpr bool is_space(CharT c)
{
return std::is_space<ctype_ascii_tag>(c);
}
}

Matthew Fioravante

unread,
Sep 27, 2014, 10:18:00 AM9/27/14
to std-pr...@isocpp.org


On Saturday, September 27, 2014 3:25:31 AM UTC-4, Myriachan wrote:
On Friday, September 26, 2014 11:50:10 PM UTC-7, Thiago Macieira wrote:
On Friday 26 September 2014 20:52:57 Myriachan wrote:
> How should wchar_t, char16_t and char32_t be handled?  int is not

I'm not sure they have to. But we can add the char16_t and char32_t overloads
because ASCII is a strict subset of UTF-16 and UTF-32, so the code is exactly
the same.


I meant that, for example, on systems with a 16-bit "int", attempting to pass a char32_t to std::ascii::isdigit(int) would be bad.  But yes, with overloads for them, this problem is avoided.

I originally went with int to be compatible with C getchar() and EOF. But I see your point.

Maybe we should have overloads for char, wchar_t, char16_t, char32_t and int (int for C compatibility when all of the underlying char types are unsigned and EOF is -1).
 
> How should negative integers be handled?  It has always been very annoying
> that calling C's isdigit, etc. requires static_cast<unsigned char>(c) all
> the time, otherwise you assert in debug builds when your input is UTF-8,
> or, worse, cause undefined behavior--reading a table in la-la land--in
> release builds.

I'm not sure how this is even a question. Negative numbers aren't ASCII,
therefore they fail any trait test.


This question is actually quite relevant.  This has been a source of crash bugs with the C library functions for decades.  On an implementation where "char" is signed, which is most, the following is undefined behavior:

const char *string = u8"猫";
if (!isdigit(*string))
   
return false;



The original cctype functions have this funky rule about how the input has to either be in range of an unsigned char or be EOF otherwise the behavior is undefined. I'm guessing this restriction was put in place so that the cctype functions could be efficiently implemented using table lookup without a bounds check.

As you've identified, this makes the interface rather painful to use so I've opted to lift this restriction.

Matthew Fioravante

unread,
Sep 27, 2014, 10:22:02 AM9/27/14
to std-pr...@isocpp.org


On Saturday, September 27, 2014 6:55:56 AM UTC-4, Bjorn Reese wrote:

I would like to suggest that the comparison is done in a traits class
so we can add other "constexpr-essional" character sets later on; an
obvious example is XML characters.


I was originally thinking along these lines as well but wasn't sure what the best interface is for such a thing. Even if such a generic mechanism is not put into this proposal, it could be added later and the ascii library could be re-implemented on top of it.

Thiago Macieira

unread,
Sep 27, 2014, 9:35:15 PM9/27/14
to std-pr...@isocpp.org
On Saturday 27 September 2014 00:25:30 Myriachan wrote:
> I meant that, for example, on systems with a 16-bit "int", attempting to
> pass a char32_t to std::ascii::isdigit(int) would be bad. But yes, with
> overloads for them, this problem is avoided.

> > I don't see how char16_t or char32_t wouldn't be wide enough to contain
> > ASCII
> > if, by definition, they contain UTF-16 and UTF-32 codepoints that are a
> > superset of ASCII.
>
> I meant that "int" can't necessarily hold char16_t and char32_t.

I don't see how "int" got into the discussion.

std::ascii::isxxx should take char, char16_t and char32_t. It stands to reason
that you already know that a given character is ASCII before using the
function. A wchar_t overload can be present only if the implementation knows
at compile time that wchar_t is always a superset of ASCII.

The use of int for ctype.h is an extreme legacy from C's obscure past.

> > I don't think this needs to be a template either, but I would think that
> > char16_t and char32_t make sense.
> >
> > Let's just drop wchar_t.
>
> Even when the rest of STL supports it, like std::iswdigit?

Yes. Let's just drop it because of its C legacy that says its encoding can be
anything. Let's stick to ASCII, UTF-16 and UTF-32.

> > I'm not sure how this is even a question. Negative numbers aren't ASCII,
> > therefore they fail any trait test.
>
> This question is actually quite relevant.
[snip the rest]

Looks like the question is relevant because you were thinking of these
functions taking int as parameters, not char.

If they take char, then the point is moot.

> And that's just annoying. This is probably why
> std::isdigit<CharType>(CharType, const std::locale &) doesn't take int as a
> parameter =)

q.e.d :-)

> > > What about the fact that many, if not most, C implementations currently
> > > implement these functions as macros? Calling these functions will be
> > > annoying like it is with numeric_limits: combined with the above, you'll
> >
> > > have to call them like this:
> > That is an issue. We'd have to find an identifier that doesn't collide.
>
> I just looked and this may actually not be a problem. I looked at Linux
> GCC and Visual Studio 2010 and 2013, and the macro versions are #ifndef
> __cplusplus. This makes sense, with the std::isdigit versions taking two
> arguments existing.

Better.

Thiago Macieira

unread,
Sep 27, 2014, 9:39:03 PM9/27/14
to std-pr...@isocpp.org
Note that we will never have a full Unicode tag for this, so long as the
functions are constexpr. For full Unicode, you'll need to drop the constexpr.

So if you want to keep the constexpr, why bother with the tag?

Agustín K-ballo Bergé

unread,
Sep 27, 2014, 9:43:51 PM9/27/14
to std-pr...@isocpp.org
On 27/09/2014 10:35 p.m., Thiago Macieira wrote:
>>> I don't think this needs to be a template either, but I would think that
>>> > >char16_t and char32_t make sense.
>>> > >
>>> > >Let's just drop wchar_t.
>> >
>> >Even when the rest of STL supports it, like std::iswdigit?
> Yes. Let's just drop it because of its C legacy that says its encoding can be
> anything. Let's stick to ASCII, UTF-16 and UTF-32.

I don't think I understand the rationale. The encoding of all of char,
char16_t, char32_t and wchar_t can be anything.

Regards,
--
Agustín K-ballo Bergé.-
http://talesofcpp.fusionfenix.com

Thiago Macieira

unread,
Sep 27, 2014, 9:58:00 PM9/27/14
to std-pr...@isocpp.org
On Saturday 27 September 2014 22:42:09 Agustín K-ballo Bergé wrote:
> On 27/09/2014 10:35 p.m., Thiago Macieira wrote:
> >>> I don't think this needs to be a template either, but I would think that
> >>>
> >>> > >char16_t and char32_t make sense.
> >>> > >
> >>> > >Let's just drop wchar_t.
> >> >
> >> >Even when the rest of STL supports it, like std::iswdigit?
> >
> > Yes. Let's just drop it because of its C legacy that says its encoding can
> > be anything. Let's stick to ASCII, UTF-16 and UTF-32.
>
> I don't think I understand the rationale. The encoding of all of char,
> char16_t, char32_t and wchar_t can be anything.

The encoding of char16_t and char32_t are always UTF-16 and UTF-32,
respectively.

The encoding of char in a function with "ascii" in the name is ASCII.

I'm saying that now that we have char16_t and char32_t with a fixed encoding,
we should stop adding wchar_t support, since that type can change sizes and
whose encoding is not fixed.

Agustín K-ballo Bergé

unread,
Sep 27, 2014, 10:10:05 PM9/27/14
to std-pr...@isocpp.org
On 27/09/2014 10:57 p.m., Thiago Macieira wrote:
> On Saturday 27 September 2014 22:42:09 Agustín K-ballo Bergé wrote:
>> On 27/09/2014 10:35 p.m., Thiago Macieira wrote:
>>>>> I don't think this needs to be a template either, but I would think that
>>>>>
>>>>>>> char16_t and char32_t make sense.
>>>>>>>
>>>>>>> Let's just drop wchar_t.
>>>>>
>>>>> Even when the rest of STL supports it, like std::iswdigit?
>>>
>>> Yes. Let's just drop it because of its C legacy that says its encoding can
>>> be anything. Let's stick to ASCII, UTF-16 and UTF-32.
>>
>> I don't think I understand the rationale. The encoding of all of char,
>> char16_t, char32_t and wchar_t can be anything.
>
> The encoding of char16_t and char32_t are always UTF-16 and UTF-32,
> respectively.

Character types have no associated encoding, you might be confusing them
with character literals.

> The encoding of char in a function with "ascii" in the name is ASCII.

Fair enough.

> I'm saying that now that we have char16_t and char32_t with a fixed encoding,
> we should stop adding wchar_t support, since that type can change sizes and
> whose encoding is not fixed.

The size of char16_t and char32_t is not fixed either.

If the implementation gives me wide character literals then it's the job
of the implementation to provide me enough support for it. Maybe the
implementation would be wise to define wide characters to be equivalent
to one of the universal characters. If you were to suggest that wchar_t,
wide character literals, basic and execution wide character sets, and so
on be deprecated, then that'd be a different story.

Thiago Macieira

unread,
Sep 27, 2014, 10:40:26 PM9/27/14
to std-pr...@isocpp.org
On Saturday 27 September 2014 23:08:23 Agustín K-ballo Bergé wrote:
> >> I don't think I understand the rationale. The encoding of all of char,
> >> char16_t, char32_t and wchar_t can be anything.
> >
> > The encoding of char16_t and char32_t are always UTF-16 and UTF-32,
> > respectively.
>
> Character types have no associated encoding, you might be confusing them
> with character literals.

No, I'm just thinking of reasonable people using the character types for the
purposes which they were intended.

I know I can store ~0 in a char32_t and that it's not a valid UTF-32
codepoint. I would expect totally undefined behaviour when passing char32_t(~0)
to a function that clearly expects UTF-32.

> > I'm saying that now that we have char16_t and char32_t with a fixed
> > encoding, we should stop adding wchar_t support, since that type can
> > change sizes and whose encoding is not fixed.
>
> The size of char16_t and char32_t is not fixed either.

No, but they are defined to be uint_least16_t and uint_least32_t, so that they
can store at least 16- and 32-bit values. That's not so for wchar_t -- for all
the standard says, it could be as wide as char.

> If the implementation gives me wide character literals then it's the job
> of the implementation to provide me enough support for it. Maybe the
> implementation would be wise to define wide characters to be equivalent
> to one of the universal characters. If you were to suggest that wchar_t,
> wide character literals, basic and execution wide character sets, and so
> on be deprecated, then that'd be a different story.

There's nothing stopping us from deprecating wchar_t in the standard library
or at least shunning it.

Myriachan

unread,
Sep 30, 2014, 12:25:18 AM9/30/14
to std-pr...@isocpp.org
On Saturday, September 27, 2014 6:35:15 PM UTC-7, Thiago Macieira wrote:
On Saturday 27 September 2014 00:25:30 Myriachan wrote:
> I meant that "int" can't necessarily hold char16_t and char32_t.

I don't see how "int" got into the discussion.

std::ascii::isxxx should take char, char16_t and char32_t. It stands to reason
that you already know that a given character is ASCII before using the
function. A wchar_t overload can be present only if the implementation knows
at compile time that wchar_t is always a superset of ASCII.

The use of int for ctype.h is an extreme legacy from C's obscure past.


His draft used "int"; that's why I mentioned it and pointed out why it's problematic.
https://github.com/fmatthew5876/stdcxx-ascii/blob/master/proposal/draft.html

> > I don't think this needs to be a template either, but I would think that
> > char16_t and char32_t make sense.
> >
> > Let's just drop wchar_t.
>
> Even when the rest of STL supports it, like std::iswdigit?

Yes. Let's just drop it because of its C legacy that says its encoding can be
anything. Let's stick to ASCII, UTF-16 and UTF-32.


On an implementation, the encoding of "char" can be EDCBIC while the encoding of "wchar_t" can be UTF-32; why would "char" be the one to get special treatment in that case, when it's wchar_t that would be the superset of ASCII?  I am not strongly opposed to not having wchar_t; it's just an irritation is all, since it adds static_casts to char32_t or char16_t everywhere.  (Most systems with wchar_t define it as UTF-32 or UTF-16 anyway.)

It would mainly affect Windows programmers, where wchar_t is deeply ingrained into the environment.  Windows programming suffers a bit from being the first major OS to jump onto the Unicode bandwagon - Windows NT came out only 6 months after UTF-8 was released, being in development long before that of course, and before Unicode realized that 64.0K wasn't enough for everyone.
 
> > I'm not sure how this is even a question. Negative numbers aren't ASCII,
> > therefore they fail any trait test.
>
> This question is actually quite relevant.  
[snip the rest]

Looks like the question is relevant because you were thinking of these
functions taking int as parameters, not char.

If they take char, then the point is moot.


Mhmm.
 
> I just looked and this may actually not be a problem.  I looked at Linux
> GCC and Visual Studio 2010 and 2013, and the macro versions are #ifndef
> __cplusplus.  This makes sense, with the std::isdigit versions taking two
> arguments existing.

Better.


^^^ Yes, sorry about my initial bad assessment about them being macros.  C++ has inline functions, yay.

Melissa

Matthew Fioravante

unread,
Sep 30, 2014, 8:34:45 PM9/30/14
to std-pr...@isocpp.org


On Tuesday, September 30, 2014 12:25:18 AM UTC-4, Myriachan wrote:
On Saturday, September 27, 2014 6:35:15 PM UTC-7, Thiago Macieira wrote:
On Saturday 27 September 2014 00:25:30 Myriachan wrote:
> I meant that "int" can't necessarily hold char16_t and char32_t.

I don't see how "int" got into the discussion.

std::ascii::isxxx should take char, char16_t and char32_t. It stands to reason
that you already know that a given character is ASCII before using the
function. A wchar_t overload can be present only if the implementation knows
at compile time that wchar_t is always a superset of ASCII.

The use of int for ctype.h is an extreme legacy from C's obscure past.


His draft used "int"; that's why I mentioned it and pointed out why it's problematic.
https://github.com/fmatthew5876/stdcxx-ascii/blob/master/proposal/draft.html


I originally proposed int so that the functions would be compatible with C getchar() and EOF. Looking at it again, perhaps that is not so necessary. I never use either of those and even if I did I would be doing an EOF check before passing it on to an ascii function. Setting the type to char also resolves ambiguities and questions about values that are valid for int but invalid for char (such as negative values).

I've changed the proposal to accept char, char16_t, char32_t. A new version of the paper is available:


I'm still on the fence about wchar_t. While it looks like the direction is for wchar_t to eventually go away there are a lot of people using it. Windows developers being the prime example.

Is anyone here a windows developer who would find these functions at all useful with wchar_t? Even if wchar_t is planned to be deprecated, we can still add overloads now and deprecate them later if and when wchar_t gets deprecated.

Daniela Engert

unread,
Oct 1, 2014, 11:04:38 AM10/1/14
to std-pr...@isocpp.org
Am 01.10.2014 um 02:34 schrieb Matthew Fioravante:
>
> I've changed the proposal to accept char, char16_t, char32_t. A new
> version of the paper is available:
>
> Is anyone here a windows developer who would find these functions at all
> useful with wchar_t? Even if wchar_t is planned to be deprecated, we can
> still add overloads now and deprecate them later if and when wchar_t
> gets deprecated.

I certainly do. As long as wchar_t isn't deprecated it's a first-class
member of the char party, imho.

Ciao
Dani


signature.asc

Thiago Macieira

unread,
Oct 1, 2014, 2:04:39 PM10/1/14
to std-pr...@isocpp.org
I would say the implementation should provide those if it can be sure that the
wide execution charset at runtime is a superset of ASCII. That's the case on
Windows (wchar_t == char16_t).

If it can't be sure or if the wide char encoding is changeable at runtime,
then those functions should be omitted.

Matthew Fioravante

unread,
Oct 1, 2014, 2:20:40 PM10/1/14
to std-pr...@isocpp.org


On Wednesday, October 1, 2014 2:04:39 PM UTC-4, Thiago Macieira wrote:

I would say the implementation should provide those if it can be sure that the
wide execution charset at runtime is a superset of ASCII. That's the case on
Windows (wchar_t == char16_t).

If it can't be sure or if the wide char encoding is changeable at runtime,
then those functions should be omitted.


Why omit? If you call ascii::isspace() on a wchar_t, you are saying "I don't care what the system encoding is, I am guaranteeing that the character I am giving you is ascii compatible".

Thiago Macieira

unread,
Oct 1, 2014, 6:47:07 PM10/1/14
to std-pr...@isocpp.org
If you know that the character is ASCII, why are you calling the function?

If you know you have UTF-16 or UTF-32, why aren't you using char16_t or
char32_t? If you know your data is a superset of ASCII, you can cast it.

We could have a wchar_t function that asks "does this character from the wide
character execution charset, whichever it is, exist in ASCII". That would make
sense to me.

But once you've got that, the traits like isdigit only make sense in ASCII in
the first place. You should simply convert to ASCII and work on a char.

char16_t and char32_t only get the exemption because the code is exactly the
same.

Matthew Fioravante

unread,
Oct 1, 2014, 7:47:15 PM10/1/14
to std-pr...@isocpp.org

If I understand what you're trying to say, I think your point is that assuming a wchar_t is ascii compatible may not make much sense at all because all uses of wchar_t come from platform specific sources such as string literals, operating system paths, etc.. In that case we would never use wchar_t to represent wide text loaded from a platform agnostic source like a file or network socket because we don't know its size. Instead we would use char16_t and char32_t, in the same way we would use int32_t instead of int or long for binary data. Therefore, on a system where wchar_t is defined to be non-ascii compatible these functions would useless if not even dangerous to use.


The ascii::isX() functions are designed for text which comes from platform agnostic sources such as data files where the encoding is known a priori.

If you have the platform specific knowledge to know your wchar_t is ascii compatible, then you also have the platform specific knowledge to know whether it should be a char16_t or char32_t. Someone who for example wants to parse a windows path using ascii methods could convert the wchar_t's to char16_t.

Am I understanding your argument correctly?

Thiago Macieira

unread,
Oct 2, 2014, 12:41:23 AM10/2/14
to std-pr...@isocpp.org
On Wednesday 01 October 2014 16:47:15 Matthew Fioravante wrote:
> *The ascii::isX() functions are designed for text which comes from platform
> agnostic sources such as data files where the encoding is known a priori.*
>
> If you have the platform specific knowledge to know your wchar_t is ascii
> compatible, then you also have the platform specific knowledge to know
> whether it should be a char16_t or char32_t. Someone who for example wants
> to parse a windows path using ascii methods could convert the wchar_t's to
> char16_t.
>
> Am I understanding your argument correctly?

Yes.

With the exception that on Windows, wchar_t is always UTF-16, so it is
actually safe. The example is misleading.

Myriachan

unread,
Oct 2, 2014, 4:43:52 AM10/2/14
to std-pr...@isocpp.org
On Wednesday, October 1, 2014 4:47:15 PM UTC-7, Matthew Fioravante wrote:
The ascii::isX() functions are designed for text which comes from platform agnostic sources such as data files where the encoding is known a priori.

If you have the platform specific knowledge to know your wchar_t is ascii compatible, then you also have the platform specific knowledge to know whether it should be a char16_t or char32_t. Someone who for example wants to parse a windows path using ascii methods could convert the wchar_t's to char16_t.

Am I understanding your argument correctly?

You could make the same argument to say that there is no reason to support anything but char32_t.  Everything above applies equally to char as it does to wchar_t.  If you know beforehand that your char buffer comes from platform-agnostic sources and contains ASCII text, which is the whole point of this thread, you can pass each char converted to char32_t and get the same result.  Every one of these functions returns false or has undefined behavior for the ranges of characters U+00D800 through U+00DFFF inclusive and U+100000 through U+10FFFF inclusive, so supporting char16_t serves no purpose either; just convert all char16_t values to char32_t before calling the char32_t versions.

The above is meant to just be reductio ad absurdum.  I think it's better to just support them all, since sprinkling static_casts every time we call these functions is just annoying.

By the way, no love for strtod/wcstod?  I'd love to avoid having to call nonstandard functions like _strtod_l with a custom-made "C" locale object in order to make sure that '.' is always the digit separator and not ','.  There are times for locale handling and there are times for consistency.

Melissa

Matthew Fioravante

unread,
Oct 2, 2014, 7:53:59 AM10/2/14
to std-pr...@isocpp.org


On Thursday, October 2, 2014 4:43:52 AM UTC-4, Myriachan wrote:
On Wednesday, October 1, 2014 4:47:15 PM UTC-7, Matthew Fioravante wrote:
The ascii::isX() functions are designed for text which comes from platform agnostic sources such as data files where the encoding is known a priori.

If you have the platform specific knowledge to know your wchar_t is ascii compatible, then you also have the platform specific knowledge to know whether it should be a char16_t or char32_t. Someone who for example wants to parse a windows path using ascii methods could convert the wchar_t's to char16_t.

Am I understanding your argument correctly?

You could make the same argument to say that there is no reason to support anything but char32_t.  Everything above applies equally to char as it does to wchar_t

The difference here is that char is often used for storing platform agnostic text as if it were a char8_t, while wchar_t is never used for this purpose.
 
If you know beforehand that your char buffer comes from platform-agnostic sources and contains ASCII text, which is the whole point of this thread, you can pass each char converted to char32_t and get the same result.  Every one of these functions returns false or has undefined behavior for the ranges of characters U+00D800 through U+00DFFF inclusive and U+100000 through U+10FFFF inclusive, so supporting char16_t serves no purpose either; just convert all char16_t values to char32_t before calling the char32_t versions.

The above is meant to just be reductio ad absurdum.  I think it's better to just support them all, since sprinkling static_casts every time we call these functions is just annoying.

I think I will add them and also add a paragraph explaining all of these discussions. I don't really care as I just really want my ascii functions for char! The committee can make the final decision on wchar_t. 

By the way, no love for strtod/wcstod?  I'd love to avoid having to call nonstandard functions like _strtod_l with a custom-made "C" locale object in order to make sure that '.' is always the digit separator and not ','.  There are times for locale handling and there are times for consistency.

That's more about specifying a static locale and unrelated to ascii. The only thing an ascii version would give you is the minor optimization that the calls to isspace() would be inlined called ascii::isspace().

Still I see your point and this may be an idea for a followup proposal (or for the new set of str to int/float functions in the other thread).

Melissa

Matthew Fioravante

unread,
Oct 2, 2014, 9:18:38 PM10/2/14
to std-pr...@isocpp.org
Candidate for final draft is here:

https://github.com/fmatthew5876/stdcxx-ascii/

If there are no more further objections or suggestions. I will submit this version.

Thank you everyone for your valuable feedback.

ron novy

unread,
Oct 2, 2014, 10:50:28 PM10/2/14
to std-pr...@isocpp.org
I might be misunderstanding its function, but the isascii(int) function might give invalid results as implemented.  It returns false for a null character and true if any of the first 7 bits are set.  This does not match the description of the function where it returns true if the value passed is >= 0 and <= 0x7F.

// This is how you have it now in cctype.hh
constexpr bool isascii(int c) noexcept
{
   
return c & 0x7F;
}

// This is how I imagine it should work.
constexpr bool isascii(int c) noexcept
{
   
return (c >= 0 && c <= 0x7F);
}



Matthew Fioravante

unread,
Oct 2, 2014, 10:57:40 PM10/2/14
to std-pr...@isocpp.org


On Thursday, October 2, 2014 10:50:28 PM UTC-4, ron novy wrote:
I might be misunderstanding its function, but the isascii(int) function might give invalid results as implemented.  It returns false for a null character and true if any of the first 7 bits are set.  This does not match the description of the function where it returns true if the value passed is >= 0 and <= 0x7F.

That's a bug in my example header. Fixed

Myriachan

unread,
Oct 4, 2014, 4:41:23 PM10/4/14
to std-pr...@isocpp.org

Regarding the undefined nature of todigit with bad input characters, do you mean that it's acceptable for these functions to crash if you give them a bad character, or that they merely give some weird useless result instead?  The latter would be called "unspecified" instead.

Wasn't there previously a fromdigit?  Also, what you have now as todigit seems more like a fromdigit than a todigit, because it just passed an isdigit check...

General proposal note: "ASCII" is an acronym--American Standard Code for Information Interchange--and, as such, should be written in all caps in the proposal.  Note that I'm not saying that this applies to identifier names like std::ascii, std::isascii, etc; those should remain lowercase.


No need to mention Microsoft Excel; "any common spreadsheet application" would work, because that is true.  Any common spreadsheet can export .csv files.  Heck, my database query tool can export select results as .tsv or .csv.


Typo in proposal's prototypes:

constexpr bool std::ascii:isalnum(char16_t c) noexcept;
constexpr bool std::ascii:isalnum(char32_t c) noexcept;

Return: std::ascii::isalpha(c) || std::asci::isdigit(c)

Needs two colons before isalnum.  This one is systemic of the Character Checks part.  isalnum specifically has a typo where asci should be ascii.

constexpr int todigit(char c, int m) noexcept;
constexpr int todigit(wchar_t c, int m) noexcept;
constexpr int todigit(char16_t c, int m) noexcept;
constexpr int todigit(char32_t c, int m) noexcept;

Return: std::ascii::todigit(c) * m

I think that you intended these to say std::ascii:: before todigit.

Return: (c >= 33 && c<= 126)

Space after second c.


Words that should be in monospace font:

"This proposal adds a set of //constexpr// free functions"


Phrases that need hyphens:

digit-to-int
high-performance
error-prone
ASCII-compatible
char-to-int
platform-agnostic
platform-specific
user-defined
std-proposals (because that's the name of this list)

Typos / misspellings:

Change: ~large data files which use ascii delimeters.
To: ~ASCII delimiters.

Change: ~github.
To: ~GitHub.

Change: We propose 2 additional useful functions todigit, and toxdigit.
To: ~functions: todigit and toxdigit.  (Add colon, remove comma.  Oxford command is awkward with two items.)

Change: 2
To: two    (this is a generic change to make things more readable)

Change: Alternatively, all of these defintions
To: Alternatively, all these definitions   (delete "of', fix spelling of "definitions")

Change: Each function has overloads for type char, wchar_t, char16_t, and char32_t.
To: (Delete final comma.  This is stylistic, though, and definitely up to you, of course.)

Change: [note-- All of these~
To: [Note: All these~     (If you want to follow the Standard's formatting, it's non-italic [ followed by italic Note: and non-italic note text.  Delete "of".)

Change: locale settings. --end-note]
To: locale settings. —end note]  (Similarly, this is an "em-dash" (U+002014), "end note" in italic (no hyphen between "end" and "note") and a non-italic ] )

Change: Return:
To: Returns:

Change: On systems such as windows
To: On systems such as Windows

Change: where wchar_t happens to be using an ASCII-compatible encoding
To: where wchar_t happens to be using an ASCII-compatible encoding (UTF-16)

Change: presense
To: presence

Change: standard committee
To: Standard Committee

Change: Csv file parsers
To: .csv file parsers  or   CSV file parsers

Change: decide that wchar_t is a bad idea
To: decide that wchar_t support in this proposal is a bad idea

Matthew Fioravante

unread,
Oct 6, 2014, 9:50:07 PM10/6/14
to std-pr...@isocpp.org
Thank you for proof reading the paper. Your suggestions helped a lot.


On Saturday, October 4, 2014 4:41:23 PM UTC-4, Myriachan wrote:

Wasn't there previously a fromdigit?  Also, what you have now as todigit seems more like a fromdigit than a todigit, because it just passed an isdigit check...


Naming is hard. The idea of todigit(), especially something like todigit(c, 100), is saying convert this character to a integral digit in the 100's place. Maybe a better name is toint()? I prefer names like toX() because such a named combined with the character argument reads as char-to-X. fromdigit() doesn't really tell you what you're converting the source digit to. maybe digit_to_int()?

Myriachan

unread,
Oct 7, 2014, 4:51:12 AM10/7/14
to std-pr...@isocpp.org

The first question--about fromdigit--was more about whether there was an inverse operation: the equivalent of "0123456789abcdef"[x].  This is less necessary, though, since it's obviously very simple to implement. =)

toint works as a name, I think, but sucks for readability; to_int would be better for readability, but doesn't fit the pattern of the rest of the functions' names.  If the function were named toint or similar, the inverse operation could be named tochar or similar.  I suppose that for tochar, lowercase would be what is returned for digits 10-15.  A Boolean parameter to select whether to use uppercase could be possible, or a bool template parameter, but just wrapping toupper around tochar would work.

I think that this should be changed to <= 127, in order to avoid issues related to 128 not being a valid char on platforms with an 8-bit signed char (i.e., most).  I'm sure that readers would have understood what was meant, so it's really just me being pedantic again:

constexpr bool std::ascii::isascii(char c) noexcept;
constexpr bool std::ascii::isascii(wchar_t c) noexcept;
constexpr bool std::ascii::isascii(char16_t c) noexcept;
constexpr bool std::ascii::isascii(char32_t c) noexcept;

Returns: true if c >= 0 && c < 128

I love your proposal, though, since at where I work we've had to implement an equivalent of these functions on who knows how many projects at this point...

Melissa

Matthew Fioravante

unread,
Oct 7, 2014, 8:19:51 AM10/7/14
to std-pr...@isocpp.org


On Tuesday, October 7, 2014 4:51:12 AM UTC-4, Myriachan wrote:

The first question--about fromdigit--was more about whether there was an inverse operation: the equivalent of "0123456789abcdef"[x]

Good idea, added.
 
This is less necessary, though, since it's obviously very simple to implement. =)

All of these are simple but annoying to implement and test.
 

toint works as a name, I think, but sucks for readability; to_int would be better for readability, but doesn't fit the pattern of the rest of the functions' names.  If the function were named toint or similar, the inverse operation could be named tochar or similar.  I suppose that for tochar, lowercase would be what is returned for digits 10-15.  A Boolean parameter to select whether to use uppercase could be possible, or a bool template parameter, but just wrapping toupper around tochar would work.

I've added fromdigit(int d, int m); fromxdigit(int d, int m, bool upper). 

Other options are:
fromxdigit();
fromXdigit(); //This doesn't mirror the toxdigit() functions

Or:
fromxdigit();
toupper(fromxdigit()); //Fully qualified, this is a lot of typing

All of them should inline down to the same thing so it doesn't matter too much.
 

I think that this should be changed to <= 127, in order to avoid issues related to 128 not being a valid char on platforms with an 8-bit signed char (i.e., most).  I'm sure that readers would have understood what was meant, so it's really just me being pedantic again:

When its time to encode it in standardeze being pedantic matters, so I fixed this one.
 

constexpr bool std::ascii::isascii(char c) noexcept;
constexpr bool std::ascii::isascii(wchar_t c) noexcept;
constexpr bool std::ascii::isascii(char16_t c) noexcept;
constexpr bool std::ascii::isascii(char32_t c) noexcept;

Returns: true if c >= 0 && c < 128

I love your proposal, though, since at where I work we've had to implement an equivalent of these functions on who knows how many projects at this point...

So have I and I'm tired of doing it. Glad to see there are others out there who could use this functionality. 

Matthew Fioravante

unread,
Oct 7, 2014, 8:20:42 AM10/7/14
to std-pr...@isocpp.org
Let me know if you see anything else. Otherwise I'm going to submit this paper, strong bitset, and the alignment paper tonight for Urbana.

Thanks!
Reply all
Reply to author
Forward
0 new messages