How to convert UTF-8 literal to `char const*` at compile time in C++20?

Alf P. Steinbach

unread,

May 28, 2019, 9:47:23 PM5/28/19

to

This is actually just a test of whether I've misconfigured Thunderbird
to killfile /all/ new articles in clc++, but, for the sake of asking a
question:

Let's say someone has the following valid C++17 code:

-----------------------------------------------------------------
using Byte = unsigned char;

constexpr auto is_utf8_tail_byte( const char ch )
-> bool
{ return (Byte( ch ) >> 6) == 0b10; }

constexpr auto n_code_points( char const* s )
-> int
{
int n = 0;
while( *s ) {
++s;
while( is_utf8_tail_byte( *s ) ) { ++s; }
++n;
}
return n;
}

auto main() -> int
{
constexpr int n = n_code_points( u8"æøå" ); // Should be 3.
return n;
}
-----------------------------------------------------------------

Now that someone upgrades the compiler to C++20, and suddenly the type
of the literal is a very incompatible `char8_t[7]`.

Assume the `constexpr` `n` is used for something very compile-time-ish,
e.g. a buffer size.

What to do?

Cheers!,

- Alf

Tim Rentsch

unread,

May 29, 2019, 3:27:02 PM5/29/19

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> Let's say someone has the following valid C++17 code:
>
>
> -----------------------------------------------------------------
> using Byte = unsigned char;
>
> constexpr auto is_utf8_tail_byte( const char ch )
> -> bool
> { return (Byte( ch ) >> 6) == 0b10; }
>
> constexpr auto n_code_points( char const* s )
> -> int
> {
> int n = 0;
> while( *s ) {
> ++s;
> while( is_utf8_tail_byte( *s ) ) { ++s; }
> ++n;
> }
> return n;
> }
>
> auto main() -> int
> {

> constexpr int n = n_code_points( u8"[lost in newsreader]" ); // [3]

> return n;
> }
> -----------------------------------------------------------------
>
>
> Now that someone upgrades the compiler to C++20, and suddenly the type
> of the literal is a very incompatible `char8_t[7]`.
>
> Assume the `constexpr` `n` is used for something very
> compile-time-ish, e.g. a buffer size.
>
> What to do?

This can be done using constexpr overloaded functions, but how
about just using a template?

template < typename T >
constexpr int n_code_points( const T *s ){
// suitable body goes here
}

Alf P. Steinbach

unread,

May 29, 2019, 5:52:26 PM5/29/19

to

I don't see how.

> but how about just using a template?
>
> template < typename T >
> constexpr int n_code_points( const T *s ){
> // suitable body goes here
> }

Yes, if one can modify that checker code.

But suppose one can't?

Cheers!,

- Alf

Tim Rentsch

unread,

May 30, 2019, 10:34:03 AM5/30/19

to

Let me first make sure I understand what you mean by the
next comment, then we can get back to this one.

>> but how about just using a template?
>>
>> template < typename T >
>> constexpr int n_code_points( const T *s ){
>> // suitable body goes here
>> }
>
> Yes, if one can modify that checker code.
>
> But suppose one can't?

When you say "that checker code", do you mean that it
must use the routine 'is_utf8_tail_byte', and use it
as written?

Normally I expect this wouldn't be a problem, because the
different types we want to address would very likely convert to
'char' in an appropriate way, and converting to 'char' (ie, from
the element type of the argument) would happen automatically when
calling is_utf8_tail_byte(). However, if it is necessary to do
some sort of special conversion for some argument types, that
can be done using another template function, as for example
(disclaimer: not compiled):

template < typename T >
constexpr char convert_to_char( T );

template < typename T >
constexpr int n_code_points( const T *s ){

...
... is_utf8_tail_byte( convert_to_char( *s ) ) ...
...
}

and use template specialization for 'convert_to_char' to get
the specific behavior desired in each particular case. Does
that make sense?

Alf P. Steinbach

unread,

May 30, 2019, 12:07:47 PM5/30/19

to

Yes.

> Normally I expect this wouldn't be a problem, because the
> different types we want to address would very likely convert to
> 'char' in an appropriate way, and converting to 'char' (ie, from
> the element type of the argument) would happen automatically when
> calling is_utf8_tail_byte(). However, if it is necessary to do
> some sort of special conversion for some argument types, that
> can be done using another template function, as for example
> (disclaimer: not compiled):
>
> template < typename T >
> constexpr char convert_to_char( T );
>
> template < typename T >
> constexpr int n_code_points( const T *s ){
> ...
> ... is_utf8_tail_byte( convert_to_char( *s ) ) ...
> ...
> }
>
> and use template specialization for 'convert_to_char' to get
> the specific behavior desired in each particular case. Does
> that make sense?

Yes, it narrows the problem down to writing a `convert_to_char_ptr`
function that is `constexpr`, which I don't think is possible.

And as such it narrows downs what is or certainly appears to be
problematic about the C++20 change of type for an `u8`-literal, and for
that matter, also for its change of result type for
`std::filesystem::path::u8string`...

It's like the committee effectively has adopted a goal of sabotaging use
of Unicode in C++, I would assume driven by a small clique of members.

Which, since that's critical, amounts to sabotaging general use of C++.

In my view, if this functionality is used, then they're breaking a lot
of code, and if it's generally not used, they should consider replacing
it instead of changing a detail and breaking some few enthusiasts' code
and having effectively judged ungood functionality in the standard.

Cheers!,

- Alf

Vir Campestris

unread,

May 30, 2019, 4:34:00 PM5/30/19

to

On 30/05/2019 17:07, Alf P. Steinbach wrote:
> It's like the committee effectively has adopted a goal of sabotaging use
> of Unicode in C++, I would assume driven by a small clique of members.

Never assume malice when incompetence is enough.

The one I recall from some years ago was trying to open files on
Windows, where the native paths were 16-bit characters (in the days
before Unicode overflowed them).

*nixers were OK with narrow chars, as UTF8 is narrow. But there was no
STL method to refer to files with wide char names.

Andy
--
I haven't hit the filter yet on this group.
Yet.

Tim Rentsch

unread,

May 31, 2019, 2:33:36 AM5/31/19

to

I'm not suggesting converting any pointers. What is being
converted is what the pointer argument points at, which is to say
an integer or character type. To illustrate here is a complete
program:

using Byte = unsigned char;

constexpr bool
is_utf8_tail_byte( const char ch ){

return (Byte( ch ) >> 6) == 0b10;
}

template < typename T >
constexpr int

n_code_points( T const* s ){

int n = 0;
while( *s ){
++s;
while( is_utf8_tail_byte( *s ) ){ ++s; }
++n;
}
return n;
}

constexpr char t1[] = "\xE6\xF8\xE5";
constexpr unsigned char t2[] = "\xE6\xF8\xE5";
constexpr signed char t3[] = "\xE6\xF8\xE5";
constexpr unsigned int t4[] = { 0xE6, 0xF8, 0xE5, 0 };
constexpr signed int t5[] = { 0xE6, 0xF8, 0xE5, 0 };

constexpr int n1 = n_code_points( t1 );
constexpr int n2 = n_code_points( t2 );
constexpr int n3 = n_code_points( t3 );
constexpr int n4 = n_code_points( t4 );
constexpr int n5 = n_code_points( t5 );

static char test_n1[ n1 ];
static char test_n2[ n2 ];
static char test_n3[ n3 ];
static char test_n4[ n4 ];
static char test_n5[ n5 ];

#include <cstdio>

int
main(){
constexpr int n = n_code_points( u8"\xE6\xF8\xE5" ); // Should be 3
printf( "n is %d\n", n );
printf( "n1 is %d\n", n1 );
printf( "n2 is %d\n", n2 );
printf( "n3 is %d\n", n3 );
printf( "n4 is %d\n", n4 );
printf( "n5 is %d\n", n5 );
return 0;
}

The program shown compiles and runs in C++14. Its output:

n is 3
n1 is 3
n2 is 3
n3 is 3
n4 is 3
n5 is 3

Does this approach do what you want? If not then how doesn't it?

Alf P. Steinbach

unread,

May 31, 2019, 1:06:30 PM5/31/19

to

Well, thanks, but converting single bytes was not the issue, really.

You might consider n_code_points a non-template function offered by a
library.

Since it's `constexpr` the source code must be available, so I guess
that yes, one could duplicate that source code and make it a template.
But then what if a bug is fixed in the original library version? Then
every programmer who's made a templated copy, must also fix that bug.

Cheers!,

- Alf

Manfred

unread,

Jun 5, 2019, 1:58:27 PM6/5/19

to

On 5/30/19 6:07 PM, Alf P. Steinbach wrote:
> On 30.05.2019 16:33, Tim Rentsch wrote:
>>
>> and use template specialization for 'convert_to_char' to get
>> the specific behavior desired in each particular case. Does
>> that make sense?
>
> Yes, it narrows the problem down to writing a `convert_to_char_ptr`
> function that is `constexpr`, which I don't think is possible.

It doesn't look to be required:
---------------------
#include <iostream>

#if 1 // or 0, result is unaffected
template<typename Char>
constexpr auto is_utf8_tail_byte( const Char ch )
-> bool
{ return (ch & 0b11000000) == 0b10000000; }
#else

using Byte = unsigned char;
constexpr auto is_utf8_tail_byte( const char ch )
-> bool
{ return (Byte( ch ) >> 6) == 0b10; }

#endif // 0

template<typename Char>
constexpr auto n_code_points( Char const* s )

-> int
{
int n = 0;
while( *s ) {

do {
++s;
} while( is_utf8_tail_byte( *s ) );

++n;
}
return n;
}

auto main() -> int
{

static constexpr auto s = u8"æøå";
constexpr int n = n_code_points( s ); // Should be 3.

static_assert(3 == n_code_points( s ));

std::cout << "length of \'" << s << "\' is " << n << std::endl;
}

---------------------

>
> And as such it narrows downs what is or certainly appears to be
> problematic about the C++20 change of type for an `u8`-literal, and for
> that matter, also for its change of result type for
> `std::filesystem::path::u8string`...
>
> It's like the committee effectively has adopted a goal of sabotaging use
> of Unicode in C++, I would assume driven by a small clique of members.
>
> Which, since that's critical, amounts to sabotaging general use of C++.

What I found most disturbing, while toying with your code, is the output:
length of '0x402010' is 3
(this is from gcc 9.1 with -std=c++2a)
Really, inserting a u8 string does that?

Alf P. Steinbach

unread,

Jun 5, 2019, 5:14:53 PM6/5/19

to

I believe this is about what Tim Rentsch suggested.

It relies on implicit value conversion of single characters, while the
problem is with strings.

Of course one may just copy the relevant code (e.g. `n_code_points`) and
make a template of it, but copying code is brittle, in particular wrt.
maintenance of the original code, and it could be a lot of code that
would need this templatization treatment.

>> And as such it narrows downs what is or certainly appears to be
>> problematic about the C++20 change of type for an `u8`-literal, and
>> for that matter, also for its change of result type for
>> `std::filesystem::path::u8string`...
>>
>> It's like the committee effectively has adopted a goal of sabotaging
>> use of Unicode in C++, I would assume driven by a small clique of
>> members.
>>
>> Which, since that's critical, amounts to sabotaging general use of C++.
>
> What I found most disturbing, while toying with your code, is the output:
> length of '0x402010' is 3
> (this is from gcc 9.1 with -std=c++2a)
> Really, inserting a u8 string does that?

I don't know what iostreams support is planned for UTF-8 strings.

Presumably the effect above is just that the standard library
implementation you used is not yet C++20-compliant.

Tim Rentsch

unread,

Jun 19, 2019, 4:15:27 PM6/19/19

to

Sorry, I was trying to be helpful. I don't know what it is
you hope to accomplish. If a given piece of code isn't
going to tolerate the transition to C++20 then either the
function being called or the sites that call it have to
change, if it must work in C++20. It seems like you want
everything to be the same and yet act differently. So
either I don't understand what you want or what you want
simply cannot be provided.