Comments on P0372R0, A type for utf-8 data

Tom Honermann

unread,

Jun 14, 2016, 10:55:38 PM6/14/16

to std-pr...@isocpp.org, bigch...@gmail.com, dccit...@gmail.com

First, thank you for writing this paper! It has been on my todo list to
write such a proposal, but alas...

I spoke with Richard Smith about such a proposal in Jacksonville and he
mentioned a further justification for supporting a char8_t type -
optimization. Today, compilers are limited in optimizing code involving
char and unsigned char glvalues because these types are allowed to alias
objects of other types (C++14 3.10 [basic.lval] p10). If a char8_t type
were to be added that adhered to strict aliasing, then compilers could
more aggressively optimize code involving it. I think this may be a
benefit worth adding to the paper.

Tom.

Nicol Bolas

unread,

Jun 14, 2016, 11:32:23 PM6/14/16

to ISO C++ Standard - Future Proposals, bigch...@gmail.com, dccit...@gmail.com

I'm quite certain that the proposal makes this illegal:

const char8_t *str = "Some String";

`char8_t` is meant for UTF-8 strings only. And most people's strings are narrow character strings; on specific platforms, this may work out to being UTF-8, but there is no guarantee of that. We need to differentiate between narrow character strings and UTF-8 encoded strings at the type level.

The last thing we want is to encourage people to do this:

auto str = (const char8_t *)"Some String";

If people start trying doing casts like that to take advantage of more aggressive optimizations, then we'll be right back where we were before: we won't know if a string really is UTF-8 or not.

Solving the "char as byte array and string" problem is important. But we shouldn't suggest that `char8_t` constitutes such a solution.

Tom Honermann

unread,

Jun 14, 2016, 11:44:19 PM6/14/16

to std-pr...@isocpp.org, bigch...@gmail.com, dccit...@gmail.com

On 06/14/2016 11:32 PM, Nicol Bolas wrote:

On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honermann wrote:
First, thank you for writing this paper! It has been on my todo list to
write such a proposal, but alas...

I spoke with Richard Smith about such a proposal in Jacksonville and he
mentioned a further justification for supporting a char8_t type -
optimization. Today, compilers are limited in optimizing code involving
char and unsigned char glvalues because these types are allowed to alias
objects of other types (C++14 3.10 [basic.lval] p10). If a char8_t type
were to be added that adhered to strict aliasing, then compilers could
more aggressively optimize code involving it. I think this may be a
benefit worth adding to the paper.

I'm quite certain that the proposal makes this illegal:

const char8_t *str = "Some String";

I would hope so.

`char8_t` is meant for UTF-8 strings only. And most people's strings are narrow character strings; on specific platforms, this may work out to being UTF-8, but there is no guarantee of that. We need to differentiate between narrow character strings and UTF-8 encoded strings at the type level.

The last thing we want is to encourage people to do this:

auto str = (const char8_t *)"Some String";

I agree.

If people start trying doing casts like that to take advantage of more aggressive optimizations, then we'll be right back where we were before: we won't know if a string really is UTF-8 or not.

Solving the "char as byte array and string" problem is important. But we shouldn't suggest that `char8_t` constitutes such a solution.

I don't think the ability to abuse a feature should be sufficient justification to not add it. I did not intend to suggest that char8_t be used to circumvent existing aliasing rules. Rather, that giving it strict aliasing behavior would enable optimizations for UTF-8 data. That could potentially provide some motivation towards using UTF-8 strings in preference to narrow strings.

Tom.

Tom Honermann

unread,

Jun 15, 2016, 11:07:27 AM6/15/16

to std-pr...@isocpp.org, bigch...@gmail.com, dccit...@gmail.com

I'd also like to propose that the implicit conversion from u8"" to const
char[] and u8'x' to char be introduced as deprecated features that can
be removed in a future standard.

Is there any implementation experience? Any chance that patches to gcc
or Clang exist? If so, I would be interested in experimenting with them.

Tom.

Nicol Bolas

unread,

Jun 15, 2016, 11:56:32 AM6/15/16

to ISO C++ Standard - Future Proposals, bigch...@gmail.com, dccit...@gmail.com

Right, but it already has that. `char8_t`, based on the "unique, unsigned type" statement in the proposal, is a different type from `char` and `unsigned char`. It has the same value representation as those two, but the way strict aliasing is defined already does not allow `char8_t*` to alias with other types. Just as it doesn't allow `char16_t*` or `char32_t*` to do so. The same goes for enums who use `char` as their underlying types; arrays of them are not `char*`s to the strict aliasing rules.

The strict aliasing rules do not care what the underlying type of something is.

What I'm saying is that we shouldn't advertise this as a selling point of the feature. It shouldn't be listed in the motivation section, for example. Otherwise you will encourage people to abuse it.

Tom Honermann

unread,

Jun 15, 2016, 5:21:28 PM6/15/16

to std-pr...@isocpp.org, bigch...@gmail.com, dccit...@gmail.com

On 6/15/2016 11:56 AM, Nicol Bolas wrote:

If people start trying doing casts like that to take advantage of more aggressive optimizations, then we'll be right back where we were before: we won't know if a string really is UTF-8 or not.

Solving the "char as byte array and string" problem is important. But we shouldn't suggest that `char8_t` constitutes such a solution.

I don't think the ability to abuse a feature should be sufficient justification to not add it. I did not intend to suggest that char8_t be used to circumvent existing aliasing rules. Rather, that giving it strict aliasing behavior would enable optimizations for UTF-8 data. That could potentially provide some motivation towards using UTF-8 strings in preference to narrow strings.

Right, but it already has that. `char8_t`, based on the "unique, unsigned type" statement in the proposal, is a different type from `char` and `unsigned char`. It has the same value representation as those two, but the way strict aliasing is defined already does not allow `char8_t*` to alias with other types. Just as it doesn't allow `char16_t*` or `char32_t*` to do so. The same goes for enums who use `char` as their underlying types; arrays of them are not `char*`s to the strict aliasing rules.

Until we have wording or the proposal states otherwise, we don't know what we have. I agree that changes particular to aliasing would have to be made to the standard if the type was intended not to follow strict aliasing rules.

The strict aliasing rules do not care what the underlying type of something is.

What I'm saying is that we shouldn't advertise this as a selling point of the feature. It shouldn't be listed in the motivation section, for example. Otherwise you will encourage people to abuse it.

Uh oh, I think the cat is out of the bag...

Name a feature that people haven't figured out how to abuse. This isn't any different. Listing the potential benefit and facilitating discussion on the potential for abuse strikes me as a better approach than "shhh".

Tom.

Michael Spencer

unread,

Jun 17, 2016, 4:18:43 PM6/17/16

to Tom Honermann, std-pr...@isocpp.org, Davide Italiano

I also had a quite similar conversation with Richard :).

We did consider covering aliasing in the paper, but in the end we felt
that it detracted from the core message of C++ needing a type for
utf-8. The aliasing properties are indeed useful for optimization, but
just adding new distinct types is a bad solution to the general
aliasing problem.

- Michael Spencer

Reply all

Reply to author

Forward