On Tuesday, June 14, 2016 at 10:55:38 PM UTC-4, Tom Honermann wrote:First, thank you for writing this paper! It has been on my todo list to
write such a proposal, but alas...
I spoke with Richard Smith about such a proposal in Jacksonville and he
mentioned a further justification for supporting a char8_t type -
optimization. Today, compilers are limited in optimizing code involving
char and unsigned char glvalues because these types are allowed to alias
objects of other types (C++14 3.10 [basic.lval] p10). If a char8_t type
were to be added that adhered to strict aliasing, then compilers could
more aggressively optimize code involving it. I think this may be a
benefit worth adding to the paper.
I'm quite certain that the proposal makes this illegal:
const char8_t *str = "Some String";
`char8_t` is meant for UTF-8 strings only. And most people's strings are narrow character strings; on specific platforms, this may work out to being UTF-8, but there is no guarantee of that. We need to differentiate between narrow character strings and UTF-8 encoded strings at the type level.
The last thing we want is to encourage people to do this:
auto str = (const char8_t *)"Some String";
If people start trying doing casts like that to take advantage of more aggressive optimizations, then we'll be right back where we were before: we won't know if a string really is UTF-8 or not.
Solving the "char as byte array and string" problem is important. But we shouldn't suggest that `char8_t` constitutes such a solution.
If people start trying doing casts like that to take advantage of more aggressive optimizations, then we'll be right back where we were before: we won't know if a string really is UTF-8 or not.
Solving the "char as byte array and string" problem is important. But we shouldn't suggest that `char8_t` constitutes such a solution.
I don't think the ability to abuse a feature should be sufficient justification to not add it. I did not intend to suggest that char8_t be used to circumvent existing aliasing rules. Rather, that giving it strict aliasing behavior would enable optimizations for UTF-8 data. That could potentially provide some motivation towards using UTF-8 strings in preference to narrow strings.
Right, but it already has that. `char8_t`, based on the "unique, unsigned type" statement in the proposal, is a different type from `char` and `unsigned char`. It has the same value representation as those two, but the way strict aliasing is defined already does not allow `char8_t*` to alias with other types. Just as it doesn't allow `char16_t*` or `char32_t*` to do so. The same goes for enums who use `char` as their underlying types; arrays of them are not `char*`s to the strict aliasing rules.
The strict aliasing rules do not care what the underlying type of something is.
What I'm saying is that we shouldn't advertise this as a selling point of the feature. It shouldn't be listed in the motivation section, for example. Otherwise you will encourage people to abuse it.