References:
https://groups.google.com/a/isocpp.org/d/topic/std-discussion/jGr2bZXWntc/discussionhttps://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussionSince
the discussions referenced above didn't come to any usefull conclusion
(at least I don't see one) but are inactive for some time now, I would
like to reiterate on the pros and cons of the addition of a distinct
code unit type for UTF-8 string literals.
These other discussions
also went somewhat off-topic when they began discussing pros/cons of the
various Unicode CEFs/CESes for APIs and transport/storage of string
data. No need to say that there needs to be an additional
verification/encoding detection process when e.g. reading text from a
file.
This discussion is
purely about string literals and their practical usage in code.
Since C++11 there are the following character/string literal types:
-
char /
char* for narrow execution character set
OR UTF-8
-
wchar_t /
wchar_t* for wide execution character set
-
char16_t /
char16_t* for UTF-16
-
char32_t /
char32_t* for UTF-32
As
it currently stands there exist some problems with UTF-8 string
literals that are not present with UTF-16/UTF-32 string literals.
a)
String literals encoded using the narrow execution charset and string
literals encoded using UTF-8 can be mixed (by accident).
b) [Inherited from a)] Overload resolution dependent on the (implicit) encoding type is impossible.
Regarding a):
char16_t and
char32_t can't be mixed with
wchar_t/
int16_t/
uint16_t/
int32_t/
uint32_t by accident. Doing so would require explicit casting via
reinterpret_cast (not
static_cast).
Sadly, the same can't be said for UTF-8 string literals (see the code example below).
I
would argue that code that mixes strings with different encodings (for
example passing a UTF-8 encoded string to an API that expects a string
encoded using the execution charset) is inherently broken and needs to
be fixed.
Too bad this can't be checked at compile time currently.
Regarding b):
At the moment there is no way to differentiate between
u8"" (UTF-8) and
"" (narrow execution charset) at compile time (and even at runtime it is quite hard to do so correctly, if not even impossible).
See for example the following code:
#include <iostream>
void f(char const*) { std::cout << "narrow\n"; }
void f(wchar_t const*) { std::cout << "wide\n"; }
//void f(??? const*) { std::cout << "UTF-8\n"; } // No distinct type for UTF-8 string literals.
void f(char16_t const*) { std::cout << "UTF-16\n"; }
void f(char32_t const*) { std::cout << "UTF-32\n"; }
int main() {
f("");
f(L"");
f(u8""); // How are we supposed to invoke the UTF-8 overload?
f(u"");
f(U"");
return 0;
}
[
online demonstration using Ideone]
To
invoke a UTF-8 aware overload the only possibility right now is to try
to detect the encoding at runtime and dispatch manually like so:
void f_narrow(char const*) { std::cout << "narrow\n"; }
void f_utf8(char const*) { std::cout << "UTF-8\n"; }
void f(char const* x) {
// is_utf8: Imaginary function that returns true if the passed in string is UTF-8 encoded of false otherwise.
if (is_utf8(x))
f_utf8(x);
else
f_narrow(x);
}
Obviously this approach is ugly and slow.
char16_t and
char32_t on the other hand work quite well, since they are distinct types which can be used for overload resolution.
A possible solution:
Adding a new type (e.g.
char8_t
to be consistent with the other type names) and specifying that UTF-8
string literals use this type would solve these problems and make the
behavior consistent with UTF-16/UTF-32 string literals.
In a perfect world this new type would
not be implicitly convertible to
char, to prevent a). On the other hand such an incompatible change would break existing code that relies on
std::is_same<decltype(""), decltype(u8"")>::value == true. To prevent breaking said code an implicit conversion could be defined, leaving a) in its current state but at least fixing b).
Any thoughts?