Distinct type of array elements in UTF-8 string literals (char8_t).

88 views
Skip to first unread message

Max Truxa

unread,
Jun 5, 2015, 4:20:36 AM6/5/15
to std-dis...@isocpp.org
References:
https://groups.google.com/a/isocpp.org/d/topic/std-discussion/jGr2bZXWntc/discussion
https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion

Since the discussions referenced above didn't come to any usefull conclusion (at least I don't see one) but are inactive for some time now, I would like to reiterate on the pros and cons of the addition of a distinct code unit type for UTF-8 string literals.
These other discussions also went somewhat off-topic when they began discussing pros/cons of the various Unicode CEFs/CESes for APIs and transport/storage of string data. No need to say that there needs to be an additional verification/encoding detection process when e.g. reading text from a file.
This discussion is purely about string literals and their practical usage in code.

Since C++11 there are the following character/string literal types:
- char / char* for narrow execution character set OR UTF-8
- wchar_t / wchar_t* for wide execution character set
- char16_t / char16_t* for UTF-16
- char32_t / char32_t* for UTF-32

As it currently stands there exist some problems with UTF-8 string literals that are not present with UTF-16/UTF-32 string literals.

a) String literals encoded using the narrow execution charset and string literals encoded using UTF-8 can be mixed (by accident).
b) [Inherited from a)] Overload resolution dependent on the (implicit) encoding type is impossible.


Regarding a):

char16_t and char32_t can't be mixed with wchar_t/int16_t/uint16_t/int32_t/uint32_t by accident. Doing so would require explicit casting via reinterpret_cast (not static_cast).
Sadly, the same can't be said for UTF-8 string literals (see the code example below).

I would argue that code that mixes strings with different encodings (for example passing a UTF-8 encoded string to an API that expects a string encoded using the execution charset) is inherently broken and needs to be fixed.
Too bad this can't be checked at compile time currently.


Regarding b):

At the moment there is no way to differentiate between u8"" (UTF-8) and "" (narrow execution charset) at compile time (and even at runtime it is quite hard to do so correctly, if not even impossible).

See for example the following code:

#include <iostream>
 
void f(char const*) { std::cout << "narrow\n"; }
void f(wchar_t const*) { std::cout << "wide\n"; }
//void f(??? const*) { std::cout << "UTF-8\n"; } // No distinct type for UTF-8 string literals.
void f(char16_t const*) { std::cout << "UTF-16\n"; }
void f(char32_t const*) { std::cout << "UTF-32\n"; }
 
int main() {
    f
("");
    f
(L"");
    f
(u8""); // How are we supposed to invoke the UTF-8 overload?
    f
(u"");
    f
(U"");
   
return 0;
}

[online demonstration using Ideone]

To invoke a UTF-8 aware overload the only possibility right now is to try to detect the encoding at runtime and dispatch manually like so:

void f_narrow(char const*) { std::cout << "narrow\n"; }
void f_utf8(char const*) { std::cout << "UTF-8\n"; }

void f(char const* x) {
   
// is_utf8: Imaginary function that returns true if the passed in string is UTF-8 encoded of false otherwise.
   
if (is_utf8(x))
        f_utf8
(x);
   
else
        f_narrow
(x);
}

Obviously this approach is ugly and slow.

char16_t and char32_t on the other hand work quite well, since they are distinct types which can be used for overload resolution.


A possible solution:

Adding a new type (e.g. char8_t to be consistent with the other type names) and specifying that UTF-8 string literals use this type would solve these problems and make the behavior consistent with UTF-16/UTF-32 string literals.
In a perfect world this new type would not be implicitly convertible to char, to prevent a). On the other hand such an incompatible change would break existing code that relies on std::is_same<decltype(""), decltype(u8"")>::value == true. To prevent breaking said code an implicit conversion could be defined, leaving a) in its current state but at least fixing b).

Any thoughts?

-- Max Truxa

Nicol Bolas

unread,
Jun 5, 2015, 3:42:26 PM6/5/15
to std-dis...@isocpp.org
This is not a defect; it's a proposal. So it should go into the forum for that, not here.

David Krauss

unread,
Jun 6, 2015, 1:27:29 AM6/6/15
to std-dis...@isocpp.org

On 2015–06–06, at 3:42 AM, Nicol Bolas <jmck...@gmail.com> wrote:

This is not a defect; it's a proposal. So it should go into the forum for that, not here.

Signed UTF-8 characters are just as defective as C++03 non-const access to string literals, in my book.

The solution could be the same: allow a special implicit conversion for u8"" literals to char const*, with the actual type being unsigned.

IMHO it deserves a DR and a proposal. The proposal would count as “evolution,” though.

Max Truxa

unread,
Jun 6, 2015, 9:37:23 AM6/6/15
to std-dis...@isocpp.org, pot...@mac.com

On Friday, June 5, 2015 at 9:42:26 PM UTC+2, Nicol Bolas wrote:
This is not a defect; it's a proposal. So it should go into the forum for that, not here.



On Saturday, June 6, 2015 at 7:27:29 AM UTC+2, David Krauss wrote:
Signed UTF-8 characters are just as defective as C++03 non-const access to string literals, in my book.

The solution could be the same: allow a special implicit conversion for u8"" literals to char const*, with the actual type being unsigned.

IMHO it deserves a DR and a proposal. The proposal would count as “evolution,” though.

Good point, I added this to my post at std-proposals.
Reply all
Reply to author
Forward
0 new messages