Should C++ contain a distinct type for UTF-8?
C++11 specifies:
+ char16_t* for UTF-16
+ char32_t* for UTF-32
+ char* for execution narrow-character set
+ wchar_t* for execution wide-character set
+ unsigned char*, possibly for raw data buffers etc.
a) Wouldn't it make sense to have a char8_t where char8_t arrays would
hold UTF-8 character sequences exclusively?
b) What is the rationale for not including it?
cheers,
Martin
On Thu, Nov 22, 2012 at 8:53 AM, Martin Ba <0xcdc...@gmx.at> wrote:
> Reference:
> https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion
>
> Should C++ contain a distinct type for UTF-8?
Not until someone writes a convincing proposal and submits it to the
committee. I can't recall even an unconvincing proposal being
submitted.
>
> C++11 specifies:
> + char16_t* for UTF-16
> + char32_t* for UTF-32
> + char* for execution narrow-character set
> + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
>
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?
typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;
Works reasonably well if that's all you want to do. But a separate
type without a way to interoperate with all the interfaces that
traffic in [const] char* and std::string is just an illusion of a
solution, so it isn't worth doing, IMO.
> b) What is the rationale for not including it?
I can only speak for myself, but I'd rather figure out how to mandate
the encoding of char* and std::string be UTF-8 in translation units
that prefer UTF-8, but do so in a way that preserves the existing
C/C++ codebase that assumes encoding based on locale as currently
specified.
const char *str = u8"This is a UTF-8 string.";
The main difficulty is in segments of code which need to treat each encoding separately. As it is not possible to treat UTF-8 and narrow-encoded strings interchangably in portable code, having one single type for them both is, in my eyes, quite defective.After all, I hate to be the one to point this out, but there's hardly a bunch of legacy code using UTF-8 literals right now. Visual Studio doesn't even support them, and there's little reason to use them on GCC or Clang since they are UTF-8 narrow encoding anyway if I recall correctly, and you can't use a UTF-8 literal to get different behaviour from a portable library, because they're the same type (not to mention that most major libraries, including Boost, have very limited C++11 support right now). Thus I'd have to suspect that there is simply no reason to use them right now on the major compilers which support them, and by inference, I'm not really swayed by the backwards-compatibility argument.
A DR for them would not be especially complex. It would simply involve changing the text relating to them to specify char8_t instead of char, and if you're desperate to support the code that was written in the interim, then you could also specify an additional conversion if you want, but I think this would be a bad idea. Fundamentally, the value of the literals is that when you take them, you know what encoding they're in. Any system which breaks this property renders them irrelevant, because now you can't deal with any non-basic character set values, because you can't know the encoding. Strings in different encodings are different types and must be treated as such, as they are not remotely interchangable. You could not write any portable code which deals with const char*, because it could be UTF-8 but it could also be narrow encoding.Anyway, I think that the simplest thing to do would be to hotfix them with a DR as soon as possible, because the longer we wait, the more probability of more legacy code to be broken, and the greater probability that it simply won't get fixed, which in my opinion is a larger problem than existing code which uses them.
It certainly couldn't be magically back-ported into C++11, since that standard has already shipped.
C++14 will hit in, well, 2014. If it has char8_t in it, we won't be seeing compilers that support it until maybe 2015. It won't be wide-spread until 2016.
Re ISO definition of “defect”: A standard has a defect if and only if something is underspecified (not enough detail to implement it correctly) or contains a contradiction (so that there is no way to implement the feature at all and satisfy all requirements; e.g., page N says X must do A, but page M says X must do B != A).
People colloquially talk about a “defect” as something they think shouldn’t have been designed that way, but that’s not the definition that applies here.
Herb
--
Re ISO definition of “defect”: A standard has a defect if and only if something is underspecified (not enough detail to implement it correctly) or contains a contradiction (so that there is no way to implement the feature at all and satisfy all requirements; e.g., page N says X must do A, but page M says X must do B != A).
People colloquially talk about a “defect” as something they think shouldn’t have been designed that way, but that’s not the definition that applies here.
Herb
The main difficulty is in segments of code which need to treat each encoding separately. As it is not possible to treat UTF-8 and narrow-encoded strings interchangably in portable code, having one single type for them both is, in my eyes, quite defective.
After all, I ... there's hardly a bunch of legacy code using UTF-8 literals right now. Visual Studio doesn't even support them, and there's little reason to use them on GCC or Clang since they are UTF-8 narrow encoding anyway if I recall correctly, and you can't use a UTF-8 literal to get different behaviour from a portable library, because they're the same type ... Thus I'd have to suspect that there is simply no reason to use them right now on the major compilers which support them, and by inference, I'm not really swayed by the backwards-compatibility argument.
... Fundamentally, the value of the literals is that when you take them, you know what encoding they're in. Any system which breaks this property renders them irrelevant, because now you can't deal with any non-basic character set values, because you can't know the encoding. Strings in different encodings are different types and must be treated as such, as they are not remotely interchangable. You could not write any portable code which deals with const char*, because it could be UTF-8 but it could also be narrow encoding.
I can only speak for myself, but I'd rather figure out how to mandate
the encoding of char* and std::string be UTF-8 in translation units
that prefer UTF-8, but do so in a way that preserves the existing
C/C++ codebase that assumes encoding based on locale as currently
specified.
struct utf8_t {char c;};