Should C++ contain a distinct type, possibly char8_t or somesuch, for UTF-8?

946 views
Skip to first unread message

Martin Ba

unread,
Nov 22, 2012, 8:53:21 AM11/22/12
to std-dis...@isocpp.org
Reference: https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion

Should C++ contain a distinct type for UTF-8?

C++11 specifies:
+ char16_t* for UTF-16
+ char32_t* for UTF-32
+ char* for execution narrow-character set
+ wchar_t* for execution wide-character set
+ unsigned char*, possibly for raw data buffers etc.

a) Wouldn't it make sense to have a char8_t where char8_t arrays would
hold UTF-8 character sequences exclusively?

b) What is the rationale for not including it?

cheers,
Martin


Beman Dawes

unread,
Nov 22, 2012, 1:59:49 PM11/22/12
to std-dis...@isocpp.org
On Thu, Nov 22, 2012 at 8:53 AM, Martin Ba <0xcdc...@gmx.at> wrote:
> Reference:
> https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion
>
> Should C++ contain a distinct type for UTF-8?

Not until someone writes a convincing proposal and submits it to the
committee. I can't recall even an unconvincing proposal being
submitted.

>
> C++11 specifies:
> + char16_t* for UTF-16
> + char32_t* for UTF-32
> + char* for execution narrow-character set
> + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
>
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?

typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;

Works reasonably well if that's all you want to do. But a separate
type without a way to interoperate with all the interfaces that
traffic in [const] char* and std::string is just an illusion of a
solution, so it isn't worth doing, IMO.

> b) What is the rationale for not including it?

I can only speak for myself, but I'd rather figure out how to mandate
the encoding of char* and std::string be UTF-8 in translation units
that prefer UTF-8, but do so in a way that preserves the existing
C/C++ codebase that assumes encoding based on locale as currently
specified.

--Beman

Martin Ba

unread,
Nov 22, 2012, 2:34:30 PM11/22/12
to std-dis...@isocpp.org, bda...@acm.org
On Thursday, November 22, 2012 7:59:52 PM UTC+1, Beman Dawes wrote:
On Thu, Nov 22, 2012 at 8:53 AM, Martin Ba <0xcdc...@gmx.at> wrote:
> Reference:
> https://groups.google.com/d/topic/comp.lang.c++.moderated/4CBsrFuMFBc/discussion
>
> Should C++ contain a distinct type for UTF-8?

Not until someone writes a convincing proposal and submits it to the
committee. I can't recall even an unconvincing proposal being
submitted.


Which I kind of found/find weird, given that "someone" found it worthwhile to propose char16_t and char32_t *and* propose character literals that map to these types.
 
>
> C++11 specifies:
> + char16_t* for UTF-16
> + char32_t* for UTF-32
> + char* for execution narrow-character set
> + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
>
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?

typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;

Works reasonably well if that's all you want to do. But a separate
type without a way to interoperate with all the interfaces that
traffic in [const] char* and std::string is just an  illusion of a
solution, so it isn't worth doing, IMO.


Well you have to start somewhere, haven't you?

> b) What is the rationale for not including it?

I can only speak for myself, but I'd rather figure out how to mandate
the encoding of char* and std::string be UTF-8 in translation units
that prefer UTF-8, but do so in a way that preserves the existing
C/C++ codebase that assumes encoding based on locale as currently
specified.


Could you elaborate? I ask this because in your(?) proposal from 2 month ago --
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html#comp-UTF-8
-- you seem to suggest that a char8_t with UTF-8 would be desirable, even though "... It needs much further study and discussion before moving forward. ...".

cheers,
Martin

Nicol Bolas

unread,
Nov 22, 2012, 2:48:32 PM11/22/12
to std-dis...@isocpp.org, bda...@acm.org

To me, the main difficulty is that C++11 allows this to compile:

const char *str = u8"This is a UTF-8 string.";

This is pretty much how UTF-8 will be used in C++ for string literals. If we introduce a `char8_t` that acts like `char16_t` and `char32_t` (ie: no implicit conversions), this will fail to compile.

By not introducing a proper `char8_t` type in C++11, we're kinda in a bad place. The only reasonable thing to do, from a backwards-compatibility perspective, is to introduce `char8_t`, but allow it to implicitly be converted into a `char`. But not the other way around.

Tony V E

unread,
Nov 22, 2012, 3:30:16 PM11/22/12
to std-dis...@isocpp.org
On Thu, Nov 22, 2012 at 1:59 PM, Beman Dawes <bda...@acm.org> wrote:
>
> I can only speak for myself, but I'd rather figure out how to mandate
> the encoding of char* and std::string be UTF-8 in translation units
> that prefer UTF-8, but do so in a way that preserves the existing
> C/C++ codebase that assumes encoding based on locale as currently
> specified.
>
> --Beman
>

Mandate that the default locale is UTF-8?
(I assume this won't work, but I ask nonetheless)

Tony

wolfei...@gmail.com

unread,
Nov 22, 2012, 5:00:15 PM11/22/12
to std-dis...@isocpp.org
The main difficulty is in segments of code which need to treat each encoding separately. As it is not possible to treat UTF-8 and narrow-encoded strings interchangably in portable code, having one single type for them both is, in my eyes, quite defective.

After all, I hate to be the one to point this out, but there's hardly a bunch of legacy code using UTF-8 literals right now. Visual Studio doesn't even support them, and there's little reason to use them on GCC or Clang since they are UTF-8 narrow encoding anyway if I recall correctly, and you can't use a UTF-8 literal to get different behaviour from a portable library, because they're the same type (not to mention that most major libraries, including Boost, have very limited C++11 support right now). Thus I'd have to suspect that there is simply no reason to use them right now on the major compilers which support them, and by inference, I'm not really swayed by the backwards-compatibility argument.

A DR for them would not be especially complex. It would simply involve changing the text relating to them to specify char8_t instead of char, and if you're desperate to support the code that was written in the interim, then you could also specify an additional conversion if you want, but I think this would be a bad idea. Fundamentally, the value of the literals is that when you take them, you know what encoding they're in. Any system which breaks this property renders them irrelevant, because now you can't deal with any non-basic character set values, because you can't know the encoding. Strings in different encodings are different types and must be treated as such, as they are not remotely interchangable. You could not write any portable code which deals with const char*, because it could be UTF-8 but it could also be narrow encoding.

Anyway, I think that the simplest thing to do would be to hotfix them with a DR as soon as possible, because the longer we wait, the more probability of more legacy code to be broken, and the greater probability that it simply won't get fixed, which in my opinion is a larger problem than existing code which uses them.

Nicol Bolas

unread,
Nov 22, 2012, 5:34:00 PM11/22/12
to std-dis...@isocpp.org, wolfei...@gmail.com


On Thursday, November 22, 2012 2:00:16 PM UTC-8, wolfei...@gmail.com wrote:
The main difficulty is in segments of code which need to treat each encoding separately. As it is not possible to treat UTF-8 and narrow-encoded strings interchangably in portable code, having one single type for them both is, in my eyes, quite defective.

After all, I hate to be the one to point this out, but there's hardly a bunch of legacy code using UTF-8 literals right now. Visual Studio doesn't even support them, and there's little reason to use them on GCC or Clang since they are UTF-8 narrow encoding anyway if I recall correctly, and you can't use a UTF-8 literal to get different behaviour from a portable library, because they're the same type (not to mention that most major libraries, including Boost, have very limited C++11 support right now). Thus I'd have to suspect that there is simply no reason to use them right now on the major compilers which support them, and by inference, I'm not really swayed by the backwards-compatibility argument.

C++14 will hit in, well, 2014. If it has char8_t in it, we won't be seeing compilers that support it until maybe 2015. It won't be wide-spread until 2016.

That's 2+ years of code that hasn't been written yet, but will be written by then.

A DR for them would not be especially complex. It would simply involve changing the text relating to them to specify char8_t instead of char, and if you're desperate to support the code that was written in the interim, then you could also specify an additional conversion if you want, but I think this would be a bad idea. Fundamentally, the value of the literals is that when you take them, you know what encoding they're in. Any system which breaks this property renders them irrelevant, because now you can't deal with any non-basic character set values, because you can't know the encoding. Strings in different encodings are different types and must be treated as such, as they are not remotely interchangable. You could not write any portable code which deals with const char*, because it could be UTF-8 but it could also be narrow encoding.

Anyway, I think that the simplest thing to do would be to hotfix them with a DR as soon as possible, because the longer we wait, the more probability of more legacy code to be broken, and the greater probability that it simply won't get fixed, which in my opinion is a larger problem than existing code which uses them.

I don't think you can defect report an entire feature into the language, including adding a new type that has new behavior and so forth. It certainly couldn't be magically back-ported into C++11, since that standard has already shipped. This is not a wording change or an invisible implementation detail. This is changing the language.

wolfei...@gmail.com

unread,
Nov 22, 2012, 7:42:34 PM11/22/12
to std-dis...@isocpp.org, wolfei...@gmail.com
I guess that depends on what you consider to be suitable for a DR. I mean, I'm no expert on the ISO Standardisation process, and maybe there is a formal rule for how widely-scoped a DR can be, but given that the current behaviour is defective and fixing it would constitute remedying that defect, then intuitively, a defect report is the way to go. Of course, you can have your own opinion about how defective it is, but assuming that you agree, then it seems DR is correct.

It certainly couldn't be magically back-ported into C++11, since that standard has already shipped.

Again, I'm no expert, but I was informed that this is actually not the case- various DRs technically become part of the previous Standard. Specifically, I was told this already happened w.r.t. eternally lasting constexpr temporaries. In any case, it's fairly immaterial.

 C++14 will hit in, well, 2014. If it has char8_t in it, we won't be seeing compilers that support it until maybe 2015. It won't be wide-spread until 2016.

I disagree. Consider how widespread rvalue references were before C++11 hit. GCC 4.3, March 5 2008, and even Visual Studio was implementing them over a year before C++11 shipped, and they were vastly more complex. If you author an existing major compiler and the change is filed as a DR or voted in to the Standard as a regular proposal, it would hardly be a large effort to conform early, and it's rather unlikely that the effort would be wasted- not to mention that an early-conforming implementation would save it's users years of legacy code. 

But secondly, as I mentioned, there's actually no reason for users of many of those compilers to write this code in the first place- and that would definitely become an unlikely prospect if the Committee votes in a fix in Bristol. Not only would they gain no observable benefit, but they'd know in advance that the Committee would change it.

Herb Sutter

unread,
Nov 22, 2012, 7:57:40 PM11/22/12
to std-dis...@isocpp.org

Re ISO definition of “defect”:  A standard has a defect if and only if something is underspecified (not enough detail to implement it correctly) or contains a contradiction (so that there is no way to implement the feature at all and satisfy all requirements; e.g., page N says X must do A, but page M says X must do B != A).

 

People colloquially talk about a “defect” as something they think shouldn’t have been designed that way, but that’s not the definition that applies here.

 

Herb

 

 

 

--
 
 
 

wolfei...@gmail.com

unread,
Nov 22, 2012, 8:28:12 PM11/22/12
to std-dis...@isocpp.org, hsu...@microsoft.com
Aright, so it would have to be a real proposal. That's fine by me.

Nicol Bolas

unread,
Nov 22, 2012, 11:52:10 PM11/22/12
to std-dis...@isocpp.org, hsu...@microsoft.com


On Thursday, November 22, 2012 4:58:42 PM UTC-8, Herb Sutter wrote:

Re ISO definition of “defect”:  A standard has a defect if and only if something is underspecified (not enough detail to implement it correctly) or contains a contradiction (so that there is no way to implement the feature at all and satisfy all requirements; e.g., page N says X must do A, but page M says X must do B != A).

 

People colloquially talk about a “defect” as something they think shouldn’t have been designed that way, but that’s not the definition that applies here.

 

Herb


Thank you for the clarification. It might be a good idea to update the isocpp site's page on submitting a defect report to explain exactly what is considered a defect.

Martin Ba

unread,
Nov 23, 2012, 4:18:08 AM11/23/12
to std-dis...@isocpp.org, wolfei...@gmail.com
On Thursday, November 22, 2012 11:00:16 PM UTC+1, wolfei...@gmail.com wrote:
The main difficulty is in segments of code which need to treat each encoding separately. As it is not possible to treat UTF-8 and narrow-encoded strings interchangably in portable code, having one single type for them both is, in my eyes, quite defective.

After all, I ... there's hardly a bunch of legacy code using UTF-8 literals right now. Visual Studio doesn't even support them, and there's little reason to use them on GCC or Clang since they are UTF-8 narrow encoding anyway if I recall correctly, and you can't use a UTF-8 literal to get different behaviour from a portable library, because they're the same type ... Thus I'd have to suspect that there is simply no reason to use them right now on the major compilers which support them, and by inference, I'm not really swayed by the backwards-compatibility argument.

... Fundamentally, the value of the literals is that when you take them, you know what encoding they're in. Any system which breaks this property renders them irrelevant, because now you can't deal with any non-basic character set values, because you can't know the encoding. Strings in different encodings are different types and must be treated as such, as they are not remotely interchangable. You could not write any portable code which deals with const char*, because it could be UTF-8 but it could also be narrow encoding.


I guess I cannot add much to this.

I might add one reply from back then:

On Wednesday, August 25, 2010 1:55:17 AM UTC+2, Seungbeom Kim wrote:
> On 2010-08-22 13:15, Martin B. wrote:
> >
> > Should C++0x contain a distinct type for UTF-8?
> >
> > Current draft N3092 specifies:

> > + char16_t* for UTF-16
> > + char32_t* for UTF-32
> > + char* for execution narrow-character set
> > + wchar_t* for execution wide-character set
> > + unsigned char*, possibly for raw data buffers etc.
> >
> > a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> > hold UTF-8 character sequences exclusively?
>
> I guess so, just as char16_t and char32_t do for UTF-16 and UTF-32.
>
> At least, char8_t could be made an unsigned integer type! (That is,
> a distinct type with the same representation as uint_least8_t.)
> Having to cast to unsigned char for any serious byte handling remains
> to be one of my biggest pet peeves.

>
> > b) What is the rationale for not including it?
>
> Probably because that's what the C committee did[N1040], I guess.
> C has had a tendency to introduce new character types via typedefs,
> such as wchar_t, char16_t, and char32_t (hence the suffix "_t"),
> which works well for C because it doesn't have overloading anyway.
> And char16_t and char32_t were meant primarily to provide clearly
> defined widths for the types and to allow string literals thereof,
> none of which a separate char8_t was necessary for.
> [N1040] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf
>
> Things are different in C++: it introduces new character types as
> distinct types, and it supports overloading. So I believe C++ could
> benefit from a separate char8_t type. However, it doesn't seem to
> have been done and I do not know whether introduction of char8_t
> has ever been discussed in one of the technical papers, or WG14's
> N1040 was adopted with just as much "translation" as necessary.
>
> --
> Seungbeom Kim

It would be interesting to hear whether it is really the case that the missing char8_t is coming from the C std.

ri...@longbowgames.com

unread,
Nov 24, 2012, 8:37:46 AM11/24/12
to std-dis...@isocpp.org, bda...@acm.org
On Thursday, November 22, 2012 1:59:52 PM UTC-5, Beman Dawes wrote:
I can only speak for myself, but I'd rather figure out how to mandate
the encoding of char* and std::string be UTF-8 in translation units
that prefer UTF-8, but do so in a way that preserves the existing
C/C++ codebase that assumes encoding based on locale as currently
specified.

You mentioned this in the other thread. Are you implying that you'll be presenting ideas on how to do this in a future paper? I ask because I'd be interested to hear you expand upon the idea. If it weren't for legacy code, C++ would certainly be a cleaner language if you could assume that char*s were UTF-8 encoded.

wolfei...@gmail.com

unread,
Nov 24, 2012, 8:48:58 AM11/24/12
to std-dis...@isocpp.org, bda...@acm.org, ri...@longbowgames.com
I don't see how that could possibly work, because you can't vary your code depending on what translation unit called it. If you have a public API then you can't know what translation units call it- if they even exist at the time. You would have to assume narrow encoding- exactly the same problem as we have now.

UTF-8 isn't interoperable with narrow encodings, so there's no way to make this work. They are separate things and need separate types.

ben....@gmail.com

unread,
Nov 24, 2012, 6:48:32 PM11/24/12
to std-dis...@isocpp.org, bda...@acm.org, ri...@longbowgames.com, wolfei...@gmail.com
I would love to see better typing amongst character encodings.  I have strongly considered hacking it myself in my own codebases by doing something like the following:
struct utf8_t {char c;};

Unfortunately, you need to do lots of reinterpret_cast's with this approach, anytime you send it through an API that isn't under your control.

If we go down this path, it would be useful to have a type for byte buffers that is distinct from chars. 

Beman Dawes

unread,
Nov 24, 2012, 9:52:09 PM11/24/12
to std-dis...@isocpp.org
I'm leery of solutions suggested when we don't have a good handle on
the problems to be solved . Any real design is going to have start by
doing some homework. Identify all the places in the core language and
standard library with behavior dependent on the encoding of narrow
character strings, for example. Find out more about real-world use
cases of non-UTF-8 narrow character encodings, particular in Asia.
Find out how real-world needs interact with current features. Basic
analysis.

--Beman

scott....@gmail.com

unread,
Feb 2, 2017, 9:04:15 PM2/2/17
to ISO C++ Standard - Discussion, hsu...@microsoft.com
Why not just scrap C and start over.  We all know that the language is garbage.

Look at you Morons arguing over the problems with creating a new variable type.

Loathsome.. Loathsome... Loathsome...
Reply all
Reply to author
Forward
0 new messages