C++0x and Unicode Literals

Scott Meyers

unread,

Oct 20, 2009, 1:58:02 PM10/20/09

to

My understanding is that UTF-8, UTF-16, and UTF-32 are all specific
encodings.
in C++0x, the u8 prefix indicates a string literal using UTF-8.
2.14.5/6 is
unusually clear:

> A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the
> given characters as encoded in UTF-8.

The situation for literals prefixed with u and U (i.e., literals
consisting of
characters of type char16_t and char32_t, respectively) is less clear.
Here's
2.14.5/8&9:

> A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal
> has type ï¿½array of n const char16_tï¿½, where n is the size of the string as defined below; it has static storage
> duration and is initialized with the given characters. A single c-char may produce more than one char16_t
> character in the form of surrogate pairs.

> A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal
> has type ï¿½array of n const char32_tï¿½, where n is the size of the string as defined below; it has static storage
> duration and is initialized with the given characters.

Notice how no mention is made of the encodings used. Presumably, there
is no
guarantee that u and U string literals are encoded using UTF-16 and UTF-32.

But perhaps there is something about the encoding used for the character
types
themselves. Character literals may also be prefixed with u and U, and
2.14.3/2
and draft C++0x has this to say:

> A character literal that begins with the letter u, such as uï¿½yï¿½, is a character literal of type char16_t. The
> value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that
> the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual
> plane code point.) If the value is not representable within 16 bits, the program is ill-formed. A char16_t
> literal containing multiple c-chars is ill-formed. A character literal that begins with the letter U, such as
> Uï¿½zï¿½, is a character literal of type char32_t. The value of a char32_t literal containing a single c-char is
> equal to its ISO 10646 code point value. A char32_t literal containing multiple c-chars is ill-formed.

Again there is no mention of UTF-16 or UTF-32, so I assume that there is no
guarantee that these encodings are used. Interestingly, there is no u8
prefix
for character literals, so there does not seem to be a way to force a
character
literal to be encoded as UTF-8, even though there is a way to force a
string
literal to be UTF-8 encoded.

I'm assuming that things are specified the way they are (i.e., no mandatory
encoding for character or string literals except for u8-prefixed string
literals) for a reason. Can somebody please explain to me what that
reason is?

As a simple start, why is there no way to specify UTF-8 encoded character
literals, nor a way to specify UTF-16 and UTF-32 encoded string literals?

Thanks,

Scott

--
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@netlab.cs.rpi.edu]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]

Scott Meyers

unread,

Oct 20, 2009, 5:23:58 PM10/20/09

to

Apologies for following up to my own post, but further thought and
reading suggests I may be able to answer part of my own question.

Scott Meyers wrote:

themselves. Character literals may also be prefixed with u and U,
and 2.14.3/2
and draft C++0x has this to say:

A character literal that begins with the letter u, such as
uï¿½yï¿½, is a character literal of type char16_t. The
value of a char16_t literal containing a single c-char is
equal to its ISO 10646 code point value, provided that
the code point is representable with a single 16-bit code
unit. (That is, provided it is a basic multi-lingual
plane code point.) If the value is not representable within 16
bits, the program is ill-formed. A char16_t
literal containing multiple c-chars is ill-formed. A character
literal that begins with the letter U, such as
Uï¿½zï¿½, is a character literal of type char32_t. The value of a
char32_t literal containing a single c-char is
equal to its ISO 10646 code point value. A char32_t literal
containing multiple c-chars is ill-formed.

I now think that the above is trying to say that char16_t literals are
UCS-2 encoded and char32_t literals are UCS-4 encoded. Is that
correct?

If so, it still leaves a couple of open questions:
- Why no UTF-16 or UTF-32 string or character literals?
- Why no UTF-8 character literals?

Michael Karcher

unread,

Oct 21, 2009, 1:06:46 AM10/21/09

to

Scott Meyers <use...@aristeia.com> wrote:
> My understanding is that UTF-8, UTF-16, and UTF-32 are all specific
> encodings.

Not really. UTF-8, UTF-16 and UTF-32 are three different schemes to
represent the same encoding, namely Unicode (thats ISO10646).
UTF-8 represents one unicode character (a codepoint number between
0 and 0x10FFFF, written as U+0000 to U+10FFFF by convention) as
a series of one to four values of eight bits each, UTF-16 represents
one unicode character as a series of one to two 16 bit values.
UTF-32 is just an identitity mapping between the unicode codepoint number
and 32 bit numbers.

> > A string literal that begins with u, such as u"asdf", is a char16_t

> > string literal. A char16_t string literal has type �array of n const
> > char16_t�, where n is the size of the string as defined below; it has

> > static storage duration and is initialized with the given characters. A
> > single c-char may produce more than one char16_t character in the form
> > of surrogate pairs.

> Notice how no mention is made of the encodings used. Presumably, there is
> no guarantee that u and U string literals are encoded using UTF-16 and
> UTF-32.

It looks like UTF-16 is intended by the quoted paragraph, as the term
"surrogate pair" is Unicode standardese for a UTF-16 character that is
represented by two 16 bit values.

> > A character literal that begins with the letter u, such as u�y�, is a

> > character literal of type char16_t. The value of a char16_t literal
> > containing a single c-char is equal to its ISO 10646 code point value,
> > provided that the code point is representable with a single 16-bit code
> > unit. (That is, provided it is a basic multi-lingual plane code point.)
> > If the value is not representable within 16 bits, the program is
> > ill-formed. A char16_t literal containing multiple c-chars is
> > ill-formed.

> > A character literal that begins with the letter U, such as U�z�, is a

> > character literal of type char32_t. The value of a char32_t literal
> > containing a single c-char is equal to its ISO 10646 code point value. A
> > char32_t literal containing multiple c-chars is ill-formed.
> Again there is no mention of UTF-16 or UTF-32, so I assume that there is no
> guarantee that these encodings are used.

As a char16_t can only store one 16 bit value, you can not store every
UTF-16 representation of a Unicode encoded character in a char16_t, but
only those Unicode characters whose representation in UTF-16 occupies just
one 16 bit value. The set of these characters is called the "base
multilingual plane" (BMP), and the UTF-16 representation of a BMP character
is just the number of the Unicode codepoint. So the Standard paragraph you
quoted effectively describes that all characters whose UTF-16 encoding fits
into a char16_t are in fact stored in their UTF-16 encoding.

For char32_t the standard specifies that the value is the unicode code point
value. That's exactly the definition of UTF-32. So the UTF-16 and UTF-32
encoding for character literals is required by the standard.

> Interestingly, there is no u8 prefix for character literals, so there does
> not seem to be a way to force a character literal to be encoded as UTF-8,
> even though there is a way to force a string literal to be UTF-8 encoded.

The only UTF-8 represented unicode encoded characters representable in a
single char variable is the 7bit ASCII charset. So an u8 prefix would in
fact not be an UTF-8 prefix but just an ASCII prefix, but that of course
doesn't make your request invalid. On an EBCDIC machine, the execution
character set is not ASCII, so '1' would not have the value 0x31, but 0xF1.
u8"1" would be {0x31,0x00}, though, while "1" is {0xF1, 0x00}. The only way
to force ASCII encoding for a character is the cumbersome expression
u8"1"[0], and yet this expression does not do the same thing as a
hyptothetical u8'1' or a'1' ("a" for ASCII) would do, as it lacks a check
that the character is indeed in the 7 bit ASCII charset.

> I'm assuming that things are specified the way they are (i.e., no mandatory
> encoding for character or string literals except for u8-prefixed string
> literals) for a reason.

For narrow characters it is for the same reason as in C++98: To allow
machines with other native character sets. The same applies for wchar_t. The
new type char16_t and char32_t are meant exclusively to store 16 bit values
composing a UTF-16 represented Unicode encoded string and 32 bit values
composing a UTF-32 represented Unicode encoded strings respectively. The
definition of the codecvt facets in 22.4.1.4/3 explicitly mention UTF-16
and UTF-32 encodings. It's interesting that the codecvt facets involving
char16_t and char32_t interpret char values as UTF-8, whereas the wchar_t
codecvt favets interprets char values as codepoints in the execution
character set. There seems to be no standard way to conver between them.

As character arrays can be either encodeded in the native charset or in
UTF-8, a syntax is needed to specify which semantic interpretation of a
narrow string literal should be applied. The values of the characters of an
u8 string must (of course) not be interpreted in the execution character
set! Note that 1.7 explicitly states that a "byte" is large enough to
contain characters of the basic character set *and* UTF-8 codes, as this may
really be two completely different semantic interpretation of the same byte
value.

> As a simple start, why is there no way to specify UTF-8 encoded character
> literals,

Good question, indeed. See above.

> nor a way to specify UTF-16 and UTF-32 encoded string literals?

There ist. u and U are just meant for that and the specification of UTF-16
and UTF-32.

Regards,
Michael Karcher

Michael Karcher

unread,

Oct 21, 2009, 1:07:40 AM10/21/09

to

Scott Meyers <use...@aristeia.com> wrote:
> themselves. Character literals may also be prefixed with u and U,
> and 2.14.3/2 and draft C++0x has this to say:
>

[...]

>
> I now think that the above is trying to say that char16_t literals are
> UCS-2 encoded and char32_t literals are UCS-4 encoded. Is that
> correct?

Yes.

> If so, it still leaves a couple of open questions:
> - Why no UTF-16 or UTF-32 string or character literals?

You can't put two UTF-16 units representing one character into a wchar16_t.
So there is no sense for an UTF-16 chacter literal that is not equivalent
to an UCS-2 character literal.

UCS-4 and UTF-32 are equivalent, so there are UTF-32 character literals.

The string literals are (although I can't find a stringent point in the
standard) also intended to be UTF-16 and UTF-32. As apposed to the character
literals, within strings it is possible to represent characters needing more
than one UTF-16 code (in that case, it's a so called surrogate pair, and
that term is explicitly mentioned in the char16 literal description)

> - Why no UTF-8 character literals?

No answer to this, although they would be restricted to plain old ASCII.

Regards,
Michael Karcher

Florian Weimer

unread,

Oct 21, 2009, 9:59:57 PM10/21/09

to

* Scott Meyers:

> If so, it still leaves a couple of open questions:
> - Why no UTF-16 or UTF-32 string or character literals?

What do you do if your input file is encoded in UCS-2 and contains
malformed UTF-16?

Presumably, some people care as strongly about UCS-2 input files as
about trigraphs. 8-)

James Kanze

unread,

Oct 23, 2009, 12:34:15 PM10/23/09

to

On Oct 21, 6:06 am, use...@mkarcher.dialup.fu-berlin.de (Michael
Karcher) wrote:
> Scott Meyers <use...@aristeia.com> wrote:

[...]

> > Interestingly, there is no u8 prefix for character literals,
> > so there does not seem to be a way to force a character
> > literal to be encoded as UTF-8, even though there is a way
> > to force a string literal to be UTF-8 encoded.

> The only UTF-8 represented unicode encoded characters
> representable in a single char variable is the 7bit ASCII
> charset. So an u8 prefix would in fact not be an UTF-8 prefix
> but just an ASCII prefix, but that of course doesn't make your
> request invalid. On an EBCDIC machine, the execution character
> set is not ASCII, so '1' would not have the value 0x31, but
> 0xF1. u8"1" would be {0x31,0x00}, though, while "1" is {0xF1,
> 0x00}. The only way to force ASCII encoding for a character is
> the cumbersome expression u8"1"[0], and yet this expression
> does not do the same thing as a hyptothetical u8'1' or a'1'
> ("a" for ASCII) would do, as it lacks a check that the
> character is indeed in the 7 bit ASCII charset.

Even without considering EBCDIC, ASCII characters like @ and $
are not in the basic character set (source or execution); if
your compiler supports '@', then it does so as a (legal)
extension; ditto '\u0040' (which is, officially, the only
portable way to write a character constant which corresponds to
@. In practice, of course, all compilers do have this
"extension", but it would be nice to either require it (as an
extension), or define a type of character constant which
requires it.

Not to mention orthogonality: I'd prefer it if string literals
and character literals were basically identical, with the
restriction that a character literal could only use values which
occupy a single element.

--
James Kanze

Scott Meyers

unread,

Oct 23, 2009, 12:33:49 PM10/23/09

to

Florian Weimer wrote:
> What do you do if your input file is encoded in UCS-2 and contains
> malformed UTF-16?

I don't understand the question, but that's probably because my
experience with
Unicode is essentially zero. (Okay, it's completely zero, but "essentially
zero" sounds so much better...)

My understanding is that UCS-2 is a subset of UTF-16, i.e., that all valid
UCS-2-encoded strings are also valid UTF-16-encoded strings. If that's
true,
then there is no such thing as a valid UCS-2-encoded file that contains
malformed UTF-16 data.

Can you explain how it would be possible to have the situation you describe?

Thanks,

Scott

Florian Weimer

unread,

Oct 24, 2009, 5:59:42 PM10/24/09

to

* Scott Meyers:

> Florian Weimer wrote:
>> What do you do if your input file is encoded in UCS-2 and contains
>> malformed UTF-16?

> My understanding is that UCS-2 is a subset of UTF-16, i.e., that all valid

> UCS-2-encoded strings are also valid UTF-16-encoded strings.

A program which doesn't know anything about UTF-16 may produce output
which isn't valid UTF-16. This can happen quite easily on systems
which gained Unicode capabilities when there only was UCS-2, e.g.
Windows and Java. Such systems tend to treat the 16-bit entities that
make up Unicode strings as atomic entities, and may split strings
between surrogate pairs (which is what makes UTF-16 special, it's
essentially a variable-length 16-bit encoding). Whether the result is
still valid UCS-2 is debatable (my material on Unicode encodings was
revised after UTF-16 was introduced, so I lack a reliable
source---given the relatively recent redefinition of UTF-8 and its
slightly revisionist aspect, this isn't an idle concern).

But my question probably doesn't matter. The standard can just say a
program is ill-formed.

Greg Herlihy

unread,

Oct 24, 2009, 10:16:48 PM10/24/09

to

On Oct 20, 5:23 pm, Scott Meyers <use...@aristeia.com> wrote:
> If so, it still leaves a couple of open questions:
> - Why no UTF-16 or UTF-32 string or character literals?
> - Why no UTF-8 character literals?

UTF-16 and UTF-32 string literals effectively already exist (in the
form "\uNNNN" and "\uNNNNNNNN" respectively), see �2.2/2.

The short answer to the question about UTF character literals is that
such character literals (if they were to exist) would not have a fixed
size. Instead a UTF-8 character literal would come in four different
sizes (ranging from one through four bytes) while a UTF-16 literal
would come in two different sizes (either two or four bytes). In
short, the distinction between a (fixed-sized) character literal and a
(variable-sized) string literal would no longer exist.

Greg

James Kanze

unread,

Oct 25, 2009, 12:10:43 PM10/25/09

to

On Oct 23, 5:33 pm, Scott Meyers <use...@aristeia.com> wrote:
> Florian Weimer wrote:
> > What do you do if your input file is encoded in UCS-2 and contains
> > malformed UTF-16?

> I don't understand the question, but that's probably because
> my experience with Unicode is essentially zero. (Okay, it's
> completely zero, but "essentially zero" sounds so much
> better...)

> My understanding is that UCS-2 is a subset of UTF-16, i.e.,
> that all valid UCS-2-encoded strings are also valid
> UTF-16-encoded strings. If that's true, then there is no such
> thing as a valid UCS-2-encoded file that contains malformed
> UTF-16 data.

Your understanding is correct, up to a point. Generally
speaking, UCS is fairly tollerant with regards to undefined code
points; ISO 10646 (the defining body for UCS), for example,
allows five and six byte forms in its version of UTF-8, where as
the Unicode definition doesn't (and in fact, no allocated code
point requires more than four bytes). If the UCS-2 contains
undefined code points, this could result in illegal UTF-16.

> Can you explain how it would be possible to have the situation
> you describe?

In a perfect world where incorrect input couldn't exist, the
situation wouldn't be possible. The problems would occur when
treating incorrect input: undefined code points in UCS-2.

--
James Kanze