"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs"

Lynn McGuire

unread,

Sep 7, 2016, 3:08:01 PM9/7/16

to

"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs"
https://msdn.microsoft.com/magazine/mt763237

"Unicode is the de facto standard for representing international text in modern software. According to the official Unicode
consortium’s Web site (bit.ly/1Rtdulx), “Unicode provides a unique number for every character, no matter what the platform, no matter
what the program, no matter what the language.” Each of these unique numbers is called a code point, and is typically represented
using the “U+” prefix, followed by the unique number written in hexadecimal form. For example, the code point associated to the
character “C” is U+0043. Note that Unicode is an industry standard that covers most of the world’s writing systems, including
ideographs. So, for example, the Japanese kanji ideograph 学, which has “learning” and “knowledge” among its meanings, is associated
to the code point U+5B66. Currently, the Unicode standard defines more than 1,114,000 code points."

Yup.

Lynn

Alf P. Steinbach

unread,

Sep 7, 2016, 8:04:02 PM9/7/16

to

Well, there are some gross inaccuracies and misconceptions in that article.

"It’s worth noting that the C++ standard doesn’t specify the size of the
wchar_t type"

That's at best an uninformed hope. The standard does narrow it down
sufficiently that all Windows C and C++ compilers are non-conforming.
For conformance a `wchar_t` must be able to represent all code points of
Unicode, i.e., in practice it must be 32 bits, since it can't be 21.

The idea of defining a custom exception class to carry an error code, is
ungood. There is `std::system_error` for that purpose. Only if one
desires a non-byte-encoded string, or other additional information, is a
custom exception class indicated, but there's none of that.

And using the Windows API for conversion between UTF-8 and UTF-17 is
unnecessary after C++11. It gets more ridiculous when one considers
where the C++ standard library doesn't suffice, i.e. where the API would
be a reasonable choice, namely conversion to/from Windows ANSI. That is,
for some unfathomable reason the author chose as his example of using
the API functions, about the only conversion where the API functions are
not needed.

Cheers!,

- Alf
(who used to read MSDN Magazine at one time, it was all so shiny! and
who is only a one-time Visual C++ MVP, as opposed to the four-time and
nine-times Visual C++ MVP the author used as experts, but hey)

Lynn McGuire

unread,

Sep 8, 2016, 10:03:07 PM9/8/16

to

Doesn't Microsoft define wchar_t type as 2 bytes ? And isn't that embedded deep into the Win32 API ?

My understanding that UTF16 is just like UTF8. When you need those extra byte(s), UTF8 just adds more to the mix as needed. Doesn't
UTF16 do the same ?

Lynn

Öö Tiib

unread,

Sep 9, 2016, 2:44:08 AM9/9/16

to

Yes, but it is non-conforming. Also, like its name suggests Win32 is
legacy. Very rare people have 32 bit hardware on their desktop.

>
> My understanding that UTF16 is just like UTF8. When you need those
> extra byte(s), UTF8 just adds more to the mix as needed. Doesn't
> UTF16 do the same ?

UTF-8 does not have endianness issues and 7-bit ASCII is UTF-8. So
UTF-8 has two benefits above UTF-16.

David Brown

unread,

Sep 9, 2016, 3:20:39 AM9/9/16

to

Yes. When MS started using unicode, they did not have proper unicode -
they used UCS-2 which is basically UTF-16 except that there are no
multi-unit encodings. It covers all the unicode characters that fit in
a single 16-bit UTF-16 code unit. And since UCS2 was the execution
character set for C and C++ on Windows at that time, 16-bit wchar_t was
fine.

But as they have moved more and more of the API and system towards full
UTF-16, it is no longer conforming - you need 32-bit wchar_t in order to
support all unicode code points in a single code unit.

>
> My understanding that UTF16 is just like UTF8. When you need those
> extra byte(s), UTF8 just adds more to the mix as needed. Doesn't UTF16
> do the same ?
>

UTF-8 is efficient and convenient, because it can store ASCII in single
bytes, it has no endian issues, and UTF-8 strings can be manipulated
with common string handling routines. UTF-32 is as close as you can get
to "one code unit is one character", though there are still combining
accent marks to consider. UTF-16 is the worst of both worlds - it is
inefficient, requires multiple code units per character (though many
implementations fail to do so properly, treating it more like UCS-2),
and has endian problems.

The world has settled firmly on UTF-8 as the encoding for storing and
transferring text - rightly so, because it is the best choice in many ways.

And most of the programming world has settled on UTF-32 for internal
encodings in cases where character counting might be convenient and
therefore UTF-8 is not ideal. It is not used much, but because it is an
internal format then at least you don't have endian issues.

UTF-16 is used mainly in two legacy situations - Windows API, and Java.
(QT is doing what it can to move over to UTF-8, restricted by
compatibility with older code.)

David Brown

unread,

Sep 9, 2016, 3:24:55 AM9/9/16

to

While Win32 is legacy in that it is an old API that suffers from
accumulated cruft and poor design decisions (though they might have made
sense 20 years ago when they were made), the great majority of /new/
Windows programs are still Win32. Only a few types of program are
written to Win64 - those that need lots of memory, or with access deep
into the bowels of the system, or that can benefit from the extra
registers, wider integers or extra instructions of x86-64. For most
Windows development work, it is easier to stick to Win32 (or a toolkit
that uses Win32 underneath), and on Windows 32-bit programs run faster
than 64-bit programs in most cases.

Alf P. Steinbach

unread,

Sep 9, 2016, 4:52:50 AM9/9/16

to

On 09.09.2016 09:20, David Brown wrote:
>
> UTF-16 is used mainly in two legacy situations - Windows API, and Java.

You forgot the Unicode APIs.

ICU FAQ: "How is a Unicode string represented in ICU4C? A Unicode string
is currently represented as UTF-16."

Also, what you write about UTF-16 being inefficient is a very local
view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so
I've heard.

Certainly, on a platform using UTF-16 natively, as in Windows, that's
most efficient.

But there is much religious belief committed to this. One crowd starts
chanting about holy war and kill the non-believers if one should happen
to mention that a byte order mark lets Windows tools work correctly with
UTF-8. That's because they want to not seem incompetent when their tools
fail to handle it, as in particular gcc once failed. I think all that
religion and zealot behavior started with how gcc botched it. This
religion's priests are powerful enough to /change standards/ to make
their botched tools seem technically correct, or at least not botched.

As with most everything else, one should just use a suitable tool for
the job. UTF-8 is very good for external text representation, and is
good enough for Asian languages, even if UTF-16 would probably be more
efficient. UTF-16 is very good for internal text representation for use
of ICU (Unicode processing) and in Windows, and it works also in
Unix-land, and so is IMHO the generally most reasonable choice for that
/if/ one has to decide on a single representation on all platforms.

Cheers & hth.,

- Alf

David Brown

unread,

Sep 9, 2016, 6:46:29 AM9/9/16

to

On 09/09/16 10:51, Alf P. Steinbach wrote:
> On 09.09.2016 09:20, David Brown wrote:
>>
>> UTF-16 is used mainly in two legacy situations - Windows API, and Java.
>
> You forgot the Unicode APIs.
>
> ICU FAQ: "How is a Unicode string represented in ICU4C? A Unicode string
> is currently represented as UTF-16."
>
> Also, what you write about UTF-16 being inefficient is a very local
> view. In Japanese, UTF-16 is generally more efficient than UTF-8. Or so
> I've heard.

That is true for a few types of document. But most documents don't
consist of pure text. They are html, xml, or some other format that
mixes the characters with structural or formatting commands. (Most
files that really are pure text are in ASCII.) When these are taken
into account, it turns out (from statistical sampling of web pages and
documents around the internet) that there is very little to be gained by
using UTF-16 even for languages where many of their characters take 2
bytes in UTF-16 and 3 bytes in UTF-8. And if there is some sort of
compression involved, as there often is one big web pages or when the
text is within a pdf, docx or odt file, the difference is negligible.

Or so I have heard :-) I haven't made such studies myself.

>
> Certainly, on a platform using UTF-16 natively, as in Windows, that's
> most efficient.
>
> But there is much religious belief committed to this. One crowd starts
> chanting about holy war and kill the non-believers if one should happen
> to mention that a byte order mark lets Windows tools work correctly with
> UTF-8. That's because they want to not seem incompetent when their tools
> fail to handle it, as in particular gcc once failed. I think all that
> religion and zealot behavior started with how gcc botched it. This
> religion's priests are powerful enough to /change standards/ to make
> their botched tools seem technically correct, or at least not botched.

I haven't heard stories about gcc botching unicode - but I have heard
many stories about how Windows has botched it, and how the C++ (and C)
standards botched it to make it easier to work with Windows' botched API.

(gcc has so far failed to make proper unicode identifiers, and the C and
C++ standards have botched the choices of allowed unicode characters in
identifiers, but that's another matter.)

But for full disclosure here, most of my programming with gcc is on
small embedded systems and I have not had to deal with unicode there at
all. And in my Windows programming, wxPython handles it all without
bothering me with the details.

And I understand that when MS choose UCS-2 they were in early and were
trying to make an efficient handling international characters that was a
big step up from multiple 8-bit code pages - it's just a shame they
could not change from 16-bit to 32-bit encoding when unicode changed
(with Unicode 2.0 in 1996, several years after Windows NT).

>
> As with most everything else, one should just use a suitable tool for
> the job. UTF-8 is very good for external text representation, and is
> good enough for Asian languages, even if UTF-16 would probably be more
> efficient. UTF-16 is very good for internal text representation for use
> of ICU (Unicode processing) and in Windows, and it works also in
> Unix-land, and so is IMHO the generally most reasonable choice for that
> /if/ one has to decide on a single representation on all platforms.
>

Well, these things are partly a matter of opinion, and partly a matter
of fitting with the APIs, libraries, toolkits, translation tools, etc.,
that you are using. If you are using Windows API and Windows tools, you
might find UTF-16 more convenient. But as the worst compromise between
UTF-8 and UTF-32, it only makes sense for code close to the Windows API.
Otherwise you'll find you've got something that /looks/ like it is one
code unit per code point, but fails for CJK unified ideographs or
emoticons. My understanding is that Windows /still/ suffers from bugs
with system code that assumes UCS-2 rather than UTF-16, though the bugs
have been greatly reduced in later versions of Windows.

Alf P. Steinbach

unread,

Sep 9, 2016, 8:55:59 AM9/9/16

to

On 09.09.2016 12:46, David Brown wrote:
>
> I haven't heard stories about gcc botching unicode

It used to be that conforming UTF-8 source code with non-ASCII string
literals could be compiled with either g++, or with MSVC, but not both.
g++ would choke on a BOM at the start, which it should have accepted.
And MSVC lacked an option to tell it the source endcoding, and would
interpret it as Windows ANSI if was UTF-8 without BOM.

At the time the zealots recommended restricting oneself to a subset of
the language that could be compiled with both g++ and MSVC. This was
viewed as a problem with MSVC, and/or with Windows conventions, rather
than botched encoding support in the g++ compiler. I.e., one should use
UTF-encoded source without BOM (which didn't challenge g++) and only
pure ASCII narrow literals (ditto), and no wide literals (ditto),
because that could also, as they saw it, be compiled with MSVC.

They were lying to the compiler, and in the blame assignment, and as is
usual that gave ungood results.

> - but I have heard
> many stories about how Windows has botched it

Well, you can tell that it's propaganda by the fact that it's about
assigning blame. IMO assigning blame elsewhere for ported software that
is, or was, of very low quality.

But it's not just ported Unix-land software that's ungood: the
propaganda could probably not have worked if Windows itself wasn't full
of the weirdest stupidity and bugs, including, up front in modern
Windows, that Windows Explorer, the GUI shell, scrolls away what you're
clicking on so that double-clicks generally have unpredictable effects,
or just don't work. It's so stupid that one suspects the internal
sabotage between different units, that Microsoft is infamous for. Not to
mention the lack of UTF-8 support in the Windows console subsystem.
Which has an API that effectively restricts it to UCS-2. /However/,
while there is indeed plenty wrong in general with Windows, Windows did
get Unicode support a good ten years before Linux, roughly.

UTF-8 was the old AT&T geniuses' (Ken Thompson and Rob Pike) idea for
possibly getting the Unix world on a workable path towards Unicode, by
supporting pure ASCII streams without code change. And that worked. But
it was not established in general until well past the year 2000.

>, and how the C++ (and C)
> standards botched it to make it easier to work with Windows' botched API.

That doesn't make sense to me, sorry.

>[snip]

> Well, these things are partly a matter of opinion, and partly a matter
> of fitting with the APIs, libraries, toolkits, translation tools, etc.,
> that you are using. If you are using Windows API and Windows tools, you
> might find UTF-16 more convenient. But as the worst compromise between
> UTF-8 and UTF-32,

Think about e.g. ICU.

Do you believe that the Unicode guys themselves would choose the worst
compromise?

That does not make technical sense.

So, this claim is technically nonsense. It only makes sense socially, as
an in-group (Unix-land) versus out-group (Windows) valuation: "they're
bad, they even small bad; we're good". I.e. it's propaganda.

In order to filter out propaganda, think about whether it makes value
evaluations like "worst", or "better", and where it does that, is it
with respect to defined goals and full technical considerations?

David Brown

unread,

Sep 9, 2016, 11:13:42 AM9/9/16

to

On 09/09/16 14:54, Alf P. Steinbach wrote:
> On 09.09.2016 12:46, David Brown wrote:
>>
>> I haven't heard stories about gcc botching unicode
>
> It used to be that conforming UTF-8 source code with non-ASCII string
> literals could be compiled with either g++, or with MSVC, but not both.
> g++ would choke on a BOM at the start, which it should have accepted.
> And MSVC lacked an option to tell it the source endcoding, and would
> interpret it as Windows ANSI if was UTF-8 without BOM.
>
> At the time the zealots recommended restricting oneself to a subset of
> the language that could be compiled with both g++ and MSVC. This was
> viewed as a problem with MSVC, and/or with Windows conventions, rather
> than botched encoding support in the g++ compiler. I.e., one should use
> UTF-encoded source without BOM (which didn't challenge g++) and only
> pure ASCII narrow literals (ditto), and no wide literals (ditto),
> because that could also, as they saw it, be compiled with MSVC.
>
> They were lying to the compiler, and in the blame assignment, and as is
> usual that gave ungood results.

UTF-8 files rarely use a BOM - its only use is to avoid
misinterpretation of the file as Latin-1 or some other encoding. So it
is understandable, but incorrect, for gcc to fail when given utf-8
source code with a BOM. It is less understandable, and at least as bad
for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
MSVC fixed their compiler yet?

But that seems like a small issue to call "botching unicode".

gcc still doesn't allow extended identifiers, however - but I think MSVC
does? Clang certainly does.

>
>
>> - but I have heard
>> many stories about how Windows has botched it
>
> Well, you can tell that it's propaganda by the fact that it's about
> assigning blame. IMO assigning blame elsewhere for ported software that
> is, or was, of very low quality.

Are you referring to gcc as "very low quality ported software"? That's
an unusual viewpoint.

>
> But it's not just ported Unix-land software that's ungood: the
> propaganda could probably not have worked if Windows itself wasn't full
> of the weirdest stupidity and bugs, including, up front in modern
> Windows, that Windows Explorer, the GUI shell, scrolls away what you're
> clicking on so that double-clicks generally have unpredictable effects,
> or just don't work. It's so stupid that one suspects the internal
> sabotage between different units, that Microsoft is infamous for. Not to
> mention the lack of UTF-8 support in the Windows console subsystem.
> Which has an API that effectively restricts it to UCS-2. /However/,
> while there is indeed plenty wrong in general with Windows, Windows did
> get Unicode support a good ten years before Linux, roughly.

From a very quick google search, I see references to Unicode in Linux
from the late 1990s - so ten years is perhaps an exaggeration. It
depends on what you mean by "support", of course - for the most part,
Linux really doesn't care about character sets or encodings. Strings
are a bunch of bytes terminated by a null character, and the OS will
pass them around, use them for file system names, etc., with an almost
total disregard for the contents - baring a few special characters such
as "/". As far as I understand it, the Linux kernel itself isn't much
bothered with unicode at all, and it's only the locale-related stuff,
libraries like iconv, font libraries, and of course things like X that
need to concern themselves with unicode. This was one of the key
reasons for liking UTF-8 - it meant very little had to change.

>
> UTF-8 was the old AT&T geniuses' (Ken Thompson and Rob Pike) idea for
> possibly getting the Unix world on a workable path towards Unicode, by
> supporting pure ASCII streams without code change. And that worked. But
> it was not established in general until well past the year 2000.
>

Maybe MS and Windows simply suffer from being the brave pioneer here,
and the *nix world watched them then saw how to do it better.

>
>> , and how the C++ (and C)
>> standards botched it to make it easier to work with Windows' botched API.
>
> That doesn't make sense to me, sorry.

The C++ standard currently has char, wchar_t, char16_t and char32_t
(plus signed and unsigned versions of char - I don't know if the other
types have signed and unsigned versions). wchar_t is of platform
dependent size, making it (and wide strings) pretty much useless for
platform independent code. But it is the way it is because MS wanted to
have 16-bit wchar_t to suit their UCS-2 / UTF-16 APIs. The whole thing
would have been much simpler if the language simply stuck to 8-bit code
units throughout, letting the compiler and/or locale settings figure out
the input encoding and the run-time encoding. If you really need a type
for holding a single unicode character, then the only sane choice is a
32-bit type.

>
>
>> [snip]
>> Well, these things are partly a matter of opinion, and partly a matter
>> of fitting with the APIs, libraries, toolkits, translation tools, etc.,
>> that you are using. If you are using Windows API and Windows tools, you
>> might find UTF-16 more convenient. But as the worst compromise between
>> UTF-8 and UTF-32,
>
> Think about e.g. ICU.
>
> Do you believe that the Unicode guys themselves would choose the worst
> compromise?
>

When unicode was started, everyone thought 16 bits would be enough. So
decisions that stretch back to before Unicode 2.0 would use 16-bit types
for encodings that covered all code points. When the ICU was started,
there was no UTF-32 (I don't know if there was a UTF-8 at the time).
The best choice at that time was UCS-2 - it is only as unicode outgrew
16 bits that this became a poor choice.

> That does not make technical sense.
>
> So, this claim is technically nonsense. It only makes sense socially, as
> an in-group (Unix-land) versus out-group (Windows) valuation: "they're
> bad, they even small bad; we're good". I.e. it's propaganda.

No, the claim is technically correct - UTF-16 has all the disadvantages
of both UTF-8 and UTF-32, and misses several of the important
(independent) advantages of those two encodings. Except for
compatibility with existing UTF-16 APIs and code, there are few or no
use-cases where UTF-16 is a better choice than UTF-8 and also a better
choice than UTF-32. The fact that the world has pretty much
standardised on UTF-8 for transmission of unicode is a good indication
of this - UTF-16 is /only/ used where compatibility with existing UTF-16
APIs and code is of overriding concern.

But historically, ICU and Windows (and Java, and QT) made the sensible
decision of using UCS-2 when they started, but apart from QT they have
failed to make a transition to UTF-8 when it was clear that this was the
way forward.

>
> In order to filter out propaganda, think about whether it makes value
> evaluations like "worst", or "better", and where it does that, is it
> with respect to defined goals and full technical considerations?
>

The advantages of UTF-8 over UTF-16 are clear and technical, not just
"feelings" or propaganda.

Tim Rentsch

unread,

Sep 10, 2016, 12:31:18 PM9/10/16

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> [...] For conformance a `wchar_t` must be able to represent

> all code points of Unicode, i.e., in practice it must be 32 bits,
> since it can't be 21.

Can you give citations to passages that establish this statement?
AFAICT the various standards do not require wchar_t to support
all of Unicode, even in later versions that mandate <uchar.h>.
Of course I may have missed something, especially in the C++
documents.

Chris Vine

unread,

Sep 10, 2016, 2:45:00 PM9/10/16

to

I was wondering about that.

§3.9.1/5 says "Type wchar_t is a distinct type whose values can
represent distinct codes for all members of the largest extended
character set specified among the supported locales (22.3.1). Type
wchar_t shall have the same size, signedness, and alignment
requirements (3.11) as one of the other integral types, called its
underlying type. Types char16_t and char32_t denote distinct types with
the same size, signedness, and alignment as uint_least16_t and
uint_least32_t, respectively, in <stdint.h>, called the underlying
types."

§3.9.1/7 says: "Types bool, char, char16_t, char32_t, wchar_t, and the
signed and unsigned integer types are collectively called integral
types."

References are to the C++11 standard.

This doesn't, in my view, preclude the use of 16-bit code units for
wchar_t with surrogate pairs for unicode coverage (nor for that matter
8-bit wchar_t with UTF-8), but there may be something else in the
standard about that, either in C++11 or C++14.

Chris

Vir Campestris

unread,

Sep 10, 2016, 5:22:56 PM9/10/16

to

On 09/09/2016 16:13, David Brown wrote:
<snip>

>
> UTF-8 files rarely use a BOM - its only use is to avoid
> misinterpretation of the file as Latin-1 or some other encoding. So it
> is understandable, but incorrect, for gcc to fail when given utf-8
> source code with a BOM. It is less understandable, and at least as bad
> for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
> MSVC fixed their compiler yet?
>

You've obviously come from a Linux world.
My experience from years back is that a file won't have a BOM, because
we know what character set it is. It's US Ascii, or some other 7 bit
national variant - or it might even be EBCDIC. Only the truly obscure
have a 6-bit byte, and are limited to UPPER CASE ONLY.

Linux assumes you are going to run UTF-8, and that's just as invalid as
assuming Windows 1251 - which used to be a perfectly sane assumption in
some parts of the world.

The BOM gets you around a few of these problems.

If you want to compile your code on the world's most popular operating
system then you have to follow its rules. Inserting a BOM is far less
painful than swapping all your slashes around - or even turning them
into Yen symbols.

<snip>

>
> Maybe MS and Windows simply suffer from being the brave pioneer here,
> and the *nix world watched them then saw how to do it better.
>

I think you're right. The 16-bit Unicode charset made perfect sense at
the time.

<snip>

>
> The advantages of UTF-8 over UTF-16 are clear and technical, not just
> "feelings" or propaganda.
>

The advantages are a lot less clear in countries that habitually use
more than 128 characters. Little unimportant countries like Japan,
China, India, Russia, Korea...

Andy

David Brown

unread,

Sep 11, 2016, 7:51:44 AM9/11/16

to

On 10/09/16 23:22, Vir Campestris wrote:
> On 09/09/2016 16:13, David Brown wrote:
> <snip>
>>
>> UTF-8 files rarely use a BOM - its only use is to avoid
>> misinterpretation of the file as Latin-1 or some other encoding. So it
>> is understandable, but incorrect, for gcc to fail when given utf-8
>> source code with a BOM. It is less understandable, and at least as bad
>> for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
>> MSVC fixed their compiler yet?
>>
>
> You've obviously come from a Linux world.

Not really - my background is very mixed. But I haven't had much use
for unicode on Windows. On Windows, Latin-1 has always been sufficient
for the non-ASCII characters I needed (mainly the Norwegian letters ÅØÆ).

> My experience from years back is that a file won't have a BOM, because
> we know what character set it is. It's US Ascii, or some other 7 bit
> national variant - or it might even be EBCDIC. Only the truly obscure
> have a 6-bit byte, and are limited to UPPER CASE ONLY.
>
> Linux assumes you are going to run UTF-8, and that's just as invalid as
> assuming Windows 1251 - which used to be a perfectly sane assumption in
> some parts of the world.
>
> The BOM gets you around a few of these problems.
>
> If you want to compile your code on the world's most popular operating
> system then you have to follow its rules. Inserting a BOM is far less
> painful than swapping all your slashes around - or even turning them
> into Yen symbols.

Just using / seems to work fine for slashes. Windows APIs, AFAIK,
accept them without question.

But I don't do much PC programming - most of my work is small embedded
systems, where neither unicode nor filenames are relevant.

>
> <snip>
>>
>> Maybe MS and Windows simply suffer from being the brave pioneer here,
>> and the *nix world watched them then saw how to do it better.
>>
> I think you're right. The 16-bit Unicode charset made perfect sense at
> the time.
>
> <snip>
>>
>> The advantages of UTF-8 over UTF-16 are clear and technical, not just
>> "feelings" or propaganda.
>>
> The advantages are a lot less clear in countries that habitually use
> more than 128 characters. Little unimportant countries like Japan,
> China, India, Russia, Korea...
>

Actually, it is still clear in those countries. CJK countries often use
characters outside the BMP, and those are three or four bytes in UTF-8
and four bytes in UTF-16. So UTF-8 is at least as efficient. And since
most text documents of any size will be full of markup (html, xml, word
processor stuff, etc.) the percentage of 16-bit code points is not
nearly as high as you might guess.

There are good reasons for UTF-8 being the most common encoding on the
net, even in those countries.

Alf P. Steinbach

unread,

Sep 11, 2016, 10:54:11 AM9/11/16

to

"Distinct codes" is clear (otherwise it would be meaningless, and the
standards committees are not into that).

But this requirement comes from earlier in C, as I recall.

Tim Rentsch

unread,

Sep 11, 2016, 1:55:54 PM9/11/16

to

Chris Vine <chris@cvine--nospam--.freeserve.co.uk> writes:

> On Sat, 10 Sep 2016 09:31:02 -0700
> Tim Rentsch <t...@alumni.caltech.edu> wrote:
>> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>>
>>> [...] For conformance a `wchar_t` must be able to represent
>>> all code points of Unicode, i.e., in practice it must be 32 bits,
>>> since it can't be 21.
>>
>> Can you give citations to passages that establish this statement?
>> AFAICT the various standards do not require wchar_t to support
>> all of Unicode, even in later versions that mandate <uchar.h>.
>> Of course I may have missed something, especially in the C++
>> documents.
>
> I was wondering about that.
>

> Section 3.9.1/5 says "Type wchar_t is a distinct type whose values

> can represent distinct codes for all members of the largest extended
> character set specified among the supported locales (22.3.1). Type
> wchar_t shall have the same size, signedness, and alignment
> requirements (3.11) as one of the other integral types, called its
> underlying type. Types char16_t and char32_t denote distinct types
> with the same size, signedness, and alignment as uint_least16_t and
> uint_least32_t, respectively, in <stdint.h>, called the underlying
> types."
>

> Section 3.9.1/7 says: "Types bool, char, char16_t, char32_t,

> wchar_t, and the signed and unsigned integer types are collectively
> called integral types."
>
> References are to the C++11 standard.

In the C standard the type wchar_t is defined in 7.19p2, and
char{16,32}_t in 7.28p2, but other than that I think the two
are pretty much the same.

> This doesn't, in my view, preclude the use of 16-bit code units for
> wchar_t with surrogate pairs for unicode coverage (nor for that matter
> 8-bit wchar_t with UTF-8), but there may be something else in the
> standard about that, either in C++11 or C++14.

I think this is sort of right and sort of wrong. The question is
what characters must be in "the largest extended character set
specified among the supported locales". AFAICT that set does not
have to be any larger than what is in the "C" locale, which is
"the minimal environment for C translation" (which for the sake
of discussion let's say is the same as 7-bit ASCII). If that set
is small enough, then wchar_t could be 16 bits or 8 bits, as you
say. If however "the largest extended character set specified
among the supported locales" has 100,000 characters, then I don't
see how wchar_t can be 16 bits or smaller, because there is no
way for each of those 100,000 characters to have a distinct code.

Now, if the largest extended character set has only the 127 ASCII
characters (plus 0 for null), then wchar_t can be 8 bits and use
a UTF-8 encoding. Or, if the largest extended character set has
only the 16-bit Unicode characters that do not need surrogate
pairs for their encoding, then wchar_t can be 16 bits and use a
UTF-16 encoding. AFAICT both of these scenarios are allowed by
the C and C++ standards. But it depends on what characters are
in the the largest extended character set specified among the
supported locales.

Let me emphasize that the above represents my best understanding,
and only that, and which may in fact be wrong if there is something
I have missed.

Tim Rentsch

unread,

Sep 11, 2016, 2:24:39 PM9/11/16

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> On 10.09.2016 20:44, Chris Vine wrote:
>> On Sat, 10 Sep 2016 09:31:02 -0700
>> Tim Rentsch <t...@alumni.caltech.edu> wrote:
>>> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>>>
>>>> [...] For conformance a `wchar_t` must be able to represent
>>>> all code points of Unicode, i.e., in practice it must be 32 bits,
>>>> since it can't be 21.
>>>
>>> Can you give citations to passages that establish this statement?
>>> AFAICT the various standards do not require wchar_t to support
>>> all of Unicode, even in later versions that mandate <uchar.h>.
>>> Of course I may have missed something, especially in the C++
>>> documents.
>>
>> I was wondering about that.
>>

>> Section 3.9.1/5 says "Type wchar_t is a distinct type whose values

>> can represent distinct codes for all members of the largest
>> extended character set specified among the supported locales
>> (22.3.1). Type wchar_t shall have the same size, signedness, and
>> alignment requirements (3.11) as one of the other integral types,
>> called its underlying type. Types char16_t and char32_t denote
>> distinct types with the same size, signedness, and alignment as
>> uint_least16_t and uint_least32_t, respectively, in <stdint.h>,
>> called the underlying types."
>>

>> Section 3.9.1/7 says: "Types bool, char, char16_t, char32_t,

>> wchar_t, and the signed and unsigned integer types are collectively
>> called integral types."
>>
>> References are to the C++11 standard.
>>
>> This doesn't, in my view, preclude the use of 16-bit code units for
>> wchar_t with surrogate pairs for unicode coverage (nor for that matter
>> 8-bit wchar_t with UTF-8), but there may be something else in the
>> standard about that, either in C++11 or C++14.
>
> "Distinct codes" is clear (otherwise it would be meaningless, and
> the standards committees are not into that).
>
> But this requirement comes from earlier in C, as I recall.

More or less this same text (for wchar_t) appears in pre-C99 drafts.
The type wchar_t was introduced in Amendment 1, aka "C95", but I
don't have a copy of that handy so I don't know what wording it
uses.

I agree that the "distinct codes" provision means there must be a
distinct numeric value in wchar_t for each character in the
largest extended character set, but how big does that set have
to be? AFAICT it can be pretty small and still be part of a
conforming implementation. Let me draw your attention to two
passages in the C++14 standard (which I believe are the same
as the corresponding passages in the relevant C standards).

2.13.3 paragraph 1:

A character literal that begins with the letter L, such as
L'z', is a wide-character literal. A wide-character literal
has type wchar_t. The value of a wide-character literal
containing a single c-char has value equal to the numerical
value of the encoding of the c-char in the execution
wide-character set, unless the c-char has no representation
in the execution wide-character set, in which case the value
is implementation-defined.

16.8 paragraph 2, entry for the symbol __STDC_ISO_10646__

An integer literal of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character
in the Unicode required set, when stored in an object of
type wchar_t, has the same value as the short identifier of
that character. The Unicode required set consists of all
the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda as of the specified
year and month.

Both of these passages appear to acknowledge the possibility that
some "characters" might not be representable in the execution
wide-character set. Even if __STDC_ISO_10646__ is defined, it
might be defined with a date value that necessitates only 16 bit
characters for wchar_t. (I don't have the old Unicode documents
available, but I'm pretty sure there was such a time in the last
20 or 25 years.) Is there something in the newer standards that
requires __STDC_ISO_10646__ be defined at all, or that constrains
the range of dates that may be used? AFAICT there isn't, but of
course I may have missed something.

So, am I on track here, or is there something you think I may
have missed?

Alf P. Steinbach

unread,

Sep 12, 2016, 2:54:19 AM9/12/16

to

Only the exceedingly obvious: the context, given in the title of this
thread.

Cheers & hth!, :-)

- Alf

Tim Rentsch

unread,

Sep 12, 2016, 9:40:49 AM9/12/16

to

Forgive me for being dense, but I don't see what the Subject:
line has to do with the question I'm asking. My concern is
only whether a conforming implementation may define the wchar_t
type to be smaller (eg, 16 bits) than what would be needed if
wchar_t had to be able to represent all Unicode code points in
a single wchar_t object.

David Brown

unread,

Sep 12, 2016, 10:17:05 AM9/12/16

to

I think the point is that wchar_t has to be wide enough to support all
the characters of the implementation's execution set. Windows supports
UTF-16, including characters that require more than one code unit with
UTF-16 encoding. Therefore, 16-bit wchar_t is not sufficient to support
all the characters in a Win32 C++ implementation, so having 16-bit
wchar_t is non-conforming on Win32. (It /was/ conforming in the early
days of Win32, with UCS-2 character encoding - but not now, with UTF-16.)

Vir Campestris

unread,

Sep 12, 2016, 4:20:04 PM9/12/16

to

On 11/09/2016 12:51, David Brown wrote:
> There are good reasons for UTF-8 being the most common encoding on the
> net, even in those countries.

The servers are probably running Linux?

Mind, what is embedded has changed. Once I worked on mainframes with a
megabyte. These days our "embedded" platform (no keyboard, no display -
but it does have a camera) has 256Mb. And runs Android.

Andy

David Brown

unread,

Sep 13, 2016, 6:43:39 AM9/13/16

to

On 12/09/16 22:19, Vir Campestris wrote:
> On 11/09/2016 12:51, David Brown wrote:
>> There are good reasons for UTF-8 being the most common encoding on the
>> net, even in those countries.
>
> The servers are probably running Linux?

The content on the servers is independent from the server OS. And most
content is produced on desktops (which are usually not Linux). Choice
of server-side languages and frameworks might have some influence, however.

Tim Rentsch

unread,

Sep 13, 2016, 6:32:23 PM9/13/16

to

More specifically, all members of the largest extended character
set specified among the supported locales. I knew that part
already - it was quoted in my posting.

> Windows supports UTF-16, including characters that require more
> than one code unit with UTF-16 encoding.

I'm not sure how that statement connects to the Subject: line
in a way that is "exceedingly obvious" (and I realize it was
Alf, not you, that said that), but anyway I will let that slide.

> Therefore, 16-bit
> wchar_t is not sufficient to support all the characters in a
> Win32 C++ implementation, so having 16-bit wchar_t is
> non-conforming on Win32.

You missed a step. This conclusion assumes that because
Windows "supports" UTF-16, there must be a supported locale
that includes all of those characters. I believe neither the C
standard nor the C++ standard requires that. At the very
least, I don't know what portions of the standards force that
conclusion. Implementations have to document what decisions
they make in this regard, but nothing forces them to hew to
this or that API of the underlying OS. Similarly, an
implementation that represents 'char' using ASCII could be
conforming running on an OS that uses EBCDIC. It might be a
silly decision, but it isn't necessarily a non-conforming one.
As it stands the above statement begs the question.

Can anyone quote chapter and verse to provide a compelling
answer to this question?

Paavo Helde

unread,

Sep 14, 2016, 6:03:43 AM9/14/16

to

A C++ implementation can be hosted or free-standing. I believe the
question here is if a hosted C++ implementation must support all the
locales and character sets supported by its host.

In the C++ standard the difference between hosted and free-standing is
that a free-standing implementation may have fewer standard headers and
may not support multithreading. I do not find any mention about
supporting "host locales" or something like that. So I guess MSVC++
implementation might be standard-conforming in this area if the declared
that it they are just supporting half of UCS-2 as the maximum character set.

However, this is not what they claim. In all documentation they give an
impression that they support all Unicode. E.g. from
https://msdn.microsoft.com/en-us/library/2dax2h36.aspx:

[quote]
A wide character is a 2-byte multilingual character code. Most
characters used in modern computing worldwide, including technical
symbols and special publishing characters, can be represented according
to the Unicode specification as a wide character. Characters that cannot
be represented in just one wide character can be represented in a
Unicode pair by using the Unicode surrogate feature. Because every wide
character is represented in a fixed size of 16 bits, using wide
characters simplifies programming with international character sets.
[/quote]

It appears they have redefined UTF-16 as "surrogate pairs", and the last
sentence does not make any sense whatsoever (probably it is a left-over
from UCS-2 times where "fixed size" == 1).

From https://msdn.microsoft.com/en-us/library/69ze775t.aspx

[quote]
Universal character names cannot encode values in the surrogate code
point range D800-DFFF. For Unicode surrogate pairs, specify the
universal character name by using \UNNNNNNNN, where NNNNNNNN is the
eight-digit code point for the character. The compiler generates a
surrogate pair if required.
[/quote]

So it appears they are having more or less transparent support for
Unicode in string literals, but not in single wchar_t characters.

Indeed, an experiment with old Coptic zero:

#include <iostream>
#include <stdint.h>
int main() {
const wchar_t a[] = L"\U000102E0";
std::cout << sizeof(a)/sizeof(a[0]) << "\n";
std::cout << std::hex << uint32_t(a[0]) << " " << uint32_t(a[1]) << "\n";

wchar_t x = L'\U000102E0';
std::cout << std::hex << (int) x << "\n";
}

Produces:
main.cpp(8): warning C4066: characters beyond first in wide-character
constant ignored

3
d800 dee0
d800

The string literal seems to be proper UTF-16, but the value of wchar_t
is obviously wrong.

For comparison, gcc output for the same program:

2
102e0 0
102e0

David Brown

unread,

Sep 15, 2016, 4:26:50 AM9/15/16

to

(I knew you knew that - I was just going through the steps of the
reasoning as I understood it.)

>
>> Windows supports UTF-16, including characters that require more
>> than one code unit with UTF-16 encoding.
>
> I'm not sure how that statement connects to the Subject: line
> in a way that is "exceedingly obvious" (and I realize it was
> Alf, not you, that said that), but anyway I will let that slide.

My guess (only a guess, as it was Alf who said it) is that the subject
line makes it obvious that we are talking specifically about Windows
32-bit API here - rather than Unicode on Linux, or Unicode on Windows
using a different set of choices (perhaps some compilers on Windows have
32-bit whcar_t).

>
>> Therefore, 16-bit
>> wchar_t is not sufficient to support all the characters in a
>> Win32 C++ implementation, so having 16-bit wchar_t is
>> non-conforming on Win32.
>
> You missed a step. This conclusion assumes that because
> Windows "supports" UTF-16, there must be a supported locale
> that includes all of those characters.

Yes, I see I made that assumption - although the locale would not have
to support /all/ of those characters, merely at least one of the
characters that requires more than one UTF-16 code unit.

The locales I am familiar with on Windows are UK English and Norwegian,
both of which work fine with Latin-1, and don't need characters outside
the UCS-2 subset of UTF-16. So to be honest, I don't know whether or
not any locales in Windows require multi-unit characters in UTF-16.

So I can't tell you if my assumption, and therefore my conclusion, was
correct or not.

> I believe neither the C
> standard nor the C++ standard requires that. At the very
> least, I don't know what portions of the standards force that
> conclusion. Implementations have to document what decisions
> they make in this regard, but nothing forces them to hew to
> this or that API of the underlying OS. Similarly, an
> implementation that represents 'char' using ASCII could be
> conforming running on an OS that uses EBCDIC. It might be a
> silly decision, but it isn't necessarily a non-conforming one.
> As it stands the above statement begs the question.
>
> Can anyone quote chapter and verse to provide a compelling
> answer to this question?
>

Not me. I'm going back to "listen and learn" mode in this thread. My
knowledge of unicode and the C++ standard here is enough to question
some people's statements, or to make some questionable statements of my
own - but not nearly enough to say something sensible at this level of
detail. But I am curious about the answer here.

Tim Rentsch

unread,

Sep 15, 2016, 1:07:26 PM9/15/16

to

Right, the question applies only for hosted implementations.

I should clarify a point of terminology. The word "locale" may
be used in several different contexts. In my comments above I
mean the word "locale" only in the sense of the context described
in the C or C++ standards, ie, those sets of characters and
characteristics identified by a particular implementation of C or
C++. So for your sentence there I agree with what I think you're
trying to say, but I say it differently, viz., the question is
whether a hosted C++ (or C) implementation must supply a locale
that incorporates all characters and characteristics "supported" by
the execution environment (which might be called a "host locale").

> [...] So I guess MSVC++ implementation might be standard-conforming

> in this area if the declared that it they are just supporting half
> of UCS-2 as the maximum character set.

That is my belief, yes, and the question before the group.

One minor correction. My understanding is that UCS-2 means only
those characters that are representable without using surrogate
pairs, so that would apply if MSVC++ declared they were supporting
all of UCS-2, not just half of it. (Let me be explicit that I am
not sure my understanding of UCS-2 is right, and may be something
slightly different. In any case this is a side issue; I think we
both understand what is meant.)

> However, this is not what they claim. In all documentation they
> give an impression that they support all Unicode.

> [snip elaboration]

Granted, if the documentation is wrong in any material way then
the implementation has not met its burden and is non-conforming.
The question of interest is where the supplied documentation
describes what is done accurately but defines the largest
supported locale as being just those characters whose Unicode
code points may be represented in 16 bits without overlapping the
surrogate pair codes. If an implementation does that then I
believe it would be conforming to limit wchar_t to 16 bits.

A fine point: AFAICT it would be conforming to define the largest
supported locale to be just that limited set, and have wchar_t be
just 16 bits, and /also/ say that certain routines (or character
and/or string literals) encode other Unicode code points using two
wchar_t objects that have values in the surrogate pair ranges.
The constraints on wchar_t are based on what characters make up
the largest supported locale, not on what might happen for any
"characters" outside that set. And implementations certainly are
allowed to define what happens outside the set of circumstances
where the standards mandate some particular behavior.

Alf P. Steinbach

unread,

Sep 17, 2016, 11:30:22 AM9/17/16

to

On 15.09.2016 19:07, Tim Rentsch wrote:
> Paavo Helde <myfir...@osa.pri.ee> writes:
[snip]

>
>> [...] So I guess MSVC++ implementation might be standard-conforming
>> in this area if the declared that it they are just supporting half
>> of UCS-2 as the maximum character set.
>
> That is my belief, yes, and the question before the group.
>
> One minor correction. My understanding is that UCS-2 means only
> those characters that are representable without using surrogate
> pairs, so that would apply if MSVC++ declared they were supporting
> all of UCS-2, not just half of it. (Let me be explicit that I am
> not sure my understanding of UCS-2 is right, and may be something
> slightly different. In any case this is a side issue; I think we
> both understand what is meant.)

Well, after checking around it seems to me that it's likely I was wrong
about 16-bit `wchar_t` being invalid for Unicode.

So, punch me.

But I don't know that, and it's certainly not my private opinion: I got
it from others, that I thought were experts in the area: a “fact” that I
was pretty sure of.

And since we don't seem to have the requisite experts gathered in this
group, and since Usenet cross-posting etc. is mostly a thing of the
past, I've asked about this on Stack Overflow, at

<url:
http://stackoverflow.com/questions/39548465/is-16-bit-wchar-t-formally-valid-for-representing-full-unicode>

Hopefully some convincing answers will be forthcoming there.

Tim Rentsch

unread,

Sep 19, 2016, 9:23:48 PM9/19/16

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> On 15.09.2016 19:07, Tim Rentsch wrote:
>> Paavo Helde <myfir...@osa.pri.ee> writes:
>
> [snip]
>
>>> [...] So I guess MSVC++ implementation might be standard-conforming
>>> in this area if the declared that it they are just supporting half
>>> of UCS-2 as the maximum character set.
>>
>> That is my belief, yes, and the question before the group.
>>
>> One minor correction. My understanding is that UCS-2 means only
>> those characters that are representable without using surrogate
>> pairs, so that would apply if MSVC++ declared they were supporting
>> all of UCS-2, not just half of it. (Let me be explicit that I am
>> not sure my understanding of UCS-2 is right, and may be something
>> slightly different. In any case this is a side issue; I think we
>> both understand what is meant.)
>
> Well, after checking around it seems to me that it's likely I was
> wrong about 16-bit `wchar_t` being invalid for Unicode.
>
> So, punch me.

I have no complaints with any of your comments. You were
conveying a result that was true to the best of your
understanding. Indeed it may very well have proven to be true.
I certainly am not an expert in the C++ standard, so I asked
about it, that's all. No one should be faulted for not knowing
all the myriad implications of the C++ standard, which is a
daunting document, and I didn't mean to imply anything along
those lines.

> But I don't know that, and it's certainly not my private opinion: I

> got it from others, that I thought were experts in the area: a ?fact?

> that I was pretty sure of.

Yes, that was my impression, and I appreciate that you did. My
question was meant only as a question.

> And since we don't seem to have the requisite experts gathered in this
> group, and since Usenet cross-posting etc. is mostly a thing of the
> past, I've asked about this on Stack Overflow, at
>
> <url:
> http://stackoverflow.com/questions/39548465/is-16-bit-wchar-t-formally-valid-for-representing-full-unicode>
>
> Hopefully some convincing answers will be forthcoming there.

Excellent! I will take a look.