Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to get standard paths?!?

69 views
Skip to first unread message

Szyk Cech

unread,
Aug 17, 2019, 7:59:00 AM8/17/19
to
Hello!

I am looking the best way to get system paths in Linux and Windows.
Off-course the best will be portable way. Like this:
https://doc.qt.io/qt-5/qstandardpaths.html
But I don't want to use Qt. I want to write most of my app in pure C++.

I am looking for following paths:
std::wstring gSettingsDir();
std::wstring gLocalDataDir();
std::wstring gAppDataDir();
std::wstring gLogDir();
std::wstring gTempDir();

Can you give me some hints how to get them in Linux and Windows?!?


Thanks in advance and best regards!
Szyk Cech

Alf P. Steinbach

unread,
Aug 17, 2019, 8:02:23 AM8/17/19
to
In the previous posting you were enquiring about UTF-32 encoded wide
strings.

Be advised that in Windows wide strings are UTF-16 encoded.


Cheers!,

- Alf


Bonita Montero

unread,
Aug 17, 2019, 8:02:52 AM8/17/19
to

Bonita Montero

unread,
Aug 17, 2019, 8:05:49 AM8/17/19
to
> Be advised that in Windows wide strings are UTF-16 encoded.

And I think that it would be the best to use std::u16string
because wchar_t is not absolutely guaranteed to be 16 bit
wide.

David Brown

unread,
Aug 17, 2019, 9:11:12 AM8/17/19
to
It is guaranteed /not/ to be 16-bit on most systems.

David Brown

unread,
Aug 17, 2019, 9:13:18 AM8/17/19
to
If you can use C++17, consider the support in <filesystem> :
<https://en.cppreference.com/w/cpp/header/filesystem>

I haven't tried it myself, but maybe it has what you need.

Sam

unread,
Aug 17, 2019, 9:21:19 AM8/17/19
to
There's no such thing, whatsoever, in Linux. This is all MS-Windows flotsam.
On Linux, there are some well-known directories, such as /var/tmp for
temporary files. There's also /tmp, but on most Linux distributions
applications should use /var/tmp. And that's about it. None of these labels
resemble anything else on Linux. Now, you do have well-known directories
like /var/log, where various log files may be found, but application
normally can't write to it, unless there's some special preparations made in
advance.

And, of course, all directories and paths on Linux are plain std::string-s.
Linux code rarely uses std::wstring, that's mostly an MS-Windows plague. In
the age of internationalization, Linux appears to have converged on UTF-8
and plain std::strings, with some occasional usage of std::u32string where
it's convenient to handle text in UTF-32.

Bonita Montero

unread,
Aug 17, 2019, 9:59:28 AM8/17/19
to
>>> Be advised that in Windows wide strings are UTF-16 encoded.

>> And I think that it would be the best to use std::u16string
>> because wchar_t is not absolutely guaranteed to be 16 bit
>> wide.

> It is guaranteed /not/ to be 16-bit on most systems.

You can be pretty sure that by far most implementations specify
whcar_t as 16 bit since it makes no sense to implement it differently.
But it's easier to conform to any hypothetical implementation by using
char16_t.

David Brown

unread,
Aug 17, 2019, 10:20:23 AM8/17/19
to
Again, you are assuming Windows is everything.

On most systems it is 32-bit. The only exceptions I know are Windows,
and a few 8-bit embedded targets. (There may be others that I haven't
tested, of course.) On all *nix systems it is 32-bit, and on most
embedded systems.

MS was early in the "Unicode" game by using UCS-2 in Windows NT. They
get credit for trying, IMHO. But it quickly became apparent that 16
bits was not sufficient, and anyone who was not tied to Windows used
32-bit utf-32 when they needed "one object is one character", typically
for internal use only, and utf-8 when they wanted strings. If only MS
had taken the hit and made the changes while they still could, it would
have avoided the hideous mess that resulted with the painful 16-bit
types. 16-bit is too short to hold all the Unicode characters, but long
enough to have all the disadvantages of wasted space and endianness
issues. And now we are left with the jumble that is five different
character types in C++ (not including signed and unsigned versions), MS'
influence putting utf-16 nonsense into the C++ standards, and of course
other platform-independent languages (Java, Python) and libraries (QT)
using utf-16 or ucs-16 to be compatible with MS's error.

Bonita Montero

unread,
Aug 17, 2019, 10:28:52 AM8/17/19
to
> MS was early in the "Unicode" game by using UCS-2 in Windows NT.  They
> get credit for trying, IMHO.  But it quickly became apparent that 16
> bits was not sufficient, and anyone who was not tied to Windows used
> 32-bit utf-32 when they needed "one object is one character", typically
> for internal use only, and utf-8 when they wanted strings.  If only MS
> had taken the hit and made the changes while they still could, it would
> have avoided the hideous mess that resulted with the painful 16-bit
> types.  16-bit is too short to hold all the Unicode characters, but long
> enough to have all the disadvantages of wasted space and endianness
> issues.  And now we are left with the jumble that is five different
> character types in C++ (not including signed and unsigned versions), MS'
> influence putting utf-16 nonsense into the C++ standards, and of course
> other platform-independent languages (Java, Python) and libraries (QT)
> using utf-16 or ucs-16 to be compatible with MS's error.

We were not talking about UTF-whatever. We were talking about wchar_t
or more precisely std::u16sting. You can store 16 bit characters in the
latter as well as UTF-16-seqences in this.
We were talking about that you can almost rely on the fact that wchar_t
is 16 bits wide. There might be ancient CPU-architectures where the re-
gisters aren't 8, 16, ... bits wide but you can be certain that no one
would implement a conforming C++-compiler for these systems.

Ralf Goertz

unread,
Aug 17, 2019, 10:37:17 AM8/17/19
to
Am Sat, 17 Aug 2019 16:28:41 +0200
schrieb Bonita Montero <Bonita....@gmail.com>:

> We were talking about that you can almost rely on the fact that
> wchar_t is 16 bits wide. There might be ancient CPU-architectures
> where the re- gisters aren't 8, 16, ... bits wide but you can be
> certain that no one would implement a conforming C++-compiler for
> these systems.


#include <iostream>

int main() {
std::cout<<sizeof(wchar_t)<<std::endl;
}

>g++ -o wchar wchar.cc
> ./wchar
4
> g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=hsa:nvptx-none
Target: x86_64-suse-linux



Bonita Montero

unread,
Aug 17, 2019, 10:54:37 AM8/17/19
to
> #include <iostream>
>
> int main() {
> std::cout<<sizeof(wchar_t)<<std::endl;
> }
>
>> g++ -o wchar wchar.cc
>> ./wchar
> 4
>> g++ -v
> Using built-in specs.
> COLLECT_GCC=g++
> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper
> OFFLOAD_TARGET_NAMES=hsa:nvptx-none
> Target: x86_64-suse-linux
> …

Hard, I wouldn't have expected this. I think this isn't a clever
decision. For internal processing the representation doesn't matter
but for compatiblity-reasons it should match the most commun Unicode
representation.

Paavo Helde

unread,
Aug 17, 2019, 11:37:31 AM8/17/19
to
For once, I agree with you. Alas, Microsoft does not listen and still
stubbornly clings to its unfortunate 16-bit Unicode representation,
which is neither the most common (UTF-8 is) nor useful for easier text
manipulation (UCS-32 is).



David Brown

unread,
Aug 17, 2019, 11:40:31 AM8/17/19
to
How about reading what others post, and perhaps /thinking/ a little? In
particular, get out of the "the world is Windows" mindset.

wchar_t has /never/ been specifically about Unicode. It means "wide
character", it is supposed to big enough to hold a character of any
character set supported by the implementation. For example, a Chinese
system might support ASCII and Big5, which is a 16-bit encoding - it
could therefore have a 16-bit wchar_t.

For systems that support Unicode, wchar_t is required to be at least
32-bit. That is why it /is/ 32-bit on any modern system - except broken
Windows where wchar_t is not as big as the C++ standards require. The
common case of 16-bit wchar_t in Windows compilers is in fact not
standard C++.

As for the "most common Unicode representation", for files or data
interchange, that is UTF-8 by many orders of magnitude. UTF-16 and
UTF-32 are found in a few niche situations. Internally, within
programs, UTF-32 is sometimes used (mostly by people who think it is
important to be able to index characters or count characters really
quickly). It is the standard wchar_t type, used by most systems.
Within Windows, you can't avoid UTF-16 internally and for API calls -
and it is also used by people who mistakenly think one wchar_t
represents one code point. And UTF-16 is also used by some libraries,
such as QT, that are stuck with it for historical reasons.


Bonita Montero

unread,
Aug 17, 2019, 11:52:57 AM8/17/19
to
> wchar_t has /never/ been specifically about Unicode.  It means "wide
> character", it is supposed to big enough to hold a character of any
> character set supported by the implementation.  For example, a Chinese
> system might support ASCII and Big5, which is a 16-bit encoding - it
> could therefore have a 16-bit wchar_t.

Unrealistic that it will be used for anything different than Unicode.

> For systems that support Unicode, wchar_t is required to be at least
> 32-bit.

No, it's implementation-defined.

Bonita Montero

unread,
Aug 17, 2019, 11:55:06 AM8/17/19
to
> For once, I agree with you. Alas, Microsoft does not listen and still
> stubbornly clings to its unfortunate 16-bit Unicode representation,
> which is neither the most common (UTF-8 is) nor useful for easier text
> manipulation (UCS-32 is).

The APIs accepting UTF-16 strings aren't for persistence.
And there aren't any function for string-manipulation on
the Win32-API.

David Brown

unread,
Aug 17, 2019, 11:56:45 AM8/17/19
to
On 17/08/2019 16:28, Bonita Montero wrote:
>> MS was early in the "Unicode" game by using UCS-2 in Windows NT.  They
>> get credit for trying, IMHO.  But it quickly became apparent that 16
>> bits was not sufficient, and anyone who was not tied to Windows used
>> 32-bit utf-32 when they needed "one object is one character",
>> typically for internal use only, and utf-8 when they wanted strings.
>> If only MS had taken the hit and made the changes while they still
>> could, it would have avoided the hideous mess that resulted with the
>> painful 16-bit types.  16-bit is too short to hold all the Unicode
>> characters, but long enough to have all the disadvantages of wasted
>> space and endianness issues.  And now we are left with the jumble that
>> is five different character types in C++ (not including signed and
>> unsigned versions), MS' influence putting utf-16 nonsense into the C++
>> standards, and of course other platform-independent languages (Java,
>> Python) and libraries (QT) using utf-16 or ucs-16 to be compatible
>> with MS's error.
>
> We were not talking about UTF-whatever. We were talking about wchar_t
> or more precisely std::u16sting. You can store 16 bit characters in the
> latter as well as UTF-16-seqences in this.

You can store any character encoding supported by the compiler in a
wchar_t, unless you are using Windows, which is broken and can't store
Unicode characters in a wchar_t because it is too small.

You can store any UTF-16 code unit in a char16_t. That does not mean
you can store any UTF-16 character in a char16_t - you can only store
those that fit in one unit. And in a std::u16string, you can store any
UTF-16 string of characters.

You can store any UTF-32 code unit in a char32_t. That covers all
Unicode code points, but there are Unicode characters that are made of
combinations of code points. And in a std::u32string, you can store any
UTF-32 string of characters.

And you can store any UTF-8 code unit in a char8_t.


In a wchar_t, you can (except on Windows) store any character from the
"execution wide-character set" for the compiler. That does not need to
be Unicode.

On most modern systems, wchar_t matches char32_t. But it is not a
requirement.


> We were talking about that you can almost rely on the fact that wchar_t
> is 16 bits wide.

You are wrong. You can rely on wchar_t being 32-bit, except on Windows
and a few small 8-bit systems that often have very limited support for
wchar_t at all.

> There might be ancient CPU-architectures where the re-
> gisters aren't 8, 16, ... bits wide but you can be certain that no one
> would implement a conforming C++-compiler for these systems.

People use C++ all the time on 8-bit and 16-bit processors - /new/
designs, not ancient ones. They often don't have a full C++ library,
but they are freestanding systems and don't need the full library to be
compliant. (They might be non-compliant in other ways.)

Windows compilers are invariably non-compliant regarding wchar_t because
it is only 16-bit on that platform, when it is required to be 32-bit.


Bonita Montero

unread,
Aug 17, 2019, 12:00:08 PM8/17/19
to
>> For systems that support Unicode, wchar_t is required to be at least
>> 32-bit.

> No, it's implementation-defined.

That is what the standard says about that:

"Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1). Type wchar_t shall have the same
size, signedness, and alignment requirements (3.11) as one of the other
integral types, called its underlying type."

David Brown

unread,
Aug 17, 2019, 12:00:18 PM8/17/19
to
On 17/08/2019 17:52, Bonita Montero wrote:
>> wchar_t has /never/ been specifically about Unicode.  It means "wide
>> character", it is supposed to big enough to hold a character of any
>> character set supported by the implementation.  For example, a Chinese
>> system might support ASCII and Big5, which is a 16-bit encoding - it
>> could therefore have a 16-bit wchar_t.
>
> Unrealistic that it will be used for anything different than Unicode.

wchar_t existed from long before Unicode became the standard choice.
But I agree that for future systems, it is unrealistic for wchar_t to be
used with anything other than UTF-32 encodings.

>
>> For systems that support Unicode, wchar_t is required to be at least
>> 32-bit.
>
> No, it's implementation-defined.

It is required to be big enough for any of the compiler's wide-character
execution set. Since Windows supports Unicode (regardless of the
encodings used), a Windows compiler must be able to hold any Unicode
code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit.

David Brown

unread,
Aug 17, 2019, 12:01:38 PM8/17/19
to
Exactly, yes. 16-bit wchar_t can't do that on a system that supports
Unicode (regardless of the encoding).

Bonita Montero

unread,
Aug 17, 2019, 12:05:43 PM8/17/19
to
> You can store any character encoding supported by the compiler in a
> wchar_t, unless you are using Windows, which is broken and can't store
> Unicode characters in a wchar_t because it is too small.

The size of wchar_t is implementation-dependent, so Windows isn't broken
here. And you have a limited number of codepoints in UTF-16 over UTF-32,
but even UTF-16 has such a huge number of codepoints that this will
never be exhausted.
And as there will never be more codepoints in Unicode as the "limited"
range of UTF-16 can cover Windows isn't broken here.

> On most modern systems, wchar_t matches char32_t.
> But it is not a requirement.

That was where David is wrong.

> People use C++ all the time on 8-bit and 16-bit processors - /new/
> designs, not ancient ones.  They often don't have a full C++ library,
> but they are freestanding systems and don't need the full library to
> be compliant.  (They might be non-compliant in other ways.)

I wasn't talking about the library but basic datatypes.

> Windows compilers are invariably non-compliant regarding wchar_t because
> it is only 16-bit on that platform, when it is required to be 32-bit.

UTF-16 is suitable for all codepoints that will ever be populated.

Bonita Montero

unread,
Aug 17, 2019, 12:07:03 PM8/17/19
to
> It is required to be big enough for any of the compiler's wide-character
> execution set.  Since Windows supports Unicode (regardless of the
> encodings used), a Windows compiler must be able to hold any Unicode
> code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit.

Win32 uses UTF-16 and there will never be more populated codepoints
than UTF-16 can cover; so where's the problem?

Bonita Montero

unread,
Aug 17, 2019, 12:26:53 PM8/17/19
to
> Exactly, yes.  16-bit wchar_t can't do that on a system that
> supports Unicode (regardless of the encoding).

There will be no more Unicode codepoints populated than could be
adressed by UTF-16. At least until we gonna support alien languages.

Paavo Helde

unread,
Aug 17, 2019, 12:55:36 PM8/17/19
to
On 17.08.2019 18:54, Bonita Montero wrote:
>
> The APIs accepting UTF-16 strings aren't for persistence.

You mean, like CreateFileW() or RegSetKeyValueW()?

> And there aren't any function for string-manipulation on
> the Win32-API.

Like ExpandEnvironmentStringsW() or PathUnExpandEnvStringsW()?

Not that it would matter the slightest. Microsoft's use of wchar_t
still does not match the C++ standard and UTF-16 still remains the most
useless Unicode representation.

Bonita Montero

unread,
Aug 17, 2019, 1:15:41 PM8/17/19
to
>> The APIs accepting UTF-16 strings aren't for persistence.

> You mean, like CreateFileW() or RegSetKeyValueW()?

More precise: for persistent content.
And for both functions UTF-16 is sufficient.

> Not that it would matter the slightest. Microsoft's use of wchar_t
> still does not match the C++ standard and UTF-16 still remains the
> most useless Unicode representation.

wchar_t is not required to be 32 bit wide.

Paavo Helde

unread,
Aug 17, 2019, 1:40:04 PM8/17/19
to
On 17.08.2019 20:15, Bonita Montero wrote:
>>> The APIs accepting UTF-16 strings aren't for persistence.
>
>> You mean, like CreateFileW() or RegSetKeyValueW()?
>
> More precise: for persistent content.
> And for both functions UTF-16 is sufficient.

Sure, and UTF-8 would be as well.

>
>> Not that it would matter the slightest. Microsoft's use of wchar_t
>> still does not match the C++ standard and UTF-16 still remains the
>> most useless Unicode representation.
>
> wchar_t is not required to be 32 bit wide.

Sure, wchar_t can be 64 bit or whatever as long as it can hold any
character supported by the implementation [1]. Windows happens to
support Unicode, so a 16-bit type does not cut it.

[1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales."

Bonita Montero

unread,
Aug 17, 2019, 1:44:45 PM8/17/19
to
>> More precise: for persistent content.
>> And for both functions UTF-16 is sufficient.

> Sure, and UTF-8 would be as well.

UTF-8 and UTF-16 can adress 21 bits.
All codepoints that Unicode adresses.

> [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales."
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hergen Lehmann

unread,
Aug 17, 2019, 2:00:15 PM8/17/19
to
Am 17.08.19 um 18:05 schrieb Bonita Montero:
>> You can store any character encoding supported by the compiler in a
>> wchar_t, unless you are using Windows, which is broken and can't store
>> Unicode characters in a wchar_t because it is too small.
>
> The size of wchar_t is implementation-dependent, so Windows isn't broken
> here.

Yes, the size of wchar_t is implementation-dependent, so it's not
formally broken.

But the Windows implementation is technically broken, as the programmer
can neither rely on the assumption, that any given character can be
stored into a wchar_t, nor that iterating over a string will correctly
return the stored characters, nor that insert/erase/substring operations
will cut the string properly - which completely defies the purpose of
wchar_t.

If i have to code multi-byte awareness into each and every string
operation anyways, there is no point of using wstring instead of string
and the much more common UTF8 encoding.

And it is getting even worse, as many Windows libraries and Windows
applications are not aware of the fact, that a Windows wstring is
supposed to be UTF16. The code is assuming one codepoint per string
position and fails in an unpredictable way, if the input data acutally
contains codepoints from the upper pages.

>> On most modern systems, wchar_t matches char32_t.
>> But it is not a  requirement.
>
> That was where David is wrong.

He's not.

David Brown

unread,
Aug 17, 2019, 2:11:06 PM8/17/19
to
On 17/08/2019 19:44, Bonita Montero wrote:
>>> More precise: for persistent content.
>>> And for both functions UTF-16 is sufficient.
>
>> Sure, and UTF-8 would be as well.
>
> UTF-8 and UTF-16 can adress 21 bits.
> All codepoints that Unicode adresses.
>

Get back to use when you figure out how to fit 21 bits of Unicode code
points into a 16-bit wchar_t.

You can happily store any Unicode characters in a UTF-16 string, but not
in a single 16-bit character object.

David Brown

unread,
Aug 17, 2019, 2:21:05 PM8/17/19
to
Does your ignorance know no bounds?

Aren't you even capable of the simplest of web searches or references
before talking rubbish in public?

<https://en.wikipedia.org/wiki/Unicode>

<https://home.unicode.org/>

Currently, Unicode has 137,929 characters. These are organised in
different 32 code planes of 64K characters, of which only 3 code planes
are significantly used. But 3 planes is a great deal more than 1 plane
- 16-bit has been insufficient for Unicode since 1996.

Tim Rentsch

unread,
Aug 17, 2019, 2:23:03 PM8/17/19
to
(and in another posting quotes the C++ standard)

> "Type wchar_t is a distinct type whose values can represent distinct
> codes for all members of the largest extended character set specified
> among the supported locales (22.3.1)."

Like you say, the C and C++ standards require wchar_t to be large
enough so one wchar_t object can hold distinct values for every
character in the largest supported character set.

Most widely used operating systems today (eg, Microsoft Windows,
Linux) support character sets with (at least) hundreds of
thousands of characters.

Can you explain how hundreds of thousands distinct values can be
represented in a single 16-bit wide wchar_t object?

David Brown

unread,
Aug 17, 2019, 2:24:47 PM8/17/19
to
There is no problem with using UTF-16 for strings - except that it is an
encoding that combines all the disadvantages of UTF-8 with all the
disadvantages of UTF-32 with none of their advantages. But you cannot
store a arbitrary Unicode /character/ in a single 16-bit object.
Critically, you cannot store a Unicode CJK (Chinese, Japanese, Korean)
ideograph in a single 16-bit wchar_t, despite these characters being
supported by Windows. You /can/ store them in a char32_t, or a 32-bit
wchar_t.

David Brown

unread,
Aug 17, 2019, 2:41:16 PM8/17/19
to
On 17/08/2019 18:05, Bonita Montero wrote:
>> You can store any character encoding supported by the compiler in a
>> wchar_t, unless you are using Windows, which is broken and can't store
>> Unicode characters in a wchar_t because it is too small.
>
> The size of wchar_t is implementation-dependent, so Windows isn't broken
> here.

Yes, Windows is broken here. The standards give minimum requirements
for many implementation-dependent features. Windows could have a 21-bit
wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is
just like saying the size of "int" is implementation dependent, but the
standards require a minimum of 16-bit int.

> And you have a limited number of codepoints in UTF-16 over UTF-32,
> but even UTF-16 has such a huge number of codepoints that this will
> never be exhausted.

Unicode started with 16-bit code points, but within five years or so
they figured out that 16-bit was not enough, and it was extended.

Perhaps you don't understand how multi-unit encoding works, and are
totally confused by the concept that UTF-16 encoding is fine for
handling some billion+ code points, just like UTF-8 and UTF-32 can,
while not understanding that you can't fit all Unicode characters into a
single 16-bit wchar_t.


> And as there will never be more codepoints in Unicode as the "limited"
> range of UTF-16 can cover Windows isn't broken here.

Look it up. UTF-16 can cover everything, though it is a silly choice.
A 16-bit wchar_t cannot store all UTF-16 characters.

>
>> On most modern systems, wchar_t matches char32_t.
>> But it is not a  requirement.
>
> That was where David is wrong.

The minimum requirement would be 21-bit for a system supporting Unicode.
But wchar_t must also match an existing integral type. It would be
possible for a compiler to have an extended integer type that is 24-bit,
but assuming a "normal" compiler, 32-bit is the only sane minimum
requirement for wchar_t. And while it the standards allow bigger sizes
- a 64-bit wchar_t would be perfectly compliant - sanity dictates 32-bit.

>
>> People use C++ all the time on 8-bit and 16-bit processors - /new/
>> designs, not ancient ones.  They often don't have a full C++ library,
>> but they are freestanding systems and don't need the full library to
>> be compliant.  (They might be non-compliant in other ways.)
>
> I wasn't talking about the library but basic datatypes.

I don't think /you/ know what you are talking about - it is hard for
other people to guess.

>
>> Windows compilers are invariably non-compliant regarding wchar_t
>> because it is only 16-bit on that platform, when it is required to be
>> 32-bit.
>
> UTF-16 is suitable for all codepoints that will ever be populated.
>

When you are in a hole, stop digging.

David Brown

unread,
Aug 17, 2019, 2:46:32 PM8/17/19
to
On 17/08/2019 19:47, Hergen Lehmann wrote:
> Am 17.08.19 um 18:05 schrieb Bonita Montero:
>>> You can store any character encoding supported by the compiler in a
>>> wchar_t, unless you are using Windows, which is broken and can't
>>> store Unicode characters in a wchar_t because it is too small.
>>
>> The size of wchar_t is implementation-dependent, so Windows isn't broken
>> here.
>
> Yes, the size of wchar_t is implementation-dependent, so it's not
> formally broken.

It is for Windows, because Windows supports Unicode as an execution
character set (the encoding is irrelevant), and 16-bit wchar_t is not
big enough. Anything 21 bits or higher, that fulfils the other
requirements in the standard, would do.

>
> But the Windows implementation is technically broken, as the programmer
> can neither rely on the assumption, that any given character can be
> stored into a wchar_t, nor that iterating over a string will correctly
> return the stored characters, nor that insert/erase/substring operations
> will cut the string properly - which completely defies the purpose of
> wchar_t.
>

Yes.

16-bit wchar_t made sense when Windows supported UCS-2, which is a
16-bit subset of Unicode, and is where Unicode started. But Unicode
moved on in 1996, and Windows did not.

> If i have to code multi-byte awareness into each and every string
> operation anyways, there is no point of using wstring instead of string
> and the much more common UTF8 encoding.
>

Yes.

Bo Persson

unread,
Aug 17, 2019, 3:08:02 PM8/17/19
to
And, of course, MS follows the letter of the law here by limiting the
number of supported locales. The language standard says nothing about that.


Bo Persson

Bo Persson

unread,
Aug 17, 2019, 3:21:56 PM8/17/19
to
No, they would naturally not fit.

But there is a loop-hole here, as the language standard says "the
supported locales" and not "every existing locale".

So if you only support the locales where 16 bits is enough, you haven't
broken any rules. And you cannot help is some people use parts of
"unsupported" locales as well...

David Brown

unread,
Aug 17, 2019, 3:41:49 PM8/17/19
to
I can understand if MS Windows does not support Cuneiform or Old Persian
locales. But are you telling me they don't support Chinese, Japanese or
Korean using plane 2 characters?

My understanding is that the standard Windows API's for things like file
names now support full UTF-16 encodings, and that includes characters
that won't fit in a 16-bit wchar_t. When the decision to have 16-bit
wchar_t was made, these API's were UCS-2, so 16-bit wchar_t was a
suitable and complaint choice at that time. But it is not suitable any
more (and hasn't been for a long time).



Keith Thompson

unread,
Aug 17, 2019, 7:34:12 PM8/17/19
to
David Brown <david...@hesbynett.no> writes:
> On 17/08/2019 18:26, Bonita Montero wrote:
>>> Exactly, yes. 16-bit wchar_t can't do that on a system that
>>> supports Unicode (regardless of the encoding).
>>
>> There will be no more Unicode codepoints populated than could be
>> adressed by UTF-16. At least until we gonna support alien languages.
>
> Does your ignorance know no bounds?
>
> Aren't you even capable of the simplest of web searches or references
> before talking rubbish in public?

I think you have incorrectly assumed that Bonita is asserting that there
are no more than 65536 Unicode codepoints.

UTF-16 can represent all Unicode codepoints. It cannot represent each
Unicode codepoint in 16 bits; some of them require two 16-bit values.

I presume Bonita meant that UTF-16 can represent all of Unicode (which
is true). But a 16-bit wchar_t cannot, because the standard requires
wchar_t to be "able to represent all members of the execution
wide-character set" (that's the point Bonita missed or ignored).

To use wchar_t to represent all Unicode code points, you either have to
make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use
it in a way that doesn't satisfy the requirements of the standard, such
as using UTF-16 to encode some characters in more than one wchar_t.

[...]

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson

unread,
Aug 17, 2019, 7:46:23 PM8/17/19
to
David Brown <david...@hesbynett.no> writes:
> On 17/08/2019 18:05, Bonita Montero wrote:
>>> You can store any character encoding supported by the compiler in a
>>> wchar_t, unless you are using Windows, which is broken and can't store
>>> Unicode characters in a wchar_t because it is too small.
>>
>> The size of wchar_t is implementation-dependent, so Windows isn't broken
>> here.
>
> Yes, Windows is broken here. The standards give minimum requirements
> for many implementation-dependent features. Windows could have a 21-bit
> wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is
> just like saying the size of "int" is implementation dependent, but the
> standards require a minimum of 16-bit int.

**OR** it could have a 16-bit wchar_t and not claim to support Unicode
as its wide character set. An implementation with 16-bit wchar_t that
supports only the BMP as its wide character set could be conforming
(though not as usesful as an implementation that supports full Unicode).

But Microsoft's own documentation says:
The wchar_t type is an implementation-defined wide character
type. In the Microsoft compiler, it represents a 16-bit wide
character used to store Unicode encoded as UTF-16LE, the native
character type on Windows operating systems. The wide character
versions of the Universal C Runtime (UCRT) library functions
use wchar_t and its pointer and array types as parameters and
return values, as do the wide character versions of the native
Windows API.
https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019

which I believe is non-conforming.

(But I'm not sure how Microsoft could have fixed this without breaking
too much existing code.)

Robert Wessel

unread,
Aug 17, 2019, 11:57:34 PM8/17/19
to
On Sat, 17 Aug 2019 20:41:06 +0200, David Brown
<david...@hesbynett.no> wrote:

>On 17/08/2019 18:05, Bonita Montero wrote:
>
>> And you have a limited number of codepoints in UTF-16 over UTF-32,
>> but even UTF-16 has such a huge number of codepoints that this will
>> never be exhausted.
>
>Unicode started with 16-bit code points, but within five years or so
>they figured out that 16-bit was not enough, and it was extended.
>
>Perhaps you don't understand how multi-unit encoding works, and are
>totally confused by the concept that UTF-16 encoding is fine for
>handling some billion+ code points, just like UTF-8 and UTF-32 can,
>while not understanding that you can't fit all Unicode characters into a
>single 16-bit wchar_t.


Actually UTF-16 can't encode billions of characters, as it
specifically uses a pair of extension/surrogate characters, which
encode about 10 bits each, leading to the extra 16 planes.

The format for UTF-8 was designed (and could trivially be extended) to
support 31 bit code points (with a six byte sequence). It was limited
to the current 17-plane scheme by the adoption of UTF-16, which could
only address 17 planes (the BMP plus the 16 extra planes implied by
the 20-bit number encoded in the surrogate characters).

Tim Rentsch

unread,
Aug 18, 2019, 1:25:10 AM8/18/19
to
Perhaps not as much of a loophole as you think. The setlocale()
function is defined, even for C++, by the C standard. A call to
setlocale() must not accept a locale that has more characters
than wchar_t can accommodate, because of how 'wide character' is
defined. So the "unsupported" locales you are talking about
cannot be used in a conforming implmentation. Of course, if the
implementation is not conforming, it can do whatever it wants.

Bonita Montero

unread,
Aug 18, 2019, 2:48:45 AM8/18/19
to
> Does your ignorance know no bounds?

Unicode is defined to have a maximum of 21 bit codepoints.

Bonita Montero

unread,
Aug 18, 2019, 2:49:37 AM8/18/19
to
> There is no problem with using UTF-16 for strings - except that it is
> an encoding that combines all the disadvantages of UTF-8 with all the
> disadvantages of UTF-32 with none of their advantages.  But you cannot
> store a arbitrary Unicode /character/ in a single 16-bit object.

Therefore you have UTF-16.

Bonita Montero

unread,
Aug 18, 2019, 2:54:14 AM8/18/19
to
> But the Windows implementation is technically broken, as the programmer
> can neither rely on the assumption, that any given character can be
> stored into a wchar_t, nor that iterating over a string will correctly
> return the stored characters, nor that insert/erase/substring operations
> will cut the string properly - which completely defies the purpose of
> wchar_t.

That's how UTF-16 works and the Unicode-standard recommends UTF-16 for
certain circumstances.

Bonita Montero

unread,
Aug 18, 2019, 2:57:12 AM8/18/19
to
> Yes, Windows is broken here.  The standards give minimum requirements
> for many implementation-dependent features.  Windows could have a 21-bit
> wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one.  It is
> just like saying the size of "int" is implementation dependent, but the
> standards require a minimum of 16-bit int.

There's no mandantory relationship between Unicode and wchar_t

>> And as there will never be more codepoints in Unicode as the "limited"
>> range of UTF-16 can cover Windows isn't broken here.

> Look it up.  UTF-16 can cover everything, though it is a silly choice. A
> 16-bit wchar_t cannot store all UTF-16 characters.

But UTF-16 can. And Windows works with UTF-16.

Bonita Montero

unread,
Aug 18, 2019, 2:58:36 AM8/18/19
to
> Get back to use when you figure out how to fit 21 bits of Unicode code
> points into a 16-bit wchar_t.

wchar_t has no mandantory relationship to Unicode.
And Windows uses UTF-16. And UTF-16 works with 16 bit characters.

Bart

unread,
Aug 18, 2019, 5:17:01 AM8/18/19
to
This is what you might often do with ASCII:

unsigned char ascii[128];
for (i=0; i<128; ++i) ascii[i]=i;

So that all ascii code points are represented in sequence by each
element of ascii;

What people are saying is that you can't do the equivalent for Unicode
using Windows' wchar_t when the latter is 16 bits:

unsigned wchar_t unicode[1114112];
for (i=0; i<1114112; ++i) unicode[i]=i;

because the stored values will wrap back to 0 as soon as i gets to 65536
(and again at 131072 and so on).

If this unicode[] array were to represent a UTF16 /string/, then its
size would be greater than 1114112 elements, as many values require
escape sequences with multiple elements, you couldn't write the loop in
such a simple way, and you wouldn't have the Nth code point stored as
the single value in unicode[N].

This is why a 32-bit wchar_t would have been a far better choice.
Obviously, Windows can display and work with any Unicode characters, but
it makes it more complicated than it would have been.

(There's nothing to stop a Windows program choosing to write the program
like this:

unsigned uint32_t unicode[1114112];
for (i=0; i<1114112; ++i) unicode[i]=i;

but then such 32-bit characters, or strings using such arrays, are not
directly supported by the OS.)

Bonita Montero

unread,
Aug 18, 2019, 5:27:57 AM8/18/19
to
> If this unicode[] array were to represent a UTF16 /string/, then its
> size would be greater than 1114112 elements, as many values require
> escape sequences with multiple elements, you couldn't write the loop
> in such a simple way, and you wouldn't have the Nth code point stored
> as the single value in unicode[N].

Of course a wchar_t with 32 bits would be more convenient.
But with 16 bits its fits with Win32.
And the Uncicode-standard recommends UTF-16 fort certain circumstances.

>    unsigned uint32_t unicode[1114112];
>    for (i=0; i<1114112; ++i) unicode[i]=i;

No one uses char-arrays of that length in C++ but u16string
or u32string.

David Brown

unread,
Aug 18, 2019, 5:34:28 AM8/18/19
to
On 18/08/2019 01:33, Keith Thompson wrote:
> David Brown <david...@hesbynett.no> writes:
>> On 17/08/2019 18:26, Bonita Montero wrote:
>>>> Exactly, yes. 16-bit wchar_t can't do that on a system that
>>>> supports Unicode (regardless of the encoding).
>>>
>>> There will be no more Unicode codepoints populated than could be
>>> adressed by UTF-16. At least until we gonna support alien languages.
>>
>> Does your ignorance know no bounds?
>>
>> Aren't you even capable of the simplest of web searches or references
>> before talking rubbish in public?
>
> I think you have incorrectly assumed that Bonita is asserting that there
> are no more than 65536 Unicode codepoints.

Possibly. If I have misinterpreted her, then I will be glad to be
corrected.

>
> UTF-16 can represent all Unicode codepoints. It cannot represent each
> Unicode codepoint in 16 bits; some of them require two 16-bit values.

Correct - and that is something I have said several times.

>
> I presume Bonita meant that UTF-16 can represent all of Unicode (which
> is true). But a 16-bit wchar_t cannot, because the standard requires
> wchar_t to be "able to represent all members of the execution
> wide-character set" (that's the point Bonita missed or ignored).

Agreed - and again, it is something I have said several times.

It is fine (both in the sense of working practically and in being
compliant with the standards) to use char16_t and u16string to handle
all Unicode strings and characters. You can't store all Unicode
characters in a /single/ char16_t object - but a char16_t is for storing
Unicode code /units/, not code /points/, so it is fine for the job.

But a whcar_t has to be able to store any /character/ - for Unicode,
that means 21 bits of code point.

>
> To use wchar_t to represent all Unicode code points, you either have to
> make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use
> it in a way that doesn't satisfy the requirements of the standard, such
> as using UTF-16 to encode some characters in more than one wchar_t.
>

In practice, I think people on Windows use wchar_t strings (or arrays)
for holding UTF-16 encoding strings. That will work fine. But it
encourages mistaken assumptions - such as that ws[9] holds the tenth
Unicode character in the string, or that wcslen returns the number of
characters in the string. These assumptions hold for a proper wchar_t,
such as the 32-bit wchar_t on Unix systems (or a 16-bit wchar_t on
Windows while it used UCS-2 rather than UTF-16).

I think the sensible practice would be to deprecate the use of wchar_t
as much as possible, using instead char8_t for UTF-8 strings when
dealing with string data (and especially for data interchange), and
char32_t for UTF-32 encoding internally if you need
character-by-character access. These are unambiguous and function
identically across platforms (except perhaps for the endianness of
char32_t). For interaction with legacy code and API's on Windows,
char16_t is a better choice than wchar_t.

Going forward, C++ could drop wchar_t and support for wide character
execution sets other than Unicode, just as it is dropping support for
signed integer representations other than two's complement. This kind
of thing limits flexibility in theory, but not in practice, and it would
simplify things a bit.

David Brown

unread,
Aug 18, 2019, 5:34:52 AM8/18/19
to
On 18/08/2019 08:48, Bonita Montero wrote:
>> Does your ignorance know no bounds?
>
> Unicode is defined to have a maximum of 21 bit codepoints.
>

Yes. And how to you fit those 21 bits into a 16-bit wchar_t ?

David Brown

unread,
Aug 18, 2019, 5:46:28 AM8/18/19
to
wchar_t is not for UTF-16 - each single wchar_t object should store a
complete character. That is what the type means - look it up.

Strings or arrays of wchar_t can hold UTF-16 encoded data, which will
work in practice but is an abuse of wchar_t that goes against the
standards. ("char16_t" is the type you want here, which is distinct
from wchar_t despite being the same size on Windows.)

Until you can find a way to store the /single/ character "𓃀" (the
hieroglyph for "B") in a /single/ 16-bit wchar_t, not a string or array,
then 16-bit wchar_t is too small.

David Brown

unread,
Aug 18, 2019, 6:07:27 AM8/18/19
to
I know how UTF-16 works. And UTF-16 is only recommended for documents
written primarily in the BMP code plane characters between 0x0800 and
0xffff (i.e., non-European scripts, not including CJK) as it is more
efficient than UTF-8 or UTF-32. This recommendation is roundly ignored,
for good reasons. While UTF-16 is still used internally on Windows, and
with some languages (like Java) and libraries (like QT), it is almost
completely negligible as an encoding in documents or data interchange.
If you see any reference recommending its usage, check the data - the
page is probably from last century.

David Brown

unread,
Aug 18, 2019, 6:22:34 AM8/18/19
to
On 18/08/2019 01:46, Keith Thompson wrote:
> David Brown <david...@hesbynett.no> writes:
>> On 17/08/2019 18:05, Bonita Montero wrote:
>>>> You can store any character encoding supported by the compiler in a
>>>> wchar_t, unless you are using Windows, which is broken and can't store
>>>> Unicode characters in a wchar_t because it is too small.
>>>
>>> The size of wchar_t is implementation-dependent, so Windows isn't broken
>>> here.
>>
>> Yes, Windows is broken here. The standards give minimum requirements
>> for many implementation-dependent features. Windows could have a 21-bit
>> wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is
>> just like saying the size of "int" is implementation dependent, but the
>> standards require a minimum of 16-bit int.
>
> **OR** it could have a 16-bit wchar_t and not claim to support Unicode
> as its wide character set. An implementation with 16-bit wchar_t that
> supports only the BMP as its wide character set could be conforming
> (though not as usesful as an implementation that supports full Unicode).

Absolutely true. And that was the case originally with Windows NT,
which used UCS-2 (essentially the subset of UTF-16 that can be encoded
in a single unit). But Windows has moved steadily more of its APIs,
libraries, gui widgets, and software to full UTF-16. It has done so in
uncoordinated jumps, with versions of Windows that let you use
multi-unit characters in some programs and APIs but not others, but I
believe it is fairly complete now. (And with Windows 10, I have heard
that UTF-8 support is officially in place.)

So 16-bit wchar_t was appropriate when Windows NT started with UCS-2.
But it should have been changed to 32-bit for later Windows.

>
> But Microsoft's own documentation says:
> The wchar_t type is an implementation-defined wide character
> type. In the Microsoft compiler, it represents a 16-bit wide
> character used to store Unicode encoded as UTF-16LE, the native
> character type on Windows operating systems. The wide character
> versions of the Universal C Runtime (UCRT) library functions
> use wchar_t and its pointer and array types as parameters and
> return values, as do the wide character versions of the native
> Windows API.
> https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019
>
> which I believe is non-conforming.
>
> (But I'm not sure how Microsoft could have fixed this without breaking
> too much existing code.)
>

Aye, there's the rub.

People programming for Windows have a long history of making unwarranted
assumptions about types and sizes. They have been used to programming
on a single platform, and thinking it will remain the same forever.
They have not been helped by Microsoft's unforgivable reticence over C99
and headers like <stdint.h>. Windows programmers regularly assume that
wchar_t is 16-bit - changing it would break code written by these
programmers. The same thing happened with the move to 64-bit Windows -
because Windows programmers had assumed that "long" is exactly 32-bit
(having had no "int32_t" available), changing the size of "long" to
match every other 64-bit platform would have broken lots of Windows code.

It is easy to say /now/ that Microsoft should have changed to 32-bit
wchar_t and UTF-32 and/or UTF-8 as soon as it was clear that Unicode
would not fit in 16 bits. But it would have been hard to do at the time.

The resulting situation today, however, is clear. Windows is
non-compliant and nearly unique with its 16-bit wchar_t, and it is the
only major OS to make heavy use of UTF-16 instead of UTF-8.

David Brown

unread,
Aug 18, 2019, 6:24:39 AM8/18/19
to
On 18/08/2019 08:57, Bonita Montero wrote:
>> Yes, Windows is broken here.  The standards give minimum requirements
>> for many implementation-dependent features.  Windows could have a
>> 21-bit wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit
>> one.  It is just like saying the size of "int" is implementation
>> dependent, but the standards require a minimum of 16-bit int.
>
> There's no mandantory relationship between Unicode and wchar_t

Correct.

But there is a mandatory relationship between the character set
supported by the target system, and wchar_t. Windows supports Unicode
(with UTF-16 encoding, and possibly UTF-8 with Windows 10). Therefore,
wchar_t must support Unicode characters.

Bonita Montero

unread,
Aug 18, 2019, 6:57:46 AM8/18/19
to
>> Unicode is defined to have a maximum of 21 bit codepoints.

> Yes.  And how to you fit those 21 bits into a 16-bit wchar_t ?

That's not necessary if you use UTF-16.

Bonita Montero

unread,
Aug 18, 2019, 6:59:17 AM8/18/19
to
> Strings or arrays of wchar_t can hold UTF-16 encoded data, which will
> work in practice but is an abuse of wchar_t that goes against the
> standards.  ("char16_t" is the type you want here, which is distinct
> from wchar_t despite being the same size on Windows.)

There's no violation of the standard. wchar_t is implementation-defined.

Bonita Montero

unread,
Aug 18, 2019, 7:06:07 AM8/18/19
to
> I know how UTF-16 works.  And UTF-16 is only recommended for documents
> written primarily in the BMP code plane characters between 0x0800 and
> 0xffff (i.e., non-European scripts, not including CJK) as it is more
> efficient than UTF-8 or UTF-32.

There's nothing about that in den Unicode-standard. The Uncicode-stan-
dard simply says that UTF-16 can cover every Unicode codepoint via
surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only
suitable for this. It says that is preferred if you ned a balance of
storage-size and access-speed.

> While UTF-16 is still used internally on Windows, and
> with some languages (like Java) and libraries (like QT), it is almost
> completely negligible as an encoding in documents or data interchange.

That might be practical reasons about which one can argue excellently
but that doesn't affect that UTF-16 is simply Unicode-compliant.

Bonita Montero

unread,
Aug 18, 2019, 7:07:34 AM8/18/19
to
> But there is a mandatory relationship between the character set
> supported by the target system, and wchar_t.

No, the implementor might implement whatever he likes since it
hasn't conform to the target-system but all locales supported
by the implementation.

Bonita Montero

unread,
Aug 18, 2019, 7:13:43 AM8/18/19
to
The standard says:
"Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales."
^^^^^^^^^^*********^^^^^^^^
So where's the inconformance?

Bart

unread,
Aug 18, 2019, 8:07:42 AM8/18/19
to
Show me how to make use of UTF-16 here, because it doesn't seem to work
on Windows:

#include <wchar_t>

wchar_t c;
c = 1000000; // character code 1000000

printf("c = %d 0x%X", c, c);

Displays:

c = 16960 0x4240

Not this:

c = 1000000 0xF4240

David Brown

unread,
Aug 18, 2019, 8:48:57 AM8/18/19
to
On 18/08/2019 13:05, Bonita Montero wrote:
>> I know how UTF-16 works.  And UTF-16 is only recommended for documents
>> written primarily in the BMP code plane characters between 0x0800 and
>> 0xffff (i.e., non-European scripts, not including CJK) as it is more
>> efficient than UTF-8 or UTF-32.
>
> There's nothing about that in den Unicode-standard. The Uncicode-stan-
> dard simply says that UTF-16 can cover every Unicode codepoint via
> surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only
> suitable for this. It says that is preferred if you ned a balance of
> storage-size and access-speed.

UTF-16 works fine, but is a poor solution in almost every case. As I
said previously, it combines the disadvantages of UTF-8 with the
disadvantages of UTF-32, giving the benefits of neither. People know
this - that is why it is not used in any significance except in places
where it is hard to avoid (Windows programming), or for dealing with
legacy code.

There was a time, long ago, when Unicode was young, when 16-bit was
enough for almost everything, when saving a few bytes was important,
when HTML and XML were not common, and when transparent compression was
rare. In those days, UTF-16 made sense for some uses.

Given a free choice without worrying about compatibility with old code
or interfaces, no one would recommend or choose UTF-16 now.

>
>> While UTF-16 is still used internally on Windows, and with some
>> languages (like Java) and libraries (like QT), it is almost completely
>> negligible as an encoding in documents or data interchange.
>
> That might be practical reasons about which one can argue excellently
> but that doesn't affect that UTF-16 is simply Unicode-compliant.

No one has doubted that UTF-16 is a valid Unicode encoding. That has
never been in question.

As seems to happen so often, you have totally misunderstood the issue
when you have been shown wrong. I wonder if it is intentional.

Bonita Montero

unread,
Aug 18, 2019, 9:49:53 AM8/18/19
to
> Show me how to make use of UTF-16 here, because it doesn't seem to work
> on Windows:
>
>     #include <wchar_t>
>
>     wchar_t c;
>     c = 1000000;         // character code 1000000
>
>     printf("c = %d 0x%X", c, c);
>
> Displays:
>
>     c = 16960 0x4240
>
> Not this:
>
>     c = 1000000 0xF4240

wchar_t needs not to be Unicode-compliant.
But you can store UTF16-strings in std::wstring.

Bonita Montero

unread,
Aug 18, 2019, 9:51:26 AM8/18/19
to
> UTF-16 works fine, but is a poor solution in almost every case.
> As I said previously, it combines the disadvantages of UTF-8
> with the disadvantages of UTF-32, giving the benefits of neither.

That's just your taste. But someone else may argue like the Unicode
-standard, that UTF-16 is a good balance between size and performance.

David Brown

unread,
Aug 18, 2019, 10:43:30 AM8/18/19
to
Sorry, you /do/ know that UTF-8 and UTF-32 are also encodings for
Unicode, just like UTF-16? It is not clear, but it looks a little like
you think UTF-16 /is/ Unicode and that I am suggesting something other
than Unicode. I hope you can clarify if you are misunderstanding this
or not.

As for taste, something like 94% of documents on the internet are
encoded with UTF-8. The rest is mostly 8-bit encodings like ISO-8859-1.
UTF-16 documents are almost invariably bigger than UTF-8 documents
(since most text-based formats involve keys, tags, etc., that are
single-byte characters in UTF-8). They have the same multi-unit
encoding issues as you get with UTF-8. If you want "one unit is one
code", you need UTF-32, which is fine for internal program usage.
UTF-16 has endianess issues. There really is no benefit in UTF-16 over
UTF-8. The world at large knows this - that is why UTF-8 is
overwhelmingly dominant at UTF-16 is used mostly for legacy reasons.

But you can all it "taste" if you like.

Bonita Montero

unread,
Aug 18, 2019, 11:02:25 AM8/18/19
to
>> That's just your taste. But someone else may argue like the Unicode
>> -standard, that UTF-16 is a good balance between size and performance.

> Sorry, you /do/ know that UTF-8 and UTF-32 are also encodings for
> Unicode, just like UTF-16?  It is not clear, but it looks a little like
> you think UTF-16 /is/ Unicode and that I am suggesting something other
> than Unicode.  I hope you can clarify if you are misunderstanding this
> or not.

You can't derive that from what I said above.
And it isn't even related to that.
You seem a bit confuse.

Scott Lurndal

unread,
Aug 18, 2019, 11:39:09 AM8/18/19
to
Bonita Montero <Bonita....@gmail.com> writes:
>> #include <iostream>
>>
>> int main() {
>> std::cout<<sizeof(wchar_t)<<std::endl;
>> }
>>
>>> g++ -o wchar wchar.cc
>>> ./wchar
>> 4
>>> g++ -v
>> Using built-in specs.
>> COLLECT_GCC=g++
>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/9/lto-wrapper
>> OFFLOAD_TARGET_NAMES=hsa:nvptx-none
>> Target: x86_64-suse-linux
>> …
>
>Hard, I wouldn't have expected this. I think this isn't a clever
>decision. For internal processing the representation doesn't matter
>but for compatiblity-reasons it should match the most commun Unicode
>representation.

wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years.

Windows is the outlier here, not unix.

Scott Lurndal

unread,
Aug 18, 2019, 11:40:37 AM8/18/19
to
Bonita Montero <Bonita....@gmail.com> writes:
>> wchar_t has /never/ been specifically about Unicode.  It means "wide
>> character", it is supposed to big enough to hold a character of any
>> character set supported by the implementation.  For example, a Chinese
>> system might support ASCII and Big5, which is a 16-bit encoding - it
>> could therefore have a 16-bit wchar_t.
>
>Unrealistic that it will be used for anything different than Unicode.

It (wchar_t) predates unicode by a few years.

Bonita Montero

unread,
Aug 18, 2019, 11:44:22 AM8/18/19
to
>> Unrealistic that it will be used for anything different than Unicode.

> It (wchar_t) predates unicode by a few years.

That doesn't change the above.

Bonita Montero

unread,
Aug 18, 2019, 11:47:02 AM8/18/19
to
>> Hard, I wouldn't have expected this. I think this isn't a clever
>> decision. For internal processing the representation doesn't matter
>> but for compatiblity-reasons it should match the most commun Unicode
>> representation.

> wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years.

That doesn't affect the above because in the beginning, UCS-2 was the
standard.

Bo Persson

unread,
Aug 18, 2019, 2:33:54 PM8/18/19
to
So we can only guess that this character code is not part of the
"supported locale" used in your program.

Nowhere does the language standard require C++ to support all available
OS locales. It could be a very small subset.



Bo Persson

Bart

unread,
Aug 18, 2019, 3:52:02 PM8/18/19
to
A small subset where none of the locale's alphabets has character codes
above 65535.

Anyway, the locale need have nothing to do with the ability to be able
to read, write or process arbitrary Unicode characters and strings.

Otherwise countries such as the UK and US could get by with just ASCII.

But if they are going beyond a 7-bit character, then why not beyond
16-bit too? Actually, the first few alphabets beyond codepoint 65535
appear to be ancient scripts with no meaningful locale anyway.

Bo Persson

unread,
Aug 18, 2019, 6:04:31 PM8/18/19
to
On 2019-08-18 at 21:51, Bart wrote:
> On 18/08/2019 19:33, Bo Persson wrote:
>> On 2019-08-18 at 14:07, Bart wrote:
>>> On 18/08/2019 11:57, Bonita Montero wrote:
>>>>>> Unicode is defined to have a maximum of 21 bit codepoints.
>>>>
>>>>> Yes.  And how to you fit those 21 bits into a 16-bit wchar_t ?
>>>>
>>>> That's not necessary if you use UTF-16.
>>>>
>>>
>>>
>>> Show me how to make use of UTF-16 here, because it doesn't seem to
>>> work on Windows:
>>>
>>>      #include <wchar_t>
>>>
>>>      wchar_t c;
>>>      c = 1000000;         // character code 1000000
>>>
>>>      printf("c = %d 0x%X", c, c);
>>>
>>> Displays:
>>>
>>>      c = 16960 0x4240
>>>
>>> Not this:
>>>
>>>      c = 1000000 0xF4240
>>
>> So we can only guess that this character code is not part of the
>> "supported locale" used in your program.
>>
>> Nowhere does the language standard require C++ to support all
>> available OS locales. It could be a very small subset.
>
> A small subset where none of the locale's alphabets has character codes
> above 65535.

Yes, that would conform to the C++ standard.

>
> Anyway, the locale need have nothing to do with the ability to be able
> to read, write or process arbitrary Unicode characters and strings.

Neither does wchar_t. The C++ standard only says:

"The values of type wchar_t can represent distinct codes for all members
of the largest extended character set specified among the supported
locales."

And then it is up to the implementation to define what locales that
could be.

>
> Otherwise countries such as the UK and US could get by with just ASCII.
>
> But if they are going beyond a 7-bit character, then why not beyond
> 16-bit too? Actually, the first few alphabets beyond codepoint 65535
> appear to be ancient scripts with no meaningful locale anyway.

VC++ also offers non-standard use of UTF-16, should you need to use
those characters anyway. But that is an extension, and not standard C++.

This is kind of similar to some compilers allowing the use of integer
types larger that intmax_t, by calling __int128 an extension (and not
"an extended integer type").


Bo Persson


Robert Wessel

unread,
Aug 19, 2019, 2:19:55 AM8/19/19
to
On Sat, 17 Aug 2019 21:21:46 +0200, Bo Persson <b...@bo-persson.se>
wrote:

>On 2019-08-17 at 20:22, Tim Rentsch wrote:
>> Bonita Montero <Bonita....@gmail.com> writes:
>>
>>>>>> Be advised that in Windows wide strings are UTF-16 encoded.
>>>>>
>>>>> And I think that it would be the best to use std::u16string
>>>>> because wchar_t is not absolutely guaranteed to be 16 bit
>>>>> wide.
>>>>
>>>> It is guaranteed /not/ to be 16-bit on most systems.
>>>
>>> You can be pretty sure that by far most implementations specify
>>> whcar_t as 16 bit since it makes no sense to implement it differently.
>>> But it's easier to conform to any hypothetical implementation by using
>>> char16_t.
>>
>> (and in another posting quotes the C++ standard)
>>
>>> "Type wchar_t is a distinct type whose values can represent distinct
>>> codes for all members of the largest extended character set specified
>>> among the supported locales (22.3.1)."
>>
>> Like you say, the C and C++ standards require wchar_t to be large
>> enough so one wchar_t object can hold distinct values for every
>> character in the largest supported character set.
>>
>> Most widely used operating systems today (eg, Microsoft Windows,
>> Linux) support character sets with (at least) hundreds of
>> thousands of characters.
>>
>> Can you explain how hundreds of thousands distinct values can be
>> represented in a single 16-bit wide wchar_t object?
>>
>
>No, they would naturally not fit.
>
>But there is a loop-hole here, as the language standard says "the
>supported locales" and not "every existing locale".
>
>So if you only support the locales where 16 bits is enough, you haven't
>broken any rules. And you cannot help is some people use parts of
>"unsupported" locales as well...


In the case of Windows, locales such as Chinese *are* defined.

https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c

Unless you want to argue that the Windows C and C++ compilers don't
actually intend to support CJK characters in the zh-CN locale.

Bonita Montero

unread,
Aug 19, 2019, 2:25:23 AM8/19/19
to
> In the case of Windows, locales such as Chinese *are* defined.
> https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/a9eac961-e77d-41a6-90a5-ce1a8b0cdb9c
> Unless you want to argue that the Windows C and C++ compilers don't
> actually intend to support CJK characters in the zh-CN locale.

wchar_t's locale-support needs only to be related to the language
not the platform.

David Brown

unread,
Aug 19, 2019, 2:52:42 AM8/19/19
to
On 18/08/2019 21:51, Bart wrote:
> On 18/08/2019 19:33, Bo Persson wrote:
>> On 2019-08-18 at 14:07, Bart wrote:
>>> On 18/08/2019 11:57, Bonita Montero wrote:
>>>>>> Unicode is defined to have a maximum of 21 bit codepoints.
>>>>
>>>>> Yes.  And how to you fit those 21 bits into a 16-bit wchar_t ?
>>>>
>>>> That's not necessary if you use UTF-16.
>>>>
>>>
>>>
>>> Show me how to make use of UTF-16 here, because it doesn't seem to
>>> work on Windows:
>>>
>>>      #include <wchar_t>
>>>
>>>      wchar_t c;
>>>      c = 1000000;         // character code 1000000
>>>
>>>      printf("c = %d 0x%X", c, c);
>>>
>>> Displays:
>>>
>>>      c = 16960 0x4240
>>>
>>> Not this:
>>>
>>>      c = 1000000 0xF4240
>>
>> So we can only guess that this character code is not part of the
>> "supported locale" used in your program.
>>
>> Nowhere does the language standard require C++ to support all
>> available OS locales. It could be a very small subset.

It is not the language that supports locales, or the compiler - it is
the /implementation/, which includes libraries and any other bits and
pieces needed for creating runnable programs.

Certainly a C++ (or C) compiler for Windows could choose only to support
some locales, such as those for which a 16-bit wchar_t is sufficient.
That might make sense for a small or custom compiler. It would make no
sense at all for the main platform compiler for the system, and the
compiler that is (presumably) used to compile Windows itself. And -
unsurprisingly - MSVC /does/ support locales with multi-unit characters
when encoded with UTF-16.

>
> A small subset where none of the locale's alphabets has character codes
> above 65535.
>
> Anyway, the locale need have nothing to do with the ability to be able
> to read, write or process arbitrary Unicode characters and strings.

Correct.

>
> Otherwise countries such as the UK and US could get by with just ASCII.

How naïve of you to think so :-)

Some /programs/ can get away with just ASCII. No countries can, not
even the USA.

Many programs could get away with UCS-2, however - these cover most
current scripts other than CJK.

>
> But if they are going beyond a 7-bit character, then why not beyond
> 16-bit too? Actually, the first few alphabets beyond codepoint 65535
> appear to be ancient scripts with no meaningful locale anyway.

Many of the ancient scripts are below 0xffff Unicode code points - i.e.,
in the BMP. The major use of higher planes is for CJK characters for
Chinese, Japanese and Korean.

Alf P. Steinbach

unread,
Aug 19, 2019, 3:33:06 AM8/19/19
to
The above discussion is IMHO not meaningful.

The issue about what the standard means when it requires support for all
characters in the execution character set, has been discussed at length
before, multiple times, both here and e.g. on SO.

There is no good complete resolution, but there is the /observation/
that since the standard is out of tune with reality in this respect, it
can't very well be viewed as authoritative. Viewing it as authoritative
on an issue where it's demonstrably just wrong, where all Windows
compilers are in breach of its silly requirement, would be religious.
Since it's ignored by the compiler writers, others should do as them.


> Certainly a C++ (or C) compiler for Windows could choose only to support
> some locales, such as those for which a 16-bit wchar_t is sufficient.
> That might make sense for a small or custom compiler. It would make no
> sense at all for the main platform compiler for the system, and the
> compiler that is (presumably) used to compile Windows itself. And -
> unsurprisingly - MSVC /does/ support locales with multi-unit characters
> when encoded with UTF-16.

Again.

Plus the observation that locales are mostly irrelevant for Unicode.

There are some slight dependencies on locales, e.g. for Turkish I's, but
those can reasonably be argued are /defects in Unicode/. One can simply
choose to not support those marginal locale dependencies. Encoding text
for a current locale is anyway super braindead in modern software.


>> A small subset where none of the locale's alphabets has character codes
>> above 65535.
>>
>> Anyway, the locale need have nothing to do with the ability to be able
>> to read, write or process arbitrary Unicode characters and strings.
>
> Correct.
>
>>
>> Otherwise countries such as the UK and US could get by with just ASCII.
>
> How naïve of you to think so :-)
>
> Some /programs/ can get away with just ASCII. No countries can, not
> even the USA.
>
> Many programs could get away with UCS-2, however - these cover most
> current scripts other than CJK.
>
>>
>> But if they are going beyond a 7-bit character, then why not beyond
>> 16-bit too? Actually, the first few alphabets beyond codepoint 65535
>> appear to be ancient scripts with no meaningful locale anyway.
>
> Many of the ancient scripts are below 0xffff Unicode code points - i.e.,
> in the BMP. The major use of higher planes is for CJK characters for
> Chinese, Japanese and Korean.

Emojis.

Cheers!,

- Alf

Bonita Montero

unread,
Aug 19, 2019, 4:27:06 AM8/19/19
to
> Certainly a C++ (or C) compiler for Windows could choose only to support
> some locales, such as those for which a 16-bit wchar_t is sufficient.
> That might make sense for a small or custom compiler. It would make no
> sense at all for the main platform compiler for the system, and the
> compiler that is (presumably) used to compile Windows itself. And -
> unsurprisingly - MSVC /does/ support locales with multi-unit characters
> when encoded with UTF-16.

MSVC++ wchar_t's size matches Windows.h's WCHAR-type; both are 16 bits.
So it is the right decision to have a wchar_t with this with.

Jorgen Grahn

unread,
Aug 19, 2019, 4:57:40 PM8/19/19
to
On Sat, 2019-08-17, Szyk Cech wrote:
> Hello!
>
> I am looking the best way to get system paths in Linux and Windows.
> Off-course the best will be portable way. Like this:
> https://doc.qt.io/qt-5/qstandardpaths.html
> But I don't want to use Qt. I want to write most of my app in pure C++.
>
> I am looking for following paths:
> std::wstring gSettingsDir();
> std::wstring gLocalDataDir();
> std::wstring gAppDataDir();

> std::wstring gLogDir();

In Unix you log using syslog(3), not to files.

> std::wstring gTempDir();

For Unix, see tmpfile(3) and similar functions. Some of them are in
the standard C++ library too.

> Can you give me some hints how to get them in Linux and Windows?!?

This is offtopic, but they don't all exist, not on Linux or Unix in
general, anyway. Learn the user interface conventions of your
platforms, and write code which follows them -- no API can replace
good taste.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Tim Rentsch

unread,
Aug 22, 2019, 3:38:10 PM8/22/19
to
> it can't very well be viewed as authoritative. [...]

It's Microsoft that is out of tune. And getting moreso as time
goes on.

Tim Rentsch

unread,
Aug 25, 2019, 8:35:45 AM8/25/19
to
I see you are giving the same wrong arguments that you gave three
years ago.

C and C++ implementations must support the locale-specific native
environment (assuming a hosted implementation in C). They don't
get to choose what that is; they are required to document it,
but the standards do not AFAICT give any freedom to choose it. If
the native environment in MS Windows has more than 65K characters
then the implementation is non-conforming.

Besides that, regardless of what locale names are documented as
being "supported", any locale whose name is accepted by the
setlocale() function (ie, return value is not null) must have the
property that all of its characters have distinct values that fit
in a single wchar_t object. Any violation of this property means
the implementation is not conforming. Note that the setlocale()
calls all fall within the bounds of Standard-specified behavior.
There is no wiggle room available to try to sweep these under the
rug as "extensions".

Bonita Montero

unread,
Aug 25, 2019, 9:03:00 AM8/25/19
to
> C and C++ implementations must support the locale-specific native
> environment (assuming a hosted implementation in C). They don't
> get to choose what that is; they are required to document it,
> but the standards do not AFAICT give any freedom to choose it.
> If the native environment in MS Windows has more than 65K characters
> then the implementation is non-conforming.

Quote the part of the C++-standard that says this.

Tim Rentsch

unread,
Aug 25, 2019, 9:38:51 AM8/25/19
to
In N4659:

Section 25.5.1 p1: "The contents and meaning of the header <clocale>
are the same as the C standard library header <locale.h>."

Section 2 (Normative references) p2: "The library described in Clause 7
of ISO/IEC 9899:2011 is hereinafter called the /C standard library/."

Bonita Montero

unread,
Aug 25, 2019, 9:46:29 AM8/25/19
to
> In N4659:
> Section 25.5.1 p1: "The contents and meaning of the header <clocale>
> are the same as the C standard library header <locale.h>."
> Section 2 (Normative references) p2: "The library described in Clause 7
> of ISO/IEC 9899:2011 is hereinafter called the /C standard library/."

And where does the C-standard say that the with of wchar_t has to match
the platform?

Bonita Montero

unread,
Aug 25, 2019, 9:55:05 AM8/25/19
to
Here, the current 2019 C-draft:
"3.7.3: wide character: value representable by an object of type
wchar_t, capable of representing any character in the current locale"
0 new messages