"2 major reasons why modern C++ is a performance beast"

Lynn McGuire

unread,

Sep 26, 2016, 10:43:26 PM9/26/16

to

"2 major reasons why modern C++ is a performance beast"
https://www.oreilly.com/ideas/2-major-reasons-why-modern-c-is-a-performance-beast

"The Prime Directive in the C/C++ continuum has always been performance. As I often tell groups when teaching C++, when I ask a
question beginning with "Why" concerning the rationale for a particular C++ language or library feature, they have a 90% chance of
getting the answer right by replying with a single word: "Performance.""

Yup.

Lynn

woodb...@gmail.com

unread,

Sep 26, 2016, 11:15:03 PM9/26/16

to

He makes some good points, but he claims there's no more
need for raw pointers as of C++ 2014. Oy vey. Raw pointers
are still needed, but not as much as in the past.

Brian
Ebenezer Enterprises - In G-d we trust.
http://webEbenezer.net

Öö Tiib

unread,

Sep 27, 2016, 6:54:25 AM9/27/16

to

He seems correct. There are no need to use raw pointers in C++.
We may want to use those for legacy reasons, for convenience
or for optimizing something. For about same reason we may want
to use inline assembler. It felt already like that with C++2003
plus boost library.

For example the three (most prominent IMHO) raw pointer
usages:
* 'main' can have raw pointer argument;
* for covariant return;
* for interfacing with other programming languages.

Can any of those really described as a "need"? Can you
name some more important cases?

Ben Bacarisse

unread,

Sep 27, 2016, 9:55:02 AM9/27/16

to

嘱 Tiib <oot...@hot.ee> writes:

> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>> On Monday, September 26, 2016 at 9:43:26 PM UTC-5, Lynn McGuire wrote:
>> > "2 major reasons why modern C++ is a performance beast"
>> > https://www.oreilly.com/ideas/2-major-reasons-why-modern-c-is-a-performance-beast

<snip>

>> He makes some good points, but he claims there's no more
>> need for raw pointers as of C++ 2014. Oy vey. Raw pointers
>> are still needed, but not as much as in the past.
>
> He seems correct. There are no need to use raw pointers in C++.
> We may want to use those for legacy reasons, for convenience
> or for optimizing something. For about same reason we may want
> to use inline assembler. It felt already like that with C++2003
> plus boost library.
>
> For example the three (most prominent IMHO) raw pointer
> usages:
> * 'main' can have raw pointer argument;
> * for covariant return;
> * for interfacing with other programming languages.
>
> Can any of those really described as a "need"? Can you
> name some more important cases?

I suppose it depends on what constitutes "use" or "need". C++ string
literals convert to raw pointers, so

std::string greeting("Hello world");

"needs" raw pointers in some sense.

--
Ben.

Melzzzzz

unread,

Sep 27, 2016, 9:56:59 AM9/27/16

to

On 2016-09-27, Öö Tiib <oot...@hot.ee> wrote:
> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>> On Monday, September 26, 2016 at 9:43:26 PM UTC-5, Lynn McGuire wrote:
>> > "2 major reasons why modern C++ is a performance beast"
>> > https://www.oreilly.com/ideas/2-major-reasons-why-modern-c-is-a-performance-beast
>> >
>> > "The Prime Directive in the C/C++ continuum has always been performance. As I often tell groups when teaching C++, when I ask a
>> > question beginning with "Why" concerning the rationale for a particular C++ language or library feature, they have a 90% chance of
>> > getting the answer right by replying with a single word: "Performance.""
>> >
>> > Yup.
>> >
>>
>> He makes some good points, but he claims there's no more
>> need for raw pointers as of C++ 2014. Oy vey. Raw pointers
>> are still needed, but not as much as in the past.
>
> He seems correct. There are no need to use raw pointers in C++.
> We may want to use those for legacy reasons, for convenience
> or for optimizing something. For about same reason we may want
> to use inline assembler. It felt already like that with C++2003
> plus boost library.

Hm, what about filling mmaped framebuffer? Or doing ray tracing?
There are a lot more, but these examples came first to me...

>
> For example the three (most prominent IMHO) raw pointer
> usages:
> * 'main' can have raw pointer argument;
> * for covariant return;
> * for interfacing with other programming languages.
>
> Can any of those really described as a "need"? Can you
> name some more important cases?

--
press any key to continue or any other to quit

woodb...@gmail.com

unread,

Sep 27, 2016, 12:43:52 PM9/27/16

to

Most containers and smart pointer classes use raw pointers.

constexpr basic_string_view(const CharT* s, size_type count);

Application programmers don't have to use raw pointers as
much as in the past, but raw pointers are an important part
of the language. The author of that article says there's
no more need for raw pointers "in the language."

Brian
Ebenezer Enterprises
http://webEbenezer.net

Alf P. Steinbach

unread,

Sep 27, 2016, 1:00:26 PM9/27/16

to

On 27.09.2016 12:53, Öö Tiib wrote:
> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>> On Monday, September 26, 2016 at 9:43:26 PM UTC-5, Lynn McGuire wrote:
>>> "2 major reasons why modern C++ is a performance beast"
>>> https://www.oreilly.com/ideas/2-major-reasons-why-modern-c-is-a-performance-beast
>>>
>>> "The Prime Directive in the C/C++ continuum has always been performance. As I often tell groups when teaching C++, when I ask a
>>> question beginning with "Why" concerning the rationale for a particular C++ language or library feature, they have a 90% chance of
>>> getting the answer right by replying with a single word: "Performance.""
>>>
>>> Yup.
>>>
>>
>> He makes some good points, but he claims there's no more
>> need for raw pointers as of C++ 2014. Oy vey. Raw pointers
>> are still needed, but not as much as in the past.

The author conflates use of pointers with ownership, presumably because
that's what he feels is important. One could start taking him seriously
if he pointed out the ownership issue as such, but he doesn't. He just
conflates.

> He seems correct. There are no need to use raw pointers in C++.
> We may want to use those for legacy reasons, for convenience
> or for optimizing something. For about same reason we may want
> to use inline assembler. It felt already like that with C++2003
> plus boost library.
>
> For example the three (most prominent IMHO) raw pointer
> usages:
> * 'main' can have raw pointer argument;
> * for covariant return;
> * for interfacing with other programming languages.
>
> Can any of those really described as a "need"?

Yes, the last point.

The arguments of `main` are not really needed because they're a
non-portable solution for what they were designed for, namely to be able
to handle command line arguments that include file names. In order to
write portable code that also works in Windows, they can't be used for
that purpose. So they're only useful for learners and in Unix-land.

Covariant returns are not needed because it's trivial to implement
covariant returns, and because that has to be done anyway with non-raw
type, albeit with some code size overhead.

> Can you name some more important cases?

Raw pointers are good for referring to objects without carrying ownership.

And yes, practitioners need that.

It would be extraordinarily inefficient & verbose, i.e. extremely silly,
to use e.g. `std::weak_ptr` for such references.

More generally, practitioners need clean tools that are designed for a
single purpose. A smart pointer can't be easily divided into its
constituent responsibilities: ownership management, and reference. When
you need one without the other, you can't easily use a smart pointer.

Raw pointers are also good for referring to functions. They're
reseatable (assignable). References are much more cumbersome for this.

Cheers, & hth.,

- Alf

Scott Lurndal

unread,

Sep 27, 2016, 1:10:03 PM9/27/16

to

r...@zedat.fu-berlin.de (Stefan Ram) writes:

>Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>>I suppose it depends on what constitutes "use" or "need". C++ string
>>literals convert to raw pointers, so
>> std::string greeting("Hello world");
>>"needs" raw pointers in some sense.
>

> In my course I now call such string literals
> "C string literals" as opposed to the C++ string
> literals I am using throughout, as in:
>
> main.cpp
>
>#include <iostream>
>#include <ostream>
>#include <string>
>
>using namespace ::std::literals;
>
>int main() { ::std::cout << "Hallo, Welt"s << "\n"s; }
>
> transcript
>
>Hallo, Welt
>
> Using »s« above does not simplify things, far from it!
> But I believe that it is simpler to always use C++ string
> literals than to sometimes use C string literals and
> sometimes C++ string literals.
>

Do you actually write programs that do something useful, or
do you just teach?

Scott Lurndal

unread,

Sep 27, 2016, 1:11:18 PM9/27/16

to

"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

>The arguments of `main` are not really needed because they're a
>non-portable solution for what they were designed for, namely to be able
>to handle command line arguments that include file names. In order to
>write portable code that also works in Windows, they can't be used for
>that purpose. So they're only useful for learners and in Unix-land.

Fortunately, a large fraction of C++ code never needs to run on
Windows. Thankfully.

woodb...@gmail.com

unread,

Sep 27, 2016, 2:06:04 PM9/27/16

to

I seem to recall Kanze mocking teachers here in the past.
I thank G-d for my teachers and I'm glad for Stefan's posts.

Richard

unread,

Sep 27, 2016, 5:32:20 PM9/27/16

to

[Please do not mail me a copy of your followup]

Melzzzzz <Melz...@zzzzz.com> spake the secret code
<nsdtr2$bmf$1...@news.albasani.net> thusly:

>> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>> He seems correct. There are no need to use raw pointers in C++.

>> [...]

>
>Hm, what about filling mmaped framebuffer?

What about it? Presumably that memory-mapped framebuffer has some
sort of structure. The result from mmap is a void *. That doesn't
represent any structure at all. Wrapping the result of mmap in some
kind of pixel_view<uint32_t> or whatever class is going to be less
error prone and provide a zero overhead abstraction. So yes, just
like std::vector<>, std::unique_ptr<>, etc., you'll want to hide
those raw pointers within some sort of abstraction. That's the
benefit of C++: the abstractions usually don't cost you anything in
efficiency compared to "doing it by hand".

>Or doing ray tracing?

I'm not sure what this has to do with raw pointers at all.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
The Terminals Wiki <http://terminals.classiccmp.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Melzzzzz

unread,

Sep 27, 2016, 8:42:45 PM9/27/16

to

On 2016-09-27, Richard <legaliz...@mail.xmission.com> wrote:
> [Please do not mail me a copy of your followup]
>
> Melzzzzz <Melz...@zzzzz.com> spake the secret code
><nsdtr2$bmf$1...@news.albasani.net> thusly:
>
>>> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>>> He seems correct. There are no need to use raw pointers in C++.
>>> [...]
>>
>>Hm, what about filling mmaped framebuffer?
>
> What about it? Presumably that memory-mapped framebuffer has some
> sort of structure. The result from mmap is a void *. That doesn't
> represent any structure at all. Wrapping the result of mmap in some
> kind of pixel_view<uint32_t> or whatever class is going to be less
> error prone and provide a zero overhead abstraction. So yes, just
> like std::vector<>, std::unique_ptr<>, etc., you'll want to hide
> those raw pointers within some sort of abstraction. That's the
> benefit of C++: the abstractions usually don't cost you anything in
> efficiency compared to "doing it by hand".

I am not talking of wrapping data into structure, rather need of raw
pointers, wrapped or not...

>
>>Or doing ray tracing?
>
> I'm not sure what this has to do with raw pointers at all.

Whenever one needs performance raw pointers have to be used...
eg implementing tree structure, or linked list for that matter...
Imagine implementing AVL tree with smart pointers ;)

David Brown

unread,

Sep 28, 2016, 2:49:10 AM9/28/16

to

On 28/09/16 02:42, Melzzzzz wrote:
> On 2016-09-27, Richard <legaliz...@mail.xmission.com> wrote:
>> [Please do not mail me a copy of your followup]
>>
>> Melzzzzz <Melz...@zzzzz.com> spake the secret code
>> <nsdtr2$bmf$1...@news.albasani.net> thusly:
>>
>>>> On Tuesday, 27 September 2016 06:15:03 UTC+3, woodb...@gmail.com wrote:
>>>> He seems correct. There are no need to use raw pointers in C++.
>>>> [...]
>>>
>>> Hm, what about filling mmaped framebuffer?
>>
>> What about it? Presumably that memory-mapped framebuffer has some
>> sort of structure. The result from mmap is a void *. That doesn't
>> represent any structure at all. Wrapping the result of mmap in some
>> kind of pixel_view<uint32_t> or whatever class is going to be less
>> error prone and provide a zero overhead abstraction. So yes, just
>> like std::vector<>, std::unique_ptr<>, etc., you'll want to hide
>> those raw pointers within some sort of abstraction. That's the
>> benefit of C++: the abstractions usually don't cost you anything in
>> efficiency compared to "doing it by hand".
>
> I am not talking of wrapping data into structure, rather need of raw
> pointers, wrapped or not...

If the pointers are wrapped, they are not "raw pointers" !

>
>>
>>> Or doing ray tracing?
>>
>> I'm not sure what this has to do with raw pointers at all.
>
> Whenever one needs performance raw pointers have to be used...
> eg implementing tree structure, or linked list for that matter...
> Imagine implementing AVL tree with smart pointers ;)
>

You need raw pointers to implement things like unique_ptr<> or other
wrappers. Dig deep enough into the linked list class and its
components, and you will find raw pointers - the language needs them in
order to make the higher level constructs.

C++ smart pointers are designed to have minimal overheads - often, they
have no overhead at all. A unique_ptr<T> will have the same size as a
T*, and for most purposes it will lead to exactly the same code for the
same operations. Raw pointers are /not/ faster - they are just lower level.

Fred.Zwarts

unread,

Sep 28, 2016, 3:48:27 AM9/28/16

to

"Scott Lurndal" schreef in bericht news:NKxGz.8775$Yk1....@fx43.iad...

Moreover, the arguments of main are by far not only used for file names, but
also for options, host names, port numbers, etc. Further, indeed file
handling for Linux is different from that of Windows (and other operating
systems), but that does not imply that only for Linux file names can be
passed to the program via arguments of main. That part is very portable. The
non-portable part is the handling of file names, which is the code inside
the program.

Alf P. Steinbach

unread,

Sep 28, 2016, 4:01:56 AM9/28/16

to

No, it isn't, and the belief that it is, seems to be typical of the
Unix-land programmer who believes that he or she knows better, but is
really at the far left of the Dunning-Kruger curve. At least that's my
impression judging from 20+ years of exposure to Unix-land programmers'
efforts at Windows programming. I find it amazing, almost unbelievable,
but still reality, that I've /never/, to my knowledge, seen a reasonably
silly-bug free Windows port of any Unix-land program.

Just the other day I tried to install the Xcas computer algebra system,
where the End User License Agreement was apparently in Chinese. Well as
it turned out it was the GNU Public License Agreement in plain ASCII,
reinterpreted as UTF-16 encoded text. With the usual GIGO effect.

<url: http://imgur.com/uI22wwi>

I tried to report it but the comments system used for bug reporting was
in French, required that I register, and failed to accept my
registration on the grounds that I apparently was a spammer. I was
supposed to include the word “math” somewhere in the registration form
to avoid that conclusion. But there was nowhere to put it, except maybe
in my email-address, and that didn't work, in the sense that it wasn't
accepted (it would have worked as e-mail address if it were accepted).

This is the kind of quality I'm /used to/ from Unix-land: some really
awesome general incompetence at play.

It's so more incomprehensible since the original Unix developers, at
AT&T Bell Labs, were not the conformist idiots that apparently rule
things today, but the very opposite: they were free thinking gurus.

> The non-portable part is the handling of file names,
> which is the code inside the program.

/That/ is portable, e.g. just Boost file system.

Cheers & hth.,

- Alf

Öö Tiib

unread,

Sep 28, 2016, 9:14:01 AM9/28/16

to

On Tuesday, 27 September 2016 16:55:02 UTC+3, Ben Bacarisse wrote:

I assumed "usage" as the cited article meant was as type of something
that we write. So your code wraps string literal into std::string.
For other example if we write such thing ...

std::shared_ptr<Foo> foos( new Foo[count], std::default_delete<Foo[]>() );

There we again use (in some sense) raw pointer but most people
say that we use a smart pointer to wrap dynamic array of Foos.

BTW, when we have a string literal (or some other raw array) then
it seems better to use reference to array instead of raw pointer to first
element. The compilers seem to optimize code with references to array
better. Hopefully C++17 will add 'std::string_view' that also does that
trick and so wraps string literals more efficiently than 'std::string'
can.

Richard

unread,

Sep 28, 2016, 1:37:24 PM9/28/16

to

[Please do not mail me a copy of your followup]

Melzzzzz <Melz...@zzzzz.com> spake the secret code

<nsf3ls$3p5$1...@news.albasani.net> thusly:

>I am not talking of wrapping data into structure, rather need of raw
>pointers, wrapped or not...

Then I think you missed the whole point of the article and the
meaning of the phrase "zero overhead abstraction".

When people talk about not needing to use raw pointers in modern C++,
they are talking about the power of zero overhead abstraction, they
are not talking about removing raw pointers from the language as they
have done in Java, C#, JavaScript, etc.

Richard

unread,

Sep 28, 2016, 1:40:26 PM9/28/16

to

[Please do not mail me a copy of your followup]

"Alf P. Steinbach" <alf.p.stein...@gmail.com> spake the secret code
<nsftd0$1o2$1...@dont-email.me> thusly:

><url: http://imgur.com/uI22wwi>

OK, that was truly hilarious!

Nobody

unread,

Sep 28, 2016, 2:22:47 PM9/28/16

to

On Wed, 28 Sep 2016 09:48:24 +0200, Fred.Zwarts wrote:

> Further, indeed file handling for Linux is different from that of
> Windows (and other operating systems), but that does not imply that only
> for Linux file names can be passed to the program via arguments of main.

Windows filenames are arrays of wchar_t, but you can't pass those to
main(). Windows (or rather MSVCRT) also allows:

int wmain(int argc, wchar_t **argv, wchar_t **envp)

But that's not portable. And you'd also need to use Windows-specific
functions to open those files.

If you're limited to char[], you can only refer to files whose filenames
are valid in the current codepage. And the current codepage can't be UTF-8
(at least, you can't use it as the system codepage).

Beyond that, the myriad other ways in which Windows' command-line handling
is broken are off-topic here.

Bo Persson

unread,

Sep 28, 2016, 3:13:25 PM9/28/16

to

On 2016-09-28 20:22, Nobody wrote:
> On Wed, 28 Sep 2016 09:48:24 +0200, Fred.Zwarts wrote:
>
>> Further, indeed file handling for Linux is different from that of
>> Windows (and other operating systems), but that does not imply that only
>> for Linux file names can be passed to the program via arguments of main.
>
> Windows filenames are arrays of wchar_t, but you can't pass those to
> main(). Windows (or rather MSVCRT) also allows:
>
> int wmain(int argc, wchar_t **argv, wchar_t **envp)
>
> But that's not portable. And you'd also need to use Windows-specific
> functions to open those files.

Not really. The Windows-specific version of std::fstream accepts wchar_t
file names.

And the acceptance of <filesystem> seems to require this, so in C++17
this will not even be seen as an extension.

Paavo Helde

unread,

Sep 28, 2016, 4:12:09 PM9/28/16

to

Except that MSVC wchar_t is not conforming to C++ standard. So the
things would still be non-portable and/or broken, but more subtly. With
some precautions one can work around differences in most cases (i.e. no
deep natural language processing), but it still takes some special
effort to ensure that the program will function the same in Windows and
elsewhere.

Cheers
Paavo

Richard

unread,

Sep 28, 2016, 4:21:40 PM9/28/16

to

[Please do not mail me a copy of your followup]

Paavo Helde <myfir...@osa.pri.ee> spake the secret code
<h9SdnU5kNMGQuXHK...@giganews.com> thusly:

>Except that MSVC wchar_t is not conforming to C++ standard.

In what way?

I looked at 3.9.1.5 in N3797:

"Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales (22.3.1). Type wchar_t
shall have the same size, signedness, and alignment requirements
(3.11) as one of the other integral types, called its underlying
type. Types char16_t and char32_t denote distinct types with
the same size, signedness, and alignment as uint_least16_t
and uint_least32_t, respectively, in <cstdint>, called the
underlying types."

In particular I don't see anything in there about Unicode or that
wchar_t has to be 32-bits or that it has to support every Unicode
character code point as a single wchar_t.

Öö Tiib

unread,

Sep 28, 2016, 4:49:48 PM9/28/16

to

On Wednesday, 28 September 2016 23:21:40 UTC+3, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>
> >Except that MSVC wchar_t is not conforming to C++ standard.
>
> In what way?
>
> I looked at 3.9.1.5 in N3797:
>
> "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character
> set specified among the supported locales (22.3.1). Type wchar_t
> shall have the same size, signedness, and alignment requirements
> (3.11) as one of the other integral types, called its underlying
> type. Types char16_t and char32_t denote distinct types with
> the same size, signedness, and alignment as uint_least16_t
> and uint_least32_t, respectively, in <cstdint>, called the
> underlying types."
>
> In particular I don't see anything in there about Unicode or that
> wchar_t has to be 32-bits or that it has to support every Unicode
> character code point as a single wchar_t.

Standard seemingly talks about character set (like UCS-2). The 'wchar_t'
values in MSVC however are not representing some character set.
These are code units of UTF-16 and not characters there. Are the code
units of UTF-8 also "character set"?

Paavo Helde

unread,

Sep 28, 2016, 4:59:47 PM9/28/16

to

On 28.09.2016 23:21, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>
>> Except that MSVC wchar_t is not conforming to C++ standard.
>
> In what way?
>
> I looked at 3.9.1.5 in N3797:
>
> "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character
> set specified among the supported locales (22.3.1). Type wchar_t
> shall have the same size, signedness, and alignment requirements
> (3.11) as one of the other integral types, called its underlying
> type. Types char16_t and char32_t denote distinct types with
> the same size, signedness, and alignment as uint_least16_t
> and uint_least32_t, respectively, in <cstdint>, called the
> underlying types."
>
> In particular I don't see anything in there about Unicode or that
> wchar_t has to be 32-bits or that it has to support every Unicode
> character code point as a single wchar_t.

In all relevant MSDN pages Microsoft gives a strong impression that
Windows and MSVC support the whole Unicode (though I have not found this
exact wording so far). This is actually kind of true for its
std::wstring as it can encode all Unicode strings in UTF-16. At the same
time its wchar_t cannot encode all Unicode characters.

My take is that if Microsoft claimed that MSVC supports only a part of
Unicode, then it would make it standard-compliant (alas, this would not
really enhance portability, rather vice versa, because in other
widespread implementations wchar_t can encode all Unicode characters).

Anyway, as it stands, Microsoft appears to claim that it supports all
Unicode, and this means wchar_t is not conformant because it cannot

"represent distinct codes for all members of the largest extended

character set specified among the supported locales", as this "largest
extended character set" IMO is clearly Unicode, in Windows. In Windows,
wchar_t encodes individual 16-bit elements of UTF-16 "surrogate pairs"
which cannot be considered as members of any character set, IMHO.

Cheers,
Paavo

Alf P. Steinbach

unread,

Sep 28, 2016, 6:52:33 PM9/28/16

to

On 28.09.2016 22:21, Richard wrote:
>
> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>
>> Except that MSVC wchar_t is not conforming to C++ standard.
>
> In what way?
>
> I looked at 3.9.1.5 in N3797:
>
> "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character
> set specified among the supported locales (22.3.1). Type wchar_t
> shall have the same size, signedness, and alignment requirements
> (3.11) as one of the other integral types, called its underlying
> type. Types char16_t and char32_t denote distinct types with
> the same size, signedness, and alignment as uint_least16_t
> and uint_least32_t, respectively, in <cstdint>, called the
> underlying types."
>
> In particular I don't see anything in there about Unicode or that
> wchar_t has to be 32-bits or that it has to support every Unicode
> character code point as a single wchar_t.
>

I think we got this answered, modulo pedantic doubts, via my SO question
about it, where ¹one answer quoted

C++11 §2.14.5/15 (lex.string/15):
“The size of a char32_t or wide string literal is the total number of
escape sequences, universal-character-names, and other characters, plus
one for the terminating U’\0’ or L’\0’.”

In my view this pretty much nails it: 16-bit wchar_t for full Unicode
via UTF-16 is not formally valid, though it would be valid for UCS-2
(the BMP subset of Unicode, corresponding to original 16-bit Unicode).

Pedantic doubts include what's meant by “characters”.

In my view the only reasonable interpretation for an UTF-16 encoded
string is that a “character” is a “Unicode code point”, which may be
encoded as a surrogate pair, but the Devil's advocate may just argue
that it's really about each encoding value, including those that are
just half of surrogate pairs. That's far fetched. But since standards
use all kinds of weird meanings of words, it can't be just dismissed.

Cheers!,

- Alf

Links:
¹ <url: http://stackoverflow.com/a/39552788/464581>

Richard

unread,

Sep 28, 2016, 7:29:18 PM9/28/16

to

[Please do not mail me a copy of your followup]

Paavo Helde <myfir...@osa.pri.ee> spake the secret code

<eJidnWEqm4SlsnHK...@giganews.com> thusly:

>On 28.09.2016 23:21, Richard wrote:
>> [Please do not mail me a copy of your followup]
>>
>> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
>> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>>
>>> Except that MSVC wchar_t is not conforming to C++ standard.
>>
>> In what way?

>> [...]

>
>My take is that if Microsoft claimed that MSVC supports only a part of
>Unicode, then it would make it standard-compliant (alas, this would not
>really enhance portability, rather vice versa, because in other
>widespread implementations wchar_t can encode all Unicode characters).

OK, so it's just your opinion that their implementation of wchar_t is
not conforming to the standard, but you can't point to a specific
part of the standard that they're required to support and aren't
supporting?

Also, I'm not an expert on Unicode or on the C++ Standard, but I'm
not aware of the standard making any Unicode impositions on wchar_t.
The standard is always careful to make statements with respect to the
source character encoding or executable character encoding, and
so-on. I mean, don't IBM zSystem mainframes use EBCDIC as their
source character encoding, and not Unicode? I have no idea what
their execution character encoding is on those machines either.

Anyway, it's one thing to say "I think wchar_t should hold any UTF-32
code point" and another thing entirely to say that any implementation
not meeting this requirement is non-conforming to the standard.

Richard

unread,

Sep 28, 2016, 7:36:09 PM9/28/16

to

[Please do not mail me a copy of your followup]

"Alf P. Steinbach" <alf.p.stein...@gmail.com> spake the secret code

<nshhin$tt8$1...@dont-email.me> thusly:

>C++11 2.14.5/15 (lex.string/15):
>"The size of a char32_t or wide string literal is the total number of
>escape sequences, universal-character-names, and other characters, plus
>one for the terminating U'\0' or L'\0'."
>
>In my view this pretty much nails it: 16-bit wchar_t for full Unicode
>via UTF-16 is not formally valid, though it would be valid for UCS-2
>(the BMP subset of Unicode, corresponding to original 16-bit Unicode).

The thing is, it seems that people are willfully interpreting this so
as to "prove Microsoft wrong" as opposed to taking the spirit of the
standard into account. I mean, really, buried in this description of
a string literal is the magic requirement that wchar_t *must* be >=
32-bits? I find it hard to believe that if this was the intent of
the standard that we'd have to infer this indirectly from this
description of a string literal. I find it more likely that this
description of a wchar_t string literal has a bug and meant to
include the possibility of surrogate characters.

But hey, I'm not on the standards body and I'm just reading this as
someone without an axe to grind against Microsoft looking for a
"gotcha!" in the standard. And no, I'm not trying to characterize
Alf's position as being in this sentiment.

Paavo Helde

unread,

Sep 29, 2016, 2:08:40 AM9/29/16

to

On 29.09.2016 2:36, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> spake the secret code
> <nshhin$tt8$1...@dont-email.me> thusly:
>
>> C++11 2.14.5/15 (lex.string/15):
>> "The size of a char32_t or wide string literal is the total number of
>> escape sequences, universal-character-names, and other characters, plus
>> one for the terminating U'\0' or L'\0'."
>>
>> In my view this pretty much nails it: 16-bit wchar_t for full Unicode
>> via UTF-16 is not formally valid, though it would be valid for UCS-2
>> (the BMP subset of Unicode, corresponding to original 16-bit Unicode).
>
> The thing is, it seems that people are willfully interpreting this so
> as to "prove Microsoft wrong" as opposed to taking the spirit of the
> standard into account.

It's not really proving anybody wrong, but just pointing out that one
cannot use wchar_t in MSVC in the same way as e.g. in gcc on Linux. The
whole point of the standards is to put in place some common rules so
that things could be used in the same way.

If the things cannot be used in the same way, then something should be
corrected. Again the point is not who is wrong, but which part should be
corrected. There is very little chance that Microsoft would change their
wchar_t, so my opinion is that the C++ standard should be changed to
explicitly allow this usage of wchar_t. For example, the standard could
say that strings of wchar_t must support Unicode, but possibly using a
packed representation corresponding to host OS conventions. The things
still could not really be used in the same way, but then it would at
least not be a surprise.

This would incidentally mean that a Linux implementation could then use
a 8-bit wchar_t.

Cheers
Paavo

Paavo Helde

unread,

Sep 29, 2016, 2:27:58 AM9/29/16

to

On 29.09.2016 2:29, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
> <eJidnWEqm4SlsnHK...@giganews.com> thusly:
>
>> On 28.09.2016 23:21, Richard wrote:
>>> [Please do not mail me a copy of your followup]
>>>
>>> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
>>> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>>>
>>>> Except that MSVC wchar_t is not conforming to C++ standard.
>>>
>>> In what way?
>>> [...]
>>
>> My take is that if Microsoft claimed that MSVC supports only a part of
>> Unicode, then it would make it standard-compliant (alas, this would not
>> really enhance portability, rather vice versa, because in other
>> widespread implementations wchar_t can encode all Unicode characters).
>
> OK, so it's just your opinion that their implementation of wchar_t is
> not conforming to the standard, but you can't point to a specific
> part of the standard that they're required to support and aren't
> supporting?

I thought you quoted the relevant verse from the standard yourself:

3.9.1.5 in N3797:

"Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character

set ...

Windows and MSVC support Unicode, so this "largest extended character
set" is clearly Unicode in Windows. They have chosen to support Unicode,
but do this in some other way than prescribed by the standard. This is
non-conforming. If the standard will be changed in the future to allow
their way, then their implementation would become conforming. Yes,
that's my opinion, neither the standard or MSDN use so explicit or clear
language.

Cheers
Paavo

Bo Persson

unread,

Sep 29, 2016, 12:55:10 PM9/29/16

to

On 2016-09-29 01:36, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> spake the secret code
> <nshhin$tt8$1...@dont-email.me> thusly:
>
>> C++11 2.14.5/15 (lex.string/15):
>> "The size of a char32_t or wide string literal is the total number of
>> escape sequences, universal-character-names, and other characters, plus
>> one for the terminating U'\0' or L'\0'."
>>
>> In my view this pretty much nails it: 16-bit wchar_t for full Unicode
>> via UTF-16 is not formally valid, though it would be valid for UCS-2
>> (the BMP subset of Unicode, corresponding to original 16-bit Unicode).
>
> The thing is, it seems that people are willfully interpreting this so
> as to "prove Microsoft wrong" as opposed to taking the spirit of the
> standard into account. I mean, really, buried in this description of
> a string literal is the magic requirement that wchar_t *must* be >=
> 32-bits? I find it hard to believe that if this was the intent of
> the standard that we'd have to infer this indirectly from this
> description of a string literal. I find it more likely that this
> description of a wchar_t string literal has a bug and meant to
> include the possibility of surrogate characters.
>

There is a loophole in the phrase "among the supported locales".
Microsoft could limit the use of wchar_t to locales where all characters
fit in 16 bits.

However, they seem to claim support also for some other locales. Perhaps
it is *this* claim that is non-conforming, and not the choice of the
size for wchar_t.

Bo Persson

Richard

unread,

Sep 29, 2016, 3:10:29 PM9/29/16

to

[Please do not mail me a copy of your followup]

Paavo Helde <myfir...@osa.pri.ee> spake the secret code

<KYqdncU-68T8KXHK...@giganews.com> thusly:

"Supporting Unicode" is not the same thing as saying "the extended
character set is UTF-32".

I'm not trying to split hairs here, but we need to get down into very
specific statements when it comes to understanding exactly what the
standard is saying and what Microsoft is claiming with respect to
it's implementation.

Öö Tiib

unread,

Sep 29, 2016, 3:59:56 PM9/29/16

to

Yes but what is your agenda? :D We "need to get down" into miserable
crap position and to grind the dim and silly scriptures? Why? That sounds
like rick_c_hodgin tm. The claims of standards? Those do not matter
anyway. Current discussion demonstrates that it is void about wchar_t like
floating point was void before C99. Useless.

Isn't Microsoft the only platform vendor that actually uses 'wchar_t' in
APIs? Rest IMHO only implement bare minimum that C standard requires and
for UTF-16 use their whatever 'jchar' they got.

Even our rejection of 'wchar_t' is not Microsoft's fault. The 'wchar_t'
*was* used for portability by some brave. It was used to port between
pre-Unix Apple and Windows. Those who did that did pay dearly for
their foolhardiness when *Apple* switched to Unix and enlarged its
'wchar_t' to be 32 bits and achieved no portability between 'wchar_t's
of Apple.

Paavo Helde

unread,

Sep 29, 2016, 5:03:46 PM9/29/16

to

"Supporting Unicode" means that the extended character set is Unicode.
At the moment Unicode contains more than 128,000 characters, so a 16-bit
integer cannot encode all distinct values of Unicode.

And yes, Microsoft is well aware of this and thus MSVC supports 32-bit
\UNNNNNNNN Unicode character literals, for example. Here is a demo program:

#include <iostream>

int main() {
// elephant-camel-ant
wchar_t message[] = L"\U0001F418\U0001F42B\U0001F41C";
std::cout << "Size in wchar_t elements: " <<
sizeof(message)/sizeof(message[0]) << "\n";
}

The wide string literal is specified as 3 Unicode characters (elephant,
camel and ant) plus the terminating zero. In Windows/MSVC the program
output is:

Size in wchar_t elements: 7

This is because each character has been encoded by an UTF-16 surrogate
pair (plus there is a single terminator zero wchar_t).

If MSVC did not support full Unicode, then:

a) it would not recognize 32-bit \UNNNNNNNN Unicode character literals
(C++ standard contains also 16-bit \uNNNN Unicode character literals).

or

b) it would truncate the values somehow into 16-bit wchar_t instead of
translating them into valid UTF-16.

As it stands now, it appears MSVC is containing special support for all
Unicode, translating it into native UTF-16. This is a reasonable
implementation for Windows. However, this plainly contradicts the C++
standard:

2.13.5/12: "A string-literal that begins with L, such as L"asdf", is a
wide string literal. A wide string literal has type
“array of n const wchar_t”."

2.13.5/15: "The size of a char32_t or wide string literal is the total

number of escape sequences, universal-character-names, and other
characters, plus one for the terminating U’\0’ or L’\0’."

It appears MSVC has implemented what C++ standard calls char16_t string
literals ( u"asdf" ) (2.13.5/15: "a universal-character-name in a
char16_t string literal may yield a surrogate pair. [...] The size of a
char16_t string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
u’\0’. [ Note: The size of a char16_t string literal is the number of
code units, not the number of characters. —end note ]")

If MSVC renamed their wchar_t to char16_t and L"abc" to u"abc", then it
would become conforming, as far as I can see. Of course they are not
going to do that as this would break a mountain range of existing code.

Cheers
Paavo

Richard

unread,

Sep 29, 2016, 5:48:20 PM9/29/16

to

[Please do not mail me a copy of your followup]

Paavo Helde <myfir...@osa.pri.ee> spake the secret code

<h6udnXPBFrk5HHDK...@giganews.com> thusly:

>2.13.5/12: "A string-literal that begins with L, such as L"asdf", is a
>wide string literal. A wide string literal has type
>“array of n const wchar_t”."
>
>2.13.5/15: "The size of a char32_t or wide string literal is the total
>number of escape sequences, universal-character-names, and other
>characters, plus one for the terminating U’\0’ or L’\0’."

We've already been over this in this thread, you're just repeating
yourself here without adding any new information.

Paavo Helde

unread,

Sep 30, 2016, 2:03:00 AM9/30/16

to

On 30.09.2016 0:48, Richard wrote:
>
> We've already been over this in this thread, you're just repeating
> yourself here without adding any new information.

Yes, I'm feeling so myself too!

Cheers
Paavo

Robert Wessel

unread,

Sep 30, 2016, 3:45:00 AM9/30/16

to

Early versions of the Unicode standard actually included "Unicode
character codes have a uniform width of 16 bits." as one of "The Ten
Unicode Design Principles" (quotes from my hardcopy Unicode 2.0
standard - circa 1996). While also defining the surrogate pair
mechanism. The ten principles have changed since then (and the 16-bit
character thing is no longer one of them). So were Unicode characters
16 bits in those days, even in the presence of (possible) surrogate
pairs? And how does that relate to the much later C tightening of the
definition of wchar_t?

Öö Tiib

unread,

Sep 30, 2016, 7:09:10 AM9/30/16

to

On Wednesday, 28 September 2016 03:42:45 UTC+3, Melzzzzz wrote:
> On 2016-09-27, Richard <legaliz...@mail.xmission.com> wrote:
> > Melzzzzz <Melz...@zzzzz.com> spake the secret code
> ><nsdtr2$bmf$1...@news.albasani.net> thusly:
> >

> >>Or doing ray tracing?
> >
> > I'm not sure what this has to do with raw pointers at all.
>
> Whenever one needs performance raw pointers have to be used...
> eg implementing tree structure, or linked list for that matter...
> Imagine implementing AVL tree with smart pointers ;)

When someone implements AVL tree for performance reasons
then he most likely keeps the nodes in some other container
(like 'vector' or 'deque') for performance reasons and so uses
the things for indexing in that container (like
'std::deque<node>::iterator' or 'std::vector<node>::size_type')
instead of 'node*' for performance reasons. ;)

Mr Flibble

unread,

Sep 30, 2016, 1:24:16 PM9/30/16

to

On 27/09/2016 03:59, Stefan Ram wrote:
> Lynn McGuire <lynnmc...@gmail.com> writes:
>> "The Prime Directive in the C/C++ continuum has always been
>> performance. As I often tell groups when teaching C++, when I
>> ask a question beginning with "Why" concerning the rationale
>> for a particular C++ language or library feature, they have a
>> 90% chance of getting the answer right by replying with a
>> single word: "Performance.""
>
> Recently, while teaching C++, I showed statistics from TIOBE
> and benchmarksgame.alioth.debian.org, and commented: C++
> is popular, but not as popular as C, and it's fast, but not
> as fast as C.

Nonsense, C++ is faster than C. Why? Two reasons why:

* C is mostly a subset of C++ and the parts that aren't have no bearing
on performance.
* C++ alternatives are often faster than their C equivalents, for
example std::sort() is faster than qsort() when passed a functor which
allows comparisons to be inlined.

/Flibble

Tim Rentsch

unread,

Sep 30, 2016, 1:52:21 PM9/30/16

to

This question came up in the newsgroup here a couple weeks ago.
Alf was kind enough to post a question on stackoverflow, which
provided some interesting and useful reading. Let me cut to the
chase. I believe the Microsoft implementation (ie, MSVC) is
technically conforming. It boils down to what characters are in
"the execution wide-character set" (this from [lex.ccon p5] in
C++14, specifically N4296, but I think there is similar wording
in C++11 and C++03). So how does the implementation define "the
execution wide-character set"? At least one source of Microsoft
documentation says this (note the part after "except"):

"The set of available locale names, languages,
country/region codes, and code pages includes all those
supported by the Windows NLS API except code pages that
require more than two bytes per character [...]"

Here is the link for that:

https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

which is documentation for setlocale(), which seems pretty
definitive.

My conclusion is that MSVC is technically conforming, but only
weaselly conforming. They wimped out by artifically defining the
set of characters officially supported to be only those whose
code points fit in two bytes, even though the compiler obviously
knows how to deal with code points that need surrogate pairs.
Technically, they are within their rights. But they deserve to
be taken to task for running roughshod over the spirit of what
the standards (both C and C++) obviously intend.

Richard

unread,

Sep 30, 2016, 2:00:55 PM9/30/16

to

[Please do not mail me a copy of your followup]

Tim Rentsch <t...@alumni.caltech.edu> spake the secret code
<kfna8ep...@x-alumni2.alumni.caltech.edu> thusly:

>My conclusion is that MSVC is technically conforming, but only
>weaselly conforming. They wimped out by artifically defining the
>set of characters officially supported to be only those whose
>code points fit in two bytes, even though the compiler obviously
>knows how to deal with code points that need surrogate pairs.
>Technically, they are within their rights. But they deserve to
>be taken to task for running roughshod over the spirit of what
>the standards (both C and C++) obviously intend.

Thanks for that.

After reading the latest messages on this thread today, I wondered
what would be the effect of wchar_t being 32-bits on Windows. My
overall conclusion is that it would be worthless.

- All existing code using L"" literals or wchar_t would break.
- You can't pass 32-bit wchar_t's to any Win32 APIs, you'd have to
convert to UTF-16 first and/or use char16_t everywhere instead of
wchar_t.
- The compiler could only support sizeof(wchar_t)==4 as a non-default
compiler option due to the above breakage.
- I can't think of anything useful I would do with 32-bit wchar_t on
Windows. Can anyone provide an example? It has to be other than
internal manipulation of 32-bit Unicode characters which you can
already do with char32_t.

Tim Rentsch

unread,

Sep 30, 2016, 2:09:30 PM9/30/16

to

Robert Wessel <robert...@yahoo.com> writes:

> On Thu, 29 Sep 2016 00:50:59 +0200, "Alf P. Steinbach"
> <alf.p.stein...@gmail.com> wrote:
>
>> On 28.09.2016 22:21, Richard wrote:
>>
>>> Paavo Helde <myfir...@osa.pri.ee> spake the secret code
>>> <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
>>>
>>>> Except that MSVC wchar_t is not conforming to C++ standard.
>>>
>>> In what way?
>>>
>>> I looked at 3.9.1.5 in N3797:
>>>
>>> "Type wchar_t is a distinct type whose values can represent
>>> distinct codes for all members of the largest extended character
>>> set specified among the supported locales (22.3.1). Type wchar_t
>>> shall have the same size, signedness, and alignment requirements
>>> (3.11) as one of the other integral types, called its underlying
>>> type. Types char16_t and char32_t denote distinct types with
>>> the same size, signedness, and alignment as uint_least16_t
>>> and uint_least32_t, respectively, in <cstdint>, called the
>>> underlying types."
>>>
>>> In particular I don't see anything in there about Unicode or that
>>> wchar_t has to be 32-bits or that it has to support every Unicode
>>> character code point as a single wchar_t.
>>
>> I think we got this answered, modulo pedantic doubts, via my SO question
>> about it, where one answer quoted
>>

>> C++11 Section 2.14.5/15 (lex.string/15):

>> "The size of a char32_t or wide string literal is the total number of
>> escape sequences, universal-character-names, and other characters, plus
>> one for the terminating U'\0' or L'\0'."
>>
>> In my view this pretty much nails it: 16-bit wchar_t for full Unicode
>> via UTF-16 is not formally valid, though it would be valid for UCS-2
>> (the BMP subset of Unicode, corresponding to original 16-bit Unicode).
>>
>> Pedantic doubts include what's meant by "characters".
>>
>> In my view the only reasonable interpretation for an UTF-16 encoded

>> string is that a ?character? is a ?Unicode code point?, which may be

>> encoded as a surrogate pair, but the Devil's advocate may just argue
>> that it's really about each encoding value, including those that are
>> just half of surrogate pairs. That's far fetched. But since standards
>> use all kinds of weird meanings of words, it can't be just dismissed.
>
> Early versions of the Unicode standard actually included "Unicode
> character codes have a uniform width of 16 bits." as one of "The Ten
> Unicode Design Principles" (quotes from my hardcopy Unicode 2.0
> standard - circa 1996). While also defining the surrogate pair
> mechanism. The ten principles have changed since then (and the 16-bit
> character thing is no longer one of them). So were Unicode characters
> 16 bits in those days, even in the presence of (possible) surrogate
> pairs? And how does that relate to the much later C tightening of the
> definition of wchar_t?

Was it that much later, or even later at all? I guess I should also ask
which tightening you are referring to. AFAIK wchar_t emerged in more or
less its current form in Amendment 1 in 1995, ie what is commonly called
C95, and has been more or less unchanged since then. Certainly wchar_t
is described in terms like those of the present day in drafts predating
C99. (In C11 the types char16_t and char32_t were added, but that seems
independent of wchar_t.)

Bo Persson

unread,

Sep 30, 2016, 2:23:53 PM9/30/16

to

On 2016-09-30 19:52, Tim Rentsch wrote:
>
> "The set of available locale names, languages,
> country/region codes, and code pages includes all those
> supported by the Windows NLS API except code pages that
> require more than two bytes per character [...]"
>
> Here is the link for that:
>
> https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
>
> which is documentation for setlocale(), which seems pretty
> definitive.
>
> My conclusion is that MSVC is technically conforming, but only
> weaselly conforming. They wimped out by artifically defining the
> set of characters officially supported to be only those whose
> code points fit in two bytes, even though the compiler obviously
> knows how to deal with code points that need surrogate pairs.
> Technically, they are within their rights. But they deserve to
> be taken to task for running roughshod over the spirit of what
> the standards (both C and C++) obviously intend.
>

What?!

The documentation states that you can only call the standard library
function setlocale with locales that the C and C++ standards allow.

The fact that the operating system happens to support additional locales
is hardly something to complain about.

Bo Persson

Paavo Helde

unread,

Sep 30, 2016, 5:07:15 PM9/30/16

to

On 30.09.2016 21:00, Richard wrote:

> - I can't think of anything useful I would do with 32-bit wchar_t on
> Windows. Can anyone provide an example? It has to be other than
> internal manipulation of 32-bit Unicode characters which you can
> already do with char32_t.

The reality is that definition of wchar_t is platform-dependent so it
would be unwise to use it for anything intended to be portable. So the
only use I can see for wchar_t is in platform-specific code parts meant
for interfacing Windows SDK functions. char16_t is a different type so
one cannot use it for that purpose on Windows (without a lot of casting,
that is).

On Linux the system calls use char and UTF-8, so I have never had a need
for wchar_t there (or anything else than plain char, for that matter).

Cheers
Paavo

Paavo Helde

unread,

Sep 30, 2016, 5:44:38 PM9/30/16

to

I noticed that I actually did not answer the question, so in short: if
wchar_t changed to 32-bit in MSVC (with Windows SDK remaining 16-bit),
then this wchar_t would lose the single use case it had so far. OTOH, it
might now become easier to port other programs using wchar_t to Windows.

Richard

unread,

Sep 30, 2016, 5:57:48 PM9/30/16

to

[Please do not mail me a copy of your followup]

Paavo Helde <myfir...@osa.pri.ee> spake the secret code

<pYydnU_-6O0mQXPK...@giganews.com> thusly:

>I noticed that I actually did not answer the question, so in short: if
>wchar_t changed to 32-bit in MSVC (with Windows SDK remaining 16-bit),
>then this wchar_t would lose the single use case it had so far. OTOH, it
>might now become easier to port other programs using wchar_t to Windows.

So, a huge negative impact on all existing code in return for the possible
benefit of porting some code.

Robert Wessel

unread,

Sep 30, 2016, 6:32:31 PM9/30/16

to

I don't have a copy of the C95 TC, but:

http://www.lysator.liu.se/c/na1.html

describes Unicode characters as 16 bit. The surrogate pair mechanism
was actually introduced with Unicode 2.0, and Unicode was purely a
16-bit character set in the two earlier standard versions (and in the
preceding drafts).

But I don't know the wording in the TC. MS used a pre-standard
version of wchar_t (which they originally typedef'd to the [16-bit]
Windows type TCHAR), way back in 1993 (and well before that if you
count the pre-release versions of WinNT).

So we have:

- Circa 1991: Unicode 1.0
- Circa 1992/1993: Unicode 1.01 and 1.1, mainly add CJK stuff
- Circa 1993: WinNT ships with (16-bit only, what we'd now call UCS-2)
Unicode 1.x support, pre-standard (16-bit) wchar_t
- Circa 1993: UTF-8 proposal made public - I'm not clear on when it
was added as a standard encoding, but it was not in unicode 1.1
(c.1993, where a earlier scheme, FSS-UTF was described), and UTF-8 was
definitely in 2.0), but there was a minor revision between those two
("1.15", c.1995), which might have added it.
- Circa 1995: TC95 standardizes wchar_t
- Circa 1995: MS Ships Win95 (first "consumer" Win32 platform) with
UCS-2 support
- Circa 1996: Unicode 2.0 adds surrogate pair support, "breaking" the
"16-bit" nature of Unicode.

Öö Tiib

unread,

Oct 2, 2016, 5:22:37 PM10/2/16

to

Developer can write code in C that is equal to 'std::sort()' with
inlined functor. It just means developer can't use 'qsort()' and has to
waste time to implement introsort (or what that 'std::sort()' typically
is) for his container but it won't be slower than 'std::sort()'.

Mr Flibble

unread,

Oct 3, 2016, 1:50:21 PM10/3/16

to

It is not possible to write functors in C. Pointers to functions can't
be inlined.

/Flibble

David Brown

unread,

Oct 3, 2016, 2:14:08 PM10/3/16

to

It is perfectly possible for a compiler to inline functions called by
pointers, as long as it knows the relevant details at compile time (or
link-time, for link-time optimisation). It is even perfectly possible
for the compiler to generate several copies of "qsort", each with an
inlined version of a particular comparison function - that is known as
"function cloning" optimisation.

Whether it /will/ inline like this or not is a different matter - that
depends on the compiler, the flags, and the code. But it is
fundamentally the same type of optimisation as inlining virtual method
calls, which C++ compilers often do.

(None of this changes the fundamental argument here - C++ can always be
at least as fast as C, assuming you have an optimising compiler, and
sometimes C++ gives you ways to write code that is faster than C code if
you want code to be clear, flexible and maintainable.)

Öö Tiib

unread,

Oct 3, 2016, 3:06:28 PM10/3/16

to

I did not write that it is possible to write functors or lambdas in C.
I wrote that it is possible to write code in C that is about equal with
instantiated 'std::sort()' template.

Daniel

unread,

Oct 3, 2016, 4:12:38 PM10/3/16

to

On Monday, September 26, 2016 at 10:43:26 PM UTC-4, Lynn McGuire wrote:

> when I ask a question beginning with "Why" concerning the rationale for a
> particular C++ language or library feature, they have a 90% chance of
> getting the answer right by replying with a single word: "Performance.""
>

A student with some doubts about the performance beast might also reply with one word: streams.

Best regards,
Daniel

Wouter van Ooijen

unread,

Oct 3, 2016, 4:53:59 PM10/3/16

to

Op 03-Oct-16 om 10:11 PM schreef Daniel:

To which I would reply: you like the C version better (for performance,
or any other reason)? Use it from C++, and still get the benefits from a
better language.

Wouter "Objects? No Thanks!" van Ooijen

Daniel

unread,

Oct 3, 2016, 5:25:09 PM10/3/16

to

On Monday, October 3, 2016 at 4:53:59 PM UTC-4, Wouter van Ooijen wrote:
> Op 03-Oct-16 om 10:11 PM schreef Daniel:
> >>

> > A student with some doubts about the performance beast might also reply with one word: streams.
>
> To which I would reply: you like the C version better (for performance,
> or any other reason)? Use it from C++, and still get the benefits from a
> better language.
>

People do that, or implement their own conversions, especially for
integer/string conversions, sometimes they even hack the floating point bits.
Anything to avoid streams. Especially if they're writing software that has to
look respectable on benchmarks to be used.

Unfortunately, the C versions aren't complete, for example, the _l versions
aren't standard and aren't available in all environments, sometimes people write code to reverse the effects of localization.

It's just something to lament, that C++ doesn't have a respectable abstraction
for IO, or a full set of C IO functions.

Best regards,
Daniel

mark

unread,

Oct 3, 2016, 6:46:41 PM10/3/16

to

On 2016-10-03 19:50, Mr Flibble wrote:
>
> It is not possible to write functors in C. Pointers to functions can't
> be inlined.

Really?

-------------------------------------------------------
#include <stdbool.h>

typedef struct S {
int a;
int b;
} S;

inline bool cmp(const void* s1, const void* s2) {
return ((S*)s1)->a < ((S*)s2)->a;
}

typedef bool (*fn_ptr_t)(const void*, const void*);
fn_ptr_t fn_ptr = cmp;

bool test(const void* s1, const void* s2, fn_ptr_t c) {
return (*c)(s1, s2);
}

bool test_caller(const S* s1, const S* s2) {
return test(s1, s2, cmp);
}
-------------------------------------------------------
_test_caller:
LFB2:
.cfi_startproc
mov eax, DWORD PTR [esp+8]
mov edx, DWORD PTR [esp+4]
mov eax, DWORD PTR [eax]
cmp DWORD PTR [edx], eax
setl al
ret
.cfi_endproc
-------------------------------------------------------

With LTO, this kind of thing works across translation units. It
obviously won't work with dynamic function pointers, but if things are
known at compile time, it's not much different from templates getting
inlined.

Mr Flibble

unread,

Oct 3, 2016, 7:56:39 PM10/3/16

to

You cannot inline the function pointer passed to qsort because qsort
isn't inline function and
has already been compiled as part of the CRT library so doesn't have
sight of the comparison
function's body for inlining.

So std::sort will be faster than qsort as we can benefit from an
inlining functor.

/Flibble

Jerry Stuckle

unread,

Oct 3, 2016, 9:24:32 PM10/3/16

to

So? Who says you need to use qsort? Many times I've written
replacements for generic functions like qsort for performance reasons.

Of course, they're not needed now as much as they were in the 80's, but
there are times when they are still needed. Nothing says you *have* to
use *any* built-in function, including qsort.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstu...@attglobal.net
==================

Robert Wessel

unread,

Oct 3, 2016, 11:27:43 PM10/3/16

to

On Tue, 4 Oct 2016 00:56:27 +0100, Mr Flibble <fli...@i42.co.uk>
wrote:

>On 03/10/2016 19:13, David Brown wrote:
>> On 03/10/16 19:50, Mr Flibble wrote:

There's no requirement that qsort actually be pre-compiled into
something like a traditional library, so its source *could* be visible
in the translation unit (just include the function body in the header
and make it static, for example). It could even be done as a built-in
function (similar to how many (simpler) C library functions can be
inlined).

Second, with LTCG, if the library is distributed in the right form*,
it's parse tree (or whatever) will be visible at link time (when the
code is generated), and it could be inlined at that point.

So that's more a limitation of current implementations, and not
inherent in the language.

*Note that I don't know of any cases where that's actually been done,
but it's certainly possible.

David Brown

unread,

Oct 4, 2016, 3:50:14 AM10/4/16

to

First, as Jerry says, you don't have to use the standard "qsort" function.

Secondly, there is nothing in the standards that says qsort must be a
separately compiled function in a "CRT" library. Options include:

1. An inline function in <stdlib.h>

2. A library function, but in a format for use with link-time
optimisation so that it can be inlined across modules.

3. Code generated at compile-time by the compiler, optimised for the
specific use-case. Compilers do that for simpler library functions like
memcpy() and the str* functions, as well as many of the maths functions.

4. A series of library functions optimised for a variety of common sort
functions or sort function patterns, with a method of matching up names
(much like function overloading in C++).

5. Anything the toolchain likes - the only requirement is that it
/looks/ like the code works in the way the C standards specify and your
source code says.

(Of those options, number 2 is certainly realistic.)

Mr Flibble

unread,

Oct 4, 2016, 11:56:49 AM10/4/16

to

On 04/10/2016 04:27, Robert Wessel wrote:
> On Tue, 4 Oct 2016 00:56:27 +0100, Mr Flibble <fli...@i42.co.uk>
> wrote:
>
>> On 03/10/2016 19:13, David Brown wrote:
>>> On 03/10/16 19:50, Mr Flibble wrote:

My point is that C only has function pointers available so any C
solution may not be able to inline however if we use templates in C++
inlining will always be available when using inlining functors.

/Flibble

Jorgen Grahn

unread,

Oct 5, 2016, 1:09:26 PM10/5/16

to

On Mon, 2016-10-03, Daniel wrote:
> On Monday, October 3, 2016 at 4:53:59 PM UTC-4, Wouter van Ooijen wrote:
>> Op 03-Oct-16 om 10:11 PM schreef Daniel:
>> >>
>> > A student with some doubts about the performance beast might also reply with one word: streams.
>>
>> To which I would reply: you like the C version better (for performance,
>> or any other reason)? Use it from C++, and still get the benefits from a
>> better language.

> People do that, or implement their own conversions, especially for
> integer/string conversions, sometimes they even hack the floating
> point bits. Anything to avoid streams.

/Some/ people may do that. I don't think I've ever met one. In my
case, iostreams shuffle a few megabytes of data quickly enough (and
I'm probably in the minority of users who need any iostreams
performance /at all/, since I tend to write traditional Unix filters a
lot). I'd be happy to see a better and faster version ... but it's not
very important to me.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Richard

unread,

Oct 5, 2016, 4:17:39 PM10/5/16

to

[Please do not mail me a copy of your followup]

Jorgen Grahn <grahn...@snipabacken.se> spake the secret code
<slrnnvactq.5...@frailea.sa.invalid> thusly:

>/Some/ people may do that. I don't think I've ever met one. In my
>case, iostreams shuffle a few megabytes of data quickly enough (and
>I'm probably in the minority of users who need any iostreams
>performance /at all/, since I tend to write traditional Unix filters a
>lot). I'd be happy to see a better and faster version ... but it's not
>very important to me.

For programs that do lots of logging or lots of I/O of large text
files, iostreams can be a bottleneck.

Sometime I would like to see a profile analysis of exactly what is in
iostreams that is causing all the performance to suck. My guess is
that it is the generality of iostreams and how it interacts with
locales that causes the problem. It would be nice if we could get
iostreams performance on par with printf() when you don't need the
fancier locale oriented features of streams. I have a suspicion that
we could slice out something underneath iostreams where you
explicitly forego the benefits of locales and recover the performance
of printf but retain an API similar to that of streams.

Richard

unread,

Oct 5, 2016, 4:19:10 PM10/5/16

to

[Please do not mail me a copy of your followup]

Oh, and also: libraries like Boost.Spirit have shown that you can do
*better* than stdio in terms of performance when {de,}serializing
integral types to ASCII. It would be nice to expose this in a more
general way through the standard. Oh my what a bunch of work that
would be, however. Even if started now it likely wouldn't appear
before 2023.

Tim Rentsch

unread,

Oct 9, 2016, 8:17:45 AM10/9/16

to

Yes, I read that same summary.

> describes Unicode characters as 16 bit. The surrogate pair mechanism
> was actually introduced with Unicode 2.0, and Unicode was purely a
> 16-bit character set in the two earlier standard versions (and in the
> preceding drafts).

Right. But the summary also says

each character is given a code value that can be stored in
an object of type wchar_t.

and Unicode is mentioned only as an example. If the principle
was already established in 1995 (as it appears it was), then
the move to Unicode 2.0 should have prompted expanding wchar_t
to be 32 bits instead of just 16.

> But I don't know the wording in the TC. MS used a pre-standard
> version of wchar_t (which they originally typedef'd to the [16-bit]
> Windows type TCHAR), way back in 1993 (and well before that if you
> count the pre-release versions of WinNT).

Yes, I expect they were perfectly fine in 1995, when wchar_t
matched where Unicode was at the time. The question is when did
the break occur - did it happen some time between 1996 and 1999,
when Microsoft expanded the set of characters accepted by the OS
generally, or did it happen only after C99 was published? More
succinctly, did it happen before or after the then-current
version of the Standard require distinct wchar_t values for
each code point in the native environment? I suspect it was
after, but I can't be sure without seeing the exact wording
used in the 1995 amendment.

> So we have:
>
> - Circa 1991: Unicode 1.0
> - Circa 1992/1993: Unicode 1.01 and 1.1, mainly add CJK stuff
> - Circa 1993: WinNT ships with (16-bit only, what we'd now call UCS-2)
> Unicode 1.x support, pre-standard (16-bit) wchar_t
> - Circa 1993: UTF-8 proposal made public - I'm not clear on when it
> was added as a standard encoding, but it was not in unicode 1.1
> (c.1993, where a earlier scheme, FSS-UTF was described), and UTF-8 was
> definitely in 2.0), but there was a minor revision between those two
> ("1.15", c.1995), which might have added it.
> - Circa 1995: TC95 standardizes wchar_t
> - Circa 1995: MS Ships Win95 (first "consumer" Win32 platform) with
> UCS-2 support
> - Circa 1996: Unicode 2.0 adds surrogate pair support, "breaking" the
> "16-bit" nature of Unicode.

For me the question hinges on not just wchar_t but also a
provision given for the setlocale() function:

A value of "C" for locale specifies the minimal environment
for C translation; a value of "" for locale specifies the
locale-specific native environment.

This wording was already present in N869, which is a pre-C99
draft (dated January 18, 1999). If that same wording was present
in the 1995 amendment, then expanding the native environment to
accept and deal with code points outside of the original 16-bit
set should have resulted in expanding wchar_t to a width large
enough to accommodate that.

Tim Rentsch

unread,

Oct 9, 2016, 8:29:49 AM10/9/16

to

Bo Persson <b...@gmb.dk> writes:

> On 2016-09-30 19:52, Tim Rentsch wrote:
>
>> "The set of available locale names, languages,
>> country/region codes, and code pages includes all those
>> supported by the Windows NLS API except code pages that
>> require more than two bytes per character [...]"
>>
>> Here is the link for that:
>>
>> https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
>>
>> which is documentation for setlocale(), which seems pretty
>> definitive.
>>
>> My conclusion is that MSVC is technically conforming, but only
>> weaselly conforming. They wimped out by artifically defining the
>> set of characters officially supported to be only those whose
>> code points fit in two bytes, even though the compiler obviously
>> knows how to deal with code points that need surrogate pairs.
>> Technically, they are within their rights. But they deserve to
>> be taken to task for running roughshod over the spirit of what
>> the standards (both C and C++) obviously intend.
>
> What?!
>
> The documentation states that you can only call the standard library
> function setlocale with locales that the C and C++ standards allow.

Yes, and one of those locales (ie, with "" as the value of the
locale parameter to setlocale()) specifies "the locale-specific
native environment". Since the native environment understands
and deals with code points outside the original set of 16-bit
values, it is not spirit-conforming for wchar_t to be just
16 bits, regardless of what encoding(s) the host API chooses
to require.

Bo Persson

unread,

Oct 9, 2016, 9:26:54 AM10/9/16

to

Technically the language doesn't have to work for the entire native
environment, but for "supported locales".

As we have seen here earlier, MS points that out in the documentation
for setlocale. You can only (legally) set locales where all characters
fit in a 16-bit wchar_t. Everything else is, at best, an extension.

Bo Persson

Bo Persson

unread,

Oct 9, 2016, 9:33:15 AM10/9/16

to

And nowhere in the C++ standard does it say that you have to be
"spirit-conforming". :-)

I'm not arguing that it is a good choice, only that "supported locales"
can be defined as those locales where it works according to the language
standard. There are no rules that the operating system cannot support
additional locales.

Also, one possible fix is to change the language standard to support
existing practice. If it has been like this for 20 years, perhaps it
should be allowed?

Bo Persson

Tim Rentsch

unread,

Oct 14, 2016, 10:04:58 AM10/14/16

to

ISTM that you are either missing or ignoring a key point. The
standard stipulates only two locales (the "C" locale and the ""
locale) that must be supported. However, the required "" locale
is specified /as that of the 'native environment'/. Clearly the
native environment understands code points outside the set of
16-bit characters (eg, by means of surrogate pairs). Whatever
other statements might be made about "OS locales" or "character
code pages", that fact by itself implies that wchar_t cannot be
as narrow as 16 bits. Given that the term 'native environment'
is not defined in the standard, there is probably enough weasel
room to make an argument for technical conformance. Certainly
though it is outside the domain of what was clearly intended.

nebo...@gmail.com

unread,

Oct 14, 2016, 5:03:15 PM10/14/16

to

On Wednesday, September 28, 2016 at 3:52:33 PM UTC-7, Alf P. Steinbach wrote:
> On 28.09.2016 22:21, Richard wrote:
> >
> > Paavo Helde <myfir...@osa.pri.ee> spake the secret code
> > <h9SdnU5kNMGQuXHK...@giganews.com> thusly:
> >
> >> Except that MSVC wchar_t is not conforming to C++ standard.
> >
> > In what way?
> >
> > I looked at 3.9.1.5 in N3797:
> >
> > "Type wchar_t is a distinct type whose values can represent
> > distinct codes for all members of the largest extended character
> > set specified among the supported locales (22.3.1). Type wchar_t
> > shall have the same size, signedness, and alignment requirements
> > (3.11) as one of the other integral types, called its underlying
> > type. Types char16_t and char32_t denote distinct types with
> > the same size, signedness, and alignment as uint_least16_t
> > and uint_least32_t, respectively, in <cstdint>, called the
> > underlying types."
> >
> > In particular I don't see anything in there about Unicode or that
> > wchar_t has to be 32-bits or that it has to support every Unicode
> > character code point as a single wchar_t.
> >
>
> I think we got this answered, modulo pedantic doubts, via my SO question
> about it, where ¹one answer quoted
>

> C++11 §2.14.5/15 (lex.string/15):

> “The size of a char32_t or wide string literal is the total number of
> escape sequences, universal-character-names, and other characters, plus
> one for the terminating U’\0’ or L’\0’.”
>
> In my view this pretty much nails it: 16-bit wchar_t for full Unicode
> via UTF-16 is not formally valid, though it would be valid for UCS-2
> (the BMP subset of Unicode, corresponding to original 16-bit Unicode).
>
> Pedantic doubts include what's meant by “characters”.
>
> In my view the only reasonable interpretation for an UTF-16 encoded

> string is that a “character” is a “Unicode code point”, which may be

> encoded as a surrogate pair, but the Devil's advocate may just argue
> that it's really about each encoding value, including those that are
> just half of surrogate pairs. That's far fetched. But since standards
> use all kinds of weird meanings of words, it can't be just dismissed.
>
>

> Cheers!,
>
> - Alf
>
> Links:
> ¹ <url: http://stackoverflow.com/a/39552788/464581>

Your interpretation makes sense to me, but Windows use of UTF-16 is the
only sensible choice given history. Unicode APIs in Windows date back to
Windows NT 3.1, which was released in 1993. At that time, Unicode was a
16-bit character set, full stop. Unicode 2.0 introduced surrogates later,
in 1996.