"Some sanity for C and C++ development on Windows" by Chris Wellons

Lynn McGuire

no leída,

4 ene 2022, 1:36:134/1/22

a

"Some sanity for C and C++ development on Windows" by Chris Wellons
https://nullprogram.com/blog/2021/12/30/

Lynn

Öö Tiib

no leída,

4 ene 2022, 13:39:234/1/22

a

On Tuesday, 4 January 2022 at 08:36:13 UTC+2, Lynn McGuire wrote:
> "Some sanity for C and C++ development on Windows" by Chris Wellons
> https://nullprogram.com/blog/2021/12/30/

The whole difference that std::string on other platforms is UTF-8.
It is something that standard of C or C++ do not support in any way.
On the contrary, the standards add obfuscation bullshit like:

const char crap[] = u8"Öö Tiib 😀";

And when to ask why then oh but maybe there is EBCDIK character
set. Shove that ebc-dick where sun doesn't shine, morons.
Let MS add its w1252 prefix if it likes that character set too lot?
But no language lawyer in committee does have balls, and there it
ends.

Vir Campestris

no leída,

5 ene 2022, 16:50:285/1/22

a

My first job was writing assembler on a mainframe with an EBCDIC
character set. The operating system I worked on has been dead for 40
years now, but I dare say IBM's mainframes still use it.

Andy

Scott Lurndal

no leída,

5 ene 2022, 17:57:095/1/22

a

The Unisys mainframes (from the Burroughs side) still use EBCDIC;
I believe the sperry side systems support EBCDIC, if not use it natively.

Kaz Kylheku

no leída,

5 ene 2022, 18:34:245/1/22

a

On 2022-01-05, Scott Lurndal <sc...@slp53.sl.home> wrote:

> Vir Campestris <vir.cam...@invalid.invalid> writes:
>>My first job was writing assembler on a mainframe with an EBCDIC
>>character set. The operating system I worked on has been dead for 40
>>years now, but I dare say IBM's mainframes still use it.
>
> The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> I believe the sperry side systems support EBCDIC, if not use it natively.

My EBCDIC experience is that over the years, I had at least one
manager who was one.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Öö Tiib

no leída,

5 ene 2022, 23:35:385/1/22

a

Even programmer of such system might want to upgrade to compiler
whose standard library supports UTF-8. But it is not possible as
standard library is defined not to.

Scott Lurndal

no leída,

6 ene 2022, 11:14:426/1/22

a

=?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> writes:
>On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:

>> Vir Campestris <vir.cam...@invalid.invalid> writes:=20
>> >On 04/01/2022 18:39, =C3=96=C3=B6 Tiib wrote:=20
>> >> On Tuesday, 4 January 2022 at 08:36:13 UTC+2, Lynn McGuire wrote:=20
>> >>> "Some sanity for C and C++ development on Windows" by Chris Wellons=
>=20
>> >>> https://nullprogram.com/blog/2021/12/30/=20
>> >>=20
>> >> The whole difference that std::string on other platforms is UTF-8.=20
>> >> It is something that standard of C or C++ do not support in any way.=
>=20
>> >> On the contrary, the standards add obfuscation bullshit like:=20
>> >>=20
>> >> const char crap[] =3D u8"=C3=96=C3=B6 Tiib =F0=9F=98=80";=20
>> >>=20
>> >> And when to ask why then oh but maybe there is EBCDIK character=20
>> >> set. Shove that ebc-dick where sun doesn't shine, morons.=20
>> >> Let MS add its w1252 prefix if it likes that character set too lot?=20
>> >> But no language lawyer in committee does have balls, and there it=20
>> >> ends.=20
>> >>=20
>> >My first job was writing assembler on a mainframe with an EBCDIC=20
>> >character set. The operating system I worked on has been dead for 40=20

>> >years now, but I dare say IBM's mainframes still use it.

>>=20
>> The Unisys mainframes (from the Burroughs side) still use EBCDIC;=20

>> I believe the sperry side systems support EBCDIC, if not use it natively.
>
>Even programmer of such system might want to upgrade to compiler

>whose standard library supports UTF-8. But it is not possible as=20
>standard library is defined not to. =20

Well, the Burroughs programmers use COBOL and ALGOL, not C, which
have supported I18N and L10N since the 1980s.

james...@alumni.caltech.edu

no leída,

6 ene 2022, 12:28:466/1/22

a

On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:

...

> > The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> > I believe the sperry side systems support EBCDIC, if not use it natively.
> Even programmer of such system might want to upgrade to compiler
> whose standard library supports UTF-8. But it is not possible as
> standard library is defined not to.

Could you cite the text from the C and C++ standards that prohibits the
standard library from supporting UTF-8?
Are you saying that it's prohibited specifically on platforms that normally
use EBCDIC? I'm not aware of any requirement that an implementation of
C follow the conventions for the target platform: an implementation that
emulates working on a completely different platform (such as one where
UTF-8 is the norm) is always allowed.

To the best of my understanding:
* The encoding of source files and the execution character narrow
character set are both implementation-defined multibyte encodings, and
there's nothing that prohibits either encoding from being UTF-8.
* Implementations are explicitly permitted to allow extended characters in
source files for identifiers, string literals, character constants, comments
and header names.
* The current versions of both standards allow string literals and character
constants prefixed with u8, which are required to have UTF-8 encoding.
* Even the latest version of C doesn't mandate any conversion routines for
UTF-8, but on a platform which makes UTF-8 the encoding for it's
execution character set, the conversion routines that contain "mb" in their
name will interpret plain char as having UTF-8 encoding.
* C++ mandates mbrtoc8() and c8rtomb(), which convert between the
native encoding and UTF-8. It also mandates mbrtowc() and wcrtomb(),
as well as the C standard library routines for convertiong betwen the
native encodings and char16_t or char32_t in <uchar>; the C standard does
not require that those types have UTF-16 and UTF-32 encoding
respectively, but the C++ standard does. It also mandates codecvt facets
for converting between UTF-8, UTF-16, and UTF-32, so conversion between
UTF-8 and any of the other four character encodings are mandated,
though conversion to and from wchar_t is a two-step process unless the
native narrow character set uses UTF-8 encoding.

While C, in particular, doesn't mandate quite as much support for UTF-8 as
I'd like, both standards allow the fullest possible support for UTF-8 that I
could imagine. Why do you think otherwise?

Guillaume

no leída,

6 ene 2022, 13:42:406/1/22

a

Le 06/01/2022 à 18:28, james...@alumni.caltech.edu a écrit :
> While C, in particular, doesn't mandate quite as much support for UTF-8 as
> I'd like, both standards allow the fullest possible support for UTF-8 that I
> could imagine. Why do you think otherwise?

Agreed.

And yes, as support is not that complete, we usually have to use
third-party (or our own) libraries just for that. And to be fair, UTF-8
support, overall, is still a bit shaky.

Sure on Windows, where MS had focused on using Unicode UCS2 instead of
UTF-8, things are no better, even if you use the Windows API instead of
the C standard library.

Öö Tiib

no leída,

7 ene 2022, 7:06:267/1/22

a

On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> > On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:
> ...
> > > The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> > > I believe the sperry side systems support EBCDIC, if not use it natively.
> > Even programmer of such system might want to upgrade to compiler
> > whose standard library supports UTF-8. But it is not possible as
> > standard library is defined not to.
> Could you cite the text from the C and C++ standards that prohibits the
> standard library from supporting UTF-8?

Are you pretending that you did not understand what I meant?
AFAIK you have rather decent knowledge of standards. The
standards allow implementations to have wide array of whatever
obscure extensions.

However some essential things, like say adding 128 bit integers or
UTF-8 support or even to stop that nonsense with newline characters
is made tricky. Plus it is obscured with random half backed extensions,
prefixes and types that promise not much, tend to be deprecated and
confuse people. That in world where 98% of plain text in internet is UTF-8.

> Are you saying that it's prohibited specifically on platforms that normally
> use EBCDIC? I'm not aware of any requirement that an implementation of
> C follow the conventions for the target platform: an implementation that
> emulates working on a completely different platform (such as one where
> UTF-8 is the norm) is always allowed.

I am saying that when to ask why default string can't be UTF-8 then that
EBCDIC is usually mentioned. Despite there probably are no much C, let
alone C++ used on few alive EBCDIC platforms.

>
> To the best of my understanding:
> * The encoding of source files and the execution character narrow
> character set are both implementation-defined multibyte encodings, and
> there's nothing that prohibits either encoding from being UTF-8.
> * Implementations are explicitly permitted to allow extended characters in
> source files for identifiers, string literals, character constants, comments
> and header names.
> * The current versions of both standards allow string literals and character
> constants prefixed with u8, which are required to have UTF-8 encoding.
> * Even the latest version of C doesn't mandate any conversion routines for
> UTF-8, but on a platform which makes UTF-8 the encoding for it's
> execution character set, the conversion routines that contain "mb" in their
> name will interpret plain char as having UTF-8 encoding.
> * C++ mandates mbrtoc8() and c8rtomb(), which convert between the
> native encoding and UTF-8. It also mandates mbrtowc() and wcrtomb(),
> as well as the C standard library routines for convertiong betwen the
> native encodings and char16_t or char32_t in <uchar>; the C standard does
> not require that those types have UTF-16 and UTF-32 encoding
> respectively, but the C++ standard does.

The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
OK, UTF-16 can be useful to communicate with mis-designed programming
languages like Java or C# or operating systems like Windows. But UTF-32
is rather exotic.

> It also mandates codecvt facets
> for converting between UTF-8, UTF-16, and UTF-32, so conversion between
> UTF-8 and any of the other four character encodings are mandated,
> though conversion to and from wchar_t is a two-step process unless the
> native narrow character set uses UTF-8 encoding.

The most useful of those facets that converted to-from UTF-8 in char
array were deprecated by C++17.

> While C, in particular, doesn't mandate quite as much support for UTF-8 as
> I'd like, both standards allow the fullest possible support for UTF-8 that I
> could imagine. Why do you think otherwise?

I think that having all text streams and plain non-prefixed "strings" as
UTF-8 is both possible and most logical. UTF-8 code unit is guaranteed
to fit into char by both standards so it is possible. UTF-8 is most
widespread text format so it is logical. Other, obscure encodings like
Windows1252 or EBCDIC (and functions using or filling those) should
have weirdo prefixes and special character types and what not.

Malcolm McLean

no leída,

7 ene 2022, 8:08:447/1/22

a

On Friday, 7 January 2022 at 12:06:26 UTC, Öö Tiib wrote:
>
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.
>

Yes and no. You can pass strings about as utf-8. But it's hard to manipulate them.
Often it's easier to convert to utf-32 and back to actually access the content of
a string and use it.

james...@alumni.caltech.edu

no leída,

7 ene 2022, 12:34:167/1/22

a

On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> > On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:

...

> > > Even programmer of such system might want to upgrade to compiler
> > > whose standard library supports UTF-8. But it is not possible as
> > > standard library is defined not to.
> > Could you cite the text from the C and C++ standards that prohibits the
> > standard library from supporting UTF-8?
> Are you pretending that you did not understand what I meant?

No, I am quite accurately and honestly expressing my confusion. You object to
something being prohibited by the standards that is, to the best of my understanding,
allowed. It would make more sense if you were objecting the fact that it isn't
mandatory, and if you were making such claims, I would disagree with you about
whether it would be a good idea to make it mandatory - but as far as I can tell, you're
claiming it isn't allowed.
If you could, as requested, cite the relevant text that prohibits such compilers, you
might convince me that I'm wrong. If not, the citation would at least enable me to try
to convince you that you're misinterpreting that text. Neither possibility can happen
until you actually honor that request.

...

> However some essential things, like say adding 128 bit integers or
> UTF-8 support or even to stop that nonsense with newline characters
> is made tricky.

Neither 128 bits integer nor UTF-8 support are essential. Lots of people have no
need of either (I've never needed either one, though I have used UTF-8 since it was
available). If you need such things on a platform where no existing implementation
of C provides them, your complaint is with the implementors, not the standard,
because the standard says nothing to prohibit such things.

...

> I am saying that when to ask why default string can't be UTF-8 then that
> EBCDIC is usually mentioned. Despite there probably are no much C, let
> alone C++ used on few alive EBCDIC platforms.

That seems reasonable to me. The only places where that logic applies are
implementations of C targeting EBCDIC platforms, and regardless of how rare such
implementations, they would become substantially rarer because users would abandon
them if they switched to UTF-8.

...

> The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.

UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
with it, I've only done a few years of programming targeting that platform, and all of
the text that came up in the work I was doing there was simple English, with no need
to make any use of extended characters, so I can't vouch for any of the details about
how text was encoded. However, WIndows is a very common platform, whether or not
you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
seems very odd.

UTF-32 shares an important property with the implementation-defined encoding used
for wchar_t: every character takes up one and only one element in the array. I have
written a lot of code over several decades for parsing strings that assumes that every
character takes up one char, a valid assumption in the contexts in which I wrote it. When
I think about how I would have to re-write such code to work with multi-byte encodings
such as UTF-8, then the simplicity of replacing char with wchar_t or char32_t, leaving my
logic unchanged, starts looking pretty attractive. However, since I have little experience
writing code to work with extended characters using any encoding, my preferences don't
carry much weight.

> > While C, in particular, doesn't mandate quite as much support for UTF-8 as
> > I'd like, both standards allow the fullest possible support for UTF-8 that I
> > could imagine. Why do you think otherwise?
> I think that having all text streams and plain non-prefixed "strings" as
> UTF-8 is both possible and most logical.

Yes, and that is allowed by both standards, and is the norm, not the exception, on most
Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
confuses me. If you don't work on Unix systems where the encoding of non-prefixed
strings is UTF-8, and you don't work on Windows systems where UTF-16 is the norm, and
you don't work on systems where EBCDIC is the norm, what kinds of systems do you
work on? I'm not saying that there aren't any other types of systems, there's a great
many, but most of the others are substantially less common, so I am just curious which
one(s) you use.

Malcolm McLean

no leída,

7 ene 2022, 13:44:397/1/22

a

On Friday, 7 January 2022 at 17:34:16 UTC, james...@alumni.caltech.edu wrote:
>
> UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
> with it, I've only done a few years of programming targeting that platform, and all of
> the text that came up in the work I was doing there was simple English, with no need
> to make any use of extended characters, so I can't vouch for any of the details about
> how text was encoded. However, WIndows is a very common platform, whether or not
> you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> seems very odd.
>

Every Windows API call that takes text comes in an A-suffix or a W-suffix call.
The A-suffix takes ascii strings, the W-suffix takes near UTF-16, actually Microsoft's
slightly incompatible version. If you don't provide a suffix at all, the compiler
selects a version which depends on how you have set it up. I'm not sure what the
default is or exactly how you play with the settings.

>
> > I think that having all text streams and plain non-prefixed "strings" as
> > UTF-8 is both possible and most logical.
> Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> confuses me.

My experience is that passing utf-8 to printf() or fopen() doesn't work. But I rarely
need to do so, and the situation might have changed recently.

Bart

no leída,

7 ene 2022, 14:18:577/1/22

a

This program, which contains UTF8 sequences:

#include <stdio.h>

int main(void) {
printf("ø°PÇ€\n");
}

works OK if compiled with bcc or tcc and run with codepage 65001 active.

However it doesn't work if compiled with gcc; I don't know why.

Calling a Windows -A function (eg. MessageBoxA) with UTF8 strings
doesn't work either.

There are also wider aspects than sending output via stdout as mentioned
in the article, such as command line input.

So it's still a mess on Windows from what I can see.

Malcolm McLean

no leída,

7 ene 2022, 15:27:517/1/22

a

Are you sure that's compiling to utf-8?
A better test would be to build the utf-8 sequence explictly, and see if the output
is as specified.

>
> So it's still a mess on Windows from what I can see.
>

It's not just Windows. On my platform, there are several different types and C++
classes that are supposed to hold Unicode. We need a suite of little functions
to do ad hoc conversions between them. Most of these are binary no-ops.

Bart

no leída,

7 ene 2022, 16:41:007/1/22

a

Notepad was told to save as UTF8. Codepage 65001 is UTF8 (and it didn't
work with 1252). And I just checked the source file to confirm the
sequences are the correct UTF8.

Malcolm McLean

no leída,

7 ene 2022, 17:03:147/1/22

a

So almost certainly the compiler is compiling it to UTF-8.
But it would have been easier to just specify one non-ascii character as raw bytes,
forcing it to use UTF-8 whatever the source or execution character set.
If it displays correctly, then you know that UTF-8 is supported by the printf / terminal
combination.

Bart

no leída,

7 ene 2022, 17:16:337/1/22

a

Actually, my compiler at least is doing nothing at all. It knows nothing
about UTF8; it's just a sequence of bytes forming a string literal. The
E2 82 AC sequence representing € is output to the binary as a E2 82 AC
sequence, just as 41 42 43 is passed through as 41 42 43 ("ABC").

That's the advantage of UTF8.

It is the editor that needs to be aware of it, needing to deal with
input of it, display, and writing to a file with the correct encoding.

And the runtime or OS must also show the correct display for UTF8
sequences. It's that bit that is going wrong with gcc.

Malcolm McLean

no leída,

7 ene 2022, 17:31:267/1/22

a

Yes, the majority of a C source file is going to be pure ascii, with only
a few extended string literals embedded. So UTF-8 is a good choice. The
compiler needs no modification, and file size remains about the same.

>
> And the runtime or OS must also show the correct display for UTF8
> sequences. It's that bit that is going wrong with gcc.
>

The terminal is the same. So gcc must be linking a printf() that doesn't
treat UTF-8 correctly, though it's hard to see what coule be going wrong
if the terminal takes UTF-8 in 8 bit bytes.

Öö Tiib

no leída,

7 ene 2022, 22:32:137/1/22

a

On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> > On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> > > On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> ...
> > > > Even programmer of such system might want to upgrade to compiler
> > > > whose standard library supports UTF-8. But it is not possible as
> > > > standard library is defined not to.
> > > Could you cite the text from the C and C++ standards that prohibits the
> > > standard library from supporting UTF-8?
> > Are you pretending that you did not understand what I meant?
> No, I am quite accurately and honestly expressing my confusion. You object to
> something being prohibited by the standards that is, to the best of my understanding,
> allowed. It would make more sense if you were objecting the fact that it isn't
> mandatory, and if you were making such claims, I would disagree with you about
> whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> claiming it isn't allowed.

It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
and u8 prefixes. Every possible thing designed to avoid adding actual support
to standard.

> If you could, as requested, cite the relevant text that prohibits such compilers, you
> might convince me that I'm wrong. If not, the citation would at least enable me to try
> to convince you that you're misinterpreting that text. Neither possibility can happen
> until you actually honor that request.

I can not possibly cite that. And now I'm confused how can be you snipped that

"The standards allow implementations to have wide array of whatever obscure

extensions." That already told it? So if I worded it unclear, then it is my fault.
Why must UTF-8 be extension?

>
> ...
> > However some essential things, like say adding 128 bit integers or
> > UTF-8 support or even to stop that nonsense with newline characters
> > is made tricky.
>
> Neither 128 bits integer nor UTF-8 support are essential. Lots of people have no
> need of either (I've never needed either one, though I have used UTF-8 since it was
> available). If you need such things on a platform where no existing implementation
> of C provides them, your complaint is with the implementors, not the standard,
> because the standard says nothing to prohibit such things.
>

In world where 98% of text communication goes with UTF-8 there of course
is some shrinking 2% of market left.

> ...
> > I am saying that when to ask why default string can't be UTF-8 then that
> > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > alone C++ used on few alive EBCDIC platforms.
> That seems reasonable to me. The only places where that logic applies are
> implementations of C targeting EBCDIC platforms, and regardless of how rare such
> implementations, they would become substantially rarer because users would abandon
> them if they switched to UTF-8.

Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
wants to exchange information with rest of the world. So why he would not want
to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
experience with those devices.

>
> ...
> > The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> > OK, UTF-16 can be useful to communicate with mis-designed programming
> > languages like Java or C# or operating systems like Windows. But UTF-32
> > is rather exotic.
> UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
> with it, I've only done a few years of programming targeting that platform, and all of
> the text that came up in the work I was doing there was simple English, with no need
> to make any use of extended characters, so I can't vouch for any of the details about
> how text was encoded. However, WIndows is a very common platform, whether or not
> you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> seems very odd.

It has usages like I confirmed already. Obscure I said because it has merged the
overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
any benefits.

>
> UTF-32 shares an important property with the implementation-defined encoding used
> for wchar_t: every character takes up one and only one element in the array. I have
> written a lot of code over several decades for parsing strings that assumes that every
> character takes up one char, a valid assumption in the contexts in which I wrote it. When
> I think about how I would have to re-write such code to work with multi-byte encodings
> such as UTF-8, then the simplicity of replacing char with wchar_t or char32_t, leaving my
> logic unchanged, starts looking pretty attractive. However, since I have little experience
> writing code to work with extended characters using any encoding, my preferences don't
> carry much weight.

That is good point. UTF-8 converts trivially to UTF-32 and back. So where such
precisely-one guarantee helps to make some algorithm more robust and simple
there we can easily convert of course. But I see no reason why to keep (or to
transfer) UTF-32 outside of such context.

> > > While C, in particular, doesn't mandate quite as much support for UTF-8 as
> > > I'd like, both standards allow the fullest possible support for UTF-8 that I
> > > could imagine. Why do you think otherwise?
> > I think that having all text streams and plain non-prefixed "strings" as
> > UTF-8 is both possible and most logical.
> Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> confuses me.

If I appeared to make that claim then it is probably my fault. I meant it is made difficult
by adding things that look like trying to support it one day but there really are no UTF-8
support in standards despite we use it everywhere for decades.

> If you don't work on Unix systems where the encoding of non-prefixed
> strings is UTF-8, and you don't work on Windows systems where UTF-16 is the norm, and
> you don't work on systems where EBCDIC is the norm, what kinds of systems do you
> work on? I'm not saying that there aren't any other types of systems, there's a great
> many, but most of the others are substantially less common, so I am just curious which
> one(s) you use.

I have written C and C++ for systems and peripherals of things like cash dispensers,
point of sale terminals, taximeters, spectrometers, thermostats, frequency converters
and mobile phones. Also I have participated in projects of writing utility software and
services on Unixes and Windowses. From systems that I've programmed if to
remove the pointless obfuscation garbage from standards and to require UTF-8
then perhaps only MS has to do something at all.

Richard Damon

no leída,

7 ene 2022, 23:01:297/1/22

a

On 1/7/22 7:06 AM, Öö Tiib wrote:
> The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.
>

One key thing to remember is that most UTF-16 (and especially Window's
use of it) goes back to a too early adoption of Unicode and UCS-2 as
'The Standard' for Text, when 16-bit 'Unicode' was claimed to be the
answer to the problem of all those code-pages.

Then, when after it got adopted and baked into ABIs/APIs it was realized
that Unicode was going to need to be bigger so UCS-2 switched to UTF-16
and UCS-4 became the real 'wide character' type (except, in some ways it
still wasn't due to combining codepoints).

For C, this basically means that the 'wide character' system is
practically broken, and is really broken on Windows machines as the
standard says its needs to be the widest type of character, and that it
expresses all the characters in one unit, but on Windows by ABI
requirements it must be 16 bits, and 32 bits isn't really correct even
on systems which do use that for wide characters due to the combining
codes issue.

Basically, for a system that want to really conform to both Unicode and
the C standard, the implementation is put in a spot where it is actually
impossible to do so, at least if you want to keep the full intent of the
standard (that wide strings can be arbitrarily split and joined without
issue).

Then you have that Unicode is actually not stateless when you add things
like the various Left-to-Right codes or things emoji modifier characters.

james...@alumni.caltech.edu

no leída,

7 ene 2022, 23:52:337/1/22

a

On Friday, January 7, 2022 at 10:32:13 PM UTC-5, Öö Tiib wrote:
> On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> > On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:

...

> > No, I am quite accurately and honestly expressing my confusion. You object to
> > something being prohibited by the standards that is, to the best of my understanding,
> > allowed. It would make more sense if you were objecting the fact that it isn't
> > mandatory, and if you were making such claims, I would disagree with you about
> > whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> > claiming it isn't allowed.
> It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
> and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
> and u8 prefixes. Every possible thing designed to avoid adding actual support
> to standard.

So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
unprefixed string literals, even for implementations targeting platforms where that's
contrary to the conventions for that platform?

> > If you could, as requested, cite the relevant text that prohibits such compilers, you
> > might convince me that I'm wrong. If not, the citation would at least enable me to try
> > to convince you that you're misinterpreting that text. Neither possibility can happen
> > until you actually honor that request.
> I can not possibly cite that. And now I'm confused how can be you snipped that
> "The standards allow implementations to have wide array of whatever obscure
> extensions." That already told it? So if I worded it unclear, then it is my fault.
> Why must UTF-8 be extension?

It didn't occur to me that you meant "obscure extensions" to refer to UTF-8 support. It
isn't an extension. "The values of the members of the execution character set are
implementation-defined." (5.2.1p1). That puts choosing UTF-8 for that encoding in the
same category as choosing 8 as the value for CHAR_BIT or setting the values for the
macros that are #defined in <limits.h>.

The term "extension" is not normally used for implementation-defined behavior. Note that
4p9 requires that "An implementation shall be accompanied by a document that defines
all implementation-defined and locale-specific characteristics and all extensions." If
implementation-defined behavior were considered to qualify as an extension, that
specification would be redundant, something the committee generally tries to avoid.

....

> In world where 98% of text communication goes with UTF-8 there of course
> is some shrinking 2% of market left.

Do you have sources for those numbers, or are you just pulling them out of your hat?
I'm not saying you're wrong, just that I don't know of any easy way to determine what
those numbers are.

> > ...
> > > I am saying that when to ask why default string can't be UTF-8 then that
> > > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > > alone C++ used on few alive EBCDIC platforms.
> > That seems reasonable to me. The only places where that logic applies are
> > implementations of C targeting EBCDIC platforms, and regardless of how rare such
> > implementations, they would become substantially rarer because users would abandon
> > them if they switched to UTF-8.
> Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
> wants to exchange information with rest of the world. So why he would not want
> to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
> experience with those devices.

Unlike your hypothetical C programmer on EBCDIC platform, real programmers of that type
have access to conversion routines for use when the need to communicate outside the
EBCDIC world comes up. If UTF-8 were mandatory for unprefixed string literals, an
implementations targeting such platforms that conformed to such a mandate could add an
extension to create EBCDIC-encoded string literals. If so, developers for such platforms
would have to routinely use that extension for most of their string literals. Such developers
would find that very inconvenient, and would therefore make sure that any implementation
targeting that platform had an option that would make it fail to conform to such a mandate.

Imposing that mandate would fail to make UTF-8 any more widely used. The reason that
there do exist platforms where UTF-8 is not the encoding used for unprefixed string literals
is because their users want some other encoding to be used for that purpose. If that weren't
the case, someone would have already created a UTF-8 implementation for that platform.

...

> > how text was encoded. However, WIndows is a very common platform, whether or not
> > you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> > seems very odd.
> It has usages like I confirmed already. Obscure I said because it has merged the
> overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
> any benefits.

It doesn't matter how strongly you disapprove of it - what matters is how many people want
to use it despite your disapproval.

james...@alumni.caltech.edu

no leída,

8 ene 2022, 0:13:468/1/22

a

On Friday, January 7, 2022 at 1:44:39 PM UTC-5, Malcolm McLean wrote:
> On Friday, 7 January 2022 at 17:34:16 UTC, james...@alumni.caltech.edu wrote:

...

> > > I think that having all text streams and plain non-prefixed "strings" as
> > > UTF-8 is both possible and most logical.
> > Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> > Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> > confuses me.
> My experience is that passing utf-8 to printf() or fopen() doesn't work. But I rarely
> need to do so, and the situation might have changed recently.

My wife was born in Taiwan, and our kids are bilingual, so every computer in our house has
been set up to handle Chinese text properly. If yours isn't, you might not see the right
characters in the string literals below. You could try using

"\u5929\u5B89\u95E8\u5E7F\u573A"

instead - the encoding of the character arrays should be unchanged.

#include <inttypes.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <uchar.h>

int main(void)
{
setlocale(LC_ALL, "");
const char location[] = "天安门广场";
const char location8[] = u8"天安门广场";
const char *p;

printf("Location :");
for(p = location; *p; p++)
printf("%#X ", *(unsigned char*)p);

printf("\nLocation8:");
for(p = location8; *p; p++)
printf("%#X ", *(unsigned char*)p);

printf("\n\"%s\"\n", location);

p = location;
const char * const end = location + sizeof location;
mbstate_t state={0};
while(*p)
{
char32_t c32;
size_t bytes = mbrtoc32(&c32, p, end - p, &state);
switch(bytes)
{
case (size_t)(-3):
printf("%#" PRIXLEAST32 " ", c32);
break;
case (size_t)(-2):
break;
fprintf(stderr, "incomplete character\n");
return EXIT_FAILURE;
case (size_t)(-1):
fprintf(stderr, "%td:Encoding error\n", p-location);
return EXIT_FAILURE;
default:
printf("%#" PRIXLEAST32 " ", c32);
p += bytes;
break;
}
}

printf("\n");
return EXIT_SUCCESS;
}

That program produces the following output on my system:
Location :0XE5 0XA4 0XA9 0XE5 0XAE 0X89 0XE9 0X97 0XA8 0XE5 0XB9 0XBF 0XE5 0X9C 0XBA
Location8:0XE5 0XA4 0XA9 0XE5 0XAE 0X89 0XE9 0X97 0XA8 0XE5 0XB9 0XBF 0XE5 0X9C 0XBA
"天安门广场"
0X5929 0X5B89 0X95E8 0X5E7F 0X573A

Note that the u8 string is encoded exactly the same way as the unprefixed string literal,
confirming that UTF-8 is the encoding for unprefixed string literals.
The setlocale() call is not needed to correctly display the string, but it is needed for
mbrtoc32() to work. In the default "C" locale, mbrtoc32() reports an encoding error. The
"" locale is the implementation-defined default locale - I'm not sure what gcc defines that
default to be, but I suspect that it's the value of my LANG environment variable, which is
"en_US.UTF-8". Virtually every locale supported on my system has UTF-8 or utf8 in it's name.
The "C" locale is one of the few exceptions, but a "C.UTF-8" locale is also supported.

Öö Tiib

no leída,

8 ene 2022, 11:17:538/1/22

a

On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> On Friday, January 7, 2022 at 10:32:13 PM UTC-5, Öö Tiib wrote:
> > On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> > > On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> ...
> > > No, I am quite accurately and honestly expressing my confusion. You object to
> > > something being prohibited by the standards that is, to the best of my understanding,
> > > allowed. It would make more sense if you were objecting the fact that it isn't
> > > mandatory, and if you were making such claims, I would disagree with you about
> > > whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> > > claiming it isn't allowed.
> > It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
> > and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
> > and u8 prefixes. Every possible thing designed to avoid adding actual support
> > to standard.
>
> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> unprefixed string literals, even for implementations targeting platforms where that's
> contrary to the conventions for that platform?

Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
Why? On what? For what? Must be odd corner case. Each trashcan, smoke sensor or
microwave oven out there wants to communicate with whatever siris, alexas,
google homes and skynets they serve. All of those use UTF-8 texts. If it has some
LCD or LED panel then it wants to show text understandable to local desperate
housewife, low salary technician or taxi driver. If it is char* then it is UTF-8 there.

> > > If you could, as requested, cite the relevant text that prohibits such compilers, you
> > > might convince me that I'm wrong. If not, the citation would at least enable me to try
> > > to convince you that you're misinterpreting that text. Neither possibility can happen
> > > until you actually honor that request.
> > I can not possibly cite that. And now I'm confused how can be you snipped that
> > "The standards allow implementations to have wide array of whatever obscure
> > extensions." That already told it? So if I worded it unclear, then it is my fault.
> > Why must UTF-8 be extension?
>
> It didn't occur to me that you meant "obscure extensions" to refer to UTF-8 support. It
> isn't an extension. "The values of the members of the execution character set are
> implementation-defined." (5.2.1p1). That puts choosing UTF-8 for that encoding in the
> same category as choosing 8 as the value for CHAR_BIT or setting the values for the
> macros that are #defined in <limits.h>.

CHAR_BIT can't be less than 8 so UTF-8 code unit is guaranteed to fit. The flexibility
to have bigger CHAR_BIT than 8 can be left there for char has to serve also as byte.

> The term "extension" is not normally used for implementation-defined behavior. Note that
> 4p9 requires that "An implementation shall be accompanied by a document that defines
> all implementation-defined and locale-specific characteristics and all extensions." If
> implementation-defined behavior were considered to qualify as an extension, that
> specification would be redundant, something the committee generally tries to avoid.

If there is implementation defined behavior or not in my experience if text is passed
with char* then it points at UTF-8 and programmer has to fight with that implementation
defined garbage because he needs it to be UTF-8. And I'm complaining against
attempts to lie to novices that UTF-8 should be uchar8_t* or something else like
that. Practical example:
FILE *f = fopen( "Foo😀Bar.txt", "w");
That should work unless underlying file system does not support files
named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
bad standard that allows implementations to weasel away. No garbage like
u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
needed as it already works like in my example on vast majority of things.

> ....
> > In world where 98% of text communication goes with UTF-8 there of course
> > is some shrinking 2% of market left.
> Do you have sources for those numbers, or are you just pulling them out of your hat?
> I'm not saying you're wrong, just that I don't know of any easy way to determine what
> those numbers are.

There are no easy way but some organizations do diligently statistics what is
possible to monitor. Like that:
<https://w3techs.com/technologies/history_overview/character_encoding>
Legacy is there but shrinking. If whatever new C position opens where text has
to be accessed with char* then chance is close to 0 that it is something else
but UTF-8.

> > > ...
> > > > I am saying that when to ask why default string can't be UTF-8 then that
> > > > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > > > alone C++ used on few alive EBCDIC platforms.
> > > That seems reasonable to me. The only places where that logic applies are
> > > implementations of C targeting EBCDIC platforms, and regardless of how rare such
> > > implementations, they would become substantially rarer because users would abandon
> > > them if they switched to UTF-8.
> > Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
> > wants to exchange information with rest of the world. So why he would not want
> > to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
> > experience with those devices.
>
> Unlike your hypothetical C programmer on EBCDIC platform, real programmers of that type
> have access to conversion routines for use when the need to communicate outside the
> EBCDIC world comes up. If UTF-8 were mandatory for unprefixed string literals, an
> implementations targeting such platforms that conformed to such a mandate could add an
> extension to create EBCDIC-encoded string literals. If so, developers for such platforms
> would have to routinely use that extension for most of their string literals. Such developers
> would find that very inconvenient, and would therefore make sure that any implementation
> targeting that platform had an option that would make it fail to conform to such a mandate.

You never answered why should they use obscure extensions for what they need on
majority of cases. Why UTF-8 must be obscure extension?

> Imposing that mandate would fail to make UTF-8 any more widely used. The reason that
> there do exist platforms where UTF-8 is not the encoding used for unprefixed string literals
> is because their users want some other encoding to be used for that purpose. If that weren't
> the case, someone would have already created a UTF-8 implementation for that platform.

It is used on close to 100% of cases anyway. I am objecting that it is deliberately
standardized (or more like pseudo-standardized/non-standardized) to be
inconvenient to use.

> ...
> > > how text was encoded. However, WIndows is a very common platform, whether or not
> > > you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> > > seems very odd.
> > It has usages like I confirmed already. Obscure I said because it has merged the
> > overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
> > any benefits.
> It doesn't matter how strongly you disapprove of it - what matters is how many people want
> to use it despite your disapproval.

Agreed. So do you have numbers how many C programmers *want* to use UTF-16? I
think that it is little, but I do not have any sources. They may *need* to for legacy
reasons I already mentioned but even there it is most likely small number. Their
pain with support to their u"string", L"string" and \x \u \U character references
might need relieving too but is bit different topic.

Malcolm McLean

no leída,

8 ene 2022, 13:53:268/1/22

a

That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
varying length. That makes UTF-8 the sole sane one.

Mateusz Viste

no leída,

> The standard doesn't say
> anything to prevent that implementation from doing so. If they don't, it can only be
> because they don't want to. So why don't you ask the implementors why they made
> that decision? They've got a reason that seemed sufficiently good for them, find out
> what it is.

No, that fish rots from the head, IOW from standards. MS abuses it more than others
but only because they are bit bigger assholes.

> ...
> > FILE *f = fopen( "Foo😀Bar.txt", "w");
> > That should work unless underlying file system does not support files
> > named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> > bad standard that allows implementations to weasel away. No garbage like
> > u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> > needed as it already works like in my example on vast majority of things.
>
> Nothing in the standard prevents an implementation from doing that. If one doesn't
> already do so, that's a choice made by the implementors, and you should ask them
> about it. Your real beef is with the implementors, not the standard.

My beef is with standards. Adding garbage that does not work to standard is wrong
and not adding what everybody at least half sane does use to standard is also wrong.

>
> They shouldn't. It isn't. It's an implementation-defined choice, and if an
> implementation you want to use forces you to use an obscure extension, in order to
> work with UTF-8, you should ask them why - it's nothing the standard forced them to
> do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
> need to use UTF-8, but that's a separate matter.

Oh, if I can't convince even experienced person like you that the obfuscation
around UTF-8 in standards is evil then there are no point to discuss that
position with any implementer.

Manfred

no leída,

8 ene 2022, 20:17:108/1/22

a

On 1/8/2022 11:50 PM, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>> ...
>>>> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
>>>> unprefixed string literals, even for implementations targeting platforms where that's
>>>> contrary to the conventions for that platform?
>>> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>>
>> Well, let me ask you - does the implementation you use most often use UTF-8
>> encoding for unprefixed string literals? Since you're complaining about the difficulty of
>> using UTF-8, I presume that it doesn't. If not, why not?
>
> All compilers that I have used did it for some time. Or at least could be configured to.
> Sometimes the configuration had to be done in inconvenient manner but these
> are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
> prefixes. When some confused novice added it then it did not matter. After all
> these two were working same in C++17:
>
> const char crap[] = "Öö Tiib 😀";
> const char crap8[] = u8"Öö Tiib 😀";
>
> But C++20 gives error about second line. Also cast does not compile there.
> So that makes me angry opponent of the whole u8 prefix. It should be gone from
> language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
> want to but should raise the privileges of the UTF-8 to be always supported
> as char array and not fucked with.

The argument you are making here is more than convincing to me, but let
me try the devil's advocate role here.

Granted, the Venerable Luminaries of the Holy Committee screwed up big
time, but they did it in C++17 (how surprising) rather than in C++20.

In principle, I could imagine a use for u8"" strings that are compatible
with some family of printf8() functions only, a sort of tight type
constraint for character types. This would have probably ended up like
annex K, but still it could have made sense to some Unicode purists, and
more importantly it would have made no harm to the sane world.

BUT the fact that C++17 allowed your second string, and thus people
started naively using it, and THEN C++ prohibited it, thus breaking said
naïve but so far legal code, denotes some serious dickheadedness, yes.

This is to say that the solution might be to consider C++17 a sad
parenthesis (once again), and only use char, char16_t (wchar_t?) and
char32_t where needed.
Applications where a distinct separation between utf_8 and generic char
is important can pay the price of using u8, but the majority of
applications would most probably ignore it.

>
>> The standard doesn't say
>> anything to prevent that implementation from doing so. If they don't, it can only be
>> because they don't want to. So why don't you ask the implementors why they made
>> that decision? They've got a reason that seemed sufficiently good for them, find out
>> what it is.
>
> No, that fish rots from the head, IOW from standards. MS abuses it more than others
> but only because they are bit bigger assholes.
>

Agreed.

>> ...
>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>> That should work unless underlying file system does not support files
>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>> bad standard that allows implementations to weasel away. No garbage like
>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>> needed as it already works like in my example on vast majority of things.
>>
>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>> already do so, that's a choice made by the implementors, and you should ask them
>> about it. Your real beef is with the implementors, not the standard.
>
> My beef is with standards. Adding garbage that does not work to standard is wrong
> and not adding what everybody at least half sane does use to standard is also wrong.
>

Also agreed, but since utf-8 is transparent to ascii functions, what
should have been added?
I mean, if printf can't print utf-8, it is a problem of the console
rather than printf itself, right? So some way to set the console in
utf-8 mode? But that is outside the scope of the standard, isn't it?

Po Lu

no leída,

9 ene 2022, 5:41:069/1/22

a

"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:

> No, I am quite accurately and honestly expressing my confusion. You
> object to something being prohibited by the standards that is, to the
> best of my understanding, allowed. It would make more sense if you
> were objecting the fact that it isn't mandatory, and if you were
> making such claims, I would disagree with you about whether it would
> be a good idea to make it mandatory - but as far as I can tell, you're
> claiming it isn't allowed.

AFAICT, he's complaining about Microsoft's specific implementations of
some standards.

I'm not an MS-Windows programmer, but from a Unix point-of-view their
way of doing things is indeed confusing -- at least when I looked into
porting some programs.

It's probably a matter of habit: I'm sure the distinction between wide
and ASCII system calls, code pages, and text and binary streams comes
naturally to MS-Windows programmers, who in turn find the lack of
explicit text streams in Unix confusing.

David Brown

no leída,

9 ene 2022, 7:33:359/1/22

a

On 08/01/2022 20:12, Mateusz Viste wrote:
> 2022-01-08 at 10:53 -0800, Öö Tiib wrote:
>> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
>> varying length. That makes UTF-8 the sole sane one.
>
> You are right, the implementations of UTF-16 I worked on were limited
> to the BMP (ie. always 2 bytes), hence my simplified view.
>

When Unicode was young, the intention was that every glyph was one
character, and it would all fit in 16-bits - that was UCS2, as used
originally by Windows NT, Java, Python, QT, and other systems, languages
and libraries. But it was quickly discovered that this was far from
sufficient.

> Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?
>

The terminology of Unicode can be a little confusing. (And I'm sure
someone will correct me if I get it wrong.)

A "code point" is an entry in the Unicode tables. Each code point is
uniquely identified by a 32-bit number. The code points are organised
in "planes" for convenience, and designed so that the first 128 code
points match ASCII and that a wide range of languages can be covered by
the code units in the range 0x0000 .. 0xffff (excluding 0xd800 ..
0xdfff) so that 16 bits would often be enough.

A "code unit" is the container for the bits of the encoding. In UTF-8,
a code unit is an 8-bit unit. In UTF-16, it is 16-bit, in UTF-32 it is
32-bit.

UTF-8 takes up to four code units (32 bits total) per code point, UTF-16
takes up to two code units, and UTF-32 takes exactly one code unit per
code point. UTF-8 is always at least as compact as UTF-32, and will be
more or less compact than UTF-16 depending on the content. These are
just different encodings - different ways to write the code points.
There are others, such as GB18030 which is a 16-bit encoding popular in
China because it matches their traditional GB encodings in the same way
UTF-8 matches ASCII.

A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
that conveys meaning. Sometimes it is useful to break them down,
sometimes it is useful to treat them separately. For example, "é" can
considered as a single grapheme, or as a grapheme "e" followed by a
combining graphene "'" acute accent. The same grapheme can match
multiple code points - a Latin alphabet capital A is the same as a Greek
alphabet capital Alpha.

A "glyph" is a rendering of a grapheme - the letter "A" in different
fonts are different glyphs of the same grapheme.

What the reader perceives as a "character" is often a single grapheme,
but might be several graphemes together.

So, with that in mind, all three UTF formats require multiple code units
to cover all graphemes. But UTF-32 always gets one code point per code
unit, making it simpler and more consistent for processing Unicode text.
As a file or transfer encoding, it has the big inconvenience of being
endian-specific as well as being bulkier than UTF-8. UTF-16 combines
the worst features of UTF-8 with the worst features of UTF-32, with none
of the benefits - it exists solely because early Unicode adopters
committed too strongly to UCS2.

People are often concerned that UTF-8 is difficult or complex to decode
or split up. It is not, in practice. It is actually quite rare that
you need to divide up a string based on characters or even find its
length in code points - for most uses of strings, you just pass them
around without bothering about the details of the contents. You need to
know how much memory the string takes, not how many code points it has.
And simply treating it as an abstract stream of data terminated by a
zero character can be enough to give you a useable sorting and
uniqueness comparison for many uses. The point where you need to decode
the code units and know what they mean is when you are doing rendering,
sorting, or other human interaction - and then you have such a vastly
bigger task that turning UTF-8 coding into UTF-32 code points is
negligible effort in comparison.

(And UTF-8 is not much harder to encode or decode than UTF-16.)

Öö Tiib

no leída,

9 ene 2022, 7:44:109/1/22

a

Hmm. Great is that at least you are convinced. For me C++17 and
C++20 both added too big number of silently changing or silently
turning into undefined behaviors so the noisy one is even better
than the rest of it. Just that the importance of UTF-8 in software
development industry is hard to overestimate.

...

> >>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>> That should work unless underlying file system does not support files
> >>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>> bad standard that allows implementations to weasel away. No garbage like
> >>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>> needed as it already works like in my example on vast majority of things.
> >>
> >> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >> already do so, that's a choice made by the implementors, and you should ask them
> >> about it. Your real beef is with the implementors, not the standard.
> >
> > My beef is with standards. Adding garbage that does not work to standard is wrong
> > and not adding what everybody at least half sane does use to standard is also wrong.
> >
> Also agreed, but since utf-8 is transparent to ascii functions, what
> should have been added?

Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
on file-system that fully supports files named "Foo😀Bar.txt" I suppose.

> I mean, if printf can't print utf-8, it is a problem of the console
> rather than printf itself, right? So some way to set the console in
> utf-8 mode? But that is outside the scope of the standard, isn't it?

The console output can be set to UTF-8 mode with few lines of platform specific
code ... its keyboard input can't but that is all about vendor ... I agree with James there.

Manfred

no leída,

9 ene 2022, 11:34:259/1/22

a

Yes, and the silent one is in C++17. From your example, in C++20 the
compiler doesn't allow you to pass a u8"string" to printf, does it?
If u8 was started this way from the beginning, then the problem you
mention above wouldn't exist.

Just that the importance of UTF-8 in software
> development industry is hard to overestimate.
>
> ...
>
>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>> That should work unless underlying file system does not support files
>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>> needed as it already works like in my example on vast majority of things.
>>>>
>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>> about it. Your real beef is with the implementors, not the standard.
>>>
>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>
>> Also agreed, but since utf-8 is transparent to ascii functions, what
>> should have been added?
>
> Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>

Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
representation, what's the difference? One form or the other shows up
only when it is displayed in some UI - the filesystem isn't one, which
leads to the implementation's runtime behavior.

If they are actually different in their binary sequence, and this is the
result of the utf-8 string being wrongly converted multiple times, this
looks like a bad implementation, rather than a problem with the standard.
IIUC you are advocating for some statement in the standard that prevents
implementations from messing up with "character sets" in null terminated
char strings?

Richard Damon

no leída,

9 ene 2022, 12:52:509/1/22

a

On 1/9/22 7:33 AM, David Brown wrote:

> A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
> that conveys meaning. Sometimes it is useful to break them down,
> sometimes it is useful to treat them separately. For example, "é" can
> considered as a single grapheme, or as a grapheme "e" followed by a
> combining graphene "'" acute accent. The same grapheme can match
> multiple code points - a Latin alphabet capital A is the same as a Greek
> alphabet capital Alpha.
>
> A "glyph" is a rendering of a grapheme - the letter "A" in different
> fonts are different glyphs of the same grapheme.
>
> What the reader perceives as a "character" is often a single grapheme,
> but might be several graphemes together.
>

No, a grapheme, from my understanding, is the character as perceived by
the readed. Thus the adding of accents to a base character builds a
single grapheme from several codepoints.

The grapheme dosn't include 'font' information like which font to use,
the size, additions like bold or italics and such, which add on to make
the final glyph, but does include all the jots and tildes that are part
of the character.

On the other hand, some languages add things like 'vowel points' to
characters, and those are seperate graphemes even though they are added
by a similar manner. This comes down to what the original language
though of as a 'character', which just makes things even more complicated.

Then as you said, there are the 'look-alike' characters which are
considered (generally) to be separate, but some canonicalizations will
convert to a common character.

Malcolm McLean

no leída,

How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
"Foo😀Bar.txt" files can be in same directory. Both have Unicode
names in underlying file system precisely as posted.

> If they are actually different in their binary sequence, and this is the
> result of the utf-8 string being wrongly converted multiple times, this
> looks like a bad implementation, rather than a problem with the standard.
> IIUC you are advocating for some statement in the standard that prevents
> implementations from messing up with "character sets" in null terminated
> char strings?

I mean that standard should require that all char* texts are treated as
UTF-8 by standard library unless said otherwise. If implementation needs
some other encoding of such byte sequence then it provides
platform-specific functions or compiler switches and/or extends language
with implementation-defined char_iso8859_1_t character types and
prefixes. If it is noteworthy handy type then add it to standards too, I
don't care.

If standard can define that overflow in signed atomics is well defined
and two's complement is mandated there then it also can define that all
char* texts are UTF-8. The only question is if what I suggest is reasonable
or not. From viewpoint of implementer of standard library or users it
is likely blessing ... so I think it is question of business/politics/religions.

Richard Damon

no leída,

9 ene 2022, 18:49:079/1/22

a

The difference is that in these days, the existence of computers that
aren't going to be able to support two's complement that will still want
to support modern 'C' is effectively non-existent.

The existance of machines that might still want to be able to support
non-UTF-8 strings is not.

Perhaps the biggest is the embedded market where needing to support
beyond plain ASCII isn't needed, and DEFINING that strings will follow
UTF-8 rules adds a LOT of complications for some operations that just
aren't needed on many of the systems.

The Standard does ALLOW a system to define char to be UTF-8 (at least
until you get into issues of what it requires for wide characters).

Öö Tiib

no leída,

9 ene 2022, 19:51:249/1/22

a

But do there exist machines that do want to support texts as char* but
do not want to support UTF-8? Describe those machines, give examples.

> Perhaps the biggest is the embedded market where needing to support

You mean tiny things like SD cards or flash sticks? I can store there
"Foo😀Bar.txt" just fine. Either embedded system does not need to
communicate in char* text at all, can fully ignore its encoding or has
to deal with UTF-8 anyway. I know of no other examples, despite I've
participated in programming whole pile of embedded systems over
the decades.

> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> UTF-8 rules adds a LOT of complications for some operations that just
> aren't needed on many of the systems.

WHAT complications? Give examples? Both ASCII and UTF-8 are row of
bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
use-case where UTF-8 hurts? Human languages and typography
are horribly complicated but UTF-8 is genially trivial. Either embedded
system does not do linguistic analyses of poems or if it does then
it needs to use Unicode anyway. But commonly if it can't display
something then it shows � and done.

> The Standard does ALLOW a system to define char to be UTF-8 (at least
> until you get into issues of what it requires for wide characters).

Allowing is apparently not enough as the support rots in standards.
Wide characters are wchar_t, char16_t and char32_t. These are
in horrible state too but I ignore it for now. Not related to issues
with char* and far less important in industry.

Manfred

no leída,

9 ene 2022, 20:20:069/1/22

a

I mean the same byte sequence in their name, but different UI
representation, e.g. when decoded as utf-8 or w-1252 or whatever.

What you are saying assumes a Unicode-aware filesystem, that's not free
from the point of view of the standard.
But, in order to support utf-8, it would be enough to have a char based
filesystem that treats names as plain 0-terminated char[]. That's
easier, probably free on most platforms, but it's different from
Unicode-aware (which could be UTF-16 like Windows, and there you have
your problems).

>
>> If they are actually different in their binary sequence, and this is the
>> result of the utf-8 string being wrongly converted multiple times, this
>> looks like a bad implementation, rather than a problem with the standard.
>> IIUC you are advocating for some statement in the standard that prevents
>> implementations from messing up with "character sets" in null terminated
>> char strings?
>
> I mean that standard should require that all char* texts are treated as
> UTF-8 by standard library unless said otherwise. If implementation needs
> some other encoding of such byte sequence then it provides
> platform-specific functions or compiler switches and/or extends language
> with implementation-defined char_iso8859_1_t character types and
> prefixes. If it is noteworthy handy type then add it to standards too, I
> don't care.

I see this hard to win, and probably not ideal - suppose in 10 years
some better encoding than utf-8 shows up, then you are screwed again.

I'd rather stick to the fact that utf-8 is compatible with 0-terminated
char[], and so a plausible wish would be that such strings are not
screwed by the implementation; for example when you store a file name in
a filesystem with fopen() and the name is given as char[], then the
standard could mandate that reading back that same name as char[] gives
back the same byte sequence.

Currently I guess one could use a utf-8 string as a name to fopen() on
Windows, then the OS assumes it is W-1252 and converts it into UTF-16,
at which point it is screwed, and when you read it back into char[] it
is garbage.

>
> If standard can define that overflow in signed atomics is well defined
> and two's complement is mandated there then it also can define that all
> char* texts are UTF-8. The only question is if what I suggest is reasonable
> or not. From viewpoint of implementer of standard library or users it
> is likely blessing ... so I think it is question of business/politics/religions.

I agree with Richard here. Two's complement is not like utf-8.
I still think it's technical rather than business/politics/religions in
this case - as I said above I'm not sure it would even be ideal.

Richard Damon

no leída,

9 ene 2022, 20:59:009/1/22

a

Small embedded micros with no need for large character sets.

>
>> Perhaps the biggest is the embedded market where needing to support
>
> You mean tiny things like SD cards or flash sticks? I can store there
> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> communicate in char* text at all, can fully ignore its encoding or has
> to deal with UTF-8 anyway. I know of no other examples, despite I've
> participated in programming whole pile of embedded systems over
> the decades.
>

Many such system communicate in command strings, maybe even with a
minimal TCP/IP but have no need for processing data beyond pure ASCII.

>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
>> UTF-8 rules adds a LOT of complications for some operations that just
>> aren't needed on many of the systems.
>
> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> use-case where UTF-8 hurts? Human languages and typography
> are horribly complicated but UTF-8 is genially trivial. Either embedded
> system does not do linguistic analyses of poems or if it does then
> it needs to use Unicode anyway. But commonly if it can't display
> something then it shows � and done.

Once you have your char as being defined as a Multi-Byte Character Set,
then wchar_t must be big enough to hold any of them. If you just support
ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
the wchar_t stuff can just be alias for the char stuff.

Thus forcing char to be UTF-8 adds a lot of complexity to the system.

>
>> The Standard does ALLOW a system to define char to be UTF-8 (at least
>> until you get into issues of what it requires for wide characters).
>
> Allowing is apparently not enough as the support rots in standards.
> Wide characters are wchar_t, char16_t and char32_t. These are
> in horrible state too but I ignore it for now. Not related to issues
> with char* and far less important in industry.
>

But that is part of the problem with supporting UTF-8, as that by
definiition brings in all the wide character issues into play.

If you define that your character set is ASCII, then wchar_t becomes
trivial.

A big part of the issue with char16_t is that it is fundamentally broken
with Unicode, but lives on due to trying to maintain the backwards
bandaids that basically can't be removed without admitting that a large
segment of code just will live as being openly non-complient.

Too much legacy code assumes that 16 bit characters are 'big enough' for
most people, and pretty much do work if you aren't being a stickler for
full conformance to the rules, which no one is because you can't be.

Richard Damon

no leída,

9 ene 2022, 21:00:589/1/22

a

But that seems to imply that the file system keeps track of file name
encoding at the entry level, which I don't know any that do that.

Öö Tiib

no leída,

9 ene 2022, 21:08:239/1/22

a

Nope. NTFS for example has file names as UTF-16 plus Windows uses hard links
to also give legacy Radix-50 style short (8.3) filenames to all files. Trivia
question: Why it is named "Radix-50" when there are only 40 characters in it?

> What you are saying assumes a Unicode-aware filesystem, that's not free
> from the point of view of the standard.
> But, in order to support utf-8, it would be enough to have a char based
> filesystem that treats names as plain 0-terminated char[]. That's
> easier, probably free on most platforms, but it's different from
> Unicode-aware (which could be UTF-16 like Windows, and there you have
> your problems).

Problems were with Japanese Shift JIS or EUC encodings in file
systems... it was expensive to guess what it is and so they switched
to Unicode. With UTF-16 one needs to know or to detect endianess
of it ... otherwise turning to UTF-8 and back is absurdly trivial.
Certainly less code than between Windows-1252 and UTF-16.

Yes, something like that happens. Microsoft was amazingly innovative
and wanted to push all kinds of good things up to 1995. But then some
kind of browser and compiler and other incompatibility wars within its
own operating system started ... and its positions started to shrink and
get damage. But it is their own business so they have full right to burn
it however they please.

Ben Bacarisse

no leída,

9 ene 2022, 21:18:129/1/22

a

Mateusz Viste <mat...@xyz.invalid> writes:

> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging.

Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
would interpret a "UTF-8 strlen" as being a function the counts the
number of encoded code points, and that's simple enough. Every byte,
before the null, that does not have 10xxxx it it's top two bits is the
start of a code point:

size_t ustrlen(char *s)
{
size_t len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
return len;
}

Obviously, for some uses, this is too simple as it does not detect
incorrect encodings.

--
Ben.

Ben Bacarisse

no leída,

9 ene 2022, 21:33:459/1/22

a

strcmp fails must closer to home (at least closer to my geographic home)
because, in Spanish, ch and ll are, transitionally, considered separate
letters. All c* words collate before any ch* words, and all l* words
before and ll* ones.

This has proved so inconvenient that I believe that the Real Academia
Española has ruled that, now, only ñ must be considered separately.

--
Ben.

Öö Tiib

no leída,

9 ene 2022, 21:40:389/1/22

a

You diligently avoid giving examples?
You mean if it displays only Arabic numbers then it needs only 10
characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
in that processing. So UTF-8 adds no extra bytes to such system.

> >
> >> Perhaps the biggest is the embedded market where needing to support
> >
> > You mean tiny things like SD cards or flash sticks? I can store there
> > "Foo😀Bar.txt" just fine. Either embedded system does not need to
> > communicate in char* text at all, can fully ignore its encoding or has
> > to deal with UTF-8 anyway. I know of no other examples, despite I've
> > participated in programming whole pile of embedded systems over
> > the decades.
>
> Many such system communicate in command strings, maybe even with a
> minimal TCP/IP but have no need for processing data beyond pure ASCII.

No, its different topic. Fully implementation defined. Conformant
implementation may implement only locale named "C" and be done with it.
The setlocale() localeconv() and lconv() can be trivial stubs behaving by
letter of standard and not worth calling ever. Would be nice from
implementer to support localization but not really required and not
something I complain about.

> >
> >>>
> >>>> Perhaps the biggest is the embedded market where needing to support
> >>>
> >>> You mean tiny things like SD cards or flash sticks? I can store there
> >>> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> >>> communicate in char* text at all, can fully ignore its encoding or has
> >>> to deal with UTF-8 anyway. I know of no other examples, despite I've
> >>> participated in programming whole pile of embedded systems over
> >>> the decades.
> >>
> >> Many such system communicate in command strings, maybe even with a
> >> minimal TCP/IP but have no need for processing data beyond pure ASCII.
> >
> > Same as with numbers, if no need to show the degree in 74.3°F so no
> > need for to process anything beyond pure ASCII. Otherwise the software
> > needs to detect that there are bytes C2 B0 for to show ° also no biggie.
>
> Again, the problem is that once you have defined that Multi-byte
> characters exist, things like printf will use locale support that might
> pull in classifaction routines that might needs to classify what
> characters are 'digits' or 'letters' in the full Unicode range.

Stay with UTF-8? It can keep locale "en_US". It has to show □ for each
missing symbol in font and � for illegal UTF-8 byte sequence (that is
trivial to detect). There are likely 0 fonts in existence with all Unicode
symbols so a font with 20 symbols is fully conformant for embedded
device that does not need to analyze Hebrew manuscripts but to
show temperature for desperate housewife.

> For a PC, with a large OS, that support is fairly cheap, and might even
> be just built in, but in a small embedded system that can be costly.
>
> I HAVE had systems that defined that characters were UTF-8 and the
> result was I couldn't use a lot of the library because it pulled in too
> much locale code to fit into my machine.

It is easy to overburden embedded system with unneeded code
for locale-specific processing for all hundreds of countries and
dialects but it is not fault of UTF-8 encoding of characters.

> >
> >>>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> >>>> UTF-8 rules adds a LOT of complications for some operations that just
> >>>> aren't needed on many of the systems.
> >>>
> >>> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> >>> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> >>> use-case where UTF-8 hurts? Human languages and typography
> >>> are horribly complicated but UTF-8 is genially trivial. Either embedded
> >>> system does not do linguistic analyses of poems or if it does then
> >>> it needs to use Unicode anyway. But commonly if it can't display
> >>> something then it shows � and done.
> >>
> >> Once you have your char as being defined as a Multi-Byte Character Set,
> >> then wchar_t must be big enough to hold any of them. If you just support
> >> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
> >> the wchar_t stuff can just be alias for the char stuff.
> >>
> >> Thus forcing char to be UTF-8 adds a lot of complexity to the system.
> >
> > Most embedded systems that I programmed used wchar_t for nothing.
> > So the compiler generated precisely 0 bytes of wchar_t processing
> > into image that was flashed into those.
> >
> The problem is that the mere fact that wcahr_t is bigger than 8 bits,
> implies behavior for some character operations that need to take that
> into account.

Be specific? Paste source code that has only char, that is affected by
properties of wchar_t?

Yes, but char being UTF-8 code unit can not break any ABI. The char has
to have at least 8 bits by standard. Therefore code unit fits. And char*
string has to be sequence of non zero char values that is terminated
with zero char value. Precisely what UTF-8 is. So the change what I
propose does not cause ABI issues like intmax_t or wchar_t change
would cause.

james...@alumni.caltech.edu

no leída,

10 ene 2022, 0:24:0110/1/22

a

On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > On 1/9/2022 1:44 PM, Öö Tiib wrote:

...

> > > Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > >
> > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > representation, what's the difference? One form or the other shows up
> > only when it is displayed in some UI - the filesystem isn't one, which
> > leads to the implementation's runtime behavior.
> How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> names in underlying file system precisely as posted.

Have you checked to make sure? Any system where passing the UTF-8 string
"Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
is likely to be a system where the file names are displayed using some single-
byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
names used that encoding to interpret the file name, that would explain your results.

Öö Tiib

no leída,

10 ene 2022, 1:11:0810/1/22

a

Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"
side-by-side? Just that for "Foo😀Bar.txt" one needs to use non-standard
_wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
gcc.

People have apparently lamented about it so the behavior of fopen can be
repaired with some butt-ugly xml file linked in the program in specific way
or by providing one next to your program with yourprogramname.xml as
name. That trick works from Windows 10 May 2019 Update.

But reading UTF-8 (for example password) from console is still impossible.
One has to write platform-specific code about like that:

SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0; // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);

That way we have UTF-8 read into buf.

David Brown

no leída,

10 ene 2022, 3:02:4610/1/22

a

In a more extreme case, in Norwegian "aa" is sometimes sorted very early
alphabetically, and sometimes very late as it is a transliteration of
the Norwegian letter "å", which is the last letter in our alphabet.

Sorting for human use (as distinct from, say, making a binary tree for
lookups, in which case pure data-based sorting is fine) is complicated
business!

Mateusz Viste

no leída,

10 ene 2022, 4:02:5410/1/22

a

2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
> I would interpret a "UTF-8 strlen" as being a function the counts the
> number of encoded code points, and that's simple enough.

> (...)

> Obviously, for some uses, this is too simple as it does not detect
> incorrect encodings.

I'm not sure what your point was supposed to be. "It's simple to write
non-practical, prototype-grade code"? Yes, it is.

Now, I am not saying that writing a utf-8 strlen() is incredibly
difficult of course. I am only saying it is an extra layer of
complexity compared to UCS-2 or UTF-32. And that is why I understand
why people often choose to internally store strings in one of these
encodings instead of utf-8 (esp. if dealing with fixed-width character
outputs). It's simply easier to deal with an array of values that maps
directly to codepoints rather than parse a utf-8 string taking care not
to explode on encoding errors or edge cases.

Mateusz

Öö Tiib

no leída,

10 ene 2022, 4:36:5610/1/22

a

My complaints are because of the events somehow falling
together.

On one hand the issues with electronic components caused
shortage of prototype devices to run tests on. So some
cooperation partners decided to run unit tests on Windows
boxes. Since on Windows the C standard library is crappy these
unit tests now mostly test ad-hoc hacks of simulating proper
standard library and pointless man-months wasted into those.

On the other hand C++20 broke that u8 prefix indicating
dedication to push UTF-8 into that char8_t* garbage that no
standard library function is using. The char8_t would also
break constexpr processing of it as casts to char are illegal in
constexpr context.

It smells like some next family of non-standard functions soon
in style of _u8fopen(u8"Foo😀Bar.txt", u8"w") . Story of annex
K repeated. Oh I hope being paranoid and UCRT with Visual
Studio 2022 coming plain great, but what is the chance of
that?

Ben Bacarisse

no leída,

10 ene 2022, 6:45:1910/1/22

a

Mateusz Viste <mat...@xyz.invalid> writes:

> 2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
>> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
>> I would interpret a "UTF-8 strlen" as being a function the counts the
>> number of encoded code points, and that's simple enough.
>> (...)
>> Obviously, for some uses, this is too simple as it does not detect
>> incorrect encodings.
>
> I'm not sure what your point was supposed to be. "It's simple to write
> non-practical, prototype-grade code"? Yes, it is.

Yes, that was my point. A lot of people think UTF-8 is more complex
than it is so I think it helps to demystify it a bit.

--
Ben.

Ben Bacarisse

no leída,

10 ene 2022, 6:47:2710/1/22

a

Yes. But that's now what "a simple strlen()" for UTF-8 appeared to be
referring to. After all, as you say, strlen /is/ strlen for UTF-8
strings.

--
Ben.

no leída,

10 ene 2022, 8:33:4710/1/22

a

You can't really separate Unicode handling from font handling. And that
is notoriously difficult, even if you restrict yourself to English.

Richard Damon

no leída,

10 ene 2022, 8:38:2110/1/22

a

And this is the reason that Unicode doesn't really meet the requirements
of a C 'Wide Character Type'. Wide characters are supposed to be 1
character = 1 storage unit. Because of combining characters Unicode
doesn't meet this requirement.

Ultimately, we have to live with it and accept that programming in the
face of full compliance with the rules of the character set are going to
add complexity.

james...@alumni.caltech.edu

no leída,

10 ene 2022, 12:01:3110/1/22

a

On Monday, January 10, 2022 at 1:11:08 AM UTC-5, Öö Tiib wrote:
> On Monday, 10 January 2022 at 07:24:01 UTC+2, james...@alumni.caltech.edu wrote:
> > On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> > > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > > > On 1/9/2022 1:44 PM, Öö Tiib wrote:
> > ...
> > > > > Something that makes it clear that it is defect when "Foo≡ƒÿÇBar.txt" is silently opened
> > > > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > > > >
> > > > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > > > representation, what's the difference? One form or the other shows up
> > > > only when it is displayed in some UI - the filesystem isn't one, which
> > > > leads to the implementation's runtime behavior.
> > > How you mean same binary representation? Both "Foo≡ƒÿÇBar.txt" and
> > > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > > names in underlying file system precisely as posted.
> > Have you checked to make sure? Any system where passing the UTF-8 string
> > "Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
> > is likely to be a system where the file names are displayed using some single-
> > byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is
> >
> > 0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74
> >
> > After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
> > has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
> > names used that encoding to interpret the file name, that would explain your results.
> Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"

> side-by-side? ...

No, that wouldn't be relevant. I realized shortly after I posted my message that you
might not understand what I meant by "Have you checked to make sure?" I started
composing a message in my head explaining in more detail. However, it was very late,
I had to get to bed, and when I checked this morning you'd already confirmed that you
didn't realize what I meant.

As I understand it, you've opened a file using "Foo😀Bar.txt" as the file name, and
somehow determined that the "actual" name of the file that got opened was
"Foo≡ƒÿÇBar.txt". You didn't specify, but I presume you reached that conclusion by
doing something like get a directory listing at the command line or using a GUI file
browser to look at the directory.

The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X7

Those same bytes, if interpreted using Code page 865, represent the string
"Foo≡ƒÿÇBar.txt". I know very little about Windows internals, so I'm not sure why that
might be relevant. It could be just a coincidence, but that seems excessively
unlikely. What I was suggestinging, and what I think Manfred was hinting at, is that the
string you provide as the file name is stored by the file system using UTF-8 encoding.
Whatever method you used to determine the "actual" file name interpreted those
bytes using a single-byte encoding, which could be Code Page 865, or possibly some
other encoding that encodes those particular characters the same way as Code Pag
865. There's a lot of different code pages out there, so I couldn't check them all, but of
the dozen or so I checked, that is the only one where 0xF0 represents '≡'. The "MS
DOS Nordic" code page was one of the first ones I checked, based upon your e-mail
address oot...@hot.ee, where I presume "ee" refers to Estonia.

If that is indeed the case, consider what should happen if you try to open a file using
the name "Foo≡ƒÿÇBar.txt". If I understand you correctly, I believe that you would
expect it to open the same file that got opened when you specified "Foo😀Bar.txt".
However, the UTF-8 encoding of "Foo≡ƒÿÇBar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

I expect that you will end up opening a different file. If the file name is being displayed
using a single-byte encoding, it should have 19 characters. If that encoding is in fact
Code Page 865, then that name should be "FooΓëí╞Æ├┐├çBar.txt". So, what result do
you get?

If the problem is in fact that the file name is being interpreted using a single byte
encoding by whatever utility you're using determine what the actual name is, then
there's absolutely nothing the standards can do about that - the behavior of any such
utility is completely outside the scope of either standard.

> ... Just that for "Foo😀Bar.txt" one needs to use non-standard

> _wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
> gcc.

As you say, it's non-standard. Therefore, nothing the C standard says could do
anything to constrain it's behavior. If your complaint is indeed about the behavior of
_wfopen(), it's not relevant to either the C or C++ standards, and should be posted to a
Windows-specific forum.

Vir Campestris

no leída,