Grupos de Google ya no admite nuevas publicaciones ni suscripciones de Usenet. El contenido anterior sigue siendo visible.

"Some sanity for C and C++ development on Windows" by Chris Wellons

Visto 305 veces
Saltar al primer mensaje no leído

Lynn McGuire

no leída,
4 ene 2022, 1:36:134/1/22
a
"Some sanity for C and C++ development on Windows" by Chris Wellons
https://nullprogram.com/blog/2021/12/30/

Lynn

Öö Tiib

no leída,
4 ene 2022, 13:39:234/1/22
a
On Tuesday, 4 January 2022 at 08:36:13 UTC+2, Lynn McGuire wrote:
> "Some sanity for C and C++ development on Windows" by Chris Wellons
> https://nullprogram.com/blog/2021/12/30/

The whole difference that std::string on other platforms is UTF-8.
It is something that standard of C or C++ do not support in any way.
On the contrary, the standards add obfuscation bullshit like:

const char crap[] = u8"Öö Tiib 😀";

And when to ask why then oh but maybe there is EBCDIK character
set. Shove that ebc-dick where sun doesn't shine, morons.
Let MS add its w1252 prefix if it likes that character set too lot?
But no language lawyer in committee does have balls, and there it
ends.

Vir Campestris

no leída,
5 ene 2022, 16:50:285/1/22
a
My first job was writing assembler on a mainframe with an EBCDIC
character set. The operating system I worked on has been dead for 40
years now, but I dare say IBM's mainframes still use it.

Andy

Scott Lurndal

no leída,
5 ene 2022, 17:57:095/1/22
a
The Unisys mainframes (from the Burroughs side) still use EBCDIC;
I believe the sperry side systems support EBCDIC, if not use it natively.

Kaz Kylheku

no leída,
5 ene 2022, 18:34:245/1/22
a
On 2022-01-05, Scott Lurndal <sc...@slp53.sl.home> wrote:
> Vir Campestris <vir.cam...@invalid.invalid> writes:
>>My first job was writing assembler on a mainframe with an EBCDIC
>>character set. The operating system I worked on has been dead for 40
>>years now, but I dare say IBM's mainframes still use it.
>
> The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> I believe the sperry side systems support EBCDIC, if not use it natively.

My EBCDIC experience is that over the years, I had at least one
manager who was one.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Öö Tiib

no leída,
5 ene 2022, 23:35:385/1/22
a
Even programmer of such system might want to upgrade to compiler
whose standard library supports UTF-8. But it is not possible as
standard library is defined not to.

Scott Lurndal

no leída,
6 ene 2022, 11:14:426/1/22
a
=?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> writes:
>On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:
>> Vir Campestris <vir.cam...@invalid.invalid> writes:=20
>> >On 04/01/2022 18:39, =C3=96=C3=B6 Tiib wrote:=20
>> >> On Tuesday, 4 January 2022 at 08:36:13 UTC+2, Lynn McGuire wrote:=20
>> >>> "Some sanity for C and C++ development on Windows" by Chris Wellons=
>=20
>> >>> https://nullprogram.com/blog/2021/12/30/=20
>> >>=20
>> >> The whole difference that std::string on other platforms is UTF-8.=20
>> >> It is something that standard of C or C++ do not support in any way.=
>=20
>> >> On the contrary, the standards add obfuscation bullshit like:=20
>> >>=20
>> >> const char crap[] =3D u8"=C3=96=C3=B6 Tiib =F0=9F=98=80";=20
>> >>=20
>> >> And when to ask why then oh but maybe there is EBCDIK character=20
>> >> set. Shove that ebc-dick where sun doesn't shine, morons.=20
>> >> Let MS add its w1252 prefix if it likes that character set too lot?=20
>> >> But no language lawyer in committee does have balls, and there it=20
>> >> ends.=20
>> >>=20
>> >My first job was writing assembler on a mainframe with an EBCDIC=20
>> >character set. The operating system I worked on has been dead for 40=20
>> >years now, but I dare say IBM's mainframes still use it.
>>=20
>> The Unisys mainframes (from the Burroughs side) still use EBCDIC;=20
>> I believe the sperry side systems support EBCDIC, if not use it natively.
>
>Even programmer of such system might want to upgrade to compiler
>whose standard library supports UTF-8. But it is not possible as=20
>standard library is defined not to. =20

Well, the Burroughs programmers use COBOL and ALGOL, not C, which
have supported I18N and L10N since the 1980s.

james...@alumni.caltech.edu

no leída,
6 ene 2022, 12:28:466/1/22
a
On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:
...
> > The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> > I believe the sperry side systems support EBCDIC, if not use it natively.
> Even programmer of such system might want to upgrade to compiler
> whose standard library supports UTF-8. But it is not possible as
> standard library is defined not to.

Could you cite the text from the C and C++ standards that prohibits the
standard library from supporting UTF-8?
Are you saying that it's prohibited specifically on platforms that normally
use EBCDIC? I'm not aware of any requirement that an implementation of
C follow the conventions for the target platform: an implementation that
emulates working on a completely different platform (such as one where
UTF-8 is the norm) is always allowed.

To the best of my understanding:
* The encoding of source files and the execution character narrow
character set are both implementation-defined multibyte encodings, and
there's nothing that prohibits either encoding from being UTF-8.
* Implementations are explicitly permitted to allow extended characters in
source files for identifiers, string literals, character constants, comments
and header names.
* The current versions of both standards allow string literals and character
constants prefixed with u8, which are required to have UTF-8 encoding.
* Even the latest version of C doesn't mandate any conversion routines for
UTF-8, but on a platform which makes UTF-8 the encoding for it's
execution character set, the conversion routines that contain "mb" in their
name will interpret plain char as having UTF-8 encoding.
* C++ mandates mbrtoc8() and c8rtomb(), which convert between the
native encoding and UTF-8. It also mandates mbrtowc() and wcrtomb(),
as well as the C standard library routines for convertiong betwen the
native encodings and char16_t or char32_t in <uchar>; the C standard does
not require that those types have UTF-16 and UTF-32 encoding
respectively, but the C++ standard does. It also mandates codecvt facets
for converting between UTF-8, UTF-16, and UTF-32, so conversion between
UTF-8 and any of the other four character encodings are mandated,
though conversion to and from wchar_t is a two-step process unless the
native narrow character set uses UTF-8 encoding.

While C, in particular, doesn't mandate quite as much support for UTF-8 as
I'd like, both standards allow the fullest possible support for UTF-8 that I
could imagine. Why do you think otherwise?

Guillaume

no leída,
6 ene 2022, 13:42:406/1/22
a
Le 06/01/2022 à 18:28, james...@alumni.caltech.edu a écrit :
> While C, in particular, doesn't mandate quite as much support for UTF-8 as
> I'd like, both standards allow the fullest possible support for UTF-8 that I
> could imagine. Why do you think otherwise?

Agreed.

And yes, as support is not that complete, we usually have to use
third-party (or our own) libraries just for that. And to be fair, UTF-8
support, overall, is still a bit shaky.

Sure on Windows, where MS had focused on using Unicode UCS2 instead of
UTF-8, things are no better, even if you use the Windows API instead of
the C standard library.

Öö Tiib

no leída,
7 ene 2022, 7:06:267/1/22
a
On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> > On Thursday, 6 January 2022 at 00:57:09 UTC+2, Scott Lurndal wrote:
> ...
> > > The Unisys mainframes (from the Burroughs side) still use EBCDIC;
> > > I believe the sperry side systems support EBCDIC, if not use it natively.
> > Even programmer of such system might want to upgrade to compiler
> > whose standard library supports UTF-8. But it is not possible as
> > standard library is defined not to.
> Could you cite the text from the C and C++ standards that prohibits the
> standard library from supporting UTF-8?

Are you pretending that you did not understand what I meant?
AFAIK you have rather decent knowledge of standards. The
standards allow implementations to have wide array of whatever
obscure extensions.

However some essential things, like say adding 128 bit integers or
UTF-8 support or even to stop that nonsense with newline characters
is made tricky. Plus it is obscured with random half backed extensions,
prefixes and types that promise not much, tend to be deprecated and
confuse people. That in world where 98% of plain text in internet is UTF-8.

> Are you saying that it's prohibited specifically on platforms that normally
> use EBCDIC? I'm not aware of any requirement that an implementation of
> C follow the conventions for the target platform: an implementation that
> emulates working on a completely different platform (such as one where
> UTF-8 is the norm) is always allowed.

I am saying that when to ask why default string can't be UTF-8 then that
EBCDIC is usually mentioned. Despite there probably are no much C, let
alone C++ used on few alive EBCDIC platforms.

>
> To the best of my understanding:
> * The encoding of source files and the execution character narrow
> character set are both implementation-defined multibyte encodings, and
> there's nothing that prohibits either encoding from being UTF-8.
> * Implementations are explicitly permitted to allow extended characters in
> source files for identifiers, string literals, character constants, comments
> and header names.
> * The current versions of both standards allow string literals and character
> constants prefixed with u8, which are required to have UTF-8 encoding.
> * Even the latest version of C doesn't mandate any conversion routines for
> UTF-8, but on a platform which makes UTF-8 the encoding for it's
> execution character set, the conversion routines that contain "mb" in their
> name will interpret plain char as having UTF-8 encoding.
> * C++ mandates mbrtoc8() and c8rtomb(), which convert between the
> native encoding and UTF-8. It also mandates mbrtowc() and wcrtomb(),
> as well as the C standard library routines for convertiong betwen the
> native encodings and char16_t or char32_t in <uchar>; the C standard does
> not require that those types have UTF-16 and UTF-32 encoding
> respectively, but the C++ standard does.

The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
OK, UTF-16 can be useful to communicate with mis-designed programming
languages like Java or C# or operating systems like Windows. But UTF-32
is rather exotic.

> It also mandates codecvt facets
> for converting between UTF-8, UTF-16, and UTF-32, so conversion between
> UTF-8 and any of the other four character encodings are mandated,
> though conversion to and from wchar_t is a two-step process unless the
> native narrow character set uses UTF-8 encoding.

The most useful of those facets that converted to-from UTF-8 in char
array were deprecated by C++17.

> While C, in particular, doesn't mandate quite as much support for UTF-8 as
> I'd like, both standards allow the fullest possible support for UTF-8 that I
> could imagine. Why do you think otherwise?

I think that having all text streams and plain non-prefixed "strings" as
UTF-8 is both possible and most logical. UTF-8 code unit is guaranteed
to fit into char by both standards so it is possible. UTF-8 is most
widespread text format so it is logical. Other, obscure encodings like
Windows1252 or EBCDIC (and functions using or filling those) should
have weirdo prefixes and special character types and what not.

Malcolm McLean

no leída,
7 ene 2022, 8:08:447/1/22
a
On Friday, 7 January 2022 at 12:06:26 UTC, Öö Tiib wrote:
>
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.
>
Yes and no. You can pass strings about as utf-8. But it's hard to manipulate them.
Often it's easier to convert to utf-32 and back to actually access the content of
a string and use it.

james...@alumni.caltech.edu

no leída,
7 ene 2022, 12:34:167/1/22
a
On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> > On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
...
> > > Even programmer of such system might want to upgrade to compiler
> > > whose standard library supports UTF-8. But it is not possible as
> > > standard library is defined not to.
> > Could you cite the text from the C and C++ standards that prohibits the
> > standard library from supporting UTF-8?
> Are you pretending that you did not understand what I meant?

No, I am quite accurately and honestly expressing my confusion. You object to
something being prohibited by the standards that is, to the best of my understanding,
allowed. It would make more sense if you were objecting the fact that it isn't
mandatory, and if you were making such claims, I would disagree with you about
whether it would be a good idea to make it mandatory - but as far as I can tell, you're
claiming it isn't allowed.
If you could, as requested, cite the relevant text that prohibits such compilers, you
might convince me that I'm wrong. If not, the citation would at least enable me to try
to convince you that you're misinterpreting that text. Neither possibility can happen
until you actually honor that request.

...
> However some essential things, like say adding 128 bit integers or
> UTF-8 support or even to stop that nonsense with newline characters
> is made tricky.

Neither 128 bits integer nor UTF-8 support are essential. Lots of people have no
need of either (I've never needed either one, though I have used UTF-8 since it was
available). If you need such things on a platform where no existing implementation
of C provides them, your complaint is with the implementors, not the standard,
because the standard says nothing to prohibit such things.

...
> I am saying that when to ask why default string can't be UTF-8 then that
> EBCDIC is usually mentioned. Despite there probably are no much C, let
> alone C++ used on few alive EBCDIC platforms.

That seems reasonable to me. The only places where that logic applies are
implementations of C targeting EBCDIC platforms, and regardless of how rare such
implementations, they would become substantially rarer because users would abandon
them if they switched to UTF-8.

...
> The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.

UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
with it, I've only done a few years of programming targeting that platform, and all of
the text that came up in the work I was doing there was simple English, with no need
to make any use of extended characters, so I can't vouch for any of the details about
how text was encoded. However, WIndows is a very common platform, whether or not
you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
seems very odd.

UTF-32 shares an important property with the implementation-defined encoding used
for wchar_t: every character takes up one and only one element in the array. I have
written a lot of code over several decades for parsing strings that assumes that every
character takes up one char, a valid assumption in the contexts in which I wrote it. When
I think about how I would have to re-write such code to work with multi-byte encodings
such as UTF-8, then the simplicity of replacing char with wchar_t or char32_t, leaving my
logic unchanged, starts looking pretty attractive. However, since I have little experience
writing code to work with extended characters using any encoding, my preferences don't
carry much weight.

> > While C, in particular, doesn't mandate quite as much support for UTF-8 as
> > I'd like, both standards allow the fullest possible support for UTF-8 that I
> > could imagine. Why do you think otherwise?
> I think that having all text streams and plain non-prefixed "strings" as
> UTF-8 is both possible and most logical.

Yes, and that is allowed by both standards, and is the norm, not the exception, on most
Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
confuses me. If you don't work on Unix systems where the encoding of non-prefixed
strings is UTF-8, and you don't work on Windows systems where UTF-16 is the norm, and
you don't work on systems where EBCDIC is the norm, what kinds of systems do you
work on? I'm not saying that there aren't any other types of systems, there's a great
many, but most of the others are substantially less common, so I am just curious which
one(s) you use.

Malcolm McLean

no leída,
7 ene 2022, 13:44:397/1/22
a
On Friday, 7 January 2022 at 17:34:16 UTC, james...@alumni.caltech.edu wrote:
>
> UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
> with it, I've only done a few years of programming targeting that platform, and all of
> the text that came up in the work I was doing there was simple English, with no need
> to make any use of extended characters, so I can't vouch for any of the details about
> how text was encoded. However, WIndows is a very common platform, whether or not
> you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> seems very odd.
>
Every Windows API call that takes text comes in an A-suffix or a W-suffix call.
The A-suffix takes ascii strings, the W-suffix takes near UTF-16, actually Microsoft's
slightly incompatible version. If you don't provide a suffix at all, the compiler
selects a version which depends on how you have set it up. I'm not sure what the
default is or exactly how you play with the settings.
>
> > I think that having all text streams and plain non-prefixed "strings" as
> > UTF-8 is both possible and most logical.
> Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> confuses me.

My experience is that passing utf-8 to printf() or fopen() doesn't work. But I rarely
need to do so, and the situation might have changed recently.

Bart

no leída,
7 ene 2022, 14:18:577/1/22
a
This program, which contains UTF8 sequences:

#include <stdio.h>

int main(void) {
printf("ø°PÇ€\n");
}

works OK if compiled with bcc or tcc and run with codepage 65001 active.

However it doesn't work if compiled with gcc; I don't know why.

Calling a Windows -A function (eg. MessageBoxA) with UTF8 strings
doesn't work either.

There are also wider aspects than sending output via stdout as mentioned
in the article, such as command line input.

So it's still a mess on Windows from what I can see.

Malcolm McLean

no leída,
7 ene 2022, 15:27:517/1/22
a
Are you sure that's compiling to utf-8?
A better test would be to build the utf-8 sequence explictly, and see if the output
is as specified.
>
> So it's still a mess on Windows from what I can see.
>
It's not just Windows. On my platform, there are several different types and C++
classes that are supposed to hold Unicode. We need a suite of little functions
to do ad hoc conversions between them. Most of these are binary no-ops.

Bart

no leída,
7 ene 2022, 16:41:007/1/22
a
Notepad was told to save as UTF8. Codepage 65001 is UTF8 (and it didn't
work with 1252). And I just checked the source file to confirm the
sequences are the correct UTF8.

Malcolm McLean

no leída,
7 ene 2022, 17:03:147/1/22
a
So almost certainly the compiler is compiling it to UTF-8.
But it would have been easier to just specify one non-ascii character as raw bytes,
forcing it to use UTF-8 whatever the source or execution character set.
If it displays correctly, then you know that UTF-8 is supported by the printf / terminal
combination.

Bart

no leída,
7 ene 2022, 17:16:337/1/22
a
Actually, my compiler at least is doing nothing at all. It knows nothing
about UTF8; it's just a sequence of bytes forming a string literal. The
E2 82 AC sequence representing € is output to the binary as a E2 82 AC
sequence, just as 41 42 43 is passed through as 41 42 43 ("ABC").

That's the advantage of UTF8.

It is the editor that needs to be aware of it, needing to deal with
input of it, display, and writing to a file with the correct encoding.

And the runtime or OS must also show the correct display for UTF8
sequences. It's that bit that is going wrong with gcc.

Malcolm McLean

no leída,
7 ene 2022, 17:31:267/1/22
a
Yes, the majority of a C source file is going to be pure ascii, with only
a few extended string literals embedded. So UTF-8 is a good choice. The
compiler needs no modification, and file size remains about the same.
>
> And the runtime or OS must also show the correct display for UTF8
> sequences. It's that bit that is going wrong with gcc.
>
The terminal is the same. So gcc must be linking a printf() that doesn't
treat UTF-8 correctly, though it's hard to see what coule be going wrong
if the terminal takes UTF-8 in 8 bit bytes.

Öö Tiib

no leída,
7 ene 2022, 22:32:137/1/22
a
On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> > On Thursday, 6 January 2022 at 19:28:46 UTC+2, james...@alumni.caltech.edu wrote:
> > > On Wednesday, January 5, 2022 at 11:35:38 PM UTC-5, Öö Tiib wrote:
> ...
> > > > Even programmer of such system might want to upgrade to compiler
> > > > whose standard library supports UTF-8. But it is not possible as
> > > > standard library is defined not to.
> > > Could you cite the text from the C and C++ standards that prohibits the
> > > standard library from supporting UTF-8?
> > Are you pretending that you did not understand what I meant?
> No, I am quite accurately and honestly expressing my confusion. You object to
> something being prohibited by the standards that is, to the best of my understanding,
> allowed. It would make more sense if you were objecting the fact that it isn't
> mandatory, and if you were making such claims, I would disagree with you about
> whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> claiming it isn't allowed.

It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
and u8 prefixes. Every possible thing designed to avoid adding actual support
to standard.

> If you could, as requested, cite the relevant text that prohibits such compilers, you
> might convince me that I'm wrong. If not, the citation would at least enable me to try
> to convince you that you're misinterpreting that text. Neither possibility can happen
> until you actually honor that request.

I can not possibly cite that. And now I'm confused how can be you snipped that
"The standards allow implementations to have wide array of whatever obscure
extensions." That already told it? So if I worded it unclear, then it is my fault.
Why must UTF-8 be extension?

>
> ...
> > However some essential things, like say adding 128 bit integers or
> > UTF-8 support or even to stop that nonsense with newline characters
> > is made tricky.
>
> Neither 128 bits integer nor UTF-8 support are essential. Lots of people have no
> need of either (I've never needed either one, though I have used UTF-8 since it was
> available). If you need such things on a platform where no existing implementation
> of C provides them, your complaint is with the implementors, not the standard,
> because the standard says nothing to prohibit such things.
>

In world where 98% of text communication goes with UTF-8 there of course
is some shrinking 2% of market left.

> ...
> > I am saying that when to ask why default string can't be UTF-8 then that
> > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > alone C++ used on few alive EBCDIC platforms.
> That seems reasonable to me. The only places where that logic applies are
> implementations of C targeting EBCDIC platforms, and regardless of how rare such
> implementations, they would become substantially rarer because users would abandon
> them if they switched to UTF-8.

Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
wants to exchange information with rest of the world. So why he would not want
to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
experience with those devices.

>
> ...
> > The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> > OK, UTF-16 can be useful to communicate with mis-designed programming
> > languages like Java or C# or operating systems like Windows. But UTF-32
> > is rather exotic.
> UTF-16 is, as I understand it, the default in the Windows world. I'm not that familiar
> with it, I've only done a few years of programming targeting that platform, and all of
> the text that came up in the work I was doing there was simple English, with no need
> to make any use of extended characters, so I can't vouch for any of the details about
> how text was encoded. However, WIndows is a very common platform, whether or not
> you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> seems very odd.

It has usages like I confirmed already. Obscure I said because it has merged the
overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
any benefits.

>
> UTF-32 shares an important property with the implementation-defined encoding used
> for wchar_t: every character takes up one and only one element in the array. I have
> written a lot of code over several decades for parsing strings that assumes that every
> character takes up one char, a valid assumption in the contexts in which I wrote it. When
> I think about how I would have to re-write such code to work with multi-byte encodings
> such as UTF-8, then the simplicity of replacing char with wchar_t or char32_t, leaving my
> logic unchanged, starts looking pretty attractive. However, since I have little experience
> writing code to work with extended characters using any encoding, my preferences don't
> carry much weight.

That is good point. UTF-8 converts trivially to UTF-32 and back. So where such
precisely-one guarantee helps to make some algorithm more robust and simple
there we can easily convert of course. But I see no reason why to keep (or to
transfer) UTF-32 outside of such context.

> > > While C, in particular, doesn't mandate quite as much support for UTF-8 as
> > > I'd like, both standards allow the fullest possible support for UTF-8 that I
> > > could imagine. Why do you think otherwise?
> > I think that having all text streams and plain non-prefixed "strings" as
> > UTF-8 is both possible and most logical.
> Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> confuses me.

If I appeared to make that claim then it is probably my fault. I meant it is made difficult
by adding things that look like trying to support it one day but there really are no UTF-8
support in standards despite we use it everywhere for decades.

> If you don't work on Unix systems where the encoding of non-prefixed
> strings is UTF-8, and you don't work on Windows systems where UTF-16 is the norm, and
> you don't work on systems where EBCDIC is the norm, what kinds of systems do you
> work on? I'm not saying that there aren't any other types of systems, there's a great
> many, but most of the others are substantially less common, so I am just curious which
> one(s) you use.

I have written C and C++ for systems and peripherals of things like cash dispensers,
point of sale terminals, taximeters, spectrometers, thermostats, frequency converters
and mobile phones. Also I have participated in projects of writing utility software and
services on Unixes and Windowses. From systems that I've programmed if to
remove the pointless obfuscation garbage from standards and to require UTF-8
then perhaps only MS has to do something at all.

Richard Damon

no leída,
7 ene 2022, 23:01:297/1/22
a
On 1/7/22 7:06 AM, Öö Tiib wrote:
> The UTF-16 and UTF-32 are also among those 2% of obscure text encodings.
> OK, UTF-16 can be useful to communicate with mis-designed programming
> languages like Java or C# or operating systems like Windows. But UTF-32
> is rather exotic.
>

One key thing to remember is that most UTF-16 (and especially Window's
use of it) goes back to a too early adoption of Unicode and UCS-2 as
'The Standard' for Text, when 16-bit 'Unicode' was claimed to be the
answer to the problem of all those code-pages.

Then, when after it got adopted and baked into ABIs/APIs it was realized
that Unicode was going to need to be bigger so UCS-2 switched to UTF-16
and UCS-4 became the real 'wide character' type (except, in some ways it
still wasn't due to combining codepoints).

For C, this basically means that the 'wide character' system is
practically broken, and is really broken on Windows machines as the
standard says its needs to be the widest type of character, and that it
expresses all the characters in one unit, but on Windows by ABI
requirements it must be 16 bits, and 32 bits isn't really correct even
on systems which do use that for wide characters due to the combining
codes issue.

Basically, for a system that want to really conform to both Unicode and
the C standard, the implementation is put in a spot where it is actually
impossible to do so, at least if you want to keep the full intent of the
standard (that wide strings can be arbitrarily split and joined without
issue).

Then you have that Unicode is actually not stateless when you add things
like the various Left-to-Right codes or things emoji modifier characters.

james...@alumni.caltech.edu

no leída,
7 ene 2022, 23:52:337/1/22
a
On Friday, January 7, 2022 at 10:32:13 PM UTC-5, Öö Tiib wrote:
> On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> > On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
...
> > No, I am quite accurately and honestly expressing my confusion. You object to
> > something being prohibited by the standards that is, to the best of my understanding,
> > allowed. It would make more sense if you were objecting the fact that it isn't
> > mandatory, and if you were making such claims, I would disagree with you about
> > whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> > claiming it isn't allowed.
> It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
> and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
> and u8 prefixes. Every possible thing designed to avoid adding actual support
> to standard.

So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
unprefixed string literals, even for implementations targeting platforms where that's
contrary to the conventions for that platform?

> > If you could, as requested, cite the relevant text that prohibits such compilers, you
> > might convince me that I'm wrong. If not, the citation would at least enable me to try
> > to convince you that you're misinterpreting that text. Neither possibility can happen
> > until you actually honor that request.
> I can not possibly cite that. And now I'm confused how can be you snipped that
> "The standards allow implementations to have wide array of whatever obscure
> extensions." That already told it? So if I worded it unclear, then it is my fault.
> Why must UTF-8 be extension?

It didn't occur to me that you meant "obscure extensions" to refer to UTF-8 support. It
isn't an extension. "The values of the members of the execution character set are
implementation-defined." (5.2.1p1). That puts choosing UTF-8 for that encoding in the
same category as choosing 8 as the value for CHAR_BIT or setting the values for the
macros that are #defined in <limits.h>.

The term "extension" is not normally used for implementation-defined behavior. Note that
4p9 requires that "An implementation shall be accompanied by a document that defines
all implementation-defined and locale-specific characteristics and all extensions." If
implementation-defined behavior were considered to qualify as an extension, that
specification would be redundant, something the committee generally tries to avoid.

....
> In world where 98% of text communication goes with UTF-8 there of course
> is some shrinking 2% of market left.

Do you have sources for those numbers, or are you just pulling them out of your hat?
I'm not saying you're wrong, just that I don't know of any easy way to determine what
those numbers are.

> > ...
> > > I am saying that when to ask why default string can't be UTF-8 then that
> > > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > > alone C++ used on few alive EBCDIC platforms.
> > That seems reasonable to me. The only places where that logic applies are
> > implementations of C targeting EBCDIC platforms, and regardless of how rare such
> > implementations, they would become substantially rarer because users would abandon
> > them if they switched to UTF-8.
> Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
> wants to exchange information with rest of the world. So why he would not want
> to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
> experience with those devices.

Unlike your hypothetical C programmer on EBCDIC platform, real programmers of that type
have access to conversion routines for use when the need to communicate outside the
EBCDIC world comes up. If UTF-8 were mandatory for unprefixed string literals, an
implementations targeting such platforms that conformed to such a mandate could add an
extension to create EBCDIC-encoded string literals. If so, developers for such platforms
would have to routinely use that extension for most of their string literals. Such developers
would find that very inconvenient, and would therefore make sure that any implementation
targeting that platform had an option that would make it fail to conform to such a mandate.

Imposing that mandate would fail to make UTF-8 any more widely used. The reason that
there do exist platforms where UTF-8 is not the encoding used for unprefixed string literals
is because their users want some other encoding to be used for that purpose. If that weren't
the case, someone would have already created a UTF-8 implementation for that platform.

...
> > how text was encoded. However, WIndows is a very common platform, whether or not
> > you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> > seems very odd.
> It has usages like I confirmed already. Obscure I said because it has merged the
> overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
> any benefits.

It doesn't matter how strongly you disapprove of it - what matters is how many people want
to use it despite your disapproval.

james...@alumni.caltech.edu

no leída,
8 ene 2022, 0:13:468/1/22
a
On Friday, January 7, 2022 at 1:44:39 PM UTC-5, Malcolm McLean wrote:
> On Friday, 7 January 2022 at 17:34:16 UTC, james...@alumni.caltech.edu wrote:
...
> > > I think that having all text streams and plain non-prefixed "strings" as
> > > UTF-8 is both possible and most logical.
> > Yes, and that is allowed by both standards, and is the norm, not the exception, on most
> > Unix-like platforms that I'm familiar with. That's why your claim that it's not allowed
> > confuses me.
> My experience is that passing utf-8 to printf() or fopen() doesn't work. But I rarely
> need to do so, and the situation might have changed recently.

My wife was born in Taiwan, and our kids are bilingual, so every computer in our house has
been set up to handle Chinese text properly. If yours isn't, you might not see the right
characters in the string literals below. You could try using

"\u5929\u5B89\u95E8\u5E7F\u573A"

instead - the encoding of the character arrays should be unchanged.

#include <inttypes.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <uchar.h>

int main(void)
{
setlocale(LC_ALL, "");
const char location[] = "天安门广场";
const char location8[] = u8"天安门广场";
const char *p;

printf("Location :");
for(p = location; *p; p++)
printf("%#X ", *(unsigned char*)p);

printf("\nLocation8:");
for(p = location8; *p; p++)
printf("%#X ", *(unsigned char*)p);

printf("\n\"%s\"\n", location);

p = location;
const char * const end = location + sizeof location;
mbstate_t state={0};
while(*p)
{
char32_t c32;
size_t bytes = mbrtoc32(&c32, p, end - p, &state);
switch(bytes)
{
case (size_t)(-3):
printf("%#" PRIXLEAST32 " ", c32);
break;
case (size_t)(-2):
break;
fprintf(stderr, "incomplete character\n");
return EXIT_FAILURE;
case (size_t)(-1):
fprintf(stderr, "%td:Encoding error\n", p-location);
return EXIT_FAILURE;
default:
printf("%#" PRIXLEAST32 " ", c32);
p += bytes;
break;
}
}

printf("\n");
return EXIT_SUCCESS;
}

That program produces the following output on my system:
Location :0XE5 0XA4 0XA9 0XE5 0XAE 0X89 0XE9 0X97 0XA8 0XE5 0XB9 0XBF 0XE5 0X9C 0XBA
Location8:0XE5 0XA4 0XA9 0XE5 0XAE 0X89 0XE9 0X97 0XA8 0XE5 0XB9 0XBF 0XE5 0X9C 0XBA
"天安门广场"
0X5929 0X5B89 0X95E8 0X5E7F 0X573A

Note that the u8 string is encoded exactly the same way as the unprefixed string literal,
confirming that UTF-8 is the encoding for unprefixed string literals.
The setlocale() call is not needed to correctly display the string, but it is needed for
mbrtoc32() to work. In the default "C" locale, mbrtoc32() reports an encoding error. The
"" locale is the implementation-defined default locale - I'm not sure what gcc defines that
default to be, but I suspect that it's the value of my LANG environment variable, which is
"en_US.UTF-8". Virtually every locale supported on my system has UTF-8 or utf8 in it's name.
The "C" locale is one of the few exceptions, but a "C.UTF-8" locale is also supported.

Öö Tiib

no leída,
8 ene 2022, 11:17:538/1/22
a
On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> On Friday, January 7, 2022 at 10:32:13 PM UTC-5, Öö Tiib wrote:
> > On Friday, 7 January 2022 at 19:34:16 UTC+2, james...@alumni.caltech.edu wrote:
> > > On Friday, January 7, 2022 at 7:06:26 AM UTC-5, Öö Tiib wrote:
> ...
> > > No, I am quite accurately and honestly expressing my confusion. You object to
> > > something being prohibited by the standards that is, to the best of my understanding,
> > > allowed. It would make more sense if you were objecting the fact that it isn't
> > > mandatory, and if you were making such claims, I would disagree with you about
> > > whether it would be a good idea to make it mandatory - but as far as I can tell, you're
> > > claiming it isn't allowed.
> > It is allowed. Almost whatever is allowed. But you yourself listed all that distracting
> > and confusing half-support, all those char8_t-s, added and deprecated codecvt-s
> > and u8 prefixes. Every possible thing designed to avoid adding actual support
> > to standard.
>
> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> unprefixed string literals, even for implementations targeting platforms where that's
> contrary to the conventions for that platform?

Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
Why? On what? For what? Must be odd corner case. Each trashcan, smoke sensor or
microwave oven out there wants to communicate with whatever siris, alexas,
google homes and skynets they serve. All of those use UTF-8 texts. If it has some
LCD or LED panel then it wants to show text understandable to local desperate
housewife, low salary technician or taxi driver. If it is char* then it is UTF-8 there.

> > > If you could, as requested, cite the relevant text that prohibits such compilers, you
> > > might convince me that I'm wrong. If not, the citation would at least enable me to try
> > > to convince you that you're misinterpreting that text. Neither possibility can happen
> > > until you actually honor that request.
> > I can not possibly cite that. And now I'm confused how can be you snipped that
> > "The standards allow implementations to have wide array of whatever obscure
> > extensions." That already told it? So if I worded it unclear, then it is my fault.
> > Why must UTF-8 be extension?
>
> It didn't occur to me that you meant "obscure extensions" to refer to UTF-8 support. It
> isn't an extension. "The values of the members of the execution character set are
> implementation-defined." (5.2.1p1). That puts choosing UTF-8 for that encoding in the
> same category as choosing 8 as the value for CHAR_BIT or setting the values for the
> macros that are #defined in <limits.h>.

CHAR_BIT can't be less than 8 so UTF-8 code unit is guaranteed to fit. The flexibility
to have bigger CHAR_BIT than 8 can be left there for char has to serve also as byte.

> The term "extension" is not normally used for implementation-defined behavior. Note that
> 4p9 requires that "An implementation shall be accompanied by a document that defines
> all implementation-defined and locale-specific characteristics and all extensions." If
> implementation-defined behavior were considered to qualify as an extension, that
> specification would be redundant, something the committee generally tries to avoid.

If there is implementation defined behavior or not in my experience if text is passed
with char* then it points at UTF-8 and programmer has to fight with that implementation
defined garbage because he needs it to be UTF-8. And I'm complaining against
attempts to lie to novices that UTF-8 should be uchar8_t* or something else like
that. Practical example:
FILE *f = fopen( "Foo😀Bar.txt", "w");
That should work unless underlying file system does not support files
named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
bad standard that allows implementations to weasel away. No garbage like
u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
needed as it already works like in my example on vast majority of things.

> ....
> > In world where 98% of text communication goes with UTF-8 there of course
> > is some shrinking 2% of market left.
> Do you have sources for those numbers, or are you just pulling them out of your hat?
> I'm not saying you're wrong, just that I don't know of any easy way to determine what
> those numbers are.

There are no easy way but some organizations do diligently statistics what is
possible to monitor. Like that:
<https://w3techs.com/technologies/history_overview/character_encoding>
Legacy is there but shrinking. If whatever new C position opens where text has
to be accessed with char* then chance is close to 0 that it is something else
but UTF-8.

> > > ...
> > > > I am saying that when to ask why default string can't be UTF-8 then that
> > > > EBCDIC is usually mentioned. Despite there probably are no much C, let
> > > > alone C++ used on few alive EBCDIC platforms.
> > > That seems reasonable to me. The only places where that logic applies are
> > > implementations of C targeting EBCDIC platforms, and regardless of how rare such
> > > implementations, they would become substantially rarer because users would abandon
> > > them if they switched to UTF-8.
> > Hypothetical C programmer on EBCDIC platform (never heard of one) most likely
> > wants to exchange information with rest of the world. So why he would not want
> > to upgrade to compiler that supports UTF-8? I am purely speculating as I got no
> > experience with those devices.
>
> Unlike your hypothetical C programmer on EBCDIC platform, real programmers of that type
> have access to conversion routines for use when the need to communicate outside the
> EBCDIC world comes up. If UTF-8 were mandatory for unprefixed string literals, an
> implementations targeting such platforms that conformed to such a mandate could add an
> extension to create EBCDIC-encoded string literals. If so, developers for such platforms
> would have to routinely use that extension for most of their string literals. Such developers
> would find that very inconvenient, and would therefore make sure that any implementation
> targeting that platform had an option that would make it fail to conform to such a mandate.

You never answered why should they use obscure extensions for what they need on
majority of cases. Why UTF-8 must be obscure extension?

> Imposing that mandate would fail to make UTF-8 any more widely used. The reason that
> there do exist platforms where UTF-8 is not the encoding used for unprefixed string literals
> is because their users want some other encoding to be used for that purpose. If that weren't
> the case, someone would have already created a UTF-8 implementation for that platform.

It is used on close to 100% of cases anyway. I am objecting that it is deliberately
standardized (or more like pseudo-standardized/non-standardized) to be
inconvenient to use.

> ...
> > > how text was encoded. However, WIndows is a very common platform, whether or not
> > > you approve of it's design (I share your disapproval for it), so calling UTF-16 obscure
> > > seems very odd.
> > It has usages like I confirmed already. Obscure I said because it has merged the
> > overhead and need for BOMs of UTF-32 with inconveniences of UTF-8 without
> > any benefits.
> It doesn't matter how strongly you disapprove of it - what matters is how many people want
> to use it despite your disapproval.

Agreed. So do you have numbers how many C programmers *want* to use UTF-16? I
think that it is little, but I do not have any sources. They may *need* to for legacy
reasons I already mentioned but even there it is most likely small number. Their
pain with support to their u"string", L"string" and \x \u \U character references
might need relieving too but is bit different topic.

Malcolm McLean

no leída,
8 ene 2022, 12:17:198/1/22
a
On Saturday, 8 January 2022 at 16:17:53 UTC, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>
> > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > unprefixed string literals, even for implementations targeting platforms where that's
> > contrary to the conventions for that platform?
> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
> Why? On what? For what? Must be odd corner case.
>
Where you've got an 8 bit character-mapped display that supports ascii plus some
extended characters. That used to be almost every microcomputer, and it still lives
on to a bit in modern PCs.

Öö Tiib

no leída,
8 ene 2022, 13:20:528/1/22
a
When? How long ago? In eighties?
I have had the fun to participate in programming such panel more than decade ago
to support showing whatever text including 8,105 "simplified" Chinese characters if
needed. Wasn't that big a project and the panel was dirt cheap. I disliked that it
was pointlessly required to use UTF-16 as UTF-8 could make it even simpler. What
is the reason to use for character-mapping in our current world anything but UTF-8?
In current PCs it lives on as deliberate sabotage by the platform vendor as there are
no reasons others than desire to make their proprietary programming language to
look better than C.

Mateusz Viste

no leída,
8 ene 2022, 13:37:388/1/22
a
2022-01-08 at 10:20 -0800, Öö Tiib wrote:
> I disliked that it was pointlessly required to use UTF-16 as UTF-8
> could make it even simpler. What is the reason to use for
> character-mapping in our current world anything but UTF-8?

While UTF-8 is neat, it is also complex to decode. Even a simple
strlen() can be challenging. That's where UTF-16 or UTF-32 are handy
since, there is no decoding required and every glyph has a fixed
byte length.

Mateusz

james...@alumni.caltech.edu

no leída,
8 ene 2022, 13:49:018/1/22
a
On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
...
> > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > unprefixed string literals, even for implementations targeting platforms where that's
> > contrary to the conventions for that platform?
> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?

Well, let me ask you - does the implementation you use most often use UTF-8
encoding for unprefixed string literals? Since you're complaining about the difficulty of
using UTF-8, I presume that it doesn't. If not, why not? The standard doesn't say
anything to prevent that implementation from doing so. If they don't, it can only be
because they don't want to. So why don't you ask the implementors why they made
that decision? They've got a reason that seemed sufficiently good for them, find out
what it is.

...
> FILE *f = fopen( "Foo😀Bar.txt", "w");
> That should work unless underlying file system does not support files
> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> bad standard that allows implementations to weasel away. No garbage like
> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> needed as it already works like in my example on vast majority of things.

Nothing in the standard prevents an implementation from doing that. If one doesn't
already do so, that's a choice made by the implementors, and you should ask them
about it. Your real beef is with the implementors, not the standard.

...
> You never answered why should they use obscure extensions for what they need on
> majority of cases. Why UTF-8 must be obscure extension?

They shouldn't. It isn't. It's an implementation-defined choice, and if an
implementation you want to use forces you to use an obscure extension, in order to
work with UTF-8, you should ask them why - it's nothing the standard forced them to
do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
need to use UTF-8, but that's a separate matter.

...
> It is used on close to 100% of cases anyway. I am objecting that it is deliberately
> standardized (or more like pseudo-standardized/non-standardized) to be
> inconvenient to use.

How would you like it to be more convenient? printf("Foo😀Bar.txt") works fine on my
system. If it doesn't work on yours, talk with your implementor.

...
> Agreed. So do you have numbers how many C programmers *want* to use UTF-16? I
> think that it is little, but I do not have any sources. They may *need* to for legacy
> reasons I already mentioned but even there it is most likely small number. Their
> pain with support to their u"string", L"string" and \x \u \U character references
> might need relieving too but is bit different topic.

If they don't want it, and in particular, don't need it, then it should be pretty easy to
convince implementors to use UTF-8 instead. Have you tried? If they refuse, their
response is likely to give you far more relevant information than I could give you, since
I don't work on that platform any more, and didn't work on it very long.

The fundamental problem is - why should people who want to use some other
encoding for unprefixed string literals be forced to use UTF-8 instead, just because
you disagree? How does the existence of implementation catering to their needs hurt
you? Those implementations aren't the reason why using UTF-8 is complicated on the
implementations you use - that's entirely due to decisions made by your implementor
- so talk to the implementor and try to convince them to change.

If it seems unreasonable to you that you should have to convince one implementor to
adopt UTF-8, keep this in mind: however hard it is to convince a single implementor to
change, it would be much harder to convince the C and C++ committees to make such
a change. If you do want to convince those committees, a good way to start is by
convincing a single implmentation to change.

Öö Tiib

no leída,
8 ene 2022, 13:53:268/1/22
a
That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
varying length. That makes UTF-8 the sole sane one.

Mateusz Viste

no leída,
8 ene 2022, 14:12:148/1/22
a
2022-01-08 at 10:53 -0800, Öö Tiib wrote:
> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
> varying length. That makes UTF-8 the sole sane one.

You are right, the implementations of UTF-16 I worked on were limited
to the BMP (ie. always 2 bytes), hence my simplified view.

Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?

Mateusz

Manfred

no leída,
8 ene 2022, 14:14:108/1/22
a
As Öö Tiib wrote, UTF-16 is variable length - its predecessor (under
Windows) UCS-2 is fixed length, but it failed to keep the promise to
accommodate for the entire universe of language glyphs.
Furthermore, the recent fashion of adding emojis to Unicode has made
UTF-32 no longer fixed length as well.

However, the problem with strlen() is most often a false problem: most
often you need to know the size of the string in memory, and that's
bytes, rather than the count of glyphs in the string. Which you still
can do, by the way, but it doesn't have to have the performance
requirements of strlen, for example.

>
> Mateusz
>

Malcolm McLean

no leída,
8 ene 2022, 14:22:408/1/22
a
The problem is that not all languages fit into the Latin mould, where
you have one letter taking up one physical rectangle of space in the
writing area.
In some languages, there are combining forms. We see this a bit
in European scripts, where you have accents. But in other languages
it can go much deeper, and you can't really provide one code per glyph.

Bart

no leída,
8 ene 2022, 14:30:048/1/22
a
This is Unicode crossing the line into typography, markup and clip-art.

The first is an actual character, or rather, a symbol (especially by
'glyph' is meant only the shape or design). The second is a modifier, I
believe of the colour, which IMO don't belong (along with font, height,
aspect and weight, among other attibutes).

I'm sure there were plenty of such schemes in ASCII too, although more
recently they take form of explicit tag strings. But it's still the case
that a linebreak in ASCII can be CR,LF or just LF; so is that one
character or two?

At the heart, though, everyone knows that a plain ASCII string of
printable characters, one - one containing no control codes, attributes
or other meta-data - can be represented by an array of bytes, one per
character.

Similarly, most such strings of full 21-bit (ie. 32 bits in practice)
Unicode codes can be represented by an indexable array of 32-bit values.

If you really, really need those multi-Unicode sequences, then you can
choose to represent a string as an array of variable-length short
strings, most of which will be one 32-bit character long.

Although there will doubtless be other special requirements that would
make that impractical too. But then, the very definition of what is a
character or word will be blurred as well.


Guillaume

no leída,
8 ene 2022, 14:30:178/1/22
a
Yeah. While fixed-width characters are certainly easier (and, most of
all, faster) to handle, UTF-8 is not rocket science. The encoding is
pretty simple.

The downside is more speed than complexity.

Öö Tiib

no leída,
8 ene 2022, 17:50:078/1/22
a
On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> > On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
> ...
> > > So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
> > > unprefixed string literals, even for implementations targeting platforms where that's
> > > contrary to the conventions for that platform?
> > Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>
> Well, let me ask you - does the implementation you use most often use UTF-8
> encoding for unprefixed string literals? Since you're complaining about the difficulty of
> using UTF-8, I presume that it doesn't. If not, why not?

All compilers that I have used did it for some time. Or at least could be configured to.
Sometimes the configuration had to be done in inconvenient manner but these
are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
prefixes. When some confused novice added it then it did not matter. After all
these two were working same in C++17:

const char crap[] = "Öö Tiib 😀";
const char crap8[] = u8"Öö Tiib 😀";

But C++20 gives error about second line. Also cast does not compile there.
So that makes me angry opponent of the whole u8 prefix. It should be gone from
language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
want to but should raise the privileges of the UTF-8 to be always supported
as char array and not fucked with.

> The standard doesn't say
> anything to prevent that implementation from doing so. If they don't, it can only be
> because they don't want to. So why don't you ask the implementors why they made
> that decision? They've got a reason that seemed sufficiently good for them, find out
> what it is.

No, that fish rots from the head, IOW from standards. MS abuses it more than others
but only because they are bit bigger assholes.

> ...
> > FILE *f = fopen( "Foo😀Bar.txt", "w");
> > That should work unless underlying file system does not support files
> > named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> > bad standard that allows implementations to weasel away. No garbage like
> > u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> > needed as it already works like in my example on vast majority of things.
>
> Nothing in the standard prevents an implementation from doing that. If one doesn't
> already do so, that's a choice made by the implementors, and you should ask them
> about it. Your real beef is with the implementors, not the standard.

My beef is with standards. Adding garbage that does not work to standard is wrong
and not adding what everybody at least half sane does use to standard is also wrong.

>
> They shouldn't. It isn't. It's an implementation-defined choice, and if an
> implementation you want to use forces you to use an obscure extension, in order to
> work with UTF-8, you should ask them why - it's nothing the standard forced them to
> do. I don't need to use any extensions to work with UTF-8 on my desktop. I also don't
> need to use UTF-8, but that's a separate matter.

Oh, if I can't convince even experienced person like you that the obfuscation
around UTF-8 in standards is evil then there are no point to discuss that
position with any implementer.

Manfred

no leída,
8 ene 2022, 20:17:108/1/22
a
On 1/8/2022 11:50 PM, Öö Tiib wrote:
> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
>> ...
>>>> So, you are arguing that it should be mandatory to have UTF-8 as the encoding for
>>>> unprefixed string literals, even for implementations targeting platforms where that's
>>>> contrary to the conventions for that platform?
>>> Yes, and vast majority would be happy. What other char* text is needed than UTF-8?
>>
>> Well, let me ask you - does the implementation you use most often use UTF-8
>> encoding for unprefixed string literals? Since you're complaining about the difficulty of
>> using UTF-8, I presume that it doesn't. If not, why not?
>
> All compilers that I have used did it for some time. Or at least could be configured to.
> Sometimes the configuration had to be done in inconvenient manner but these
> are the easy parts of our work. I ignored that. Also I did ignore the unneeded u8
> prefixes. When some confused novice added it then it did not matter. After all
> these two were working same in C++17:
>
> const char crap[] = "Öö Tiib 😀";
> const char crap8[] = u8"Öö Tiib 😀";
>
> But C++20 gives error about second line. Also cast does not compile there.
> So that makes me angry opponent of the whole u8 prefix. It should be gone from
> language. They may add their char_iso8859_1_t and iso8859_1"strings" if they
> want to but should raise the privileges of the UTF-8 to be always supported
> as char array and not fucked with.

The argument you are making here is more than convincing to me, but let
me try the devil's advocate role here.

Granted, the Venerable Luminaries of the Holy Committee screwed up big
time, but they did it in C++17 (how surprising) rather than in C++20.

In principle, I could imagine a use for u8"" strings that are compatible
with some family of printf8() functions only, a sort of tight type
constraint for character types. This would have probably ended up like
annex K, but still it could have made sense to some Unicode purists, and
more importantly it would have made no harm to the sane world.

BUT the fact that C++17 allowed your second string, and thus people
started naively using it, and THEN C++ prohibited it, thus breaking said
naïve but so far legal code, denotes some serious dickheadedness, yes.

This is to say that the solution might be to consider C++17 a sad
parenthesis (once again), and only use char, char16_t (wchar_t?) and
char32_t where needed.
Applications where a distinct separation between utf_8 and generic char
is important can pay the price of using u8, but the majority of
applications would most probably ignore it.

>
>> The standard doesn't say
>> anything to prevent that implementation from doing so. If they don't, it can only be
>> because they don't want to. So why don't you ask the implementors why they made
>> that decision? They've got a reason that seemed sufficiently good for them, find out
>> what it is.
>
> No, that fish rots from the head, IOW from standards. MS abuses it more than others
> but only because they are bit bigger assholes.
>

Agreed.

>> ...
>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>> That should work unless underlying file system does not support files
>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>> bad standard that allows implementations to weasel away. No garbage like
>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>> needed as it already works like in my example on vast majority of things.
>>
>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>> already do so, that's a choice made by the implementors, and you should ask them
>> about it. Your real beef is with the implementors, not the standard.
>
> My beef is with standards. Adding garbage that does not work to standard is wrong
> and not adding what everybody at least half sane does use to standard is also wrong.
>

Also agreed, but since utf-8 is transparent to ascii functions, what
should have been added?
I mean, if printf can't print utf-8, it is a problem of the console
rather than printf itself, right? So some way to set the console in
utf-8 mode? But that is outside the scope of the standard, isn't it?

Po Lu

no leída,
9 ene 2022, 5:41:069/1/22
a
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:

> No, I am quite accurately and honestly expressing my confusion. You
> object to something being prohibited by the standards that is, to the
> best of my understanding, allowed. It would make more sense if you
> were objecting the fact that it isn't mandatory, and if you were
> making such claims, I would disagree with you about whether it would
> be a good idea to make it mandatory - but as far as I can tell, you're
> claiming it isn't allowed.

AFAICT, he's complaining about Microsoft's specific implementations of
some standards.

I'm not an MS-Windows programmer, but from a Unix point-of-view their
way of doing things is indeed confusing -- at least when I looked into
porting some programs.

It's probably a matter of habit: I'm sure the distinction between wide
and ASCII system calls, code pages, and text and binary streams comes
naturally to MS-Windows programmers, who in turn find the lack of
explicit text streams in Unix confusing.

David Brown

no leída,
9 ene 2022, 7:33:359/1/22
a
On 08/01/2022 20:12, Mateusz Viste wrote:
> 2022-01-08 at 10:53 -0800, Öö Tiib wrote:
>> That is incorrect as glyph 👌🏽 is U+1F44C U+1F3FD so all three are of
>> varying length. That makes UTF-8 the sole sane one.
>
> You are right, the implementations of UTF-16 I worked on were limited
> to the BMP (ie. always 2 bytes), hence my simplified view.
>

When Unicode was young, the intention was that every glyph was one
character, and it would all fit in 16-bits - that was UCS2, as used
originally by Windows NT, Java, Python, QT, and other systems, languages
and libraries. But it was quickly discovered that this was far from
sufficient.

> Still, UTF-32 is always 4 bytes for any possible glyph, isn't it?
>

The terminology of Unicode can be a little confusing. (And I'm sure
someone will correct me if I get it wrong.)

A "code point" is an entry in the Unicode tables. Each code point is
uniquely identified by a 32-bit number. The code points are organised
in "planes" for convenience, and designed so that the first 128 code
points match ASCII and that a wide range of languages can be covered by
the code units in the range 0x0000 .. 0xffff (excluding 0xd800 ..
0xdfff) so that 16 bits would often be enough.

A "code unit" is the container for the bits of the encoding. In UTF-8,
a code unit is an 8-bit unit. In UTF-16, it is 16-bit, in UTF-32 it is
32-bit.

UTF-8 takes up to four code units (32 bits total) per code point, UTF-16
takes up to two code units, and UTF-32 takes exactly one code unit per
code point. UTF-8 is always at least as compact as UTF-32, and will be
more or less compact than UTF-16 depending on the content. These are
just different encodings - different ways to write the code points.
There are others, such as GB18030 which is a 16-bit encoding popular in
China because it matches their traditional GB encodings in the same way
UTF-8 matches ASCII.

A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
that conveys meaning. Sometimes it is useful to break them down,
sometimes it is useful to treat them separately. For example, "é" can
considered as a single grapheme, or as a grapheme "e" followed by a
combining graphene "'" acute accent. The same grapheme can match
multiple code points - a Latin alphabet capital A is the same as a Greek
alphabet capital Alpha.

A "glyph" is a rendering of a grapheme - the letter "A" in different
fonts are different glyphs of the same grapheme.

What the reader perceives as a "character" is often a single grapheme,
but might be several graphemes together.


So, with that in mind, all three UTF formats require multiple code units
to cover all graphemes. But UTF-32 always gets one code point per code
unit, making it simpler and more consistent for processing Unicode text.
As a file or transfer encoding, it has the big inconvenience of being
endian-specific as well as being bulkier than UTF-8. UTF-16 combines
the worst features of UTF-8 with the worst features of UTF-32, with none
of the benefits - it exists solely because early Unicode adopters
committed too strongly to UCS2.


People are often concerned that UTF-8 is difficult or complex to decode
or split up. It is not, in practice. It is actually quite rare that
you need to divide up a string based on characters or even find its
length in code points - for most uses of strings, you just pass them
around without bothering about the details of the contents. You need to
know how much memory the string takes, not how many code points it has.
And simply treating it as an abstract stream of data terminated by a
zero character can be enough to give you a useable sorting and
uniqueness comparison for many uses. The point where you need to decode
the code units and know what they mean is when you are doing rendering,
sorting, or other human interaction - and then you have such a vastly
bigger task that turning UTF-8 coding into UTF-32 code points is
negligible effort in comparison.

(And UTF-8 is not much harder to encode or decode than UTF-16.)

Öö Tiib

no leída,
9 ene 2022, 7:44:109/1/22
a
Hmm. Great is that at least you are convinced. For me C++17 and
C++20 both added too big number of silently changing or silently
turning into undefined behaviors so the noisy one is even better
than the rest of it. Just that the importance of UTF-8 in software
development industry is hard to overestimate.

...

> >>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>> That should work unless underlying file system does not support files
> >>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>> bad standard that allows implementations to weasel away. No garbage like
> >>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>> needed as it already works like in my example on vast majority of things.
> >>
> >> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >> already do so, that's a choice made by the implementors, and you should ask them
> >> about it. Your real beef is with the implementors, not the standard.
> >
> > My beef is with standards. Adding garbage that does not work to standard is wrong
> > and not adding what everybody at least half sane does use to standard is also wrong.
> >
> Also agreed, but since utf-8 is transparent to ascii functions, what
> should have been added?

Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
on file-system that fully supports files named "Foo😀Bar.txt" I suppose.

> I mean, if printf can't print utf-8, it is a problem of the console
> rather than printf itself, right? So some way to set the console in
> utf-8 mode? But that is outside the scope of the standard, isn't it?

The console output can be set to UTF-8 mode with few lines of platform specific
code ... its keyboard input can't but that is all about vendor ... I agree with James there.

Manfred

no leída,
9 ene 2022, 11:34:259/1/22
a
Yes, and the silent one is in C++17. From your example, in C++20 the
compiler doesn't allow you to pass a u8"string" to printf, does it?
If u8 was started this way from the beginning, then the problem you
mention above wouldn't exist.

Just that the importance of UTF-8 in software
> development industry is hard to overestimate.
>
> ...
>
>>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
>>>>> That should work unless underlying file system does not support files
>>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
>>>>> bad standard that allows implementations to weasel away. No garbage like
>>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
>>>>> needed as it already works like in my example on vast majority of things.
>>>>
>>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
>>>> already do so, that's a choice made by the implementors, and you should ask them
>>>> about it. Your real beef is with the implementors, not the standard.
>>>
>>> My beef is with standards. Adding garbage that does not work to standard is wrong
>>> and not adding what everybody at least half sane does use to standard is also wrong.
>>>
>> Also agreed, but since utf-8 is transparent to ascii functions, what
>> should have been added?
>
> Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
>

Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
representation, what's the difference? One form or the other shows up
only when it is displayed in some UI - the filesystem isn't one, which
leads to the implementation's runtime behavior.

If they are actually different in their binary sequence, and this is the
result of the utf-8 string being wrongly converted multiple times, this
looks like a bad implementation, rather than a problem with the standard.
IIUC you are advocating for some statement in the standard that prevents
implementations from messing up with "character sets" in null terminated
char strings?

Richard Damon

no leída,
9 ene 2022, 12:52:509/1/22
a

On 1/9/22 7:33 AM, David Brown wrote:

> A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
> that conveys meaning. Sometimes it is useful to break them down,
> sometimes it is useful to treat them separately. For example, "é" can
> considered as a single grapheme, or as a grapheme "e" followed by a
> combining graphene "'" acute accent. The same grapheme can match
> multiple code points - a Latin alphabet capital A is the same as a Greek
> alphabet capital Alpha.
>
> A "glyph" is a rendering of a grapheme - the letter "A" in different
> fonts are different glyphs of the same grapheme.
>
> What the reader perceives as a "character" is often a single grapheme,
> but might be several graphemes together.
>

No, a grapheme, from my understanding, is the character as perceived by
the readed. Thus the adding of accents to a base character builds a
single grapheme from several codepoints.

The grapheme dosn't include 'font' information like which font to use,
the size, additions like bold or italics and such, which add on to make
the final glyph, but does include all the jots and tildes that are part
of the character.

On the other hand, some languages add things like 'vowel points' to
characters, and those are seperate graphemes even though they are added
by a similar manner. This comes down to what the original language
though of as a 'character', which just makes things even more complicated.

Then as you said, there are the 'look-alike' characters which are
considered (generally) to be separate, but some canonicalizations will
convert to a common character.

Malcolm McLean

no leída,
9 ene 2022, 13:13:149/1/22
a
On Sunday, 9 January 2022 at 17:52:50 UTC, Richard Damon wrote:
>
> On the other hand, some languages add things like 'vowel points' to
> characters, and those are seperate graphemes even though they are added
> by a similar manner. This comes down to what the original language
> though of as a 'character', which just makes things even more complicated.
>
In Hebrew the "vowel points" are optional. They are used in beginners' and
religious texts, but not in general use. So if we take a text from scripture,
and represent it with and without vowels, is that the same text or a different
text? Almost all Hebrew speakers would say "It's the same text". So
strcmp() doesn't necessarily work in a Hebrew context.

Richard Damon

no leída,
9 ene 2022, 13:24:479/1/22
a
Hebrew isn't the only language to use 'vowel points', but they also
occur in a number of other languages.

The key point that I was pointing out is that in many of these
languages, to points are NOT considered a part of the letter they are
'attached' to, but a separate letter, even if typographically connected.

David Brown

no leída,
9 ene 2022, 17:16:099/1/22
a
On 09/01/2022 18:52, Richard Damon wrote:
>
> On 1/9/22 7:33 AM, David Brown wrote:
>
>> A "grapheme" is a written mark - a letter, punctuation, accent, etc.,
>> that conveys meaning.  Sometimes it is useful to break them down,
>> sometimes it is useful to treat them separately.  For example, "é" can
>> considered as a single grapheme, or as a grapheme "e" followed by a
>> combining graphene "'" acute accent.  The same grapheme can match
>> multiple code points - a Latin alphabet capital A is the same as a Greek
>> alphabet capital Alpha.
>>
>> A "glyph" is a rendering of a grapheme - the letter "A" in different
>> fonts are different glyphs of the same grapheme.
>>
>> What the reader perceives as a "character" is often a single grapheme,
>> but might be several graphemes together.
>>
>
> No, a grapheme, from my understanding, is the character as perceived by
> the readed. Thus the adding of accents to a base character builds a
> single grapheme from several codepoints.

The letter "o" is a grapheme, and an umlaut accent " is a grapheme. The
combination ö may be considered a single grapheme, or a combination of
graphemes. A German reader might consider it two graphemes - an
accented letter "o". A Swedish reader would consider it to be one
grapheme, as "ö" is a distinct letter in Swedish.

>
> The grapheme dosn't include 'font' information like which font to use,
> the size, additions like bold or italics and such, which add on to make
> the final glyph, but does include all the jots and tildes that are part
> of the character.

Correct. These are included in the glyph - the actual ink pattern on
the page.

>
> On the other hand, some languages add things like 'vowel points' to
> characters, and those are seperate graphemes even though they are added
> by a similar manner. This comes down to what the original language
> though of as a 'character', which just makes things even more complicated.
>

Yes, I think that is correct. (And "it's complicated" is /certainly/
correct!)

David Brown

no leída,
9 ene 2022, 17:23:219/1/22
a
That's true. The opposite is true also - a ligature like fi might be
typographically connected (depending on the font and typesetting),
despite being two independent letters and two characters. Unicode has a
number of code points for such ligatures, but there are many more that
are sometimes used in typography, especially historical documents.
(Unicode is missing an "fj" ligature, for example.)

And while "æ" is a ligature of two letters forming a diphthong (used in
Latin and related languages), it is an independent letter in Norwegian.


Öö Tiib

no leída,
9 ene 2022, 17:35:429/1/22
a
On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> On 1/9/2022 1:44 PM, Öö Tiib wrote:
> > On Sunday, 9 January 2022 at 03:17:10 UTC+2, Manfred wrote:
> >> On 1/8/2022 11:50 PM, Öö Tiib wrote:
> >>> On Saturday, 8 January 2022 at 20:49:01 UTC+2, james...@alumni.caltech.edu wrote:
> >>>> On Saturday, January 8, 2022 at 11:17:53 AM UTC-5, Öö Tiib wrote:
> >>>>> On Saturday, 8 January 2022 at 06:52:33 UTC+2, james...@alumni.caltech.edu wrote:
...
> >
> >>>>> FILE *f = fopen( "Foo😀Bar.txt", "w");
> >>>>> That should work unless underlying file system does not support files
> >>>>> named "Foo😀Bar.txt" If it supports but the code does not work then it indicates
> >>>>> bad standard that allows implementations to weasel away. No garbage like
> >>>>> u8fopen( u8"Foo😀Bar.txt", "w") coming somewhere maybe in C35 or so is
> >>>>> needed as it already works like in my example on vast majority of things.
> >>>>
> >>>> Nothing in the standard prevents an implementation from doing that. If one doesn't
> >>>> already do so, that's a choice made by the implementors, and you should ask them
> >>>> about it. Your real beef is with the implementors, not the standard.
> >>>
> >>> My beef is with standards. Adding garbage that does not work to standard is wrong
> >>> and not adding what everybody at least half sane does use to standard is also wrong.
> >>>
> >> Also agreed, but since utf-8 is transparent to ascii functions, what
> >> should have been added?
> >
> > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> >
> Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> representation, what's the difference? One form or the other shows up
> only when it is displayed in some UI - the filesystem isn't one, which
> leads to the implementation's runtime behavior.

How you mean same binary representation? Both "Foo😀Bar.txt" and
"Foo😀Bar.txt" files can be in same directory. Both have Unicode
names in underlying file system precisely as posted.

> If they are actually different in their binary sequence, and this is the
> result of the utf-8 string being wrongly converted multiple times, this
> looks like a bad implementation, rather than a problem with the standard.
> IIUC you are advocating for some statement in the standard that prevents
> implementations from messing up with "character sets" in null terminated
> char strings?

I mean that standard should require that all char* texts are treated as
UTF-8 by standard library unless said otherwise. If implementation needs
some other encoding of such byte sequence then it provides
platform-specific functions or compiler switches and/or extends language
with implementation-defined char_iso8859_1_t character types and
prefixes. If it is noteworthy handy type then add it to standards too, I
don't care.

If standard can define that overflow in signed atomics is well defined
and two's complement is mandated there then it also can define that all
char* texts are UTF-8. The only question is if what I suggest is reasonable
or not. From viewpoint of implementer of standard library or users it
is likely blessing ... so I think it is question of business/politics/religions.

Richard Damon

no leída,
9 ene 2022, 18:49:079/1/22
a
The difference is that in these days, the existence of computers that
aren't going to be able to support two's complement that will still want
to support modern 'C' is effectively non-existent.

The existance of machines that might still want to be able to support
non-UTF-8 strings is not.

Perhaps the biggest is the embedded market where needing to support
beyond plain ASCII isn't needed, and DEFINING that strings will follow
UTF-8 rules adds a LOT of complications for some operations that just
aren't needed on many of the systems.

The Standard does ALLOW a system to define char to be UTF-8 (at least
until you get into issues of what it requires for wide characters).

Öö Tiib

no leída,
9 ene 2022, 19:51:249/1/22
a
But do there exist machines that do want to support texts as char* but
do not want to support UTF-8? Describe those machines, give examples.

> Perhaps the biggest is the embedded market where needing to support

You mean tiny things like SD cards or flash sticks? I can store there
"Foo😀Bar.txt" just fine. Either embedded system does not need to
communicate in char* text at all, can fully ignore its encoding or has
to deal with UTF-8 anyway. I know of no other examples, despite I've
participated in programming whole pile of embedded systems over
the decades.

> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> UTF-8 rules adds a LOT of complications for some operations that just
> aren't needed on many of the systems.

WHAT complications? Give examples? Both ASCII and UTF-8 are row of
bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
use-case where UTF-8 hurts? Human languages and typography
are horribly complicated but UTF-8 is genially trivial. Either embedded
system does not do linguistic analyses of poems or if it does then
it needs to use Unicode anyway. But commonly if it can't display
something then it shows � and done.

> The Standard does ALLOW a system to define char to be UTF-8 (at least
> until you get into issues of what it requires for wide characters).

Allowing is apparently not enough as the support rots in standards.
Wide characters are wchar_t, char16_t and char32_t. These are
in horrible state too but I ignore it for now. Not related to issues
with char* and far less important in industry.


Manfred

no leída,
9 ene 2022, 20:20:069/1/22
a
I mean the same byte sequence in their name, but different UI
representation, e.g. when decoded as utf-8 or w-1252 or whatever.

What you are saying assumes a Unicode-aware filesystem, that's not free
from the point of view of the standard.
But, in order to support utf-8, it would be enough to have a char based
filesystem that treats names as plain 0-terminated char[]. That's
easier, probably free on most platforms, but it's different from
Unicode-aware (which could be UTF-16 like Windows, and there you have
your problems).

>
>> If they are actually different in their binary sequence, and this is the
>> result of the utf-8 string being wrongly converted multiple times, this
>> looks like a bad implementation, rather than a problem with the standard.
>> IIUC you are advocating for some statement in the standard that prevents
>> implementations from messing up with "character sets" in null terminated
>> char strings?
>
> I mean that standard should require that all char* texts are treated as
> UTF-8 by standard library unless said otherwise. If implementation needs
> some other encoding of such byte sequence then it provides
> platform-specific functions or compiler switches and/or extends language
> with implementation-defined char_iso8859_1_t character types and
> prefixes. If it is noteworthy handy type then add it to standards too, I
> don't care.

I see this hard to win, and probably not ideal - suppose in 10 years
some better encoding than utf-8 shows up, then you are screwed again.

I'd rather stick to the fact that utf-8 is compatible with 0-terminated
char[], and so a plausible wish would be that such strings are not
screwed by the implementation; for example when you store a file name in
a filesystem with fopen() and the name is given as char[], then the
standard could mandate that reading back that same name as char[] gives
back the same byte sequence.

Currently I guess one could use a utf-8 string as a name to fopen() on
Windows, then the OS assumes it is W-1252 and converts it into UTF-16,
at which point it is screwed, and when you read it back into char[] it
is garbage.

>
> If standard can define that overflow in signed atomics is well defined
> and two's complement is mandated there then it also can define that all
> char* texts are UTF-8. The only question is if what I suggest is reasonable
> or not. From viewpoint of implementer of standard library or users it
> is likely blessing ... so I think it is question of business/politics/religions.

I agree with Richard here. Two's complement is not like utf-8.
I still think it's technical rather than business/politics/religions in
this case - as I said above I'm not sure it would even be ideal.

Richard Damon

no leída,
9 ene 2022, 20:59:009/1/22
a
Small embedded micros with no need for large character sets.

>
>> Perhaps the biggest is the embedded market where needing to support
>
> You mean tiny things like SD cards or flash sticks? I can store there
> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> communicate in char* text at all, can fully ignore its encoding or has
> to deal with UTF-8 anyway. I know of no other examples, despite I've
> participated in programming whole pile of embedded systems over
> the decades.
>

Many such system communicate in command strings, maybe even with a
minimal TCP/IP but have no need for processing data beyond pure ASCII.

>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
>> UTF-8 rules adds a LOT of complications for some operations that just
>> aren't needed on many of the systems.
>
> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> use-case where UTF-8 hurts? Human languages and typography
> are horribly complicated but UTF-8 is genially trivial. Either embedded
> system does not do linguistic analyses of poems or if it does then
> it needs to use Unicode anyway. But commonly if it can't display
> something then it shows � and done.

Once you have your char as being defined as a Multi-Byte Character Set,
then wchar_t must be big enough to hold any of them. If you just support
ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
the wchar_t stuff can just be alias for the char stuff.

Thus forcing char to be UTF-8 adds a lot of complexity to the system.

>
>> The Standard does ALLOW a system to define char to be UTF-8 (at least
>> until you get into issues of what it requires for wide characters).
>
> Allowing is apparently not enough as the support rots in standards.
> Wide characters are wchar_t, char16_t and char32_t. These are
> in horrible state too but I ignore it for now. Not related to issues
> with char* and far less important in industry.
>

But that is part of the problem with supporting UTF-8, as that by
definiition brings in all the wide character issues into play.

If you define that your character set is ASCII, then wchar_t becomes
trivial.

A big part of the issue with char16_t is that it is fundamentally broken
with Unicode, but lives on due to trying to maintain the backwards
bandaids that basically can't be removed without admitting that a large
segment of code just will live as being openly non-complient.

Too much legacy code assumes that 16 bit characters are 'big enough' for
most people, and pretty much do work if you aren't being a stickler for
full conformance to the rules, which no one is because you can't be.

Richard Damon

no leída,
9 ene 2022, 21:00:589/1/22
a
But that seems to imply that the file system keeps track of file name
encoding at the entry level, which I don't know any that do that.

Öö Tiib

no leída,
9 ene 2022, 21:08:239/1/22
a
Nope. NTFS for example has file names as UTF-16 plus Windows uses hard links
to also give legacy Radix-50 style short (8.3) filenames to all files. Trivia
question: Why it is named "Radix-50" when there are only 40 characters in it?

> What you are saying assumes a Unicode-aware filesystem, that's not free
> from the point of view of the standard.
> But, in order to support utf-8, it would be enough to have a char based
> filesystem that treats names as plain 0-terminated char[]. That's
> easier, probably free on most platforms, but it's different from
> Unicode-aware (which could be UTF-16 like Windows, and there you have
> your problems).

Problems were with Japanese Shift JIS or EUC encodings in file
systems... it was expensive to guess what it is and so they switched
to Unicode. With UTF-16 one needs to know or to detect endianess
of it ... otherwise turning to UTF-8 and back is absurdly trivial.
Certainly less code than between Windows-1252 and UTF-16.
Yes, something like that happens. Microsoft was amazingly innovative
and wanted to push all kinds of good things up to 1995. But then some
kind of browser and compiler and other incompatibility wars within its
own operating system started ... and its positions started to shrink and
get damage. But it is their own business so they have full right to burn
it however they please.

Ben Bacarisse

no leída,
9 ene 2022, 21:18:129/1/22
a
Mateusz Viste <mat...@xyz.invalid> writes:

> While UTF-8 is neat, it is also complex to decode. Even a simple
> strlen() can be challenging.

Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points, I
would interpret a "UTF-8 strlen" as being a function the counts the
number of encoded code points, and that's simple enough. Every byte,
before the null, that does not have 10xxxx it it's top two bits is the
start of a code point:

size_t ustrlen(char *s)
{
size_t len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;
return len;
}

Obviously, for some uses, this is too simple as it does not detect
incorrect encodings.

--
Ben.

Ben Bacarisse

no leída,
9 ene 2022, 21:33:459/1/22
a
strcmp fails must closer to home (at least closer to my geographic home)
because, in Spanish, ch and ll are, transitionally, considered separate
letters. All c* words collate before any ch* words, and all l* words
before and ll* ones.

This has proved so inconvenient that I believe that the Real Academia
Española has ruled that, now, only ñ must be considered separately.

--
Ben.

Öö Tiib

no leída,
9 ene 2022, 21:40:389/1/22
a
You diligently avoid giving examples?
You mean if it displays only Arabic numbers then it needs only 10
characters, if minuses and dots too then 12. ASCII and UTF-8 are identical
in that processing. So UTF-8 adds no extra bytes to such system.

> >
> >> Perhaps the biggest is the embedded market where needing to support
> >
> > You mean tiny things like SD cards or flash sticks? I can store there
> > "Foo😀Bar.txt" just fine. Either embedded system does not need to
> > communicate in char* text at all, can fully ignore its encoding or has
> > to deal with UTF-8 anyway. I know of no other examples, despite I've
> > participated in programming whole pile of embedded systems over
> > the decades.
>
> Many such system communicate in command strings, maybe even with a
> minimal TCP/IP but have no need for processing data beyond pure ASCII.

Same as with numbers, if no need to show the degree in 74.3°F so no
need for to process anything beyond pure ASCII. Otherwise the software
needs to detect that there are bytes C2 B0 for to show ° also no biggie.

> >> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> >> UTF-8 rules adds a LOT of complications for some operations that just
> >> aren't needed on many of the systems.
> >
> > WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> > bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> > use-case where UTF-8 hurts? Human languages and typography
> > are horribly complicated but UTF-8 is genially trivial. Either embedded
> > system does not do linguistic analyses of poems or if it does then
> > it needs to use Unicode anyway. But commonly if it can't display
> > something then it shows � and done.
>
> Once you have your char as being defined as a Multi-Byte Character Set,
> then wchar_t must be big enough to hold any of them. If you just support
> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
> the wchar_t stuff can just be alias for the char stuff.
>
> Thus forcing char to be UTF-8 adds a lot of complexity to the system.

Most embedded systems that I programmed used wchar_t for nothing.
So the compiler generated precisely 0 bytes of wchar_t processing
into image that was flashed into those.

> >
> >> The Standard does ALLOW a system to define char to be UTF-8 (at least
> >> until you get into issues of what it requires for wide characters).
> >
> > Allowing is apparently not enough as the support rots in standards.
> > Wide characters are wchar_t, char16_t and char32_t. These are
> > in horrible state too but I ignore it for now. Not related to issues
> > with char* and far less important in industry.
> >
> But that is part of the problem with supporting UTF-8, as that by
> definiition brings in all the wide character issues into play.
>
> If you define that your character set is ASCII, then wchar_t becomes
> trivial.
>
> A big part of the issue with char16_t is that it is fundamentally broken
> with Unicode, but lives on due to trying to maintain the backwards
> bandaids that basically can't be removed without admitting that a large
> segment of code just will live as being openly non-complient.
>
> Too much legacy code assumes that 16 bit characters are 'big enough' for
> most people, and pretty much do work if you aren't being a stickler for
> full conformance to the rules, which no one is because you can't be.

But that all is far from true. The wchar_t on Windows is 16 bits but Windows
supports UTF-16 fully so lot of characters take multiple wchar_t's
to represent. Microsoft just violates standard with straight face. And
I do not care about it. I care about UTF-8.

Richard Damon

no leída,
9 ene 2022, 22:14:049/1/22
a
My understanding is that for MBCS the function strlen returns the number
of BYTES in the string, not the number of Multi-Byte Characters in the
string.

This means that strlen can be used to determine how much space is needs
to store the string.

James Kuyper

no leída,
9 ene 2022, 22:26:339/1/22
a
On 1/9/22 5:41 AM, Po Lu wrote:
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:
>
>> No, I am quite accurately and honestly expressing my confusion. You
>> object to something being prohibited by the standards that is, to the
>> best of my understanding, allowed. It would make more sense if you
>> were objecting the fact that it isn't mandatory, and if you were
>> making such claims, I would disagree with you about whether it would
>> be a good idea to make it mandatory - but as far as I can tell, you're
>> claiming it isn't allowed.
>
> AFAICT, he's complaining about Microsoft's specific implementations of
> some standards.

He's repeatedly asserted that it's the standards themselves that he's
complaining about. His actual complaints, however, seem to be about
Microsoft-specific behavior. That's not a contradiction - he's
complaining about the fact that the standards allow that behavior.

Richard Damon

no leída,
9 ene 2022, 22:35:089/1/22
a
The problem is that once you define that your 'Character Set' is UTF-8,
and thus wide characters are wider than 8 bits, then a number of
mechanizism need to be provided by the library, and it can be hard to
keep that down to zero cost if not used.

A big issue is locales. An ASCII only system can easily just define very
crude locale support that is very cheap. Once you introduce UTF-8, it
becomes a very slippery slope that can make it hard to keep the size of
the library code brought in under control.

Remember, even simple things like printf pulls in some locale code, even
if you don't actaully ever set a locale.

>
>>>
>>>> Perhaps the biggest is the embedded market where needing to support
>>>
>>> You mean tiny things like SD cards or flash sticks? I can store there
>>> "Foo😀Bar.txt" just fine. Either embedded system does not need to
>>> communicate in char* text at all, can fully ignore its encoding or has
>>> to deal with UTF-8 anyway. I know of no other examples, despite I've
>>> participated in programming whole pile of embedded systems over
>>> the decades.
>>
>> Many such system communicate in command strings, maybe even with a
>> minimal TCP/IP but have no need for processing data beyond pure ASCII.
>
> Same as with numbers, if no need to show the degree in 74.3°F so no
> need for to process anything beyond pure ASCII. Otherwise the software
> needs to detect that there are bytes C2 B0 for to show ° also no biggie.

Again, the problem is that once you have defined that Multi-byte
characters exist, things like printf will use locale support that might
pull in classifaction routines that might needs to classify what
characters are 'digits' or 'letters' in the full Unicode range.

For a PC, with a large OS, that support is fairly cheap, and might even
be just built in, but in a small embedded system that can be costly.

I HAVE had systems that defined that characters were UTF-8 and the
result was I couldn't use a lot of the library because it pulled in too
much locale code to fit into my machine.


>
>>>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
>>>> UTF-8 rules adds a LOT of complications for some operations that just
>>>> aren't needed on many of the systems.
>>>
>>> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
>>> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
>>> use-case where UTF-8 hurts? Human languages and typography
>>> are horribly complicated but UTF-8 is genially trivial. Either embedded
>>> system does not do linguistic analyses of poems or if it does then
>>> it needs to use Unicode anyway. But commonly if it can't display
>>> something then it shows � and done.
>>
>> Once you have your char as being defined as a Multi-Byte Character Set,
>> then wchar_t must be big enough to hold any of them. If you just support
>> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
>> the wchar_t stuff can just be alias for the char stuff.
>>
>> Thus forcing char to be UTF-8 adds a lot of complexity to the system.
>
> Most embedded systems that I programmed used wchar_t for nothing.
> So the compiler generated precisely 0 bytes of wchar_t processing
> into image that was flashed into those.
>

The problem is that the mere fact that wcahr_t is bigger than 8 bits,
implies behavior for some character operations that need to take that
into account.

>>>
>>>> The Standard does ALLOW a system to define char to be UTF-8 (at least
>>>> until you get into issues of what it requires for wide characters).
>>>
>>> Allowing is apparently not enough as the support rots in standards.
>>> Wide characters are wchar_t, char16_t and char32_t. These are
>>> in horrible state too but I ignore it for now. Not related to issues
>>> with char* and far less important in industry.
>>>
>> But that is part of the problem with supporting UTF-8, as that by
>> definiition brings in all the wide character issues into play.
>>
>> If you define that your character set is ASCII, then wchar_t becomes
>> trivial.
>>
>> A big part of the issue with char16_t is that it is fundamentally broken
>> with Unicode, but lives on due to trying to maintain the backwards
>> bandaids that basically can't be removed without admitting that a large
>> segment of code just will live as being openly non-complient.
>>
>> Too much legacy code assumes that 16 bit characters are 'big enough' for
>> most people, and pretty much do work if you aren't being a stickler for
>> full conformance to the rules, which no one is because you can't be.
>
> But that all is far from true. The wchar_t on Windows is 16 bits but Windows
> supports UTF-16 fully so lot of characters take multiple wchar_t's
> to represent. Microsoft just violates standard with straight face. And
> I do not care about it. I care about UTF-8.
>

Right, the issue is that it is IMPOSSIBLE for Windows to properly
support wchar in a compatible manner, as the OS definitions go back to
the days when 16 bits were enough.

Windows basically plays the game that it only 'officially' supports the
BMP, and is using UCS-2, but as an extension support UTF-16 and
characters beyond the BMP.

This is basically the same reason that __int128_t isn't officially an
Extended Integer Type, as that would mean that intmax_t would need to be
128 bits, wwhich it can't be due to the ABI.

These are practically holes in the Standard where the Committee didn't
quite think things through far enough. I suspect they figured that the
old OS interfaces would have died and been replaced with new, but there
is just too much legacy code still using them, and new code continues to
use the legacy interfaces, so it will never go away.

Öö Tiib

no leída,
9 ene 2022, 23:24:209/1/22
a
You must be specific.

> A big issue is locales. An ASCII only system can easily just define very
> crude locale support that is very cheap. Once you introduce UTF-8, it
> becomes a very slippery slope that can make it hard to keep the size of
> the library code brought in under control.
>
> Remember, even simple things like printf pulls in some locale code, even
> if you don't actaully ever set a locale.

No, its different topic. Fully implementation defined. Conformant
implementation may implement only locale named "C" and be done with it.
The setlocale() localeconv() and lconv() can be trivial stubs behaving by
letter of standard and not worth calling ever. Would be nice from
implementer to support localization but not really required and not
something I complain about.

> >
> >>>
> >>>> Perhaps the biggest is the embedded market where needing to support
> >>>
> >>> You mean tiny things like SD cards or flash sticks? I can store there
> >>> "Foo😀Bar.txt" just fine. Either embedded system does not need to
> >>> communicate in char* text at all, can fully ignore its encoding or has
> >>> to deal with UTF-8 anyway. I know of no other examples, despite I've
> >>> participated in programming whole pile of embedded systems over
> >>> the decades.
> >>
> >> Many such system communicate in command strings, maybe even with a
> >> minimal TCP/IP but have no need for processing data beyond pure ASCII.
> >
> > Same as with numbers, if no need to show the degree in 74.3°F so no
> > need for to process anything beyond pure ASCII. Otherwise the software
> > needs to detect that there are bytes C2 B0 for to show ° also no biggie.
>
> Again, the problem is that once you have defined that Multi-byte
> characters exist, things like printf will use locale support that might
> pull in classifaction routines that might needs to classify what
> characters are 'digits' or 'letters' in the full Unicode range.

Stay with UTF-8? It can keep locale "en_US". It has to show □ for each
missing symbol in font and � for illegal UTF-8 byte sequence (that is
trivial to detect). There are likely 0 fonts in existence with all Unicode
symbols so a font with 20 symbols is fully conformant for embedded
device that does not need to analyze Hebrew manuscripts but to
show temperature for desperate housewife.

> For a PC, with a large OS, that support is fairly cheap, and might even
> be just built in, but in a small embedded system that can be costly.
>
> I HAVE had systems that defined that characters were UTF-8 and the
> result was I couldn't use a lot of the library because it pulled in too
> much locale code to fit into my machine.

It is easy to overburden embedded system with unneeded code
for locale-specific processing for all hundreds of countries and
dialects but it is not fault of UTF-8 encoding of characters.

> >
> >>>> beyond plain ASCII isn't needed, and DEFINING that strings will follow
> >>>> UTF-8 rules adds a LOT of complications for some operations that just
> >>>> aren't needed on many of the systems.
> >>>
> >>> WHAT complications? Give examples? Both ASCII and UTF-8 are row of
> >>> bytes that end with zero. ASCII is proper subset of UTF-8. Tell about
> >>> use-case where UTF-8 hurts? Human languages and typography
> >>> are horribly complicated but UTF-8 is genially trivial. Either embedded
> >>> system does not do linguistic analyses of poems or if it does then
> >>> it needs to use Unicode anyway. But commonly if it can't display
> >>> something then it shows � and done.
> >>
> >> Once you have your char as being defined as a Multi-Byte Character Set,
> >> then wchar_t must be big enough to hold any of them. If you just support
> >> ASCII, then wchar_t can be just 8 Bits if you want, and (almost?) all
> >> the wchar_t stuff can just be alias for the char stuff.
> >>
> >> Thus forcing char to be UTF-8 adds a lot of complexity to the system.
> >
> > Most embedded systems that I programmed used wchar_t for nothing.
> > So the compiler generated precisely 0 bytes of wchar_t processing
> > into image that was flashed into those.
> >
> The problem is that the mere fact that wcahr_t is bigger than 8 bits,
> implies behavior for some character operations that need to take that
> into account.

Be specific? Paste source code that has only char, that is affected by
properties of wchar_t?
Yes, but char being UTF-8 code unit can not break any ABI. The char has
to have at least 8 bits by standard. Therefore code unit fits. And char*
string has to be sequence of non zero char values that is terminated
with zero char value. Precisely what UTF-8 is. So the change what I
propose does not cause ABI issues like intmax_t or wchar_t change
would cause.

james...@alumni.caltech.edu

no leída,
10 ene 2022, 0:24:0110/1/22
a
On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > On 1/9/2022 1:44 PM, Öö Tiib wrote:
...
> > > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > >
> > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > representation, what's the difference? One form or the other shows up
> > only when it is displayed in some UI - the filesystem isn't one, which
> > leads to the implementation's runtime behavior.
> How you mean same binary representation? Both "Foo😀Bar.txt" and
> "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> names in underlying file system precisely as posted.

Have you checked to make sure? Any system where passing the UTF-8 string
"Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
is likely to be a system where the file names are displayed using some single-
byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
names used that encoding to interpret the file name, that would explain your results.

Öö Tiib

no leída,
10 ene 2022, 1:11:0810/1/22
a
Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"
side-by-side? Just that for "Foo😀Bar.txt" one needs to use non-standard
_wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
gcc.

People have apparently lamented about it so the behavior of fopen can be
repaired with some butt-ugly xml file linked in the program in specific way
or by providing one next to your program with yourprogramname.xml as
name. That trick works from Windows 10 May 2019 Update.

But reading UTF-8 (for example password) from console is still impossible.
One has to write platform-specific code about like that:

SIZE_T wbuf_len = (len - 1 + 2)*sizeof(*wbuf);
WCHAR *wbuf = HeapAlloc(GetProcessHeap(), 0, wbuf_len);
DWORD nread;
ReadConsoleW(hi, wbuf, len - 1 + 2, &nread, 0);
wbuf[nread-2] = 0; // truncate "\r\n"
int r = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, buf, len, 0, 0);
SecureZeroMemory(wbuf, wbuf_len);
HeapFree(GetProcessHeap(), 0, wbuf);

That way we have UTF-8 read into buf.

David Brown

no leída,
10 ene 2022, 3:02:4610/1/22
a
In a more extreme case, in Norwegian "aa" is sometimes sorted very early
alphabetically, and sometimes very late as it is a transliteration of
the Norwegian letter "å", which is the last letter in our alphabet.

Sorting for human use (as distinct from, say, making a binary tree for
lookups, in which case pure data-based sorting is fine) is complicated
business!

Mateusz Viste

no leída,
10 ene 2022, 4:02:5410/1/22
a
2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
> I would interpret a "UTF-8 strlen" as being a function the counts the
> number of encoded code points, and that's simple enough.
> (...)
> Obviously, for some uses, this is too simple as it does not detect
> incorrect encodings.

I'm not sure what your point was supposed to be. "It's simple to write
non-practical, prototype-grade code"? Yes, it is.

Now, I am not saying that writing a utf-8 strlen() is incredibly
difficult of course. I am only saying it is an extra layer of
complexity compared to UCS-2 or UTF-32. And that is why I understand
why people often choose to internally store strings in one of these
encodings instead of utf-8 (esp. if dealing with fixed-width character
outputs). It's simply easier to deal with an array of values that maps
directly to codepoints rather than parse a utf-8 string taking care not
to explode on encoding errors or edge cases.

Mateusz

Öö Tiib

no leída,
10 ene 2022, 4:36:5610/1/22
a
My complaints are because of the events somehow falling
together.

On one hand the issues with electronic components caused
shortage of prototype devices to run tests on. So some
cooperation partners decided to run unit tests on Windows
boxes. Since on Windows the C standard library is crappy these
unit tests now mostly test ad-hoc hacks of simulating proper
standard library and pointless man-months wasted into those.

On the other hand C++20 broke that u8 prefix indicating
dedication to push UTF-8 into that char8_t* garbage that no
standard library function is using. The char8_t would also
break constexpr processing of it as casts to char are illegal in
constexpr context.

It smells like some next family of non-standard functions soon
in style of _u8fopen(u8"Foo😀Bar.txt", u8"w") . Story of annex
K repeated. Oh I hope being paranoid and UCRT with Visual
Studio 2022 coming plain great, but what is the chance of
that?

Ben Bacarisse

no leída,
10 ene 2022, 6:45:1910/1/22
a
Mateusz Viste <mat...@xyz.invalid> writes:

> 2022-01-10 at 02:18 +0000, Ben Bacarisse wrote:
>> Hmmm... Since UTF-8 is a multi-byte encoding of Unicode code points,
>> I would interpret a "UTF-8 strlen" as being a function the counts the
>> number of encoded code points, and that's simple enough.
>> (...)
>> Obviously, for some uses, this is too simple as it does not detect
>> incorrect encodings.
>
> I'm not sure what your point was supposed to be. "It's simple to write
> non-practical, prototype-grade code"? Yes, it is.

Yes, that was my point. A lot of people think UTF-8 is more complex
than it is so I think it helps to demystify it a bit.

--
Ben.

Ben Bacarisse

no leída,
10 ene 2022, 6:47:2710/1/22
a
Yes. But that's now what "a simple strlen()" for UTF-8 appeared to be
referring to. After all, as you say, strlen /is/ strlen for UTF-8
strings.

--
Ben.

Malcolm McLean

no leída,
10 ene 2022, 7:34:3810/1/22
a
Yes. It's not obvious from the name "strlen" what it should do when fed
UTF-8.

Öö Tiib

no leída,
10 ene 2022, 8:16:0610/1/22
a
On Monday, 10 January 2022 at 11:02:54 UTC+2, Mateusz Viste wrote:
>
> Now, I am not saying that writing a utf-8 strlen() is incredibly
> difficult of course. I am only saying it is an extra layer of
> complexity compared to UCS-2 or UTF-32. And that is why I understand
> why people often choose to internally store strings in one of these
> encodings instead of utf-8 (esp. if dealing with fixed-width character
> outputs). It's simply easier to deal with an array of values that maps
> directly to codepoints rather than parse a utf-8 string taking care not
> to explode on encoding errors or edge cases.

That argument feels like result of misinterpretation of formation of
UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
characters, modifiers, accents and tabulators. So even with monospaced
font (in world where proportional fonts are more frequently used) one
can't decide the width of result on screen without examining all
characters in sequence. But if to examine all characters of sequence
anyway then UTF-8 is often.just bit less memory to examine. Somehow
it does not look like the other options are "simply easier".

Mateusz Viste

no leída,
10 ene 2022, 8:25:3310/1/22
a
2022-01-10 at 05:15 -0800, Öö Tiib wrote:
> That argument feels like result of misinterpretation of formation of
> UCS-2 and UTF-32 glyphs. Both encodings contain variety of combining
> characters, modifiers, accents and tabulators. So even with
> monospaced font (in world where proportional fonts are more
> frequently used) one can't decide the width of result on screen
> without examining all characters in sequence.

That is true in a perfect world, yes. In practice, terminal-based
implementations often naively assume that 1 codepoint = 1 character on
screen. And this works fairly well, as far as I can tell, even if it's
not a 100% correct implementation.

I agree that a full-blown unicode implementation can be quite complex
(handling text-direction, combining characters, separators, control
characters, etc). In such context the extra complexity of parsing utf-8
strings may seem irrelevant. All depends what is the implementation's
goal I guess.

Mateusz

Malcolm McLean

no leída,
10 ene 2022, 8:33:4710/1/22
a
You can't really separate Unicode handling from font handling. And that
is notoriously difficult, even if you restrict yourself to English.

Richard Damon

no leída,
10 ene 2022, 8:38:2110/1/22
a
And this is the reason that Unicode doesn't really meet the requirements
of a C 'Wide Character Type'. Wide characters are supposed to be 1
character = 1 storage unit. Because of combining characters Unicode
doesn't meet this requirement.

Ultimately, we have to live with it and accept that programming in the
face of full compliance with the rules of the character set are going to
add complexity.

james...@alumni.caltech.edu

no leída,
10 ene 2022, 12:01:3110/1/22
a
On Monday, January 10, 2022 at 1:11:08 AM UTC-5, Öö Tiib wrote:
> On Monday, 10 January 2022 at 07:24:01 UTC+2, james...@alumni.caltech.edu wrote:
> > On Sunday, January 9, 2022 at 5:35:42 PM UTC-5, Öö Tiib wrote:
> > > On Sunday, 9 January 2022 at 18:34:25 UTC+2, Manfred wrote:
> > > > On 1/9/2022 1:44 PM, Öö Tiib wrote:
> > ...
> > > > > Something that makes it clear that it is defect when "Foo😀Bar.txt" is silently opened
> > > > > on file-system that fully supports files named "Foo😀Bar.txt" I suppose.
> > > > >
> > > > Assuming that "Foo≡ƒÿÇBar.txt" "Foo😀Bar.txt" have the same binary
> > > > representation, what's the difference? One form or the other shows up
> > > > only when it is displayed in some UI - the filesystem isn't one, which
> > > > leads to the implementation's runtime behavior.
> > > How you mean same binary representation? Both "Foo😀Bar.txt" and
> > > "Foo😀Bar.txt" files can be in same directory. Both have Unicode
> > > names in underlying file system precisely as posted.
> > Have you checked to make sure? Any system where passing the UTF-8 string
> > "Foo😀Bar.txt" to fopen() opens a file whose name displays as Foo≡ƒÿÇBar.txt
> > is likely to be a system where the file names are displayed using some single-
> > byte encoding. The UTF-8 encoding of "Foo😀Bar.txt" is
> >
> > 0X46 0X6F 0X6F 0XF0 0X9F 0X98 0X80 0X42 0X61 0X72 0X2E 0X74 0X78 0X74
> >
> > After quite a bit of searching, I found Code page 865 (MS-DOS Nordic), which
> > has 0XF0 = '≡', 0x9F = 'ƒ' and 0X98 = 'ÿ'. If the utilities that you use to display file
> > names used that encoding to interpret the file name, that would explain your results.
> Need a screenshot with files named "Foo😀Bar.txt" and "Foo≡ƒÿÇBar.txt"
> side-by-side? ...

No, that wouldn't be relevant. I realized shortly after I posted my message that you
might not understand what I meant by "Have you checked to make sure?" I started
composing a message in my head explaining in more detail. However, it was very late,
I had to get to bed, and when I checked this morning you'd already confirmed that you
didn't realize what I meant.

As I understand it, you've opened a file using "Foo😀Bar.txt" as the file name, and
somehow determined that the "actual" name of the file that got opened was
"Foo😀Bar.txt". You didn't specify, but I presume you reached that conclusion by
doing something like get a directory listing at the command line or using a GUI file
browser to look at the directory.

The UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X7

Those same bytes, if interpreted using Code page 865, represent the string
"Foo😀Bar.txt". I know very little about Windows internals, so I'm not sure why that
might be relevant. It could be just a coincidence, but that seems excessively
unlikely. What I was suggestinging, and what I think Manfred was hinting at, is that the
string you provide as the file name is stored by the file system using UTF-8 encoding.
Whatever method you used to determine the "actual" file name interpreted those
bytes using a single-byte encoding, which could be Code Page 865, or possibly some
other encoding that encodes those particular characters the same way as Code Pag
865. There's a lot of different code pages out there, so I couldn't check them all, but of
the dozen or so I checked, that is the only one where 0xF0 represents '≡'. The "MS
DOS Nordic" code page was one of the first ones I checked, based upon your e-mail
address oot...@hot.ee, where I presume "ee" refers to Estonia.

If that is indeed the case, consider what should happen if you try to open a file using
the name "Foo😀Bar.txt". If I understand you correctly, I believe that you would
expect it to open the same file that got opened when you specified "Foo😀Bar.txt".
However, the UTF-8 encoding of "Foo😀Bar.txt" is

0X46 0X6F 0X6F 0XE2 0X89 0XA1 0XC6 0X92 0XC3 0XBF 0XC3 0X87 0X42 0X61 0X72 0X2E 0X74 0X78 0X74

I expect that you will end up opening a different file. If the file name is being displayed
using a single-byte encoding, it should have 19 characters. If that encoding is in fact
Code Page 865, then that name should be "Foo😀Bar.txt". So, what result do
you get?

If the problem is in fact that the file name is being interpreted using a single byte
encoding by whatever utility you're using determine what the actual name is, then
there's absolutely nothing the standards can do about that - the behavior of any such
utility is completely outside the scope of either standard.

> ... Just that for "Foo😀Bar.txt" one needs to use non-standard
> _wfopen( L"Foo😀Bar.txt", L"w"). That is default both with MSVC and MinGW
> gcc.

As you say, it's non-standard. Therefore, nothing the C standard says could do
anything to constrain it's behavior. If your complaint is indeed about the behavior of
_wfopen(), it's not relevant to either the C or C++ standards, and should be posted to a
Windows-specific forum.

Vir Campestris

no leída,
10 ene 2022, 16:35:5510/1/22
a
On 10/01/2022 00:51, Öö Tiib wrote:
> But do there exist machines that do want to support texts as char* but
> do not want to support UTF-8? Describe those machines, give examples.

All the mainframes that run EBCDIC. There are a lot of them still.

Andy

Scott Lurndal

no leída,
10 ene 2022, 17:09:4810/1/22
a
Here's the implementation guide for one:

https://public.support.unisys.com/aseries/docs/ClearPath-MCP-18.0/86002268-207.pdf

See appendix E for I18N.

Öö Tiib

no leída,
10 ene 2022, 21:03:5610/1/22
a
Seems out of context as these guys do not look like wanting to upgrade to C99 or
something. But maybe at 2060 or so they start to think about usefulness of
UTF-8 too. The EBCDIC is usual, but irrelevant red herring in discussions like this.

Öö Tiib

no leída,
10 ene 2022, 21:42:2510/1/22
a
Hmm. I'm thinking you analyze correctly. To clarify wherever I refer to file
named "Foo😀Bar.txt" and file named "Foo≡ƒÿÇBar.txt" then it is how Windows
(or other operating system if I make the files reachable) displays names of the files.
The string "Foo😀Bar.txt" is what I see in text editor of UTF-8 source code. The
bytes of yours are correct. Windows C runtime translates all char* strings to
UTF-16 and so it is all about algorithm how it does that.

Öö Tiib

no leída,
11 ene 2022, 4:52:1911/1/22
a
Unicode is quite successful in supporting nuances of texts that
people expect to see. So it is most popular text format in world.
If it contradicts with requirements of C then ... C has to change
as world is lot harder to change.

Malcolm McLean

no leída,
11 ene 2022, 6:20:3411/1/22
a
UTF-8 is designed so that programs written in C, as well as many other
programming languages, have a good chance of working correctly
even if not UTF-8 aware.
However if you are displaying text, and start developing for an exclusively
English-speaking end user, then moving to non-English texts isn't always as
simple as changing the raster patterns of the glyphs. That's inherent in the
complexities of human languages and writing systems. Representing
strings in UTF-8 is just the start.

Vir Campestris

no leída,
12 ene 2022, 16:46:0312/1/22
a
I'm not going to read the whole of that spec - but it does say the
compiler by default reads EBCDIC encoded source files (with a switch for
ASCII) and it suggests text data files are EBCDIC too.

I haven't programmed for an EBCDIC machine in 30 years - in fact I
didn't learn C until I stopped using them - but that doesn't mean they
don't exist.

I don't see why they are a red herring.

It would be a perfectly valid decision to target Windows only, or
Android only, but if you want to be truly portable EBCDIC should always
be in the back of your mind.

Andy

Öö Tiib

no leída,
12 ene 2022, 20:40:1212/1/22
a
On Wednesday, 12 January 2022 at 23:46:03 UTC+2, Vir Campestris wrote:
> On 11/01/2022 02:03, Öö Tiib wrote:
> > On Tuesday, 11 January 2022 at 00:09:48 UTC+2, Scott Lurndal wrote:
> >> Vir Campestris <vir.cam...@invalid.invalid> writes:
> >>> On 10/01/2022 00:51, Öö Tiib wrote:
> >>>> But do there exist machines that do want to support texts as char* but
> >>>> do not want to support UTF-8? Describe those machines, give examples.
> >>>
> >>> All the mainframes that run EBCDIC. There are a lot of them still.
> >> Here's the implementation guide for one:
> >>
> >> https://public.support.unisys.com/aseries/docs/ClearPath-MCP-18.0/86002268-207.pdf
> >>
> >> See appendix E for I18N.
> >
> > Seems out of context as these guys do not look like wanting to upgrade to C99 or
> > something. But maybe at 2060 or so they start to think about usefulness of
> > UTF-8 too. The EBCDIC is usual, but irrelevant red herring in discussions like this.
> >
> I'm not going to read the whole of that spec - but it does say the
> compiler by default reads EBCDIC encoded source files (with a switch for
> ASCII) and it suggests text data files are EBCDIC too.
>
> I haven't programmed for an EBCDIC machine in 30 years - in fact I the
> didn't learn C until I stopped using them - but that doesn't mean they
> don't exist.
>
> I don't see why they are a red herring.

I tried to tell. Because the vendors of those platforms do not want to migrate even to
more than 20 year old C99 standard.

> It would be a perfectly valid decision to target Windows only, or
> Android only, but if you want to be truly portable EBCDIC should always
> be in the back of your mind.

"Truly portable" is another red herring. Everything is portable only between
certain platforms & versions. Windows program that one writes now most likely
does not run on Windows 2000. Android programs released today have always
list of Android version that these run on. Portability is achieved with certain
amount of work per platform supported. Some old platforms are not worth to
invest that work in.

Kaz Kylheku

no leída,
13 ene 2022, 18:13:1313/1/22
a
On 2022-01-13, Öö Tiib <oot...@hot.ee> wrote:
> On Wednesday, 12 January 2022 at 23:46:03 UTC+2, Vir Campestris wrote:
>> It would be a perfectly valid decision to target Windows only, or
>> Android only, but if you want to be truly portable EBCDIC should always
>> be in the back of your mind.
>
> "Truly portable" is another red herring. Everything is portable only between

Indeed, "actually ported" beats "truly portable".

If you don't have a single test case in your system which covers EBCDIC,
or have written a few, but never had the opportunity to run them, then
the portability of the system to EBCDIC is only theoretical.

There could be snags in an actual porting effort that you have no clue
about.

When we write code, we have certain platforms in mind (and their
toolchains). Practical portable coding means having a few more kinds of
platforms in mind, without going overboard (such as ones that actually
exist and that some of the code might plausibly go to).

Where it is practical and easy, you make the coding decisions to be a
little more portable than actually required.

In the C language, a lot of this thinking is spent in not platform
concerns but compiler concerns. We have to worry a lot less about EBCDIC
or 36 bit pointers (which will never happen to our code) than, say,
about how deeply some future compiler makes inferences based on the
assumption of well-defined behavior. A C programmer's portability brain
cycles are much more profitably (or less unprofitably) spent on these
language semantic issues: as a strategist, you have to fortify yourself
where you expect to be attacked!

Strategically, worrying about EBCDIC is like building a fortress against
rocks and arrows, when your enemy has long abandoned those and moved on
to rockets and grenades. Well kind of. Imagine if there was a way of
fortifying that works gainst rockets and grenades, but succumbs to
arrows and rocks ... but arrows and rocks won't happen because almost
nobody remembers them. :)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
0 mensajes nuevos