Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

wstring_convert

326 views
Skip to first unread message

Bonita Montero

unread,
Dec 21, 2020, 2:14:55 PM12/21/20
to
What are the means to convert UTF-8-strings to u16string-s
with C++20. wstring_convert is deprecated.

Öö Tiib

unread,
Dec 22, 2020, 8:19:47 AM12/22/20
to
On Monday, 21 December 2020 at 21:14:55 UTC+2, Bonita Montero wrote:
> What are the means to convert UTF-8-strings to u16string-s
> with C++20. wstring_convert is deprecated.

There are only platform- or library-specific means. Unicode standard is
differently (in strict sense incorrectly in various ways) supported by
platforms and libraries. In such world it is impossible to support it
"correctly" so the C++ standard does not want to lie that C++ somehow
does that. There is unicode.org for information about Unicode.

Richard Damon

unread,
Dec 22, 2020, 9:06:08 AM12/22/20
to
On 12/21/20 2:14 PM, Bonita Montero wrote:
> What are the means to convert UTF-8-strings to u16string-s
> with C++20. wstring_convert is deprecated.

Part of the issue is that wstring is not necessarily UTF-16 encoded, so
it might not be the right choice. wstring might not even be 16 bits wide.

Actually converting UTF-8 into UTF-16 isn't that hard to do, as it is a
simple matter to extract the next UCS-4 code-point out of a UTF-8 string
(a bit more complicated if you want to do all the suggested error checks
for malformed UTF-8, but still not that hard), and converting the UCS-4
code-point into UTF-16 is even simpler (just check if it is BMP or not
and write the value(s) out).


Note that technically, you may want to use char16_t based string instead
of wstring, as technically wstring should be based on char32_t, but for
historical reasons it may still be 16 bits on Windows.

Bonita Montero

unread,
Dec 22, 2020, 9:12:31 AM12/22/20
to
>> What are the means to convert UTF-8-strings to u16string-s
>> with C++20. wstring_convert is deprecated.

> There are only platform- or library-specific means. Unicode standard is
> differently (in strict sense incorrectly in various ways) supported by
> platforms and libraries. In such world it is impossible to support it
> "correctly" so the C++ standard does not want to lie that C++ somehow
> does that. There is unicode.org for information about Unicode.

I'm not talking about Unicode but UTF-8.
UTF-8 isn't Unicode but just a encoding.

Bonita Montero

unread,
Dec 22, 2020, 9:14:51 AM12/22/20
to
> Part of the issue is that wstring is not necessarily UTF-16 encoded, so
> it might not be the right choice. wstring might not even be 16 bits wide.

There's u16string and u32string which have UTF-16 or UTF-32 encoding.
And I don't talk about a charset but en encoding, which is independent
of a charset.

> Actually converting UTF-8 into UTF-16 isn't that hard to do, as it is a
> simple matter to extract the next UCS-4 code-point out of a UTF-8 string
> (a bit more complicated if you want to do all the suggested error checks
> for malformed UTF-8, but still not that hard), and converting the UCS-4
> code-point into UTF-16 is even simpler (just check if it is BMP or not
> and write the value(s) out).

Nevertheless it would be nice to have this in the standard-library.

Bonita Montero

unread,
Dec 22, 2020, 9:16:59 AM12/22/20
to

Richard Damon

unread,
Dec 22, 2020, 9:28:27 AM12/22/20
to
I think part of the issue is that despite the names, u16string is NOT
required to be UTF-16 encoded, as it is just basic_string<char16_t>, and
that is not required to use UTF-16. (In C there is a define
__STDC_UTF_16__ to indicate that it is, which I don't see in my C++
standard, but C++ still uses words like if the native encoding is UTF-16)

It looks like codecvt can be used to make the conversion IF the
implementation uses UTF-16/UTF-32 for char16_t and char32_t.

James Kuyper

unread,
Dec 22, 2020, 2:19:38 PM12/22/20
to
On 12/22/20 9:28 AM, Richard Damon wrote:
...
> I think part of the issue is that despite the names, u16string is NOT
> required to be UTF-16 encoded, as it is just basic_string<char16_t>, and
> that is not required to use UTF-16.

Citation, please? A search of every ocurrance of "UTF-16" in the C++
standard leaves me with the impression that every function in the C++
standard library that has a specialization for char16_t that interprets
objects of that type is required to interpret them as parts of a UTF-16
string. What did I miss?

(In C there is a define
> __STDC_UTF_16__ to indicate that it is, which I don't see in my C++
> standard, but C++ still uses words like if the native encoding is UTF-16)

A search for "native encoding is" doesn't get any hits. The phrase
"native encoding" occurs in only three places in the standard:

29.11.7.2.2p1: refers to the native encoding of "ordinary character
strings" and "wide character strings", or char and wchar_t respectively.
It says nothing about char16_t.
D.23p4 talks about u8path(), which converts from utf8 encodings to the
native encoding for filenames.

I'm using n4860.pdf, 2020-03-31 as my reference.

Richard Damon

unread,
Dec 22, 2020, 2:41:35 PM12/22/20
to
Looking at the change log for C++20, one of the changes is:

> guarantee that char16_t and char32_t literals are encoded as UTF-16 and UTF-32 respectively

so this is a new requirement of the Standard.

daniel...@gmail.com

unread,
Dec 22, 2020, 4:04:29 PM12/22/20
to
On Tuesday, December 22, 2020 at 9:28:27 AM UTC-5, Richard Damon wrote:
>
> It looks like codecvt can be used to make the conversion IF the
> implementation uses UTF-16/UTF-32 for char16_t and char32_t.

The entire header <codecvt> has been deprecated as of C++17. The
std::codecvt template from <locale> hasn't been deprecated, but
all the standard conversion facets have been.

Daniel

Öö Tiib

unread,
Dec 22, 2020, 4:56:49 PM12/22/20
to
On Tuesday, 22 December 2020 at 16:06:08 UTC+2, Richard Damon wrote:
> Actually converting UTF-8 into UTF-16 isn't that hard to do, as it is a
> simple matter to extract the next UCS-4 code-point out of a UTF-8 string
> (a bit more complicated if you want to do all the suggested error checks
> for malformed UTF-8, but still not that hard), and converting the UCS-4
> code-point into UTF-16 is even simpler (just check if it is BMP or not
> and write the value(s) out).

Converting UTF-8 into UTF-16 is simple only if it is correct (in some
manner of "correct") UTF-8. What to do when it is incorrect (in some sense
of "incorrect")? Close the application? But it was "only" text, shame on you.

Bonita Montero

unread,
Dec 22, 2020, 4:58:30 PM12/22/20
to
> Converting UTF-8 into UTF-16 is simple only if it is correct (in some
> manner of "correct") UTF-8. What to do when it is incorrect (in some sense
> of "incorrect")? Close the application? But it was "only" text, shame on you.

What do you mean with "incorrect" ?

Richard Damon

unread,
Dec 22, 2020, 5:30:50 PM12/22/20
to
A couple of quick things that can make a byte sequence not valid UTF-8:

1) The number of bytes the first bytes says the code-point will have
doesn't match the number of bytes that it does have.

2) The first byte of the string isn't a valid first byte of a UTF-8
sequence. (like have a value between 0x80 and 0xBF, the values for
subsequent bytes)

3) The byte sequences is NOT the minimal length for that value. Some
variants of UTF-8 allow NUL to be encoded as 0xC0 0x00, but allowing
others can allow for some possible exploits, and the standard says they
should not be allowed.

4) A UTF-8 sequence that decodes to a value greater than 0x0010FFFF
should be marked as invalid (and can't be converted to UTF-16)


There is a code point U+FFFD (Replacement Character) reserved for this
sort of error.

Öö Tiib

unread,
Dec 22, 2020, 5:33:25 PM12/22/20
to
Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. Rest
are invalid. With "incorrect" I meant invalid Unicode that is treated as valid
by one or other library or platform. The details are not hard to find even
Wikipedia mentions couple of such cases.

Öö Tiib

unread,
Dec 22, 2020, 5:43:19 PM12/22/20
to
On Wednesday, 23 December 2020 at 00:30:50 UTC+2, Richard Damon wrote:
>
> There is a code point U+FFFD (Replacement Character) reserved for this
> sort of error.

That character is used to make "incorrect unicode" produced in your
product to look ugly in competitor's product that technically validates
it correctly.

daniel...@gmail.com

unread,
Dec 22, 2020, 10:32:43 PM12/22/20
to
It's very difficult to understand your point. Are you suggesting that the
fstream family should only be used to read bytes, with the decoding of
those bytes always left to a higher level? And perhaps that all the fstream
variants that perform non binary reads, including wfstream, be deprecated
as well? Of course that would break existing code, as has the deprecation
of the codecvt header and the the standard conversion facets.

Daniel

Alf P. Steinbach

unread,
Dec 23, 2020, 12:33:55 AM12/23/20
to
Very true.

I remember running into issues with g++ versus Visual C++, for using the
standard library's functionality:

* MinGW g++ produced big endian 16-bit values while Visual C++ produced
little endian.
* One of them consumed the next byte on error while the other didn't.

As I recall for the first point g++ was wrong while for the last point
Visual C++ was wrong, so there was no way to write portable code without
at least some checking of results and adaption to the compiler.

- Alf

Öö Tiib

unread,
Dec 23, 2020, 4:28:04 AM12/23/20
to
On Wednesday, 23 December 2020 at 05:32:43 UTC+2, daniel...@gmail.com wrote:
> On Tuesday, December 22, 2020 at 5:43:19 PM UTC-5, Öö Tiib wrote:
> > On Wednesday, 23 December 2020 at 00:30:50 UTC+2, Richard Damon wrote:
> > >
> > > There is a code point U+FFFD (Replacement Character) reserved for this
> > > sort of error.
> > That character is used to make "incorrect unicode" produced in your
> > product to look ugly in competitor's product that technically validates
> > it correctly.
> It's very difficult to understand your point.

Can be as the problem is tricky. I try to elaborate but you may not like
my point. It is question about how can we layer our software.

For example addition of two signed integers is undefined behavior for
quarter of input values. Not implementation-defined. Not even unspecified
result. Full whopping bear trap in one of most elementary operations.
So to make software that behaves in some sane manner in hands of
distressed housewives we are expected to make some kind of layer
before signed addition that ensures that the input is not in that range
of undefined behavior. That is doable, no problems.

With functions that are meant for taking their input from outside of
program the committee can not use same pattern and burden such
undefined behavior on shoulders of programmers. It is impossible
to write input-checking layers before input itself. They can standardize
whatever nonsense but clear DOS attack or worse feels too lot even
for them.

> Are you suggesting that the
> fstream family should only be used to read bytes, with the decoding of
> those bytes always left to a higher level?

That is what I'm doing in practice. Anything that comes from outside
is dirty data consisting of full range of possible byte values.

> And perhaps that all the fstream
> variants that perform non binary reads, including wfstream, be deprecated
> as well?

Either define fully for all possible input bytes or admit your
incapability and deprecate the tools that are useless.
Implementation vendors *want* the result to be non-portable.

> Of course that would break existing code, as has the deprecation
> of the codecvt header and the the standard conversion facets.

AFAIK the header is named <locale>. I don't know how to use it
in sane manner. It slows everything down but is not useful for
i18n or portabilty. It is for pile of newbie questions in style of
"Why my garbage parses CSV files wrongly in hands of my
German customers?" Quite pointless feature in my experience.



Bonita Montero

unread,
Dec 23, 2020, 4:38:39 AM12/23/20
to
> Only subset of sequences of bytes is valid UTF-8 or valid UTF-16.
> Rest are invalid. ...

Examples ?

Öö Tiib

unread,
Dec 23, 2020, 5:03:16 AM12/23/20
to
Can you read only 15 first words from posts?

I already said that Wikipedia even paints the examples *red*
<https://en.wikipedia.org/wiki/UTF-8#Codepage_layout>

Full docs are at unicode.org.
<https://www.unicode.org/versions/Unicode6.0.0/>

Richard Damon

unread,
Dec 23, 2020, 7:41:06 AM12/23/20
to
The key point is UTF-8 has a distinct syntax for how byte sequences from
a valid code-point.

Some simple examples:

1) String starts with 0x80 (This is a following byte, without a leading
byte)

2) String: 0xc0 0x40 (First byte says 2 byte code-point, next byte isn't
a following byte but a single byte code)

3) String: 0xC0 0x8F 0x8F (first byte says 2 byte code-point, but there
are then 2 following bytes indicating a 3 byte code-point)

There are also some more semantic errors like the sequence 0xC0 0x81
which would encode the code-point U+0001, but that should be encoded as
just 0x01, The Unicode standard says this should be treated as an error.

You could also encode a value greater than U+10FFFF which is an error,
and UTF-8 originally was defined to allow encoding to more than 4 bytes
until UTF-16 compatibility made them redefine it to limit code points to
at most 0x10FFFF.

UTF-8 was designed to allow for simple operation with the data. No
code-point is a sub-sequence of another code-point, it is easy to start
in the middle of a string and find the beginning of the next or previous
code-point in the string. It is NOT optimized for minimum data storage
(but not that bad). There are enough rules about what is valid, that a
test if a string/file is UTF-8 encoded by just checking if it is valid
works pretty well (pure ASCII passes, the odds of an 8-bit code-page
data using the upper characters will pass is very small).

Richard Damon

unread,
Dec 23, 2020, 7:44:27 AM12/23/20
to
Neither big endian or little endian UTF-16 is 'wrong', it may not be the
encoding you want, but isn't flat out wrong. (One is actually called
UTF-16BE and the other UTF-16LE).

Now, if it claimed that it was generating native endian UTF-16, then
having the wrong endianness would be wrong.

Bonita Montero

unread,
Dec 23, 2020, 8:03:23 AM12/23/20
to
>> Examples ?

> Can you read only 15 first words from posts?

UTF-8 is independent of invalid Unicode code-points.

daniel...@gmail.com

unread,
Dec 23, 2020, 8:31:34 AM12/23/20
to
On Tuesday, December 22, 2020 at 4:56:49 PM UTC-5, Öö Tiib wrote:

> Converting UTF-8 into UTF-16 is simple only if it is correct (in some
> manner of "correct") UTF-8. What to do when it is incorrect (in some sense
> of "incorrect")? Close the application? But it was "only" text, shame on you.

If the question is, what about if the input you thought was UTF-8 is not in
fact UTF-8, well, what about it? What about if the input you thought was a
JPEG file is not in fact a JPEG file?

With errors, you can attempt to recover, or fail. You can be as lenient or
strict as the utilities you're using allow you to be.

Daniel

Juha Nieminen

unread,
Dec 23, 2020, 8:56:14 AM12/23/20
to
Öö Tiib <oot...@hot.ee> wrote:
> For example addition of two signed integers is undefined behavior for
> quarter of input values. Not implementation-defined. Not even unspecified
> result. Full whopping bear trap in one of most elementary operations.

I believe this is a combination of "you don't pay for what you don't need"
and wanting to support a wide variety of possible architectures at the
same time.

There may be some more exotic CPU architecture where an integer overflow
or underflow causes a CPU interrupt, for example. The standard doesn't
want to force the compiler to add extraneous validity checks to every
single arithmetic operation by demanding that eg. addition always gives
some result (rather than, for example, crashing the program). Such
checks might be nice in the very rare cases where you really need them,
but in the vast majority of cases they would only make the program
slower for literally zero benefit.

Even if there would be a check, what exactly should the compiler do?
If it detects an overflow it simulates the result of unsigned 2's
complement arithmetic (cast back to signed afterwards)? Return the
maximum value? Something else? What exactly should the standard
mandate it to do?

It's just easier for the standard to say to the compiler developers
"do whatever you want in this case".

Of course then it's the responsibility of the programmer to be aware that
relying on a particular behavior for signed integer overflow is
technically speaking non-portable. Not that it matters much in practice.

Richard Damon

unread,
Dec 23, 2020, 12:31:27 PM12/23/20
to
Yes, there have been processors that the results of a signed arithmetic
overflow was a processor trap, an others where the results might be
something like a clamped result. So defining what should happen was
going to be expensive on some platform. They also hadn't with the phrase
unspecified value or and implementation define trap, which would have
beed significantly more restrictive that just general undefined behavior.

It also turns out that for most reasonable applications, it isn't too
hard to avoid the problem, you generally know the expected magnitudes of
the values and can use a type that can handle that range.

It also turned out that some useful optimizations were possible if the
compiler could assume that signed overflow did not happen, and this
could give a noticable speed advantage to some cases.

There also is the interesting fact that where there is enough demand for
a specified behavior for signed overflow, and implementation can provide
that behavior in a way that it documents, and in fact many
implementations DO provide an option for signed overflow to just
generate the wrap around behavior that a typical 2's complement
processor will generate.

Öö Tiib

unread,
Dec 23, 2020, 1:45:24 PM12/23/20
to
On Wednesday, 23 December 2020 at 15:56:14 UTC+2, Juha Nieminen wrote:
> Öö Tiib <oot...@hot.ee> wrote:
> > For example addition of two signed integers is undefined behavior for
> > quarter of input values. Not implementation-defined. Not even unspecified
> > result. Full whopping bear trap in one of most elementary operations.
> I believe this is a combination of "you don't pay for what you don't need"
> and wanting to support a wide variety of possible architectures at the
> same time.

Yes. That was example of solvable problem. I ended the paragraph
"That is doable, no problems."

> There may be some more exotic CPU architecture where an integer overflow
> or underflow causes a CPU interrupt, for example. The standard doesn't
> want to force the compiler to add extraneous validity checks to every
> single arithmetic operation by demanding that eg. addition always gives
> some result (rather than, for example, crashing the program). Such
> checks might be nice in the very rare cases where you really need them,
> but in the vast majority of cases they would only make the program
> slower for literally zero benefit.

Yes. If there would be such architecture then programmers would
pay more attention to "So to make software that behaves in some sane
manner in hands of distressed housewives we are expected to make
some kind of layer before signed addition that ensures that the input
is not in that range of undefined behavior. " that I wrote.

> Even if there would be a check, what exactly should the compiler do?
> If it detects an overflow it simulates the result of unsigned 2's
> complement arithmetic (cast back to signed afterwards)? Return the
> maximum value? Something else? What exactly should the standard
> mandate it to do?

Oh, easy. On hypothetical case that I can wish something from
Joulupukki about that matter I would perhaps ask something like
guaranteed SIGFPE there.
But I am still in position that "That is doable, no problems." like
it is too.

> It's just easier for the standard to say to the compiler developers
> "do whatever you want in this case".
>
> Of course then it's the responsibility of the programmer to be aware that
> relying on a particular behavior for signed integer overflow is
> technically speaking non-portable. Not that it matters much in practice.

No! That is incorrect to rely on any behavior on case of signed
overflow! Major compilers optimize assuming that there are no
signed overflow. Whole branches that depend on particular behavior
of overflow can be erased by optimizer. "So to make software that
behaves in some sane manner in hands of distressed housewives
we are expected to make some kind of layer before signed addition
that ensures that the input is not in that range of undefined behavior."
That is the sole way to handle it. All other ways of handling it by
programmer can result with it outright blowing up in hands of said
distressed housewives.

Öö Tiib

unread,
Dec 23, 2020, 1:53:14 PM12/23/20
to
On Wednesday, 23 December 2020 at 15:03:23 UTC+2, Bonita Montero wrote:
> >> Examples ?
>
> > Can you read only 15 first words from posts?
> UTF-8 is independent of invalid Unicode code-points.4

Can't you read even those 15? You replied to:
"Only subset of sequences of bytes is valid UTF-8 or valid UTF-16.
Rest are invalid." Bytes can fail being valid far before reaching
any code points.

Richard Damon

unread,
Dec 23, 2020, 2:04:54 PM12/23/20
to
No, you do NOT need a 'layer' before signed addition, you need to design
your program not to overflow and use right sized numbers.

For instance, if you are working on a game with screen coordinates, you
can design it with a limited range of coordinates (say 0 - 8191), then
the difference of any two positions will fit into a 16 bit signed
number, and the sum of two to get a midpoint will also fit. IF you need
a 2d distance, and will do it by sum of squares, you know that squaring
needs to be done in 32 bits.

With this sort of basic anaylsis, you can make sure you never will
generate a signed overflow, you just need to early in the code make sure
that your points stay in their allowed range, which is a natural part of
the problem.

Yes, sometimes you are working with an application that needs to deal
with more arbitrary scaled numbers, but then error handling would likely
require doing those test anyway to avoid giving wrong answers, you just
can't depend on known behavior of overflow to detect it, you need to
bounds check before (or do math in a larger type and check if it will
fit before downcasting).

Öö Tiib

unread,
Dec 23, 2020, 2:19:42 PM12/23/20
to
On Wednesday, 23 December 2020 at 15:31:34 UTC+2, daniel...@gmail.com wrote:
> On Tuesday, December 22, 2020 at 4:56:49 PM UTC-5, Öö Tiib wrote:
>
> > Converting UTF-8 into UTF-16 is simple only if it is correct (in some
> > manner of "correct") UTF-8. What to do when it is incorrect (in some sense
> > of "incorrect")? Close the application? But it was "only" text, shame on you.
> If the question is, what about if the input you thought was UTF-8 is not in
> fact UTF-8, well, what about it?

That depends what requirements say.

> What about if the input you thought was a
> JPEG file is not in fact a JPEG file?

That also depends what requirements say.

> With errors, you can attempt to recover, or fail. You can be as lenient or
> strict as the utilities you're using allow you to be.

So utilities in <locale> that do not let me to implement what requirements
say are useless and should be deprecated.


Öö Tiib

unread,
Dec 23, 2020, 2:40:36 PM12/23/20
to
That is logical layer if the numbers can not overflow logically.

> For instance, if you are working on a game with screen coordinates, you
> can design it with a limited range of coordinates (say 0 - 8191), then
> the difference of any two positions will fit into a 16 bit signed
> number, and the sum of two to get a midpoint will also fit. IF you need
> a 2d distance, and will do it by sum of squares, you know that squaring
> needs to be done in 32 bits.

Good example of logic in real meta-programming (IOW in programmers
brain) layer!

> With this sort of basic anaylsis, you can make sure you never will
> generate a signed overflow, you just need to early in the code make sure
> that your points stay in their allowed range, which is a natural part of
> the problem.
>
> Yes, sometimes you are working with an application that needs to deal
> with more arbitrary scaled numbers, but then error handling would likely
> require doing those test anyway to avoid giving wrong answers, you just
> can't depend on known behavior of overflow to detect it, you need to
> bounds check before (or do math in a larger type and check if it will
> fit before downcasting).

Yes. We happened to discuss handling of dirty data like something that
we read and that is supposed to be UTF-8 but might be is invalid. We
can not solve issues with numbers in some JSON file at that logic
layer that you so well brought example of. We have to write code that
checks those ranges and do what is needed on case the check fails,
no way to supply those to addition operator immediately.


Richard Damon

unread,
Dec 23, 2020, 2:41:31 PM12/23/20
to
<locale> and such is reasonably good for OUTPUT. C does not provide a
great input method if you need to process defensively (which if you
don't control the input you should). It isn't that hard to write a small
package of input routines to handle the bad cases the way you want, the
problem is that what is a bad case, and what you want to do with them
vary so much from place to place, that making a standard library method
to do it is hard.


Öö Tiib

unread,
Dec 23, 2020, 3:36:03 PM12/23/20
to
My impression is that <locale> is utterly insufficient for i18n in whatever
direction or even as building block of it but lets just disagree there as
tastes vary.

Unicode is formally well-defined only issue with it is that it has gone
from 6.0.0 to 13.0.0 during last 10 years. With so moving target
it sometimes makes version indication kind of desirable.

My take on any converter/filter is that based on data in input
sometimes it can be fully round-trip, sometimes it loses and/or
adds something and sometimes fails to convert.
Best is when it has some fully defined default behavior that
can be configured and also that it indicates whatever it did to
caller. How to react to each of those cases is then all up to caller.
We have C++ so I would love compile-time configurable but
dynamic is fine as bottle-neck is usually speed of channel or
media. I do not understand what is so tricky about it as I do it
all the time.

Filter being "for output" you meant in sense that quality of its
input can be blamed on programmer? Also on case of "for
input" it can be blamed to some programmer ... just that
chances are that the programmer is more anonymous. In
both directions it is bad excuse for weak work.

Manfred

unread,
Dec 23, 2020, 4:13:32 PM12/23/20
to
On 12/23/2020 7:45 PM, Öö Tiib wrote:
> On Wednesday, 23 December 2020 at 15:56:14 UTC+2, Juha Nieminen wrote:
>> Öö Tiib <oot...@hot.ee> wrote:
[...]
>
> Yes. If there would be such architecture then programmers would
> pay more attention to "So to make software that behaves in some sane
> manner in hands of distressed housewives we are expected to make
> some kind of layer before signed addition that ensures that the input
> is not in that range of undefined behavior. " that I wrote.
>
>> Even if there would be a check, what exactly should the compiler do?
>> If it detects an overflow it simulates the result of unsigned 2's
>> complement arithmetic (cast back to signed afterwards)? Return the
>> maximum value? Something else? What exactly should the standard
>> mandate it to do?
>
> Oh, easy. On hypothetical case that I can wish something from
> Joulupukki about that matter I would perhaps ask something like
> guaranteed SIGFPE there.
> But I am still in position that "That is doable, no problems." like
> it is too.
>
[...]
>
> No! That is incorrect to rely on any behavior on case of signed
> overflow! Major compilers optimize assuming that there are no
> signed overflow. Whole branches that depend on particular behavior
> of overflow can be erased by optimizer. "So to make software that
> behaves in some sane manner in hands of distressed housewives
> we are expected to make some kind of layer before signed addition
> that ensures that the input is not in that range of undefined behavior."
> That is the sole way to handle it. All other ways of handling it by
> programmer can result with it outright blowing up in hands of said
> distressed housewives.
>

Now, I must /really/ ask, Öö:
Is the wife of Joulupukki one distressed housewife ?

That is really bugging me.

Richard Damon

unread,
Dec 23, 2020, 5:02:44 PM12/23/20
to
locale was NEVER the complete i18n solution. It provides a few key
features, like the representation of numbers and currency, and does a
fairly decent job at that IF you feed it the information that is needed
(which is sometimes tougher to know). Yes, it doesn't handle thing like
I want numbers printed in European format except for things intended to
be cut and pasted into Excel which I have setup for a different format.
The programmer needs to figure out which locale set to use for what.

>
> Unicode is formally well-defined only issue with it is that it has gone
> from 6.0.0 to 13.0.0 during last 10 years. With so moving target
> it sometimes makes version indication kind of desirable.

The thing to note is that Unicode has been incredibly backwards
compatible, and most of the 'changes' have been defining what new
code-points represent, which requires updating character classification
tables if (and only if) you want to process 'correctly' those new
characters. These changes do NOT invalidate any older processing
>
> My take on any converter/filter is that based on data in input
> sometimes it can be fully round-trip, sometimes it loses and/or
> adds something and sometimes fails to convert.

The 'illegal' Unicode that I have mentioned hasn't changed since the
VERY early days (once Unicode became a 21 bit character set). Valid (in
that sense) UTF-8 should fully round trip to UTF-16 or UCS-4 without any
changes.

Yes, an arbitrary string of bytes is very likely to be marked invalid,
and not round trip. There are also some not uncommon enough errors that
people make encoding data that won't round trip if done right (and an
application being strict is supposed to mark these cases with the
replacement character, but many will just silently 'fix' them.


> Best is when it has some fully defined default behavior that
> can be configured and also that it indicates whatever it did to
> caller. How to react to each of those cases is then all up to caller.
> We have C++ so I would love compile-time configurable but
> dynamic is fine as bottle-neck is usually speed of channel or
> media. I do not understand what is so tricky about it as I do it
> all the time.

The problem here is that range of possible desired error recovery is so
broad that it becomes unwieldy to implement.


>
> Filter being "for output" you meant in sense that quality of its
> input can be blamed on programmer? Also on case of "for
> input" it can be blamed to some programmer ... just that
> chances are that the programmer is more anonymous. In
> both directions it is bad excuse for weak work.
>

Output routines can specify their calling conditions, as the programmer
using them has essential control over the data going into them. Yes, if
he doesn't meet the published requreements for the routine, he can be
'blamed' for them not working.

Input routines take some of their input from something at least
potentially outside of the control of the programmer using them. This
input potentially even comes from a source that is adversarial to the
program.

Specification for processing inputs can sometimes get quite involved,
especially if the input is possibly coming from an untrained user,
specifying not only the primary expected inputs, but possible variations
that users might try, as well as safeguards for dealing with hostile
input (sometimes you want to do more than just ignore it).

Yes, if the input is securely from a trusted source and known to be free
from errors, you can be a bit more lax with parsing, and perhaps some of
the simple input processing routines from the standard library can be used.

Öö Tiib

unread,
Dec 23, 2020, 8:14:07 PM12/23/20
to
I NEVER said it was i18n solution. I said it is utterly insufficient as
building block of one for my taste. I won't use <locale> even if paid
well as life is short and there are more fun and fruitful things to do.
Frameworks that have implemented i18n support have also usually
side-tracked <locale> so I'm not alone in that assessment. I already
indicated that we can disagree there as it is not really my business
what you use for i18n (if anything). I just disagree that it is good
idea.

> > Unicode is formally well-defined only issue with it is that it has gone
> > from 6.0.0 to 13.0.0 during last 10 years. With so moving target
> > it sometimes makes version indication kind of desirable.
> The thing to note is that Unicode has been incredibly backwards
> compatible, and most of the 'changes' have been defining what new
> code-points represent, which requires updating character classification
> tables if (and only if) you want to process 'correctly' those new
> characters. These changes do NOT invalidate any older processing

I lost the point of your reasoning. So who does not want to process
things correctly? Or to avoid passing things that cooperating module
is incapable of processing correctly? That makes version supported
sometimes desirable to ask and to tell ... and that is all I said.

> > My take on any converter/filter is that based on data in input
> > sometimes it can be fully round-trip, sometimes it loses and/or
> > adds something and sometimes fails to convert.
> The 'illegal' Unicode that I have mentioned hasn't changed since the
> VERY early days (once Unicode became a 21 bit character set). Valid (in
> that sense) UTF-8 should fully round trip to UTF-16 or UCS-4 without any
> changes.

That is unimportant as it does not affect what you say after "but".
The rule that everything before "but" is bullshit applies.

> Yes, an arbitrary string of bytes is very likely to be marked invalid,
> and not round trip. There are also some not uncommon enough errors that
> people make encoding data that won't round trip if done right (and an
> application being strict is supposed to mark these cases with the
> replacement character, but many will just silently 'fix' them.

And here was the "but". Either such "fixes" should be illegal or we
have to run with pack, isn't it? Otherwise the distressed housewives
discard our product as inferior to those "many". Or what you suggest?

> > Best is when it has some fully defined default behavior that
> > can be configured and also that it indicates whatever it did to
> > caller. How to react to each of those cases is then all up to caller.
> > We have C++ so I would love compile-time configurable but
> > dynamic is fine as bottle-neck is usually speed of channel or
> > media. I do not understand what is so tricky about it as I do it
> > all the time.
> The problem here is that range of possible desired error recovery is so
> broad that it becomes unwieldy to implement.

That sounds like that what those "many" implement is unwieldy
to implement. It isn't easy but it is far easier than for example to
support reasonable sub-part of forest of "MarkDown format"
incarnations. Do you have some positive advise what to do
instead?

> > Filter being "for output" you meant in sense that quality of its
> > input can be blamed on programmer? Also on case of "for
> > input" it can be blamed to some programmer ... just that
> > chances are that the programmer is more anonymous. In
> > both directions it is bad excuse for weak work.
> >
> Output routines can specify their calling conditions, as the programmer
> using them has essential control over the data going into them. Yes, if
> he doesn't meet the published requreements for the routine, he can be
> 'blamed' for them not working.
>
> Input routines take some of their input from something at least
> potentially outside of the control of the programmer using them. This
> input potentially even comes from a source that is adversarial to the
> program.

Yes and Yes. So what? If we have to make filter that is helpful
with those adversarial sources then why should we not use it in
both directions?

> Specification for processing inputs can sometimes get quite involved,
> especially if the input is possibly coming from an untrained user,
> specifying not only the primary expected inputs, but possible variations
> that users might try, as well as safeguards for dealing with hostile
> input (sometimes you want to do more than just ignore it).
>
> Yes, if the input is securely from a trusted source and known to be free
> from errors, you can be a bit more lax with parsing, and perhaps some of
> the simple input processing routines from the standard library can be used.

I do not trust even myself. I make typos constantly. This post has probably
several. So filter that safeguards regardless of what direction the
communication goes helps to find and resolve any problems and
<locale> that can't carry its own weight is useless.

0 new messages