Is this safe?

Mut...@dastardlyhq.com

unread,

Feb 20, 2023, 11:45:52 AM2/20/23

to

It occured to me that I don't actually know the answer. It works, but will
it always work? Could it crash certain string implementations?

std::string s = "hello";
for(auto &c: s) c = toupper(c);
std::cout << s << std::endl;

Bonita Montero

unread,

Feb 20, 2023, 12:33:38 PM2/20/23

to

Why should this not work ?

Chris Vine

unread,

Feb 20, 2023, 2:16:53 PM2/20/23

to

The problem is that UTF-8 is now more or less ubiquitous for narrow
strings and std::toupper only works for UTF-8 strings guaranteed to
be in the ASCII subset.

Apart from the fact that in UTF-8, any one unicode code point can
require between 1 and 5 bytes (code units) to encode it, upper and
lower case representations can occupy different numbers of code
points. Since you are in Germany, one example is the esszet (ß),
which is one code point in lower case but two code points in upper
case (SS or SZ) by old/traditional orthography.

Bonita Montero

unread,

Feb 20, 2023, 3:38:09 PM2/20/23

to

Am 20.02.2023 um 20:16 schrieb Chris Vine:

> The problem is that UTF-8 is now more or less ubiquitous for narrow
> strings and std::toupper only works for UTF-8 strings guaranteed to
> be in the ASCII subset.

Muttley didn't talk about UTF-8. But toupper() also works with UTF-8.

Chris Vine

unread,

Feb 20, 2023, 3:57:05 PM2/20/23

to

On Monday, 20 February 2023 at 20:38:09 UTC, Bonita Montero wrote:
> Am 20.02.2023 um 20:16 schrieb Chris Vine:
>
> > The problem is that UTF-8 is now more or less ubiquitous for narrow
> > strings and std::toupper only works for UTF-8 strings guaranteed to
> > be in the ASCII subset.

> ... But toupper() also works with UTF-8.

You are ill informed. It only works for the ASCII subset of UTF-8.

Tony Oliver

unread,

Feb 20, 2023, 7:14:43 PM2/20/23

to

The OP specifically used an ASCII string; no mention was made of UTF-8.

Why did you need to bring it up?

Tony Oliver

unread,

Feb 20, 2023, 7:17:23 PM2/20/23

to

And the example given in the OP is, indeed, ASCII.

Again, why are you (irrelevantly) bringing UTF-8 into this?

Manu Raju

unread,

Feb 20, 2023, 9:00:06 PM2/20/23

to

On 21/02/2023 00:17, Tony Oliver wrote:
>
>
> Again, why are you (irrelevantly) bringing UTF-8 into this?

The question was about "is this safe" and Chris told him indirectly that
it is not safe because it doesn't work in full set of UTF-8. The OP
might not even have thought of UTF-8 but answer has to be given to
the question asked.

Richard Damon

unread,

Feb 20, 2023, 9:19:06 PM2/20/23

to

One possible issue is the type of c will be char& but if char is signed,
then if the string contains any characters with the sign bit set,
toupper will have undefined behavior.

you need to cast the value c to unsigned char before passing to toupper
(which will then convert it to int).

This is a case where auto gives you problems, as if you intend to be
able to handle other versions of string built on types other than char,
you need to know the unsigned equivalent for them. The localized toupper
can handle that sort of operation (as it doesn't handle the -1 case, so
it takes a charT parameter, not an int).

As others have mentioned, it also assumes that your string is using a
single byte encoding for characters (like plain ASCII, or the old
"code-page" style text strings), as multi-byte encoding can't be handled
by this form of toupper().

Bonita Montero

unread,

Feb 20, 2023, 11:35:06 PM2/20/23

to

Am 20.02.2023 um 21:56 schrieb Chris Vine:

>
> You are ill informed. It only works for the ASCII subset of UTF-8.

toupper only applies to a - z and thereby works with UTF-8.

Andrey Tarasevich

unread,

Feb 20, 2023, 11:38:40 PM2/20/23

to

No, it is not safe in general case. Functions from <cctype> group
generally require either a non-negative arguments or `EOF`. Otherwise,
the behavior is undefined.

This means that when you pass `char` values to these functions you
better be sure that your `char` is unsigned or, at least, that all
`char` values you are passing are non-negative.

If you are not sure of that, you'll be better off explicitly casting the
argument to `unsigned char`:

for(auto &c: s) c = toupper((unsigned char) c);

--
Best regards,
Andrey

James Kuyper

unread,

Feb 20, 2023, 11:49:04 PM2/20/23

to

I think that the answer to the original question is that it is
definitely safe, because it contained only characters for which
toupper() is guaranteed to work (regardless of encoding). Chris was
incorrect to suggest otherwise.
However, Bonita asked "why should this not work?" - in other words, how
could there be any room for doubt? And the answer to that is indeed to
point out that, with a different string, it might not work, so it is a
legitimate question to ask if it will work with this particular string.
Note that it's unnecessary to invoke UTF-8 specifically; any encoding
that has MB_LEN_MAX > 1 can run into the same problem.

Öö Tiib

unread,

Feb 21, 2023, 3:05:44 AM2/21/23

to

On Monday, 20 February 2023 at 21:16:53 UTC+2, Chris Vine wrote:
>
> Apart from the fact that in UTF-8, any one unicode code point can
> require between 1 and 5 bytes (code units) to encode it, upper and
> lower case representations can occupy different numbers of code
> points.

From November 2003 <https://datatracker.ietf.org/doc/html/rfc3629>
all 5 and 6 byte sequences were removed and so UTF-8 code point is
now up to 4 bytes. But as a "character" can be made of several code
points there are seemingly no limits. Flag of Scotland (🏴󠁧󠁢󠁳󠁣󠁴󠁿) takes 28
bytes.

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 4:14:40 AM2/21/23

to

Because I don't know where the reference is pointing to. Its been standard
with strings not to alter the contents returned by c_str() so is this a
similar case or is it an analogue of just doing s[<index>] which is gauranteed
safe?

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 4:15:35 AM2/21/23

to

I was concerned whether the code would corrupt the string or crash the
program. I couldn't give a f**k about UTF8.

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 4:18:19 AM2/21/23

to

I can't believe I needed to explain what I meant by safe. I was not talking
about character encoding FFS, I simply used toupper as an example.

Ok, is this safe:

for(auto &c: s) c = 'x';

Or will it screw up the string object or have no effect in some cases just
as altering the contents returned by the c_str() pointer can sometimes have?

Chris Vine

unread,

Feb 21, 2023, 5:19:28 AM2/21/23

to

On Tuesday, 21 February 2023 at 04:35:06 UTC, Bonita Montero wrote:
> toupper only applies to a - z ...

That's wrong. Its behaviour is locale specific, and for example with a
locale which uses the ISO-8859-1 codeset, will work for lower case
characters outside the a-z range which have an upper case
ISO-8859-1 representation.

Paavo Helde

unread,

Feb 21, 2023, 5:31:34 AM2/21/23

to

We understood that. Alas, there is no need for such concerns, a
std::string owns its own memory and does not wrap any external memory.
So we are having fun by nitpicking unrelated aspects.

FYI: for wrapping there is std::string_view, but your example would not
compile with s/string/string_view/.

Bonita Montero

unread,

Feb 21, 2023, 8:15:22 AM2/21/23

to

Am 21.02.2023 um 11:19 schrieb Chris Vine:

> On Tuesday, 21 February 2023 at 04:35:06 UTC, Bonita Montero wrote:

>> toupper only applies to a - z ...

> That's wrong. Its behaviour is locale specific, ...

This ...

template< class CharT >
CharT toupper( CharT ch, const locale& loc );

... is locale-specific, not C's toupper.

Ralf Goertz

unread,

Feb 21, 2023, 8:40:29 AM2/21/23

to

Am Tue, 21 Feb 2023 14:16:23 +0100
schrieb Bonita Montero <Bonita....@gmail.com>:

“man toupper” disagrees:

NAME
toupper, toupper_l — transliterate lowercase characters
to uppercase

SYNOPSIS
#include <ctype.h>

int toupper(int c);
int toupper_l(int c, locale_t locale);

DESCRIPTION
For toupper(): The functionality described on this ref-
erence page is aligned with the ISO C standard. Any con-
flict between the requirements described here and the
ISO C standard is unintentional. This volume of
POSIX.1‐2017 defers to the ISO C standard.

The toupper() and toupper_l() functions have as a domain
a type int, the value of which is representable as an
unsigned char or the value of EOF. If the argument has
any other value, the behavior is undefined.

If the argument of toupper() or toupper_l() represents a
lowercase letter, and there exists a corresponding up-
percase letter as defined by character type information
in the current locale or in the locale represented by
locale, respectively (category LC_CTYPE), the result
shall be the corresponding uppercase letter.
…

Chris Vine

unread,

Feb 21, 2023, 8:51:25 AM2/21/23

to

No. Here is what the standard says about toupper in <cctype>:

[cctype.syn]/1: "The contents and meaning of the header <cctype> are
the same as the C standard library header <ctype.h>."

C11, 7.4.2.2/3, the toupper function: "If the argument is a character
for which islower is true and there are one or more corresponding
characters, as specified by the current locale, for which isupper is
true, the toupper function returns one of the corresponding
characters (always the same one for any given locale) ..."

I don't have a more recent version of the C standard standard but I doubt
it has changed.

Paavo Helde

unread,

Feb 21, 2023, 8:58:36 AM2/21/23

to

Seriously? A major headache with C is that a lot of functions like
toupper() or sprintf() are locale-specific, but the locale is
unpredictable and uncontrollable for library code loaded in a
multithreaded process.

Alf P. Steinbach

unread,

Feb 21, 2023, 9:12:56 AM2/21/23

to

Demonstrates the need for making examples short and to the point, to not
introduce extraneous issues.

The assignment is safe.

`c` is a reference to an item in the string, within the string bounds,
i.e. it can't be a reference to the null-item that since C++11 is
effectively required past the end of the string.

Not what you're asking, but the common convention in C++ is to separate
the type specification from the variable, i.e. `auto& c` not `auto &c`.

But on the third and gripping hand, in this case I'd write `char& c`,
since there's no advantage in `auto` here and it obscures things.

- Alf

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 9:53:23 AM2/21/23

to

On Tue, 21 Feb 2023 15:12:39 +0100
"Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
>On 2023-02-21 10:18 AM, Mut...@dastardlyhq.com wrote:
>> Ok, is this safe:
>>
>> for(auto &c: s) c = 'x';
>>
>> Or will it screw up the string object or have no effect in some cases just
>> as altering the contents returned by the c_str() pointer can sometimes have?
>
>Demonstrates the need for making examples short and to the point, to not
>introduce extraneous issues.

3 lines is short and I didn't introduce any extranious issues. Other posters
created them.

>`c` is a reference to an item in the string, within the string bounds,
>i.e. it can't be a reference to the null-item that since C++11 is
>effectively required past the end of the string.

Ok. However in the past with some implementations of std::string, modifying
the contents of the buffer returned by c_str() can have undefined consequences.
I simply wondered if this was a similar situation.

>Not what you're asking, but the common convention in C++ is to separate
>the type specification from the variable, i.e. `auto& c` not `auto &c`.

Whose common convention?

>But on the third and gripping hand, in this case I'd write `char& c`,
>since there's no advantage in `auto` here and it obscures things.

Thats ironic coming from you. :)

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 9:55:31 AM2/21/23

to

Why would multithreading make any difference? Enviroment variables such as
LC_TYPE are process wide.

Andrey Tarasevich

unread,

Feb 21, 2023, 11:16:02 AM2/21/23

to

On 02/21/23 1:18 AM, Mut...@dastardlyhq.com wrote:
> I can't believe I needed to explain what I meant by safe. I was not talking
> about character encoding FFS, I simply used toupper as an example.
>
> Ok, is this safe:
>
> for(auto &c: s) c = 'x';
>
> Or will it screw up the string object or have no effect in some cases just
> as altering the contents returned by the c_str() pointer can sometimes have?

What you do need to explain is the underlying roots/origins of your
question. For anyone familiar with the language the above question will
seem trivial, meaning that people will normally assume that you meant
something more elaborate than the above.

Yes, the above is fine.

And, BTW, the original `c_str()` issue is no longer relevant. Starting
from C++11 `c_str()` points to the same location as `data()`, i.e. it is
required to point to the same controlled sequence that you access by all
other means.

--
Best regards,
Andrey

Paavo Helde

unread,

Feb 21, 2023, 12:02:50 PM2/21/23

to

Exactly. The locale is process wide, so whenever I want to do something
which might require a different locale than the current process-wide
locale (e.g. producing CSV/XML/JSON files, formatting program code,
etc), I cannot use any functions like toupper() or sprintf() which use
the process-wide locale. I cannot also use setlocale() because this
might disturb other threads.

There are alternative functions where one can specify the locale
explicitly, typically ending with the "_l" suffix, but these are
relative late-comers and not so universally supported. On Windows the
names tend to be different, which complicates the things further.

The locale-dependent functions tend to be quite slow anyway. In C++ we
finally have some fast formatting support like std::to_chars(), but this
was a long wait.

The locale concept originates from 40 years back where it was assumed
that all information in text format is meant for visual consumption by a
human sitting near the same machine. This is not the case already for a
long time.

Not to speak about that the locale mechanisms are nowhere near up to the
task for supporting real i22n.

Mut...@dastardlyhq.com

unread,

Feb 21, 2023, 12:14:03 PM2/21/23

to

On Tue, 21 Feb 2023 19:02:33 +0200

Make your program multi process then. Problem solved.

Paavo Helde

unread,

Feb 21, 2023, 12:58:31 PM2/21/23

to

21.02.2023 19:13 Mut...@dastardlyhq.com kirjutas:
>
> Make your program multi process then. Problem solved.

Too late. Nowadays I'm happy if there are less than 100 threads in the
process, otherwise debugging them might become really tedious.

james...@alumni.caltech.edu

unread,

Feb 21, 2023, 1:35:42 PM2/21/23

to

On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Tue, 21 Feb 2023 15:12:39 +0100
> "Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
> >On 2023-02-21 10:18 AM, Mut...@dastardlyhq.com wrote:
> >> Ok, is this safe:
> >>
> >> for(auto &c: s) c = 'x';
> >>
> >> Or will it screw up the string object or have no effect in some cases just
> >> as altering the contents returned by the c_str() pointer can sometimes have?
> >
> >Demonstrates the need for making examples short and to the point, to not
> >introduce extraneous issues.
> 3 lines is short and I didn't introduce any extranious issues. Other posters
> created them.

No, they didn't. When there's multiple possible reasons why you might be worried
about whether a given piece of code is safe, you need to identify which of the
possibilities is the one you are actually worried about. You considered it so
obvious that you don't care about UTF-8, that you didn't consider the possibility
that other people would consider that to be the most obvious safety concern.
The people who responded considered the resolution of the issue that you were
concerned with to be so obvious that they didn't even consider the possibility
that you were uncertain about it.

james...@alumni.caltech.edu

unread,

Feb 21, 2023, 1:58:40 PM2/21/23

to

On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com wrote:

> On Tue, 21 Feb 2023 15:12:39 +0100
> "Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
> >On 2023-02-21 10:18 AM, Mut...@dastardlyhq.com wrote:
> >> Ok, is this safe:
> >>
> >> for(auto &c: s) c = 'x';
> >>
> >> Or will it screw up the string object or have no effect in some cases just
> >> as altering the contents returned by the c_str() pointer can sometimes have?
> >
> >Demonstrates the need for making examples short and to the point, to not
> >introduce extraneous issues.
> 3 lines is short and I didn't introduce any extranious issues. Other posters
> created them.
> >`c` is a reference to an item in the string, within the string bounds,
> >i.e. it can't be a reference to the null-item that since C++11 is
> >effectively required past the end of the string.
> Ok. However in the past with some implementations of std::string, modifying
> the contents of the buffer returned by c_str() can have undefined consequences.
> I simply wondered if this was a similar situation.
> >Not what you're asking, but the common convention in C++ is to separate
> >the type specification from the variable, i.e. `auto& c` not `auto &c`.
> Whose common convention?

The convention you're using is popular for simple declarations, because it uses
the space to separate the sequence of declaration specifiers from the list of
declarators. That's a helpful distinction to be aware of, because if there is more
than one declarator, they all share the declaration specifiers.
However, this is a different context, the range declaration for a range-based for
statement. If followed by a comma-delimited list of identifiers, i it would be
parsed as a structured binding declaration, and in that context & would be shared
by all of those identifiers, so it makes sense to separate it from them, and
combine it with the preceding specifiers, which they would also share.

Chris M. Thomasson

unread,

Feb 21, 2023, 3:16:03 PM2/21/23

to

100 threads? I am wondering what they are all doing. ;^)

Paavo Helde

unread,

Feb 21, 2023, 4:41:47 PM2/21/23

to

Mostly waiting for each other, or for external events. The general goal
is to have the number of active calculation threads in the order of
physical CPU cores, or somewhat higher if there is hyperthreading. So
the number of actively running threads is usually less than 20 or 30,
the rest are waiting.

I'm sure it ought be possible to reduce the number of needed threads
several times by code reorganization, but the cost/benefit ratio does
not seem to encourage this work.

Andrey Tarasevich

unread,

Feb 22, 2023, 1:32:28 AM2/22/23

to

On 02/21/23 6:12 AM, Alf P. Steinbach wrote:
>
> Not what you're asking, but the common convention in C++ is to separate
> the type specification from the variable, i.e. `auto& c` not `auto &c`.
>
> But on the third and gripping hand, in this case I'd write `char& c`,
> since there's no advantage in `auto` here and it obscures things.

Not really. This is a misleading "convention" that has no reason to
exist. In this case the `&` is a part of an individual declarator, not a
part of decl-specifier-seq. There's no reason to group the syntactic
elements differently. It achieves nothing.

--
Best regards,
Andrey

Mut...@dastardlyhq.com

unread,

Feb 22, 2023, 4:25:59 AM2/22/23

to

On Tue, 21 Feb 2023 10:35:34 -0800 (PST)
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
>On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com
>wrote:
>> On Tue, 21 Feb 2023 15:12:39 +0100
>> "Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
>> >On 2023-02-21 10:18 AM, Mut...@dastardlyhq.com wrote:
>> >> Ok, is this safe:
>> >>
>> >> for(auto &c: s) c = 'x';
>> >>
>> >> Or will it screw up the string object or have no effect in some cases
>just
>> >> as altering the contents returned by the c_str() pointer can sometimes
>have?
>> >
>> >Demonstrates the need for making examples short and to the point, to not
>> >introduce extraneous issues.
>> 3 lines is short and I didn't introduce any extranious issues. Other posters
>
>> created them.
>
>No, they didn't. When there's multiple possible reasons why you might be
>worried
>about whether a given piece of code is safe, you need to identify which of the
>possibilities is the one you are actually worried about. You considered it so
>obvious that you don't care about UTF-8, that you didn't consider the
>possibility
>that other people would consider that to be the most obvious safety concern.

Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
means will the program crash or have hidden bugs, not whether a string gets
translated into uppercase properly or not which would be immediately obvious.

>The people who responded considered the resolution of the issue that you were
>concerned with to be so obvious that they didn't even consider the possibility
>that you were uncertain about it.

Don't worry, next time I'll ask in crayon.

Mut...@dastardlyhq.com

unread,

Feb 22, 2023, 4:27:12 AM2/22/23

to

On Tue, 21 Feb 2023 10:58:31 -0800 (PST)
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
>On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com
>wrote:

>> created them.
>> >`c` is a reference to an item in the string, within the string bounds,
>> >i.e. it can't be a reference to the null-item that since C++11 is
>> >effectively required past the end of the string.
>> Ok. However in the past with some implementations of std::string, modifying
>> the contents of the buffer returned by c_str() can have undefined
>consequences.
>> I simply wondered if this was a similar situation.
>> >Not what you're asking, but the common convention in C++ is to separate
>> >the type specification from the variable, i.e. `auto& c` not `auto &c`.
>> Whose common convention?
>
>The convention you're using is popular for simple declarations, because it uses
>
>the space to separate the sequence of declaration specifiers from the list of
>declarators. That's a helpful distinction to be aware of, because if there is
>more
>than one declarator, they all share the declaration specifiers.
>However, this is a different context, the range declaration for a range-based
>for
>statement. If followed by a comma-delimited list of identifiers, i it would be
>parsed as a structured binding declaration, and in that context & would be

No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
vs >>) template syntax hack up until 2011.

Andrey Tarasevich

unread,

Feb 22, 2023, 11:09:22 AM2/22/23

to

On 02/21/23 10:58 AM, james...@alumni.caltech.edu wrote:
> However, this is a different context, the range declaration for a range-based for
> statement. If followed by a comma-delimited list of identifiers, i it would be
> parsed as a structured binding declaration, and in that context & would be shared
> by all of those identifiers, so it makes sense to separate it from them, and
> combine it with the preceding specifiers, which they would also share.

In a structured binding declaration it is not just a "comma-delimited
list of identifiers". It is actually a []-enclosed comma-delimited list
of identifiers (!). The presence of that `[]` is already perfectly
sufficient to clearly convey the fact that the `&` applies to all
identifiers in the list.

Since a structured binding declaration always contains exactly one
[]-enclosed "declarator", the placement of spacing around `&` is moot.
No reason to change the formatting rules from an ordinary declaration.
Just keep aligning it to the right: `auto &[a, b]`.

--
Best regards,
Andrey

Tim Rentsch

unread,

Feb 22, 2023, 11:33:14 AM2/22/23

to

I would say it achieves less than nothing, because it's misleading.
It's like writing a+b * c ; the grouping suggested by the spacing
doesn't match the actual precedence. In almost all cases it's a
mistake to use "contra-syntactic" spacing, and attaching '&' or '*'
to a type specifier is definitely one of those mistakes.

Keith Thompson

unread,

Feb 22, 2023, 12:17:30 PM2/22/23

to

I prefer the "int *foo;" spacing (conventional in C) because it follows
the syntax, but the "int* foo;" spacing has become a convention in C++
(probably because Stroustrup uses it in his books), and I use it in C++
for that reason. I also avoid declarations where it would make a
difference to a human reader, like "int* foo, bar;".

Most development is on existing code. It would be poor practice to use
the "better" spacing on new code while leaving the existing code (which
most likely uses the C++ conventional spacing) as it is. It would also
be poor practice to change all the spacing in a large body of existing
code.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */

Alf P. Steinbach

unread,

Feb 22, 2023, 4:33:26 PM2/22/23

to

The UB includes that the program can crash.

>> The people who responded considered the resolution of the issue that you were
>> concerned with to be so obvious that they didn't even consider the possibility
>> that you were uncertain about it.
>
> Don't worry, next time I'll ask in crayon.

The sarcasm is well anchored in ignorance.

- Alf

Alf P. Steinbach

unread,

Feb 22, 2023, 4:34:52 PM2/22/23

to

Either you missed the point, or you understood and deliberately snipped
what you quoted to create a misleading impression.

- Alf

james...@alumni.caltech.edu

unread,

Feb 23, 2023, 12:21:43 AM2/23/23

to

On Wednesday, February 22, 2023 at 4:25:59 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Tue, 21 Feb 2023 10:35:34 -0800 (PST)
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
> >On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com
> >wrote:

...

> >> 3 lines is short and I didn't introduce any extranious issues. Other posters
> >
> >> created them.
> >
> >No, they didn't. When there's multiple possible reasons why you might be
> >worried
> >about whether a given piece of code is safe, you need to identify which of the
> >possibilities is the one you are actually worried about. You considered it so
> >obvious that you don't care about UTF-8, that you didn't consider the
> >possibility
> >that other people would consider that to be the most obvious safety concern.
> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
> means will the program crash or have hidden bugs, not whether a string gets
> translated into uppercase properly or not which would be immediately obvious.

Not getting the expected results can make other parts of the program malfunction.
There are certainly more dangerous problems a program can have, but there's more
reasons to worry about that issue than about the one that was your actual concern.

james...@alumni.caltech.edu

unread,

Feb 23, 2023, 12:30:15 AM2/23/23

to

On Wednesday, February 22, 2023 at 4:27:12 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Tue, 21 Feb 2023 10:58:31 -0800 (PST)
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:

...

> >The convention you're using is popular for simple declarations, because it uses
> >
> >the space to separate the sequence of declaration specifiers from the list of
> >declarators. That's a helpful distinction to be aware of, because if there is
> >more
> >than one declarator, they all share the declaration specifiers.
> >However, this is a different context, the range declaration for a range-based
> >for
> >statement. If followed by a comma-delimited list of identifiers, i it would be
> >parsed as a structured binding declaration, and in that context & would be
> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
> vs >>) template syntax hack up until 2011.

If white-space between tokens were significant after translation phase 5, the way in
which people used it wouldn't qualify as a "convention", but as a necessity for
correct code. Such conventions are chosen to make code easier for humans to
understand, not because they are needed to ensure that compilers handle the code
correctly.

Note: white-space within tokens is always significant, and white-space between
tokens can be significant in translation phase 4 and earlier.

Paavo Helde

unread,

Feb 23, 2023, 2:39:20 AM2/23/23

to

22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:

> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
> means will the program crash or have hidden bugs, not whether a string gets
> translated into uppercase properly or not which would be immediately obvious.

Because you asked: I have seen isdigit() crashing hard on negative
values, would not be surprised if toupper() would behave the same in
some implementation.

In case of a multi-byte UTF-8 character encoded in a std::string all its
bytes would have a negative value if 'char' is signed on that platform.
So there.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 4:25:27 AM2/23/23

to

On Wed, 22 Feb 2023 22:33:09 +0100

Feel free to explain how toupper could crash.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 4:30:14 AM2/23/23

to

On Wed, 22 Feb 2023 22:34:37 +0100

"Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:

>On 2023-02-22 10:26 AM, Mut...@dastardlyhq.com wrote:
>> On Tue, 21 Feb 2023 10:58:31 -0800 (PST)

>>> than one declarator, they all share the declaration specifiers.
>>> However, this is a different context, the range declaration for a
>range-based
>>> for
>>> statement. If followed by a comma-delimited list of identifiers, i it would
>be
>>> parsed as a structured binding declaration, and in that context & would be
>>
>> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
>> vs >>) template syntax hack up until 2011.
>>
>
>Either you missed the point, or you understood and deliberately snipped
>what you quoted to create a misleading impression.

I didn't miss the point at all. You implied that the positioning of the
whitespace in a declaration makes a difference. It doesn't.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 4:32:13 AM2/23/23

to

On Wed, 22 Feb 2023 21:30:04 -0800 (PST)
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
>On Wednesday, February 22, 2023 at 4:27:12 AM UTC-5, Mut...@dastardlyhq.com
>wrote:

>> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
>
>> vs >>) template syntax hack up until 2011.
>
>If white-space between tokens were significant after translation phase 5, the
>way in
>which people used it wouldn't qualify as a "convention", but as a necessity for
>
>correct code. Such conventions are chosen to make code easier for humans to
>understand, not because they are needed to ensure that compilers handle the
>code
>correctly.
>
>Note: white-space within tokens is always significant, and white-space between
>tokens can be significant in translation phase 4 and earlier.

Quite obviously a program with no whitespace won't compile. That doesn't mean
the whitespace is significant in the programming sense.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 4:35:59 AM2/23/23

to

On Thu, 23 Feb 2023 09:39:04 +0200
Paavo Helde <ees...@osa.pri.ee> wrote:
>22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>
>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>> means will the program crash or have hidden bugs, not whether a string gets
>> translated into uppercase properly or not which would be immediately obvious.
>
>
>
>Because you asked: I have seen isdigit() crashing hard on negative

Thats clearly a library bug. All bets are off when they exist.

>In case of a multi-byte UTF-8 character encoded in a std::string all its
>bytes would have a negative value if 'char' is signed on that platform.
>So there.

One would assume unsigned would be used internally.

Paavo Helde

unread,

Feb 23, 2023, 5:30:54 AM2/23/23

to

23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:
> On Thu, 23 Feb 2023 09:39:04 +0200
> Paavo Helde <ees...@osa.pri.ee> wrote:
>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>
>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>>> means will the program crash or have hidden bugs, not whether a string gets
>>> translated into uppercase properly or not which would be immediately obvious.
>>
>>
>>
>> Because you asked: I have seen isdigit() crashing hard on negative
>
> Thats clearly a library bug. All bets are off when they exist.

What makes you think so? The C standard clearly says in 7.4 (Character
handling <ctype.h>):

"In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro
EOF. If the argument has any other value, the behavior is undefined."

Undefined behavior may or may not involve a program crash.

>
>> In case of a multi-byte UTF-8 character encoded in a std::string all its
>> bytes would have a negative value if 'char' is signed on that platform.
>> So there.
>
> One would assume unsigned would be used internally.

Alas, std::string is standardized to use plain char, whose signedness is
implementation dependent.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 5:34:28 AM2/23/23

to

On Thu, 23 Feb 2023 12:30:39 +0200

Paavo Helde <ees...@osa.pri.ee> wrote:
>23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:
>> On Thu, 23 Feb 2023 09:39:04 +0200
>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>
>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>>>> means will the program crash or have hidden bugs, not whether a string gets
>
>>>> translated into uppercase properly or not which would be immediately
>obvious.
>>>
>>>
>>>
>>> Because you asked: I have seen isdigit() crashing hard on negative
>>
>> Thats clearly a library bug. All bets are off when they exist.
>
>What makes you think so? The C standard clearly says in 7.4 (Character
>handling <ctype.h>):
>
>"In all cases the argument is an int, the value of which shall be
>representable as an unsigned char or shall equal the value of the macro
>EOF. If the argument has any other value, the behavior is undefined."
>
>Undefined behavior may or may not involve a program crash.

I would still consider a crash to be a bug. Undefined would just be returning
rubbish. I can't even figure out HOW it would crash since all its doing is

return (c >= '0' && c <= '9')

unless there's some obscure way of doing that test even faster.

Öö Tiib

unread,

Feb 23, 2023, 5:47:42 AM2/23/23

to

On Thursday, 23 February 2023 at 11:35:59 UTC+2, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 09:39:04 +0200
> Paavo Helde <ees...@osa.pri.ee> wrote:
> >22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
> >
> >> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
> >> means will the program crash or have hidden bugs, not whether a string gets
> >> translated into uppercase properly or not which would be immediately obvious.
> >
> >
> >
> >Because you asked: I have seen isdigit() crashing hard on negative
>
> Thats clearly a library bug. All bets are off when they exist.
>

Nope, standard does matter only as specification. Read the licence
agreements of compilers you use or something. There are all the
warranties that you actually get and there the interesting part of our
work only starts. No bets are off ... in practice we may need to use
clearly and provably defective implementations for developing
financial applications that people use daily and trust blindly without
thinking. We get paid well for that.

About 15 years ago one of my teams helped programming particular
point-of-sale credit card terminal using gcc that produced a binary
that rebooted that terminal on case of situation that Paavo described.
POS could talk native language of card owner (that might contain none
of Latin characters) got certified by EMV (eurocard-mastercard-visa)
and I saw it in use only few years ago.

Paavo Helde

unread,

Feb 23, 2023, 6:21:06 AM2/23/23

to

23.02.2023 12:34 Mut...@dastardlyhq.com kirjutas:
> On Thu, 23 Feb 2023 12:30:39 +0200
> Paavo Helde <ees...@osa.pri.ee> wrote:
>> 23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:
>>> On Thu, 23 Feb 2023 09:39:04 +0200
>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>>
>>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>>>>> means will the program crash or have hidden bugs, not whether a string gets
>>
>>>>> translated into uppercase properly or not which would be immediately
>> obvious.
>>>>
>>>>
>>>>
>>>> Because you asked: I have seen isdigit() crashing hard on negative
>>>
>>> Thats clearly a library bug. All bets are off when they exist.
>>
>> What makes you think so? The C standard clearly says in 7.4 (Character
>> handling <ctype.h>):
>>
>> "In all cases the argument is an int, the value of which shall be
>> representable as an unsigned char or shall equal the value of the macro
>> EOF. If the argument has any other value, the behavior is undefined."
>>
>> Undefined behavior may or may not involve a program crash.
>
> I would still consider a crash to be a bug.

Sure, but the bug would be in your code.

> Undefined would just be returning
> rubbish. I can't even figure out HOW it would crash since all its doing is
>
> return (c >= '0' && c <= '9')

Nope, because it's not known at the compile time which characters should
be considered digits. It might do something like

if (c==(EOF)) {
return (EOF);
} else {
lock_current_locale();
int result = get_current_locale()->isdigit_map[c];
unlock_current_locale();
return result;
}

where isdigit_map is a 256-element array provided by the locale.

Richard Damon

unread,

Feb 23, 2023, 7:18:20 AM2/23/23

to

The issue is that to support locales, things like isdigit might be
implemented as

int isdigit(int c) {
return _prop_table[c+1] & DIGIT_PROPERTY;
}

where _prop_table gets set to a table based on the current locale, which
might define additional characters that are digits.

David Brown

unread,

Feb 23, 2023, 9:03:13 AM2/23/23

to

On 23/02/2023 11:34, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 12:30:39 +0200
> Paavo Helde <ees...@osa.pri.ee> wrote:
>> 23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:
>>> On Thu, 23 Feb 2023 09:39:04 +0200
>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>>
>>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>>>>> means will the program crash or have hidden bugs, not whether a string gets
>>
>>>>> translated into uppercase properly or not which would be immediately
>> obvious.
>>>>
>>>>
>>>>
>>>> Because you asked: I have seen isdigit() crashing hard on negative
>>>
>>> Thats clearly a library bug. All bets are off when they exist.
>>
>> What makes you think so? The C standard clearly says in 7.4 (Character
>> handling <ctype.h>):
>>
>> "In all cases the argument is an int, the value of which shall be
>> representable as an unsigned char or shall equal the value of the macro
>> EOF. If the argument has any other value, the behavior is undefined."
>>
>> Undefined behavior may or may not involve a program crash.
>
> I would still consider a crash to be a bug. Undefined would just be returning
> rubbish.

Undefined behaviour means there is no define behaviour - crashing is
entirely plausible. It doesn't matter what /you/ think about it. The C
standard is quite clear about this - if you pass a valid argument to
isdigit(), as specified in the standard, you'll get a valid result. If
you pass something invalid, all bets are off and whatever happens is
/your/ problem.

This is so fundamental to the whole concept of programming that I am
regularly surprised by people who call themselves programmers, yet fail
to comprehend it. A function has a specified input domain, and a
specified result or behaviour for inputs in that domain. Move outside
that input domain, and you are in the realm of nonsense. You don't
expect particular behaviour from 1/0 - maybe you'll get a random value,
maybe you'll get a crash. Why you think calling isdigit() with an
invalid input should have some guarantees is beyond me. "Garbage in,
garbage out" applies to behaviour, not just values, and has been
understood since Babbage designed the first programmable mechanical
computer.

So yes, there is a bug - it's in /your/ code if you pass an invalid
value to the function.

> I can't even figure out HOW it would crash since all its doing is
>
> return (c >= '0' && c <= '9')
>
> unless there's some obscure way of doing that test even faster.
>

There are other ways that can be faster (depending on details of
processor, cache uses, and other aspects). The traditional
implementation of the <ctype.h> classification functions involves lookup
tables, and it is quite reasonable for a negative value to lead to
things going horribly wrong.

Ben Bacarisse

unread,

Feb 23, 2023, 9:49:34 AM2/23/23

to

This is formally true, but I think we can also legitimately ask to what
extent a function's domain (and the corresponding returned values) are
reasonable and helpful.

> You don't expect particular
> behaviour from 1/0 - maybe you'll get a random value, maybe you'll get a
> crash. Why you think calling isdigit() with an invalid input should have
> some guarantees is beyond me. "Garbage in, garbage out" applies to
> behaviour, not just values, and has been understood since Babbage designed
> the first programmable mechanical computer.

Those of us with a background in languages like C are not going to be
confused, but I bet almost everyone who comes to C from a more modern
language will be astonished by what you have to do to get isdigit to
work safely. To have a character testing function that does not work
for all the values of the language's char type is, well, bonkers.

In Haskell, a program won't even compile unless isDigit is called with
an argument of type Char, and the result is defined to be exactly one
of True or False for all values of that type.

And that brings up another trap for the unwary: C's isdigit returns
something that is only vaguely Boolean. For example, you can't test if
char c1, c2; are both digits or neither are digits with

isdigit(c1) == isdigit(c2)

because the value indicating "yes" is not guaranteed to be anything
other than "not zero". Instead you'd write

!isdigit((unsigned char)c1) == !isdigit((unsigned char)c2)

I think an occasional nod to how we have got used to such nonsense is
merited!

--
Ben.

Andrey Tarasevich

unread,

Feb 23, 2023, 10:05:33 AM2/23/23

to

On 02/23/23 1:25 AM, Mut...@dastardlyhq.com wrote:
>
> Feel free to explain how toupper could crash.
>

There's no such concept as "explaining" undefined behavior.

Yet, it could be very simple. The implementation treats `toupper` as an
intrinsic, and the compiler explicitly generates a "CRASH NOW!!11"
instruction for invalid arguments. Let's say that in their
implementation of `toupper` it would result in a negligible performance
penalty or no penalty at all.

GCC is well-known to do such things, for one example.

--
Best regards,
Andrey

Andrey Tarasevich

unread,

Feb 23, 2023, 10:16:28 AM2/23/23

to

On 02/23/23 2:34 AM, Mut...@dastardlyhq.com wrote:
>>
>> Undefined behavior may or may not involve a program crash.
>
> I would still consider a crash to be a bug. Undefined would just be returning
> rubbish.

Nope. "Returning rubbish" would be an example _unspecified behavior_.
Undefined is a wholly different thing.

--
Best regards,
Andrey

David Brown

unread,

Feb 23, 2023, 10:26:04 AM2/23/23

to

Sure. You could, for example, argue that "isdigit" would be better
designed if it were to return "false" on any int value outside of the
current valid range. But it might be less efficient if it had such an
extended domain - do you optimise for maximum efficiency for programmers
who are able to read and follow specifications and write correct code,
or do you optimise for minimal surprise for programmers who can't or
won't follow the specifications? I'd say that for C, it's the former -
let those who don't understand the importance of following
specifications use a different language more suited to their needs,
wants and skills. There is a time and a place for making functions with
maximal input domains and controlled handling of nonsensical inputs -
low-level functions like "isdigit" are not such cases.

>> You don't expect particular
>> behaviour from 1/0 - maybe you'll get a random value, maybe you'll get a
>> crash. Why you think calling isdigit() with an invalid input should have
>> some guarantees is beyond me. "Garbage in, garbage out" applies to
>> behaviour, not just values, and has been understood since Babbage designed
>> the first programmable mechanical computer.
>
> Those of us with a background in languages like C are not going to be
> confused, but I bet almost everyone who comes to C from a more modern
> language will be astonished by what you have to do to get isdigit to
> work safely. To have a character testing function that does not work
> for all the values of the language's char type is, well, bonkers.
>

IMHO the concept of "character" in C and C++ is a mess these days. It
was perhaps inevitable, given the history of the languages, the
development of characters, and the overriding requirement for backwards
compatibility. The notion of "signed characters" and "unsigned
characters" is insane. The jumble of "wide characters" and various
varieties of UTF formats is confusing at best. There are various
character sets - source, execution, basic, extended, whatever (the terms
seem to change regularly, especially in C++). Sometimes these are the
same, sometimes different. Some of the different character types are
the same size as others, but have different interpretations. Sometimes
they have the same interpretations, but are still distinct. Some of the
standard C library functions work only on 7-bit ASCII, some will work
with UTF-8 as well. Some of them work with "int" parameters instead of
more logical "char" types, and support non-character values (like EOF)
in functions that appear to take character parameters. And some
functions treat EOF as a normal character.

And if you are coming to C from pretty much any other language, you'll
be shocked at the rudimentary "string" support.

So while I agree that it might be surprising to find that "isdigit()" is
not defined for all possible values of all character types, I think it
would be /way/ down the list.

This is not a criticism of C - different languages are better and worse
for different things. But if you are working in C, you have to learn C
- you can't just assume it is like whatever other languages you have
used. And if C and C++ were to try to be like other languages, such as
by specifying values for every int value in "isdigit" calls, or having
"isdigit" throw a C++ exception on bad values, you'd lose some of the
aspects that make C and C++ important and useful languages.

> In Haskell, a program won't even compile unless isDigit is called with
> an argument of type Char, and the result is defined to be exactly one
> of True or False for all values of that type.
>
> And that brings up another trap for the unwary: C's isdigit returns
> something that is only vaguely Boolean. For example, you can't test if
> char c1, c2; are both digits or neither are digits with
>
> isdigit(c1) == isdigit(c2)
>
> because the value indicating "yes" is not guaranteed to be anything
> other than "not zero". Instead you'd write
>
> !isdigit((unsigned char)c1) == !isdigit((unsigned char)c2)
>
> I think an occasional nod to how we have got used to such nonsense is
> merited!
>

Yes, absolutely.

Alf P. Steinbach

unread,

Feb 23, 2023, 10:35:12 AM2/23/23

to

An MS runtime library example, but apparently this is old code, not for
the current version:

<url:
https://github.com/ojdkbuild/tools_toolchain_sdk10_1607/blob/master/Source/10.0.14393.0/ucrt/convert/isctype.cpp#L34-L39>

// The _chvalidator function is called by the character
classification functions
// in the debug CRT. This function tests the character argument to
ensure that
// it is not out of range. For performance reasons, this function
is not used
// in the retail CRT.
#if defined _DEBUG

extern "C" int __cdecl _chvalidator(int const c, int const mask)
{
_ASSERTE(c >= -1 && c <= 255);
return _chvalidator_l(nullptr, c, mask);
}

extern "C" int __cdecl _chvalidator_l(_locale_t const locale, int
const c, int const mask)
{
_ASSERTE(c >= -1 && c <= 255);

_LocaleUpdate locale_update(locale);

int const index = (c >= -1 && c <= 255) ? c : -1;

return
locale_update.GetLocaleT()->locinfo->_public._locale_pctype[index] & mask;
}

#endif // _DEBUG

In addition to explicit assertions, array indexing out of bounds can
cause a crash, in the Linux world a "segfault".

Implementations do access arrays with the character code as index
because functions like `toupper` are locale dependent and in practice
support a number of encodings such as Latin-1, in addition to ASCII.

- Alf

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 10:59:29 AM2/23/23

to

On Thu, 23 Feb 2023 02:47:33 -0800 (PST)
=?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> wrote:
>On Thursday, 23 February 2023 at 11:35:59 UTC+2, Mut...@dastardlyhq.com wrote:
>> On Thu, 23 Feb 2023 09:39:04 +0200
>> Paavo Helde <ees...@osa.pri.ee> wrote:
>> >22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>> >
>> >> Absolute rubbish. Why would UTF8 have anything to do with "safety"?
>Safety
>> >> means will the program crash or have hidden bugs, not whether a string
>gets
>> >> translated into uppercase properly or not which would be immediately
>obvious.
>> >
>> >
>> >
>> >Because you asked: I have seen isdigit() crashing hard on negative
>>
>> Thats clearly a library bug. All bets are off when they exist.
>>
>Nope, standard does matter only as specification. Read the licence

Yes.

A standard use of toupper and tolower is to iterate through a string and
apply them to whatever is there without testing each character first. If
either crashes IT IS a bug.

>agreements of compilers you use or something. There are all the
>warranties that you actually get and there the interesting part of our

Warranties are legal protection against bugs. It doesn't mean they don't exist.

>About 15 years ago one of my teams helped programming particular
>point-of-sale credit card terminal using gcc that produced a binary
>that rebooted that terminal on case of situation that Paavo described.
>POS could talk native language of card owner (that might contain none
>of Latin characters) got certified by EMV (eurocard-mastercard-visa)
>and I saw it in use only few years ago.

Its still a library bug.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 11:00:47 AM2/23/23

to

On Thu, 23 Feb 2023 13:20:50 +0200

Paavo Helde <ees...@osa.pri.ee> wrote:
>23.02.2023 12:34 Mut...@dastardlyhq.com kirjutas:
>> On Thu, 23 Feb 2023 12:30:39 +0200
>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>> 23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:
>>>> On Thu, 23 Feb 2023 09:39:04 +0200
>>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>>>
>>>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"?
>Safety
>>>>>> means will the program crash or have hidden bugs, not whether a string
>gets
>>>
>>>>>> translated into uppercase properly or not which would be immediately
>>> obvious.
>>>>>
>>>>>
>>>>>
>>>>> Because you asked: I have seen isdigit() crashing hard on negative
>>>>
>>>> Thats clearly a library bug. All bets are off when they exist.
>>>
>>> What makes you think so? The C standard clearly says in 7.4 (Character
>>> handling <ctype.h>):
>>>
>>> "In all cases the argument is an int, the value of which shall be
>>> representable as an unsigned char or shall equal the value of the macro
>>> EOF. If the argument has any other value, the behavior is undefined."
>>>
>>> Undefined behavior may or may not involve a program crash.
>>
>> I would still consider a crash to be a bug.
>
>Sure, but the bug would be in your code.

Wrong.

>> Undefined would just be returning
>> rubbish. I can't even figure out HOW it would crash since all its doing is
>>
>> return (c >= '0' && c <= '9')
>
>Nope, because it's not known at the compile time which characters should
>be considered digits. It might do something like

Wow, talk about clutching at straws.

Unlike letters the characters for digits don't change unless you're talking
about something like ancient egyptian.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 11:01:15 AM2/23/23

to

Numbers don't have a locale FFS.

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 11:11:13 AM2/23/23

to

On Thu, 23 Feb 2023 15:02:56 +0100
David Brown <david...@hesbynett.no> wrote:
>On 23/02/2023 11:34, Mut...@dastardlyhq.com wrote:
>> I would still consider a crash to be a bug. Undefined would just be returning
>
>> rubbish.
>
>Undefined behaviour means there is no define behaviour - crashing is
>entirely plausible. It doesn't matter what /you/ think about it. The C
>standard is quite clear about this - if you pass a valid argument to
>isdigit(), as specified in the standard, you'll get a valid result. If
>you pass something invalid, all bets are off and whatever happens is
>/your/ problem.
>
>This is so fundamental to the whole concept of programming that I am
>regularly surprised by people who call themselves programmers, yet fail
>to comprehend it. A function has a specified input domain, and a

Sorry, I get tired of people resorting to "undefined" as some kind of get
out clause for some badly written API crashing whole program. isdigit() is a
yes/no function. If it is a valid digit it returns 1 else it returns 0
regardless. Crashing is a bug. End of.

>maybe you'll get a crash. Why you think calling isdigit() with an

Oh ok. How about this then?

int i;
++i;

The result is undefined. Presumably you think ++ crashing the program is
an acceptable outcome?

>invalid input should have some guarantees is beyond me. "Garbage in,
>garbage out" applies to behaviour, not just values, and has been
>understood since Babbage designed the first programmable mechanical
>computer.

Garbage out != program crashes.

>So yes, there is a bug - it's in /your/ code if you pass an invalid
>value to the function.

BS.

>implementation of the <ctype.h> classification functions involves lookup
>tables, and it is quite reasonable for a negative value to lead to
>things going horribly wrong.

So when did out of bounds array accessing become "undefined" in your world
then rather than a bug?

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 11:15:28 AM2/23/23

to

On Thu, 23 Feb 2023 16:34:57 +0100

"Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:

The digit characters are the same valid regardless of which 8 bit code page is
being used. Ditto UTF16 and UTF8, so I have no idea in which locale the
numeric digits would have different values.

Paavo Helde

unread,

Feb 23, 2023, 11:24:30 AM2/23/23

to

23.02.2023 18:00 Mut...@dastardlyhq.com kirjutas:

>
> Unlike letters the characters for digits don't change unless you're talking
> about something like ancient egyptian.

You are funny!

I was bored, so I coded a counter-example:

#include <locale.h>
#include <ctype.h>
#include <string>
#include <iostream>

int main() {
if (!setlocale(LC_ALL, "French_Canada.1252")) {
std::cerr << "Failed to set locale\n";
return EXIT_FAILURE;
}
std::string s = "a² + b² = c²";
bool result = isdigit(static_cast<unsigned char>(s[1]));
std::cout << "'" << s[1] << "' is " <<
(result? "a digit": "not a digit") << "\n";
}

Output:

'²' is a digit

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 11:28:25 AM2/23/23

to

No it isn't. I guess thats part of the undefined output then.

james...@alumni.caltech.edu

unread,

Feb 23, 2023, 11:29:07 AM2/23/23

to

On Thursday, February 23, 2023 at 4:30:14 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Wed, 22 Feb 2023 22:34:37 +0100
> "Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
> >On 2023-02-22 10:26 AM, Mut...@dastardlyhq.com wrote:

...

> >> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
> >> vs >>) template syntax hack up until 2011.
> >>
> >
> >Either you missed the point, or you understood and deliberately snipped
> >what you quoted to create a misleading impression.
> I didn't miss the point at all. You implied that the positioning of the
> whitespace in a declaration makes a difference. It doesn't.

The fact that it doesn't matter to the implementation doesn't mean that it makes no
difference. Conventions like this one are for humans, not compilers, and the
convention you use can make a difference in how easy it is for a human to
understand the code.

james...@alumni.caltech.edu

unread,

Feb 23, 2023, 11:37:53 AM2/23/23

to

On Thursday, February 23, 2023 at 4:35:59 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 09:39:04 +0200
> Paavo Helde <ees...@osa.pri.ee> wrote:

...

> >Because you asked: I have seen isdigit() crashing hard on negative
> Thats clearly a library bug. All bets are off when they exist.

The standard requires that the value be either EOF or representable as an unsigned
char. That permits the very common implementation of isdigit() as

static unsigned _ctype_table[UCHAR_MAX-EOF];
int isdigit(int c) { return _ctype_table[c-EOF] & _DIGIT_BIT_; }
// Each of the other <ctype.h> functions uses the same table,
// masking off a different bit pattern.

Anything you do to modify that code to make it safe to pass numbers less than EOF
or greater than UCHAR_MAX would make it less efficient, and would only benefit
code that has undefined behavior. Many implementations provide such safety only in
a special debugging mode.

Richard Damon

unread,

Feb 23, 2023, 11:50:21 AM2/23/23

to

Nope, isdigit refers to 5.2.1 which states:

Each set is further divided into a basic character set, whose contents
are given by this subclause, and a set of zero or more locale-specific
members (which are not members of the basic character set) called
extended characters.

Thus EACH category later defined (including digits) may have
locale-specific members added.

A Nummber of languages do add there own characters for the digits,
besides the basic arabic numerals included in the standard character set.

Richard Damon

unread,

Feb 23, 2023, 11:56:07 AM2/23/23

to

On 2/23/23 10:59 AM, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 02:47:33 -0800 (PST)
> =?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> wrote:
>> On Thursday, 23 February 2023 at 11:35:59 UTC+2, Mut...@dastardlyhq.com wrote:
>>> On Thu, 23 Feb 2023 09:39:04 +0200
>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>>
>>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"?
>> Safety
>>>>> means will the program crash or have hidden bugs, not whether a string
>> gets
>>>>> translated into uppercase properly or not which would be immediately
>> obvious.
>>>>
>>>>
>>>>
>>>> Because you asked: I have seen isdigit() crashing hard on negative
>>>
>>> Thats clearly a library bug. All bets are off when they exist.
>>>
>> Nope, standard does matter only as specification. Read the licence
>
> Yes.
>
> A standard use of toupper and tolower is to iterate through a string and
> apply them to whatever is there without testing each character first. If
> either crashes IT IS a bug.

and the loop, if reading from a "char" string needs to cast the
character to "unsigned char" before giving it to toupper/tolower.

The original definitions of the functions were based on input loops that
read using a function that returned an int (not a char) that returned -1
for EOF or a non-negative value that represented the character.

>
>> agreements of compilers you use or something. There are all the
>> warranties that you actually get and there the interesting part of our
>
> Warranties are legal protection against bugs. It doesn't mean they don't exist.

And the C standard explicitly provides no "warranties" for code that
performs "Undefined Behavior"

>
>> About 15 years ago one of my teams helped programming particular
>> point-of-sale credit card terminal using gcc that produced a binary
>> that rebooted that terminal on case of situation that Paavo described.
>> POS could talk native language of card owner (that might contain none
>> of Latin characters) got certified by EMV (eurocard-mastercard-visa)
>> and I saw it in use only few years ago.
>
> Its still a library bug.
>

Nope, calling toupper with a value outside its define input range (which
has only 1 possibly negative value, that of EOF) is explicitly defined
to be "Undefined Behavior", so the PROGRAM that does that has the bug,
not the library, which met its requirements.

James Kuyper

unread,

Feb 23, 2023, 11:57:26 AM2/23/23

to

On 2/23/23 06:20, Paavo Helde wrote:
> 23.02.2023 12:34 Mut...@dastardlyhq.com kirjutas:

...

>> Undefined would just be returning
>> rubbish. I can't even figure out HOW it would crash since all its
>> doing is
>>
>> return (c >= '0' && c <= '9')
>
> Nope, because it's not known at the compile time which characters
> should be considered digits. It might do something like

No, isdigit() and isxdigit() are special cases. They are not
locale-dependent. Regardless of locale, which characters qualify is
defined by cross-referencing section 5.2.1.

James Kuyper

unread,

Feb 23, 2023, 11:57:50 AM2/23/23

to

That's non-conforming.
"The isdigit function tests for any decimal-digit character (as defined
in 5.2.1)." (7.4.1.5).

"...
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9" (5.2.1p3)

"digits" is italicized, an ISO convention indicating that this
constitutes the official definition of that term. '2' is on that list,
'²' is not, and that definition allows for only 10 digits.

Paavo Helde

unread,

Feb 23, 2023, 12:01:53 PM2/23/23

to

Unfortunately it is classified as a digit, and there is no UB here. This
example also shows why the functions like isdigit() are next to useless,
in addition of being slow.

I cannot even imagine a use case which would benefit from superscripts
classified as digits, especially if this happens only in some locales,
but not in others.

Scott Lurndal

unread,

Feb 23, 2023, 12:10:05 PM2/23/23

to

Paavo Helde <ees...@osa.pri.ee> writes:
>22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>
>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>> means will the program crash or have hidden bugs, not whether a string gets
>> translated into uppercase properly or not which would be immediately obvious.
>
>
>Because you asked: I have seen isdigit() crashing hard on negative

>values, would not be surprised if toupper() would behave the same in
>some implementation.

It was not uncommon in early implementations to for isdigit et al
to simply index into an array of bytes. Any value not between
zero and 255 would result in UB.

Paavo Helde

unread,

Feb 23, 2023, 12:12:23 PM2/23/23

to

I stand corrected. Now I also found a footnote: "The only functions in
7.4 whose behavior is not affected by the current locale are isdigit and
isxdigit".

It's a pity Microsoft has not got the memo and isdigit() is thus still
unusable.

Scott Lurndal

unread,

Feb 23, 2023, 12:12:36 PM2/23/23

to

Mut...@dastardlyhq.com writes:
>On Thu, 23 Feb 2023 12:30:39 +0200
>Paavo Helde <ees...@osa.pri.ee> wrote:
>>23.02.2023 11:35 Mut...@dastardlyhq.com kirjutas:

>>> On Thu, 23 Feb 2023 09:39:04 +0200
>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>> 22.02.2023 11:25 Mut...@dastardlyhq.com kirjutas:
>>>>
>>>>> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
>>>>> means will the program crash or have hidden bugs, not whether a string gets
>>
>>>>> translated into uppercase properly or not which would be immediately
>>obvious.
>>>>
>>>>
>>>>
>>>> Because you asked: I have seen isdigit() crashing hard on negative
>>>

>>> Thats clearly a library bug. All bets are off when they exist.
>>

>>What makes you think so? The C standard clearly says in 7.4 (Character
>>handling <ctype.h>):
>>
>>"In all cases the argument is an int, the value of which shall be
>>representable as an unsigned char or shall equal the value of the macro
>>EOF. If the argument has any other value, the behavior is undefined."
>>
>>Undefined behavior may or may not involve a program crash.
>

>I would still consider a crash to be a bug. Undefined would just be returning

>rubbish. I can't even figure out HOW it would crash since all its doing is
>
>return (c >= '0' && c <= '9')
>

>unless there's some obscure way of doing that test even faster.

Early implementations used an array of bytes indexed by the
char value. Each byte had a set of flags indicating character
class. isdigit() was a macro (DEFINE) which simply tested the byte for the
DIGIT flag. Likewise isalpha, etc.

#ifndef _U
# define _U 01 /* Upper case */
# define _L 02 /* Lower case */
# define _N 04 /* Numeral (digit) */
# define _S 010 /* Spacing character */
# define _P 020 /* Punctuation */
# define _C 040 /* Control character */
# define _B 0100 /* Blank */
# define _X 0200 /* heXadecimal digit */
#endif

#ifdef __STDC__

extern int isalnum(int);
extern int isalpha(int);
extern int iscntrl(int);
extern int isdigit(int);
extern int isgraph(int);
extern int islower(int);
extern int isprint(int);
extern int ispunct(int);
extern int isspace(int);
extern int isupper(int);
extern int isxdigit(int);
extern int tolower(int);
extern int toupper(int);

extern int isascii(int);
extern int toascii(int);
extern int _tolower(int);
extern int _toupper(int);

extern unsigned char __ctype[];

#if !#lint(on)

#define isalpha(c) ((__ctype + 1)[c] & (_U | _L))
#define isupper(c) ((__ctype + 1)[c] & _U)
#define islower(c) ((__ctype + 1)[c] & _L)
#define isdigit(c) ((__ctype + 1)[c] & _N)
#define isxdigit(c) ((__ctype + 1)[c] & _X)
#define isalnum(c) ((__ctype + 1)[c] & (_U | _L | _N))
#define isspace(c) ((__ctype + 1)[c] & _S)
#define ispunct(c) ((__ctype + 1)[c] & _P)
#define isprint(c) ((__ctype + 1)[c] & (_P | _U | _L | _N | _B))
#define isgraph(c) ((__ctype + 1)[c] & (_P | _U | _L | _N))
#define iscntrl(c) ((__ctype + 1)[c] & _C)

#define isascii(c) (!((c) & ~0177))

>

Mut...@dastardlyhq.com

unread,

Feb 23, 2023, 12:16:33 PM2/23/23

to

On Thu, 23 Feb 2023 08:28:58 -0800 (PST)
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
>On Thursday, February 23, 2023 at 4:30:14 AM UTC-5, Mut...@dastardlyhq.com
>wrote:
>> On Wed, 22 Feb 2023 22:34:37 +0100
>> "Alf P. Steinbach" <alf.p.s...@gmail.com> wrote:
>> >On 2023-02-22 10:26 AM, Mut...@dastardlyhq.com wrote:

>....

>> >> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the
>> >
>> >> vs >>) template syntax hack up until 2011.
>> >>
>> >
>> >Either you missed the point, or you understood and deliberately snipped
>> >what you quoted to create a misleading impression.
>> I didn't miss the point at all. You implied that the positioning of the
>> whitespace in a declaration makes a difference. It doesn't.
>
>The fact that it doesn't matter to the implementation doesn't mean that it
>makes no
>difference. Conventions like this one are for humans, not compilers, and the
>convention you use can make a difference in how easy it is for a human to
>understand the code.

Ok, you're just redefining the meaning of significant whitespace then. So how
would you describe how python treats whitespace?

Richard Damon

unread,

Feb 23, 2023, 12:20:36 PM2/23/23

to

Which still allows for "extended characters" to be added. It just says
that such an addition needs to hold for ALL locales.

(Just makes it less likely)

Richard Damon

unread,

Feb 23, 2023, 12:21:58 PM2/23/23

to

-1 and 255, since EOF was explicitly an allowed value.

Scott Lurndal

unread,

Feb 23, 2023, 12:55:51 PM2/23/23

to

Not in SVR4.2, see msg <W7NJL.30343$Kqu2...@fx01.iad>

David Brown

unread,

Feb 23, 2023, 1:44:35 PM2/23/23

to

That doesn't matter - it could still be implemented using such tables.
It's just that the "DIGIT_PROPERTY" bit would be the same for all locale
_prop_table tables, while bits such as for "ALPHA_PROPERTY" and
"ISUPPER_PROPERTY" might vary between locales.

On modern processors, the comparison to '0' and '9' is probably the most
efficient implementation - and . On older ones, table lookup could be
more efficient. But a table-based implementation is certainly common:

<https://elixir.bootlin.com/glibc/latest/source/ctype/ctype.h>

David Brown

unread,

Feb 23, 2023, 2:46:16 PM2/23/23

to

On 23/02/2023 17:10, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 15:02:56 +0100
> David Brown <david...@hesbynett.no> wrote:
>> On 23/02/2023 11:34, Mut...@dastardlyhq.com wrote:
>>> I would still consider a crash to be a bug. Undefined would just be returning
>>
>>> rubbish.
>>
>> Undefined behaviour means there is no define behaviour - crashing is
>> entirely plausible. It doesn't matter what /you/ think about it. The C
>> standard is quite clear about this - if you pass a valid argument to
>> isdigit(), as specified in the standard, you'll get a valid result. If
>> you pass something invalid, all bets are off and whatever happens is
>> /your/ problem.
>>
>> This is so fundamental to the whole concept of programming that I am
>> regularly surprised by people who call themselves programmers, yet fail
>> to comprehend it. A function has a specified input domain, and a
>
> Sorry, I get tired of people resorting to "undefined" as some kind of get
> out clause for some badly written API crashing whole program. isdigit() is a
> yes/no function. If it is a valid digit it returns 1 else it returns 0
> regardless. Crashing is a bug. End of.

Sorry, I get tired of people who think /their/ ideas of how languages
work is somehow correct, despite how it contradicts the actual
definition and implementations of the language.

You can argue that this is how the function /should/ work until you are
blue in the face. It doesn't change the simple facts about how it
/does/ work.

As Ben said, it's fine to question if the design choices of the C
standard library functions were the "best" choices (for whatever value
of "best" you want to pick). But there is no point in arguing with reality.

>
>> maybe you'll get a crash. Why you think calling isdigit() with an
>
> Oh ok. How about this then?
>
> int i;
> ++i;
>
> The result is undefined. Presumably you think ++ crashing the program is
> an acceptable outcome?
>

Sure.

I greatly prefer a warning from the compiler that there is a bug in my
code. And I don't expect the compiler to go out of its way to cause a
crash - I expect it to generate efficient code as best it can. But with
a bug in the source code I can't expect the compiler to somehow generate
"correct" code.

And I /like/ the fact that a C implementation is allowed to crash the
program when you try to execute undefined behaviour. That's what allows
you to have tools like sanitizers - compiler this with gcc or clang with
the undefined behaviour sanitizer enabled, and the program /will/ crash
at that point (along with an error message). That makes it easier to
find the bug.

>> invalid input should have some guarantees is beyond me. "Garbage in,
>> garbage out" applies to behaviour, not just values, and has been
>> understood since Babbage designed the first programmable mechanical
>> computer.
>
> Garbage out != program crashes.
>

Garbage out includes behaviour.

You seem to have fallen for the myth that nonsense data values are
somehow not as bad as nonsense control flow - that undefined behaviour
should be allowed to give incorrect results (not that there are any
/correct/ results possible) but not give unexpected control flow
changes. This is a very seductive concept - even the C standards
committee fell for it, making "Annex L - Analyzability" that
distinguishes between "bounded undefined behaviour" and "critical
undefined behaviour".

The problem is, it is bogus. That is why no one has ever tried to
implement anything in Annex L (or at least, no one has succeeded).

Bugs propagate. Invalid data spreads - causing more trouble, including
control flow faults.

int arr[100];

int i = foo(x);
arr[i] = 123;

If "foo()" is specified to return a value between 0 and 100 for an
argument between 0 and 1000, and you pass x equal to -1, does it matter
if "foo(-1)" crashes, or returns a random value outside of 0 to 100 ?
You are screwed either way. You have a bug in your code. You failed to
follow the specifications of the function, and things will go wrong.
Maybe the write will cause an immediate crash. Maybe it will overwrite
the return address on the stack, causing a leap into the unknown.
There's no way of telling.

>> So yes, there is a bug - it's in /your/ code if you pass an invalid
>> value to the function.
>
> BS.

So tell us, whose fault is it that you have not followed the
specification for the function? Your cat's ?

>
>> implementation of the <ctype.h> classification functions involves lookup
>> tables, and it is quite reasonable for a negative value to lead to
>> things going horribly wrong.
>
> So when did out of bounds array accessing become "undefined" in your world
> then rather than a bug?
>

It became "undefined behaviour" when the C standard said it was. You
don't even have to access the object - simply calculating the out of
bounds address is undefined behaviour (except that you can calculate -
but not dereference - one past the end of the array).

If you try to execute undefined behaviour, you have a bug in the code -
since the behaviour is not defined, it cannot possibly match the
behaviour you want, and is therefore a bug.

If you have a bug in your code, the code will be doing something that
you did not intend - you could argue that since you have not defined the
behaviour the code actually exhibits, then it is executing undefined
behaviour.

The concepts of "bug" and "undefined behaviour" are strongly related.

Scott Lurndal

unread,

Feb 23, 2023, 3:05:40 PM2/23/23

to

David Brown <david...@hesbynett.no> writes:
>On 23/02/2023 17:10, Mut...@dastardlyhq.com wrote:
>> On Thu, 23 Feb 2023 15:02:56 +0100
>> David Brown <david...@hesbynett.no> wrote:
>>> On 23/02/2023 11:34, Mut...@dastardlyhq.com wrote:
>>>> I would still consider a crash to be a bug. Undefined would just be returning
>>>
>>>> rubbish.
>>>
>>> Undefined behaviour means there is no define behaviour - crashing is
>>> entirely plausible. It doesn't matter what /you/ think about it. The C
>>> standard is quite clear about this - if you pass a valid argument to
>>> isdigit(), as specified in the standard, you'll get a valid result. If
>>> you pass something invalid, all bets are off and whatever happens is
>>> /your/ problem.
>>>
>>> This is so fundamental to the whole concept of programming that I am
>>> regularly surprised by people who call themselves programmers, yet fail
>>> to comprehend it. A function has a specified input domain, and a
>>
>> Sorry, I get tired of people resorting to "undefined" as some kind of get
>> out clause for some badly written API crashing whole program. isdigit() is a
>> yes/no function. If it is a valid digit it returns 1 else it returns 0
>> regardless. Crashing is a bug. End of.
>
>Sorry, I get tired of people who think /their/ ideas of how languages
>work is somehow correct, despite how it contradicts the actual
>definition and implementations of the language.

And, of course, when 'isdigit' et alia were designed, 7-bit ASCII was
limited to the positive subset of a signed 8-bit character, and
there was no need to ever pass the value assigned to the EOF macro.

if (isascii(c) && isdigit(c))

is using the API in the manner in which it was designed.

Keith Thompson

unread,

Feb 23, 2023, 4:58:15 PM2/23/23

to

Yes in SVR4.2. The code in the cited article allows for a -1 argument.

extern unsigned char __ctype[];
[...]

#define isalpha(c) ((__ctype + 1)[c] & (_U | _L))

Adding 1 to the array address allows for an index of -1.

(The standard requires EOF to have a negative value. This is a good
reason for it to be exactly -1, and I've never heard of an
implementation where EOF != -1.)

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson

unread,

Feb 23, 2023, 5:20:35 PM2/23/23

to

An implementation could make _ctype_table cover values from SCHAR_MIN to
UCHAR_MAX and use an SCHAR_MIN offset when indexing it. That would make
it well defined for any value within the range of signed char, char, or
unsigned char. No implementations are *required* to do this, but any
that do will avoid crashing when passing arbitrary char values to the
is*() and to() functions.

GNU's glibc appears to do something like this.

I'd like to see a future standard require well defined behavior for all
values from SCHAR_MIN to UCHAR_MAX.

(There could be a problem treating -1 as EOF and 255 as the letter 'ÿ'.
I'm tempted to argue that the special treatment of EOF has outlived its
usefulness, but I'm not suggesting a breaking change.)

Keith Thompson

unread,

Feb 23, 2023, 5:32:03 PM2/23/23

to

Andrey Tarasevich <andreyta...@hotmail.com> writes:

> On 02/23/23 2:34 AM, Mut...@dastardlyhq.com wrote:
>>>
>>> Undefined behavior may or may not involve a program crash.

>> I would still consider a crash to be a bug. Undefined would just be
>> returning
>> rubbish.
>

> Nope. "Returning rubbish" would be an example _unspecified
> behavior_. Undefined is a wholly different thing.

Crashing, returning rubbish, returning a sensible result, and
making demons fly out of your nose are *all* permitted consequences
of undefined behavior.

Unspecified behavior is limited to two or more possibilities
that are always (C) or usually (C++) specified by the standard.
Implementation-defined behavior is unspecified behavior where
the implementation must document its choice. The standard never
includes nasal demons as one of the possibilities.

Keith Thompson

unread,

Feb 23, 2023, 5:41:59 PM2/23/23

to

5.2.1 (I'm using the n1570 C standard draft) does not say that
characters outside the basic character set can be digits. In
enumerating the characters that are included in the basic source and
execution character sets, it says:

the 10 decimal *digits*

0 1 2 3 4 5 6 7 8 9

The word "digits" is in italics, so this is the definition of the word.
If I'm reading it correctly, a character like '²' (superscript two)
might be in the extended character set, but it cannot be a "digit" in
the meaning used in the standard.

Similarly:

A *letter* is an uppercase letter or a lowercase letter as defined
above; in this International Standard the term does not include
other characters that are letters in other alphabets.

where the "above" includes a list of the 26 uppercase and 26 lowercase
Latin letters.

The isupper and islower functions can return a true result either for a
*letter* or for other locale-specific characters. isdigit() is not
locale-specific; it tests only for "any decimal-digit character (as
defined in 5.2.1)".

Keith Thompson

unread,

Feb 23, 2023, 5:49:11 PM2/23/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
[...]

> And, of course, when 'isdigit' et alia were designed, 7-bit ASCII was
> limited to the positive subset of a signed 8-bit character, and
> there was no need to ever pass the value assigned to the EOF macro.
>
> if (isascii(c) && isdigit(c))
>
> is using the API in the manner in which it was designed.

Perhaps, but isascii() was never included in the C or C++ standard
(neither of which excludes EBCDIC or other character sets).

The is*() and to*() functions can safely handle the value returned
by getchar(), which is an int either in the range of unsigned char
or equal to EOF. They cannot safely handle arbitrary values in
a string.

The undefined behavior for negative values other than EOF is clearly
stated in the standard, so any program that fails because of it
is a buggy program -- but I suggest that it's also a misfeature,
and arguably a bug, in the standard itself.

I wouldn't mind seeing a future standard require plain char to be
unsigned. I wonder if there are any strong arguments against that.
(Yes, it would require some work for compiler and library implementers.)

Andrey Tarasevich

unread,

Feb 23, 2023, 8:29:04 PM2/23/23

to

On 02/23/23 1:57 PM, Keith Thompson wrote:
> (The standard requires EOF to have a negative value. This is a good
> reason for it to be exactly -1, and I've never heard of an
> implementation where EOF != -1.)

https://github.com/xinu-os/xinu/blob/master/include/stddef.h

--
Best regards,
Andrey

Ben Bacarisse

unread,

Feb 23, 2023, 9:15:47 PM2/23/23

to

But 7.2 (<ctype.h>) p2 says:

2. The behavior of these functions is affected by the current
locale. Those functions that have locale-specific aspects only when
not in the "C" locale are noted below.

and such aspects are documented for isupper (and others) but /not/ for
isdigit. That a distinction is pointless unless isdigit is supposed to
be an exception.

--
Ben.

Richard Damon

unread,

Feb 23, 2023, 10:13:07 PM2/23/23

to

On 2/23/23 5:48 PM, Keith Thompson wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
> [...]
>> And, of course, when 'isdigit' et alia were designed, 7-bit ASCII was
>> limited to the positive subset of a signed 8-bit character, and
>> there was no need to ever pass the value assigned to the EOF macro.
>>
>> if (isascii(c) && isdigit(c))
>>
>> is using the API in the manner in which it was designed.
>
> Perhaps, but isascii() was never included in the C or C++ standard
> (neither of which excludes EBCDIC or other character sets).
>
> The is*() and to*() functions can safely handle the value returned
> by getchar(), which is an int either in the range of unsigned char
> or equal to EOF. They cannot safely handle arbitrary values in
> a string.
>
> The undefined behavior for negative values other than EOF is clearly
> stated in the standard, so any program that fails because of it
> is a buggy program -- but I suggest that it's also a misfeature,
> and arguably a bug, in the standard itself.
>
> I wouldn't mind seeing a future standard require plain char to be
> unsigned. I wonder if there are any strong arguments against that.
> (Yes, it would require some work for compiler and library implementers.)
>

The one strong arguement against it is that many existing implementatons
have docuemented that there char is signed, and many application
(non-portable) have been written based on that assumption.

FORCING a quietly breaking change to an implementation that breaks
existing applications goes against the general philosophy the Standards
Committee has been following.

Keith Thompson

unread,

Feb 24, 2023, 12:44:49 AM2/24/23

to

Well, now I have.

From the link:

#define EOF (-2) /**< End-of-file (usually from read) */

james...@alumni.caltech.edu

unread,

Feb 24, 2023, 1:00:39 AM2/24/23

to

On Thursday, February 23, 2023 at 4:32:13 AM UTC-5, Mut...@dastardlyhq.com wrote:
> On Wed, 22 Feb 2023 21:30:04 -0800 (PST)
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
> >On Wednesday, February 22, 2023 at 4:27:12 AM UTC-5, Mut...@dastardlyhq.com
> >wrote:

> >> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
> >
> >> vs >>) template syntax hack up until 2011.
> >

> >If white-space between tokens were significant after translation phase 5, the
> >way in
> >which people used it wouldn't qualify as a "convention", but as a necessity for
> >
> >correct code. Such conventions are chosen to make code easier for humans to
> >understand, not because they are needed to ensure that compilers handle the
> >code
> >correctly.
> >
> >Note: white-space within tokens is always significant, and white-space between
> >tokens can be significant in translation phase 4 and earlier.
> Quite obviously a program with no whitespace won't compile. ...

It is in fact quite commonly the case that white space in C++ code serves mainly
to separate tokens, and it doesn't matter which kind of white-space it is, nor how
much of it there is. In many cases it doesn't even matter whether or not it is present,
but you're quite right - there's also many cases where white-space that serves solely
to separate tokens cannot be removed entirely. That is NOT what I'm talking about.

> ... That doesn't mean
> the whitespace is significant in the programming sense.

As I said above, the key fact that makes your claim basically correct is that, starting
with translation phase 5, white space separating tokens is no longer significant. But
a careful examination of that statement will reveal that there are several contexts in
which white space does more than merely separate tokens.

First of all, there's three kinds of tokens which are allowed to contain white-space:
header-names, character literals, and string literals. None of those can contain a
new-line character, but they can contain any other kind of white-space. Which
white-space characters they contain, and how many they contain, can directly
affect the behavior of the program.

Secondly, white-space characters that do separate tokens remains significant during
translation phase 4, in four different ways:
1. It matters very much that the only white-space allowed between pre-processing
tokens in a pre-processing directive are spaces and horizontal tabs (15.1p5), and
that the white-space separating such a directive from the surrounding code must
be an unescaped new-line.
2. Duplicate #defines for the same identifier are allowed only if the white-space
separations in the replacement lists are equivalent (15.6p2).
3. The lparen immediately following a function-like macro's name, either in the
definition or invocation of the macro, must NOT be preceded by white-space. It is
quite common for people to write code which relies upon this fact to suppress
invocation of a function-like macro, requiring that an actual function be called
instead.
4. When a function-like macro's parameter is the operand of an stringizing
operator, each sequence of white-space characters separating the pre-processing
tokens in that argument is replace by a single space character, which survives in
the resulting string literal, and as a result can affect the behavior of the program.

Keith Thompson

unread,

Feb 24, 2023, 1:20:24 AM2/24/23

to

"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:
[...]

> 3. The lparen immediately following a function-like macro's name, either in the
> definition or invocation of the macro, must NOT be preceded by white-space. It is
> quite common for people to write code which relies upon this fact to suppress
> invocation of a function-like macro, requiring that an actual function be called
> instead.

[...]

The '(' must immediately follow the macro name only in the definition,
not in an invocation.

#include <iostream>
#define PLUS1(x) ((x) + 1)
int main() {
std::cout << PLUS1 ( 42 ) << "\n";
}

(Output is 43.)

The whole point is that a function-like macro can be used as if it were
a function. The '(' has to immediately follow the name in the
definition to distinguish a function-like macro from a macro with no
parameters whose expansion happens to start with a '(' token.

If PLUS1 were defined both as a macro and as a function, the way to
bypass the macro definition would be `(PLUS)(42)` (with arbitrary
whitespace).

james...@alumni.caltech.edu

unread,

Feb 24, 2023, 1:25:01 AM2/24/23

to

On Thursday, February 23, 2023 at 12:16:33 PM UTC-5, Mut...@dastardlyhq.com wrote:
> On Thu, 23 Feb 2023 08:28:58 -0800 (PST)
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:

...

> >The fact that it doesn't matter to the implementation doesn't mean that it
> >makes no
> >difference. Conventions like this one are for humans, not compilers, and the
> >convention you use can make a difference in how easy it is for a human to
> >understand the code.
> Ok, you're just redefining the meaning of significant whitespace then. So how
> would you describe how python treats whitespace?

No, I'm not re-defining the meaning of "significant white-space", I'm
distinguishing between what is significant to the implementation, and what is
significant to the human reader. We call things "coding conventions" precisely
because they are not significant to the implementation, but are significant to
the human reader. If they were significant to the implementation, we would
refer to them as "the correct syntax", or some similar phrase.

I'm not very familiar with python. My seven-year-old son has shown an interest
in the Scratch programming language, and my wife is insisting that I introduce
him to a "real" programming language as well. I chose Python, because it would
give me an opportunity to learn the language while teaching it to him, and
because I'd heard that it's substantially easier to learn than C or C++. My wife
also insisted that I buy a book for him to study Python, rather than using
online sources, which is what I recommended.
I chose "Learn Python" by Eric Wall, which came well-recommended on Amazon.
I have no idea why. It is filled with typos, misprints, poor wording, misleading
wording, and downright incorrect information. It also has no index, which I've
usually found indispensible for books of this type. Most relevant to your
comment, it contains example code that needs to be correctly indented in order
to work properly, and isn't. It makes no mention of the fact that indentation is
needed. Luckily, the Python interpreter that I installed on his computer provided
helpful error messages.

I would greatly appreciate recommendations for an alternative source of
instruction in Python, suitable for a 7-year-old (and his 64-year old dad).

James Kuyper

unread,

Feb 24, 2023, 1:34:44 AM2/24/23

to

On 2/24/23 01:20, Keith Thompson wrote:
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> writes:
> [...]
>> 3. The lparen immediately following a function-like macro's name, either in the
>> definition or invocation of the macro, must NOT be preceded by white-space. It is
>> quite common for people to write code which relies upon this fact to suppress
>> invocation of a function-like macro, requiring that an actual function be called
>> instead.
> [...]
>
> The '(' must immediately follow the macro name only in the definition,
> not in an invocation.
>
> #include <iostream>
> #define PLUS1(x) ((x) + 1)
> int main() {
> std::cout << PLUS1 ( 42 ) << "\n";
> }
>
> (Output is 43.)
>
> The whole point is that a function-like macro can be used as if it were
> a function. The '(' has to immediately follow the name in the
> definition to distinguish a function-like macro from a macro with no
> parameters whose expansion happens to start with a '(' token.
>
> If PLUS1 were defined both as a macro and as a function, the way to
> bypass the macro definition would be `(PLUS)(42)` (with arbitrary
> whitespace).

You're correct, I got that mixed up.

James Kuyper

unread,

Feb 24, 2023, 2:14:23 AM2/24/23

to

On 2/23/23 12:20, Richard Damon wrote:
> On 2/23/23 11:57 AM, James Kuyper wrote:
>> On 2/23/23 06:20, Paavo Helde wrote:
>>> 23.02.2023 12:34 Mut...@dastardlyhq.com kirjutas:
>> ...
>>>> Undefined would just be returning
>>>> rubbish. I can't even figure out HOW it would crash since all its
>>>> doing is
>>>>
>>>> return (c >= '0' && c <= '9')
>>>
>>> Nope, because it's not known at the compile time which characters
>>> should be considered digits. It might do something like
>>
>> No, isdigit() and isxdigit() are special cases. They are not
>> locale-dependent. Regardless of locale, which characters qualify is
>> defined by cross-referencing section 5.2.1.

Note: I forgot that I was posting this to comp.lang.c++. That's a
citation from the C standard. It's the correct citation to make, because
the C++ standard defines the behavior of isdigit() and isxdigit() solely
by cross-referencing the C standard. However, I should have pointed out
the fact that I was referring to the C standard.

> Which still allows for "extended characters" to be added. It just says
> that such an addition needs to hold for ALL locales.

I'm not sure what you mean by "adding" extended characters. The
definition of "digit" I referenced from 5.2.1 defines the term by
listing exactly 10 characters, and identifying 10 as the number of
things that qualify as digits. They are all members of the basic
character set.

Note that all of the following functions are described explicitly as
returning true for a locale-specific set of characters:
isalpha(), isblank(), islower(), ispunct(), isspace(), isupper()

isalnum() is defined in terms of isalpha(), and is therefore
locale-specific, but only insofar as isalpha() is locale-specific.

iscntrl(), isgraph(), and isprint() are not directly described as
locale-specific, but they cross-reference the definitions of control
character and printing character in 7.4p3, which explicitly describe
those as locale-specific sets.

isdigit() and isxdigit() are the only ones to not be described, either
indirectly or directly, as returning true for a locale-specific of
characters. The other functions establish that such needs to be
explicitly stated in order for it to be true.
Finally, footnote 241, while not normative, clarifies that "The only

Paavo Helde

unread,

Feb 24, 2023, 3:50:02 AM2/24/23

to

24.02.2023 00:48 Keith Thompson kirjutas:
I wouldn't mind seeing a future standard require plain char to be
> unsigned. I wonder if there are any strong arguments against that.
> (Yes, it would require some work for compiler and library implementers.)

I encountered this scenario last year when porting our codebase to
aarch64. It was broken all over the place and it took me several days to
cope with the char being unsigned.

Admitted, this was because we had used char as a synonym for std::int8
(codebase originates from before std::int8) and there was a ton of code
for working with numeric arrays of that type. Thankfully most of that
code was templated, so could be adjusted relatively easily.

For strings there were only a couple of fixes, like UTF-8 handling
functions assuming the multibyte character values are negative, etc.

Mut...@dastardlyhq.com

unread,

Feb 24, 2023, 5:29:25 AM2/24/23

to

On Thu, 23 Feb 2023 20:46:00 +0100
David Brown <david...@hesbynett.no> wrote:

>On 23/02/2023 17:10, Mut...@dastardlyhq.com wrote:
>> Sorry, I get tired of people resorting to "undefined" as some kind of get
>> out clause for some badly written API crashing whole program. isdigit() is a
>> yes/no function. If it is a valid digit it returns 1 else it returns 0
>> regardless. Crashing is a bug. End of.
>
>Sorry, I get tired of people who think /their/ ideas of how languages
>work is somehow correct, despite how it contradicts the actual
>definition and implementations of the language.

We're not talking about the language, we're talking about API functions.
the is*() functions unlike printf are not built in to the compiler.

>You can argue that this is how the function /should/ work until you are
>blue in the face. It doesn't change the simple facts about how it
>/does/ work.

I'm aware how it does work. I'm also saying that crashing is a bug.

You and a lot of other people seem to have trouble differentiating the
following:

logic bug
crash bug

A non variadic C function being passed non pointer primitive types should be
able to cope with ALL possible inputs without going tits up. If it doesn't its
badly written. Indexing an array without checking the index value is beginners
error 101 and there is NO EXCUSE WHATSOEVER for that sort of code being in a
common library function.

>> The result is undefined. Presumably you think ++ crashing the program is
>> an acceptable outcome?
>>
>
>Sure.
>
>I greatly prefer a warning from the compiler that there is a bug in my
>code. And I don't expect the compiler to go out of its way to cause a
>crash - I expect it to generate efficient code as best it can. But with
>a bug in the source code I can't expect the compiler to somehow generate
>"correct" code.

Seems to me you're just arguing the toss for the sake of it now.

Mut...@dastardlyhq.com

unread,

Feb 24, 2023, 5:32:50 AM2/24/23

to

On Thu, 23 Feb 2023 22:00:17 -0800 (PST)
"james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
>First of all, there's three kinds of tokens which are allowed to contain
>white-space:
>header-names, character literals, and string literals. None of those can
>contain a
>new-line character, but they can contain any other kind of white-space. Which

char *str = "hello\
world\n";

2 newlines in there.

>white-space characters they contain, and how many they contain, can directly
>affect the behavior of the program.

Significant whitespace means significant in and of itself, not simply as
a kind of seperator. However you're not going to admit you were wrong so
lets call it a day.