What's the deal with the "toupper" family?

Frederick Gotham

unread,

Jul 5, 2006, 10:24:44 AM7/5/06

to

The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.
(Although why it isn't of type "char" escapes me... )

The "toupper" function imposes a further constrait in that the value
passed to it must be representable as a unsigned char. (If C does not
require all character values to be positive, then again, this constrait
too escapes me... )

Let's say we have the following hypothetical system:

char is signed.

UCHAR_MAX == 255
SCHAR_MAX == 127
CHAR_MAX == 127

INT_MAX == 65535

We are able to represent all the characters of ASCII using positive
numbers, but anything beyond that would require negative numbers on this
system.

So what's the deal with using toupper on these extraneous characters
whose numeric value is negative?

Let's say we have a German sharp S, or a Spanish N with a curly thing on
top of it, and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

(One more thing. If you have a signed integer value, and you cast it to
its corresponding unsigned integer type, and then back to the signed
type, are you guaranteed to have the same value? i.e.:

signed char s = -5;

unsigned char us = s;

s = us;

assert( -5 == s ); /* Is this guaranteed? */

--

Frederick Gotham

Eric Sosman

unread,

Jul 5, 2006, 11:06:39 AM7/5/06

to

Frederick Gotham wrote:
> The "toupper" function takes an int as an argument. That's not too
> irrational given that a character literal is of type "int" in C.
> (Although why it isn't of type "char" escapes me... )
>
> The "toupper" function imposes a further constrait in that the value
> passed to it must be representable as a unsigned char. (If C does not
> require all character values to be positive, then again, this constrait
> too escapes me... )

Back in the Dawn of C (well, the Early Morning), the
<ctype.h> functions were defined to operate on all the values
returned by getchar(), getc(), and fgetc(). These functions
need to be able to return any legitimate character code plus
a code unlike all characters to indicate an input failure.
The scheme adopted for the input functions was that they would
return a non-negative int to represent an actual character code
or a negative int to represent input failure. The <ctype.h>
functions thus inherited their oddities from the I/O functions'
practice of returning "special values" in place of "real data."

If one were designing the C library today, I doubt these
decisions would be made in the same way. getchar() et al. are
already in trouble on systems where sizeof(int)==1, because there
is no "space" for a distinguished non-character EOF value. If
getchar() returns EOF, it could actually be "real data:" you
cannot tell from the returned value alone, but must consult the
feof() and ferror() functions.

Even if the "in-band" signalling by the I/O functions were
retained, I doubt that newly-designed <ctype.h> functions would
be defined on the entire range of values getchar() can return.
Rather, they would be defined for all possible char values and
would make no special provision for EOF. Then we'd need none
of this silly casting when applying the <ctype.h> functions to
characters taken from a string.

However, that particular horse left the barn long ago.

> Let's say we have the following hypothetical system:
>
> char is signed.
>
> UCHAR_MAX == 255
> SCHAR_MAX == 127
> CHAR_MAX == 127
>
> INT_MAX == 65535
>
> We are able to represent all the characters of ASCII using positive
> numbers, but anything beyond that would require negative numbers on this
> system.

Character codes 128 through 255 would not be representable
as char, but they would be representable as unsigned char or as
int.

> So what's the deal with using toupper on these extraneous characters
> whose numeric value is negative?

As above: The argument to a <ctype.h> function must be either
the negative value EOF or else a character code represented as
an unsigned char value. A <ctype.h> function should never see a
negative character code; if it does, the caller is at fault.

> Let's say we have a German sharp S, or a Spanish N with a curly thing on
> top of it, and that its numeric value is negative. How do we go about
> passing their value to toupper? Should we do the following?
>
> toupper( (unsigned char)c );

Yes.

> (One more thing. If you have a signed integer value, and you cast it to
> its corresponding unsigned integer type, and then back to the signed
> type, are you guaranteed to have the same value? i.e.:
>
> signed char s = -5;
> unsigned char us = s;

No problem yet: us has the value UCHAR_MAX-4 (252, for
an eight-bit character).

> s = us;

Trouble in River City. The value of us is out of range
for a signed char, so you get either (1) an implementation-
defined result stored in s, or (2) an implementation-defined
signal is raised. (This is not undefined behavior, technically
speaking, but it might as well be. If a signal is raised, there
is no way to handle that signal and continue without invoking
undefined behavior. The distinction is somewhat like observing
that you will not be harmed by a fall from a hundred-story tower
but only by the sudden stop at the end.)

On most implementations nowadays, alternative (1) is taken
and the implementation-defined result happens to be equal to the
value s had before conversion to unsigned char. This is not an
outcome guaranteed by the language itself, though.

--
Eric Sosman
eso...@acm-dot-org.invalid

SM Ryan

unread,

Jul 5, 2006, 11:20:31 AM7/5/06

to

Frederick Gotham <fgot...@SPAM.com> wrote:

# We are able to represent all the characters of ASCII using positive
# numbers, but anything beyond that would require negative numbers on this
# system.

Beyond ASCII, there are many different encodings.

# Let's say we have a German sharp S, or a Spanish N with a curly thing on
# top of it, and that its numeric value is negative. How do we go about
# passing their value to toupper? Should we do the following?

Don't depend on the encoding of non-ASCII characters. Instead you can
use wide characters (wchar_t) and functions like towupper.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
We found a loophole; they can't keep us out anymore.

Ben Pfaff

unread,

Jul 5, 2006, 11:46:31 AM7/5/06

to

Frederick Gotham <fgot...@SPAM.com> writes:

> Let's say we have a German sharp S, or a Spanish N with a curly thing on
> top of it, and that its numeric value is negative. How do we go about
> passing their value to toupper? Should we do the following?
>
> toupper( (unsigned char)c );

Yes. That's the usual thing to do.

> (One more thing. If you have a signed integer value, and you cast it to
> its corresponding unsigned integer type, and then back to the signed
> type, are you guaranteed to have the same value?

No. The behavior is essentially undefined:

6.3.1.3 Signed and unsigned integers

1 When a value with integer type is converted to another integer
type other than _Bool, if the value can be represented by
the new type, it is unchanged.

2 Otherwise, if the new type is unsigned, the value is converted
by repeatedly adding or subtracting one more than the
maximum value that can be represented in the new type until
the value is in the range of the new type.49)

3 Otherwise, the new type is signed and the value cannot be
represented in it; either the result is
implementation-defined or an implementation-defined signal
is raised.

--
"This is a wonderful answer.
It's off-topic, it's incorrect, and it doesn't answer the question."
--Richard Heathfield

Jack Klein

unread,

Jul 5, 2006, 1:02:20 PM7/5/06

to

On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
<fgot...@SPAM.com> wrote in comp.lang.c:

>
> The "toupper" function takes an int as an argument. That's not too
> irrational given that a character literal is of type "int" in C.
> (Although why it isn't of type "char" escapes me... )

Obviously you lack an understanding of K&R C, not to mention BCPL and
B.

> The "toupper" function imposes a further constrait in that the value
> passed to it must be representable as a unsigned char. (If C does not
> require all character values to be positive, then again, this constrait
> too escapes me... )

What does not escape you? All of the to... and is... functions
defined in <ctype.h> work perfectly with the int value returned by
getchar(), which returns valid characters in the range of
0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.

> Let's say we have the following hypothetical system:
>
> char is signed.
>
> UCHAR_MAX == 255
> SCHAR_MAX == 127
> CHAR_MAX == 127
>
> INT_MAX == 65535
>
>
> We are able to represent all the characters of ASCII using positive
> numbers, but anything beyond that would require negative numbers on this
> system.
>
> So what's the deal with using toupper on these extraneous characters
> whose numeric value is negative?

"The deal" is undefined behavior.

> Let's say we have a German sharp S, or a Spanish N with a curly thing on
> top of it, and that its numeric value is negative. How do we go about
> passing their value to toupper? Should we do the following?
>
> toupper( (unsigned char)c );
>
>
>
> (One more thing. If you have a signed integer value, and you cast it to
> its corresponding unsigned integer type, and then back to the signed
> type, are you guaranteed to have the same value? i.e.:

No.

> signed char s = -5;
>
> unsigned char us = s;
>
> s = us;
>
> assert( -5 == s ); /* Is this guaranteed? */

Again, not. Given your assumption that the implementation has
UCHAR_MAX 255 and CHAR_MAX 127, assigning a value of -5 to an unsigned
char results is well defined, and results in an unsigned char with the
value 251. Assigning the value 251 to a signed char, a value outside
its range, results in either an implementation-defined result, or an
implementation-defined signal is raised.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

Walter Roberson

unread,

Jul 5, 2006, 1:52:24 PM7/5/06

to

In article <mprna2dt1ds799dbk...@4ax.com>,

Jack Klein <jack...@spamcop.net> wrote:
>On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
><fgot...@SPAM.com> wrote in comp.lang.c:

>> The "toupper" function imposes a further constrait in that the value

>> passed to it must be representable as a unsigned char. (If C does not
>> require all character values to be positive, then again, this constrait
>> too escapes me... )

>What does not escape you? All of the to... and is... functions
>defined in <ctype.h> work perfectly with the int value returned by
>getchar(), which returns valid characters in the range of
>0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.

Not according to C89. According to C89, getchar() is equivilent
[but possibly a macro] to fgetc(stdin), and fgetc() is defined as
returning "an unsigned char converted to an int". In implementations
in which UCHAR_MAX exceeds INT_MAX [e.g., sizeof(char) == sizeof(int),
in which case UCHAR_MAX may be UINT_MAX > INT_MAX]
then the conversion of values in the range INT_MAX+1 to UCHAR_MAX
has implementation defined results that are -not- guaranteed
to be in the range of 0..UCHAR_MAX.

C89 does NOT define fgetc() [and transitively, getchar()] such that
returning a negative value indicates EOF or an error. C89 defines
fgetc() as returning the specific value EOF upon EOF or error,
and defines EOF only as "a negative integral constant". As long as
the value EOF is not one of the values that can be returned for valid
characters, getchar() is free to return negative values.

For example, an implementation might choose to include keycode
modifiers such as LEFT_ALT LEFT_CONTROL RIGHT_ALT RIGHT_CONTROL
CAPS_LOCK NUM_LOCK KEY_DOWN KEY_UP for characters from some sources.
In this example, on a system with 16 bit ints, all 8 of these
flag bits might be set, and keys such as F12 could generate basic
values in the 128..255 range. The composite result could be
something greater than INT_MAX, and the implementation behaviour
in converting the value to an int might be to just copy the bits
and let the value be reinterpreted as 2's complement, leading to
negative values. The implementation could know, however, that
there is no key whose basic value is 255, and so could set EOF as
LEFT_ALT|LEFT_CONTROL|RIGHT_ALT|RIGHT_CONTROL|CAPS_LOCK|NUM_LOCK|
KEY_DOWN|KEY_UP|255
which in this hypothetical arrangement would happen to come out,
after interpretation as a signed 2s complement integer, as -1 .
EOF would be negative, would not represent any possible character
in the hypothetical system, but there would be valid negative values.
--
All is vanity. -- Ecclesiastes

Ben Pfaff

unread,

Jul 5, 2006, 2:04:30 PM7/5/06

to

robe...@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:

> In article <mprna2dt1ds799dbk...@4ax.com>,
> Jack Klein <jack...@spamcop.net> wrote:
>>On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
>><fgot...@SPAM.com> wrote in comp.lang.c:
>
>>> The "toupper" function imposes a further constrait in that the value
>>> passed to it must be representable as a unsigned char. (If C does not
>>> require all character values to be positive, then again, this constrait
>>> too escapes me... )
>
>>What does not escape you? All of the to... and is... functions
>>defined in <ctype.h> work perfectly with the int value returned by
>>getchar(), which returns valid characters in the range of
>>0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.
>
> Not according to C89. According to C89, getchar() is equivilent
> [but possibly a macro] to fgetc(stdin), and fgetc() is defined as
> returning "an unsigned char converted to an int". In implementations
> in which UCHAR_MAX exceeds INT_MAX [e.g., sizeof(char) == sizeof(int),
> in which case UCHAR_MAX may be UINT_MAX > INT_MAX]
> then the conversion of values in the range INT_MAX+1 to UCHAR_MAX
> has implementation defined results that are -not- guaranteed
> to be in the range of 0..UCHAR_MAX.

Jack and many of the other posters here are well aware of this.
However, in previous discussions, we've been unable to locate a
hosted implementation that meets these criteria. Some
freestanding ones are known to exist, if I recall correctly, but
freestanding implementations do not include the standard I/O
library.
--
Peter Seebach on C99:
"[F]or the most part, features were added, not removed. This sounds
great until you try to carry a full-sized printout of the standard
around for a day."

Keith Thompson

unread,

Jul 5, 2006, 4:13:07 PM7/5/06

to

Frederick Gotham <fgot...@SPAM.com> writes:
> The "toupper" function takes an int as an argument. That's not too
> irrational given that a character literal is of type "int" in C.
> (Although why it isn't of type "char" escapes me... )

In K&R C, it wasn't possible for a function to have an argument of
type char. Even in modern C, expressions of type char and short are
promoted to int.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Peter Nilsson

unread,

Jul 5, 2006, 6:56:04 PM7/5/06

to

Frederick Gotham wrote:
> The "toupper" function takes an int as an argument. That's not too
> irrational given that a character literal is of type "int" in C.

Not necessarily. Even if é is a member of the execution character set,
the
character constant 'é' needn't be a positive value (in the range of
unsigned
char.)

> (Although why it isn't of type "char" escapes me... )

Covered elsethread by others.

> The "toupper" function imposes a further constrait in that the value
> passed to it must be representable as a unsigned char. (If C does not
> require all character values to be positive,

It requires the execution character set character codings have
non-negative
values. Whether those codings are represented as non-negative values in
(plain) char is another matter.

> then again, this constrait too escapes me... )

Technically, it's not a constraint. It's a prerequisite for the
standard
implementation of toupper.

> Let's say we have the following hypothetical system:
>
> char is signed.
>
> UCHAR_MAX == 255
> SCHAR_MAX == 127
> CHAR_MAX == 127
>
> INT_MAX == 65535
>
> We are able to represent all the characters of ASCII using positive
> numbers, but anything beyond that would require negative numbers on this
> system.

As a plain char value yes, however most programs receive input as
though
fgetc is storing an unsigned char into char storage.

> So what's the deal with using toupper on these extraneous characters
> whose numeric value is negative?

It's up to the programmer to supply the correct character code value.

> Let's say we have a German sharp S, or a Spanish N with a curly thing
> on top of it,

[Tilde.]

> and that its numeric value is negative. How do we go about
> passing their value to toupper? Should we do the following?
>
> toupper( (unsigned char)c );

That's the clc regular's method. To me, it generally makes more
sense to do...

toupper( * (unsigned char) &c )

...when c is a plain char.

Even on a two's complement system, there is no guarantee that
the cast conversion of a plain char value will yield the original
unsigned char value of the character code.

The following is unlikely (due to QoI), but nontheless allowed...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

--
Peter

Jack Klein

unread,

Jul 5, 2006, 7:29:29 PM7/5/06

to

On Wed, 5 Jul 2006 17:52:24 +0000 (UTC), robe...@ibd.nrc-cnrc.gc.ca
(Walter Roberson) wrote in comp.lang.c:

As Ben mentioned, it is literally impossible to have a conforming
hosted implementation where INT_MAX < UCHAR_MAX. There can be, and
are, more-or-less conforming implementations where UINT_MAX ==
UCHAR_MAX and therefore UCHAR_MAX > INT_MAX, and I have worked on some
of them.

In fact it is impossible for a conforming getchar() (and related
functions) to exist on a platform where INT_MAX is not at least equal
to UCHAR_MAX. getchar() and its ilk must be able to return UCHAR_MAX
+ 1 distinct values, since each and every value in the range
0...UCHAR_MAX can be read from a stream, and EOF must be
distinguishable from all.

You may ask why I say EOF be distinguishable from all values in the
range 0 to U_CHAR max, and therefore cannot have the same
representation in an int as any of these values.

C99: paragraph 9 of 7.19.1 requires that the macro EOF "expands to an
integer constant expression, with type int and a negative value, that
is returned by several functions to indicate end-of-file, that is, no
more input from a stream".

C90: no paragraph numbers, but the corresponding section of 7.9.1
has identical wording.

No function defined to return EOF on end-of-file (or error) may return
this value unless it detects end-of-file or an error.

Any implementation where UCHAR_MAX > INT_MAX must be a free-standing
implementation. Free-standing implementations are not required to
provide either <stdio.h> or <ctype.h>, so there is no point is arguing
on how such features interact on such a platform.

Walter Roberson

unread,

Jul 5, 2006, 8:10:33 PM7/5/06

to

In article <4mgoa29m508h6125b...@4ax.com>,
Jack Klein <jack...@spamcop.net> wrote:

>In fact it is impossible for a conforming getchar() (and related
>functions) to exist on a platform where INT_MAX is not at least equal
>to UCHAR_MAX. getchar() and its ilk must be able to return UCHAR_MAX
>+ 1 distinct values, since each and every value in the range
>0...UCHAR_MAX can be read from a stream, and EOF must be
>distinguishable from all.

Why must every value in the range 0...UCHAR_MAX be readable from
a stream?

>You may ask why I say EOF be distinguishable from all values in the
>range 0 to U_CHAR max,

No, I don't ask that: in my posting I specifically proposed an EOF
distinct from any value that could be reach in the hypothetical system.

What is a stream, that every value 0...UCHAR_MAX must be readable
from it? For example, an implementation could be such that data
read from a file or pipe or socket is returned 8 bits at a time, but that
data read from a console might be augmented with keycode modifiers.

I don't have my standard at home with me: does the standard promise
that all possible values 0 to UCHAR_MAX must be writable to a binary
stream? (If it does so guarantee, then the standard does indicate
that it must be possible to read them back unchanged, except perhaps
trailing nulls.) Does the standard promise that all values
0 to UCHAR_MAX must be ungetc()-able?
--
Prototypes are supertypes of their clones. -- maplesoft

Frederick Gotham

unread,

Jul 5, 2006, 8:15:46 PM7/5/06

to

Peter Nilsson posted:

> The following is unlikely (due to QoI), but nontheless allowed...
>
> UCHAR_MAX: 65535

This suggests that a unsigned char has 16 value representation bits, and an
unknown quantity of padding bits.

> SCHAR_MAX: 127
> SCHAR_MIN: -128
> CHAR_MAX: 127

This suggests that a signed char has 8 value representation bits (inclusive
of the sign bit), and at least 8 paddings bits, in order to satisfy:

assert( sizeof(signed char) == sizeof(unsigned char) );

--

Frederick Gotham

unread,

Jul 5, 2006, 8:17:39 PM7/5/06

to

Peter Nilsson posted:

<slightly altered>
> toupper( *(unsigned char const *)&c )

Does anyone else agree with this?

It's safe because an unsigned char cannot have any trap representations,
but nonetheless, does it do what we want it to do, and is it preferable
over the following?

toupper( (unsigned char)c );

--

Frederick Gotham

Ben Pfaff

unread,

Jul 5, 2006, 9:11:51 PM7/5/06

to

robe...@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:

> In article <4mgoa29m508h6125b...@4ax.com>,
> Jack Klein <jack...@spamcop.net> wrote:
>
>>In fact it is impossible for a conforming getchar() (and related
>>functions) to exist on a platform where INT_MAX is not at least equal
>>to UCHAR_MAX. getchar() and its ilk must be able to return UCHAR_MAX
>>+ 1 distinct values, since each and every value in the range
>>0...UCHAR_MAX can be read from a stream, and EOF must be
>>distinguishable from all.
>
> Why must every value in the range 0...UCHAR_MAX be readable from
> a stream?

For binary streams there is a guarantee (C99 7.19.2):

3 A binary stream is an ordered sequence of characters that can
transparently record internal data. Data read in from a
binary stream shall compare equal to the data that were
earlier written out to that stream, under the same
implementation. Such a stream may, however, have an
implementation-defined number of null characters appended to
the end of the stream.

For text streams there is no such guarantee.
--
"Given that computing power increases exponentially with time,
algorithms with exponential or better O-notations
are actually linear with a large constant."
--Mike Lee

Andrew Poelstra

unread,

Jul 5, 2006, 9:54:13 PM7/5/06

to

On 2006-07-06, Frederick Gotham <fgot...@SPAM.com> wrote:
> Peter Nilsson posted:
>
><slightly altered>
>> toupper( *(unsigned char const *)&c )
>
>
> Does anyone else agree with this?
>

It looks overly complicated to me.

> It's safe because an unsigned char cannot have any trap representations,
> but nonetheless, does it do what we want it to do, and is it preferable
> over the following?
>
> toupper( (unsigned char)c );
>

No; the latter is much clearer and just as functional, IMHO.

--
Andrew Poelstra <http://www.wpsoftware.net/projects/>
To email me, use "apoelstra" at the above address.
"You people hate mathematics." -- James Harris

Mike S

unread,

Jul 5, 2006, 10:28:28 PM7/5/06

to

Peter Nilsson wrote:

> Frederick Gotham wrote:
> > Let's say we have a German sharp S, or a Spanish N with a curly thing
> > on top of it,
>
> [Tilde.]
>
> > and that its numeric value is negative. How do we go about
> > passing their value to toupper? Should we do the following?
> >
> > toupper( (unsigned char)c );
>
> That's the clc regular's method. To me, it generally makes more
> sense to do...
>
> toupper( * (unsigned char) &c )
>
> ...when c is a plain char.

ITYM:

toupper( *(unsigned char *) &c)

OK, it's late and I might be missing something here, but aren't the
expressions

(unsigned char) c

and

*(unsigned char*) &c

semantically equivalent? Or is there a chance that they might evaluate
to a different result or produce different side effects along the way
to the result which somehow makes the second expression even more
reliable as a parameter to the to*() and is*() functions than the first
(seemingly more popular) expression? At the moment, the two seem
perfectly interchangeable to me, so I don't see much reason for
choosing the second over the first, especially since the first is
clearer.

--
Mike S

Richard Heathfield

unread,

Jul 5, 2006, 11:21:17 PM7/5/06

to

Mike S said:

<snip>

>
> OK, it's late and I might be missing something here, but aren't the
> expressions
>
> (unsigned char) c
>
> and
>
> *(unsigned char*) &c
>
> semantically equivalent?

No.

> Or is there a chance that they might evaluate to a different result

Very much so.

int c = getchar(); /* let's say we get an 'A' from getchar(), and let's
assume we're using some completely arbitrary and whacko character set such
as, say, ASCII. */

c now has the value 65, right? (Remember, we're assuming ASCII for the sake
of this exercise.) Okay, so (unsigned char)c gets you 65, which is fine.

But let's take a closer look at this int. If ints are 16 bits, we have two
choices for in-memory representation: 0x0041, or 0x4100. If ints are 32
bits, we have rather more choices, but the two most likely are 0x00000041
and 0x41000000. If ints are 64 bits, we are probably going to have either
0x0000000000000041 or 0x4100000000000000. Other endianisms are possible,
but we don't need to go there to demonstrate that *(unsigned char)&c is
wrong. I hope you can see the problem straight away. On any big-endian
system where sizeof(int) > 1, this code is going to produce the wrong
result. Specifically, it will normally produce 0 instead of the required
result.

So Peter's idea is fatally flawed. And yet it probably works fine for him,
because he's probably using it on a little-endian system. So it's just
sitting there waiting to bite him (or his maintainers) at porting time.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)

Peter Nilsson

unread,

Jul 5, 2006, 11:52:56 PM7/5/06

to

Frederick Gotham wrote:
> Peter Nilsson posted:
> > The following is unlikely (due to QoI), but nontheless allowed...
> >
> > UCHAR_MAX: 65535
>
> This suggests that a unsigned char has 16 value representation bits, and an
> unknown quantity of padding bits.
>
> > SCHAR_MAX: 127
> > SCHAR_MIN: -128
> > CHAR_MAX: 127
>
> This suggests that a signed char has 8 value representation bits (inclusive
> of the sign bit), and at least 8 paddings bits,

Yes.

> in order to satisfy:
>
> assert( sizeof(signed char) == sizeof(unsigned char) );

That is always satisfied on a conforming implementation, but yes.

--
Peter

Peter Nilsson

unread,

Jul 6, 2006, 12:08:38 AM7/6/06

to

Richard Heathfield wrote:

> Mike S said:
> > OK, it's late and I might be missing something here, but aren't the
> > expressions
> >
> > (unsigned char) c
> >
> > and
> >
> > *(unsigned char*) &c
> >
> > semantically equivalent?
>
> No.
>
> > Or is there a chance that they might evaluate to a different result
>
> Very much so.
>

> int c = getchar(); ...
<snip>
> ... Peter's idea is fatally flawed.

<sigh>

Consider...

char line[256];
size_t i;
if (fgets(line, sizeof line, stdin))
{
for (i = 0; line[i] != 0; i++)
{
line[i] = toupper((unsigned char) line[i]); /* v1 */
line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
}
...
}

On an implementation satisfying...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

...v1 can fail, v2 succeeds.

--
Peter

Richard Heathfield

unread,

Jul 6, 2006, 12:17:11 AM7/6/06

to

Peter Nilsson said:

> Richard Heathfield wrote:
<snip>
>> ... Peter's idea is fatally flawed.
>
> <sigh>
>
> Consider...
>
> char line[256];
> size_t i;
> if (fgets(line, sizeof line, stdin))
> {
> for (i = 0; line[i] != 0; i++)
> {
> line[i] = toupper((unsigned char) line[i]); /* v1 */
> line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
> }
> ...
> }
>
> On an implementation satisfying...
>
> UCHAR_MAX: 65535
> SCHAR_MAX: 127
> SCHAR_MIN: -128
> CHAR_MAX: 127
>
> ...v1 can fail, v2 succeeds.

Even if such an implementation is conforming (about which I have serious
doubts, but I'm not going to press the point right now), it would be
extraordinarily rare. I have already posted code which shows how your
technique fails on big-endian systems with perfectly ordinary char ranges,
and such systems are far more common (eg IBM 370, 68000, most RISCs) than
an architecture that has 8 padding bits in every char!

Therefore, your technique is not safe for general use, and I cannot
recommend it.

Peter Nilsson

unread,

Jul 6, 2006, 12:16:09 AM7/6/06

to

Andrew Poelstra wrote:
> On 2006-07-06, Frederick Gotham <fgot...@SPAM.com> wrote:
> > Peter Nilsson posted:
> ><slightly altered>
> >> toupper( *(unsigned char const *)&c )
> >
> > Does anyone else agree with this?
>
> It looks overly complicated to me.

In normal form, I use things like...

const unsigned char *us = (const unsigned char *) s;
for (; *us; us++) *us = toupper(*us);

If that's too complicated for some people, so be it.

> > It's safe because an unsigned char cannot have any trap
> > representations, but nonetheless, does it do what we want
> > it to do, and is it preferable over the following?

As I said, it's up to the programmer to pass the right value.
Different circumstances may well require different forms.
Where and how you source and store the character is a
factor in deciding which method you use.

> > toupper( (unsigned char)c );
>
> No; the latter is much clearer and just as functional, IMHO.

But fails for potentially conforming implementations. To many people,
that's acceptable.

--
Peter

Andrew Poelstra

unread,

Jul 6, 2006, 12:56:18 AM7/6/06

to

On 2006-07-06, Peter Nilsson <ai...@acay.com.au> wrote:
> Andrew Poelstra wrote:
>> On 2006-07-06, Frederick Gotham <fgot...@SPAM.com> wrote:
>> > Peter Nilsson posted:
>> ><slightly altered>
>> >> toupper( *(unsigned char const *)&c )
>> >
>> > Does anyone else agree with this?
>>
>> It looks overly complicated to me.
>
> In normal form, I use things like...
>
> const unsigned char *us = (const unsigned char *) s;
> for (; *us; us++) *us = toupper(*us);
>

No matter what you think `const' means in this context, it's wrong. You
change both `us' /and/ `*us' in the second line.

> If that's too complicated for some people, so be it.
>

Most simple-minded people believe that the const keyword will create a
constant. It's true that we find it `too complicated' to violate that.

>> > It's safe because an unsigned char cannot have any trap
>> > representations, but nonetheless, does it do what we want
>> > it to do, and is it preferable over the following?
>
> As I said, it's up to the programmer to pass the right value.
> Different circumstances may well require different forms.
> Where and how you source and store the character is a
> factor in deciding which method you use.
>

The point of the cast is to work correctly, even if the programmer passes
the wrong value. Perhaps the programmer is passing input from a file
stream or something, and doesn't want to validate the string for such
a simple function. (And perhaps the string being uppercase is required
for future validations.)

>> > toupper( (unsigned char)c );
>>
>> No; the latter is much clearer and just as functional, IMHO.
>
> But fails for potentially conforming implementations. To many people,
> that's acceptable.
>

Under what circumstances will casting to unsigned char fail, and how
will it fail?

Peter Nilsson

unread,

Jul 6, 2006, 2:55:53 AM7/6/06

to

Andrew Poelstra wrote:
> On 2006-07-06, Peter Nilsson <ai...@acay.com.au> wrote:
> > Andrew Poelstra wrote:
> >> On 2006-07-06, Frederick Gotham <fgot...@SPAM.com> wrote:
> >> > Peter Nilsson posted:
> >> ><slightly altered>
> >> >> toupper( *(unsigned char const *)&c )
> >> >
> >> > Does anyone else agree with this?
> >>
> >> It looks overly complicated to me.
> >
> > In normal form, I use things like...
> >
> > const unsigned char *us = (const unsigned char *) s;
> > for (; *us; us++) *us = toupper(*us);
>

> No matter what you think `const' means in this context, it's wrong. ...

Yup, braino. I was thinking about reading from a source and writing to
a different string. Please remove the const and reparse.

> You change both `us' /and/ `*us' in the second line.

That wasn't a typo, just saving whitespace.

> >> > toupper( (unsigned char)c );
> >>
> >> No; the latter is much clearer and just as functional, IMHO.
> >
> > But fails for potentially conforming implementations. To many people,
> > that's acceptable.
>
> Under what circumstances will casting to unsigned char fail, and how
> will it fail?

On hypothetical but conforming implementations where char is signed
and the count of integers in the range of char is smaller than the
count
of integers in the range of unsigned char. Pigeon hole principles come
into play.

--
Peter

Peter Nilsson

unread,

Jul 6, 2006, 2:57:17 AM7/6/06

to

Richard Heathfield wrote:
> Peter Nilsson said:
> > Richard Heathfield wrote:
> <snip>
> >> ... Peter's idea is fatally flawed.
> >
> > <sigh>
> >
> > Consider...
> >
> > char line[256];
> > size_t i;
> > if (fgets(line, sizeof line, stdin))
> > {
> > for (i = 0; line[i] != 0; i++)
> > {
> > line[i] = toupper((unsigned char) line[i]); /* v1 */
> > line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
> > }
> > ...
> > }
> >
> > On an implementation satisfying...
> >
> > UCHAR_MAX: 65535
> > SCHAR_MAX: 127
> > SCHAR_MIN: -128
> > CHAR_MAX: 127
> >
> > ...v1 can fail, v2 succeeds.
>
> Even if

[Nothing semantically wrong with the v2 version of the above code
then?]

> such an implementation is conforming (about which I have
> serious doubts, but I'm not going to press the point right now),

Since you clearly don't have serious c&v, I won't either.

--
Peter

Richard Heathfield

unread,

Jul 6, 2006, 3:28:25 AM7/6/06

to

Peter Nilsson said:

> Richard Heathfield wrote:
>> Peter Nilsson said:
>> > Richard Heathfield wrote:
>> <snip>
>> >> ... Peter's idea is fatally flawed.
>> >
>> > <sigh>
>> >
>> > Consider...
>> >
>> > char line[256];
>> > size_t i;
>> > if (fgets(line, sizeof line, stdin))
>> > {
>> > for (i = 0; line[i] != 0; i++)
>> > {
>> > line[i] = toupper((unsigned char) line[i]); /* v1 */
>> > line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
>> > }
>> > ...
>> > }
>> >
>> > On an implementation satisfying...
>> >
>> > UCHAR_MAX: 65535
>> > SCHAR_MAX: 127
>> > SCHAR_MIN: -128
>> > CHAR_MAX: 127
>> >
>> > ...v1 can fail, v2 succeeds.
>>
>> Even if
>
> [Nothing semantically wrong with the v2 version of the above code
> then?]

I didn't look that closely, since you're only describing a theoretical
problem, which you are trying to solve by replacing it with a technique
that is flawed not just in theory but also in practice.

>
>> such an implementation is conforming (about which I have
>> serious doubts, but I'm not going to press the point right now),
>
> Since you clearly don't have serious c&v, I won't either.

I think you've completely and utterly missed my point.

I will accept for the purposes of this discussion that the implementation
you describe is conforming, and might conceivably exist. Nevertheless, you
would presumably agree that no such implementation is in widespread use. So
your "fix" doesn't actually fix anything in real life. (If you disagree,
let's hear it. Which widely-used platform has the characteristics you
describe?)

On the other hand, conforming implementations for big-endian platforms
certainly exist, and are in widespread use, and your technique breaks on
such platforms, in a manner I have described upthread.

So we have two choices: a technique that can only be shown to break on a
hypothetical platform, and a technique that can be shown to break on very
real and widely-used platforms.

If those are the only choices, then, for me at least, it's no contest.

Dik T. Winter

unread,

Jul 6, 2006, 6:08:38 AM7/6/06

to

In article <yKidnfOaQIvDJjHZ...@bt.com> inv...@invalid.invalid writes:

Richard, I think you are missing something:

> Peter Nilsson said:
...
> >> > char line[256];
...

> >> > line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
...

> On the other hand, conforming implementations for big-endian platforms
> certainly exist, and are in widespread use, and your technique breaks on
> such platforms, in a manner I have described upthread.

Care to explain why the above would break on such a platform? The only
thing is that a pointer to char is cast to a pointer to unsigned char,
and the latter is dereferenced.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Richard Heathfield

unread,

Jul 6, 2006, 7:04:40 AM7/6/06

to

Dik T. Winter said:

> In article <yKidnfOaQIvDJjHZ...@bt.com>
> inv...@invalid.invalid writes:
>
> Richard, I think you are missing something:

...and I think Peter is. :-)

> > Peter Nilsson said:
> ...
> > >> > char line[256];
> ...
> > >> > line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
> ...
> > On the other hand, conforming implementations for big-endian platforms
> > certainly exist, and are in widespread use, and your technique breaks
> > on such platforms, in a manner I have described upthread.
>
> Care to explain why the above would break on such a platform?

I'm not saying it will. Peter introduced that code to illustrate how a
simple cast to unsigned char could conceivably break on a hypothetical
platform with UCHAR_MAX = 65535 and SCHAR_MAX = 127. Let us ascribe the
generic name "PeterPlatform" to such platforms, and let us give big-endian
platforms with sizeof(int) > 1 the generic name of "PracticalPlatform".

The problem I have with his suggested technique:

object = toupper(*(unsigned char *)&object);

is not in relation to the above code, but in contexts where a character
value is stored in an int, and it is not known whether the character is
representable as an unsigned char. This is far from rare. Consider, for
example, the following function:

#include <ctype.h>

int toggle_case(int ch)
{
#ifdef PETER
if(islower(*(unsigned char *)&ch))
{
ch = toupper(*(unsigned char *)&ch);
}
else
{
ch = tolower(*(unsigned char *)&ch);
}
#else
if(islower((unsigned char)ch))
{
ch = toupper((unsigned char)ch);
}
else
{
ch = tolower((unsigned char)ch);
}
#endif
return ch;
}

If PETER is defined, the code breaks on PracticalPlatform, but works on
PeterPlatform.

If PETER is not defined, the code breaks on PeterPlatform, but works on
PracticalPlatform. This more conventional technique also works for the code
Peter wrote, except on PeterPlatform.

So we have two techniques, one of which works just about everywhere in the
real world, and one which breaks on a very important subset of the real
world, in certain reasonably common situations. Given the choice between
the two, I favour the technique that fails on fewest real world platforms.

Frederick Gotham

unread,

Jul 6, 2006, 7:16:29 AM7/6/06

to

Richard Heathfield posted:

> So we have two techniques, one of which works just about everywhere in
> the real world, and one which breaks on a very important subset of the
> real world, in certain reasonably common situations. Given the choice
> between the two, I favour the technique that fails on fewest real
> world platforms.

Just as a hypothetical:
If there were a guarantee in C that a signed char had no padding (and
thus the exact same quantity of value representation bits as an unsigned
char), then would you consider using:

toupper( *(unsigned char const *)c );

It would seem preferable to me over:

Richard Heathfield

unread,

Jul 6, 2006, 8:14:25 AM7/6/06

to

Frederick Gotham said:

> Richard Heathfield posted:
>
>
>> So we have two techniques, one of which works just about everywhere in
>> the real world, and one which breaks on a very important subset of the
>> real world, in certain reasonably common situations. Given the choice
>> between the two, I favour the technique that fails on fewest real
>> world platforms.
>
>
> Just as a hypothetical:
> If there were a guarantee in C that a signed char had no padding (and
> thus the exact same quantity of value representation bits as an unsigned
> char), then would you consider using:
>
> toupper( *(unsigned char const *)c );

(Presumably you mean c to be a pointer.)

No, I wouldn't, because this is broken in exactly the same way as it was
before - i.e. it gives wrong results in some circumstances, on a very
important bunch of platforms.

> It would seem preferable to me over:
>
> toupper( (unsigned char)c );

Not to me.

Hallvard B Furuseth

unread,

Jul 6, 2006, 9:24:18 AM7/6/06

to

Frederick Gotham writes:

> This suggests that a unsigned char has 16 value representation
> bits, and an unknown quantity of padding bits.

unsigned char has no padding bits.

--
Hallvard

Frederick Gotham

unread,

Jul 6, 2006, 9:28:05 AM7/6/06

to

Hallvard B Furuseth posted:

Wups, slipped my mind.

So in the given example:

unsigned char: 16 value bits, no padding.
char: 8 value bits, 8 padding bits.

--

Frederick Gotham

Mike S

unread,

Jul 6, 2006, 9:34:36 AM7/6/06

to

Richard Heathfield wrote:
> Mike S said:
>
> <snip>
> >
> > OK, it's late and I might be missing something here, but aren't the
> > expressions
> >
> > (unsigned char) c
> >
> > and
> >
> > *(unsigned char*) &c
> >
> > semantically equivalent?
>
> No.
>
> > Or is there a chance that they might evaluate to a different result
>
> Very much so.
>
> int c = getchar(); /* let's say we get an 'A' from getchar(), and let's
> assume we're using some completely arbitrary and whacko character set such
> as, say, ASCII. */

[...]

> On any big-endian
> system where sizeof(int) > 1, this code is going to produce the wrong
> result. Specifically, it will normally produce 0 instead of the required
> result.

Peter had mentioned in a previous post that c was a plain char, so I
assumed that in my "semantically equivalent" statement. Even if it were
an int, I probably would have forgotten to consider "other-endian"
machines anyway -- I'm a bit *too* comfortable with x86 and I doubt I
would have thought twice about it ;-)

--
Mike S

Hallvard B Furuseth

unread,

Jul 6, 2006, 9:54:58 AM7/6/06

to

Frederick Gotham writes:
> Hallvard B Furuseth posted:

>> unsigned char has no padding bits.
>
> Wups, slipped my mind.
>
> So in the given example:
>
> unsigned char: 16 value bits, no padding.
> char: 8 value bits, 8 padding bits.

Yup.

OTOH, getting back to (unsigned char)c vs. *(unsigned char *)&c where c
is a char: These expressions produce different values if c has the sign
bit set and is represented as one's complement or sign/magnitude. Just
like with the different-width example above I have no idea if that is
possible in a conforming implementation, but I doubt it.

However if both are possible the pointer cast hack is still just
replacing one possible bug with another one. It'll give you a value,
but not necessarily the _right_ value. Or the other way around: The one
with pointers gives the right value and the other gives the wrong value.
Depends on how the character value was stored. One thing I feel certain
about is that even if I by some miracle managed to keep that straight,
some other component of the program would be getting it wrong. So I
just don't worry about it, and use both expressions interchangeably.

--
Hallvard

Richard Heathfield

unread,

Jul 6, 2006, 10:15:59 AM7/6/06

to

Peter Nilsson said:

> Frederick Gotham wrote:
<snip>

>>
>> toupper( (unsigned char)c );
>
> That's the clc regular's method. To me, it generally makes more
> sense to do...
>
> toupper( * (unsigned char) &c )

[(unsigned char *) was intended]

>
> ...when c is a plain char.

Peter, I owe you an apology. I missed this caveat when I first read your
article. My "big-endian" objection does not apply in such a case.

<snip>

Andrew Poelstra

unread,

Jul 6, 2006, 10:51:22 AM7/6/06

to

On 2006-07-06, Peter Nilsson <ai...@acay.com.au> wrote:
> Andrew Poelstra wrote:
>> On 2006-07-06, Peter Nilsson <ai...@acay.com.au> wrote:
>> > Andrew Poelstra wrote:
>> >> On 2006-07-06, Frederick Gotham <fgot...@SPAM.com> wrote:
>> >> > Peter Nilsson posted:
>> >> ><slightly altered>
>> >> >> toupper( *(unsigned char const *)&c )
>> >> >
>> >> > Does anyone else agree with this?
>> >>
>> >> It looks overly complicated to me.
>> >
>> > In normal form, I use things like...
>> >
>> > const unsigned char *us = (const unsigned char *) s;
>> > for (; *us; us++) *us = toupper(*us);
>>
>> No matter what you think `const' means in this context, it's wrong. ...
>
> Yup, braino. I was thinking about reading from a source and writing to
> a different string. Please remove the const and reparse.
>
>> You change both `us' /and/ `*us' in the second line.
>
> That wasn't a typo, just saving whitespace.
>

It was an error when you had the `const' in there. If you remove them,
the code works. (Although some people like to put `us' in the first
part of the for statement instead of leaving it empty).

>> >> > toupper( (unsigned char)c );
>> >>
>> >> No; the latter is much clearer and just as functional, IMHO.
>> >
>> > But fails for potentially conforming implementations. To many people,
>> > that's acceptable.
>>
>> Under what circumstances will casting to unsigned char fail, and how
>> will it fail?
>
> On hypothetical but conforming implementations where char is signed
> and the count of integers in the range of char is smaller than the
> count
> of integers in the range of unsigned char. Pigeon hole principles come
> into play.
>

I believe all of these are guaranteed:

sizeof (char) == sizeof (unsigned char)
char has no padding bits
char has no trap representations
Therefore all chars must have 2^CHAR_BIT values.

In the case that you have a problem because on some mysterious platform
without these attributes, you'll have other problems elsewhere in the
code. That, and any code that relies on your platform will be almost
certainly nonportable.

Eric Sosman

unread,

Jul 6, 2006, 11:12:20 AM7/6/06

to

Andrew Poelstra wrote:
> [...]

>
> I believe all of these are guaranteed:
>
> sizeof (char) == sizeof (unsigned char)

Yes, because both are guaranteed to equal 1.

> char has no padding bits
> char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.

> Therefore all chars must have 2^CHAR_BIT values.

The Standard's language about "negative zero" casts some
doubt on this. If there are two different forms of the value
zero, there must be strictly fewer than 2^CHAR_BIT possible
values -- even without padding bits.

> In the case that you have a problem because on some mysterious platform
> without these attributes, you'll have other problems elsewhere in the
> code. That, and any code that relies on your platform will be almost
> certainly nonportable.

It seems to me that this is a backwards definition of
"portability." The point isn't about relying on peculiarities
of exotic platforms, but about writing code that works whether
those peculiarities are present or not. A program that works
correctly with all conforming representations of char is more
portable, not less, than a program that insists on trapless
eight-bit two's complement.

--
Eric Sosman
eso...@acm-dot-org.invalid

Andrew Poelstra

unread,

Jul 6, 2006, 3:37:51 PM7/6/06

to

On 2006-07-06, Eric Sosman <eso...@acm-dot-org.invalid> wrote:
> Andrew Poelstra wrote:
>> [...]
>>
>> I believe all of these are guaranteed:

<snip>

>> char has no padding bits
>> char has no trap representations
>
> Would you mind revealing where you find these guarantees?
> If they are in the Standard, I have overlooked them.
>

The first has been mentioned in this group many times (although it
may pertain only to unsigned char), and the second seemed to me a
logical extension.

>> Therefore all chars must have 2^CHAR_BIT values.
>
> The Standard's language about "negative zero" casts some
> doubt on this. If there are two different forms of the value
> zero, there must be strictly fewer than 2^CHAR_BIT possible
> values -- even without padding bits.
>

I consider 0 and -0 separate values for the purposes of my post.

>> In the case that you have a problem because on some mysterious platform
>> without these attributes, you'll have other problems elsewhere in the
>> code. That, and any code that relies on your platform will be almost
>> certainly nonportable.
>
> It seems to me that this is a backwards definition of
> "portability." The point isn't about relying on peculiarities
> of exotic platforms, but about writing code that works whether
> those peculiarities are present or not. A program that works
> correctly with all conforming representations of char is more
> portable, not less, than a program that insists on trapless
> eight-bit two's complement.
>

All I insisted on was trapless. Please don't misinterpret me.

--
Andrew Poelstra <http://www.wpsoftware.net/projects/>

To email me, use "apoelstra" at the above domain.

Eric Sosman

unread,

Jul 6, 2006, 3:52:47 PM7/6/06

to

Andrew Poelstra wrote:
> On 2006-07-06, Eric Sosman <eso...@acm-dot-org.invalid> wrote:
>
>>Andrew Poelstra wrote:
>>
>>>[...]
>>>
>>>I believe all of these are guaranteed:
>
> <snip>
>
>>>char has no padding bits
>>>char has no trap representations
>>
>> Would you mind revealing where you find these guarantees?
>>If they are in the Standard, I have overlooked them.
>
> The first has been mentioned in this group many times (although it
> may pertain only to unsigned char), and the second seemed to me a
> logical extension.

There are special guarantees for unsigned char, so that
it is possible to treat the representation of any object as
an array of unsigned char. This would not work if unsigned
char had trap representation or contained indeterminately-
valued padding bits.

However, I am unaware of any similar guarantees for char,
either signed or plain. On an implementation where plain char
is unsigned one can deduce that it has no padding bits or traps
(argument: On such an implementation, plain char can represent
all the values unsigned char can, and since the latter "fills
the code space" the former must, too). But the argument doesn't
hold for signed char, or for plain char on an implementation
where CHAR_MIN<0.

--
Eric Sosman
eso...@acm-dot-org.invalid

Keith Thompson

unread,

Jul 6, 2006, 4:18:48 PM7/6/06

to

Andrew Poelstra <apoe...@wpsoftware.net> writes:
[...]

> I believe all of these are guaranteed:
>
> sizeof (char) == sizeof (unsigned char)
> char has no padding bits
> char has no trap representations
> Therefore all chars must have 2^CHAR_BIT values.

I believe the last three are guaranteed only for unsigned char, not
for plain or signed char.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Keith Thompson

unread,

Jul 6, 2006, 4:22:15 PM7/6/06

to

Andrew Poelstra <apoe...@wpsoftware.net> writes:
> On 2006-07-06, Eric Sosman <eso...@acm-dot-org.invalid> wrote:
>> Andrew Poelstra wrote:
>>> [...]
>>>
>>> I believe all of these are guaranteed:
> <snip>
>
>>> char has no padding bits
>>> char has no trap representations
>>
>> Would you mind revealing where you find these guarantees?
>> If they are in the Standard, I have overlooked them.
>
> The first has been mentioned in this group many times (although it
> may pertain only to unsigned char), and the second seemed to me a
> logical extension.
>
>>> Therefore all chars must have 2^CHAR_BIT values.
>>
>> The Standard's language about "negative zero" casts some
>> doubt on this. If there are two different forms of the value
>> zero, there must be strictly fewer than 2^CHAR_BIT possible
>> values -- even without padding bits.
>
> I consider 0 and -0 separate values for the purposes of my post.

But they're not separate values in any reasonable sense. In
particular (0 == -0) is guaranteed to be true. They may be different
*representations* of the same value.

[...]

> All I insisted on was trapless. Please don't misinterpret me.

Ok, but I don't see a guarantee in the standard that signed or plain
char has no trap representations.

If you want a byte-sized type with no padding bits or trap
representations, use unsigned char; that's what it's for.

Flash Gordon

unread,

Jul 6, 2006, 4:46:09 PM7/6/06

to

Eric Sosman wrote:
> Andrew Poelstra wrote:
>> On 2006-07-06, Eric Sosman <eso...@acm-dot-org.invalid> wrote:
>>
>>> Andrew Poelstra wrote:
>>>
>>>> [...]
>>>>
>>>> I believe all of these are guaranteed:
>>
>> <snip>
>>
>>>> char has no padding bits
>>>> char has no trap representations
>>>
>>> Would you mind revealing where you find these guarantees?
>>> If they are in the Standard, I have overlooked them.
>>
>> The first has been mentioned in this group many times (although it
>> may pertain only to unsigned char), and the second seemed to me a
>> logical extension.
>
> There are special guarantees for unsigned char, so that
> it is possible to treat the representation of any object as
> an array of unsigned char. This would not work if unsigned
> char had trap representation or contained indeterminately-
> valued padding bits.

This is covered in 6.2.6.2 para 1 of N1124 which describes padding bits
and explicitly states that unsigned char cannot have them.

With the range requirements for signed char, this means that signed char
can only have padding bits if CHAR_BIT is greater than 8, and char can
only have padding bits if it is signed and CHAR_BIT is greater than 8.

> However, I am unaware of any similar guarantees for char,
> either signed or plain. On an implementation where plain char
> is unsigned one can deduce that it has no padding bits or traps
> (argument: On such an implementation, plain char can represent
> all the values unsigned char can, and since the latter "fills
> the code space" the former must, too). But the argument doesn't
> hold for signed char, or for plain char on an implementation
> where CHAR_MIN<0.

Specifically, CHAR_MIN is allowed to be -127 on a 2s-complement system
with -128 being a trap. In addition, -0 is allowed to be a trap on
1s-complement and sign-magnitude implementations. Specifically, section
6.2.6.2 para 2 of N1124 describes this for all signed integer types with
no exception mentioned for char or signed char.

So you can have a trap representation for char even on a system with
CHARBIT==8 although I am not aware of any such system.

Since fgetc "obtains that character as an unsigned char converted to
int" it is obviously possible for it to read the representation that for
char could be a trap. Since the fgets and friends are defined in terms
of fgetc (section 7.19.3 para 11) the representation they store must
IMHO be that of the unsigned char, especially as there is the one bit
pattern that could be a trap for signed char.

So, going back to the original question, which has fallen off on this
quote, if you have some form of byte array that has been read from a
file by fgetc then I believe the technically correct method would be to
use an unsigned char pointer to read the values, since with a char
pointer you could read a trap representation and in any case for 1s
complement or sign-magnitude reading with a char pointer then casting to
unsigned char would change the bit pattern and this would IMHO be wrong.

If, on the other hand, you are passing a string literal a byte at a time
to isupper, toupper etc, then using a char pointer and casting to
unsigned char would IMHO be the correct thing.

All in all, I think it is a bit of a mess if char is signed when it
comes to the library functions. However, the standard committee probably
inherited a mess from the existing practice.
--
Flash Gordon, living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidelines and intro:
http://clc-wiki.net/wiki/Intro_to_clc

Peter Nilsson

unread,

Jul 6, 2006, 8:35:35 PM7/6/06

to

Richard Heathfield wrote:
> Dik T. Winter said:
> > In article <yKidnfOaQIvDJjHZ...@bt.com>
> > inv...@invalid.invalid writes:
> >
> > Richard, I think you are missing something:
>
> ...and I think Peter is. :-)

I originally wrote:

> It's up to the programmer to supply the correct character code
value.
> ...To me, it generally makes more sense to do...

>
> toupper( * (unsigned char) &c )

[Later corrected to: toupper( * (unsigned char *) &c ) ]

>
> ...when c is a plain char.

^^^^^^^^^^^^^^^^^^^^^^^

I have _never_ said the technique should be applied to an int.

> > > Peter Nilsson said:
> > ...
> > > >> > char line[256];
> > ...
> > > >> > line[i] = toupper(* (unsigned char *) &line[i]); /* v2 */
> > ...
> > > On the other hand, conforming implementations for big-endian platforms
> > > certainly exist, and are in widespread use, and your technique breaks
> > > on such platforms, in a manner I have described upthread.
> >
> > Care to explain why the above would break on such a platform?
>
> I'm not saying it will.

Then please stop calling it a...:

"technique that is flawed not just in theory but also in practice"

"technique that can be shown to break on very real and widely-
used platforms."

Especially when you have yet to demonstrate that the above code fails
on _any_ C implementation, let alone real world ones.

--
Peter

Old Wolf

unread,

Jul 6, 2006, 11:23:58 PM7/6/06

to

Frederick Gotham wrote:
> toupper( *(unsigned char const *)&c )
>
> Does anyone else agree with this?

No. If we take an extension of your hypothetical system:
char: 1 sign bit, then 8 padding bits, then 7 value bits
uchar: 16 value bits
Padding bits must be 0 if sign bit is 0; otherwise, can be anything.

This satisfies the C standard (AFAIK) because the
representation of a non-negative plain char has the same
representation as the unsigned char of the same value.

But for negative-valued chars, the pointer cast version
returns different results depending on what the padding
bits are, which is stupid.

The only reason you would use the above expression, is
if you knew the char had been created by stuffing the
representation for a plain char into the unsigned char.

This is not the case for the result of getchar(), which is
a conversion of the value of the plain char.

Another example, on a system with 8-bit chars and
sign-magnitude:

If the byte in question has bit pattern 10000010
then your method ends up calling toupper(130).

But the proper method calls toupper(254).

The character with bit pattern 10000010 would cause
getchar() to return 254.

So it comes down to: does the char contain a value
that came from getchar, or does it contain the
representation of an unsigned char?

Again, I think the latter shows poor design, as the
representation of an unsigned char could correspond to
a trap for signed char. In particular, what is the result
of toupper(128) ?

Richard Heathfield

unread,

Jul 6, 2006, 11:38:37 PM7/6/06

to

Peter Nilsson said:

<snip>

> I originally wrote:
>
> > It's up to the programmer to supply the correct character code
> value.
> > ...To me, it generally makes more sense to do...
> >
> > toupper( * (unsigned char) &c )
> [Later corrected to: toupper( * (unsigned char *) &c ) ]
> >
> > ...when c is a plain char.
> ^^^^^^^^^^^^^^^^^^^^^^^

Yes, you did, and I missed that on my first reading. Hence the confusion. I
have already apologised for this error elsethread, but in case you missed
it I am happy to do so again.

<snip>

Keith Thompson

unread,

Jul 7, 2006, 1:52:51 AM7/7/06

to

Richard Heathfield <inv...@invalid.invalid> writes:
> Peter Nilsson said:
> <snip>
>
>> I originally wrote:
>>
>> > It's up to the programmer to supply the correct character code
>> value.
>> > ...To me, it generally makes more sense to do...
>> >
>> > toupper( * (unsigned char) &c )
>> [Later corrected to: toupper( * (unsigned char *) &c ) ]
>> >
>> > ...when c is a plain char.
>> ^^^^^^^^^^^^^^^^^^^^^^^
>
> Yes, you did, and I missed that on my first reading. Hence the confusion. I
> have already apologised for this error elsethread, but in case you missed
> it I am happy to do so again.
>
> <snip>

Nevertheless, using the proposed expression

toupper( * (unsigned char *) &c )

when c is of type int would be an easy mistake to make (and likely to
be missed on a machine of whichever endianness it is that would hide
the error).

Peter Nilsson

unread,

Jul 7, 2006, 10:44:11 PM7/7/06

to

> Richard Heathfield <inv...@invalid.invalid> writes:
> > Peter Nilsson said:
> >> I originally wrote:
> >>
> >> > It's up to the programmer to supply the correct character code
> >> value.
> >> > ...To me, it generally makes more sense to do...
> >> >
> >> > toupper( * (unsigned char) &c )
> >> [Later corrected to: toupper( * (unsigned char *) &c ) ]
> >> >
> >> > ...when c is a plain char.
> >> ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Yes, you did, and I missed that on my first reading. Hence the confusion. I
> > have already apologised for this error elsethread, but in case you missed
> > it I am happy to do so again.

I'm grateful for the apology and I'm sorry for having hounded you on
the issue. :-)

Keith Thompson wrote:
> Nevertheless, using the proposed expression
> toupper( * (unsigned char *) &c )
> when c is of type int would be an easy mistake to make

You say 'would be', but AFAIK, the number of people actively using the
method I posted is 1. I can tell you that I honestly can't recall ever
making that mistake. ;-)

I can't recall ever writing... int line[256]; ...instead of... char
line[256];
and with... int c = getchar(); ...I don't cast c in either form since
there's no need to do so.

Note that with an int c, I'm also careful to store the character as an
unsigned char byte, rather than using simple assignment to plain
char.

The fact that it's 'not the done thing', doesn't make it wrong. :-)

--
Peter

pete

unread,

Jul 8, 2006, 5:18:18 PM7/8/06

to

Mike S wrote:

> OK, it's late and I might be missing something here, but aren't the
> expressions
>
> (unsigned char) c
>
> and
>
> *(unsigned char*) &c
>
> semantically equivalent?

No.
For starters, (*(unsigned char*) &c) is an lvalue.
*(unsigned char*) &c = 0;
is valid C code.
(unsigned char) c = 0;
isn't valid C code.

> Or is there a chance that they might evaluate
> to a different result

Yes, in so many ways.
If the value of c is equal to -1,
then ((unsigned char) c) is equal to UCHAR_MAX,
regardless of the sizeof c,
or whether negative integers
are represented as two's complement,
one's complement or signed magnitude.

If the value of c is -1 and (sizeof c) equals 1, then
*(unsigned char*) &c
could equal either
(UCHAR_MAX) or (UCHAR_MAX - 1) or (UCHAR_MAX / 2 + 2)

If (sizeof c) doesn't equal 1,
then the allowed possible values of (*(unsigned char*) &c) are many.

--
pete

pete

unread,

Jul 8, 2006, 9:40:04 PM7/8/06

to

Peter Nilsson wrote:

> > Let's say we have a German sharp S,
> > or a Spanish N with a curly thing on top of it,
>
> [Tilde.]
>
> > and that its numeric value is negative. How do we go about
> > passing their value to toupper? Should we do the following?
> >
> > toupper( (unsigned char)c );
>
> That's the clc regular's method. To me, it generally makes more

> sense to do...
>
> toupper( * (unsigned char) &c )
>

> ...when c is a plain char.

fputc(c) can only return either EOF or ((int)(unsigned char)c).

That's why the cast to (unsigned char) is appropriate
for the ctype functions.

--
pete

Peter Nilsson

unread,

Jul 12, 2006, 7:21:52 PM7/12/06

to

Old Wolf wrote:
> ...

> This is not the case for the result of getchar(), which is
> a conversion of the value of the plain char.

Huh? The getchar() function returns either EOF or a value in the
range of unsigned char. The unsigned char being the byte value
read from input.

--
Peter