Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Manipulation of strings: upper/lower case

6 views
Skip to first unread message

Pierre

unread,
Jan 15, 2005, 10:32:21 AM1/15/05
to
Hello!

I've been looking for a portable means of changing the case of a
string but i've found nothing so far. Does it exists? I guess (and
hope) it does..

Thanks
Pierre

BMarsh

unread,
Jan 15, 2005, 10:48:38 AM1/15/05
to
Hi there,

#include <ctype.h>
int toupper(int c);

cheers

B.

Lew Pitcher

unread,
Jan 15, 2005, 10:57:49 AM1/15/05
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pierre wrote:
> Hello!
>
> I've been looking for a portable means of changing the case of a
> string but i've found nothing so far.

Such can be easily built from the existing standard C functions

> Does it exists? I guess (and hope) it does..

If you can't find one, try this...

#include <ctype.h>

void UppercaseString(char *string)
{
for(;*string;++string)
if (islower(*string)) *string = toupper(*string);
}

void LowercaseString(char *string)
{
for(;*string;++string)
if (isupper(*string)) *string = tolower(*string);
}

- --
Lew Pitcher

Master Codewright and JOAT-in-training
Registered Linux User #112576 (http://counter.li.org/)
Slackware - Because I know what I'm doing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB6T19agVFX4UWr64RAmrqAJ4gTLptYf+LpCT67ruc88tAQoPmyACcCKQT
lBuQV/LkjuvpFyBzPs+qdhY=
=Mz/Z
-----END PGP SIGNATURE-----

infobahn

unread,
Jan 15, 2005, 12:16:02 PM1/15/05
to
Lew Pitcher wrote:
>

<snip>

> #include <ctype.h>
>
> void UppercaseString(char *string)
> {
> for(;*string;++string)
> if (islower(*string)) *string = toupper(*string);
> }

Caution is necessary here. The behaviours of islower and toupper
are undefined if they are passed a value that is neither EOF nor
representable as an unsigned char. It is good practice, therefore,
to cast *string to unsigned char. (No need to cast it back to
int afterwards, since the normal promotion rules handle that.)

The islower() call smacks of premature optimisation. :-)

<snip>

CBFalconer

unread,
Jan 15, 2005, 2:00:59 PM1/15/05
to
Pierre wrote:
>
> I've been looking for a portable means of changing the case of a
> string but i've found nothing so far. Does it exists? I guess (and
> hope) it does..

Unusual to want to simply change the case, but try something like:

#include <ctype.h>

void flipcase(char *s)
{
unsigned char ch;

if (s) /* assuming you want to protect against NULL */
while (ch = *s) {
if (isupper(ch) *s = tolower(ch);
else if (islower(ch) *s = toupper(ch);
s++;
}
} /* flipcase, untested */

which allows for the fact that some chars do not have an upper or
lower case to be flipped.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson


Joe Wright

unread,
Jan 15, 2005, 3:05:28 PM1/15/05
to

The islower() call is unnecessary.

char *upper(char *st) {
char *s = st;
while ((*s = toupper(*s))) ++s;
return st;
}

There is no need to cast the argument to toupper() to unsigned char.
We assume that st points to a valid string. All characters of such a
string are within the range 0..CHAR_MAX by definition. CHAR_MAX is
within UCHAR_MAX by definition.

If st points to something not a valid string, and toupper() is
presented with something out of range, (-20 for example) it may
SEGFAULT. And why not? It might tell you where your error is.

--
Joe Wright mailto:joeww...@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Mathew Hendry

unread,
Jan 15, 2005, 3:37:25 PM1/15/05
to
On Sat, 15 Jan 2005 15:05:28 -0500, Joe Wright <joeww...@comcast.net>
wrote:

>infobahn wrote:
>
>>Lew Pitcher wrote:
>>
>>>#include <ctype.h>
>>>
>>>void UppercaseString(char *string)
>>>{
>>> for(;*string;++string)
>>> if (islower(*string)) *string = toupper(*string);
>>>}

>>...


>> The islower() call smacks of premature optimisation. :-)
>>
>> <snip>
>
>The islower() call is unnecessary.
>
>char *upper(char *st) {
> char *s = st;
> while ((*s = toupper(*s))) ++s;
> return st;
>}
>
>There is no need to cast the argument to toupper() to unsigned char.
>We assume that st points to a valid string. All characters of such a
>string are within the range 0..CHAR_MAX by definition.

Only if char happens to be unsigned, surely?

-- Mat.

Chris Torek

unread,
Jan 15, 2005, 3:49:28 PM1/15/05
to
In article <aZadnbNe4MT...@comcast.com>

Joe Wright <joeww...@comcast.net> wrote:
>The islower() call is unnecessary.

Indeed.

>char *upper(char *st) {
> char *s = st;
> while ((*s = toupper(*s))) ++s;
> return st;
>}
>
>There is no need to cast the argument to toupper() to unsigned char.
>We assume that st points to a valid string.

And someone whose name is "Pól" has a name that is an "invalid
string"? :-)

>All characters of such a string are within the range 0..CHAR_MAX
>by definition. CHAR_MAX is within UCHAR_MAX by definition.

If you use ISO-Latin-1, and have signed characters -- and both of
these are quite commonly true today -- you *will* have characters
whose value is outside the [0..CHAR_MAX] range. For instance, the
o-with-accent-acute above is 0xf3 or -13.

>If st points to something not a valid string, and toupper() is
>presented with something out of range, (-20 for example) it may
>SEGFAULT. And why not? It might tell you where your error is.

Or it may change the guy's name from Pól (the Celtic form of
the name "Paul") to PzL, which might just annoy him. If he happens
to have a large sword, this could be a bad strategy. :-)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Eric Sosman

unread,
Jan 15, 2005, 4:09:47 PM1/15/05
to
Joe Wright wrote:
> [...]

>> Lew Pitcher wrote:
>>
>> Caution is necessary here. The behaviours of islower and toupper
>> are undefined if they are passed a value that is neither EOF nor
>> representable as an unsigned char. It is good practice, therefore,
>> to cast *string to unsigned char. (No need to cast it back to
>> int afterwards, since the normal promotion rules handle that.)
>> [...]

>
> There is no need to cast the argument to toupper() to unsigned char.

Didn't we just do this a week or so ago? Perhaps it's
a candidate for the FAQ; it seems at any rate to be FA.

> We
> assume that st points to a valid string. All characters of such a string
> are within the range 0..CHAR_MAX by definition.

No, they are in the range CHAR_MIN through CHAR_MAX.
Since `char' may be a signed type (it's the implementation's
choice), CHAR_MIN can be negative. It's true that all the
characters mandated by the Standard are required to be non-
negative, but the Standard allows the implementation to define
additional characters, too -- and some of these may have
negative codes.

> CHAR_MAX is within
> UCHAR_MAX by definition.

True, but CHAR_MIN can be negative, hence outside the
range of `unsigned char'.

> If st points to something not a valid string, and toupper() is presented
> with something out of range, (-20 for example) it may SEGFAULT. And why
> not? It might tell you where your error is.

Except that the "error" isn't the presence of a -20 in
the string (in one widely-used scheme, -20 is "Latin small
i with grave accent"). The real error is the failure to
use the cast that Lew recommends.

--
Eric Sosman
eso...@acm-dot-org.invalid

Jack Klein

unread,
Jan 15, 2005, 7:13:11 PM1/15/05
to
On Sat, 15 Jan 2005 19:00:59 GMT, CBFalconer <cbfal...@yahoo.com>
wrote in comp.lang.c:

> Pierre wrote:
> >
> > I've been looking for a portable means of changing the case of a
> > string but i've found nothing so far. Does it exists? I guess (and
> > hope) it does..
>
> Unusual to want to simply change the case, but try something like:
>
> #include <ctype.h>
>
> void flipcase(char *s)
> {
> unsigned char ch;
>
> if (s) /* assuming you want to protect against NULL */
> while (ch = *s) {
> if (isupper(ch) *s = tolower(ch);

Completely unnecessary conditional test.

> else if (islower(ch) *s = toupper(ch);

Completely unnecessary conditional test.

> s++;
> }
> } /* flipcase, untested */
>
> which allows for the fact that some chars do not have an upper or
> lower case to be flipped.

(sigh)

7.4.2.1 The tolower function
Synopsis
1 #include <ctype.h>
int tolower(int c);
Description
2 The tolower function converts an uppercase letter to a corresponding
lowercase letter.
Returns
3 If the argument is a character for which isupper is true and there
are one or more corresponding characters, as specified by the current
locale, for which islower is true, the tolower function returns one of
the corresponding characters (always the same one for any given
locale); otherwise, the argument is returned unchanged.

7.4.2.2 The toupper function
Synopsis
1 #include <ctype.h>
int toupper(int c);
Description
2 The toupper function converts a lowercase letter to a corresponding
uppercase letter.
Returns
3 If the argument is a character for which islower is true and there
are one or more corresponding characters, as specified by the current
locale, for which isupper is true, the toupper function returns one of
the corresponding characters (always the same one for any given
locale); otherwise, the argument is returned unchanged.

So the tests are totally unnecessary.

But suppose:

char test [] = "Hello" "\xf0" "World";

...then your function causes undefined behavior on an implementation
with CHAR_BIT 8 and signed char, because you will pass an invalid
value to tolower() or toupper().

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html

Joe Wright

unread,
Jan 15, 2005, 8:21:08 PM1/15/05
to
Eric Sosman wrote:
> Joe Wright wrote:
>
>> [...]
>>
>>> Lew Pitcher wrote:
>>>
>>> Caution is necessary here. The behaviours of islower and toupper
>>> are undefined if they are passed a value that is neither EOF nor
>>> representable as an unsigned char. It is good practice, therefore,
>>> to cast *string to unsigned char. (No need to cast it back to
>>> int afterwards, since the normal promotion rules handle that.)
>>> [...]
>>
>>
>> There is no need to cast the argument to toupper() to unsigned char.
>
>
> Didn't we just do this a week or so ago? Perhaps it's
> a candidate for the FAQ; it seems at any rate to be FA.
>
Yes we did. It remains to be seen whether I can learn enough from
one beating to avoid the next one. :)

>> We assume that st points to a valid string. All characters of such a
>> string are within the range 0..CHAR_MAX by definition.
>
>
> No, they are in the range CHAR_MIN through CHAR_MAX.
> Since `char' may be a signed type (it's the implementation's
> choice), CHAR_MIN can be negative. It's true that all the
> characters mandated by the Standard are required to be non-
> negative, but the Standard allows the implementation to define
> additional characters, too -- and some of these may have
> negative codes.
>

Yes, and I truly missed that until just now. Thank you.

>> CHAR_MAX is within UCHAR_MAX by definition.
>
>
> True, but CHAR_MIN can be negative, hence outside the
> range of `unsigned char'.
>

Yes, but I never mentioned CHAR_MIN.

>> If st points to something not a valid string, and toupper() is
>> presented with something out of range, (-20 for example) it may
>> SEGFAULT. And why not? It might tell you where your error is.
>
>
> Except that the "error" isn't the presence of a -20 in
> the string (in one widely-used scheme, -20 is "Latin small
> i with grave accent"). The real error is the failure to
> use the cast that Lew recommends.
>

It didn't occur to me that the value of é (130) was negative as a
signed char (10000010) and when promoted to int would be -126.

I apologize to you and the group for my noise. I'll get it right
next time, I promise. :=)

Mysidia

unread,
Jan 15, 2005, 8:25:09 PM1/15/05
to
> char test [] = "Hello" "\xf0" "World";
>
> ...then your function causes undefined behavior on an implementation
> with CHAR_BIT 8 and signed char, because you will pass an invalid
> value to tolower() or toupper().


But checking islower() or isupper() does not protect from this,
because islower() and isupper() have the same fundamental requirement..

>From ISO/IEC 9899:1999 (E) :
"The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."
isupper(0xf0) is just as undefined as toupper(0xf0) is.

Joe Wright

unread,
Jan 15, 2005, 8:42:40 PM1/15/05
to
Chris Torek wrote:
> In article <aZadnbNe4MT...@comcast.com>
> Joe Wright <joeww...@comcast.net> wrote:
>
>>The islower() call is unnecessary.
>
>
> Indeed.
>
>
>>char *upper(char *st) {
>> char *s = st;
>> while ((*s = toupper(*s))) ++s;
>> return st;
>>}
>>
>>There is no need to cast the argument to toupper() to unsigned char.
>>We assume that st points to a valid string.
>
>
> And someone whose name is "Pól" has a name that is an "invalid
> string"? :-)
>
>
>>All characters of such a string are within the range 0..CHAR_MAX
>>by definition. CHAR_MAX is within UCHAR_MAX by definition.
>
>
> If you use ISO-Latin-1, and have signed characters -- and both of
> these are quite commonly true today -- you *will* have characters
> whose value is outside the [0..CHAR_MAX] range. For instance, the
> o-with-accent-acute above is 0xf3 or -13.
>
It looks something like ó (162) at my house. 10100010 is -94 but
your point is taken. I didn't consider negative char as valid.

>
>>If st points to something not a valid string, and toupper() is
>>presented with something out of range, (-20 for example) it may
>>SEGFAULT. And why not? It might tell you where your error is.
>
>
> Or it may change the guy's name from Pól (the Celtic form of
> the name "Paul") to PzL, which might just annoy him. If he happens
> to have a large sword, this could be a bad strategy. :-)

I'll try to stay away from that sword. I'm sorry to have muddied the
water. I'll get it wright next time, I promise. :-)

S.Tobias

unread,
Jan 15, 2005, 8:46:22 PM1/15/05
to
Jack Klein <jack...@spamcop.net> wrote:
> On Sat, 15 Jan 2005 19:00:59 GMT, CBFalconer <cbfal...@yahoo.com>
> wrote in comp.lang.c:

> > #include <ctype.h>


> >
> > void flipcase(char *s)
> > {
> > unsigned char ch;
> >
> > if (s) /* assuming you want to protect against NULL */
> > while (ch = *s) {
> > if (isupper(ch) *s = tolower(ch);

> Completely unnecessary conditional test.

> > else if (islower(ch) *s = toupper(ch);

> Completely unnecessary conditional test.

Why completely unnecessary? This is case *toggling* function, so at
least one test must remain (note "else").

> > s++;
> > }
> > } /* flipcase, untested */
> >

--
Stan Tobias
mailx `echo si...@FamOuS.BedBuG.pAlS.INVALID | sed s/[[:upper:]]//g`

infobahn

unread,
Jan 16, 2005, 12:02:01 AM1/16/05
to
Eric Sosman wrote:
>
> Except that the "error" isn't the presence of a -20 in
> the string (in one widely-used scheme, -20 is "Latin small
> i with grave accent"). The real error is the failure to
> use the cast that Lew recommends.

Ahem. That /Lew/ recommends? Am I invisible all of a sudden?

CBFalconer

unread,
Jan 16, 2005, 12:16:37 AM1/16/05
to
Jack Klein wrote:
> CBFalconer <cbfal...@yahoo.com>

>> Pierre wrote:
>>>
>>> I've been looking for a portable means of changing the case of a
>>> string but i've found nothing so far. Does it exists? I guess (and
>>> hope) it does..
>>
>> Unusual to want to simply change the case, but try something like:
>>
>> #include <ctype.h>
>>
>> void flipcase(char *s)
>> {
>> unsigned char ch;
>>
>> if (s) /* assuming you want to protect against NULL */
>> while (ch = *s) {
>> if (isupper(ch) *s = tolower(ch);
>
> Completely unnecessary conditional test.
>
>> else if (islower(ch) *s = toupper(ch);
>
> Completely unnecessary conditional test.
>
>> s++;
>> }
>> } /* flipcase, untested */
>>
>> which allows for the fact that some chars do not have an upper or
>> lower case to be flipped.
>
... snip ...

>
> So the tests are totally unnecessary.
>
> But suppose:
>
> char test [] = "Hello" "\xf0" "World";
>
> ...then your function causes undefined behavior on an implementation
> with CHAR_BIT 8 and signed char, because you will pass an invalid
> value to tolower() or toupper().

If you examine my function you will find that isupper/lower and
toupper/lower are always operating on an unsigned char. The tests
are necessary, to decide whether to upshift or downshift, although
the second can probably be eliminated. However that would leave
the action somewhat unclear, as it is no longer obvious that some
characters are never transformed.

While busily charging off in all directions you failed to even read
the verbiage I attached, and missed the fact that the conditional
expressions lacked a closing parenthesis, and thus were syntax
errors.

The function will convert test[] to "hELLO" "\xf0" "wORLD".

Eric Sosman

unread,
Jan 16, 2005, 12:17:24 AM1/16/05
to

My apologies; I mistook >>> for >> (or maybe the
other way around) in the attrisnipbutions.

--
Eric Sosman
eso...@acm-dot-org.invalid

Eric Sosman

unread,
Jan 16, 2005, 12:25:32 AM1/16/05
to
Jack Klein wrote:

> On Sat, 15 Jan 2005 19:00:59 GMT, CBFalconer <cbfal...@yahoo.com>
> wrote in comp.lang.c:
>

>>Unusual to want to simply change the case, but try something like:
>>
>>#include <ctype.h>
>>
>>void flipcase(char *s)
>>{
>> unsigned char ch;
>>
>> if (s) /* assuming you want to protect against NULL */
>> while (ch = *s) {
>> if (isupper(ch) *s = tolower(ch);

> [...]


> But suppose:
>
> char test [] = "Hello" "\xf0" "World";
>
> ...then your function causes undefined behavior on an implementation
> with CHAR_BIT 8 and signed char, because you will pass an invalid
> value to tolower() or toupper().

No: The argument is always in the range of `unsigned char'
as required by the Standard. You'll see why this must be so
if you examine the type of the variable `ch' ...

--
Eric Sosman
eso...@adm-dot-org.invalid

Giorgos Keramidas

unread,
Jan 16, 2005, 2:17:12 PM1/16/05
to
On 2005-01-15 19:00, CBFalconer wrote:
> Pierre wrote:
>> I've been looking for a portable means of changing the case of a
>> string but i've found nothing so far. Does it exists? I guess (and
>> hope) it does..
>
> Unusual to want to simply change the case, but try something like:
>
> #include <ctype.h>
>
> void flipcase(char *s)
> {
> unsigned char ch;
>
> if (s) /* assuming you want to protect against NULL */
> while (ch = *s) {
> if (isupper(ch) *s = tolower(ch);
> else if (islower(ch) *s = toupper(ch);
> s++;
> }
> } /* flipcase, untested */

Missing parentheses in both conditionals :-(

Jack Klein

unread,
Jan 16, 2005, 5:09:35 PM1/16/05
to
On Sun, 16 Jan 2005 05:16:37 GMT, CBFalconer <cbfal...@yahoo.com>
wrote in comp.lang.c:

> > CBFalconer <cbfal...@yahoo.com>


>
> If you examine my function you will find that isupper/lower and
> toupper/lower are always operating on an unsigned char. The tests
> are necessary, to decide whether to upshift or downshift, although
> the second can probably be eliminated. However that would leave
> the action somewhat unclear, as it is no longer obvious that some
> characters are never transformed.
>
> While busily charging off in all directions you failed to even read
> the verbiage I attached, and missed the fact that the conditional
> expressions lacked a closing parenthesis, and thus were syntax
> errors.
>
> The function will convert test[] to "hELLO" "\xf0" "wORLD".

Sorry, need to have my meds adjusted again, I guess. Please disregard
my previous post.

Peter Nilsson

unread,
Jan 17, 2005, 5:07:28 AM1/17/05
to
infobahn wrote:
> Lew Pitcher wrote:
> >
> > #include <ctype.h>
> >
> > void UppercaseString(char *string)
> > {
> > for(;*string;++string)
> > if (islower(*string)) *string = toupper(*string);
> > }
>
> Caution is necessary here. The behaviours of islower and toupper
> are undefined if they are passed a value that is neither EOF nor
> representable as an unsigned char. It is good practice, therefore,
> to cast *string to unsigned char.

I believe the cast (conversion) of individual characters is
incorrect. Instead, the byte characters should be interpreted as
unsigned char...

. char *make_upper(char *s)
. {
. unsigned char *us = (unsigned char *) s;
. for (; *us; us++) *us = toupper(*us);
. return s;
. }

The reason being that reinterpretation is more likely to be
correct.

I did once post a query about this...
http://groups.google.com/groups?threadm=4044...@news.rivernet.com.au
--
Peter

Old Wolf

unread,
Jan 17, 2005, 3:52:05 PM1/17/05
to
Peter Nilsson wrote:
> infobahn wrote:
> > Lew Pitcher wrote:
> > > if (islower(*string)) *string = toupper(*string);
> >
> > Caution is necessary here. The behaviours of islower and toupper
> > are undefined if they are passed a value that is neither EOF nor
> > representable as an unsigned char. It is good practice, therefore,
> > to cast *string to unsigned char.
>
> I believe the cast (conversion) of individual characters is
> incorrect. Instead, the byte characters should be interpreted as
> unsigned char...
>
> unsigned char *us = (unsigned char *) s;
>
> The reason being that reinterpretation is more likely to be
> correct.

Casting a signed char to unsigned is always correct.
So everything else is equally or less likely to be
correct :)
AFAIK the standard does not explicitly say that you
can cast a (char *) to an (unsigned char *) , for example
many compilers warn about parameter type mismatches if you
pass one to a function expecting the other.
However it does say that they must have the same size,
alignment etc. etc. etc. so I don't see how an implementation
could conform but not allow the cast. (Unless it was the DS9k).

Peter Nilsson

unread,
Jan 17, 2005, 5:46:31 PM1/17/05
to
Old Wolf wrote:
> Peter Nilsson wrote:
> > infobahn wrote:
> > > Lew Pitcher wrote:
> > > > if (islower(*string)) *string = toupper(*string);
> > >
> > > Caution is necessary here. The behaviours of islower and toupper
> > > are undefined if they are passed a value that is neither EOF nor
> > > representable as an unsigned char. It is good practice,
therefore,
> > > to cast *string to unsigned char.
> >
> > I believe the cast (conversion) of individual characters is
> > incorrect. Instead, the byte characters should be interpreted as
> > unsigned char...
> >
> > unsigned char *us = (unsigned char *) s;
> >
> > The reason being that reinterpretation is more likely to be
> > correct.
>
> Casting a signed char to unsigned is always correct.
> So everything else is equally or less likely to be
> correct :)

Chapter and verse, please.

Consider that I/O functions write to buffers (and strings)
using unsigned char, not char. The string and mem functions
use unsigned char, not char.

My main point is that a cast from char to unsigned char may
NOT yield the original value that was written to the char.

> AFAIK the standard does not explicitly say that you
> can cast a (char *) to an (unsigned char *) ,

6.3.2.3p7 "... When a pointer to an object is converted to a
pointer to a character type, the result points to the lowest
addressed byte of the object. ..."

> for example
> many compilers warn about parameter type mismatches if you
> pass one to a function expecting the other.

Because many implicit conversions _require_ a diagnostic.
> <snip>

--
Peter

infobahn

unread,
Jan 18, 2005, 12:35:13 AM1/18/05
to
Old Wolf wrote:
> Peter Nilsson wrote:
> > infobahn wrote:
> > > It is good practice, therefore,
> > > to cast *string to unsigned char.
> >
> > I believe the cast (conversion) of individual characters is
> > incorrect. Instead, the byte characters should be interpreted as
> > unsigned char...
> >
> > unsigned char *us = (unsigned char *) s;
> >
> > The reason being that reinterpretation is more likely to be
> > correct.
>
> Casting a signed char to unsigned is always correct.

Yes. His complaint is most strange, since there's nothing at all
wrong with the cast I suggested.

> So everything else is equally or less likely to be
> correct :)
> AFAIK the standard does not explicitly say that you
> can cast a (char *) to an (unsigned char *) ,

You can point an unsigned char * anywhere you can point (within
reason - for example, you wouldn't want to point it at a function).

The closest the Standard comes to formalising this, as far as I can
tell, is:

"Values stored in non-bit-field objects of any other object type
consist of n x CHAR_BIT bits, where n is the size of an object of
that type, in bytes. The value may be copied into an object of type
unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value."

This doesn't actually say anything about casting, but it does say
we can represent any object using an array of unsigned char.

> for example
> many compilers warn about parameter type mismatches if you
> pass one to a function expecting the other.

And rightly so, but not because objects can't be pointed to by
unsigned char *.

> However it does say that they must have the same size,
> alignment etc. etc. etc. so I don't see how an implementation
> could conform but not allow the cast. (Unless it was the DS9k).

I do not believe the DS9K could refuse the cast either.

Peter Nilsson

unread,
Jan 18, 2005, 3:49:54 AM1/18/05
to
infobahn wrote:
> Old Wolf wrote:
> > Peter Nilsson wrote:
> > > infobahn wrote:
> > > > It is good practice, therefore,
> > > > to cast *string to unsigned char.
> > >
> > > I believe the cast (conversion) of individual characters is
> > > incorrect. Instead, the byte characters should be interpreted as
> > > unsigned char...
> > >
> > > unsigned char *us = (unsigned char *) s;
> > >
> > > The reason being that reinterpretation is more likely to be
> > > correct.
> >
> > Casting a signed char to unsigned is always correct.
>
> Yes. His complaint is most strange, since there's nothing at all
> wrong with the cast I suggested.

6.2.5p3 says:

" An object declared as type char is large enough to store any
" member of the basic execution character set. If a member of the
" basic execution character set is stored in a char object, its
" value is guaranteed to be positive. If any other character is
" stored in a char object, the resulting value is implementation-
" defined but shall be within the range of values that can be
" represented in that type.

This makes it quite clear that plain char may not be sufficient
to represent the values of all (extended) characters in the
execution character set. This is the first clue that a conversion
of a plain char value might not be appropriate.

But let's look at an example...

Suppose we have an implementation with an extended character set
that includes an accented e. For the sake of argument, let's
suppose the coding for that character is 233 (0xE9). This is
representable within a byte on any system, and is therefore a
valid single-byte character.

Let's go on to suppose we read input into a character array, and
that input includes one accented e. Note that ordinary input is
made through "byte input/output functions", so the value stored
in the corresponding byte is 233. Assuming an 8-bit byte, this
has the representation...

11101001

Consider the possible signed plain char value of this
representation on various allowed 8-bit implementations...

2c: -23
1c: -22
sm: -105

Using your cast to convert char to unsigned char, we get...

2c: 233
1c: 234
sm: 151

...only _one_ of which is correct.

If instead we interpret the byte through an unsigned char
pointer, then we get 233, irrespective of the signed plain
char value. Had I considered the character coding of 128,
then the last sentance of 6.2.5p3 says you have _NO_ guarantee
that your cast to unsigned char will produce 128.

That is why the 'interpreted' way is better than 'conversion'.
Note that the string/memory functions interpret, rather than
cast, for similar reasons.

--
Peter

Lawrence Kirby

unread,
Jan 18, 2005, 6:31:31 AM1/18/05
to
On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:

...

> If instead we interpret the byte through an unsigned char
> pointer, then we get 233, irrespective of the signed plain
> char value. Had I considered the character coding of 128,
> then the last sentance of 6.2.5p3 says you have _NO_ guarantee
> that your cast to unsigned char will produce 128.
>
> That is why the 'interpreted' way is better than 'conversion'.
> Note that the string/memory functions interpret, rather than
> cast, for similar reasons.

The real issue is that neither approach is correct until we know how the
value in the char has been derived in the first place. Maybe the character
value was obtained by converting the return value of getc() to char,
maybe it was written directly by fgets() or fread().

In practice implementations that create inconsistent results for the
various appraches discussed are going to cause problems. In such
environments it would probably be wise for the implementation to define
char as an unsigned type. It is one of those things where the best thing
to do is ignore it until you come across it. You would have to be
AMAZINGLY unlucky for that to happen. IMO you are more likely to encounter
problems due to compiler bugs than this, and you might as well treat this
as such.

Lawrence

Peter Nilsson

unread,
Jan 18, 2005, 8:46:49 AM1/18/05
to
Lawrence Kirby wrote:
> On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:
>
> ...
>
> > If instead we interpret the byte through an unsigned char
> > pointer, then we get 233, irrespective of the signed plain
> > char value. Had I considered the character coding of 128,
> > then the last sentance of 6.2.5p3 says you have _NO_ guarantee
> > that your cast to unsigned char will produce 128.
> >
> > That is why the 'interpreted' way is better than 'conversion'.
> > Note that the string/memory functions interpret, rather than
> > cast, for similar reasons.
>
> The real issue is that neither approach is correct until we know
> how the value in the char has been derived in the first place.
> Maybe the character value was obtained by converting the return
> value of getc() to char, maybe it was written directly by fgets()
> or fread().

This is generally within the control of the programmer. Reading
input into char arrays by assigning values returned by fgetc is
wrong... in the theoretical sense. That a lot of programs do it
(K&R2 does it) doesn't make it any less 'wrong'.

> In practice implementations that create inconsistent results for
> the various appraches discussed are going to cause problems. In
> such environments it would probably be wise for the implementation
> to define char as an unsigned type.

It would be even better if the standard actually _required_ this
for qualified implementations.

Personally, I think the standard is defective, not merely because
of the above issues, but also in the way it treats character
constants.

Consider an 8-bit implementation where plain char is signed, uses
non two's complement, but supports a subset of iso646. C99, by
my reading, _requires_ that such implementations generate a value
_other than_ 233 for the character constants '\xe9' and '\u00e9'!

That said, I don't honestly claim to be able to rectify the standard
in a way that a significant majority of C diehards would approve of.

> It is one of those things where the best thing to do is ignore it
> until you come across it.
>
> You would have to be AMAZINGLY unlucky for that to happen. IMO you
> are more likely to encounter problems due to compiler bugs than
> this, and you might as well treat this as such.

I agree, but I note that a modern C programmer would have to be
'amazingly unlucky' to ever program a hosted implementation that
didn't use two's complement, or had 9-bit chars, or uses different
sized pointers for different (object or incomplete) pointer types,
has integer padding bits, ... and all the other things which are
regularly cited in clc as being supposedly relevant considerations.

Such things are so esoteric as to be worth ignoring. Nontheless, I
still believe clc would be doing a disservice to its readers if it
did not mention them.

--
Peter

Richard Bos

unread,
Jan 18, 2005, 11:15:29 AM1/18/05
to
"Peter Nilsson" <ai...@acay.com.au> wrote:

> Old Wolf wrote:
> > Peter Nilsson wrote:
> > > infobahn wrote:
> > > > Lew Pitcher wrote:
> > > > > if (islower(*string)) *string = toupper(*string);
> > > >
> > > > Caution is necessary here. The behaviours of islower and toupper
> > > > are undefined if they are passed a value that is neither EOF nor
> > > > representable as an unsigned char. It is good practice,
> therefore,
> > > > to cast *string to unsigned char.
> > >
> > > I believe the cast (conversion) of individual characters is
> > > incorrect. Instead, the byte characters should be interpreted as
> > > unsigned char...
> > >
> > > unsigned char *us = (unsigned char *) s;
> > >
> > > The reason being that reinterpretation is more likely to be
> > > correct.
> >
> > Casting a signed char to unsigned is always correct.
> > So everything else is equally or less likely to be
> > correct :)
>
> Chapter and verse, please.

6.2.5#9, 6.2.6.1#3 and 4, 6.3.1.3#1 and #2, 6.5.3.4#1 (which clauses
together guarantee that for each value of signed char there is a unique
corresponding value for unsigned char, and the conversion cannot fail:
unsigned char has no trap bits, both signed and unsigned char are one
byte large, conversion to unsigned types cannot overflow and has
well-defined results).
7.4#1 (which specifies that the <ctype.h> functions take an int, with
the value of an unsigned char or EOF).

> My main point is that a cast from char to unsigned char may
> NOT yield the original value that was written to the char.

I'm afraid they must.

Richard

Old Wolf

unread,
Jan 18, 2005, 3:15:13 PM1/18/05
to
infobahn wrote:
> Old Wolf wrote:
>
> > AFAIK the standard does not explicitly say that you
> > can cast a (char *) to an (unsigned char *) ,
>
> You can point an unsigned char * anywhere you can point (within
> reason - for example, you wouldn't want to point it at a function).

Right. I meant to also say "...and get the expected result".

After reading Peter Nilsson's last post, I think his point
was that if you want to access the representation of a byte,
then you must point to it with (unsigned char *) and then read
it. This is of course different to reading the C value of a
signed char, and then converting to unsigned (because of
non-2s-magnitude systems).

Keith Thompson

unread,
Jan 18, 2005, 4:06:14 PM1/18/05
to
"Peter Nilsson" <ai...@acay.com.au> writes:
[...]

> Personally, I think the standard is defective, not merely because
> of the above issues, but also in the way it treats character
> constants.
>
> Consider an 8-bit implementation where plain char is signed, uses
> non two's complement, but supports a subset of iso646. C99, by
> my reading, _requires_ that such implementations generate a value
> _other than_ 233 for the character constants '\xe9' and '\u00e9'!
>
> That said, I don't honestly claim to be able to rectify the standard
> in a way that a significant majority of C diehards would approve of.

Is there any real advantage (other than not breaking existing
implementations) in allowing plain char to be signed? I know there
are historical reasons, but what would break if the standard required
char to have the same characteristics as unsigned char?

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Eric Sosman

unread,
Jan 18, 2005, 9:17:19 PM1/18/05
to
Richard Bos wrote:

> "Peter Nilsson" <ai...@acay.com.au> wrote:
>
>>My main point is that a cast from char to unsigned char may
>>NOT yield the original value that was written to the char.
>
> I'm afraid they must.

A counterexample comes to mind. Consider a signed `char'
on a system that uses either ones' complement or signed
magnitude to represent negative integers. On such a system
there are two distinct `char' representations that have the
value zero (unless "minus zero" is a trap value), and both
of them produce the same value (zero) upon conversion to
`unsigned char'. Conversion obliterates the distinction.

Whether all this makes much difference is open to question,
though. A conforming C implementation can use signed magnitude,
can choose signed `char', can even choose CHAR_MAX==ULLONG_MAX,
but if it is a hosted implementation it must still make the I/O
functions work "properly." A successful getc() delivers an `int'
in the range 0..UCHAR_MAX, and if CHAR_MAX<UCHAR_MAX we might
think it unsafe to assign such a value to a plain `char' -- the
attempted conversion, according to the Standard, produces an
implementation-defined result or raises an implementation-defined
signal, and thus cannot be performed in a strictly-conforming
program. However, an implementation capable of reading a valid
character from an input stream but incapable of storing it into
a `char' would be laughed out of the marketplace. It might be
too ambitious to claim that such an implementation violated the
Standard, but "quality of implementation" concerns would, I think,
rule it out. As a practical matter, any system with signed `char'
must do "something reasonable" when it converts an out-of-range
`unsigned char' to plain (signed) `char'; the implementation-
defined aspect will turn out to be "what you wanted."

--
Eric Sosman
eso...@acm-dot-org.invalid

pete

unread,
Jan 18, 2005, 10:54:16 PM1/18/05
to
Eric Sosman wrote:
>
> Richard Bos wrote:
>
> > "Peter Nilsson" <ai...@acay.com.au> wrote:
> >
> >>My main point is that a cast from char to unsigned char may
> >>NOT yield the original value that was written to the char.
> >
> > I'm afraid they must.
>
> A counterexample comes to mind. Consider a signed `char'
> on a system that uses either ones' complement or signed
> magnitude to represent negative integers. On such a system
> there are two distinct `char' representations that have the
> value zero (unless "minus zero" is a trap value), and both
> of them produce the same value (zero) upon conversion to
> `unsigned char'. Conversion obliterates the distinction.

That's not an example of a char value being changed by a cast.
The distinction is not between values.

If you have a char value of -1
and cast that to unsigned char, you get UCHAR_MAX.
UCHAR_MAX won't compare equal to a char value of -1
when UCHAR_MAX is less than or equal to INT_MAX.

--
pete

Richard Bos

unread,
Jan 19, 2005, 10:48:26 AM1/19/05
to
Eric Sosman <eso...@acm-dot-org.invalid> wrote:

> Richard Bos wrote:
>
> > "Peter Nilsson" <ai...@acay.com.au> wrote:
> >
> >>My main point is that a cast from char to unsigned char may
> >>NOT yield the original value that was written to the char.
> >
> > I'm afraid they must.
>
> A counterexample comes to mind. Consider a signed `char'
> on a system that uses either ones' complement or signed
> magnitude to represent negative integers. On such a system
> there are two distinct `char' representations that have the
> value zero (unless "minus zero" is a trap value), and both
> of them produce the same value (zero) upon conversion to
> `unsigned char'. Conversion obliterates the distinction.

It's dubious whether this can be called a difference in _value_, though.
They're both zero.

> Whether all this makes much difference is open to question,
> though. A conforming C implementation can use signed magnitude,
> can choose signed `char', can even choose CHAR_MAX==ULLONG_MAX,
> but if it is a hosted implementation it must still make the I/O
> functions work "properly."

True. Which means that it's probably only possible to input one of the
two zeroes anyway.

Richard

Lawrence Kirby

unread,
Jan 20, 2005, 11:19:05 AM1/20/05
to
On Tue, 18 Jan 2005 21:06:14 +0000, Keith Thompson wrote:

> "Peter Nilsson" <ai...@acay.com.au> writes:
> [...]
>> Personally, I think the standard is defective, not merely because
>> of the above issues, but also in the way it treats character
>> constants.
>>
>> Consider an 8-bit implementation where plain char is signed, uses
>> non two's complement, but supports a subset of iso646. C99, by
>> my reading, _requires_ that such implementations generate a value
>> _other than_ 233 for the character constants '\xe9' and '\u00e9'!
>>
>> That said, I don't honestly claim to be able to rectify the standard
>> in a way that a significant majority of C diehards would approve of.
>
> Is there any real advantage (other than not breaking existing
> implementations) in allowing plain char to be signed? I know there
> are historical reasons, but what would break if the standard required
> char to have the same characteristics as unsigned char?

There is of course a huge body of platform-specific code that assumes the
existing conventions for that platform such as the signedness of char.
Implementations themselves should be able to make the transition
fairly easily, although implementation code can quite legitimately
assume properties of the implementation, so if those properties are
changed some fixing and testing work would be needed. There is also the
issue of whether the change could produce a performance hit on some
implementations.

Lawrence

Lawrence Kirby

unread,
Jan 20, 2005, 11:37:24 AM1/20/05
to
On Tue, 18 Jan 2005 21:17:19 -0500, Eric Sosman wrote:

> Richard Bos wrote:
>
>> "Peter Nilsson" <ai...@acay.com.au> wrote:
>>
>>>My main point is that a cast from char to unsigned char may
>>>NOT yield the original value that was written to the char.
>>
>> I'm afraid they must.
>
> A counterexample comes to mind. Consider a signed `char'
> on a system that uses either ones' complement or signed
> magnitude to represent negative integers. On such a system
> there are two distinct `char' representations that have the
> value zero (unless "minus zero" is a trap value), and both
> of them produce the same value (zero) upon conversion to
> `unsigned char'. Conversion obliterates the distinction.

Since they both represent the same value there wasn't a distinction to
start with. Characters are represented by value, you cannot have two
different characters represented by the same value. It isn't the
conversion to unsigned char that causes the problem, that exists
whatever you do while the character is being represented and manipulated
as a char. Having multiple representations for a value will cause problems
for I/O handling so, as you say...

> Whether all this makes much difference is open to question,
> though. A conforming C implementation can use signed magnitude, can
> choose signed `char', can even choose CHAR_MAX==ULLONG_MAX, but if it is
> a hosted implementation it must still make the I/O functions work
> "properly." A successful getc() delivers an `int' in the range
> 0..UCHAR_MAX, and if CHAR_MAX<UCHAR_MAX we might think it unsafe to
> assign such a value to a plain `char' -- the attempted conversion,
> according to the Standard, produces an implementation-defined result or
> raises an implementation-defined signal, and thus cannot be performed in
> a strictly-conforming program. However, an implementation capable of
> reading a valid character from an input stream but incapable of storing
> it into a `char' would be laughed out of the marketplace. It might be
> too ambitious to claim that such an implementation violated the
> Standard, but "quality of implementation" concerns would, I think, rule
> it out. As a practical matter, any system with signed `char' must do
> "something reasonable" when it converts an out-of-range `unsigned char'
> to plain (signed) `char'; the implementation- defined aspect will turn
> out to be "what you wanted."

... a realistic implementation will avoid the possibility.

Lawrence


0 new messages