I alwyays use unsigned chars for my text data, since there are no negative character, just character
codes. Problem is, strcmp expects chars, and gcc thinks that chars are signed by default, what is
all correct of course, but annoying.
Is there any way to convince it to avoid putting this warning?
Thanks
> ./dictionary.c:156: warning: pointer targets in passing argument 1 to strcmp differ in signedness
[...]
> Is there any way to convince it to avoid putting this warning?
-Wno-pointer-sign
--
"When I have to rely on inadequacy, I prefer it to be my own."
--Richard Heathfield
OK, that worked, thanks
How do you cope with string literals?
--
Ian Collins
Use char.
"If you lie to the compiler, it will get its revenge."
-- Henry Spencer
--
Eric Sosman
eso...@ieee-dot-org.invalid
I do not want to have negative characters!
They are NOT integers, that is why they are UNSIGNED.
If you limit yourself to ASCII, it could be OK, but I do not want just
ASCII.
<snip>
> I do not want to have negative characters!
> They are NOT integers, that is why they are UNSIGNED.
Um, signed or unsigned, characters are integers whether you like it or
not. Presumably you mean that you are thinking of them as text glyphs
rather than as integers (which is fine, by the way - lots of us do
that some or all of the time). I don't see why you think making them
unsigned stops them from being integers.
<snip>
--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
"Usenet is a strange place" - dmr 29 July 1999
Sig line vacant - apply within
So how do you handle string literals?
How do you manage warnings form other compilers? You are producing
non-portable code.
--
Ian Collins
They are integers, from zero to 2^CHAR_BIT.
I.e. they are non negative integer codes.
Since when manipulating bits I need to avoid sign extensions, I used in the
bit string package unsigned integers throughout. Converting signed chars into
unsigned integers can produce all kinds of nonsense.
I standardized into unsigned char throughout the container library. There
is NO system that I know of that would have a different pointer size
or characteristics for signed or unsigned chars!
Can you give an example of something that doesn't work with plain char
specifically because some characters are negative? I think that can only
happen if you are making bad assumptions.
--
Alan Curry
I assume characters are codes from one to 255. This is a bad assumption
maybe, in some kind of weird logic when you assign a sign to a character
code.
There is a well established confusion in C between characters (that are
encoded as integers) and integer VALUES.
One of the reasons is that we have "signed" and "unsigned" characters.
I prefer not to use any sign in the characters, and treat 152 as character
code 152 and not as -104. Stupid me, I know.
Besides, when I convert it into a bigger type, I would like to get
152, and not 4294967192.
Of course, when YOU see 4294967192 you think immediately:
Ahhh of course, that is character code 152 that got converted into an int, then casted
to unsigned and got that weird value...
Since size_t is unsigned, converting to unsigned is a fairly common operation.
Or when comparing, I get
warning: "comparison between signed and unsigned".
And MANY other bugs and stuff I do not want to get involved with. Writing software
is difficult enough without having to bother with the sign of characters or the
sex of angels, or the number of demons you can fit in a pin's head.
The most annoying is using the character class tests isxxxx.
Technically, a cast is needed to be portable:
char *cp = ...;
...
if (isdigit((unsigned char)*cp)) ...
So far, I have no found any implementation that does not handle this
correctly. So much code exists without the cast, that C libraries
that run on machines with signed char make sure that nothing bad
happens. Still, I think is an example that matches what you asked
for, though not a strong one since the solution is simple.
I don't like using unsigned char for plain strings since it suggests
other uses. As a result, I end up putting the cast in where it's
needed.
--
Ben.
<snip>
> [Characters] are integers, from zero to 2^CHAR_BIT.
2^CHAR_BIT - 1, but yes. But you said they weren't integers. You seem
to be accepting that that was a mistake. Fair enough.
> I.e. they are non negative integer codes.
Right.
<snip>
> There is NO system that I know of that would have a different
> pointer size or characteristics for signed or unsigned chars!
Likewise.
So don't do that. If the values are relevant at all, you should be using
unsigned char explicitly, not plain char.
>
>There is a well established confusion in C between characters (that are
>encoded as integers) and integer VALUES.
Indeed, you can get confused if you rely too much on the fact that char is an
integer type.
>
>Besides, when I convert it into a bigger type, I would like to get
>152, and not 4294967192.
There's an easy answer for that: never convert plain char to a bigger type.
My rule on plain chars is that they should only be used for real characters,
which are things that are read from and/or written to a text stream. If your
char variable is not really a character (i.e. it didn't come from a text
stream and it will never be printed to a text stream) it should be declared
explicitly as signed or unsigned.
The standard library does add some confusion with the ctype.h functions that
work on characters as characters but require them to be unsigned. Don't look
in ctype.h for examples of good design.
>
>Since size_t is unsigned, converting to unsigned is a fairly common operation.
How does a character value (which is charset-dependent anyway) become a size?
I can't see how that makes sense.
>
>warning: "comparison between signed and unsigned".
I see a lot of those when compiling other people's code, and sometimes my own
too, and usually I fix it by changing whichever thing was signed to unsigned,
and this is usually an improvement.
I've done that so many times, it makes me think that perhaps C got the
default integer signedness wrong. If plain int, short, and long had all been
unsigned, with the "signed" keyword being required to declare signed
variables, there might be fewer problems.
--
Alan Curry
Yes, they do.
> I alwyays use unsigned chars for my text data, since there are no negative character, just character
> codes. Problem is, strcmp expects chars, and gcc thinks that chars are signed by default, what is
> all correct of course, but annoying.
Yup.
> Is there any way to convince it to avoid putting this warning?
Cast arguments to the type strcmp expects. Or use 'char' for text data,
since it is the native type for text data, and use 'unsigned char' when
you want to manipulate raw bits.
-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
You won't on a machine where character values are never negative.
> They are NOT integers, that is why they are UNSIGNED.
Apparently, they are not.
> If you limit yourself to ASCII, it could be OK, but I do not want just
> ASCII.
That's fine, plain char should be able to hold those values fine on any
system.
To be picky, BTW, even on a system where plain char is unsigned, unsigned
char and plain char are two different types.
No problem, cast to unsigned char before converting. :)
Not necessarily. I've used plenty of systems that support, among
other character sets, Latin-1 (ISO 8859-1), which uses the full 8-bit
range from 0 to 255, but on which plain char is 8 bits and signed.
On such a system, with the right locale settings, this program:
#include <stdio.h>
int main(void)
{
const char *s = "This is a Yen sign: '\xa5'";
puts(s);
return 0;
}
will produce the expected output, even though puts takes an argument
of type "const char*", not "const unsigned char*".
The thing is, we tend to depend on this kind of thing to Just Work,
but I'd have to go through several sections of the standard to figure
out just what's guaranteed (I'll do that later).
> To be picky, BTW, even on a system where plain char is unsigned, unsigned
> char and plain char are two different types.
Yup.
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
> Richard Heathfield a écrit :
>> In <hdkpl5$8aq$1...@aioe.org>, jacob navia wrote:
>>
>> <snip>
>>
>>> I do not want to have negative characters!
>>> They are NOT integers, that is why they are UNSIGNED.
>>
>> Um, signed or unsigned, characters are integers whether you like it
>> or not. Presumably you mean that you are thinking of them as text
>> glyphs rather than as integers (which is fine, by the way - lots of
>> us do that some or all of the time). I don't see why you think
>> making them unsigned stops them from being integers.
>>
>> <snip>
>>
>
> They are integers, from zero to 2^CHAR_BIT.
>
> I.e. they are non negative integer codes.
If you are viewing them as a collection of bytes, rather than as
strings, you really should be using memcmp instead of strcmp. Of
course, you need to know the length.
But C doesn't have a native "collection of positive integers from zero
to 2^CHAR_BIT terminated by a zero" function, so won't have native
functions to operate on them either.
--
Online waterways route planner: http://canalplan.org.uk
development version: http://canalplan.eu
That first "function" there is wrong, and really the paragraph would be
better written as:
But C doesn't have the concept of "collection of positive integers from
zero to 2^CHAR_BIT terminated by a zero", so won't have functions to
Yes, there should have been signed and unsigned byte. And a separate char
type equivalent to (or a synonym for) unsigned byte.
It really is exasperating when most people in this group insist that signed
character codes are perfectly normal and sensible!
Apparently chars are signed because on the PDP11 or some such machine,
sign-extending byte values was faster than zero-extending them. A bit
shortsighted. (If it had been the other way around, they would of course
have been singing the praises of unsigned char codes; except they would have
been justified this time..)
> I prefer not to use any sign in the characters, and treat 152 as character
> code 152 and not as -104. Stupid me, I know.
As I understand it, you can easily choose to use unsigned char type for such
codes. The problem being when passing these to library functions where char
is signed and this triggers a warning?
> Besides, when I convert it into a bigger type, I would like to get
> 152, and not 4294967192.
Why doesn't widening a signed value into an unsigned one itself trigger a
warning?
--
bartc
Text manipulation is much better done on a higher level.
>> In article <hdkpl5$8aq$1...@aioe.org>, jacob navia <j...@nospam.org> wrote:
>>>
>>>I do not want to have negative characters!
>>
>> Can you give an example of something that doesn't work with plain char
>> specifically because some characters are negative? I think that can only
>> happen if you are making bad assumptions.
>
>The most annoying is using the character class tests isxxxx.
>Technically, a cast is needed to be portable:
>
> char *cp = ...;
> ...
> if (isdigit((unsigned char)*cp)) ...
And if testing in a loop, you may want to cast separately from the test.
Like in this trim function:
static void
trim (char **ts)
{
unsigned char *exam;
unsigned char *keep;
exam = (unsigned char *) *ts;
while (*exam && isspace (*exam)) {
++exam;
}
*ts = (char *) exam;
if (!*exam) {
return;
}
keep = exam;
while (*++exam) {
if (!isspace (*exam)) {
keep = exam;
}
}
if (*++keep) {
*keep = '\0';
}
}
--
Webmail for Dialup Users
http://www.isp2dial.com/freeaccounts.html
It's certainly a bad assumption on machines where `char'
runs from -128 to 127 ...
> There is a well established confusion in C between characters (that are
> encoded as integers) and integer VALUES.
A character -- loosely, a glyph like 'A' -- is not something
computers nowadays can represent directly in their memories.
Unable to store an actual 'A', they instead store a number like
65 or 193, and say "When thought of as a character, the value
refers to the 65th/193d entry in a list of glyphs." The members
of that list and the order in which they appear are a matter of
convention, nothing more.
It's not really different from the convention that "zero is
false, anything else is true." Some other languages use other
conventions, like "even values are false, odds are true." Neither
scheme is inherently more "right" or "wrong" than the other; it's
just a matter of convention, of a correspondence between the
notions one wants to represent and the numbers that are all the
computer can store internally.
What I'm getting at is that there is (or need be) no confusion
between storing a character and storing a number: The computer always
does the latter and never does the former. When we talk about
"storing a character," it's just a convenient verbal shorthand for
"storing the number that represents a character." And the data type
C uses for this purpose is `char'. Some awkwardnesses stem from this
choice, mostly having to do with the library, and getting the library
to work nicely sometimes involves converting the numbers to and
from other types -- see getchar() or isalpha(), for instance. But
when you want to store character codes, use `char'. Use `unsigned
char' or `signed char' when you want to store small numbers that
are *not* to be thought of as characters.
> I prefer not to use any sign in the characters, and treat 152 as character
> code 152 and not as -104. Stupid me, I know.
152 is not a character; it is a number. In one popular
encoding scheme it corresponds to the character 'q', by virtue
of one of those conventional correspondences. If you want a 'q',
use a `char' and store 'q' in it. If you want the number 152
in a small space, use an `unsigned char' -- but don't think of
it as a character, because it isn't one.
> Besides, when I convert it into a bigger type, I would like to get
> 152, and not 4294967192.
Much depends on the type to which you are converting, and
on why you are performing the conversion.
> Since size_t is unsigned, converting to unsigned is a fairly common
> operation.
It sounds very much as if you are dealing with "raw" numbers,
not with numbers that correspond to characters. If so, it's
quite strange that you are using strcmp() on assemblages of these
numbers, because strcmp() isn't well-suited to the task.
> Writing software
> is difficult enough without having to bother with the sign of characters
> or the
> sex of angels, or the number of demons you can fit in a pin's head.
A little thought about the artificiality of number-to-glyph
correspondences will remove much of the difficulty.
--
Eric Sosman
eso...@ieee-dot-org.invalid
I disagree. Ideally char should be a separate type which is *nothing* to
do with integer types. So to assign a char to an integer type you have
to cast it to that type (just as with pointers).
> It really is exasperating when most people in this group insist that signed
> character codes are perfectly normal and sensible!
Insisting that they are perfectly normal is *not* the same as saying
that it is sensible.
> Apparently chars are signed because on the PDP11 or some such machine,
> sign-extending byte values was faster than zero-extending them. A bit
> shortsighted. (If it had been the other way around, they would of course
> have been singing the praises of unsigned char codes; except they would
> have
> been justified this time..)
Ah, but the people you are complaining about would proably accept that
char being unsigned is *also* perfectly normal.
>> I prefer not to use any sign in the characters, and treat 152 as
>> character
>> code 152 and not as -104. Stupid me, I know.
>
> As I understand it, you can easily choose to use unsigned char type for
> such
> codes. The problem being when passing these to library functions where char
> is signed and this triggers a warning?
More to the point, why does he actually care wither a given character
value happens to be positive or negative? The only time it matters that
I can see is when using certain specific functions in the C library, and
unfortunately then you need a cast.
Of course, with gcc you can (on many architectures) select whether char
is signed or unsigned, it is of course still a distinct type.
>> Besides, when I convert it into a bigger type, I would like to get
>> 152, and not 4294967192.
>
> Why doesn't widening a signed value into an unsigned one itself trigger a
> warning?
Why should it? In any case, as others mentioned, a cast will fix this.
Although I have to wonder why the char is being assigned to a larger
unsigned integer type in the first place, it seems an odd thing to do to me.
--
Flash Gordon
>> Writing software
>> is difficult enough without having to bother with the sign of characters
>> or the
>> sex of angels, or the number of demons you can fit in a pin's head.
>
> A little thought about the artificiality of number-to-glyph
> correspondences will remove much of the difficulty.
Making char types always positive would remove all the difficulties.
And there are difficulties because this issue keeps coming up.
--
Bartc
> On Sat, 14 Nov 2009 01:04:35 +0000, Ben Bacarisse <ben.u...@bsb.me.uk>
> wrote:
<snip>
>>Technically, a cast is needed to be portable:
>>
>> char *cp = ...;
>> ...
>> if (isdigit((unsigned char)*cp)) ...
>
> And if testing in a loop, you may want to cast separately from the test.
> Like in this trim function:
>
>
> static void
> trim (char **ts)
> {
> unsigned char *exam;
> unsigned char *keep;
>
> exam = (unsigned char *) *ts;
> while (*exam && isspace (*exam)) {
You can remove the *exam test.
> ++exam;
> }
> *ts = (char *) exam;
> if (!*exam) {
> return;
> }
> keep = exam;
> while (*++exam) {
> if (!isspace (*exam)) {
> keep = exam;
> }
> }
> if (*++keep) {
> *keep = '\0';
> }
And here you could replace the whole 'if' with 'keep[1] = 0;'.
Neither of them is wrong, of course, but every test makes the reader
wonder why it is there.
> }
--
Ben.
>> static void
>> trim (char **ts)
>> {
>> unsigned char *exam;
>> unsigned char *keep;
>>
>> exam = (unsigned char *) *ts;
>> while (*exam && isspace (*exam)) {
>
>You can remove the *exam test.
But then you're testing whether '\0' is a space or not. Perhaps it
improves performance, but is it good programming?
>> ++exam;
>> }
>> *ts = (char *) exam;
>> if (!*exam) {
>> return;
>> }
>> keep = exam;
>> while (*++exam) {
>> if (!isspace (*exam)) {
>> keep = exam;
>> }
>> }
>> if (*++keep) {
>> *keep = '\0';
>> }
>
>And here you could replace the whole 'if' with 'keep[1] = 0;'.
>Neither of them is wrong, of course, but every test makes the reader
>wonder why it is there.
But then you replace '\0' with '\0'. Which is worse, one extra test, or
a redundant action?
The test yields "false," so what's wrong with it?
Or, to turn it around, what would your response be to
while (*exam && *exam != '#' && *exam != 'X' && isspace(*exam))
?
--
Eric Sosman
eso...@ieee-dot-org.invalid
>John Kelly wrote:
>> On Sat, 14 Nov 2009 15:27:03 +0000, Ben Bacarisse <ben.u...@bsb.me.uk>
>> wrote:
>>
>>>> static void
>>>> trim (char **ts)
>>>> {
>>>> unsigned char *exam;
>>>> unsigned char *keep;
>>>>
>>>> exam = (unsigned char *) *ts;
>>>> while (*exam && isspace (*exam)) {
>>> You can remove the *exam test.
>>
>> But then you're testing whether '\0' is a space or not. Perhaps it
>> improves performance, but is it good programming?
>
> The test yields "false," so what's wrong with it?
'\0' is not part of the string, it's a pseudo length specifier, and
conceptually, should not be treated as part of the string. You can get
away with it in this case, but it's a bad programming habit to rely on
environmental assumptions.
With real length specifiers, you wouldn't test one position beyond the
end of the string, so why do it with NUL terminated strings? It's just
a stupid C trick for some dubious performance gain. For my use of that
code, the performance gain doesn't amount to a drop in a bucket.
I would rather think portably, as in from one language to another. I
may use tricks when performance really matters, but then I would include
some remark about my choice and why.
The nul terminator is part of the string in C. It's not an environmental
assumption, it's a definition.
> I would rather think portably, as in from one language to another. I
> may use tricks when performance really matters, but then I would include
> some remark about my choice and why.
You can't meaningfully "think portably" about C strings, because they're
not really analagous to things in other languages.
The letter '�' is 130. Why I should have it as -126 ???
The problem is that you ignore foreign languages and all their special
characters like � or � or � or � or...
>> Besides, when I convert it into a bigger type, I would like to get
>> 152, and not 4294967192.
>
> Much depends on the type to which you are converting, and
> on why you are performing the conversion.
>
Most the conversions are indirect, or because some operation with characters
is done by promoting, etc etc.
>> Since size_t is unsigned, converting to unsigned is a fairly common
>> operation.
>
> It sounds very much as if you are dealing with "raw" numbers,
> not with numbers that correspond to characters. If so, it's
> quite strange that you are using strcmp() on assemblages of these
> numbers, because strcmp() isn't well-suited to the task.
>
Sure, if we accept that '�' is not a character THEN obviously
"strcmp is not well suited to the task.
What function should I use then?
>> Writing software
>> is difficult enough without having to bother with the sign of
>> characters or the
>> sex of angels, or the number of demons you can fit in a pin's head.
>
> A little thought about the artificiality of number-to-glyph
> correspondences will remove much of the difficulty.
>
No. A little thought will make you use unsigned chars everywhere.
UNLESS you want signed small integers!
The '\0' *is* a part of the string. 7.1.1p1:
"A /string/ is a contiguous sequence of characters
terminated by and including the first null character. [...]"
The "environmental assumption" is thus on the same level as the
assumption that stdout designates a FILE*.
> With real length specifiers, you wouldn't test one position beyond the
> end of the string, so why do it with NUL terminated strings? It's just
> a stupid C trick for some dubious performance gain. For my use of that
> code, the performance gain doesn't amount to a drop in a bucket.
Okay: If your complaint is "C strings shouldn't be That Way,"
fine. We had a huge and unenlightening wrangle over this issue
just a month or so ago. But if the presence of the '\0' bothers
you, it's hard to see how `while (*exam && ...)' assuages your
worries, given its explicit '\0' test.
> I would rather think portably, as in from one language to another. I
> may use tricks when performance really matters, but then I would include
> some remark about my choice and why.
Ah, but how do you set off the remarks, without the non-portable
assumption that comments are surrounded by /*...*/ or by //...'\n'?
At some point you simply *must* assume that the language you use is
as described by the relevant documentation, or you cannot use the
language.
--
Eric Sosman
eso...@ieee-dot-org.invalid
Which has the potential to misbehave on ones' complement machines if
*cp is -0 (you might get 0 rather than UCHAR_MAX), so it's better to
cast the pointer:
if (isdigit(*(unsigned char *)cp)) ...
--
Larry Jones
It's like SOMEthing... I just can't think of it. -- Calvin
>> '\0' is not part of the string, it's a pseudo length specifier, and
>> conceptually, should not be treated as part of the string. You can get
>> away with it in this case, but it's a bad programming habit to rely on
>> environmental assumptions.
>
> The '\0' *is* a part of the string. 7.1.1p1:
>
> "A /string/ is a contiguous sequence of characters
> terminated by and including the first null character. [...]"
So they say. But conceptually, I think otherwise.
> If your complaint is "C strings shouldn't be That Way,"
No, when I use C, I work around its limitations.
>fine. We had a huge and unenlightening wrangle over this issue
>just a month or so ago. But if the presence of the '\0' bothers
>you, it's hard to see how `while (*exam && ...)' assuages your
>worries, given its explicit '\0' test.
The string is data and the '\0' is metadata. The standard say it's all
data, but that's what someone else said. I think the '\0' is metadata,
serving as a pseudo length specifier.
I'm not worried the standard will change and break my program. I could
remove the *exam test to reduce the loop test to a single condition, if
performance really mattered. But it doesn't in this case. And leaving
it in reminds me how to think. I don't want to forget how to think.
> Ah, but how do you set off the remarks, without the non-portable
>assumption that comments are surrounded by /*...*/ or by //...'\n'?
>At some point you simply *must* assume that the language you use is
>as described by the relevant documentation, or you cannot use the
>language.
I think I can use C effectively without being a slave to the standard.
>> "A /string/ is a contiguous sequence of characters
>> terminated by and including the first null character. [...]"
> So they say. But conceptually, I think otherwise.
This explains a fair bit.
I think you're mistaken, though. Conceptually, the object includes all its
storage. The terminating null byte is part of the storage of the object;
that's why you have to allocate space for it when allocating a string, for
instance.
>> If your complaint is "C strings shouldn't be That Way,"
> No, when I use C, I work around its limitations.
You might find it more rewarding to adapt to the model of a language
when using it.
> The string is data and the '\0' is metadata. The standard say it's all
> data, but that's what someone else said.
But since the someone else defines the language, they win.
> I think the '\0' is metadata, serving as a pseudo length specifier.
It is, yes. Sometimes metadata is mixed in with data for various reasons.
Tables may contain sentinel values. Those values are metadata, but it
doesn't make them not part of the table.
> I'm not worried the standard will change and break my program. I could
> remove the *exam test to reduce the loop test to a single condition, if
> performance really mattered. But it doesn't in this case. And leaving
> it in reminds me how to think. I don't want to forget how to think.
I think you would do better to adopt idioms which remind you to think
like C, not idioms which remind you to think that you're programming
something else but using C to express it for unknown reasons.
> I think I can use C effectively without being a slave to the standard.
Oh, certainly. But you can't use C effectively without making good use
of the standard. Going beyond what the standard allows can make sense
in some contexts. Pretending it doesn't offer the guarantees that it does,
however, is crippling.
> Eric Sosman a écrit :
>>> I prefer not to use any sign in the characters, and treat 152 as
>>> character
>>> code 152 and not as -104. Stupid me, I know.
>>
>> 152 is not a character; it is a number. In one popular
>> encoding scheme it corresponds to the character 'q', by virtue
>> of one of those conventional correspondences. If you want a 'q',
>> use a `char' and store 'q' in it. If you want the number 152
>> in a small space, use an `unsigned char' -- but don't think of
>> it as a character, because it isn't one.
>>
>
> The letter 'é' is 130. Why I should have it as -126 ???
> The problem is that you ignore foreign languages and all their special
> characters like é or è or à or £ or...
No it's not. It's 195 168.
The problem is that you assume everything is the same.
>
>>> Besides, when I convert it into a bigger type, I would like to get
>>> 152, and not 4294967192.
So in a bigger type it should be 43459.
The numeric value corresponding to '�' is 130 in some
encodings, -126 in others, and for all I know 250 in still
others. Why should you care what the number is, as long
as you get an '�' when you want one?
> [...]
>> It sounds very much as if you are dealing with "raw" numbers,
>> not with numbers that correspond to characters. If so, it's
>> quite strange that you are using strcmp() on assemblages of these
>> numbers, because strcmp() isn't well-suited to the task.
I wrote this because you kept on about numbers, numbers,
numbers, and not characters. But perhaps I misguessed, and
you've confused numbers and characters. You're now talking
about the character '�', which you insist "is" the number 130.
That's a needless confusion, and seems to be the source of
your grief.
Would you say that the ancient physician Galen was born
around AD '�', or that Edison's first successful light bulb
test took place '�' years ago, or that Smarty Jones won the
'�'th running of the Kentucky Derby? If not, why do you say
that '�' "is" 130?
> Sure, if we accept that '�' is not a character THEN obviously
> "strcmp is not well suited to the task.
>
> What function should I use then?
If you want to store the code for the character '�', store it
in a char. If that char is part of a string, you can use strcmp()
on it.
If you want to store the number 130 in a small space, store
it in an unsigned char. Don't use strcmp() on it.
--
Eric Sosman
eso...@ieee-dot-org.invalid
Okay, so the C you're talking about is not the C that
"they" talk about, where "they" are several international
and national standards organizations.
>> If your complaint is "C strings shouldn't be That Way,"
>
> No, when I use C, I work around its limitations.
Which C do you mean here? Kelly C, or internationally
agreed-upon C?
> The string is data and the '\0' is metadata. The standard say it's all
> data, but that's what someone else said. I think the '\0' is metadata,
> serving as a pseudo length specifier.
If "the standard say" [sic] isn't good enough for you, what
is there to discuss?
> I think I can use C effectively without being a slave to the standard.
Knowledge of is not enslavement to; it's Haddocks' Eyes.
--
Eric Sosman
eso...@ieee-dot-org.invalid
Why should you care if they are negative? They are not 0 and they
represent the appropriate character.
>>> Besides, when I convert it into a bigger type, I would like to get
>>> 152, and not 4294967192.
>>
>> Much depends on the type to which you are converting, and
>> on why you are performing the conversion.
>
> Most the conversions are indirect, or because some operation with
> characters
> is done by promoting, etc etc.
I still can't see why you hit this. I can't think of any cases where
I've needed to compare a character specifically with an unsigned number.
>>> Since size_t is unsigned, converting to unsigned is a fairly common
>>> operation.
I can't think of any time I've needed to compare a character to a size_t.
>> It sounds very much as if you are dealing with "raw" numbers,
>> not with numbers that correspond to characters. If so, it's
>> quite strange that you are using strcmp() on assemblages of these
>> numbers, because strcmp() isn't well-suited to the task.
>
> Sure, if we accept that '�' is not a character THEN obviously
> "strcmp is not well suited to the task.
>
> What function should I use then?
If it's a string then strcmp. This is not a problem because strcmp will
handle this case perfectly.
>>> Writing software
>>> is difficult enough without having to bother with the sign of
>>> characters or the
>>> sex of angels, or the number of demons you can fit in a pin's head.
>>
>> A little thought about the artificiality of number-to-glyph
>> correspondences will remove much of the difficulty.
>
> No. A little thought will make you use unsigned chars everywhere.
> UNLESS you want signed small integers!
Whilst I agree that the definition of a char is less helpful than it
could be, I don't think using unsigned char throughout solves the problem.
--
Flash Gordon
Because plain char is signed on the implementation you're using.
(Just curious: What is it on lcc-win?)
[...]
> No. A little thought will make you use unsigned chars everywhere.
> UNLESS you want signed small integers!
Except that the standard library functions that deal with character
strings use plain char, not unsigned char -- though the plain chars
are often *interpreted* as unsigned chars.
For example, consider strcmp(). Its declaration is:
int strcmp(const char *s1, const char *s2);
and C99 7.21.4 says:
The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.
Note that it's not *converted* to unsigned char, it's *interpreted*
as unsigned char. That might have some odd effects on
sign-and-magnitude systems.
Also, an octal or hexadecimal escape sequence in a non-wide character
constant or string literal must be within the range of unsigned char.
In effect, the language and library use type char (which may be
signed or unsigned) to hold values in the range 0..UCHAR_MAX,
typically 0..255. This can theoretically cause problems in some
cases, but in practice everything works out. You've seen a case
where the inconsistency produces a compiler warning, but the code
works as expected anyway (and you can inhibit the warning if you
choose).
There's an implicit assumption that an array of plain char can
safely be interpreted as an array of unsigned char, and vice versa.
I'm not convinced that this assumption is entirely justified by the
normative wording of the standard; on the other hand, it's likely
that the assumption is valid on all existing systems.
Personally, I think the language (including the library) would be
cleaner if plain char were required to be unsigned. But there are
historical reasons for leaving it up to the implementation. (I think
making plain char signed made for significantly more efficient code
on the PDP-11; it's likely the same issue occurred on other systems.)
> John Kelly wrote:
<snip>
>> I think I can use C effectively without being a slave to the
>> standard.
>
> Knowledge of is not enslavement to; it's Haddocks' Eyes.
No, it isn't - that's just what its name is called.
--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
"Usenet is a strange place" - dmr 29 July 1999
Sig line vacant - apply within
Isn't there a way to persuade GCC to use the unsigned version of
chars? I know on at least on one vendor's gcc for POWER, it defaults
to unsigned, and there's a switch to make it use unsigned instead.
Answering my own question with a grep:
-fsigned-char -funsigned-bitfields -funsigned-char
Whether char being unsigned is enough to make them silently
equivalent to unsigned char in all contexts to gcc, I don't know.
Phil
--
Any true emperor never needs to wear clothes. -- Devany on r.a.s.f1
It might do (where the range is taken as half-open). The implementation
is free to have CHAR_MIN=0 and CHAR_MAX=UCHAR_MAX.
Note that if you've got two char variable cpz and cmz
holding the plus-zero and minus-zero representations, both
fputs(cpz,stream) and fputc(cmz,stream) output exactly the
same character: There's no way to tell from examining the
output which variable was written. That being the case, it's
not troubling that isdigit(0) == isdigit(0).
--
Eric Sosman
eso...@ieee-dot-org.invalid
7.21.4 Comparison functions
[#1] The sign of a nonzero value returned by the comparison
functions memcmp, strcmp, and strncmp is determined by the
sign of the difference between the values of the first pair
of characters (both interpreted as unsigned char) that
differ in the objects being compared.
So strcmp thinks that, for the purposes of comparison of strings,
the characters in strings should be treated as unsigned chars.
So in some contexts, strings are more reasonably considered to be
a sequence of unsigned chars.
Why not all contexts?
Please take that 'kick me' sign off your back. We're weak-willed
here and may do as you command.
What is the type that strcmp uses to compare such text?
I don't know about efficiency, but consider:
char x;
short x;
int x;
long x;
long long x;
Why should one of these five values default to unsigned, when the others
all default to signed?
I would sort have preferred that 'signed' not be a keyword, and plain char
be always-signed. However, here we run into the essential clash between
'char' as thing to hold characters and 'char' as shortest basic integer
type.
Perhaps the correct solution would have been to use 'byte' and 'unsigned
byte', then have 'typedef <...> char_t' as the basic type used for strings,
etc.
With the benefit of 20-20 hindsight, I'm sure we would use unsigned or
maybe Unicode. But we are stuck with the baggage of 7 bit ASCII.
As others have pointed out, when dealing with text, the value is what
matters, the representation is irrelevant. For non-textual data where
the value is relevant, use unsigned char.
--
Ian Collins
> [...] consider:
>
> char x;
> short x;
> int x;
> long x;
> long long x;
>
> Why should one of these five values default to unsigned, when the
> others all default to signed?
Because you're on an EBCDIC system with 8-bit chars. The code point of
'0' is 11110000. Since '0' is part of the basic character set, its
value is required to be positive. So you have no choice - char must
default to unsigned because the Standard requires it on such a
system.
<snip>
> [...] we are stuck with the baggage of 7 bit ASCII.
7-bit ASCII? 7-BIT ASCII? I used to *dream* of being stuck with 7-bit
ASCII!
Because one of them is used primarily to hold character codes, which
are normally thought of as unsigned values.
There is a conflict between the idea that char is a type used to old
character codes, and that char is a narrow integer type. The way C
has (mostly) resolved this conflict is steeped in historical accident,
and I think it would have been done quite differently if the language
were being designed from scratch today.
> I would sort have preferred that 'signed' not be a keyword, and plain char
> be always-signed. However, here we run into the essential clash between
> 'char' as thing to hold characters and 'char' as shortest basic integer
> type.
>
> Perhaps the correct solution would have been to use 'byte' and 'unsigned
> byte', then have 'typedef <...> char_t' as the basic type used for strings,
> etc.
Or make char a fundamental type that isn't part of the family of
integer types.
One example: In Ada, "Character" is an enumerated type, and character
literals like 'x' are permitted as enumerators.
A solution for a hypothetical C-like solution might be something like
this:
The integer types are byte, short, int, long, and long long
(or choose better names if you like). There are signed and
unsigned versions of each of these; each name by itself refers
to the signed version. For example, "byte" and "signed byte"
are different names for the same type; "unsigned byte" is another
type with the same size but a different range. (Or maybe "byte"
is an exception, with "byte" being an alias for "unsigned byte",
since unsigned bytes are more useful. Either way, the choice
is made by the language, *not* by the implementation.)
Type char is distinct from any of these types, and can hold
a single character value. char acts like an unsigned type,
in the sense that converting a char value to a sufficiently
wide type always yields a nonnegative value.
Deciding whether char is actually an integer type, and whether
conversions between char and (other) integer types may be done
implicitly, is left as an exercise.
Of course it's way too late to change C in this way.
Try telling that to the Europeans!
--
Ian Collins
int. :P
What it actually compares, we're told, is the values "interpreted as
unsigned char" (which does not imply a conversion), but imagine that
you were to write this:
if (*(unsigned char *)s != *(unsigned char *)t)
The type used for this comparison is, of course, int. Unless int and
unsigned char are the same size, in which case, it's unsigned int.
>>> Besides, when I convert it into a bigger type, I would like to get
>>> 152, and not 4294967192.
>>
>> Why doesn't widening a signed value into an unsigned one itself trigger a
>> warning?
>
> Why should it? In any case, as others mentioned, a cast will fix this.
> Although I have to wonder why the char is being assigned to a larger
> unsigned integer type in the first place, it seems an odd thing to do to
> me.
Try this:
int offset;
char c;
unsigned char uc;
c=uc=130;
offset=10;
printf("Defchar <%u>\n",c+offset);
printf("Defchar <%d>\n",c+offset);
printf("Unsigned <%u>\n",uc+offset);
printf("Unsigned <%d>\n",uc+offset);
You expect 140 to be printed. But using a default char type, you will get
apparent nonsense displayed when this happens to be signed. Using an
explicit unsigned char, you get the 140 you expect, with both %d and %u
formats.
This can be fixed by workarounds, by really people have other matters to
worry about than fixing problems caused by C's idiosyncracies.
--
Bartc
On the contrary, it *avoids* the odd effect of -0 being interpreted as 0
rather than as UCHAR_MAX, which it might well do if it were converted.
--
Larry Jones
I don't want to be THIS good! -- Calvin
>> No, when I use C, I work around its limitations.
>
> Which C do you mean here? Kelly C, or internationally
>agreed-upon C?
>
>> The string is data and the '\0' is metadata. The standard say it's all
>> data, but that's what someone else said. I think the '\0' is metadata,
>> serving as a pseudo length specifier.
>
> If "the standard say" [sic] isn't good enough for you, what
>is there to discuss?
It's so easy to make people angry here. Without even trying. This must
be Trolls' Paradise.
Exactly!
If you can avoid bugs, why not avoiding them?
Why? I can see no good reason to do what you are doing.
> int offset;
> char c;
> unsigned char uc;
>
> c=uc=130;
> offset=10;
>
> printf("Defchar <%u>\n",c+offset);
> printf("Defchar <%d>\n",c+offset);
> printf("Unsigned <%u>\n",uc+offset);
> printf("Unsigned <%d>\n",uc+offset);
>
> You expect 140 to be printed. But using a default char type, you will
> get apparent nonsense displayed when this happens to be signed. Using an
> explicit unsigned char, you get the 140 you expect, with both %d and %u
> formats.
Well, with the number of bits of implementation defined (or maybe
undefined) behaviour you could get nonsense, but I get numbers which I
expect on my implementation.
> This can be fixed by workarounds, by really people have other matters to
> worry about than fixing problems caused by C's idiosyncracies.
I've yet to see a good argument *why* you care about the numeric value
what you are using it as a character. In your example above you are
clearly using it as a number, so that is not relevant to the discusion.
--
Flash Gordon
I don't understand. You have the situation where this code:
char c=130;
if (c+10==140)
puts("It worked as expected.");
else
puts("It didn't work!");
does not do what you expect, and you're perfectly happy with this?
char values containing character representations which are negative, get
unexpectedly sign-extended when used in mixed arithmetic. Usually this is
undesirable, and unexpected if you are unaware of the signedness of your
char type.
You can fix this in *your* code, by using unsigned char types, but then you
get type mismatches with other code. Or you can stick (unsigned char)
everywhere, which is really going to help unclutter your code and make it
readable...
As to using character code as numbers, well I've been doing that for two or
three decades, and with codes up to 255 too. It does happen from from time
to time that you do arithmetic with numbers representing character codes...
--
Bartc
The question still remains: why are you doing mixed arithmetic on
character values?
--
Ian Collins
>> char values containing character representations which are negative, get
>> unexpectedly sign-extended when used in mixed arithmetic. Usually this is
>> undesirable, and unexpected if you are unaware of the signedness of your
>> char type.
>
> The question still remains: why are you doing mixed arithmetic on
> character values?
Why not?
I didn't even need to make why code so elaborate:
char c=130;
if (c==130)
will fail, and it's not immediately obvious that it *is* mixed arithmetic.
And on my machine:
char c=255;
if (c==EOF)
will be true, but not true for unsigned char. I thought char signedness
wasn't supposed to matter...
--
bartc
Already sloppy. 130 isn't a character, it's a number. The rest is just
a demonstration of "garbage in, garbage out"
--
Alan Curry
But why would you write such code? char represents a character, not a
numeric value. I could just as easily write
short n = 0x8000;
if( n == 0x8000 )
and in either case, my compiler would give me a handy warning.
> if (c==130)
>
> will fail, and it's not immediately obvious that it *is* mixed arithmetic.
>
> And on my machine:
>
> char c=255;
> if (c==EOF)
>
> will be true, but not true for unsigned char. I thought char signedness
> wasn't supposed to matter...
It doesn't, if you use char to store characters.
--
Ian Collins
> Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>>
>> The most annoying is using the character class tests isxxxx.
>> Technically, a cast is needed to be portable:
>>
>> char *cp = ...;
>> ...
>> if (isdigit((unsigned char)*cp)) ...
>
> Which has the potential to misbehave on ones' complement machines if
> *cp is -0 (you might get 0 rather than UCHAR_MAX), so it's better to
> cast the pointer:
>
> if (isdigit(*(unsigned char *)cp)) ...
Nasty. I'd really, really, hope that char would be unsigned on such a
machine! On a system like that -- with signed char -- even
while (!*cp) ...
breaks, does it not?
--
Ben.
I think that's undefined behavior (you assigned a value to c that didn't
fit in the type, possibly).
Because he can? As I understand it, signed char and unsigned
char are integer types. One expects to be able to mixed
arithmetic on integer types. However char is an exotic integer
type with the peculiar property that its signedness is undefined
by the language.
The problem is simple: Char is an ill-defined integer type;
despite its name it is not a character type.
Richard Harter, c...@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Infinity is one of those things that keep philosophers busy when they
could be more profitably spending their time weeding their garden.
> On Sat, 14 Nov 2009 15:27:03 +0000, Ben Bacarisse <ben.u...@bsb.me.uk>
> wrote:
<snip>
>>> while (*exam && isspace (*exam)) {
>>
>>You can remove the *exam test.
>
> But then you're testing whether '\0' is a space or not. Perhaps it
> improves performance, but is it good programming?
<snip>
>>> if (*++keep) {
>>> *keep = '\0';
>>> }
>>
>>And here you could replace the whole 'if' with 'keep[1] = 0;'.
>>Neither of them is wrong, of course, but every test makes the reader
>>wonder why it is there.
>
> But then you replace '\0' with '\0'. Which is worse, one extra test, or
> a redundant action?
I can't imagine you think I'd suggest something I thought was bad
programming or a change that made the code worse, so your questions
are, I presume, rhetorical -- intended to prompt readers to decide for
themselves.
That was the purpose of my post. I think
while (isspace(*exam)) { ... }
and
keep[1] = 0;
are both slightly better whereas you don't. At least now, anyone who
has not yet decided such matters for themselves can see the options.
--
Ben.
>I can't imagine you think I'd suggest something I thought was bad
>programming or a change that made the code worse, so your questions
>are, I presume, rhetorical -- intended to prompt readers to decide for
>themselves.
Treating '\0' as data in a NUL terminated string seems unnatural to me,
despite what the standard says. I know it's data in the sense of taking
up storage, but I think of it as metadata, a pseudo length specifier.
>At least now, anyone who has not yet decided such matters for
>themselves can see the options.
I'm not saying your C idiom is bad. For C, it's good. But how good can
C be, is the question.
He can also go to work naked, but that isn't a particularly good idea
either.
> As I understand it, signed char and unsigned
> char are integer types. One expects to be able to mixed
> arithmetic on integer types. However char is an exotic integer
> type with the peculiar property that its signedness is undefined
> by the language.
Which is why it shouldn't be used for mixed arithmetic.
> The problem is simple: Char is an ill-defined integer type;
> despite its name it is not a character type.
It is used to hold the representation of a single character.
--
Ian Collins
> He can also go to work naked, but that isn't a particularly good idea
> either.
I think that rather depends on his line of work. And also on whether
he telecommutes.
>> The problem is simple: Char is an ill-defined integer type;
>> despite its name it is not a character type.
> It is used to hold the representation of a single character.
Except when that has to be done in unsigned char, such as when using
the ctype functions...
Surely if you are programming in C, you should use C idiom - as it's
what anyone reading your code is going to expect. If you find C's idiom
too repulsive to use (and I don't think anyone would claim that there
aren't things in C that would never get in there now but that we're
stuck with), then don't use C.
But to try and force it into some sort of "like all other programs" mode
just means that you're writing unfriendly C.
--
Online waterways route planner: http://canalplan.org.uk
development version: http://canalplan.eu
Ok, so how do I assign a character code to c that happens to be the code
130, and that happens to have a different encoding from the one C
understands?
It's quite common to want to use character codes from 0 to 255, and it's
understandable that someone may want to use single chars and char arrays to
store them in.
But that's apparently full of gotchas. But what's more annoying is the
experts smugly explaining everytime the distinction between numbers,
characters, character codes, glyphs, and whatnot.
Instead of explaining why you can't do this or that, why couldn't the C99
people have fixed the problem instead?
--
Bartc
Doesn't it?
char c=130,d=140;
if (c+10==d) seems to work as you'd expect.
--
Bartc
Probably the usual problem: it would have broken stuff. I think I'd
have been a bit more ruthless than they were in several areas, but I'm
not a major organisation with a huge installed codebase - if I was, I
think I'd agree with what they did.
char is really only any good for holding characters, and only those
characters in a particular subset (I think this is what they call the
"execution character set") - crucially /not/ all the characters that can
be displayed on the host machine.
If you do treat them as characters, and never try to look at the numeric
values it works surprisingly well, even when they are outside the normal
range. For example, I find that I can read and write UTF8 into normal C
strings (but don't expect strlen to give you what you'd expect!).
You'd write
char c = '\x82';
> It's quite common to want to use character codes from 0 to 255, and it's
> understandable that someone may want to use single chars and char arrays to
> store them in.
>
> But that's apparently full of gotchas. But what's more annoying is the
> experts smugly explaining everytime the distinction between numbers,
> characters, character codes, glyphs, and whatnot.
I suspect that there is a communication problem here, to some extent.
You've encountered a problem in the past which was fiddly to solve,
but in describing it you've simplified to the extent that the problem
disappears.
char data usually just comes form some input and goes to some output;
it is rare to calculate with the values. If you need to calculate
with the codes, you are doing arithmetic so need to be sure of range
of the integer type you are using and your code should probably use
explicitly signed or unsigned char types rather than plain char. (A
few calculations are covered by the various guarantees in the standard
(c - '0' for example) but these are the exception.)
> Instead of explaining why you can't do this or that, why couldn't the C99
> people have fixed the problem instead?
I think you are overstating the degree to which there is a problem.
How complex what it, eventually, get round the problem that you
encountered? I can't say why "the problem" was not fixed because I am
not sure exactly what it is. Changing anything as basic as the way
C's char is defined or the interface to the C library is would require
clear evidence of a major problem.
--
Ben.
OK. But then you have this little anomaly:
int C = '\x82';
int D = 0x82;
You might expect C==D, but that isn't the case. Just something else to
explain that probably wouldn't need explaining if chars were not signed.
>
>> It's quite common to want to use character codes from 0 to 255, and it's
>> understandable that someone may want to use single chars and char arrays
>> to
>> store them in.
>>
>> But that's apparently full of gotchas.
> I suspect that there is a communication problem here, to some extent.
> You've encountered a problem in the past which was fiddly to solve,
> but in describing it you've simplified to the extent that the problem
> disappears.
>
> char data usually just comes form some input and goes to some output;
> it is rare to calculate with the values.
If I was writing Cobol, that might be the case. But I do just as much
messing about with char codes as with integers.
>> Instead of explaining why you can't do this or that, why couldn't the C99
>> people have fixed the problem instead?
>
> I think you are overstating the degree to which there is a problem.
Up to now it was just one of those things. Every so often, something that
didn't work, would suddenly start working as soon as a few 'unsigned char's
were sprinkled about. I don't use C seriously enough to worry about it. (And
if I did use it seriously there are plenty of other things to take issue
with.)
But when someone as experienced as jacob navia says there is a problem, then
you listen.
> How complex what it, eventually, get round the problem that you
> encountered? I can't say why "the problem" was not fixed because I am
> not sure exactly what it is. Changing anything as basic as the way
> C's char is defined or the interface to the C library is would require
> clear evidence of a major problem.
Insisting on char being unsigned by default I think would be more useful
than otherwise. Why do all the C compilers on my Windows machine use signed
char? What is the advantage?
Someone who wants small negative integers can explicitly write signed char.
For all other purposes, there is no reason for char to be signed.
--
bartc
> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
> news:0.56a34609daac81aaf5a8.2009...@bsb.me.uk...
>> "bartc" <ba...@freeuk.com> writes:
>> <snip>
>>> Ok, so how do I assign a character code to c that happens to be the code
>>> 130, and that happens to have a different encoding from the one C
>>> understands?
>>
>> You'd write
>>
>> char c = '\x82';
>
> OK. But then you have this little anomaly:
>
> int C = '\x82';
> int D = 0x82;
>
> You might expect C==D, but that isn't the case.
Presumably you meant char C?
> Just something else to
> explain that probably wouldn't need explaining if chars were not
> signed.
I don't think it is possible to "rescue" C from its ancient history.
You get into trouble if you assume that '\x82' necessarily equals
0x82 but I can't see why anyone has to make that assumption, though I
agree that many probably do.
<snip>
>> I think you are overstating the degree to which there is a problem.
>
> Up to now it was just one of those things. Every so often, something
> that didn't work, would suddenly start working as soon as a few
> unsigned char's were sprinkled about. I don't use C seriously enough
> to worry about it. (And if I did use it seriously there are plenty of
> other things to take issue with.)
>
> But when someone as experienced as jacob navia says there is a
> problem, then you listen.
I prefer to take note when the problem is explained. Currently, I
just can't see it, which I agree may be my fault, but I can't see a
problem just because someone is experienced. It needs to be
explained.
>> How complex what it, eventually, get round the problem that you
>> encountered? I can't say why "the problem" was not fixed because I am
>> not sure exactly what it is. Changing anything as basic as the way
>> C's char is defined or the interface to the C library is would require
>> clear evidence of a major problem.
>
> Insisting on char being unsigned by default I think would be more
> useful than otherwise. Why do all the C compilers on my Windows
> machine use signed char? What is the advantage?
You'd have to ask a compiler author. Jacob made char signed on his
Windows implementation so you could ask him.
> Someone who wants small negative integers can explicitly write signed
> char. For all other purposes, there is no reason for char to be
> signed.
Changing C needs a powerful motivation, and I have not seen one
presented that merits a change to one of C's basic types.
--
Ben.
Not angry, no, but impatient. There's a serious point here,
one that is perhaps not appreciated by those who didn't use C in
the Bad Old Days. Before the ANSI Standard, "C" was whatever an
implementor felt like implementing. All one had to do was read
the White Book and start ringing one's favorite changes on it.
The result was that while it was often possible to write portable
"C," it was a difficult and clumsy business. The line about C
combining the power of assembly language with the portability of
assembly language dates from this era, when portability could be
achieved only by larding the source with #ifdef's to an extent
that would startle today's pampered practitioners.
When the ANSI Standard came along, imaginative implementors
were reined in somewhat and "C" became something that could be
agreed upon. You no longer needed an #ifdef to decide whether
to include <string.h> or <strings.h>, or to find out whether
sprintf() returned a count or a pointer, or to figure out whether
integer promotions preserved value or preserved sign, or ... It
became enormously easier (although still not trivial) to write
portable C code, portable to an extent that was economically
infeasible in the Bad Old Days.
And all this easing of difficulty and reduction of expense
came from -- what? From a single document that everyone could
point to and say "That is the official definition of C." Well,
not quite: The benefits flowed not from the document itself --
it's just words -- but from the agreement to accept the document
as the definition, the agreement that it was a bug, not a feature,
if one's C implementation failed to adhere to the Standard.
So when you decide to reject the Standard's definition of C
and substitute your own, it seems to me you are rejecting all the
benefits an agreed-upon definition generates. You appear to want
to return to the lawless Wild West, a place and time where life
was difficult and brief. (Yet even in the Wild West standards
had benefits: Just try finding ammunition for a .42 caliber
seven-shooter ...)
> This must
> be Trolls' Paradise.
Herein, alas, you are right.
--
Eric Sosman
eso...@ieee-dot-org.invalid
>"C," it was a difficult and clumsy business. The line about C
>combining the power of assembly language with the portability of
>assembly language dates from this era, when portability could be
>achieved only by larding the source with #ifdef's to an extent
>that would startle today's pampered practitioners.
C portability is still hard due to environmental differences. I can
port dh from Linux to BSD without too much work, but beyond that, say to
Solaris, the going gets tough.
>So when you decide to reject the Standard's definition of C
>and substitute your own, it seems to me you are rejecting all the
>benefits an agreed-upon definition generates. You appear to want
>to return to the lawless Wild West, a place and time where life
>was difficult and brief.
My point about C string data vs. '\0' metadata was not a broad rejection
of standards. To me it was a trivial thing, a possible basis for an
interesting discussion of C programming vs. programming in general. My
apologies for arousing such ire.
>Richard Harter wrote:
>> On Sun, 15 Nov 2009 14:40:57 +1300, Ian Collins
>> <ian-...@hotmail.com> wrote:
>>
>>> The question still remains: why are you doing mixed arithmetic on
>>> character values?
>>
>> Because he can?
>
>He can also go to work naked, but that isn't a particularly good idea
>either.
Whether or not it is a good idea (actually I go to work naked
sometimes) is not the point; it can be done, the language is
structured to permit this peculiar operation (with no other good
reason than history), and there is a lot of code that does use
that bad idea.
>
>> As I understand it, signed char and unsigned
>> char are integer types. One expects to be able to mixed
>> arithmetic on integer types. However char is an exotic integer
>> type with the peculiar property that its signedness is undefined
>> by the language.
>
>Which is why it shouldn't be used for mixed arithmetic.
Exactly. More than that it shouldn't be used for arithmetic at
all.
>
>> The problem is simple: Char is an ill-defined integer type;
>> despite its name it is not a character type.
>
>It is used to hold the representation of a single character.
Not exactly. It used to hold the representation of cetain classes
of single characters. Thus the need for unicode. What is more,
it is not a representation, it is a handle. I admit that this is
a fine point, but a representation or proxy is a standin that has
the superficial appearance of the thing being represented. A
handle, on the other hand, is an arbitrary token or reference.
Excellent question!
In this case, I think existing practice really was a killer -- quite simply,
way too many systems existed that had made each decision about what 'char'
was, and there were compelling advantages to having plain char represent
the system's native preferences.
Basically, I can't think of a fix that wouldn't break millions of lines of
code. Maybe the right solution would have been to introduce a new type,
but that seemed a bit drastic.
It certainly is a foundational flaw that 'char' is supposed to be both the
smallest addressable unit of at least 8 bits, and also the execution
character set. MHO.
#include <stdio.h>
int C = '\x100';
int D = 0x100;
int main(void) {
printf("%d %d\n", C, D);
return 0;
}
And there's something we wouldn't have to explain if we hadn't guaranteed
that characters were at least 9 bits.
Basically, I expect '' to give me local character set, which may or may not
be the same thing as a particular numeric value.
> But when someone as experienced as jacob navia says there is a problem, then
> you listen.
Listen, yes. Agree, not always.
> Insisting on char being unsigned by default I think would be more useful
> than otherwise. Why do all the C compilers on my Windows machine use signed
> char? What is the advantage?
I think one of the reasons people used that in the past was that it was
consistent with the other types; no qualifier means signed. There may be
systems on which the ability to represent "-1" in the same object you'd use
to hold members of your tiny character set is also useful.
> Presumably you meant char C?
Nope! '' interprets the value as a character then promotes to int. So
on my system, '\400' == '\0'.
Two issues here:
1. You're writing something intrinsically system-specific.
2. If you were more familiar with POSIX, you would probably not have this
problem.
Understanding the difference between "what this compiler accepts" and "what
is guaranteed for the set of environments I'm looking at" is important and
useful.
> My point about C string data vs. '\0' metadata was not a broad rejection
> of standards. To me it was a trivial thing, a possible basis for an
> interesting discussion of C programming vs. programming in general. My
> apologies for arousing such ire.
Again, you're showing classic NPD traits here.
No one's angry at you. They disagree with you and they're telling you why.
Disagreeing with you, or claiming you are incorrect, is not attacking. If
your brain perceives a disagreement as an attack, you are showing a diagnostic
criterion for NPD and should probably get that checked out, because that is
one of the more destructive cognitive problems you could possibly have.
>Again, you're showing classic NPD traits here.
>No one's angry at you. They disagree with you and they're telling you why.
>Disagreeing with you, or claiming you are incorrect, is not attacking. If
>your brain perceives a disagreement as an attack, you are showing a diagnostic
>criterion for NPD and should probably get that checked out, because that is
>one of the more destructive cognitive problems you could possibly have.
Are you a licensed psychiatrist too?
You should be more careful about slandering people. Some might take it
seriously.
> "Ian Collins" <ian-...@hotmail.com> wrote in message
> > bartc wrote:
>
> >> char values containing character representations which are negative, get
> >> unexpectedly sign-extended when used in mixed arithmetic. Usually this is
> >> undesirable, and unexpected if you are unaware of the signedness of your
> >> char type.
> >
> > The question still remains: why are you doing mixed arithmetic on
> > character values?
>
> Why not?
Because, as you know damned well, it's the Wrong Thing to do. You use
plain char for characters. If you want a small integer, you use
explicitly signed or unsigned char.
When I read this whole whine about how it's _soooo_ broken that char
does not behave the same on all systems, I am reminded of 1950s writing
manuals which assumed that all pupils were right-handed. Teachers have
grown out of that prejudice; it's time that C programmers do the same.
Richard
> On 2009-11-15, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
>> "bartc" <ba...@freeuk.com> writes:
>>> OK. But then you have this little anomaly:
>>>
>>> int C = '\x82';
>>> int D = 0x82;
>>>
>>> You might expect C==D, but that isn't the case.
>
>> Presumably you meant char C?
>
> Nope! '' interprets the value as a character then promotes to int. So
> on my system, '\400' == '\0'.
Not exactly, but only because "promote" is a technical term. '\x82'
is of type int so no promotion happens, but the effect is the same, I
agree. The exact wording in the standard is a little cumbersome.
Ironically, I think both bartc's point and mine are better made
without any intervening variables. He thinks many people will assume
that '\x82' == 0x82 must be 1 and I don't think anyone should --
that's what the escapes are for.
--
Ben.
No, and I have not offered a clinical diagnosis.
> You should be more careful about slandering people. Some might take it
> seriously.
No slander. I have a psych degree, I keep up on the literature, and I
offer suggestions as to things people might want to look into, which are
not to be mistaken for a clinical diagnosis. No licensing is required
to argue that someone appears to have traits consistent with a given
disorder.
You have a near perfect sweep of the NPD traits, with your insistence on
believing that people who politely express disagreement with you are
"attacking" you being one of the more outstanding. You view your personal
convenience as absolutely more important than millions of existing users,
too; that's another classic example.
>On 2009-11-15, John Kelly <j...@isp2dial.com> wrote:
>> Are you a licensed psychiatrist too?
>
>No, and I have not offered a clinical diagnosis.
>
>> You should be more careful about slandering people. Some might take it
>> seriously.
>
>No slander. I have a psych degree
I've met psychologists, but not as a client. I've met psychiatrists,
but not as a patient. I understand the significant difference between
the two.
>You have a near perfect sweep of the NPD traits, with your insistence on
>believing that people who politely express disagreement with you are
>"attacking" you being one of the more outstanding.
To "politely express disagreement" does not include slanderously
mutilating my subject lines as you have done many times. Sparring with
you is mildly amusing, but my time is limited. So please don't feel
lonely if I stop.
> char is really only any good for holding characters, and only those
> characters in a particular subset (I think this is what they call the
> "execution character set") - crucially /not/ all the characters that can
> be displayed on the host machine.
>
That's the sort of distinction that used to be meanigful but no longer is.
Nowadays PCs can be expected to run programs which may display characters in
any language, although any individual PC will probably only have two or
three languages in use. Mainframes still typcially spit out output to stdout
in some sort of encoding, but often their output files are not read from the
terminal - they are downloaded and displayed on PCs.
So the idea of an "execution character set" is getting wooly.
If your system has CHAR_BIT==8, then '\400' violates a constraint.
See C99 6.4.4.4p9:
Constraints
The value of an octal or hexadecimal escape sequence shall
be in the range of representable values for the type unsigned
char for an integer character constant, or the unsigned type
corresponding to wchar_t for a wide character constant.
Of course the compiler is free to issue a warning and then go on to
treat it as '\0' (which is what gcc does).
Hmm. Since a character constant is of type int, I would have expected
'\x82' to have type int and value +130. But gcc and Sun's C compiler
agree that its value is -126.
C99 6.4.4.4p6 specifies the meaning of a hexadecimal escape sequence:
The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.
Note that it says "character"; it doesn't refer to the type (plain)
char.
And, of course, the constraint I already quoted says that the value of
the hexadecimal escape sequence must be in the range of type unsigned
char. If '\x82' has the value -126, then it violates the constraint,
which I don't think is the intent.
My tentative conclusion is that the value of '\x82' is supposed to be
+130, not -126, and that both gcc and Sun's compiler get this wrong.
I'd be interested in any counterarguments.
This issue isn't likely to cause problems in real-world code, since
character constants are usually used with objects of type char, signed
char, or unsigned char. There's no good reason to use '\x82' rather
than 0x82 if you want to store it in an int.
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
The day anyone with any intelligence at all takes anything Seebs says
the least bit seriously, is the day we all better get really and
seriously worried about the future of the planet.
I think the answer lies further on, in p10 under semantics:
10 An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that
results when an object with type char whose value is that of the
single character or escape sequence is converted to type int.
The actual int value of '\x82' is the value you'd get from a char with
that "character value" after being converted to int. Since char is
probably signed in the implementation you are using, gcc can give
-126.
I think all the gyrations are to avoid the possibility of an
implementation-defined conversion of an out-of range value. I think
that is why the standard talks about the value of the character rather
than using more concrete C terms. '\x82' can't just be 0x82 because
then, with a signed char type,
char c = '\x82';
would be governed by 6.3.1.3 p3:
"Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined or
an implementation-defined signal is raised."
Instead, '\x80' denotes some character (not char) value that is put in
a char object and that char is converted to int. I think -126 is
correct on signed char machines.
> This issue isn't likely to cause problems in real-world code, since
> character constants are usually used with objects of type char, signed
> char, or unsigned char. There's no good reason to use '\x82' rather
> than 0x82 if you want to store it in an int.
Agreed.
You do want to be able to assign '\x82' to a char, though, and that
requires that '\x82' (an int) be in the range of char. The hex part
must be in the range of unsigned char, and the value you finally get
is the result of putting a not entirely well-specified "character
value" into a char object and converting that to int.
--
Ben.
Do you, though? Both are qualified to diagnose mental disorders -- what
differs is what they're qualified to do about it.
> To "politely express disagreement" does not include slanderously
> mutilating my subject lines as you have done many times.
Welcome to Usenet, where topic lines drift appropriately. There was no
defamation. Your program is incompetently written, poorly duplicative of
existing and widely available software which has done it better for ten
years, and maintained by someone who dismissed millions of users with
comments about how funny he thinks it is when they squeal.
(I would point out, by the way, if you want to sound cool and hip, you
must remember that "libel" is written, and "slander" is spoken.)
> Sparring with
> you is mildly amusing, but my time is limited. So please don't feel
> lonely if I stop.
Oh, dear. I may faint. Where oh where are the smelling salts.
Just type that character directly into a character literal. Here's an
experiment:
I want to use a particular character that is not part of the minimal
character set required by the C standard.
I know the numerical value of that character in a particular character set,
so I could use your
char c=130;
statement (with a different nmber) to get the character I want. Or I could
use the equivalent of
char c='\x82';
I also know that it's unlikely for my program ever to reach a system where
this character doesn't have the same value as my development system.
Or I could just type the character itself.
Which of the 3 alternatives do you prefer, when I show you the actual
character I'm talking about:
Spoiler ahead (this is not the end of the article. keep going.)
char c=36;
char c='\x24';
char c='$';
I like the last one. It's not less portable than the first two, and it's more
human-readable.
If you use the third form for your 130 character, by just typing it into the
source code as a literal, you'll get exactly the character you want. No
hardcoded character numbers necessary. Signedness irrelevant.
--
Alan Curry
Ah, yes, I missed that. So gcc and Sun's compiler are right, and I
was wrong. (I'm actually a bit relieved.)
Here's some more convoluted reasoning:
Assume that CHAR_BIT == 8 and that plain char is signed with a
two's-complement representation. C99 6.4.4.4p says:
Constraints
The value of an octal or hexadecimal escape sequence shall be in
the range of representable values for the type unsigned char for
an integer character constant, or the unsigned type corresponding
to wchar_t for a wide character constant.
At first glance, this implies that, since '\x82' has the value -126,
and -126 is not "in the range of representable values for the type
unsigned char", '\x82' must be a constraint violation. This
clearly was not the intent. Looking back at 6.4.4.4p6:
The numerical value of the hexadecimal integer so formed
specifies the value of the desired character or wide character.
It says that it *specifies* the value, not that it *is* the value.
The numerical value of the hexadecimal integer 82 is +130, which is
within the range of unsigned char. This *specifies* the value -126
for "the desired character".
Remind me precisely how signed 7-bit ASCII is?
Phil
--
Any true emperor never needs to wear clothes. -- Devany on r.a.s.f1