Signed Chars - What Foolishness Revisited!

Jeffrey William Gillette

unread,

Nov 1, 1986, 10:52:29 AM11/1/86

to

[]

A few weeks ago I vented my hostilities on MSC's support (or lack
thereof) for extended ASCII characters - specifically for their
decision to make type 'char' default to a signed quantity. I asked
if other compilers defaulted to signed, and what justification existed
for such a policy. I would like to thank those who were kind enough
to respond to my questions, summarize the arguments as I understand
them, and come back for a rebuttal.

1) Microsoft C

MSC does, in fact, claim quite explicitly in the library manual that
'isupper', 'islower', etc. are defined only when 'isascii' is true.
Thus, with regards to my original complaint about 'isupper', the
compiler is not broken, it is simply wrong!

The MSC "Language Reference" distinguishes two types of character
sets. The "representable" character set includes all symbols which
are meaningful to the host system. The "C" character set, a subset
of the former, includes all characters which have meaning to the compiler.
I assume this distinction allows, e.g. the compiler to process strings
containing non-ASCII characters, or to handle quoted non-ASCII
characters in 'if' or 'case' statements.

It seems to me that any 'isbar' macro *ought* to apply to the full set
of characters which can be represented in the system, not only to those
used by the compiler. For the PCDOS environment this includes characters
with umlauts, acute and grave accents, etc. Thus I argue that Microsoft
has made the wrong decision in failing to support the full character
environment of their target system.

2) Signed char default

It appears that an accident of history - the architecture of the PDP-11 -
brought about the implementation of 'signed' chars. Since then there
appears to be a split between compilers that default to signed chars
and those that default to unsigned.

The only argument for signed char default appears to be that some old
PDP and VAX code will break without signed char defaults. I could say
that this seems to me a better argument for rewriting the faulty code,
but I understand why many implementors do not want to rewrite large
amounts of established utilities.

I would suggest that the proper way to handle portability problems is
that of (believe it or not) the Microsoft 4.0 compiler. Several of you
called attention to the new command line switch that will default chars
to unsigned. This seems a relatively painless way to support code that
requires char defaults. My bone of contention, however, is that this
scheme is exactly backwards. Code that uses signed chars will not handle
half of the system's character set, and thus I must deliberately and
consciously choose to set a command line switch every time I compile
a program, or my program will not work acceptably on my system!

3) What is a 'char' anyway?

Some of you called attention to K&R's discussions of the char type.
K&R definitely present 'char' as system specific.

a single byte, capable of holding one character in
the local character set. (p. 34)

Following this statement is a table in which presents the 'char' type
as 8-bit ASCII on the PDP-11, 9-bit ASCII on the Honeywell 6000,
8-bit EBCDIC on the IBM 370, and 8-bit ASCII on the Interdata 8/32.
On the following page is an explanation of character constants and
the differing numerical values associated with '0' in ASCII and EBCDIC.

My point is that K&R clearly sets forth the 'char' type as a logical
quantity which is implementation specific. They are willing to
include ASCII and EBCDIC in the definition, and, I assume, any other
arbitrary representation scheme that will fit into "a single byte".
By this definition, any code that depends on the mathematical properties
of characters (e.g. that, in ASCII, A-Z and a-z are contiguous) is
inherently non-portable!

4) What difference does it make?

None - if we want to continue to insist that English is the official
language of C and UNIX! There is, however, a market of people who
want to sed with ninyas or awk with cedillas. There may, in fact,
be a system just around the corner for users who want to diff in
Kanji! Unfortunately all of these are out of luck, since the afore-
mentioned code only works with 7-bit characters. At this point in
time I am still trying to explain to my colleagues in the Humanities
Computing Lab why their new $10,000 Apollo supermicro can't display
a simple umlaut!

I guess the point of this rave should be summarized. Now that hardware
no longer restricts us to 7-bit character sets, isn't it time we see
*forward* compatible compilers that default to the native character
set of their host system, and isn't it time we start writing (or
rewriting) portable UNIX code that will work on systems whether
characters display in ASCII, EBCDIC, Swedish, or Amharic!

Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff
Humanities Computing Facility bitnet: DYBBUK @ TUCCVM
Duke University
--
Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff
Humanities Computing Facility bitnet: DYBBUK @ TUCCVM
Duke University

Metro T. Sauper

unread,

Nov 5, 1986, 10:11:30 AM11/5/86

to

I would just like to point out that there are actually two issues which
are being argued. They are actually two different topics.

1. Should characters be signed or unsigned by default.

2. Should the character type macros/subroutines support all possible
values of type char.

The first question is compiler related, the second is library related.

My own preferences follows:

1. Since the "c" language has an "unsigned" modifier, and not a "signed"
modifier, I would much rather have a signed character by default and
be able to define it to be "unsigned char" if needs be.

2. The ctype routines are trivial at best, and with all the effort put
to arguing which way they should work, you could have rewritten them
to do whatever you would like them to do.

Metro T. Sauper, Jr.
..!ihnp4!ll1!bpa!asi!metro

Henry Spencer

unread,

Nov 5, 1986, 4:27:14 PM11/5/86

to

> It appears that an accident of history - the architecture of the PDP-11 -

> brought about the implementation of 'signed' chars...

This is correct.

> The only argument for signed char default appears to be that some old

> PDP and VAX code will break without signed char defaults...

No, sorry, this is wrong. There are many other machines on which char
is substantially more efficient when it is considered signed than when
it is considered unsigned. Consigning the PDP11 and the VAX to history
(a dubious decision in itself) does not remove the problem.
--
Henry Spencer @ U of Toronto Zoology
{allegra,ihnp4,decvax,pyramid}!utzoo!henry

Henry Spencer

unread,

Nov 7, 1986, 3:03:19 PM11/7/86

to

> 1. Since the "c" language has an "unsigned" modifier, and not a "signed"
> modifier, I would much rather have a signed character by default and
> be able to define it to be "unsigned char" if needs be.

Would you still feel this way if all manipulations of signed char took
three times as long as those of unsigned char? It can happen.

All members of this debate please attend to the following.

- There exist machines (e.g. pdp11) on which unsigned chars are a lot less
efficient than signed chars.

- There exist machines (e.g. ibm370) on which signed chars are a lot less
efficient than unsigned chars.

- Many applications do not care whether the chars are signed or unsigned,
so long as they can be twiddled efficiently.

- For this reason, char is intended to be the more efficient of the two.

- Many old programs assume that char is signed; this does not make it so.
Those programs are wrong, and have been all along. Alas, this is
not a comfort if you have to run them.

- The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all
agree that a character of the machine's normal character set MUST
appear positive. Given that the IBM PC has, I understand, a full
8-bit character set, this means that a PC compiler which treats
char as signed is wrong, period. This should be documented as, at
the very least, a deviation from K&R.

- The "unsigned char" type exists (in most newer compilers) because there
are a number of situations where sign extension is very awkward.
For example, getchar() wants to do a non-sign-extended conversion
from char to int.

- X3J11, in its semi-infinite wisdom, has decided that it would be nice to
have a signed counterpart to "unsigned char", to wit "signed char".
Therefore it is reasonable to expect that most new compilers, and
old ones brought into conformance with the yet-to-be-issued standard,
will give you the full choice: signed char if you need signs,
unsigned char if you need everything positive, and char if you don't
care but want it to run fast.

- Given that many compilers have not yet been upgraded to match even the
current X3J11 drafts, much less the final endproduct (which doesn't
exist yet), any application which cares about signedness should use
typedefs or macros for its char types, so that the definitions can
be revised later.

- The only things you can safely put into a char variable, and depend on
having them come out unchanged, are characters from the native
character set and small *positive* integers.

Henry Spencer

unread,

Nov 10, 1986, 2:11:41 PM11/10/86

to

> - The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all
> agree that a character of the machine's normal character set MUST

> appear positive...

It turns out that I have to amend this slightly. The Father and the Son
are indeed in agreement on this. The Holy Ghost has chickened out and
watered down this restriction, however: it only says that the characters
in the "source character set" (roughly, those one uses to write C) must
look positive. Thus an 8088 C which makes normal ASCII look positive but
lets the "upper-bit" characters look negative is technically legitimate.
Grr. ("Grr" not just because I goofed, but because I don't like the change.)