Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why 8 bit exit status codes?

92 views
Skip to first unread message

Andreas Kempe

unread,
Feb 2, 2024, 11:05:20 AMFeb 2
to
Hello everyone,

I'm wondering why, at least on Linux and FreeBSD, a process exit
status was chosen to be only the lower 8 bits in the C interface, i.e.
exit() and wait().

This did bite some colleagues at work at one point who were porting a
modem manager from a real-time OS to Linux because they were returning
negative status codes for errors. We fixed it by changing the status
codes and I never really thought about why this is the state of
things... until now!

Having a look at man 3 exit on my FreeBSD system, it states

>Both functions make the low-order eight bits of the status argument
>available to a parent process which has called a wait(2)-family
>function.

and that it is conforming to the C99 standard

> The exit() and _Exit() functions conform to ISO/IEC 9899:1999
> (“ISO C99”).

C99 7.20.4.3 § 5 states

> Finally, control is returned to the host environment. If the value of
> status is zero or EXIT_SUCCESS, an implementation-defined form of the
> status successful termination is returned. If the value of status is
> EXIT_FAILURE, an implementation-defined form of the status
> unsuccessful termination is returned. Otherwise the status returned
> is implementation-defined.

which I read as the C standard leaving it to the implementation to
decide how to handle the int type argument.

Having a look at man 2 _exit, the system call man page, it says
nothing about the lower 8 bits, but claims conformance with
IEEE Std 1003.1-1990 ("POSIX.1") which says
in Part 1: System Application Program Interface (API) [C Language], 3.2.2.2 § 2

> If the parent process of the calling process is executing a wait() or
> waitpid(), it is notified of the termination of the calling process
> and the low order 8 bits of status are made available to it; see
> 3.2.1.

that only puts a requirement on making the lower 8 bits available.
Looking at a more modern POSIX, IEEE Std 1003.1-2017, that has
waitid() defined, it has the following for _exit()

> The value of status may be 0, EXIT_SUCCESS, EXIT_FAILURE, or any
> other value, though only the least significant 8 bits (that is,
> status & 0377) shall be available from wait() and waitpid(); the
> full value shall be available from waitid() and in the siginfo_t
> passed to a signal handler for SIGCHLD.

so the mystery of why the implementation is the way it is was
dispelled.

The question that remains is what the rationale behind using the lower
8 bits was from the start? Is it historical legacy that no one wanted
to change for backwards compatibility? Is there no need for exit codes
larger than 8 bits?

I don't know if I have ever come into contact with software that deals
with status codes that actually looks at the full value. My daily
driver shell, fish, certainly does not.

--
Best regards,
Andreas Kempe

Scott Lurndal

unread,
Feb 2, 2024, 11:33:45 AMFeb 2
to
Andreas Kempe <ke...@lysator.liu.se> writes:
>Hello everyone,
>
>I'm wondering why, at least on Linux and FreeBSD, a process exit
>status was chosen to be only the lower 8 bits in the C interface, i.e.
>exit() and wait().
>
<snip>

>The question that remains is what the rationale behind using the lower
>8 bits was from the start? Is it historical legacy that no one wanted
>to change for backwards compatibility? Is there no need for exit codes
>larger than 8 bits?

The definition of the wait system call. Recall that the
PDP-11 was a 16-bit computer and wait needed to be able
to include metadata along with the exit status.

Andreas Kempe

unread,
Feb 2, 2024, 3:02:21 PMFeb 2
to
Den 2024-02-02 skrev Scott Lurndal <sc...@slp53.sl.home>:
> Andreas Kempe <ke...@lysator.liu.se> writes:
>>Hello everyone,
>>
>>I'm wondering why, at least on Linux and FreeBSD, a process exit
>>status was chosen to be only the lower 8 bits in the C interface, i.e.
>>exit() and wait().
>>
> <snip>
>
>>The question that remains is what the rationale behind using the lower
>>8 bits was from the start? Is it historical legacy that no one wanted
>>to change for backwards compatibility? Is there no need for exit codes
>>larger than 8 bits?
>
> The definition of the wait system call. Recall that the
> PDP-11 was a 16-bit computer

I'm afraid that's a tall order. I had yet to learn how to read when
they went out of production. :) Please excuse my ignorance.

> and wait needed to be able to include metadata along with the exit
> status.

I'm a bit unclear on the order of things coming into being. Did their
C implementation already use exit() with an int argument of size 16
bits and they also masked? Or was an int 8 bits on PDP-11 with POSIX
opting mask out the lower 8 bits on platforms with wider ints to
maintain backwards compatibility?

Scott Lurndal

unread,
Feb 2, 2024, 3:15:30 PMFeb 2
to
The status argument to the wait system call returned
a two part value; 8 bits of exit status and 8 bits
that describe the termination conditions (e.g. the
signal number that stopped or terminated the
process).


Here's the modern 32-bit layout (in little endian form):

unsigned int __w_termsig:7; /* Terminating signal. */
unsigned int __w_coredump:1; /* Set if dumped core. */
unsigned int __w_retcode:8; /* Return code if exited normally. */
unsigned int:16;

It's just the PDP-11 unix 16-bit version with 16 unused padding bits.

SVR4 added the waitid(2) system call which via the siginfo argument has
access to the full 32-bit program exit status.

Lawrence D'Oliveiro

unread,
Feb 2, 2024, 4:13:46 PMFeb 2
to
On Fri, 2 Feb 2024 16:05:14 -0000 (UTC), Andreas Kempe wrote:

> I'm wondering why, at least on Linux and FreeBSD, a process exit status
> was chosen to be only the lower 8 bits in the C interface, i.e.
> exit() and wait().

I’ve never used that many different values. E.g. 0 for some test condition
passed, 1 for failed, 2 for unexpected error.

> This did bite some colleagues at work at one point who were porting a
> modem manager from a real-time OS to Linux because they were returning
> negative status codes for errors.

True enough:

ldo@theon:~> python3 -c "import sys; sys.exit(1)"; echo $?
1
ldo@theon:~> python3 -c "import sys; sys.exit(-1)"; echo $?
255

But you could always sign-extend it.

Andreas Kempe

unread,
Feb 2, 2024, 4:20:27 PMFeb 2
to
Thank you for the clarification, but I don't think I have any problem
grasping how the implementation works. My thought are why they did
what they did.

Why not use a char in exit() instead of int, with wait() returning the
full 16 bits? If the program itself fills in the upper 8 bits, it
makes sense, but otherwise I don't understand from an API perspective
why one would use a data type with the caveat that only half is used.

If we already have exit() and wait() using ints and want to stuff our
extra information in there without changing the API, it also makes
sense.

Keith Thompson

unread,
Feb 2, 2024, 4:23:57 PMFeb 2
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
> On Fri, 2 Feb 2024 16:05:14 -0000 (UTC), Andreas Kempe wrote:
>
>> I'm wondering why, at least on Linux and FreeBSD, a process exit status
>> was chosen to be only the lower 8 bits in the C interface, i.e.
>> exit() and wait().
>
> I’ve never used that many different values. E.g. 0 for some test condition
> passed, 1 for failed, 2 for unexpected error.

The curl command defines nearly 100 error codes ("man curl" for
details). That's the most I've seen. 8 bits is almost certainly
plenty if the goal is to enumerate specific error conditions.
It's not enough if you want to pass more information through the
error code, which is why most programs don't try to do that.

Since int is typically 32 bits (but only guaranteed by C to be at
least 16), the exit() function could theoretically be used to pass
32 bits of information, but that's not really much more useful than 8
bits. If a program needs to return more than 8 bits of information,
it will typically print a string to stdout or something similar.

(On Plan 9, a program's exit status is (was?) a string, empty for
success, a description of the error condition on error. It's a cool
idea, but I can imagine it introducing some interesting problems.)

[...]

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Medtronic
void Void(void) { Void(); } /* The recursive call of the void */

Lawrence D'Oliveiro

unread,
Feb 2, 2024, 4:38:58 PMFeb 2
to
On Fri, 02 Feb 2024 13:23:52 -0800, Keith Thompson wrote:

> The curl command defines nearly 100 error codes ("man curl" for
> details). That's the most I've seen.

Another reason for staying away from curl, I would say. It needlessly
replicates the functionality of a whole lot of different protocol clients,
when all you need is HTTP/HTTPS (maybe FTP/FTPS as well). That’s why I
stick to wget.

> (On Plan 9, a program's exit status is (was?) a string, empty for
> success, a description of the error condition on error. It's a cool
> idea, but I can imagine it introducing some interesting problems.)

What, not a JSON object?

Lawrence D'Oliveiro

unread,
Feb 2, 2024, 4:40:36 PMFeb 2
to
On Fri, 2 Feb 2024 21:20:22 -0000 (UTC), Andreas Kempe wrote:

> Why not use a char in exit() instead of int, with wait() returning the
> full 16 bits? If the program itself fills in the upper 8 bits, it makes
> sense, but otherwise I don't understand from an API perspective why one
> would use a data type with the caveat that only half is used.

The other half contains information like whether the low half is actually
an explicit exit code, or something else like a signal that killed the
process. Or an indication that the process has not actually terminated,
but is just stopped.

Keith Thompson

unread,
Feb 2, 2024, 9:17:57 PMFeb 2
to
C tends to use int values even for character data (when not an element
of a string). See for example the return types of getchar(), fgetc(),
et al, and even the type of character constants ('x' is of type int, not
char).

In early C, int was in many ways a kind of default type. Functions with
no visible declaration were assumed to return int. The signedness of
plain char is implementation-defined. Supporting exit values from 0 to
255 is fairly reasonable. Using an int to store that value is also
fairly reasonable -- especially since main() returns int, and exit(n) is
very nearly equivalent to return n in main().

Ignoring all but the low-order 8 bits is not specified by C. Non-POSIX
systems can use all 32 (or 16, or ...) bits of the return value.

Andreas Kempe

unread,
Feb 3, 2024, 8:21:34 AMFeb 3
to
Den 2024-02-03 skrev Keith Thompson <Keith.S.T...@gmail.com>:
> Andreas Kempe <ke...@lysator.liu.se> writes:
>>
>> Why not use a char in exit() instead of int, with wait() returning the
>> full 16 bits? If the program itself fills in the upper 8 bits, it
>> makes sense, but otherwise I don't understand from an API perspective
>> why one would use a data type with the caveat that only half is used.
>
> C tends to use int values even for character data (when not an element
> of a string). See for example the return types of getchar(), fgetc(),
> et al, and even the type of character constants ('x' is of type int, not
> char).
>

I thought the reason for the int return type was to have an error code
outside of the range of the valid data, with EOF being defined as
being a negative integer. A reason that isn't applicable for the
argument passing to exit by a program.

> In early C, int was in many ways a kind of default type. Functions with
> no visible declaration were assumed to return int. The signedness of
> plain char is implementation-defined.

I realised that char was a bad example just as I posted. I should have
chosen unsigned char instead.

> Supporting exit values from 0 to 255 is fairly reasonable. Using an
> int to store that value is also fairly reasonable -- especially
> since main() returns int, and exit(n) is very nearly equivalent to
> return n in main(). Ignoring all but the low-order 8 bits is not
> specified by C. Non-POSIX systems can use all 32 (or 16, or ...)
> bits of the return value.
>

Yes, in my original post, I detailed that the restriction does not
come from the C standard, but from POSIX. I'm not sure which came
first.

If C was first with having an exit() function and an int return for
main, I can imagine that it went something like this

- C chooses int for main
- C uses int in exit() to match main
- OS folks want to store extra data in the exit status, but they
want to match the C API
- let's just stuff it in the upper bits and keep the API the same with
an imposed restriction on the value in POSIX

or POSIX exit() was constructed with the int from main in mind, or it
could just be, as you point out, that int is a nice default integer
type and there wasn't much thought put into it beyond that.

I can speculate a bunch different reasons, but I'm curious if anyone
knows what the actual reasoning was.

Janis Papanagnou

unread,
Feb 3, 2024, 10:38:45 AMFeb 3
to
AFAICT; "historical reasons". You have some bits to carry some exit
status, some bits to carry other termination information (signals),
optionally some more bits to carry other supplementary information.
If you want that information all carried across a single primitive
data type you have to draw a line somewhere. Given that these days
one can not assume that more than 16 bit in the default 'int' type
guaranteed it seems quite obvious to split at 8 bit. (For practical
reasons differentiating 255 error codes seems more than enough, if
we consider what evaluating and individually acting on all of them
at the calling/environment level would mean.)

Janis

Scott Lurndal

unread,
Feb 3, 2024, 4:34:34 PMFeb 3
to
Andreas Kempe <ke...@lysator.liu.se> writes:
>Den 2024-02-03 skrev Keith Thompson <Keith.S.T...@gmail.com>:
>> Andreas Kempe <ke...@lysator.liu.se> writes:

>Yes, in my original post, I detailed that the restriction does not
>come from the C standard, but from POSIX. I'm not sure which came
>first.

The restriction predates both. It was how unix v6 worked; every
version of unix thereafter continued that so that existing applications
would not need to be rewritten.

It was documented in the SVID (System V Interface Definition) which
was part of the source materials used by X/Open when developing
the X Portability Guides (xpg) (which became the SuS).

Ken and Dennis chose to implement the wait system call (which
the shell uses to collect the exit status) with an 8-bit value
so they could use the other 8 bits of the 16-bit int for metadata.

This could never be changed without breaking applications, so
we still have it today in unix, linux and other POSIX-compliant
operating evironments.

Lawrence D'Oliveiro

unread,
Feb 3, 2024, 4:38:00 PMFeb 3
to
On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

> The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Signed characters make no sense.

Keith Thompson

unread,
Feb 3, 2024, 6:29:15 PMFeb 3
to
Andreas Kempe <ke...@lysator.liu.se> writes:
> Den 2024-02-03 skrev Keith Thompson <Keith.S.T...@gmail.com>:
>> Andreas Kempe <ke...@lysator.liu.se> writes:
>>>
>>> Why not use a char in exit() instead of int, with wait() returning the
>>> full 16 bits? If the program itself fills in the upper 8 bits, it
>>> makes sense, but otherwise I don't understand from an API perspective
>>> why one would use a data type with the caveat that only half is used.
>>
>> C tends to use int values even for character data (when not an element
>> of a string). See for example the return types of getchar(), fgetc(),
>> et al, and even the type of character constants ('x' is of type int, not
>> char).
>
> I thought the reason for the int return type was to have an error code
> outside of the range of the valid data, with EOF being defined as
> being a negative integer. A reason that isn't applicable for the
> argument passing to exit by a program.

I don't think there's one definitive reason for either decision.

>> In early C, int was in many ways a kind of default type. Functions with
>> no visible declaration were assumed to return int. The signedness of
>> plain char is implementation-defined.
>
> I realised that char was a bad example just as I posted. I should have
> chosen unsigned char instead.

The exit() function predates unsigned char (see K&R1). It probably even
predates char. (C's predecessor B was untyped, with characters being
stored in words which were effectively of type int. There was an exit()
function, but it apparently took no arguments.)

Changing exit()'s parameter type to reflect the range of valid values
undoubtedly wasn't considered worth doing -- especially since a wider
range of values might be valid on some systems.

Joe Pfeiffer

unread,
Feb 3, 2024, 10:33:25 PMFeb 3
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:

Except in architectures where they do. If you're doing something where
it matters (or even if you want your code to be more readable) used
signed char or unsigned char as appropriate.

Lawrence D'Oliveiro

unread,
Feb 4, 2024, 1:41:29 AMFeb 4
to
On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer wrote:

> Lawrence D'Oliveiro <l...@nz.invalid> writes:
>
>> On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
>>
>>> The signedness of plain char is implementation-defined.
>>
>> Why? Because the PDP-11 on which C and Unix were originally developed
>> did sign extension when loading a byte quantity into a (word-length)
>> register.
>>
>> Signed characters make no sense.
>
> Except in architectures where they do.

There are no character encodings which assign meanings to negative codes.

Richard Kettlewell

unread,
Feb 4, 2024, 3:49:17 AMFeb 4
to
Joe Pfeiffer <pfei...@cs.nmsu.edu> writes:
> Lawrence D'Oliveiro <l...@nz.invalid> writes:
>> On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:
>>> The signedness of plain char is implementation-defined.
>>
>> Why? Because the PDP-11 on which C and Unix were originally developed did
>> sign extension when loading a byte quantity into a (word-length) register.
>>
>> Signed characters make no sense.
>
> Except in architectures where they do.

Such as?

> If you're doing something where it matters (or even if you want your
> code to be more readable) used signed char or unsigned char as
> appropriate.

Signed 8-bit integers are perfectly sensible, signed characters not so
much.

--
https://www.greenend.org.uk/rjk/

Scott Lurndal

unread,
Feb 4, 2024, 11:25:07 AMFeb 4
to
But then 'signed char' doesn't necessarily need to be used
for character encoding (consider int8_t, for example, which
defines a signed arithmetic type from -128..+127.

On the 16-bit PDP-11, signed 8-bit values would not have been uncommon,
if only because of the limited address space.

Rainer Weikusat

unread,
Feb 5, 2024, 11:11:15 AMFeb 5
to
Andreas Kempe <ke...@lysator.liu.se> writes:
> Den 2024-02-03 skrev Keith Thompson <Keith.S.T...@gmail.com>:
>> Andreas Kempe <ke...@lysator.liu.se> writes:

[...]

> If C was first with having an exit() function and an int return for
> main, I can imagine that it went something like this
>
> - C chooses int for main
> - C uses int in exit() to match main
> - OS folks want to store extra data in the exit status, but they
> want to match the C API
> - let's just stuff it in the upper bits and keep the API the same with
> an imposed restriction on the value in POSIX
>
> or POSIX exit() was constructed with the int from main in mind, or it
> could just be, as you point out, that int is a nice default integer
> type and there wasn't much thought put into it beyond that.
>
> I can speculate a bunch different reasons, but I'm curious if anyone
> knows what the actual reasoning was.

This should be pretty obvious: A C int is really a machine data type in
disguise, namely, whatever fits into a common general purpose register
of a certain machine. C was created for porting UNIX to
the PDP-11 (or rather, rewriting UNIX for the PDP-11 with the goal of
not having to rewrite it again for next type of machine which would need
to be supported by it). Putting a value into a certain register is a
common convention for returning values from functions (or rather, Dennis
Ritchie probably thought it would be a sensible convention at that
time). Hence, having main return an int was the 'natural' idea and
allocating the lower half of this int to applications whising to return
status codes and the upper half to the system for returning
system-specific metadata was also the 'natural' idea.

Surely, eight whole bits must be enough for everyone! :-)

Rainer Weikusat

unread,
Feb 5, 2024, 11:12:58 AMFeb 5
to
Keith Thompson <Keith.S.T...@gmail.com> writes:

[...]

> (On Plan 9, a program's exit status is (was?) a string, empty for
> success, a description of the error condition on error. It's a cool
> idea, but I can imagine it introducing some interesting problems.)

That's interesting to know as I have been using the same convention for
validation functions in Perl for some years: These return nothing when
everything was ok or a textual error message otherwise.

Kees Nuyt

unread,
Feb 5, 2024, 12:23:08 PMFeb 5
to
On Sat, 03 Feb 2024 20:33:19 -0700, Joe Pfeiffer
<pfei...@cs.nmsu.edu> wrote:

> Signed characters make no sense.

Nor did 6 bit characters, but in the 1980s we had them:
3 characters in a 24 bit word.
Welcome to what was then called mini or midrange computers.

(Yes, looking at you, Harris, with its Vulcan Operating System)

--
Regards,
Kees Nuyt

Andreas Kempe

unread,
Feb 5, 2024, 2:02:29 PMFeb 5
to
Thank you everyone for the different informative replies and
historical insight! I think I have gotten what I can out of this
thread.

Lawrence D'Oliveiro

unread,
Feb 5, 2024, 5:41:43 PMFeb 5
to
On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:

> On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
>
> Signed characters make no sense.
>
> Nor did 6 bit characters, but in the 1980s we had them:
> 3 characters in a 24 bit word.

I see your sixbit and raise you Radix-50, which packed 3 characters into a
16-bit word.

None of these used signed character codes, by the way. So my point still
stands.

Keith Thompson

unread,
Feb 5, 2024, 6:51:42 PMFeb 5
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
My understanding is that on the PDP-11, making plain char signed made
code that stored character values in int objects more efficient.
Sign-extension was more efficient than zero-filling or something like
that. I don't remember the details, but I'm sure it wouldn't be
difficult to find out.

At the time, making such code a little more efficient was worth the
effort -- and character data with the high-order bit set to 1 was rare,
so it didn't make much difference in practice.

I don't know whether there are efficiency issue on modern platforms. If
modern CPUs have similar characteristics to the PDP-11, that could
impose some pressure to keep signed characters. And the representation
requirements for the character types (especially with C23 requiring
2's-complement) mean that signed characters don't cause many practical
problems.

Since C code has always had to work correctly if plain char is signed,
there wasn't much pressure to make it unsigned (though some platforms do
so).

I'd be happy if some future C standard mandated that plain char is
unsigned, just because I think it would make more sense, but I don't
think that's likely to happen. But the historical reasons for allowing
plain char to be signed are valid.

Scott Lurndal

unread,
Feb 5, 2024, 7:17:01 PMFeb 5
to
Keith Thompson <Keith.S.T...@gmail.com> writes:
>Lawrence D'Oliveiro <l...@nz.invalid> writes:
>> On Mon, 05 Feb 2024 18:22:59 +0100, Kees Nuyt wrote:
>>> On Sat, 3 Feb 2024 21:37:55 -0000 (UTC), Lawrence D'Oliveiro wrote:
>>> Signed characters make no sense.
>>>
>>> Nor did 6 bit characters, but in the 1980s we had them:
>>> 3 characters in a 24 bit word.
>>
>> I see your sixbit and raise you Radix-50, which packed 3 characters into a
>> 16-bit word.
>>
>> None of these used signed character codes, by the way. So my point still
>> stands.
>
>My understanding is that on the PDP-11, making plain char signed made
>code that stored character values in int objects more efficient.
>Sign-extension was more efficient than zero-filling or something like
>that. I don't remember the details, but I'm sure it wouldn't be
>difficult to find out.

The PDP-11 had two move instructions:

MOV (r1)+,r2
MOVB (r2)+,r3

MOV moved source to destination. MOVB always sign-extended the byte
to the destination register size (16-bit).

Lawrence D'Oliveiro

unread,
Feb 5, 2024, 7:58:35 PMFeb 5
to
On Mon, 05 Feb 2024 15:51:37 -0800, Keith Thompson wrote:

> My understanding is that on the PDP-11, making plain char signed made
> code that stored character values in int objects more efficient.
> Sign-extension was more efficient than zero-filling or something like
> that.

The move-byte instruction did sign-extension when loading into a register,
not storing into memory.

There was no convert-byte-to-word instruction as such.

Keith Thompson

unread,
Feb 5, 2024, 9:31:44 PMFeb 5
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
Right, so if you wanted to copy an 8-bit value into a 16-bit register
with sign-extension, you do it in one instruction, whereas zeroing the
top 8 bits would require at least one additional instruction, probably a
BIC (bit-clear) following the MOVB. You'd probably need more
instruction space to store the mask value of 0xff00 -- pardon me,
0177400. And I expect that copying a character into a register would
have been a common operation.

Given those constraints, I'd say it made sense *at the time* for char to
be signed on the PDP-11, especially since it was pretty much assumed
that text would be plain ASCII that would never have the high bit set.

If the PDP-11 had had an alternative MOVB instruction that did
zero-extension, we might not be having this discussion.

Question: Do any more modern CPUs have similar characteristics that make
either signed or unsigned char more efficient?

Lawrence D'Oliveiro

unread,
Feb 5, 2024, 10:10:56 PMFeb 5
to
On Mon, 05 Feb 2024 18:31:36 -0800, Keith Thompson wrote:

> If the PDP-11 had had an alternative MOVB instruction that did
> zero-extension, we might not be having this discussion.

Which is effectively what I said:

On Fri, 02 Feb 2024 18:17:52 -0800, Keith Thompson wrote:

> The signedness of plain char is implementation-defined.

Why? Because the PDP-11 on which C and Unix were originally developed did
sign extension when loading a byte quantity into a (word-length) register.

Keith Thompson

unread,
Feb 5, 2024, 11:00:39 PMFeb 5
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
You wrote that "Signed characters make no sense". I was talking about a
context in which they did make sense. How is that effectively what you
said? (I was agreeing with and expanding on your statement about the
PDP-11.)

Richard Kettlewell

unread,
Feb 6, 2024, 12:00:29 PMFeb 6
to
Keith Thompson <Keith.S.T...@gmail.com> writes:
> Lawrence D'Oliveiro <l...@nz.invalid> writes:

>> Signed characters make no sense.
>
> You wrote that "Signed characters make no sense". I was talking about a
> context in which they did make sense. How is that effectively what you
> said? (I was agreeing with and expanding on your statement about the
> PDP-11.)

I still don’t see any explanation for signed characters as such making
sense.

I think the situation is more accurately interpreted as letting a
PDP-11-specific optimization influence the language design, and
(temporarily) getting away with it because the character values they
cared about at the time happened to lie within a small enough range that
negative values didn’t arise.

--
https://www.greenend.org.uk/rjk/

Rainer Weikusat

unread,
Feb 6, 2024, 12:35:07 PMFeb 6
to
I think that's just a (probably traditional) misnomer. A C char isn't a
character, it's an integer type and it's a signed integer type because
all other original C integer types (int and short) were signed as
well. Unsigned integer types, as something that's different from
pointer, were a later addition.

Kaz Kylheku

unread,
Feb 6, 2024, 1:04:21 PMFeb 6
to
Sure, except for the part where "abcd" denotes an object that is a
null-terminated array of these *char* integers, that entity being formally
called a "string" in ISO C, and used for representing text. (Or else "abcd" is
initializer syntax for a four element (or larger) array of *char*).

If *char* is signed (and CHAR_BIT is 8), then '\xff` produces a negative value,
even though the constant has type *int*, and "\xff"[0] does likewise.

This has been connected to needless bugs in C programs. An expression like
table[str[i]] may result in table[] being negatively indexed.

The <ctype.h> function require an argument that is either EOF
or a value in the range of 0 to UCHAR_MAX, and so are incompatible
with string elements.

All this crap could have been avoided if *char* had been unsigned.
*unsigned char* never needed to exist except as a synonym for plain
*char*.

Speaking of synonyms, *char* is a distinct type, and not a synonym for either
*signed char* or *unsigned char*. It has to be that way, given the way it is
defined, but it's just another complication that need not have existed:

#include <stdio.h>

int main(void)
{
char *cp = 0;
unsigned char *ucp = 0;
signed char *scp = 0;
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
printf("%d\n", '\xff');
}

char.c: In function ‘main’:
char.c:8:27: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:38: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);
^~
char.c:8:50: warning: comparison of distinct pointer types lacks a cast
printf("%d %d %d\n", cp == ucp, cp == scp, ucp == scp);

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazi...@mstdn.ca

Rainer Weikusat

unread,
Feb 6, 2024, 1:30:53 PMFeb 6
to
Kaz Kylheku <433-92...@kylheku.com> writes:
> On 2024-02-06, Rainer Weikusat <rwei...@talktalk.net> wrote:
>> Richard Kettlewell <inv...@invalid.invalid> writes:

[...]
All of this may be true¹ but it's all besides the point. The original C
language had three integer types, char, short and int, which were all
signed types. It further supported declaring pointers to some type and
pointers were basically unsigned integer indices into a linear memory
array. Char couldn't have been an unsigned integer type, regardless if
this would have made more sense², because unsigned integer types didn't
exist in the language.

¹ My personal theory of human fallibility is that humans tend to fuck up
everything they possibly can. Hence, so-called C pitfalls expose human
traits (fallibility) and not language traits. Had they been avoided,
human ingenuity would have found something else to fuck up.

² Being wise in hindsight is always easy. But that's not an option for
people who need to create something which doesn't yet exist and not be
wisely critical of something that does.

Kaz Kylheku

unread,
Feb 6, 2024, 1:38:10 PMFeb 6
to
On 2024-02-06, Rainer Weikusat <rwei...@talktalk.net> wrote:
> ¹ My personal theory of human fallibility is that humans tend to fuck up
> everything they possibly can. Hence, so-called C pitfalls expose human
> traits (fallibility) and not language traits.

Does that work for all safety devices? Isolation transformers, steel
toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

A fractured skull reveals a human trait (accident proneness, weak bone)
rather than the workplace trait of not enforcing helmet use.

Rainer Weikusat

unread,
Feb 6, 2024, 2:02:06 PMFeb 6
to
Kaz Kylheku <433-92...@kylheku.com> writes:
> On 2024-02-06, Rainer Weikusat <rwei...@talktalk.net> wrote:
>> ¹ My personal theory of human fallibility is that humans tend to fuck up
>> everything they possibly can. Hence, so-called C pitfalls expose human
>> traits (fallibility) and not language traits.
>
> Does that work for all safety devices? Isolation transformers, steel
> toed boots, helmets, seat belts, roll bars, third outlet prongs, ...

I wrote about C types and somewhat more generally, programming language
features, and not "safety devices" supposed to protect human bodies from
physical injury.

Keith Thompson

unread,
Feb 6, 2024, 2:08:54 PMFeb 6
to
I think we're mostly in agreement, perhaps with different understandings
of "making sense". What I'm saying is that the decision to make char a
signed type made sense for PDP-11 implementation, purely because of
performance issues.

I just did a quick test on x86_64, x86, and ARM. It appears that
assigning either an unsigned char or a signed char to an int object
takes a single instruction. (My test didn't distinguish between
register or memory target.) I suspect there's no longer any performance
justification on most modern platforms for making plain char signed.
But there's like to be (bad or at least non-portable) code that depends
on plain char being signed. As it happens, plain char is unsigned in
gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
to override the default.

Keith Thompson

unread,
Feb 6, 2024, 2:15:18 PMFeb 6
to
Here's a quote from the 1974 and 1975 C reference manuals:

A char object may be used anywhere an int may be. In all cases the
char is converted to an int by propagating its sign through the
upper 8 bits of the resultant integer. This is consistent with the
two’s complement representation used for both characters and
integers. (However, the sign-propagation feature disappears in other
implementations.)

In more modern terms, that last sentence suggests that plain char was
unsigned in some implementations

K&R1, 1978, is more explicit:

There is one subtle point about the conversion of characters
to integers. The language does not specify whether variables
of type char are signed or unsigned quantities. When a
char is converted to an int, can it ever produce a negative
integer? Unfortunately, this varies from machine to machine,
reflecting differences in architecture. On some machines
(PDP-11, for instance), a char whose leftmost bit is 1 will be
converted to a negative integer ("sign extension"). On others,
a char is promoted to an int by adding zeros at the left end,
and thus is always positive.

Lew Pitcher

unread,
Feb 6, 2024, 2:25:31 PMFeb 6
to
This view ignores the early implementation of (K&R) C on IBM 370 systems,
where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric
characters have their high bit set (alphabetics range from 0x80 through
0xe9, while numerics range from 0xf0 through 0xf9). A char in this
implementation, by necessity, was unsigned, as C "guarantees that any
character in the machine's standard character set will never be negative"
(K&R "The C Programming Language", p40)


> It further supported declaring pointers to some type and
> pointers were basically unsigned integer indices into a linear memory
> array. Char couldn't have been an unsigned integer type, regardless if
> this would have made more sense², because unsigned integer types didn't
> exist in the language.
>
> ¹ My personal theory of human fallibility is that humans tend to fuck up
> everything they possibly can. Hence, so-called C pitfalls expose human
> traits (fallibility) and not language traits. Had they been avoided,
> human ingenuity would have found something else to fuck up.
>
> ² Being wise in hindsight is always easy. But that's not an option for
> people who need to create something which doesn't yet exist and not be
> wisely critical of something that does.




--
Lew Pitcher
"In Skills We Trust"

Rainer Weikusat

unread,
Feb 6, 2024, 3:01:49 PMFeb 6
to
Lew Pitcher <lew.p...@digitalfreehold.ca> writes:
> On Tue, 06 Feb 2024 18:30:46 +0000, Rainer Weikusat wrote:

[Why-oh-why is char not unsigned?!?]


>> All of this may be true¹ but it's all besides the point. The original C
>> language had three integer types, char, short and int, which were all
>> signed types.
>
> This view ignores the early implementation of (K&R) C on IBM 370 systems,
> where a char was 8 bits of EBCDIC. In EBCDIC, all alphabetic and numeric
> characters have their high bit set (alphabetics range from 0x80 through
> 0xe9, while numerics range from 0xf0 through 0xf9).

Indeed. It refers to the C lanuage as it existed/ was created when UNIX
was brought over to the PDP-11. This language didn't have any unsigned
integer types as the concept didn't yet exist.

Kaz Kylheku

unread,
Feb 6, 2024, 4:23:01 PMFeb 6
to
Type systems are safety devices. That's why we have terms like "type
safe" and "unsafe code".

Type safety helps prevent misbehavior, which results in problems like
incorrect results and data loss, which can have real economic harm.

In a safety-critical embedded system, a connection between type safety
and physical safety is readily identifiable.

"Type safety" it's not just some fanciful metaphor like "debugging";
there is a literal interpretation which is true.

Rainer Weikusat

unread,
Feb 6, 2024, 4:37:56 PMFeb 6
to
Kaz Kylheku <433-92...@kylheku.com> writes:
> On 2024-02-06, Rainer Weikusat <rwei...@talktalk.net> wrote:
>> Kaz Kylheku <433-92...@kylheku.com> writes:
>>> On 2024-02-06, Rainer Weikusat <rwei...@talktalk.net> wrote:
>>>> ¹ My personal theory of human fallibility is that humans tend to fuck up
>>>> everything they possibly can. Hence, so-called C pitfalls expose human
>>>> traits (fallibility) and not language traits.
>>>
>>> Does that work for all safety devices? Isolation transformers, steel
>>> toed boots, helmets, seat belts, roll bars, third outlet prongs, ...
>>
>> I wrote about C types and somewhat more generally, programming language
>> features, and not "safety devices" supposed to protect human bodies from
>> physical injury.
>
> Type systems are safety devices. That's why we have terms like "type
> safe" and "unsafe code".

They're not, at least not when safety device is supposed to mean
something like hard hats. That's just an inappropriate analogy some
people like to employ. This is, however, completely besides the point of
my original text which was about providing an explanation why char is
signed in C despite all kinds of smart alecs with fifty years of
hindsight Ritchie didn't have in 1972 are extremely concvinced that this
was an extremely bad idea.

Andreas Kempe

unread,
Feb 6, 2024, 6:13:25 PMFeb 6
to
I wouldn't expect any difference on a modern CPU. I did a microbench
on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
and movsbl to move char to int so that's what I benched.

The bench was done by moving a byte from the stack to eax using a loop
of 10 movzbl/movsbl running 10M times. Both instructions gave on
average about 0.7 cycles per instruction measured using rdtsc. The
highest bit in the byte being set or unset made no difference.

Scott Lurndal

unread,
Feb 6, 2024, 6:27:27 PMFeb 6
to
Andreas Kempe <ke...@lysator.liu.se> writes:
>Den 2024-02-06 skrev Keith Thompson <Keith.S.T...@gmail.com>:
>> Richard Kettlewell <inv...@invalid.invalid> writes:

>>
>> I think we're mostly in agreement, perhaps with different understandings
>> of "making sense". What I'm saying is that the decision to make char a
>> signed type made sense for PDP-11 implementation, purely because of
>> performance issues.
>>
>> I just did a quick test on x86_64, x86, and ARM. It appears that
>> assigning either an unsigned char or a signed char to an int object
>> takes a single instruction. (My test didn't distinguish between
>> register or memory target.) I suspect there's no longer any performance
>> justification on most modern platforms for making plain char signed.
>> But there's like to be (bad or at least non-portable) code that depends
>> on plain char being signed. As it happens, plain char is unsigned in
>> gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
>> to override the default.
>>
>
>I wouldn't expect any difference on a modern CPU. I did a microbench
>on my laptop with an Intel i5-8350U. clang on my FreeBSD uses movzbl
>and movsbl to move char to int so that's what I benched.

A move from register to register isn't even executed on most modern
processor designs. It is detected at fetch and the register is
just renamed in the pipeline.

Andreas Kempe

unread,
Feb 6, 2024, 7:26:12 PMFeb 6
to
Yeah. I tried some different variations and by adding some data
dependencies by incrementing the value and moving it around, I could
get some difference between the two, approx 10 to 30 %, but I'm not
sure how much is due to the instruction itself or other effects of
manipulating the data.

Funnily enough, the zero extend was the more performant in these tests
making unsigned char possibly more performant.

My intention wasn't really to claim they're exactly the same, but that
that I don't think there is any real performance benefit to be had by
switching char to unsigned. Even if the 10-30 % are a real thing, I
wonder how much software is actually using char types in a way where
it would make a difference?

Scott Lurndal

unread,
Feb 6, 2024, 7:46:22 PMFeb 6
to
The logic for sign extension (MOVSX) isn't complex, the added gate delay
wouldn't affect the instruction timing. Fan the sign bit out
to the higher bits through a couple of gates to either select the
sign bit or the high order bits when storing into the new register.

Sign extension on load (MOV from memory) will happen in the load unit before
it hits the register file, most likely.

The x86 MOVBE instruction is a slight more complex example.


>Funnily enough, the zero extend was the more performant in these tests
>making unsigned char possibly more performant.

Within what margin of measurement error?

>
>My intention wasn't really to claim they're exactly the same, but that
>that I don't think there is any real performance benefit to be had by
>switching char to unsigned. Even if the 10-30 % are a real thing, I
>wonder how much software is actually using char types in a way where
>it would make a difference?

We use uint8_t extensively because the data is unsigned in the range 0-255.

And generally want wrapping behavior modulo 2^8.

Andreas Kempe

unread,
Feb 6, 2024, 9:11:31 PMFeb 6
to
Here's an example of a test I played around with. The body of my loop
does this 10M times for this test. movzbl is switched for movsbl when
testing the other configuration.

movzbl -24(%rsp), %eax
movb %al, -25(%rsp)
movzbl -25(%rsp), %eax
movb %al, -26(%rsp)
movzbl -26(%rsp), %eax
movb %al, -27(%rsp)
movzbl -27(%rsp), %eax
movb %al, -28(%rsp)
movzbl -28(%rsp), %eax
incl %eax
movb %al, -24(%rsp)

This is the data, unit is total cycles for a run, from 2000 runs of
10M each for the two different instructions:

movzbl:
mean = 1.24E+08
variance = 3.95E+12

movsbl:
mean = 1.38E+08
variance = 3.44E+12

ratio movsbl/movzbl = 1.11

Performing a two-tail student t-test gives

p-value: 0.00E+00

Something is causing these two test runs to give different performance
results. I will not pretend I know enough about the inner workings of
Intel's magic box to explain why.

>>
>>My intention wasn't really to claim they're exactly the same, but that
>>that I don't think there is any real performance benefit to be had by
>>switching char to unsigned. Even if the 10-30 % are a real thing, I
>>wonder how much software is actually using char types in a way where
>>it would make a difference?
>
> We use uint8_t extensively because the data is unsigned in the range 0-255.
>
> And generally want wrapping behavior modulo 2^8.

Sure, but if you are using uint8_t, you have sidestepped the whole
issues of char being signed or unsigned so a change wouldn't really
affect you.

Richard Kettlewell

unread,
Feb 7, 2024, 5:30:02 AMFeb 7
to
Keith Thompson <Keith.S.T...@gmail.com> writes:

> Richard Kettlewell <inv...@invalid.invalid> writes:
>> Keith Thompson <Keith.S.T...@gmail.com> writes:
>>> Lawrence D'Oliveiro <l...@nz.invalid> writes:
>>>> Signed characters make no sense.
>>>
>>> You wrote that "Signed characters make no sense". I was talking about a
>>> context in which they did make sense. How is that effectively what you
>>> said? (I was agreeing with and expanding on your statement about the
>>> PDP-11.)
>>
>> I still don’t see any explanation for signed characters as such making
>> sense.
>>
>> I think the situation is more accurately interpreted as letting a
>> PDP-11-specific optimization influence the language design, and
>> (temporarily) getting away with it because the character values they
>> cared about at the time happened to lie within a small enough range that
>> negative values didn’t arise.
>
> I think we're mostly in agreement, perhaps with different understandings
> of "making sense". What I'm saying is that the decision to make char a
> signed type made sense for PDP-11 implementation, purely because of
> performance issues.

Having a basic 8-bit integer type be signed type makes sense (in
context) for performance reasons and perhaps for usability reasons too.

But that’s really not the same as “signed characters make sense”. For
signed characters to make sense there has to be encoding where some
signs (or control codes, etc) are encoded to negative values. I’ve never
heard of one.

“char” isn’t just a random string of symbols. It’s obvious both from the
name and the way it’s used in the language that it’s intended to
represent characters, not just small integer values. If the purpose was
purely the latter it would have been called ‘short short int’ or
something like that.

> I just did a quick test on x86_64, x86, and ARM. It appears that
> assigning either an unsigned char or a signed char to an int object
> takes a single instruction. (My test didn't distinguish between
> register or memory target.) I suspect there's no longer any performance
> justification on most modern platforms for making plain char signed.
> But there's like to be (bad or at least non-portable) code that depends
> on plain char being signed. As it happens, plain char is unsigned in
> gcc for ARM. And gcc has "-fsigned-char" and "-funsigned-char" options
> to override the default.

i.e. we’re still suffering the locked-in side-effects of an ancient
decision even though the original justification has become irrelevant.
It might or might not have been a reasonable trade-off at the time,
disregarding what were then hypotheticals about the future, but (indeed
with hindsight) I think from today’s point of view it was clearly the
wrong decision.

--
https://www.greenend.org.uk/rjk/

Scott Lurndal

unread,
Feb 7, 2024, 10:22:09 AMFeb 7
to
Sehr interresant. Ich weiss nicht, warum es ist.


>> We use uint8_t extensively because the data is unsigned in the range 0-255.
>>
>> And generally want wrapping behavior modulo 2^8.
>
>Sure, but if you are using uint8_t, you have sidestepped the whole
>issues of char being signed or unsigned so a change wouldn't really
>affect you.

While most C compilers have a compile-time option to select the signed-ness of
char, using uint8_t sidesteps the issue completely.

Rainer Weikusat

unread,
Feb 7, 2024, 10:30:30 AMFeb 7
to
Richard Kettlewell <inv...@invalid.invalid> writes:
> Keith Thompson <Keith.S.T...@gmail.com> writes:

[...]

>> I think we're mostly in agreement, perhaps with different understandings
>> of "making sense". What I'm saying is that the decision to make char a
>> signed type made sense for PDP-11 implementation, purely because of
>> performance issues.
>
> Having a basic 8-bit integer type be signed type makes sense (in
> context) for performance reasons and perhaps for usability reasons too.
>
> But that’s really not the same as “signed characters make sense”. For
> signed characters to make sense there has to be encoding where some
> signs (or control codes, etc) are encoded to negative values. I’ve never
> heard of one.
>
> “char” isn’t just a random string of symbols. It’s obvious both from the
> name and the way it’s used in the language that it’s intended to
> represent characters, not just small integer values.

Computers have absolutely no idea of "characters". They handle numbers,
integer numbers in this case, and humans then interpret them as
characters based on some convention for encoding characters as
integers. Hence, a data type suitable for holding an encoded character
(ie, an integer value from 0 - 127 for the case in question) is not the
same as a character.

Richard Kettlewell

unread,
Feb 7, 2024, 3:20:17 PMFeb 7
to
Rainer Weikusat <rwei...@talktalk.net> writes:
> Richard Kettlewell <inv...@invalid.invalid> writes:
>> “char” isn’t just a random string of symbols. It’s obvious both from the
>> name and the way it’s used in the language that it’s intended to
>> represent characters, not just small integer values.
>
> Computers have absolutely no idea of "characters". They handle numbers,
> integer numbers in this case, and humans then interpret them as
> characters based on some convention for encoding characters as
> integers. Hence, a data type suitable for holding an encoded character
> (ie, an integer value from 0 - 127 for the case in question) is not the
> same as a character.

Language designers do, however, have an idea of “characters”.

--
https://www.greenend.org.uk/rjk/

Lawrence D'Oliveiro

unread,
Feb 7, 2024, 3:58:06 PMFeb 7
to
On Wed, 07 Feb 2024 20:20:12 +0000, Richard Kettlewell wrote:

> Language designers do, however, have an idea of “characters”.

Unicode uses the terms “grapheme” and “text element”. Actually it also
uses “character”, but it seems less clear on what that means. It is not
the same as a “code point” or “glyph”.

<https://www.unicode.org/faq/char_combmark.html>

Richard Kettlewell

unread,
Feb 8, 2024, 6:22:01 AMFeb 8
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:

Sure, but this was all happening in the 1970s, long before Unicode
existed.

K&R1 explicitly says char is “capable of holding one character in the
local character set” (and mentions EBCDIC as a concrete example on the
same page - the problem must have been obvious already).

--
https://www.greenend.org.uk/rjk/

Rainer Weikusat

unread,
Feb 8, 2024, 11:34:16 AMFeb 8
to
I don't quite understand what that's supposed to communicate. Insofar
the machine is concerned, a character is nothig but an integer and a
data type sufficient to hold a characters is thus necessarily an integer
type of some size. In a language without unsigned integer types, it'll
necessarily also be an signed integer type.

Keith Thompson

unread,
Feb 8, 2024, 11:53:42 AMFeb 8
to
Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
char was effectively unsigned in some implementations, in that
converting a char value to int would zero-fill the result rather than
doing sign-extension.

Rainer Weikusat

unread,
Feb 8, 2024, 12:46:26 PMFeb 8
to
Keith Thompson <Keith.S.T...@gmail.com> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>> Richard Kettlewell <inv...@invalid.invalid> writes:
>>> Rainer Weikusat <rwei...@talktalk.net> writes:
>>>> Richard Kettlewell <inv...@invalid.invalid> writes:
>>>>> “char” isn’t just a random string of symbols. It’s obvious both from the
>>>>> name and the way it’s used in the language that it’s intended to
>>>>> represent characters, not just small integer values.
>>>>
>>>> Computers have absolutely no idea of "characters". They handle numbers,
>>>> integer numbers in this case, and humans then interpret them as
>>>> characters based on some convention for encoding characters as
>>>> integers. Hence, a data type suitable for holding an encoded character
>>>> (ie, an integer value from 0 - 127 for the case in question) is not the
>>>> same as a character.
>>>
>>> Language designers do, however, have an idea of “characters”.
>>
>> I don't quite understand what that's supposed to communicate. Insofar
>> the machine is concerned, a character is nothig but an integer and a
>> data type sufficient to hold a characters is thus necessarily an integer
>> type of some size. In a language without unsigned integer types, it'll
>> necessarily also be an signed integer type.
>
> Early C (pre-K&R1) didn't explicitly have unsigned integer types, but
> char was effectively unsigned in some implementations, in that
> converting a char value to int would zero-fill the result rather than
> doing sign-extension.

According to Ritchie's "The Development of the C Language"

,----
| During 1973-1980, the language grew a bit: the type structure gained
| unsigned
|
| [...]
|
| the similarity of the arithmetic properties of character pointers and
| unsigned integers made it hard to resist the temptation to identify
| them. The unsigned types were added to make unsigned arithmetic
| available without confusing it with pointer manipulation. Similarly, the
| early language condoned assignments between integers and pointers
`----

Keith Thompson

unread,
Feb 8, 2024, 1:23:34 PMFeb 8
to
Right. K&R1 (1978) had "unsigned", but only for unsigned int. Still,
the signedness of char was effectively implementation-defined, though it
wasn't stated in those terms.

From K&R1:

A character or a short integer may be used wherever an
integer may be used. In all cases the value is converted to
an integer. Conversion of a shorter integer to a longer always
involves sign extension; integers are signed quantities. Whether
or not sign-extension occurs for characters is machine dependent,
but it is guaranteed that a member of the standard character
set is non-negative. Of the machines treated by this manual,
only the PDP-11 sign-extends. On the PDP-11, character variables
range in value from -128 to 127; the characters of the ASCII
alphabet are all positive. A character constant specified with
an octal escape suffers sign extension and may appear negative;
for example, '\377' has the value -1.

The sentence "Whether or not sign-extension occurs for characters is
machine dependent" might be written in more modern terms as "The
signedness of char is implementation-defined".

signed char and unsigned char (and unsigned short and unsigned long)
were added in ANSI C 1989, possibly earlier.

Kaz Kylheku

unread,
Feb 8, 2024, 2:54:17 PMFeb 8
to
On 2024-02-08, Rainer Weikusat <rwei...@talktalk.net> wrote:
> According to Ritchie's "The Development of the C Language"
>
> ,----
>| During 1973-1980, the language grew a bit: the type structure gained
>| unsigned
>|
>| [...]
>|
>| the similarity of the arithmetic properties of character pointers and
>| unsigned integers made it hard to resist the temptation to identify
>| them. The unsigned types were added to make unsigned arithmetic
>| available without confusing it with pointer manipulation. Similarly, the
>| early language condoned assignments between integers and pointers
> `----

It seems like a very odd rationale, given how things played out.

The difference between two pointers ended up signed (ptrdiff_t).
So pointer arithmetic doesn't work exactly like unsigned. That's mostly
a good thing, except that pointers farther from each other than half the
address space cannot be subtracted. (ISO C mostly takes that away anyway
since pointers to different objects may only be compared for exact
equality, and canno tbe subtracted. If no object is half the address
space or larger, subtraction overflow will never occur.)

Moreover, unsigned ended up necessary for representing a simple byte
in a nice way.

Not only that, but unsigned types are useful for bit manipulation,
without running into nonportable behaviors around shifting into and out
of the sign bit.

If you have a 32 bit int and want a 32 bit field, you want unsigned int.

Very odd to see the existence of unsigned math justified in terms of
some story about pointers.

It seems Ritchie really didn't think much about portability; he
probabably thought it was fine to do 1 << 15 with a 16 bit signed int
to calculate a mask for the highest bit, since that happened to work in
the systems he designed. If someone wanted C on their weird machine
where that misbehaves, or produces an alternative zero that compares
equal to regular zero, that was their problem.

Lawrence D'Oliveiro

unread,
Feb 8, 2024, 4:58:02 PMFeb 8
to
On Thu, 08 Feb 2024 10:23:29 -0800, Keith Thompson wrote:

> The sentence "Whether or not sign-extension occurs for characters is
> machine dependent" might be written in more modern terms as "The
> signedness of char is implementation-defined".
>
> signed char and unsigned char (and unsigned short and unsigned long)
> were added in ANSI C 1989, possibly earlier.

Here’s an odd thing: what happens when you shift a signed int? K&R allows
left-shift with the obvious meaning, and says that, for right-shift,
whether the top bits are zero-filled or sign-extended is implementation-
defined; newer C specs say that left-shifting a negative value is simply
“undefined”, and right-shifting a negative value is “implementation-
defined”.

Scott Lurndal

unread,
Feb 8, 2024, 5:30:57 PMFeb 8
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
There were extant hardware implementations exhibiting both behaviors. So they
made the behavior implementation-defined in the compiler.

Keith Thompson

unread,
Feb 8, 2024, 5:31:10 PMFeb 8
to
Lawrence D'Oliveiro <l...@nz.invalid> writes:
True. On the other hand, there are no shift operations on character
types (or short or unsigned short). The integer promotions are
performed on both operands of "<<" or ">>", so the value that's shifted
is at least as wide as int or unsigned int.

Lawrence D'Oliveiro

unread,
Feb 8, 2024, 6:26:59 PMFeb 8
to
Except the current spec doesn’t mention a choice between two behaviours.
0 new messages