Endian wars

David Eppstein

unread,

Jan 21, 1989, 5:48:12 PM1/21/89

to

I don't why I'm contributing to this recurring flamefest, but here goes...

Big endian lets you use integer comparison instructions to do string
compares a word at a time. Little endian means you are stuck with a
byte at a time.

The arguments about how people expect to read things seem pretty bogus
to me. One of the things computers are very good at doing is format
conversion.
--
David Eppstein epps...@garfield.cs.columbia.edu Columbia U. Computer Science

Steve Vegdahl

unread,

Jan 23, 1989, 3:11:20 PM1/23/89

to

In article <61...@columbia.edu>, eppstein@garfield (David Eppstein) writes:
> Big endian lets you use integer comparison instructions to do string
> compares a word at a time. Little endian means you are stuck with a
> byte at a time.

This depends on how strings are represented. If you take the view that
"C does things the right way, the only way", then I would be inclined
to agree that making the "wrong" endian choice would slow down string
comparison *if* your compiler is smart enough to figure out the optimization,
*or* if the string-comparison algorithm is coded in a non-portable way
(namely, an endian-ness assumption).

But consider a representation of strings where the characters are laid out
"backwards" in memory; a pointer to a string would contain the address of
the strings highest-addressed byte, which is the first character of the
string. Now, big endian and little endian find their roles reversed WRT the
above optimization.

> The arguments about how people expect to read things seem pretty bogus
> to me. One of the things computers are very good at doing is format
> conversion.

I agree. I also believe that people (other than language implementors)
should not be concerned with the details of how a string is represented
in memory. He should only be reading high-level-language code.

Steve Vegdahl
Computer Research Lab
Tektronix Labs
Beaverton, Oregon

Richard Okeefe

unread,

Jan 25, 1989, 12:00:42 AM1/25/89

to

Before arguing about whether big-endian order or little-endian
order is "more natural" for people, it's enlightening to consider
the historical origin of the way we write numbers. We write from
left to right, and put the most significant digit on the left.
But we copied that method of writing numbers from >>Arabic<<
mathematics, where the direction of writing was otherwise right
to left. So in Arabic, you encountered the low digit first in
your normal reading scan. This had the pleasant psychological
advantage that when you added two numbers, you wrote the answer
down in the order that you always wrote numbers, instead of
starting from the opposite end. There was one famous mathematician
this century who habitually wrote numbers least-significant-digit-
first, apparently for this reason.

So _both_ conventions are "natural" in human writing systems.

Jim Patterson

unread,

Jan 25, 1989, 12:24:36 PM1/25/89

to

In article <61...@columbia.edu> eppstein@garfield (David Eppstein) writes:
>Big endian lets you use integer comparison instructions to do string
>compares a word at a time. Little endian means you are stuck with a
>byte at a time.

I'd like to see an algorithm that actually benefits from this.
Consider...

If you know how long the string is ahead of time, you can optimize the
first (n / 4) int's (assuming 4-byte ints) whether or not it's
big-endian. If it's big-endian, then the first non-equal match
indicates the result. If it's little-endian, you have to switch to a
byte-wise loop for the non-equal word which means up to four bytes are
checked twice. In the C library, this approach only applies to memcmp.

You have to treat the last word specially if it's not an even multiple
of the word size with the big-endian approach. Otherwise, you will get
the wrong answer if the portion that shouldn't be matched is the only
part that is different. The little-endian approach already checks any
non-matching word, so if designed right this would not be a special
case.

If you want to do signed-byte comparisons, the big-endian
word-oriented approach won't work (you will have to do it similar to
the little-endian approach). The reason is that the sign of the result
will indicate the relative ordering of the int's, and not of the bytes
that mismatch.

If you don't know how long the string is (as for strcmp and strncmp),
then you have to scan the string to find how long it is. For a
word-oriented algorithm to be effective here, you need an algorithm
which detects which byte of a word (if any) contains a NUL.

I contend that there's no simple way to do this with integer
instructions; it's more effective to use byte-oriented instructions.
The only word-oriented approaches I can think of are along these
lines.

has_NUL = ! (i & 0xff && i & 0xff00 && i & 0xff0000 && i & 0xff000000);

If you have to scan the string as bytes anyways, I think that it would
be more efficient to compare them with the other string at the same
time.

Maybe someone has a better algorithm. This one is sure to be worse
than using byte-oriented instructions, unless your machine just
doesn't have any or they are woefully inadequate. A straight
table-lookup is obviously out of the question.

In summary, a big-endian machine might gain a slight advantage if it
knows ahead of time how long the string is. Really, the only
difference is that the big-endian machine can run along some fixed
number of words, and return the result from the first non-equal match,
whereas the little-endian algorithm has to double-check the unmatching
word with byte instructionsbefore it can return. For varying-length
strings, I see no advantage to the word-oriented approach.

If anyone has any good word-oriented implementations of memcmp or
especially strcmp, I'd be interested in seeing them.
--
Jim Patterson Cognos Incorporated
UUCP:decvax!utzoo!dciem!nrcaer!cognos!jimp P.O. BOX 9707
PHONE:(613)738-1440 3755 Riverside Drive
Ottawa, Ont K1G 3Z4

Jan Gray

unread,

Jan 27, 1989, 7:04:24 PM1/27/89

to

In article <51...@aldebaran.UUCP> ji...@cognos.UUCP (Jim Patterson) writes:
>If you don't know how long the string is (as for strcmp and strncmp),
>then you have to scan the string to find how long it is. For a
>word-oriented algorithm to be effective here, you need an algorithm
>which detects which byte of a word (if any) contains a NUL.
>
>I contend that there's no simple way to do this with integer
>instructions; it's more effective to use byte-oriented instructions.
>

> has_NUL = ! (i & 0xff && i & 0xff00 && i & 0xff0000 && i & 0xff000000);

This went around comp.arch a while ago.
has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
e.g. "test if there were any borrows as a result of the bytewise subtracts"

Using this trick on the '386, strlen on long strings can be made about 30%
faster than using the dedicated string instruction "rep scasb"!
(Except this will cause many instruction fetches that will keep your bus busy.)

The 80960 has the SCANBYTE instruction, and the 29000 has CPBYTE, for just
this sort of thing.

Hmm. "0x80808080". I knew the 8080 was good for something... :-)

Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080

Wayne A. Throop

unread,

Jan 27, 1989, 10:30:13 AM1/27/89

to

> ste...@tekchips.CRL.TEK.COM (Steve Vegdahl)
>> eppstein@garfield (David Eppstein)

>> Big endian lets you use integer comparison instructions to do string
>> compares a word at a time.

> This depends on how strings are represented. [...]

> consider a representation of strings where the characters are laid out

> "backwards" in memory; [...]

> Now, big endian and little endian find their roles reversed WRT the
> above optimization.

True, true... BUT this role reversal is by no means "free". Now the
little-endian-but-backwards-strings machine cannot use the same routines
to allocate, read, write, and otherwise treat as an uninterpreted bucket
of bits strings and non-strings. All manipulators of pointers will
have to "know" what they point at. There will be much "duplicated" code
on this machine relative to a machine where all memory-chunks are
addressed by their "least" (or greatest) component-address. And of course
all the attendant bugs that occur when a pointer of one kind is
fed to a routine expecting the other kind.

Not that these problems are insolvable. It may (possibly) even be
worthwhile to do things this way. But the big-endian way is still
superior in this respect, and making this change is only trading one
difficulty for another. (Of course, the little-endian way is superior
in other respects.)

--
There are two ways to write error-free programs;
only the third one works.
--- Alan J. Perlis
--
Wayne Throop <the-known-world>!mcnc!rti!xyzzy!throopw

Tim Olson

unread,

Jan 27, 1989, 11:26:52 AM1/27/89

to

In article <51...@aldebaran.UUCP> ji...@cognos.UUCP (Jim Patterson) writes:

| In article <61...@columbia.edu> eppstein@garfield (David Eppstein) writes:
| >Big endian lets you use integer comparison instructions to do string
| >compares a word at a time. Little endian means you are stuck with a
| >byte at a time.
|
| I'd like to see an algorithm that actually benefits from this.
| Consider...
|

| If you don't know how long the string is (as for strcmp and strncmp),
| then you have to scan the string to find how long it is. For a
| word-oriented algorithm to be effective here, you need an algorithm
| which detects which byte of a word (if any) contains a NUL.
|

| If anyone has any good word-oriented implementations of memcmp or
| especially strcmp, I'd be interested in seeing them.

A couple of years ago I posted essentially the string routines we use on
the Am29000. These routines make use of:

1) comparisons a word at a time

2) the "cpbyte" instruction to detect a null byte in a word

3) the "extract" instruction (extract a 32-bit word from
a 64-bit word pair at any bit boundary) to take care of
misaligned operands

These routines broke even with a standard byte-at-a-time hand-coded
routine with 5-byte strings (including null) and were always better with
longer strings.

-- Tim Olson
Advanced Micro Devices
(t...@crackle.amd.com)

Dik T. Winter

unread,

Jan 27, 1989, 6:28:44 PM1/27/89

to

In article <51...@aldebaran.UUCP> ji...@cognos.UUCP (Jim Patterson) writes:

> word-oriented algorithm to be effective here, you need an algorithm
> which detects which byte of a word (if any) contains a NUL.
>
> I contend that there's no simple way to do this with integer
> instructions; it's more effective to use byte-oriented instructions.
> The only word-oriented approaches I can think of are along these
> lines.
>
> has_NUL = ! (i & 0xff && i & 0xff00 && i & 0xff0000 && i & 0xff000000);
>

On a ones complement machine (because of end-around carry):
has_NUL = (~i ^ (i - 0x01010101)) & 0x01010101
On a twos complement machine you have to look at the carry bit too.
However you might also try:
j = i - 0x01010101;
has_NUL = ((~i ^ j) & 0x01010101) | (~j & i & 0x80000000)
modulo some typos of course.
--
dik t. winter, cwi, amsterdam, nederland
INTERNET : d...@cwi.nl
BITNET/EARN: dik@mcvax

Paul A Vixie

unread,

Jan 29, 1989, 3:52:05 AM1/29/89

to

[Jan Gray]
# This went around comp.arch a while ago.
# has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
# e.g. "test if there were any borrows as a result of the bytewise subtracts"
#
# Using this trick on the '386, strlen on long strings can be made about 30%
# faster than using the dedicated string instruction "rep scasb"! (Except this
# will cause many instruction fetches that will keep your bus busy.)

Wait a minute... This is probably a silly question, more so since it comes
from the moderator of <info-...@vixie.sf.ca.us>, but... doesn't the '386
have a cache for recently used instructions?
--
Paul Vixie
Work: vi...@decwrl.dec.com decwrl!vixie +1 415 853 6600
Play: pa...@vixie.sf.ca.us vixie!paul +1 415 864 7013

bill vermillion

unread,

Jan 28, 1989, 12:32:52 PM1/28/89

to

In article <1...@aucsv.UUCP> o...@aucsv.UUCP (Richard Okeefe) writes:
>Before arguing about whether big-endian order or little-endian
>order is "more natural" for people,
>

>So _both_ conventions are "natural" in human writing systems.

And mixed conventions are considered normal in spoken English.

Consider that, for example twenty-five or thirty-six would fit the
"big-endian" defintion, the numbers thir-teen, four-teen, would be
considered "little endian"

To be consistant with the numbering scheme of 1 to 100 the numbers
after nine should probably be teen-zero or ten-zero, followed by
teen-one, teen-three, teen-four. Using the "y" ending would be
confusing if we called then teenty-four of tenty-four. Too much
sound-alikes for the twentys.

It appears the natural order is dis-order

--
Bill Vermillion - UUCP: {uiucuxc,hoptoad,petsd}!peora!rtmvax!bilver!bill
: bi...@bilver.UUCP

Mark Armbrust

unread,

Jan 30, 1989, 11:43:02 AM1/30/89

to

In article <7...@microsoft.UUCP> ja...@microsoft.UUCP (Jan Gray) writes:
>In article <51...@aldebaran.UUCP> ji...@cognos.UUCP (Jim Patterson) writes:
>
>This went around comp.arch a while ago.
> has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
>e.g. "test if there were any borrows as a result of the bytewise subtracts"
>
>Using this trick on the '386, strlen on long strings can be made about 30%
>faster than using the dedicated string instruction "rep scasb"!

@: lodsd 5 clocks to execute
mov ebx, eax 2
sub eax, 01010101h 3
not ebx 2
and eax, ebx 3
and eax, 80808080h 3
loopz < 12 (30 clocks/19 bytes)

Seems to me that the following would be a bit faster:

@: lodsd 5 clocks to execute
mov ebx, eax 2
shr ebx, 16 3
and ax, bx 3
and al, ah 3
loopnz @ 12 (28 clocks/13 bytes)

In any case, I prefer strings stored with leading count instead of trailing
zero's--I've been thinking of writing an alternate library for C to be able
to handle them. Some of the programming I've done could have benifitted from
this type of strings. (So what if they're longer--memory is cheaper than
execution speed.)

Mark

(I should know better by now to post things when I have a nasty cold--there's
prob'ly something wrong with the above and I'll be burried in mail :-( )

A.H.L. Wassenberg

unread,

Jan 31, 1989, 3:34:56 AM1/31/89

to

In article <3...@bilver.UUCP> bi...@bilver.UUCP (bill vermillion) writes:

> To be consistant with the numbering scheme of 1 to 100 the numbers
> after nine should probably be teen-zero or ten-zero, followed by

Do you think that is consistent? Is 20 in your language "two-zero"?
Or "twen-zero"? I think "onety" would be more consistent (considering
twenty, thirty, etc.).

> teen-one, teen-three, teen-four. Using the "y" ending would be
> confusing if we called then teenty-four of tenty-four. Too much
> sound-alikes for the twentys.

These would become onety-one, onety-two [ you forgot that one :-) ],
onety-three, etc. All very consistent, and not more sound-alike than
the other ....-ty's.

__
/ ) Lex Wassenberg
/ Philips Telecommunication & Data Systems B.V.
/ _ Apeldoorn, The Netherlands
__/ /_\ \/ Internet: le...@idca.tds.philips.nl
(_/\___\___/\ UUCP: ..!mcvax!philapd!lexw

Jan Gray

unread,

Jan 31, 1989, 12:13:07 AM1/31/89

to

[Paul A Vixie]
# [Jan Gray]
# # This went around comp.arch a while ago.
# # has_NUL = (((i-0x01010101)&~i)&0x80808080) != 0,
# # e.g. "test if there were any borrows as a result of the bytewise subtracts"

# #
# # Using this trick on the '386, strlen on long strings can be made about 30%

# # faster than using the dedicated string instruction "rep scasb"! (Except this
# # will cause many instruction fetches that will keep your bus busy.)
#
# Wait a minute... This is probably a silly question, more so since it comes
# from the moderator of <info-...@vixie.sf.ca.us>, but... doesn't the '386
# have a cache for recently used instructions?

The parenthetical comment was speculation on my part, as I haven't actually
taken a data analyzer to any of my 386 boxes.

The 386 itself has no instruction cache, although there are many high
performance 386 systems with 32K or 64K external caches. The 386 does have
an instruction prefetch buffer of some size, but I don't think it is of any
benefit to loops a la 68010 loop mode.

I would not advocate changing your 386 str* library, because the 30%
figure is reallly the asymptotic speedup. For typical strings of less
than ten bytes you wouldn't get much improvement after longword alignment
(we don't want to page fault off the last data page), and having to find
which null within the longword.

The payoff is more dramatic on the 68020. The "by the book" timings for
traditional byte-at-a-time strlen (unrolled 4 times) vs. one iteration
of the longword scan:
Cycles: best cache worst Bus reads
Traditional (scan 4 bytes) 22 42 50 4
Longword scan 7 22 31 1

No, MS software engineers *don't* spend their days counting clock cycles!

Fritz Nordby

unread,

Feb 1, 1989, 10:47:24 AM2/1/89

to

In article <3...@microsoft.UUCP> w-co...@microsoft.uucp (Colin Plumb) writes:
>(Quick: who's run into Unix's 10K command-line limit?)

Me. Often. And probably other folks who've worked with large numbers of
source files. Consider:

$ pr `find . -type f -print|egrep '^([Mm]akefile|.*\.[ch])$'` | lpr

Another example: have you looked at looked at the way the "rcbook.t"
program (from the alt.gourmand recipes software) works? It has to work
around this restriction.

BTW, the "10K command-line limit" is not exactly that; it's really a
"10 block command-line limit". Yes, that's right, it used to be a 5K
limit in the days of 512 byte disk blocks, and on systems with 4K disk
blocks I rather suspect that it's a 40K limit.

Moral: A restriction is a restriction, and no matter how lax or trivial the
restriction may seem today, eventually somebody will run up hard against it.
(Anybody else remember when 64k was a lot of memory? And now we're running
out of space with 32 bits?)

Fritz Nordby. fr...@vlsi.caltech.edu cit-vax!cit-vlsi!fritz

Mark Armbrust

unread,

Feb 2, 1989, 11:01:23 AM2/2/89

to

In article <3...@microsoft.UUCP> w-co...@microsoft.uucp (Colin Plumb) writes:

>m...@nbires.UUCP (Mark Armbrust) wrote:
>> Seems to me that the following would be a bit faster:
>>

>> [some blatently WRONG code deleted.]
>
>But won't work. Consider "\1\2\4\8".

Like I said in the original posting; I should have learned by now not to post
when the brain is misfiring due to 'flu.

As for string length limits and count preceeded string, I prefer using two
one-word values: one that is fixed at allocation time and gives the maximum
size string that can be held by this string, and one that gives the current
size of the string.

Mark

Peter da Silva

unread,

Feb 2, 1989, 11:29:35 AM2/2/89

to

In article <3...@microsoft.UUCP>, w-co...@microsoft.UUCP (Colin Plumb) writes:
> (Quick: who's run into Unix's 10K command-line limit?)

I don't know, but I run into UNIX's command-line and environment limit
all the time. Last I checked this limit was 1K, but it's probably bigger
these days. On our 286 boxes it's certainly nowhere near 10K.
--
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
Work: uunet.uu.net!ficc!peter, pe...@ficc.uu.net, +1 713 274 5180. `-_-'
Home: bigtex!texbell!sugar!peter, pe...@sugar.uu.net. 'U`
Opinions may not represent the policies of FICC or the Xenix Support group.

ag...@mcdurb.urbana.gould.com

unread,

Feb 2, 1989, 12:18:00 PM2/2/89

to

>(Quick: who's run into Unix's 10K command-line limit?)

I `have` - ie. I have produced overlong commands using backquote
expansion. Although I seem to remember that the limit was considerably
smaller at the time (on a V7 hybrid running csh).
I ran into so many of csh's built-in limits that I almost
completely abandoned it for serious programming - although I still
used in interactively.

Limits make liars about the people who say that UNIX tools, pipes
and backquotes, provide composability of modules. Hurrah for
Richard Stallman and his campaign to make it possible to let
the limit on filename size be infinite! Death to System V's
14 character filenames!

Andy "Krazy" Glew ag...@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew
Motorola Microcomputer Division, Champaign-Urbana Design Center
1101 E. University, Urbana, Illinois 61801, USA.

My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

Rob Warnock

unread,

Feb 2, 1989, 10:59:46 PM2/2/89

to

[(*Sigh*) I was replying to individuals directly when I saw lots of posted
replies, so I should probably say it here, even though its implication for
"architecture" is mainly the architecture of command line interfaces.]

+---------------

| (Quick: who's run into Unix's 10K command-line limit?)

+---------------

Yes, all of us, until we [most of us?] started using the standard utility
program "xargs" (used to be System-V only, until a p-d source for a useful
subset for 4.x BSD was posted several years ago). With "xargs", that 10k limit
becomes a non-issue. Instead of, for example:

$ some_cmd -options `find <selection> -print`
try:
$ find <selection> -print | xargs some_cmd -options

"Xargs" gathers lines from standard input until it gets "close" to the
system-established limits on command-line length, then runs "some_cmd"
with a "bunch" of args (repeating "-options" each time), then repeats
until EOF on input. It's also got options to limit the number of input
items gathered per bunch, so you can do things in pairs, triples, etc.

Simple idea. You could code one up quickly just given the idea. (But I
admit *I* never thought of it until the System-V "xargs" came out...)

Rob Warnock
Systems Architecture Consultant

UUCP: {amdcad,fortune,sun}!redwood!rpw3
ATTmail: !rpw3
DDD: (415)572-2607
USPS: 627 26th Ave, San Mateo, CA 94403

Miles Bader

unread,

Feb 3, 1989, 12:52:14 AM2/3/89

to

ag...@mcdurb.Urbana.Gould.COM writes:
> Limits make liars about the people who say that UNIX tools, pipes
> and backquotes, provide composability of modules. Hurrah for
> Richard Stallman and his campaign to make it possible to let
> the limit on filename size be infinite! Death to System V's
> 14 character filenames!

Sigh... We have thousands of source files that were created by people
using BSD. Now it's been decided that they have to be compatible (to
facilitate future porting) with... OS/2! 8+3 chars, no case
distinction, no leading periods. Microsoft: just say Bleah.

-Miles

Jim Shankland

unread,

Feb 3, 1989, 12:13:20 PM2/3/89

to

In article <28200266@mcdurb> ag...@mcdurb.Urbana.Gould.COM writes:
>
>>(Quick: who's run into Unix's 10K command-line limit?)
>
>I `have` - ie. I have produced overlong commands using backquote
>expansion. Although I seem to remember that the limit was considerably
>smaller at the time (on a V7 hybrid running csh).
> I ran into so many of csh's built-in limits that I almost
>completely abandoned it for serious programming - although I still
>used in interactively.

Even BSD's 10K limit (N.B.: it's a kernel limit, not a csh limit) is
nowhere near enough for me. I know it's a pain to implement, but
for my money, the limit on environment size and total argv length
ought to be on the order of the limit on a process's virtual address
space size. (Hint to implementors: *don't* shrink the max. address
space size to make this true!)

Mr. Glew also mentions the 14-character filename length limit as an
annoyance. True enough; but I find other limits worse. Countless
utilities have line length limits, and many silently truncate lines
that are too long. And recently, I was unable to run vi on a ca. 390K
file on an AT&T 3B2/600 running SVR3: "sorry, file too big". AT&T
is positively quaint in its notion of what constitutes a "big" file.

I know it's a pain to find and root out all these limits. But they've
got to go. They don't belong to 1989.

(This is no longer an architecture topic. Followups, if any, to
comp.unix.wizards.)

Jim Shankland
j...@ernie.berkeley.edu

"I've been walking in a river all my life, and now my feet are wet"

ag...@mcdurb.urbana.gould.com

unread,

Feb 4, 1989, 12:46:00 PM2/4/89

to

..> Xargs

Close, although a pain to use. Makes me wish that xargs was a builtin,
and backquote substitution didn't exist.

[To give this a connection with architecture]: isn't this like the RISC
philosophy of providing only one, optimal, way to do things?

---

Soon I will be using news instead of @$%#@#@@! notes! Then I'll
be able to redirect and cross-post like the rest of you!

Let's get this out of comp.arch

Dik T. Winter

unread,

Feb 4, 1989, 9:48:35 PM2/4/89

to

In article <24...@amdcad.AMD.COM> rp...@amdcad.UUCP (Rob Warnock) writes:
> +---------------
> | (Quick: who's run into Unix's 10K command-line limit?)
> +---------------
>
> Yes, all of us, until we [most of us?] started using the standard utility
> program "xargs" (used to be System-V only, until a p-d source for a useful
> subset for 4.x BSD was posted several years ago). With "xargs", that 10k limit
> becomes a non-issue. Instead of, for example:
>
> $ some_cmd -options `find <selection> -print`
> try:
> $ find <selection> -print | xargs some_cmd -options
>

But of course the two are equivalent only if "some_cmd -options" uses only
one input file at a time. Try:
cc -o aap `find <selection> -print`
And if they are equivalent you could already do:
find <selection> -exec some_cmd -options {} \;
(although you have a lot more processes).

Paul L Schauble

unread,

Feb 5, 1989, 1:25:30 AM2/5/89

to

This deserves a new subject.

Since it was mentioned in the Endian Wars, does anyone know why C uses the
null terminated string rather than an explicit length? It seems like such
an odd choice considering that
- It removes a character from the character set, a source of many C
bugs, and
- All machines I know of that have character string instructions want
the length of the string. This forces the string primitives to first
scan for null, a time wasting operation.

There must have been a reason. What is it?

++PLS

John Hascall

unread,

Feb 5, 1989, 12:47:17 PM2/5/89

to

In article <14...@cup.portal.com> P...@cup.portal.com (Paul L Schauble) writes:

>Since it was mentioned in the Endian Wars, does anyone know why C uses the
>null terminated string rather than an explicit length? It seems like such
>an odd choice considering that

> - It removes a character from the character set, a source of many C
> bugs, and

Agreed.

> - All machines I know of that have character string instructions want
> the length of the string. This forces the string primitives to first
> scan for null, a time wasting operation.

I assume you mean something like:

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+
|length| H | E | L | L | O | , | | W | O | R | L | D | \n|
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+

but, what size would you use for "length", a byte? a word? a longword?

I suspect that some of these machines' instructions expect different
sized operands for the length.

Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
a machine with no string instructions.

John Hascall
ISU Comp Center

Mark Nagel

unread,

Feb 5, 1989, 6:19:06 PM2/5/89

to

Hmm. There are two things going on here. One is that you want to
have truly variable-length strings. You can do it the C way, or you
can adopt some more complicated method like having different string
types or a variable length string length indicator. I think the
implementors chose the simplest approach, hoping that in the average
case, the overhead from scanning a string would be small (and
hopefully the value would be cached in whatever data structure needed
it). The other thing (once the sentinel method was chosen) was to
select the proper terminating character. I don't think NUL is used
much anywhere for anything and thus is a good candidate. In addition,
I've heard that NUL was chosen as a way to help prevent overrunning
the ends of strings by too much in the case of a missing end-of-string
character. What single byte value is more prevalent in machine code
than zero?

Mark Nagel @ UC Irvine, Dept of Info and Comp Sci
ARPA: na...@ics.uci.edu | Charisma doesn't have jelly in the
UUCP: {sdcsvax,ucbvax}!ucivax!nagel | middle. -- Jim Ignatowski

ag...@mcdurb.urbana.gould.com

unread,

Feb 5, 1989, 10:25:00 PM2/5/89

to

>>Since it was mentioned in the Endian Wars, does anyone know why C uses the
>>null terminated string rather than an explicit length?

>> - All machines I know of that have character string instructions want
>> the length of the string. This forces the string primitives to first
>> scan for null, a time wasting operation.
>
> I assume you mean something like:
>
> +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
> |length| H | E | L | L | O | , | | W | O | R | L | D | \n|
> +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
>

> Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
> a machine with no string instructions.

May I encourage people implementing string libraries to use an extra
level of indirection? Instead of length immediately preceding the string,
let length be associated with a pointer to the string. Makes
substringing operations much easier, and has the ability to reduce
unnecessary copies (at the risk of increased aliasing).

+------+---+
|length|ptr|
+------+---+
|
+------+
|
V
+---+---+---+---+---+---+---+---+---+---+---+---+---+

| H | E | L | L | O | , | | W | O | R | L | D | \n|

+---+---+---+---+---+---+---+---+---+---+---+---+---+

Machine instructions should not mandate the layout of strings in memory.
They should, instead, require length and start to be preloaded in registers
(where the machine'll have to put them anyway).

That is, of course, if you *have* string instructions. As I am fond of
pointing out, large integer operations can remove the need for many
string operations (ie. give me a 64 or 128 bit wide bus, and a
"STORE BYTES UNDER MASK" operation, and I don't *want* string operations).

d...@alice.uucp

unread,

Feb 6, 1989, 1:56:52 AM2/6/89

to

The question arose: why does C use a terminating character for
strings instead of a count?

Discussion of the representation of strings in C is not fruitful
unless it is realized that there are no strings in C. There
are character arrays, which serve a similar purpose, but no
strings.

Things very deep in the design of the language, and in the customs
of its use, make strings a mess to add. The intent was that
the behavior of character arrays should be exactly like that
of other arrays, and the hope was that stringish operations
on these character arrays should be convenient enough.

The interplay of pointers and arrays, and the possible representations
for them, place strong contraints on what one can do if one wants
real strings (counted sequences of characters) in the context
of the existing language, in particular if types char* or char[]
are going to be counted strings. In general it is hard to account for
the space in which to put the count, and also to make sure that
it can be updated properly under all operations. For example,
'sizeof' is used for allocation and it is hard to make this
use compatible with a count. Similarly, in practice,
most implementations make 'struct { char s1[3]; s2[5]; }' say
something about the storage layout that doesn't mix well with
a count.

Given the explicit use of character arrays, and explicit pointers to
sequences of characters, the conventional use of a terminating
marker is hard to avoid. The history of this convention and
of the general array scheme had little to do with the PDP-11; it
was inherited from BCPL and B.

Of course, it is possible to imagine adding a primitive string
type to C, and to put in some useful operations like concatenation,
search, and substring. This would not really be a good idea,
because this new primitive type would continually be at war with
the existing character pointers and arrays. In the context
of C (even with ANSI function prototypes) it would be quite
difficult to make a string type usable in all the places it should
be.

In extensible languages like C++ and of course in languages
in which the notion is designed in from the start, strings are
fine. (However, even in C++, where it is readily possible to define
your own string class, it would take quite a lot of work to
make this class work smoothly with the entire existing library).

In my opinion, C's array/pointer scheme for representation
of character strings has worked out reasonably well, although
it is somewhat clumsy when there are lots of string operations.
I don't think it has been demonstrated that the usual run of
C programs pays an extremely high cost in performance for their
string operations, though doubtless there are counterexamples
for particular machine architectures or particular programs.

Dennis Ritchie
att!research!dmr
d...@research.att.com

Robert Firth

unread,

Feb 6, 1989, 8:33:15 AM2/6/89

to

In article <88...@alice.UUCP> d...@alice.UUCP writes:

["strings" in C]

>Given the explicit use of character arrays, and explicit pointers to
>sequences of characters, the conventional use of a terminating
>marker is hard to avoid. The history of this convention and
>of the general array scheme had little to do with the PDP-11; it
>was inherited from BCPL and B.

A correction here: the C scheme was NOT inherited from BCPL.
BCPL strings are not confused with character arrays; their
implemetation is not normally visible to the programmer, and
their semantics are respectably robust.

Much the most common implementation is the one proposed earlier -
have an initial length count followed by exactly that number
of characters. Naturally, all characters are legal, including
NUL.

There are several reasons for the C 'design', but that its
perpetrators didn't know any better isn't one of them.

John Hascall

unread,

Feb 6, 1989, 12:46:00 PM2/6/89

to

In article <88...@alice.UUCP> d...@alice.UUCP writes:

>The question arose: why does C use a terminating character for
>strings instead of a count?

[...discussion of the history of "strings" in C omitted...]

>In my opinion, C's array/pointer scheme for representation
>of character strings has worked out reasonably well, although
>it is somewhat clumsy when there are lots of string operations.

>I don't think it has been demonstrated that the usual run of
>C programs pays an extremely high cost in performance for their
>string operations, though doubtless there are counterexamples
>for particular machine architectures or particular programs.

This is a rather circular argument. This is rather like
saying "I don't think it has been demonstrated that the usual
automobile pays an extremely high cost in performance for their
amphibious operations,..."

Just like most people don't drive their cars across lakes,
most C programs are not string operation intensive.

If I had to cross a lake,
I'd probably use a different tool than a car.
If I had to write a string operation intensive program,
I'd probably use a different tool than C.

John Hascall
ISU Comp Center

Rx: Apply :-) above as needed for pain.

gil...@m.cs.uiuc.edu

unread,

Feb 6, 1989, 10:19:00 PM2/6/89

to

/* Written 12:25 am Feb 5, 1989 by P...@cup.portal.com in m.cs.uiuc.edu:comp.arch */

This deserves a new subject.

Since it was mentioned in the Endian Wars, does anyone know why C uses the

null terminated string rather than an explicit length?

...

- It removes a character from the character set, a source of many C bugs

- All machines I know of that have character string instructions want
the length of the string. This forces the string primitives to first
scan for null, a time wasting operation.

/* End of text from m.cs.uiuc.edu:comp.arch */

First, let me say that string type is a religious issue. I once
worked for a workstation vendor whose main workstation had THREE
different types of strings. Each development group claimed THEIR
strings ran the fastest on the hardware platform. Every package had
about 25 string subroutines, including 5-10 for "converting"
"inferior" formats into "ours".

Second, I was once told that the following C code compiles into 1
instruction (or something amazingly short) on the PDP-11, C's mother
machine:

while (*p++ = *q++);

This is perhaps part of the reason why strings were designed with
null-termination

Don Gillies {uiucdcs!gillies} U of Illinois

wsm...@m.cs.uiuc.edu

unread,

Feb 6, 1989, 11:23:00 PM2/6/89

to

On using machine string instructions for performing things such
as index() that need the length of the string ahead of time...

On the 80x86 family of chips, I believe it is possible for a routine to
beat the fancy string instructions with clever coding that does not need
the length of the string so that the index() subroutine does not average
O(length), but instead much less than that when the pattern is near the
beginning. Certainly on some of the low end VAXes this was true also
because the special string instructions were executed by software.
Just because they give a special instruction, doesn't mean you have to
use it.

Bill Smith
wsm...@cs.uiuc.edu
uiucdcs!wsmith

Jim Haynes

unread,

Feb 6, 1989, 11:48:12 PM2/6/89

to

In article <88...@alice.UUCP> d...@alice.UUCP writes:

>The question arose: why does C use a terminating character for
>strings instead of a count?
>

Since dmr has spoken I probably shouldn't even think of trying to
add anything, but ...

If we think of machine architecture there are historically several
different ways that have been used to represent strings.
1) Put the characters in adjacent storage locations, and use a special
character to delimit the end of the string. This goes back to at least
IBM 702 or thereabouts. It is basically what is used in C; IBM just
happened to use a character other than null.
2) Put the characters in adjacent storage locations, and reserve a bit
per storage location to delimit string boundaries. This is found in
IBM 1401-1410-7010 family and 1620.
3) Put the characters in adjacent storage locations, don't delimit the
boundaries at all, and absorb the information about string length into
the instruction stream. This is what's done in IBM 360 and a lot of
other machines. In contrast to 1) and 2) it requires something like a
simulation of 4) below to deal with varying length strings and dynamic
allocation of storage.
4) Put the characters in adjacent storage locations and store information
about starting address and length in a data structure for the purpose,
a string descriptor. This is used in Burroughs B6500 and later
machines of that family. The string is virtually delimited, in that
you can't access it, or aren't supposed to, without going through
the descriptor and observing the boundary.

What's the difference between a string and an array of characters?
Is it anything other than the set of operations that are provided
to operate on it?
hay...@ucscc.ucsc.edu
hay...@ucscc.bitnet
..ucbvax!ucscc!haynes

"Any clod can have the facts, but having opinions is an Art."
Charles McCabe, San Francisco Chronicle

d...@alice.uucp

unread,

Feb 7, 1989, 2:57:35 AM2/7/89

to

Robert Firth justifiably corrects my misstatement about
BCPL strings; they were indeed counted. I evidently edited
my memory.

Perhaps he or someone else can discuss authoritatively how they
fitted into the language and (to recall the original topic) how string
instructions might help in processing. More nearly correct
memory--which applies only to 20-year-old implementations
on the IBM 7090 and GE 635/645--says that strings existed
in two forms: a one-byte count followed by the characters
of the string, and an expanded form in which the count
and the characters were blown out into words.

In these implementations, bytes were 9 bits, and so the compact
representation limited strings to length 511. On the
other hand, the blown-out form occupied lots of space and
wasn't suitable for transferring directly to text files.
Originally, I believe, there were no operators for accessing
individual characters in a compact string, though they were
added later.

Dennis Ritchie
att!research!dmr
d...@research.att.com

Robert Claeson

unread,

Feb 7, 1989, 3:30:07 AM2/7/89

to

In article <7...@atanasoff.cs.iastate.edu>, has...@atanasoff.cs.iastate.edu (John Hascall) writes:

> > - All machines I know of that have character string instructions want
> > the length of the string. This forces the string primitives to first
> > scan for null, a time wasting operation.

> I assume you mean something like:

> +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
> |length| H | E | L | L | O | , | | W | O | R | L | D | \n|
> +------+---+---+---+---+---+---+---+---+---+---+---+---+---+

> but, what size would you use for "length", a byte? a word? a longword?

A 16-bit word, just to remain compatible with all the 16-bit machines
out there.

> I suspect that some of these machines' instructions expect different
> sized operands for the length.

Some expects 16 bits, some expects 32 bits, and a few (such as the old
Z-80) expects 8 bits.

> Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
> a machine with no string instructions.

Well, what can we learn from this?
--
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
"No problems." -- Alf
Tel: +46 758-202 50 EUnet: rcla...@ERBE.SE uucp: uunet!erbe.se!rclaeson
Fax: +46 758-197 20 Internet: rcla...@ERBE.SE BITNET: rcla...@ERBE.SE

Robert Colwell

unread,

Feb 7, 1989, 8:45:02 AM2/7/89

to

In article <7...@atanasoff.cs.iastate.edu> has...@atanasoff.cs.iastate.edu (John Hascall) writes:
=In article <88...@alice.UUCP= d...@alice.UUCP writes:
==The question arose: why does C use a terminating character for
==strings instead of a count?
==In my opinion, C's array/pointer scheme for representation
==of character strings has worked out reasonably well, although
==it is somewhat clumsy when there are lots of string operations.
=
==I don't think it has been demonstrated that the usual run of
==C programs pays an extremely high cost in performance for their
==string operations, though doubtless there are counterexamples
==for particular machine architectures or particular programs.
=
= This is a rather circular argument. This is rather like
= saying "I don't think it has been demonstrated that the usual
= automobile pays an extremely high cost in performance for their
= amphibious operations,..."
=
= Just like most people don't drive their cars across lakes,
= most C programs are not string operation intensive.

Dennis's comment *could* have been circular, but I don't think it
was. After all, the Unix OS has lots of places where exceptionally
poor string handling would be obvious very quickly, and there are
several known installations of this OS nowadays...On the other hand,
there's Dhrystone -- if your string handling is poor, your
Dhrystone number may well be pathetic. And Dhrystone is supposed
to be a systems code benchmark (at least to this level of fidelity).

Bob Colwell ..!uunet!mfci!colwell
Multiflow Computer or col...@multiflow.com
175 N. Main St.
Branford, CT 06405 203-488-6090

Robert Sansom

unread,

Feb 7, 1989, 6:08:55 PM2/7/89

to d...@research.att.com

In article <88...@alice.UUCP>, d...@alice.UUCP writes:
> Robert Firth justifiably corrects my misstatement about
> BCPL strings; they were indeed counted. I evidently edited
> my memory.
>
> Perhaps he or someone else can discuss authoritatively how they
> fitted into the language and (to recall the original topic) how string
> instructions might help in processing.

To quote 'BCPL - the language and its compiler' by Martin Richards and Colin
Whitby-Strevens:

'... you can use UNPACKSTRING to lay out a string in a vector one
character to a word, and PACKSTRING to pack it up again. After
unpacking your string, you will discover that the first word contains a
count of the number of characters in the sting proper, which starts at
the second word.'

and:

'Exactly how BCPL strings are stored depends, amongst other things, upon
the implementation word size. This dependency is concealed within the
string access procedures GETBYTE and PUTBYTE. The call "GETBYTE(S,I)"
obtains the Ith byte of the string S. By convention, byte 0 contains the
number of characters in the string, which are stored consecutively from
byte. The call "PUTBYTE(S,I,C)" sets the Ith byte of the string S to
contain the character C.'

Robert Sansom (r...@cs.cmu.edu)
School of Computer Science
Carnegie Mellon University
--

Eric S. Raymond

unread,

Feb 7, 1989, 8:37:00 PM2/7/89

to

In article <84...@aw.sei.cmu.edu>, fi...@bd.sei.cmu.edu (Robert Firth) writes:
> In article <88...@alice.UUCP> d...@alice.UUCP writes:
> >The history of this convention and of the general array scheme had little
> >to do with the PDP-11; it was inherited from BCPL and B.
>
> A correction here: the C scheme was NOT inherited from BCPL.

I've seen bonehead idiocy on the net before, but this tops it all -- this takes
the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to
before you pontificate? Didn't the name `Dennis Ritchie' register in whatever
soggy lump of excrement you're using as a central nervous system? Do you
realize that the person you just incorrectly `corrected' on a point of C's
intellectual antecedents is the *inventor of C himself*!?!

Sheesh. No *wonder* Dennis doesn't post more often.

Next time dmr posts something, I suggest you shut up and listen. Respectfully.
--
Eric S. Raymond (the mad mastermind of TMN-Netnews)
Email: er...@snark.uu.net CompuServe: [72037,2306]
Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718

Larry McVoy

unread,

Feb 8, 1989, 4:07:31 AM2/8/89

to

In article <enj91#24gKdg=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond) writes:
$In article <84...@aw.sei.cmu.edu>, fi...@bd.sei.cmu.edu (Robert Firth) writes:
$> In article <88...@alice.UUCP> d...@alice.UUCP writes:
$> >The history of this convention and of the general array scheme had little
$> >to do with the PDP-11; it was inherited from BCPL and B.
$>
$> A correction here: the C scheme was NOT inherited from BCPL.
$
$I've seen bonehead idiocy on the net before, but this tops it all -- this takes
$the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to
$before you pontificate? Didn't the name `Dennis Ritchie' register in whatever
$soggy lump of excrement you're using as a central nervous system? Do you
$realize that the person you just incorrectly `corrected' on a point of C's
$intellectual antecedents is the *inventor of C himself*!?!
$
$Sheesh. No *wonder* Dennis doesn't post more often.
$
$Next time dmr posts something, I suggest you shut up and listen. Respectfully.
$--
$ Eric S. Raymond (the mad mastermind of TMN-Netnews)
$ Email: er...@snark.uu.net CompuServe: [72037,2306]
$ Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718

Is everyone else laughing as hard as I am?

Eric,

Even the Gods make mistakes, OK? And, although I don't pretend to speak
for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the
last person that would want you to "shut up and listen". The whole point of
this newsgroup, and research in general, is to question the obvious, point out
the incorrect. It's called learning. Blind faith is called religion and has
no place in science.

"Sheesh", indeed.

Larry McVoy, Lachman Associates. ...!sun!lm or l...@sun.com

Keith Bierman - Sun Tactical Engineering

unread,

Feb 8, 1989, 4:40:40 AM2/8/89

to

In article <88...@sun.uucp> l...@sun.UUCP (Larry McVoy) writes:
>In article <enj91#24gKdg=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond) writes:
>$In article <84...@aw.sei.cmu.edu>, fi...@bd.sei.cmu.edu (Robert Firth) writes:
>$> In article <88...@alice.UUCP> d...@alice.UUCP writes:
>$> >The history of this convention and of the general array scheme had little
>$> >to do with the PDP-11; it was inherited from BCPL and B.
>$>
>$> A correction here: the C scheme was NOT inherited from BCPL.
>$

>$I've seen bonehead idiocy on the net before, ...
>$before you pontificate? Didn't the name `Dennis Ritchie' ...

>$
>$Next time dmr posts something, I suggest you shut up and listen. Respectfully.

>Is everyone else laughing as hard as I am?

Hopefully. It is quite amusing.
>
>...

>for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the
>last person that would want you to "shut up and listen". The whole point of
>this newsgroup, and research in general, is to question the obvious, point out
>the incorrect. It's called learning. Blind faith is called religion and has
>no place in science.

In general, I agree with Mr. McV. This is, however, a special case.
The question is (to paraphrase) "What did the inventors of C think
about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff".

I do not think it fair to characterize "boff" as a valid hypothesis,
unless the PI had died, and left no notes, or ambigous ones. Since the
PI is very much alive, and has spoken, contradicting him is a bit out
of _my_ definition of scientific inquiry.

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

Jeffrey Putnam

unread,

Feb 8, 1989, 6:17:33 AM2/8/89

to

In reference to the C representation of strings.

Note followups to comp.lang.c.

I like the C model for strings. I like it mostly for its simplicity
and ease of use. It may well be that a representation for strings
that includes string length as a part of a structure is better for
efficiency, or more modular or whatever. But! the model used is
simple and introduces no magic into the language.

Magic? Yup. Magic is what happens when the language (or operating system
or hardware) does something odd that is not reachable by the user. This
includes magic strings, magic arrays (arrays stored in the same way - that
is with extra information hidden from the user), magic library calls (like
some VMS calls) and so on.

If language (hardware, os) designers want to do something, they should make
it evident and available to the user - because if they want to do it, the
user probably will as well.

In the string question, adding the string length means that what is passed
may be a magic cookie that the language knows how to use, but that the user
is often denied access to. I have used languages that did lots of magic
(the worst was PL/I) and it was often quite difficult to decide what was
actually happening (in a function call, for example).

The C choice was the simpler one, one with no magic, and the best for the
kind of programming that C encourages. Further, if you want to add
counted strings, it can be done in C easily. I believe that i have seen
a counted string library posted to the net - it might be interesting to
see if string handling programs actually run a lot faster with such a
library instead of the standard string functions.

jeff putnam -- "Sometimes one must attempt the impossible if only to
je...@pawl.rpi.edu -- show it is merely inadvisable."

John Hascall

unread,

Feb 8, 1989, 9:36:49 AM2/8/89

to

In article <330...@m.cs.uiuc.edu> gil...@m.cs.uiuc.edu writes:

>/* Written 12:25 am Feb 5, 1989 by P...@cup.portal.com in m.cs.uiuc.edu:comp.arch */
>This deserves a new subject.

>Second, I was once told that the following C code compiles into 1
>instruction (or something amazingly short) on the PDP-11, C's mother
>machine:

>while (*p++ = *q++);

>This is perhaps part of the reason why strings were designed with
>null-termination

First, I assume you mean:

char *p,*q;

while (*p++ == *q++);

I can see no way to code this in 1 PDP instruction, the best I can see
is (assume R1 is p, R2 is q):

L: MOVB (R1)+,R0 ; temp <- *p++
CMPB (R2)+,R0 ; compare *q++ to temp
BEQ L ; again if equal

now on a machine with more flexibility in addressing modes (i.e., a
VAX):

L: CMPB (R1)+,(R2)+ ; compare *p++ to *q++
BEQL L ; again if equal

or if we make the assumption that "strings" p and q are less than 64K
long:

CMPC3 #65535,(R1)+,(R2)+ ; while (*p++ == q++);

now on a MC680x0 (A1,A2):

L: CMPB (A1)+,(A2)+ ; compare *p++ to *q++
DBNE D0,L ; "loop mode", assume D0=LARGE_NUMBER
or
BEQ L ; non-"loop mode"

Are their any machines which have a combined "compare-and-branch"
instruction? Are their any (other) machines which can do this in
1 instruction (with or without assumptions)?

Robert Firth

unread,

Feb 8, 1989, 10:36:16 AM2/8/89

to

In article <330...@m.cs.uiuc.edu> gil...@m.cs.uiuc.edu writes:

>Second, I was once told that the following C code compiles into 1
>instruction (or something amazingly short) on the PDP-11, C's mother
>machine:
>
>while (*p++ = *q++);

It compiles into two instructions. If p and q are in registers R1 and
R2 respectively, the code is

1$: MOVB (R2)+,(R1)+
BNE 1$

The "=" maps onto the MOVB, the "++" maps onto the autoincrement
address mode, the move sets the condition codes for the branch
to test, and the move of the trailing NUL makes the test fail.

This is a neat and beautiful idiom in PDP-11 Assembler. There is,
however, one problem with the equivalent C code: it is incorrect.
After termination of the loop, the variables p and q, though declared
of type 'pointer-to-char', will hold values that do not point to
declared or allocated objects of type 'char'. Should you ever have
the misfortune to port this code to a machine with hardware segmentation,
and automatic segment bounds checking as part of the address arithmetic,
(or be a consultant involved in such a port), you will face this
problem.

Could someone inform me whether the current C standard has fixed this?
The simplest answer I guess is to rule that the address of array[upb+1]
must always be legal; in practice this means the implementation has to
leave dead space at the end of each memory segment.

Hugh LaMaster

unread,

Feb 8, 1989, 12:48:19 PM2/8/89

to

In article <1...@aucsv.UUCP> o...@aucsv.UUCP (Richard Okeefe) writes:

>So _both_ conventions are "natural" in human writing systems.

That is absolutely true.

Nevertheless, it is interesting to note that when we "Westerners" try to
produce a consistent little endian machine, we
always seem to fail. I thought that the ns32000 series had finally
done it, but someone recently pointed out that in one small way it isn't
quite. The fact is, after a typical "Western" education, it seems to be quite
difficult to work with little endian numbers. Just look at the mess
DEC made of the extensions to the PDP-11 formats when they produced the VAX.
So, I still claim that it is easier for almost all Anglo/American/European
folks to use big-endian numbers. For whatever it is worth. But since it only
comes up when reading dumps, my real interest in the subject is VERY
limited. I only want to point out that
a) there is no big efficiency advantage to using little-endian formats, as
some little endians have claimed (as far as I can see, all such claims
made in this newsgroup have been refuted), and
b) there ARE MANY advantages to having all machines the SAME

Now, DEC just turned down the chance to start evolving in the direction
of common formats with their new RISC machine, so, the best we can hope
for now is a common interchange file format that all machines would create
when data interchange is required. I suggest that the time is ripe for
the development of such a standard. One small request - people who are
working on creating such a standard please include 64 bit integer
(and floating point, but that is usual) formats - there are quite a few
uses for 64 bit integers, and some machines that really need access to
integers at least 48 bits long.

--
Hugh LaMaster, m/s 233-9, UUCP ames!lamaster
NASA Ames Research Center ARPA lama...@ames.arc.nasa.gov
Moffett Field, CA 94035
Phone: (415)694-6117

David Chase

unread,

Feb 8, 1989, 2:26:03 PM2/8/89

to

In article <88...@sun.uucp> k...@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:

In article <88...@alice.UUCP> d...@alice.UUCP ["PI" below] writes:
>>$> >The history of this convention and of the general array scheme had little
>>$> >to do with the PDP-11; it was inherited from BCPL and B.

["bliff" below]

>>$In article <84...@aw.sei.cmu.edu>, fi...@bd.sei.cmu.edu

(Robert Firth) ["Mr Poster" below] writes:
>>$> A correction here: the C scheme was NOT inherited from BCPL.

["boff" below]

>The question is (to paraphrase) "What did the inventors of C think
>about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff".
>
>I do not think it fair to characterize "boff" as a valid hypothesis,
>unless the PI had died, and left no notes, or ambigous ones. Since the
>PI is very much alive, and has spoken, contradicting him is a bit out
>of _my_ definition of scientific inquiry.

Sigh. Nonetheless, he (Dennis Ritchie) probably made a mistake in his
posting. Other Prinicipal Inventors are still alive and also left
unambiguous notes. Also, I'd suggest that Robert Firth knows BCPL
better than Dennis Ritchie. (I'd suggest that *I* know BCPL better
than Dennis Ritchie, too -- I've used it within the last 4 years.)
I'll give references.

From _BCPL -- The language and its compiler_ by Martin Richards and
Colin Whhitby-Strevens, 1979
----------------
[PACKSTRING and UNPACKSTRING] "After unpacking your string, you will

discover that the first word contains a count of the number of

characters in the string proper, which starts at the second word.
As an example, we give the library routines WRITES, UNPACKSTRING,
and PACKSTRING:

LET PACKSTRING(V,S) = VALOF
$(
LET N = V!0 & #XFF // extract least significant 8 bytes
LET SIZE = N / BYTESPERWORD
S!SIZE := 0 // pack out last word with zeroes
FOR I = 0 TO N DO PUTBYTE(S,I, V!I)
RESULTIS SIZE
$)
----------------

Note, too, that the zeros in the last word will only appear in those
cases where the bytes packed do not fill out the words in the string
(that is, consider packing a string containing 3 characters).

From "The Portability of the BCPL Compiler" by Martin Richards in
_Software -- Practice and Experience_, volume 1, pp 135-146, 1971.
----------------
Strings are packed in BCPL and the packing is necessarily machine
dependent since it depends strongly on the word and byte sizes of the
object machine. The usual internal representation of a string value
is as a pointer to the first of a set of words holding the length and
packed characters of the string. The zeroth byte is usually justified
to the start of a word and holds the length of the string with
successive bytes holding the characters and padded with zeros (or
possibly spaces) at the end to fill the last word. In order to handle
strings in as machine independent way [sic] as possible packing,
unpacking and writing of strings is done using library routines which
are defined in the machine dependent interface with the operating
system.
----------------

I think it is fair to say that C did NOT inherit its string
representation from BCPL. I wish that some of you people would check
your facts before posting.

Linguistic comparisons belong elsewhere, so I won't make them. As far
as implementation goes, I think it is a mixed bag. Many operations
are "faster" on strings with counts, but if your maximum count is only
255 then everything is pretty fast whether it is counted or
terminated.

You should also check out the string operations on the 360/370 sort of
machines; BCPL was running there (rather well) a very long time ago.
I think that those operations worked on at most 256 characters (and, I
should add, NOT on 0-length strings). It may well be another case of
architecture influencing language design (note that a zero-length BCPL
string actually contains one byte -- the zero count).

David

Tim Olson

unread,

Feb 8, 1989, 2:44:14 PM2/8/89

to

In article <84...@aw.sei.cmu.edu> fi...@bd.sei.cmu.edu (Robert Firth) writes:
| >while (*p++ = *q++);

|
| This is a neat and beautiful idiom in PDP-11 Assembler. There is,
| however, one problem with the equivalent C code: it is incorrect.
| After termination of the loop, the variables p and q, though declared
| of type 'pointer-to-char', will hold values that do not point to
| declared or allocated objects of type 'char'. Should you ever have
| the misfortune to port this code to a machine with hardware segmentation,
| and automatic segment bounds checking as part of the address arithmetic,
| (or be a consultant involved in such a port), you will face this
| problem.
|
| Could someone inform me whether the current C standard has fixed this?
| The simplest answer I guess is to rule that the address of array[upb+1]
| must always be legal; in practice this means the implementation has to
| leave dead space at the end of each memory segment.

That is exactly what is done in the current proposed ANSI C standard;
the address is legal to compute, although dereferencing the address is
undefined. Only a single byte of "dead space" is required for this.

-- Tim Olson
Advanced Micro Devices
(t...@crackle.amd.com)

News system

unread,

Feb 8, 1989, 4:03:41 PM2/8/89

to

In article <88...@alice.UUCP> d...@alice.UUCP writes:

> More nearly correct
>memory--which applies only to 20-year-old implementations
>on the IBM 7090 and GE 635/645--says that strings existed
>in two forms: a one-byte count followed by the characters
>of the string, and an expanded form in which the count
>and the characters were blown out into words.
>
>In these implementations, bytes were 9 bits, and so the compact
>representation limited strings to length 511. On the
>other hand, the blown-out form occupied lots of space and
>wasn't suitable for transferring directly to text files.
>Originally, I believe, there were no operators for accessing
>individual characters in a compact string, though they were
>added later.

I cannot speak about the GE 635, But I can speak with some authority about
the IBM 7090. I have before me both the 7090 reference manual and the micro
code that I wrote to emulate the 7090 on the Standard IC4000.

The 7090 provided no instructions for accessing individual characters. The
notion of characters existed only at the I/O interface where 6 characters
(in 6 bit BCD) could be read into a single 36 bit word. The minimum transfer
was two words. I am not aware of any languages made available by IBM that
provided a string data type. In fact the limit of 6 character identifiers in
FORTRAN was due to the fact that 6 characters could be manipulated as a word
on the 7090 (and on the 704 the original "FORTRAN" machine).

IBM did produce a 7040 machine (similar to but not an extension of the 7090).
This machine provided character load and store. But the use of the facility
was limited by the fact that a character address could not be put into an
index register.

I implemented an extended version of 7090 called the EX02 by Standard
Computer that did support strings. Because the words had room in them for
two pointers and a character, I represented a string as a linked list with
one character per word. The ends of the string was marked by null pointers.
There were instructions to traverse the linked list and to convert from a
packed array of characters to a string and back again. The language that
took advantage of these strings was called IMPLAN (implementation language).
An interesting feature of the language was that an expression like i*s, where
i was an integer and s was a string, meant the first i charcters of s. And
s*i meant the last i characters of the string.

Marv Rubinstein

Brian Thomson

unread,

Feb 8, 1989, 4:29:29 PM2/8/89

to

In article <37...@oliveb.olivetti.com> ch...@Ozona.UUCP (David Chase) writes:
>
>You should also check out the string operations on the 360/370 sort of
>machines; BCPL was running there (rather well) a very long time ago.
>I think that those operations worked on at most 256 characters (and, I
>should add, NOT on 0-length strings). It may well be another case of
>architecture influencing language design (note that a zero-length BCPL
>string actually contains one byte -- the zero count).
>

But the 360 implementation was not the first implementation. It
is probably more correct to say that the maximum string length was
implementation-dependent, but tended to be 255 because many machines
have 8-bit bytes.

Also, I don't remember the 360 code generator producing any string
instructions, although it certainly could be persuaded to produce
byte (character) instructions. By that time, the language had been
extended with the packed string selector operator "%", so

GETBYTE(S, I) was equivalent to the (inline) S%I
and PUTBYTE(S, I, C) to S%I := C

--
Brian Thomson, CSRI Univ. of Toronto
utcsri!uthub!thomson, tho...@hub.toronto.edu

Charles Bryant

unread,

Feb 8, 1989, 4:51:13 PM2/8/89

to

In article <3...@bilver.UUCP> bi...@bilver.UUCP (bill vermillion) writes:
>In article <1...@aucsv.UUCP> o...@aucsv.UUCP (Richard Okeefe) writes:

>>Before arguing about whether big-endian order or little-endian
>>order is "more natural" for people,

>>
>>So _both_ conventions are "natural" in human writing systems.
>

>And mixed conventions are considered normal in spoken English.
>
>Consider that, for example twenty-five or thirty-six would fit the
>"big-endian" defintion, the numbers thir-teen, four-teen, would be
>considered "little endian"

"Twenty-five" etc do not fit into _either_ big-or little- endian categories.
Nor do most numbers in English because it isn't a positional system. It is
possible to speak it backwards without losing meaning: five [and] twenty.
(five twenty is a time!). It is spoken with the most significant part first
to allow an estimate of the size to be made easily, I suppose, and obviously
if the number is spoken without qualifiers like "hundred" this is impossible.
Numbers for computers are always (as far as I know) given as a fixed size
object (in programs) of as a string of digits (most I/O) where it is either
unnecessary or impossible to estimate the magnitude of a number without
having it all.
--

Charles Bryant.
Working at Datacode Electronics Ltd.

Eric Koldinger

unread,

Feb 8, 1989, 6:16:25 PM2/8/89

to

In article <7...@atanasoff.cs.iastate.edu> has...@atanasoff.cs.iastate.edu (John Hascall) writes:

>In article <330...@m.cs.uiuc.edu> gil...@m.cs.uiuc.edu writes:
>
>>instruction (or something amazingly short) on the PDP-11, C's mother
>>machine:
>

>>while (*p++ = *q++);
>

>>This is perhaps part of the reason why strings were designed with
>>null-termination
>
> First, I assume you mean:
>
> char *p,*q;
> while (*p++ == *q++);
>
> I can see no way to code this in 1 PDP instruction, the best I can see
> is (assume R1 is p, R2 is q):

Piece of cake:
For the loop as he put it, with your assumptions:
loop: movb (r2)+, (r1)+
bneq loop

Two instructions for the loop. The loop as you wrote it (which doesn't really
do very much, but here goes), again with your assumptions:
loop: cmpb (r2)+, (r1)+
beq loop

God it's good to write some PDP-11 code again. Now there was a great machine
even if it did only have 64K of memory.

------
_ /| Eric Koldinger
\`o_O' University of Washington
( ) "Gag Ack Barf" Department of Computer Science
U kol...@cs.washington.edu

Barry Margolin

unread,

Feb 8, 1989, 7:03:47 PM2/8/89

to

In article <88...@sun.uucp> k...@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:

]In article <88...@sun.uucp> l...@sun.UUCP (Larry McVoy) writes:
]>In article <enj91#24gKdg=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond) writes:

]>$Next time dmr posts something, I suggest you shut up and listen. Respectfully.

]
]>Is everyone else laughing as hard as I am?

]Since the

]PI is very much alive, and has spoken, contradicting him is a bit out
]of _my_ definition of scientific inquiry.

BUT -- DMR later replied and said that the person contradicting him
was RIGHT!

Even the great god :-) Ritchie is permitted to have a memory lapse.

Barry Margolin
Thinking Machines Corp.

bar...@think.com
{uunet,harvard}!think!barmar

Keith Bierman - Sun Tactical Engineering

unread,

Feb 8, 1989, 9:47:12 PM2/8/89

to

In article <37...@oliveb.olivetti.com> ch...@Ozona.UUCP (David Chase) writes:

>....

>Sigh. Nonetheless, he (Dennis Ritchie) probably made a mistake in his
>posting. Other Prinicipal Inventors are still alive and also left
>unambiguous notes. Also, I'd suggest that Robert Firth knows BCPL
>better than Dennis Ritchie. (I'd suggest that *I* know BCPL better
>than Dennis Ritchie, too -- I've used it within the last 4 years.)
>I'll give references.
>

It was pointed out to me (private mail) that what made the situation
so funny was that dmr recanted, and that then someone came along and
castigated Firth.

I had failed to keep all the threads in mind. I do not claim to know
BCPL, and I don't claim to be expert on dmr's thoughts.

As I saw it, the argument was what was going on in dmr's
thoughts....and since dmr is very much alive speculating is not very
fruitful.

As it happens, the speculator was right......

Just goes to show.

I hereby repudiate my position. Debating what when on in somebody's
mind a decade+ ago is valid...because the thinker probably doesn't
remember it correctly! :>

khb

Gregory N. Bond

unread,

Feb 8, 1989, 9:59:15 PM2/8/89

to

In article <84...@aw.sei.cmu.edu> fi...@bd.sei.cmu.edu (Robert Firth) writes:

[ Re: while (*p++ = *q++); ]
.This is a neat and beautiful idiom in PDP-11 Assembler. There is,
.however, one problem with the equivalent C code: it is incorrect.
.After termination of the loop, the variables p and q, though declared
.of type 'pointer-to-char', will hold values that do not point to
.declared or allocated objects of type 'char'. Should you ever have
.the misfortune to port this code to a machine with hardware segmentation,
.and automatic segment bounds checking as part of the address arithmetic,
.(or be a consultant involved in such a port), you will face this
.problem.
.
.Could someone inform me whether the current C standard has fixed this?
.The simplest answer I guess is to rule that the address of array[upb+1]
.must always be legal; in practice this means the implementation has to
.leave dead space at the end of each memory segment.

This is only a problem if p or q are dereferenced after the loop. They
are (at least potentially) invalid addresses, but so is NULL. And if
it is legal for p to be NULL, it is legal for it to point nowhere. And
if it is dereferenced, SIGSEGV it, just as with NULL pointers. No need
to fix the ANSI doc, nor to allocate dead space. It's not incorrect
(IMHO!)

[ No, we don't have comp.lang.c in Australia. Sorry. ]
--
Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia
Internet: g...@melba.bby.oz.au non-MX: gnb%melba....@uunet.uu.net
Uucp: {uunet,mnetor,pyramid,ubc-vision,ukc,mcvax,...}!munnari!melba.bby.oz!gnb

ag...@mcdurb.urbana.gould.com

unread,

Feb 8, 1989, 10:09:00 PM2/8/89

to

>/* Written 3:15 pm Feb 6, 1989 by GQ....@forsythe.stanford.edu in mcdurb:comp.arch */
>->[Me] ag...@mcdurb.Urbana.Gould.COM writes:
>->May I encourage people implementing string libraries to use an extra
>->level of indirection? Instead of length immediately preceding the string,
>->let length be associated with a pointer to the string. Makes
>->substringing operations much easier, and has the ability to reduce
>->unnecessary copies (at the risk of increased aliasing).
>->
>-> +------+---+
>-> |length|ptr|
>-> +------+---+
>-> |
>-> +------+
>-> |
>-> V
>-> +---+---+---+---+---+---+---+---+---+---+---+---+---+
>-> | H | E | L | L | O | , | | W | O | R | L | D | \n|
>-> +---+---+---+---+---+---+---+---+---+---+---+---+---+
>
>Such an implementation has adverse effects when the string is sent
>to/from an external device, such as a file. The 'length' must be
>with the string, or the string needs a terminator character.

If you are sending directly to an output device, I doubt that
your output device accepts your internal format. If you have
to reformat anyway...

Oh, you mean storing data in a file. What's a file? You mean this
memory-mapped object...
Sorry, I don't live in that environment, unfortunately.
Yep, you have to decide either way. For text strings, ASCII files
or binary files are fine by me. Leading counts are fine.
Nothing says that ptr could not point to the very next location.

>what happens to the 'length' information for the old string?

I sure would hope it got changed appropriately!
And I sure would hope that the use was wrapped in a library routine
or macro or C++ type object interface so that nobody ever accessed
the length and ptr explicitly!

Look, null terminated is fine by me, I use it every day. It just has
the embedded null drawback, and the fact that it encourages dumb
code. Several examples of which (dumb code that scans the string twice)
are on my list of things to fix real soon now - one is taking up 10%
of a loaded system. And, yes, good coding practices can avoid double
scanning, so all that you're left with is the embedded null problem.

(Talking about dumb code - has anyone else seen things like

#define TERM_ESCAPE_CODE "\e[foo\0bar"
puts(TERM_ESCAPE_CODE); /* Do escape code magic with terminal? */

particularly in things where the escape code is computed?)
And I sure would never let any oiu

King Su

unread,

Feb 8, 1989, 11:27:19 PM2/8/89

to

In article <21...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:
>Now, DEC just turned down the chance to start evolving in the direction
<of common formats with their new RISC machine, so, the best we can hope
>for now is a common interchange file format that all machines would create
<when data interchange is required. I suggest that the time is ripe for
>the development of such a standard. One small request - people who are
<working on creating such a standard please include 64 bit integer
>(and floating point, but that is usual) formats - there are quite a few
<uses for 64 bit integers, and some machines that really need access to
>integers at least 48 bits long.

You mean DEC has finally decided to go big-endian? That is news to me.
The little-endian format is the current dominate format - remember all
the IBM PC's and their clones. To evolve in the direction of a common
format would mean to take the little-endian route. I would say that
the day we have a common format will come a day after US adopts the
metric system.

The SUN's XDR library has already provided us with a common interchange
file format. It probably does not address 64 bit integers.

--
/*------------------------------------------------------------------------*\
| Wen-King Su wen-...@vlsi.caltech.edu Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

Robert Firth

unread,

Feb 9, 1989, 8:13:35 AM2/9/89

to

In article <84...@aw.sei.cmu.edu> fi...@bd.sei.cmu.edu (Robert Firth) writes:

| Could someone inform me whether the current C standard has fixed this?
| The simplest answer I guess is to rule that the address of array[upb+1]
| must always be legal; in practice this means the implementation has to
| leave dead space at the end of each memory segment.

In article <24...@amdcad.AMD.COM> t...@amd.com (Tim Olson) writes:

>That is exactly what is done in the current proposed ANSI C standard;
>the address is legal to compute, although dereferencing the address is
>undefined. Only a single byte of "dead space" is required for this.

Thanks, Tim, and others who mailed me. There is a copy of the latest
ANSI C in this building, but it seems to have wandered off, so I can't
look this up for myself readily.

However, is only a single byte required? Suppose you have an array of
a struct; is it legal to compute the address of array[upb+1].component?
If so, then you really do need to allocate a complete dead array element.

Hugh LaMaster

unread,

Feb 9, 1989, 2:57:26 PM2/9/89

to

In article <94...@cit-vax.Caltech.Edu> wen-...@cit-vax.UUCP (Wen-King Su) writes:
>In article <21...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

>The little-endian format is the current dominate format - remember all
>the IBM PC's and their clones. To evolve in the direction of a common

I did not mean to imply that DEC has decided to go Big OR Little Endian.
DEC's new machine is touted (I haven't seen an architecture
manual, so I can't vouch for the accuracy of it) to have duplicated the
stubbornly Middle Endian formats of the VAX.

Now, why SHOULD DEC build a consistent (big or little endian) data format
machine? I don't know, consistency is the bugaboo of small minds, after
all.

(

Quiz: Figure out, in your head, which byte (offset from the address in
memory) of a DEC F format fp number contains the least significant
bits of the fraction (multiple choice):
a) byte 0
b) byte 1
c) byte 2
d) byte 3
e) I dunno, I never could figure it out. But real programmers don't use f.p.

)

>The SUN's XDR library has already provided us with a common interchange
>file format. It probably does not address 64 bit integers.

I remember seeing a definition of XDR a couple of years ago - but in the
context of RPC. Is there an XDR FILE format definition? NFS certainly
doesn't translate to/from it.

J.BEYER

unread,

Feb 9, 1989, 5:42:52 PM2/9/89

to

In article <22...@ism780c.isc.com>, ne...@ism780c.isc.com (News system) writes:
> The 7090 provided no instructions for accessing individual characters. The
> notion of characters existed only at the I/O interface where 6 characters
> (in 6 bit BCD) could be read into a single 36 bit word. The minimum transfer
> was two words. I am not aware of any languages made available by IBM that
> provided a string data type. In fact the limit of 6 character identifiers in
> FORTRAN was due to the fact that 6 characters could be manipulated as a word
> on the 7090 (and on the 704 the original "FORTRAN" machine).

This may be quibbling, but did not the Convert By Replacement From MQ
and similar instructions deal with 6-bit characters on the 7090? I know
they were not in the 704. Perhaps they did not get there until the 7094.
But they were not full-fledged character manipulation primitives, I agree.
It has been at least 20 years since I programmed one of those.

As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit
characters (programmer's choice), using clever addressing and tally modes.
I am even more hazy about that machine, though.

--
Jean-David Beyer
A.T.&T., Holmdel, New Jersey, 07733
houxs!beyer

Tim Olson

unread,

Feb 9, 1989, 5:59:11 PM2/9/89

to

It is not legal to compute the address of array[upb+1].component (at
least if component has a non-zero offset from the beginning of the
structure).

Peter da Silva

unread,

Feb 9, 1989, 6:32:47 PM2/9/89

to

In article <330...@m.cs.uiuc.edu>, gil...@m.cs.uiuc.edu writes:
> while (*p++ = *q++);

p, q in registers:

tst (rq)
beq pool
loop: movb (rq)+,(rp)+
bne loop
pool:

> len = *p++ = *q++;
> while(len-->0)
> *p++ = *q++;

movb (rq)+,rtemp
movb rtemp,(rp)+
beq pool
loop: movb (rq)+,(rp)+
sob rtemp,loop ; not at all sure of the syntax here
pool:

Two instructions for the loop, either way. But the former is more likely
to be implemented by a dumb compiler... what did Ritchie's compiler do
for it with p and q in registers?
--
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
Work: uunet.uu.net!ficc!peter, pe...@ficc.uu.net, +1 713 274 5180. `-_-'
Home: bigtex!texbell!sugar!peter, pe...@sugar.uu.net. 'U`
Opinions may not represent the policies of FICC or the Xenix Support group.

Sean Fagan

unread,

Feb 10, 1989, 2:19:16 AM2/10/89

to

In article <4...@maxim.ERBE.SE> p...@maxim.ERBE.SE (Robert Claeson) writes:
>> but, what size would you use for "length", a byte? a word? a longword?
>A 16-bit word, just to remain compatible with all the 16-bit machines
>out there.

You would limit all those who are on VAXen, 68k's, 32k's (both WE and NS),
Sparc's, 88k's, CDC Cybers (180 state), Cray's (1, 2, 3, X and Y MP), ARM's,
29k's, Elxsi's, ad naseum, just to retain backwards compatibility?

If you *must* use a <length, pointer> combination, make the <length>
attribute the same size as a normal char *. However, I've gotten along
quite well without them, as have thousands of other people.

--
Sean Eric Fagan | "What the caterpillar calls the end of the world,
se...@sco.UUCP | the master calls a butterfly." -- Richard Bach
(408) 458-1422 | Any opinions expressed are my own, not my employers'.

Guy Harris

unread,

Feb 10, 1989, 4:00:49 AM2/10/89

to

>The SUN's XDR library has already provided us with a common interchange
>file format. It probably does not address 64 bit integers.

Nope, it does. The datatype names are "hyper" and "unsigned hyper".

XDR is big-endian, BTW. (So is the 68K; this may or may not be a
coincidence :-).)

Guy Harris

unread,

Feb 10, 1989, 4:20:51 AM2/10/89

to

>I remember seeing a definition of XDR a couple of years ago - but in the
>context of RPC.

XDR's spec tends to be bundled with RPC's spec, but XDR is a data
representation format that can be used on a spinning oxide coated
platter, or a strip of oxide-coated plastic, just as it can be used on a
wire. The UNIX XDR from Sun implementation can stuff XDR'ed data into
memory or pick it up from memory, or write it to a standard I/O stream
or read it from a standard I/O stream. You can, in fact, define your
own XDR stream implementation types, if the canned ones that come with
the user-mode RPC library (standard I/O, memory, and "record stream" -
can be used over a TCP connection, or into and out of a file, or....)
won't do what you want.

>Is there an XDR FILE format definition?

Well, there are the formats generated and read by the standard I/O and
"record stream" XDR stream types. I suspect the standard I/O format
consists of a sequence of the encoded objects, shoved to the file as
bytes in the format specified by the XDR spec. The "record stream"
format is probably similar, but with some form of "record marks" in the
stream (the documentation claims the record marking mechanism is
described in "Advanced Topics", but it's not described there in the
ONC/NFS Protocol Specifications and Services Manual).

>NFS certainly doesn't translate to/from it.

No, and I doubt it could do so, given that many file formats are not
self-describing. If you want to maintain a file over NFS that's
readable and writable by clients with different "native" data
representations, you're probably best off using XDR in some form to
write the data out (either XDR into memory and write/read and XDR from
memory, or use XDR over standard I/O, or XDR over "record stream", or...).

Darren New

unread,

Feb 10, 1989, 9:59:12 AM2/10/89

to

In article <21...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <94...@cit-vax.Caltech.Edu> wen-...@cit-vax.UUCP (Wen-King Su) writes:
>>In article <21...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>The SUN's XDR library has already provided us with a common interchange
>>file format. It probably does not address 64 bit integers.
>
>I remember seeing a definition of XDR a couple of years ago - but in the
>context of RPC. Is there an XDR FILE format definition? NFS certainly
>doesn't translate to/from it.
>

If you are really looking for a COMMON interchange format, ASN.1 is the way to
go. XDR is nice when you are working with C, but defining something like a font
in a way that is machine independent requires much extra information in terms
of order of bits in a byte and so on. XDR is also not self-delimiting, does
not handle time or alternate character sets well (last I looked), and cannot
be parsed without knowledge of the XDR functions used to encode the higher-level
constructs. ASN.1, having been standardized by ISO (I can find the number if
anyone wants it), is available world-wide. ASN.1 also addresses integers, strings,
etc as big as you want. Unfortunately, it does not at this time have a single,
standard floating-point format, but this is easy to add on an application
basis. This topic seems to be wanderring from comp.arch somewhat...
- Darren

Hugh LaMaster

unread,

Feb 10, 1989, 4:01:11 PM2/10/89

to

In article <84...@louie.udel.EDU> n...@udel.EDU (Darren New) writes:

>If you are really looking for a COMMON interchange format, ASN.1 is the way to
>go. XDR is nice when you are working with C, but defining something like a font

>etc as big as you want. Unfortunately, it does not at this time have a single,

>standard floating-point format, but this is easy to add on an application

But floating point IS my problem. Yes, it IS easy to add on an application
basis, just not on a hundreds-of-applications basis. (Text data is easy to
convert on an application basis also between character sets of different types.
But when you want to create text files that can be read by a large number of
applications you are going to write in the future, you use ASCII. In the
future, perhaps you will be able to use ASN.1 data streams.) It also
can be expensive, even though it is "easy", when you need to translate large
quantities of data between different data types. (End of sermon :-) )

Anyway, does ASN.1 define some kind of file structure? (Since this is USENET,
we won't use the R-word (a logical r*c*rd for you old-timers over thirty)).
Are the data types defined in the structure somewhere, so that a conversion
program can figure out what it is converting from/to? Is there a well
defined library with C and Fortran bindings that an applications programmer
can use to read and write ASN.1 files with? Will the cost of using ASN.1
structured data approach zero as the structures become large (arrays of 10000
floating point numbers, for example)? If the answer is yes to all these
questions, I would like to know more.

News system

unread,

Feb 10, 1989, 9:32:16 PM2/10/89

to

[M Rubinstein]

>> The 7090 provided no instructions for accessing individual characters.

[J BEYER]

>This may be quibbling, but did not the Convert By Replacement From MQ
>and similar instructions deal with 6-bit characters on the 7090? I know
>they were not in the 704. Perhaps they did not get there until the 7094.
>But they were not full-fledged character manipulation primitives, I agree.
>It has been at least 20 years since I programmed one of those.

[M Rubinstein]
J BEYER is right about convert instructions on the 709/7090/7094 operating on
6 bit fields. However the fields accessed were fields in a register. The
were were no instructions for accessing 6 bit fields from memory. There were
instructions for accessing two 15 bit fields from memory. The fields were
called the address and decrment. It is my understanding the the Lisp names
CAR and CDR come from the 15 bit field names. CAR==contents of address
register, and CDR==contents of decrement register.

Marv Rubinstein

Chris Torek

unread,

Feb 10, 1989, 10:58:42 PM2/10/89

to

More or less incidentally: it is easy to build counted strings out
of C strings, if you prefer counts. For instance:

typedef struct string {
int len;
char *str;
} string;

#define STRING(s) { sizeof(s) - 1, s }

string foo = STRING("foo");

In pANS C this works for automatic variables as well as statics.
These strings are not normal expressions, though, and cannot be
anonymous---that is,

result = some_string_function(str, STRING("lima beans"))

is illegal (though gcc has an extension that will do it).
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris

Alex Colvin

unread,

Feb 11, 1989, 9:53:54 AM2/11/89

to

In article <62...@saturn.ucsc.edu>, hay...@ucscc.UCSC.EDU (Jim Haynes) writes:
...<history of string representations>...

> What's the difference between a string and an array of characters?
> Is it anything other than the set of operations that are provided
> to operate on it?

In the indirect representation (via length & pointer) any substring is also
a string.

In the end-delimited representation (as in C) only a tail substring is a
string. This still allows you to consume a string from start to end, but
makes it difficult to pull a string off the start.

In both representations, catenation is messy. That's when we turn to
buffer chains.

William E. Davidsen Jr

unread,

Feb 13, 1989, 9:47:29 AM2/13/89

to

In article <88...@alice.UUCP> d...@alice.UUCP writes:

>Originally, I believe, there were no operators for accessing
>individual characters in a compact string, though they were
>added later.

The GE line had support for true string descriptors as early as the
625 (~1962?). There was a "tally word" mechanism which consisted of a
starting address (word), a starting address (char in word), a count for
string size (12 bits), and a bit for 6 or 9 bit bytes.

This could be used by instructions with modifiers, as below:
LDA STR1,CI ; Character indirect
STA STR2,SC ; store, move the starting address
; forward one byte, and decriment the
; size by one.
LDQ STR3,SCR ; Move the address back one, etc.

While the 600 line didn't have bytes addressing, they did have the
tally mechanism, which could emulate a stack in 6 or 9 bit bytes, or any
number of words up to (I believe) 64 (might be 63).

Perhaps when the 68/xxx series replaced the 645 direct byte addressing
was added, I got away from Multics about the end of the 645 era, when GE
decided to sell the computer operation and put the money into nuclear
development, since computers would never be a mass produced item.

--
bill davidsen (we...@ge-crd.arpa)
{uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

Richard Childers

unread,

Feb 13, 1989, 3:27:49 PM2/13/89

to

has...@atanasoff.cs.iastate.edu (John Hascall) writes:

>In article <88...@alice.UUCP> d...@alice.UUCP writes:

>>I don't think it has been demonstrated that the usual run of
>>C programs pays an extremely high cost in performance for their
>>string operations, though doubtless there are counterexamples
>>for particular machine architectures or particular programs.

> This is a rather circular argument. This is rather like
> saying "I don't think it has been demonstrated that the usual
> automobile pays an extremely high cost in performance for their
> amphibious operations,..."

It's a lot easier to criticize two decades after the fact, to say, 'You
oughta do this ... why didn't you do THAT ?' instead of merely accepting
it as a foible of whichever language is under discussion.

I think what Mssr. Ritchie was trying to say was something along the lines
of, "OK, we messed up ... we tried for an elegant solution, treating text
as an array of characters. It was abstract, it was clean, we thought it'd
fly. It did. Not everybody likes it ... but it's not going to change, as
there are some subterranean assumptions that it would be wise to take into
account before this conversation goes much further."

>John Hascall
>ISU Comp Center
>Rx: Apply :-) above as needed for pain.

-- richard

--
* "Do not look at my outward shape, but take what is in my hand." *
* -- Jalaludin Rumi, 1107-1173 *
* ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd.UUCP!childers@tycho *
* AMPEX Corporation - Audio-Visual Systems Division, R & D *

Richard Childers

unread,

Feb 13, 1989, 3:35:41 PM2/13/89

to

er...@snark.uu.net (Eric S. Raymond) writes:

>I've seen bonehead idiocy on the net before, but this tops it all --"

Indeed, it does.

>Next time dmr posts something, I suggest you shut up and listen. Respectfully.

Gag me with mindless worshipping twits. Ride on someone else's coattails, eh?

> Eric S. Raymond (the mad mastermind of TMN-Netnews)
> Email: er...@snark.uu.net CompuServe: [72037,2306]
> Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718

John Mashey

unread,

Feb 14, 1989, 10:52:07 AM2/14/89

to

In article <6...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
....
>Dennis's comment *could* have been circular, but I don't think it
>was. After all, the Unix OS has lots of places where exceptionally
>poor string handling would be obvious very quickly, and there are
>several known installations of this OS nowadays...On the other hand,
>there's Dhrystone -- if your string handling is poor, your
>Dhrystone number may well be pathetic. And Dhrystone is supposed
>to be a systems code benchmark (at least to this level of fidelity).

As has been discussed before in this group, Dhrystone's str* behavior
seems to differ a bit from "more typical" C programs. Everything I've
ever seen from analyzing C programs backs up what Dennis says.
When we did the first MIPS UNIX port, I wrote all of the str* routines
in assembler; we threw most of them out in favor of the portable C programs,
because the testing time to find the (must have been at least) one bug in the
lot wouldn't have been worth it, according to the program statistics we saw.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR ma...@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Hugh LaMaster

unread,

Feb 14, 1989, 2:40:12 PM2/14/89

to

The Cyber 205 provides an address/length form of descriptor in hardware
which works with all vector data types. Unfortunately, the length field
is only 16 bits long, so the longest string you could have would be of
65535. This really hurts on floating point arrays, though, because now
some arrays are that big. (In the late 60's, when the ISA was defined,
512K of Core Memory was a BIG NUMBER.) The hardware descriptor vehicle
is excellent, in my opinion. The lesson: pick a big number and then
at least double the number of bits.

Hugh LaMaster

unread,

Feb 14, 1989, 2:57:32 PM2/14/89

to

In article <21...@ames.arc.nasa.gov> lama...@ames.arc.nasa.gov (Hugh LaMaster) writes:

>512K of Core Memory was a BIG NUMBER.) The hardware descriptor vehicle
>
>
>

That is 512 KWords (4MB) of Core Memory.

John McCalpin

unread,

Feb 15, 1989, 9:28:02 AM2/15/89

to

In article <21...@ames.arc.nasa.gov> lamaster@ames (Hugh LaMaster) writes:
>The Cyber 205 provides an address/length form of descriptor in hardware
>which works with all vector data types. Unfortunately, the length field
>is only 16 bits long, so the longest string you could have would be of
>65535. This really hurts on floating point arrays, though, because now
>some arrays are that big. (In the late 60's, when the ISA was defined,
>512K of Core Memory was a BIG NUMBER.)

> Hugh LaMaster, m/s 233-9, UUCP ames!lamaster

The decision to use a 16-bit length field might be repeated if the
instruction set were redesigned today. There are a number of tradeoffs
between vector length and paging time on a virtual memory machine which
make longer vector lengths potentially expensive. At the very least,
very long vector lengths might require the addition of a third (larger)
page size for the virtual memory system to deal with.

Current software automatically segments arithmetic loops into
65535-element chunks, similar to the 64-element strip-mining on the
Cray's. The overhead for this on the 205/ETA-10 is negligible. I do
not think that the current compiler knows about the vector instruction
for byte-oriented string-searching, but the user can get at it with an
explicit hardware call.
---------------------- John D. McCalpin ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mcca...@masig1.ocean.fsu.edu mcca...@nu.cs.fsu.edu
--------------------------------------------------------------------

Alex Colvin

unread,

Feb 15, 1989, 11:46:48 AM2/15/89

to

In article <11...@houxs.ATT.COM>, be...@houxs.ATT.COM (J.BEYER) writes:
> In article <22...@ism780c.isc.com>, ne...@ism780c.isc.com (News system) writes:
> > The 7090 provided no instructions for accessing individual characters. The

> As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit

It did. Unfortunately, there were several incompatible ways,
the new (EIS string, all memory-memory),
the old (tally words, all memory to register)
and the incredibly old (fixed character/byte, memory-register).

I've heard of a cute way to implement character addressing on a 7090 using
one half register for the word address, another half for character offset,
and XECs to choose a load/store, with an XED chain to wrap the character
count into the word address. Almost microcode (cf Clipper macrocode?)

Alex Colvin
old DTSS programmer