reinventing ASCII?

wildhalcyon

unread,

Mar 3, 2008, 8:56:01 AM3/3/08

to

Hypothetically, if you could reinvent ASCII in a modern context, what
would you remove or add to the specification?

Unicode looks like its set to take ASCII's place somewhere down the
line, so this is mostly a thought problem, but I'm curious about what
changes others might implement. I think the first logical change would
be implementing it as 8-bit, rather than 7-bit, to take advantage of a
full byte. After that, maybe remove most of the non-printable
characters. I don't know any modern programs that take advantage of
most of them. What to fill the additional spaces with though, I don't
know.

cr88192

unread,

Mar 3, 2008, 2:35:03 PM3/3/08

to

"wildhalcyon" <wild.h...@gmail.com> wrote in message
news:0a003284-3ae4-4215...@s13g2000prd.googlegroups.com...

actually, that it only uses 7 bits is nice for UTF-8, since UTF-8 text, for
the ASCII range, is still ASCII (and vice versa...).

there is also a full 8-bit alternative to ASCII, and it is EBCDIC...

I think I before also use some of the control-character space, not as
control characters, but for embedding special commands for a "psuedo-text"
format, if I remember correctly. it was one or another of my odd-varieties
of self-compressed XML I think.

another possibility, would be a purely-textual variety, which could well
make more sense (no need for special handling and would be directly
human-decodable). of course, the much more popular/common option, is to take
XML and compress it with deflate or similar (pro: commonly agreeable, con:
does not allow optimizing the decoding process...).

or such...

Rod Pemberton

unread,

Mar 3, 2008, 3:13:13 PM3/3/08

to

"wildhalcyon" <wild.h...@gmail.com> wrote in message
news:0a003284-3ae4-4215...@s13g2000prd.googlegroups.com...

> Hypothetically, if you could reinvent ASCII in a modern context, what
> would you remove or add to the specification?
>

I'd add any characters from the *keyboards* used by other Romance and
Germanic languages that aren't in ASCII.

What about keep? There's much to keep from ASCII, IMO.

C requires number characters to be adjacent and sequential, for which both
EBCDIC and ASCII work. If EBCDIC hadn't existed, C probably would've
required alphabetic characters be adjacent and sequential also. It's clear
to me that sequences of upper and lower alpha characters should be
convertable by adding or subtracting an offset, i.e., mathematically related
positions. This is true for EBCDIC and ASCII. I'm not sure if EBCDIC,
where the alphabetic characters are disjointed, makes it slightly more
difficult to write functions or not. The breaks in the EBCDIC alpha
sequence prevents using simple arithmetic for certain character operations,
thereby requiring use of an array or lookup table, but it's likely one would
use have to use an array in such situations anyway. The only situation I
can come up with at the moment which is inconvienent for disjointed EBCDIC
is initializing a character translation table. I don't think anyone wants
to use CBM's PETSCII although it was ASCII derived. It's clear to me that
character sets of 7 or 8-bits is sufficient for Romance and Germanic
languages, but insufficient for pictograph, or hieroglyph based languages.
It's also clear to me that using character sets of more than 7 or 8-bits
makes implementing parsers, lexers, etc. much more complicated - especially
if those additional characters are used by the language being parsed... For
7 or 8-bit characters, you can use simple arrays to parse, but with a huge
character set, say Unicode, you can only parse easily if using a smaller
subset, say the overlapped ASCII area of Unicode. Can you imagine trying to
parse code, perhaps in Chinese, using a large number Unicode pictographs?

> Unicode looks like its set to take ASCII's place somewhere down the
> line, so this is mostly a thought problem, but I'm curious about what
> changes others might implement. I think the first logical change would
> be implementing it as 8-bit, rather than 7-bit, to take advantage of a
> full byte.

Is ASCII really 7-bits? Or, is it 8-bits? Everyone says it's 7-bits...
and there isn't even a mention on Wikipedia about ASCII as 8-bits. Imagine
that... IIRC, ANSI X3.4-1986 defines two character sets: standard ASCII as
values from 0 to 127 and extended ASCII as values from 128 to 256 - the
later are platform specific. ANSI X3.32-1973 also defines graphic
representations for the standard ASCII control characters. I'm sure the
standardized, non-platform specific, part of ASCII only uses 7-bits, but one
would need to pull out the spec. to see if ASCII is actually _defined_ as 7
or 8 bits...

> After that, maybe remove most of the non-printable
> characters.

Although many of those were for terminal support, many are still required
today for text alignment, C language, control-flow of data, etc. If you are
wanting an application only character set, delete all non-printables except
space, add in the a,e,i,o,u, and other characters from Germanic and Romance
languages that have accent, breves, etc. However, this introduces the same
problem as detecting an upper and lower case letter - you can't easily match
letters because you have numerous versions of each letter, e.g., with
accent, breve, etc on top: هنولaâaa (perhaps a few of those came
through...).

Rod Pemberton

Jacko

unread,

Mar 3, 2008, 3:58:14 PM3/3/08

to

well the control codes should be thought about. how many are used on
an IO device of today? even cr is html br tag for many people. maybe
they should be given as 1 char representations of common tags.

accents should be control tags.

i understood ascii to be 7 bit

Torben Ægidius Mogensen

unread,

Mar 4, 2008, 4:19:06 AM3/4/08

to

wildhalcyon <wild.h...@gmail.com> writes:

I agree that most of the non-printable character should go, but even
more important would be to standardise how newlines work. Different
systems use different combibations of CR and LF characters to denote
newlines, and this is a mess.

I would add diacritical marks (accents, umlauts, cedille, ...) as
prefix characters that would combine with the following character, so,
for example, å would be repersented as °a. If the following character
is a space, the diacritical mark would appear as itself. Underline
could be a diacritical mark, so you can have underlined text.

This does play havoc with lengths of strings, as the number of shown
characters is not the same as the number of bytes in the string, but
you do get a lot of characters without having to have separate ASCII
codes for all possible combinations of diacritical marks and letters.

On my old mechanical typewriter, diacritical marks did not advance the
carriage, and bitmapped screens and printers can easily combine two
bitmapped characters, so it would not be a problem for such systems.
More advanced systems could have special characters for common
combinations, so accents could be placed differently on different
letters.

Torben

Eliot Miranda

unread,

Mar 4, 2008, 4:18:31 PM3/4/08

to

wildhalcyon wrote:
> Hypothetically, if you could reinvent ASCII in a modern context, what
> would you remove or add to the specification?

I'd make '0' - '9' and 'A' - 'Z' contiguous ;)

> Unicode looks like its set to take ASCII's place somewhere down the
> line, so this is mostly a thought problem, but I'm curious about what
> changes others might implement. I think the first logical change would
> be implementing it as 8-bit, rather than 7-bit, to take advantage of a
> full byte. After that, maybe remove most of the non-printable
> characters. I don't know any modern programs that take advantage of
> most of them. What to fill the additional spaces with though, I don't
> know.

--
The surest sign that intelligent life exists elsewhere in Calvin &
the universe is that none of it has tried to contact us. Hobbes.
--
Eliot ,,,^..^,,, Smalltalk - scene not herd

Charlie Gordon

unread,

Mar 4, 2008, 9:58:24 PM3/4/08

to

"Eliot Miranda" <eli...@pacbell.net> a écrit dans le message de news:
H0jzj.4477$fX7...@nlpi061.nbdc.sbc.com...

> wildhalcyon wrote:
>> Hypothetically, if you could reinvent ASCII in a modern context, what
>> would you remove or add to the specification?
>
> I'd make '0' - '9' and 'A' - 'Z' contiguous ;)

Do you mean '9' + 1 == 'A' ?

--
Chqrlie.

Message has been deleted

Torben Ægidius Mogensen

unread,

Mar 5, 2008, 4:31:40 AM3/5/08

to

r...@zedat.fu-berlin.de (Stefan Ram) writes:

> "Charlie Gordon" <ne...@chqrlie.org> writes:
>>>I'd make '0' - '9' and 'A' - 'Z' contiguous ;)
>>Do you mean '9' + 1 == 'A' ?

Probably. And while it would make hexadecimal conversion a bit
simpler, I don't find it really important.

> Some aspects of the original design where:
>
> - Digits and symbols they are usually paired with
> on a keyboard should differ only by one
> bit in order to simplify keyboard design
> (i.e., »1!«, »3#«, »4$«, and so on).

This is certainly not relevant anymore. Besides, the placement of the
symbols on the shifted numeric keys is not the same in all countries.
For example, the standard Danish keyboard layout has '(' and ')' over
8 and 9 insted of over 9 and 0. Some older typewriters don't even
have separate 0 and 1 keys, requiring you to use O and l instead.

> - All (uppercase) letters should reside within
> a single 5-bit block.

Less relevant now than then, but not completely irrelevant.

> - The lowest four bits of a digit should represent
> its value.

This is on par with having A-Z directly after 0-9: It makes text to
number conversion slightly easier, but not by much. I write c-'0'
rather than c&0xf to find the value of a digit, as I find the
intention much clearer. And it doesn't take any longer to do on
modern hardware.

> - Due to technical reasons, »0000000« and »1111111«
> had to be control codes (NUL and DEL, respectively).
> Therefore, the outer regions of the code were used
> for control codes, while printable characters were
> placed in the middle.

This is almost irrelevant now, except I shouldn't wonder if some
hardware would break if it was changed.

Torben

Edward Feustel

unread,

Mar 5, 2008, 6:32:23 AM3/5/08

to

On 5 Mar 2008 03:26:51 GMT, r...@zedat.fu-berlin.de (Stefan Ram) wrote:

>"Charlie Gordon" <ne...@chqrlie.org> writes:
>>>I'd make '0' - '9' and 'A' - 'Z' contiguous ;)
>>Do you mean '9' + 1 == 'A' ?
>

> Some aspects of the original design where:
>
> - Digits and symbols they are usually paired with
> on a keyboard should differ only by one
> bit in order to simplify keyboard design
> (i.e., »1!«, »3#«, »4$«, and so on).
>

> - All (uppercase) letters should reside within
> a single 5-bit block.
>

> - The lowest four bits of a digit should represent
> its value.
>

> - Due to technical reasons, »0000000« and »1111111«
> had to be control codes (NUL and DEL, respectively).
> Therefore, the outer regions of the code were used
> for control codes, while printable characters were
> placed in the middle.

I'd like to have Math Symbols i.e., the APL set.

On the other hand, if the character set is a revolutionary change, it
will probably not suceed. For example, the Prime Computer ASCII set
had the parity bit turned on for ALL characters. Management was never
willing to change and adopt standard ASCII characters. There was just
to much retained data that "would have to be converted". Ditto the
Prime Computer Number format.

Ed

James Harris

unread,

Mar 5, 2008, 6:05:37 PM3/5/08

to

In the context of this newsgroup - i.e. language design - my belief is
that a language would benefit from independence from the character
representation. In other words I'd say ASCII is not pragmatically
better or worse than any other representation EXCEPT I would say that
the non-printable characters are a pain.

Which non-printable characters are still needed? How about

0 NUL
4 EOT (end of transmission)
9 TAB (or field separator)
10 LF (or end of line)

Any more? Of course, only one or two control chars are really needed,
perhaps one to say that the next byte is to be interpreted as a non-
printable (perhaps a control code) and one to introduce a composite
character.

In short, to reply to your post, the thing I'd like change most in
ASCII is to get rid of most of the non-printables. Pity we can't get
rid of the lot of 'em!

--
James

Torben Ægidius Mogensen

unread,

Mar 6, 2008, 5:34:33 AM3/6/08

to

James Harris <james.h...@googlemail.com> writes:

> In the context of this newsgroup - i.e. language design - my belief is
> that a language would benefit from independence from the character
> representation. In other words I'd say ASCII is not pragmatically
> better or worse than any other representation EXCEPT I would say that
> the non-printable characters are a pain.
>
> Which non-printable characters are still needed? How about
>
> 0 NUL
> 4 EOT (end of transmission)
> 9 TAB (or field separator)
> 10 LF (or end of line)

Why would we need a null character? I know that it in C is used as
end-of-string marker, but that is a horrible design decision. If any
single character should mark the end of a string, it should be EOT.

As for TAB, I would standardise the tab positions to be 8 characters
wide. And I would, as you imply, keep a single newline character
instead of the mess of incompatible combinations of CR and LF.

> Any more?

Page feed might be useful. At the moment, it also doubles as clear
screen, which is a bit dubious. ASCII was clearly designed for paper
terminals, so it is not really prepared for erasable displays. Hence,
some control codes have been modified to mean different things on
printers and screens. This is a bit of a mess, so this should be
cleaned up.

> Of course, only one or two control chars are really needed,
> perhaps one to say that the next byte is to be interpreted as a non-
> printable (perhaps a control code) and one to introduce a composite
> character.

Yes, that would work. Backspace used to be used for creating
composite characters by moving the carriage back and striking on top
of the previous character, but these days it is used to delete the
previous character. So a control code that explicitly merges the
following two characters would be better.

Torben

James Harris

unread,

Mar 8, 2008, 10:16:34 AM3/8/08

to

On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ægidius Mogensen)
wrote:
> James Harris <james.harri...@googlemail.com> writes:
...

> Why would we need a null character? I know that it in C is used as
> end-of-string marker, but that is a horrible design decision. If any
> single character should mark the end of a string, it should be EOT.

My vague inference (which I've never had tested) is that the null
string terminator is a consequence of C's pointer model. C is designed
to allow a pointer to any memory object to be taken. This means C can
depend only on knowing

1. the address
2. the type

of any given object when it is operating on that object. The address
is passed between modules. The type is derived lexically. This has
consequences:

1. strings need a terminating character
2. arrays have no knowledge of bounds when operated upon
3. header files are needed to share other info between compilation
units
4. a number of functions require a count/length to be passed

The array issue is why C /cannot/ detect array bounds infractions. It
cannot depend on knowing more than the addres of element zero and the
type of the array (which includes the type of the elements).

I expect the C folks will be able to correct me. Hence I'm copying
them this reply.

Of course, there are other ways strings can be defined, even in C,
with the right library support.

--
James

Morris Dovey

unread,

Mar 8, 2008, 9:45:05 AM3/8/08

to

James Harris wrote:
>
> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ćgidius Mogensen)

> wrote:
> > James Harris <james.harri...@googlemail.com> writes:
> ...
> > Why would we need a null character? I know that it in C is used as
> > end-of-string marker, but that is a horrible design decision. If any
> > single character should mark the end of a string, it should be EOT.

> I expect the C folks will be able to correct me. Hence I'm copying
> them this reply.

The NUL terminator was one of the methods used to terminate
strings before C came along (other methods included setting the
high-order bit of the final character and the '$' as a
terminating character). The selection of NUL was probably an
attempt to choose a "do no harm" option, since some hardware
recognized the EOT and might react in unintended/undesirable
ways, while NUL was generally harmless.

> Of course, there are other ways strings can be defined, even in C,
> with the right library support.

Stratus VOS was originally written in PL/I and their C compiler
allowed both PL/I strings (where the start-of-string included a
length) and the standard-compliant NUL terminator. (The amount of
confusion produced by using both methods in a single system has
to be experienced to be believed.)

--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA
http://www.iedu.com/DeSoto

Eric Sosman

unread,

Mar 8, 2008, 10:48:58 AM3/8/08

to

James Harris wrote:
> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ćgidius Mogensen)

> wrote:
>> James Harris <james.harri...@googlemail.com> writes:
> ...
>> Why would we need a null character? I know that it in C is used as
>> end-of-string marker, but that is a horrible design decision. If any
>> single character should mark the end of a string, it should be EOT.

(This reply is really to Mogensen, whose message I saw only
in Harris' quotation.)

Choosing EOT would have raised an interesting problem: What
is the numeric value of this character? (Hint: There are at
least two different answers.) How should strings be terminated
on a system whose character set does not include an EOT? What
about systems that support multiple character encodings, switchable
at run-time, with different values for EOT in each?

It is often said that C's strings are NUL-terminated, but this
is not correct. They are zero-terminated, and the meaning (if any)
of zero in whatever encoding is in use at the moment is unspecified
and irrelevant. The internal representation of a string, like the
internal representation of a double, has nothing to do with its
use in external protocols.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Ben Bacarisse

unread,

Mar 8, 2008, 12:22:36 PM3/8/08

to

James Harris <james.h...@googlemail.com> writes:

> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ægidius Mogensen)
> wrote:
>> James Harris <james.harri...@googlemail.com> writes:
> ...
>> Why would we need a null character?

<snip>

I am confused... When you comment on this question you start with
some observations:

> My vague inference (which I've never had tested) is that the null
> string terminator is a consequence of C's pointer model. C is designed
> to allow a pointer to any memory object to be taken. This means C can
> depend only on knowing
>
> 1. the address
> 2. the type
>
> of any given object when it is operating on that object. The address
> is passed between modules. The type is derived lexically.

And conclude that:

> This has consequences:
>
> 1. strings need a terminating character

<snip>
but then:

> Of course, there are other ways strings can be defined, even in C,
> with the right library support.

So they need a terminating character unless they are defined some
other way? Surely the question was why were they not defined in one
of these other ways?

I don't think there will be a better answer than "it seemed more
convenient than using a length" to paraphrase Richie in
http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

--
Ben.

Bill Gunshannon

unread,

Mar 8, 2008, 1:27:46 PM3/8/08

to

In article <da33b5f9-4d2f-4c30...@h25g2000hsf.googlegroups.com>,

James Harris <james.h...@googlemail.com> writes:
> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ægidius Mogensen)
> wrote:
>> James Harris <james.harri...@googlemail.com> writes:
> ...
>> Why would we need a null character? I know that it in C is used as
>> end-of-string marker, but that is a horrible design decision. If any
>> single character should mark the end of a string, it should be EOT.
> My vague inference (which I've never had tested) is that the null
> string terminator is a consequence of C's pointer model.

Null terminated strings pre-date C. Macro-11 had them and with Unix and
the first C compiler being developed on a PDP-11 that may be where C got
the idea.

> C is designed
> to allow a pointer to any memory object to be taken. This means C can
> depend only on knowing
> 1. the address
> 2. the type
> of any given object when it is operating on that object. The address
> is passed between modules. The type is derived lexically. This has
> consequences:
> 1. strings need a terminating character
> 2. arrays have no knowledge of bounds when operated upon
> 3. header files are needed to share other info between compilation
> units
> 4. a number of functions require a count/length to be passed
> The array issue is why C /cannot/ detect array bounds infractions. It
> cannot depend on knowing more than the addres of element zero and the
> type of the array (which includes the type of the elements).
> I expect the C folks will be able to correct me. Hence I'm copying
> them this reply.
> Of course, there are other ways strings can be defined, even in C,
> with the right library support.

I'm glad you included that last part. Saved me the trouble of pointing
out that the only reason C still uses null terminated strings is momentum.
There are other ways, some of them in use on non-Unix systems. And no
reason why a safer system could not be added to Unix, beyond momentum!

bill

--
Bill Gunshannon | de-moc-ra-cy (di mok' ra see) n. Three wolves
bill...@cs.scranton.edu | and a sheep voting on what's for dinner.
University of Scranton |
Scranton, Pennsylvania | #include <std.disclaimer.h>

Rod Pemberton

unread,

Mar 8, 2008, 7:29:00 PM3/8/08

to

"Eric Sosman" <eso...@ieee-dot-org.invalid> wrote in message
news:O7WdnQ5Xbd9fKE_a...@comcast.com...

> It is often said that C's strings are NUL-terminated, but this
> is not correct. They are zero-terminated, and the meaning (if any)
> of zero in whatever encoding is in use at the moment is unspecified
> and irrelevant. The internal representation of a string, like the
> internal representation of a double, has nothing to do with its
> use in external protocols.
>

Eric,

zero-terminated?

No. That's incorrect too. C's strings are null-terminated. There is a
difference. They aren't terminated by 32-bit or 36-bit zeros...

7.1.1 sub 1 n1256
"A string is a contiguous sequence of characters terminated by and including
the first null character. ..."

5.2.1 sub 2
"...A byte with all bits set to 0, called the null character, shall exist in
the basic execution character set; it is used to terminate a character
string."

6.4.4.4 sub 12
"The construction '\0' is commonly used to represent the null character."

See, they are terminated by a C byte with all bits 0, whatever value the C
system interprets that as...

D. Ritchie said this:

"The other characteristic feature of C, its treatment of arrays, is more
suspect on practical grounds, though it also has real virtues. Although the
relationship between pointers and arrays is unusual, it can be learned.
Moreover, the language shows considerable power to describe important
concepts, for example, vectors whose length varies at run time, with only a
few basic rules and conventions. In particular, character strings are
handled by the same mechanisms as any other array, plus the convention that
a null character terminates a string."

"The Development of the C Language", D. Ritchie

BTW, for ASCII, EBCDIC, PETSCII, ATASCII, UTF-8, Unicode, etc. the all bits
zero character is named NUL...

Rod Pemberton

Ulrich Eckhardt

unread,

Mar 9, 2008, 4:18:15 AM3/9/08

to

James Harris wrote:
> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ćgidius Mogensen)

> wrote:
>> James Harris <james.harri...@googlemail.com> writes:
> ...
>> Why would we need a null character? I know that it in C is used as
>> end-of-string marker, but that is a horrible design decision. If any
>> single character should mark the end of a string, it should be EOT.

It's not soo bad:
* Comparing with a certain value is slower and requires more instructions
(or at least used to on ancient machines) than comparing with zero. Simply
loading a value was enough to detect that it's zero sometimes.
* C did not want to require the system in question to use ASCII as
encoding, possibly because EBCDIC wasn't dead yet at that time. Requiring a
certain value for EOT or just requiring a definition in a header would have
worked, too, but it might have restricted the possible encodings.
* It has some similarity with C's pointers, so that you can e.g. use both
of them in a conditional expression.

> My vague inference (which I've never had tested) is that the null
> string terminator is a consequence of C's pointer model. C is designed
> to allow a pointer to any memory object to be taken. This means C can
> depend only on knowing
>
> 1. the address
> 2. the type
>
> of any given object when it is operating on that object.

Further, and that is important, from the type you can infer the size and
layout of the object.

However: strings are not passed with their complete type (like array of 15
characters) but instead as pointer to the first character, so the knowledge
of the size is lost. Instead, the size must be passed along, also in order
to allow functions to handle strings of a different type. The chosen way to
achieve that was to terminate the string, another alternative (which I
today consider better because it is safer) would have been to pass the size
along explicitly.

> This has consequences:
>
> 1. strings need a terminating character

In order to operate on an object, you need to know its size. Using a signal
value to terminate the array is just one way.

> 2. arrays have no knowledge of bounds when operated upon

Yes, that's why they are typically passed along as pointer/size pair. Note
that a string is nothing else but an array of char.

> 3. header files are needed to share other info between compilation
> units

Hmmm, you mean share information about other types?

> 4. a number of functions require a count/length to be passed

Well... I wouldn't say this immediately follows from what you said but it's
true nonetheless.

> The array issue is why C /cannot/ detect array bounds infractions. It
> cannot depend on knowing more than the addres of element zero and the
> type of the array (which includes the type of the elements).
>
> I expect the C folks will be able to correct me. Hence I'm copying
> them this reply.
>
> Of course, there are other ways strings can be defined, even in C,
> with the right library support.

Yes, you could define them like arrays, i.e. always pass the size along.
Something like

struct text {
char* p;
size_t l;
};

..which actually makes quite readable code, I once wrote something like this
for fun.

Uli

James Harris

unread,

Mar 9, 2008, 8:16:03 PM3/9/08

to

On 8 Mar, 17:22, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
...

> So they need a terminating character unless they are defined some
> other way? Surely the question was why were they not defined in one
> of these other ways?

Yes, I mean that to allow C strings to be defined as they are by a
single pointer requires there to be a terminating character of some
sort. (I know there are other ways to keep the pointer but which
require additional data.) This refers to strings the way they were
implmented in C. The other comment I made was that there are other
ways to define strings and these could be used in C with appropriate
library support so the limitation (of having to use one value as a
terminator) is not with C itself.

> I don't think there will be a better answer than "it seemed more
> convenient than using a length" to paraphrase Richie in
> http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

Interesting link.

--
James

Richard Tobin

unread,

Mar 9, 2008, 9:08:23 PM3/9/08

to

In article <5d3d80ce-48fc-4327...@x41g2000hsb.googlegroups.com>,

James Harris <james.h...@googlemail.com> wrote:
>The other comment I made was that there are other
>ways to define strings and these could be used in C with appropriate
>library support so the limitation (of having to use one value as a
>terminator) is not with C itself.

Certainly you could have libraries for other kinds of strings. But
null-terminated strings are also privileged in having a simple syntax
for string constants.

-- Richard

--
:wq

Ben Bacarisse

unread,

Mar 10, 2008, 8:13:19 AM3/10/08

to

James Harris <james.h...@googlemail.com> writes:

> On 8 Mar, 17:22, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
> ...
>> So they need a terminating character unless they are defined some
>> other way? Surely the question was why were they not defined in one
>> of these other ways?
>
> Yes, I mean that to allow C strings to be defined as they are by a
> single pointer requires there to be a terminating character of some
> sort.

But this is not true -- at least not literally true. In BCPL strings
were represented by a single pointer, but that pointer was to a
length. I am sure you are not unaware of this kind of representation,
I just think you are over-cooking the argument to come up with
"requires there to be a terminating character". I see no such
requirement -- it could have been different.

I suspect I am reading more into your statement than you intended.

--
Ben.

James Harris

unread,

Mar 10, 2008, 11:45:25 AM3/10/08

to

On 10 Mar, 12:13, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:

I see your point and appreciate being made to think this through more
clearly. I suppose the bit missing from what I said above is that the
C concept of a string being an array (of char) allows nowhere for a
length prefix. If it is an array of char then all elements should be
of type char. I suppose the libraries could just ignore this
requirement (could they?) but then would have to agree on how many
bytes made up the length and in what order they existed. Hmm....

I was comparing this with the definition of a string as a tuple of
(address, length) where BOTH elements are passed as parameters - i.e.
a single reference to the string embeds both parameters or expects
both as arguments. As far as I can see tuples of this sort cannot be
used in C.

The remaining option (I think there are just these three primary
options for strings but would welcome correction) is to store
(address, length) separately and point to it. It's not as flexible as
the tuple option above and not as simple as C's string handling. I
guess it could have been used but maybe the following comment from the
paper you referred to above helps explain why C went the way it did:

"
Problems became evident when I tried to extend the type notation,
especially to add structured (record) types. Structures, it seemed,
should map in an intuitive way onto memory in the machine, but in a
structure containing an array, there was no good place to stash the
pointer containing the base of the array, nor any convenient way to
arrange that it be initialized. For example, the directory entries of
early Unix systems might be described in C as

struct {
int inumber;
char name[14];
};

I wanted the structure not merely to characterize an abstract object
but also to describe a collection of bits that might be read from a
directory. Where could the compiler hide the pointer to name that the
semantics demanded? Even if structures were thought of more
abstractly, and the space for pointers could be hidden somehow, how
could I handle the technical problem of properly initializing these
pointers when allocating a complicated object, perhaps one that
specified structures containing arrays containing structures to
arbitrary depth?

The solution constituted the crucial jump in the evolutionary chain
between typeless BCPL and typed C. It eliminated the materialization
of the pointer in storage, and instead caused the creation of the
pointer when the array name is mentioned in an expression. The rule,
which survives in today's C, is that values of array type are
converted, when they appear in expressions, into pointers to the first
of the objects making up the array.
"

Morris Dovey

unread,

Mar 10, 2008, 11:35:21 AM3/10/08

to

James Harris wrote:

> I suppose the bit missing from what I said above is that the
> C concept of a string being an array (of char) allows nowhere for a
> length prefix. If it is an array of char then all elements should be
> of type char. I suppose the libraries could just ignore this
> requirement (could they?) but then would have to agree on how many
> bytes made up the length and in what order they existed. Hmm....

The Stratus C compiler provides both BCPL and C strings. IIRC,
the length component of the BCPL-PL/I style string is a 16-bit
unsigned value.

It does require additional library functions (as well as the
expected conversions back and forth) - and is comparable to using
both imperial and metric systems in a machine design (IMHO, _not_
a great idea).

I felt that the existing C string implementation was easier to
work with, but that may just have been a matter of being more
used to it.

--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA

http://www.iedu.com/DeSoto/

Rod Pemberton

unread,

Mar 10, 2008, 7:09:27 PM3/10/08

to

"Morris Dovey" <mrd...@iedu.com> wrote in message
news:47D55539...@iedu.com...

> James Harris wrote:
>
> > I suppose the bit missing from what I said above is that the
> > C concept of a string being an array (of char) allows nowhere for a
> > length prefix. If it is an array of char then all elements should be
> > of type char. I suppose the libraries could just ignore this
> > requirement (could they?) but then would have to agree on how many
> > bytes made up the length and in what order they existed. Hmm....
>
> The Stratus C compiler provides both BCPL and C strings. IIRC,
> the length component of the BCPL-PL/I style string is a 16-bit
> unsigned value.
>

It's been a few years, so I no longer recall the size. But, the interesting
aspect Stratus VOS PL/1 strings was that the length at the front of the
string (and whatever else was there... size?) was a header of fixed size.
IIRC, that was true of PL/1's struct's too. I.e., for C, the header with
the length could be 1) at a negative offset from the start of the string or
2) an offset for the size of the header could be added to C's char pointer
"to the first of the objects making up the array", i.e., "decayed into"
pointer. One could implicitly assume that array[length]=='\0'... although
no null character is actually there. I'm not sure about C compliance
comlications...

Rod Pemberton

unread,

Mar 10, 2008, 7:26:16 PM3/10/08

to

"James Harris" <james.h...@googlemail.com> wrote in message
news:44bf8f7a-c1f5-4dbb...@q78g2000hsh.googlegroups.com...

> On 10 Mar, 12:13, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
> > James Harris <james.harri...@googlemail.com> writes:
> > > On 8 Mar, 17:22, Ben Bacarisse <ben.use...@bsb.me.uk> wrote:
> > > ...
> > >> So they need a terminating character unless they are defined some
> > >> other way? Surely the question was why were they not defined in one
> > >> of these other ways?
> >
> > > Yes, I mean that to allow C strings to be defined as they are by a
> > > single pointer requires there to be a terminating character of some
> > > sort.
> >
> > But this is not true -- at least not literally true. In BCPL strings
> > were represented by a single pointer, but that pointer was to a
> > length. I am sure you are not unaware of this kind of representation,
> > I just think you are over-cooking the argument to come up with
> > "requires there to be a terminating character". I see no such
> > requirement -- it could have been different.
> >
> > I suspect I am reading more into your statement than you intended.
>
> I see your point and appreciate being made to think this through more
> clearly.
> I suppose the bit missing from what I said above is that the
> C concept of a string being an array (of char) allows nowhere for a
> length prefix. If it is an array of char then all elements should be
> of type char.

A string can be an contiguous sequence of char - even with a length prefix.
Whether it's C compliant is a separate issue. The header containing the
length just needs to be placed prior to the address representing the start
of the string.

C normally:
----
item[0] <-- i.e., C's "decayed" pointer points here, start of string data
item[1]
item[2]
item[3]
...
item[length]='\0' <-- end of string data
...
item[size] <-- end of allocation for string

In the style of PL/1:
----
length <-- start of string data, first two items fixed size header
size
item[0] <-- i.e., C's "decayed" pointer points here
item[1]
item[2]
item[3]
...
item[length-1] <-- end of string data
item[length] <-- implicit null character '\0' - not actually set to null
character
...
item[size] <-- end of allocation for string

Rod Pemberton

Eric Sosman

unread,

Mar 10, 2008, 9:00:14 PM3/10/08

to

Severe, I think.

char bleat[] = "Hello, world!";
assert (strlen(bleat) == 13);
bleat[5] = '\0';
assert (strlen(bleat) == 5;

That is, an implementation with counted strings would need an
unmodifiable array of char (with consequences to strcat, strtok),
or would need to treat '\0' as "just another character" (ruining
interoperability with the existing C string functions). This is
not to say that such things couldn't be worked out -- other
languages have already taken different approaches -- but that
trying to graft them onto C-as-it-stands would be a daunting task.
I don't even want to *think* about fgets() ...

--
Eric Sosman
eso...@ieee-dot-org.invalid

Eric Sosman

unread,

Mar 10, 2008, 9:03:06 PM3/10/08

to

Rod Pemberton wrote:
>
> A string can be an contiguous sequence of char - even with a length prefix.
> Whether it's C compliant is a separate issue. The header containing the
> length just needs to be placed prior to the address representing the start
> of the string.

"Prior to," but at what distance?

char *string = "Vogon poetry";
assert (strlen(string) == 12);
assert (strlen(string+6) == ???);

--
Eric Sosman
eso...@ieee-dot-org.invalid

Torben Ægidius Mogensen

unread,

Mar 11, 2008, 6:04:42 AM3/11/08

to

James Harris <james.h...@googlemail.com> writes:

> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ćgidius Mogensen)

> wrote:
>> James Harris <james.harri...@googlemail.com> writes:
> ...
>> Why would we need a null character? I know that it in C is used as
>> end-of-string marker, but that is a horrible design decision. If any
>> single character should mark the end of a string, it should be EOT.
>
> My vague inference (which I've never had tested) is that the null

> string terminator is a consequence of C's pointer model. [...]

>
> 1. strings need a terminating character

True, but why does the terminating character need to be null? There
is in the ASCII alphabet a control character EOT representing
end-of-text. That seems like a much more appropriate terminating
character.

> Of course, there are other ways strings can be defined, even in C,
> with the right library support.

Quite. For manipulating long texts, a linear array of characters is
not a very good structure anyway -- regardless of whether it is
terminated by a specific character or the length is specified
separately.

Torben

santosh

unread,

Mar 11, 2008, 6:56:16 AM3/11/08

to

Torben Ćgidius Mogensen wrote:

> James Harris <james.h...@googlemail.com> writes:
>
>> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ćgidius Mogensen)
>> wrote:
>>> James Harris <james.harri...@googlemail.com> writes:
>> ...
>>> Why would we need a null character? I know that it in C is used as
>>> end-of-string marker, but that is a horrible design decision. If
>>> any single character should mark the end of a string, it should be
>>> EOT.
>>
>> My vague inference (which I've never had tested) is that the null
>> string terminator is a consequence of C's pointer model. [...]
>>
>> 1. strings need a terminating character
>
> True, but why does the terminating character need to be null? There
> is in the ASCII alphabet a control character EOT representing
> end-of-text. That seems like a much more appropriate terminating
> character.

EOT is not present in C's basic source or execution character set, i.e,
it is not available on all system where C is implemented.

<snip>

Marco van de Voort

unread,

Mar 11, 2008, 7:04:46 AM3/11/08

to

On 2008-03-10, Rod Pemberton <do_no...@nohavenot.cmm> wrote:
>> The Stratus C compiler provides both BCPL and C strings. IIRC,
>> the length component of the BCPL-PL/I style string is a 16-bit
>> unsigned value.
>
> It's been a few years, so I no longer recall the size. But, the interesting
> aspect Stratus VOS PL/1 strings was that the length at the front of the
> string (and whatever else was there... size?) was a header of fixed size.

Delphi has this too for strings and dynamic arrays. Size, allocated size
(there can be unused entries to avoid too many repeated allocs) and ref
count.

> IIRC, that was true of PL/1's struct's too. I.e., for C, the header with
> the length could be 1) at a negative offset from the start of the string or

And indeed it is that way in Delphi.

> One could implicitly assume that array[length]=='\0'... although
> no null character is actually there. I'm not sure about C compliance
> comlications...

It is this way in Delphi, but null chars in the rest of the string are
legal. (IOW no length terminator). So Delphi strings can contains zeroes,
but are typecast compatible to C's char * for reading purposes.

Writing is slightly more complicated

setlength(s,buflen); // allocate room for delphi string. zero termination is guaranteed
if c_routine(pchar(s),buflen)= success then // pass to C.
setlength(s,strlen(s)) // search for #0 char and update Delphi length.
else
s:='';

Richard Tobin

unread,

Mar 11, 2008, 7:10:37 AM3/11/08

to

In article <7zlk4p4...@app-2.diku.dk>,

Torben Ægidius Mogensen <tor...@app-2.diku.dk> wrote:

>> 1. strings need a terminating character

>True, but why does the terminating character need to be null? There
>is in the ASCII alphabet a control character EOT representing
>end-of-text. That seems like a much more appropriate terminating
>character.

... if you want to tie that language to a particular character set.
C doesn't define a terminating character, it defines a terminating
integer value.

[Actually EOT is "end of transmission". ETX is end of text.]

I suppose it could not specify the value, but have a #defined constant
or a character escape which varied between implementations, and ASCII
implementations could use EOT.

Of course the choice of zero lends itself to varous idioms, such as
"while(*p++)".

-- Richard
--
:wq

Marco van de Voort

unread,

Mar 11, 2008, 7:06:14 AM3/11/08

to

This is indeed a problem for C. A subset of any other string wouldn't be
typecast compatible to a string.

Bill Gunshannon

unread,

Mar 11, 2008, 8:04:26 AM3/11/08

to

In article <7zlk4p4...@app-2.diku.dk>,

tor...@app-2.diku.dk (Torben Ægidius Mogensen) writes:
> James Harris <james.h...@googlemail.com> writes:
>

>> On 6 Mar, 10:34, torb...@app-1.diku.dk (Torben Ægidius Mogensen)

>> wrote:
>>> James Harris <james.harri...@googlemail.com> writes:
>> ...
>>> Why would we need a null character? I know that it in C is used as
>>> end-of-string marker, but that is a horrible design decision. If any
>>> single character should mark the end of a string, it should be EOT.
>>
>> My vague inference (which I've never had tested) is that the null
>> string terminator is a consequence of C's pointer model. [...]
>>
>> 1. strings need a terminating character
>
> True, but why does the terminating character need to be null? There
> is in the ASCII alphabet a control character EOT representing
> end-of-text.

Actually, EOT = end of transmission
ETX = end of text

But the problem with this approach is it misses the point of ASCII.
American Standard Code for Information Interchange
While ASCII has been usedc for local storage of characters I believe its
intended purpose was for moving them between locations over what were the
common transmission methods of its day. Thus I think while there is an
ETX it would be meaningless without a preceding STX somewhere in the string.

> That seems like a much more appropriate terminating
> character.
>
>> Of course, there are other ways strings can be defined, even in C,
>> with the right library support.
>
> Quite. For manipulating long texts, a linear array of characters is
> not a very good structure anyway -- regardless of whether it is
> terminated by a specific character or the length is specified
> separately.

There used to be a product called SafeC (a long, long time ago). I wonder
how it handled strings?

Bill Gunshannon

unread,

Mar 11, 2008, 8:07:36 AM3/11/08

to

In article <fr5ogg$9p7$1...@registered.motzarella.org>,

If you want to get pedantic, neither is NUL (as people here understand it).
But those systems seemed to be able to function with C anyway.

Default User

unread,

Mar 11, 2008, 5:26:05 PM3/11/08

to

Torben Agidius Mogensen wrote:

> James Harris <james.h...@googlemail.com> writes:

> > 1. strings need a terminating character
>
> True, but why does the terminating character need to be null? There
> is in the ASCII alphabet a control character EOT representing
> end-of-text. That seems like a much more appropriate terminating
> character.

For one thing, that is a legal character in strings. You'd prevent
people using the ASCII char set from creating strings with EOT (or ETX
or whatever).

Brian

Neilist

unread,

Mar 11, 2008, 5:28:23 PM3/11/08

to

On Mar 11, 7:10 am, rich...@cogsci.ed.ac.uk (Richard Tobin) wrote:
> In article <7zlk4p4lvp....@app-2.diku.dk>,

No it doesn't.

Micah Cowan

unread,

Mar 11, 2008, 6:14:10 PM3/11/08

to

"Default User" <defaul...@yahoo.com> writes:

Yeah. AIUI, ETX (and EOT, ST and others) were usually intended for
embedding "out of band" control strings of various kinds, intended for
interpretation by one of the communication devices to which it was
passed. An inability to output these to such a device (except via
putchar()) would be pretty cumbersome, I think.

--
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

Richard Tobin

unread,

Mar 11, 2008, 6:22:36 PM3/11/08

to

In article <1b270b41-fa90-457e...@s12g2000prg.googlegroups.com>,
Neilist <latto...@gmail.com> wrote:

>> Of course the choice of zero lends itself to varous idioms, such as
>> "while(*p++)".

>No it doesn't.

Er, what?

-- Richard

--
:wq

Richard

unread,

Mar 11, 2008, 6:23:21 PM3/11/08

to

Neilist <latto...@gmail.com> writes:

Oh yes it does ...

Richard Tobin

unread,

Mar 11, 2008, 6:27:14 PM3/11/08

to

In article <jg7ig8c...@micah.cowan.name>,
Micah Cowan <mi...@cowan.name> wrote:

>> For one thing, that is a legal character in strings. You'd prevent
>> people using the ASCII char set from creating strings with EOT (or ETX
>> or whatever).

>Yeah. AIUI, ETX (and EOT, ST and others) were usually intended for
>embedding "out of band" control strings of various kinds, intended for
>interpretation by one of the communication devices to which it was
>passed. An inability to output these to such a device (except via
>putchar()) would be pretty cumbersome, I think.

I doubt it would make much difference in practice. How many strings
have you seen with such characters in them? putchar() is quite likely
to be the natural function to use. And programs that controlled such
devices (I use the past tense deliberately) were probably equally
likely to have to send nuls for timing purposes.

(It was common to send nulls after carriage-returns to allow time for
the carriage to return. See the unix stty command for evidence.)

-- Richard
--
:wq

Torben Ægidius Mogensen

unread,

Mar 12, 2008, 4:43:18 AM3/12/08

to

ric...@cogsci.ed.ac.uk (Richard Tobin) writes:

> Of course the choice of zero lends itself to varous idioms, such as
> "while(*p++)".

Is that a good thing?

Torben

santosh

unread,

Mar 12, 2008, 4:55:04 AM3/12/08

to

Torben Ćgidius Mogensen wrote:

What's bad about it? It might not look pretty to folks used to other
languages, but it's a common construct in C.

Wilhelm B. Kloke

unread,

Mar 12, 2008, 5:18:23 AM3/12/08

to

["Followup-To:" nach comp.lang.misc gesetzt.]
Torben Ægidius Mogensen <tor...@app-2.diku.dk> schrieb:

Yes. It translates to optimal assembler code for most architectures.
On others, while(p[i++]) would do, or while(p[i++]!=0), with the
obvious compiler optimisation. In contrast, the comparison to anything
else than zero needs more bytes and processor cycles.
--
Dipl.-Math. Wilhelm Bernhard Kloke
Institut fuer Arbeitsphysiologie an der Universitaet Dortmund
Ardeystrasse 67, D-44139 Dortmund, Tel. 0231-1084-257
PGP: http://vestein.arb-phys.uni-dortmund.de/~wb/mypublic.key

Marco van de Voort

unread,

Mar 12, 2008, 7:08:21 AM3/12/08

to

On 2008-03-12, Wilhelm B. Kloke <w...@arb-phys.uni-dortmund.de> wrote:
>>> Of course the choice of zero lends itself to varous idioms, such as
>>> "while(*p++)".
>>
>> Is that a good thing?
>
> Yes. It translates to optimal assembler code for most architectures.
> On others, while(p[i++]) would do, or while(p[i++]!=0), with the
> obvious compiler optimisation

> In contrast, the comparison to anything

> else than zero needs more bytes and processor cycles.

Not through, e.g. on x86, scasb is bounded by a number. You need a bound
anyway, since relying on unbounded strings is a security risk (and in fact
most string routines changed in later C revisions to add a limit to scan)

Wilhelm B. Kloke

unread,

Mar 12, 2008, 8:33:14 AM3/12/08

to

Marco van de Voort <mar...@stack.nl> schrieb:

>
>> In contrast, the comparison to anything
>> else than zero needs more bytes and processor cycles.
>
> Not through, e.g. on x86, scasb is bounded by a number. You need a bound
> anyway, since relying on unbounded strings is a security risk (and in fact
> most string routines changed in later C revisions to add a limit to scan)

Yes, sometimes there may be a security risk, esp. in the case of library
functions. But there are also contexts, in which it is perfectly safe
and worth to do for performance. There are other contexts, in which
the need to touch every single byte in the string by byte instructions
may be detrimental to performance, and a length prefix is preferrable.

In any case, replacing NUL by EOT or ETX doesn't help.

Chris Dollin

unread,

Mar 12, 2008, 9:33:42 AM3/12/08

to

Wilhelm B. Kloke wrote:

> ["Followup-To:" nach comp.lang.misc gesetzt.]
> Torben Ægidius Mogensen <tor...@app-2.diku.dk> schrieb:
>> ric...@cogsci.ed.ac.uk (Richard Tobin) writes:
>>
>>
>>> Of course the choice of zero lends itself to varous idioms, such as
>>> "while(*p++)".
>>
>> Is that a good thing?
>
> Yes. It translates to optimal assembler code for most architectures.
> On others, while(p[i++]) would do, or while(p[i++]!=0), with the
> obvious compiler optimisation. In contrast, the comparison to anything
> else than zero needs more bytes and processor cycles.

IIRC, not (necessarily) true on an ARM.

--
"Well begun is half done." - Proverb

Hewlett-Packard Limited Cain Road, Bracknell, registered no:
registered office: Berks RG12 1HN 690597 England

Marco van de Voort

unread,

Mar 12, 2008, 10:11:25 AM3/12/08

to

On 2008-03-12, Wilhelm B. Kloke <w...@arb-phys.uni-dortmund.de> wrote:

>>> In contrast, the comparison to anything
>>> else than zero needs more bytes and processor cycles.
>>
>> Not through, e.g. on x86, scasb is bounded by a number. You need a bound
>> anyway, since relying on unbounded strings is a security risk (and in fact
>> most string routines changed in later C revisions to add a limit to scan)
>
> Yes, sometimes there may be a security risk, esp. in the case of library
> functions. But there are also contexts, in which it is perfectly safe
> and worth to do for performance

Note that even then it is unlikely to gain anything, since the register
comparison will generally be negiable (since pipelined) to the cache
effects of walking the string.

> There are other contexts, in which the need to touch every single byte in
> the string by byte instructions may be detrimental to performance, and a
> length prefix is preferrable.

One doesn't exclude the other. (like Delphi, that has both length and #0
termination. Though the latter is never used by Delphi itself, only for
communicating with C)

The main problem why one can't introduce such scheme with C, as noted
earlier, is the ability to pass a poitner to a char in a string to something
that accepts a string. Any metadata is lost that way. And the fact that the
existing C codebase is the main reason why it exists today.

> In any case, replacing NUL by EOT or ETX doesn't help.

In general, killing terminated strings is better. But that would kill C's
only stringsupport, which is why it survived so long in the first place.

Wilhelm B. Kloke

unread,

Mar 12, 2008, 11:34:40 AM3/12/08

to

Chris Dollin <chris....@hp.com> schrieb:

> Wilhelm B. Kloke wrote:
>
>> Yes. It translates to optimal assembler code for most architectures.
>> On others, while(p[i++]) would do, or while(p[i++]!=0), with the
>> obvious compiler optimisation. In contrast, the comparison to anything
>> else than zero needs more bytes and processor cycles.
>
> IIRC, not (necessarily) true on an ARM.

I know that it is not true on a MIPS, but it costs a register. Comparison
to zero is comparison to the zero-hardwired register 0 on this machine.
MIPS was the architecture I had in mind when I restricted my claim.

Default User

unread,

Mar 12, 2008, 2:22:45 PM3/12/08

to

Wilhelm B. Kloke wrote:

> ["Followup-To:" nach comp.lang.misc gesetzt.]

That was probably the least appropriate group to set for follow-ups. I
know slrn does that automatically for cross-posted messages, but you
need to pay better attention to what your newsreader is doing.

Brian

Jacko

unread,

Mar 12, 2008, 2:49:33 PM3/12/08

to

<table> <td> and <tr> etc could be the record seperators, etc and the
SI, SO could be < and </ for other tags etc. lf <hr> and cr <br>.

could do whole of html 2 in ascii control. tag parameters well what
options exsist?

Micah Cowan

unread,

Mar 12, 2008, 3:01:00 PM3/12/08

to

Jacko <jacko...@gmail.com> writes:

SI and SO would not be appropriate for that use, as they already have
the very clearly-defined purpose of shifting character codesets.

CBFalconer

unread,

Mar 11, 2008, 4:22:58 PM3/11/08

to

I don't see any point. What is wrong with:

assert (strlen(string + 6) == 6);

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

--
Posted via a free Usenet account from http://www.teranews.com

CBFalconer

unread,

Mar 10, 2008, 6:35:16 PM3/10/08

to

Morris Dovey wrote:
> James Harris wrote:
>
>> I suppose the bit missing from what I said above is that the C
>> concept of a string being an array (of char) allows nowhere for
>> a length prefix. If it is an array of char then all elements
>> should be of type char. I suppose the libraries could just
>> ignore this requirement (could they?) but then would have to
>> agree on how many bytes made up the length and in what order
>> they existed. Hmm....

>
> The Stratus C compiler provides both BCPL and C strings. IIRC,
> the length component of the BCPL-PL/I style string is a 16-bit
> unsigned value.
>

> It does require additional library functions (as well as the
> expected conversions back and forth) - and is comparable to
> using both imperial and metric systems in a machine design
> (IMHO, _not_ a great idea).
>
> I felt that the existing C string implementation was easier to
> work with, but that may just have been a matter of being more
> used to it.

In general, the objections to C string format are due to:

1. The lack of an immediately available length.
2. Vulnerability to overwriting past storage end.

1. is greatly mitigated by the fact that strings are usually short,
and thus it is a trivial effort to extract the length, with usually
highly optimized code in strlen().

2. is the real bug-a-boo. Use of carefully crafted routines, such
as strlcpy and strlcat will virtually eliminate those problems.
Unfortunately, those routines are not present in standard C
libraries.

Eric Sosman

unread,

Mar 13, 2008, 5:37:20 PM3/13/08

to

CBFalconer wrote:
> Eric Sosman wrote:
>> Rod Pemberton wrote:
>>
>>> A string can be an contiguous sequence of char - even with a
>>> length prefix. Whether it's C compliant is a separate issue.
>>> The header containing the length just needs to be placed prior
>>> to the address representing the start of the string.
>> "Prior to," but at what distance?
>>
>> char *string = "Vogon poetry";
>> assert (strlen(string) == 12);
>> assert (strlen(string+6) == ???);
>
> I don't see any point. What is wrong with:
>
> assert (strlen(string + 6) == 6);

Pemberton's proposal (as I understood it, anyhow) was to
store a string's length just before its first character, in
much the say some malloc() implementations store metadata
just before an allocated block:

[12] V o g o n p o e t r y
^
The decayed `string' points here

strlen() would swizzle its argument to locate and fetch the 12
without hunting for a (non-existent) sentinel character. And
my point was that

[12] V o g o n p o e t r y
^
`string+6' points here

... would cause strlen() to mis-swizzle its argument and
pick up some kind of garbage length.

--
Eric....@sun.com

Richard Bos

unread,

Mar 14, 2008, 5:28:38 AM3/14/08

to

CBFalconer <cbfal...@yahoo.com> wrote:

> Morris Dovey wrote:
> > I felt that the existing C string implementation was easier to
> > work with, but that may just have been a matter of being more
> > used to it.
>
> In general, the objections to C string format are due to:
>
> 1. The lack of an immediately available length.
> 2. Vulnerability to overwriting past storage end.
>
> 1. is greatly mitigated by the fact that strings are usually short,
> and thus it is a trivial effort to extract the length, with usually
> highly optimized code in strlen().

And further optimisations possible if the string doesn't change in a
loop.

> 2. is the real bug-a-boo. Use of carefully crafted routines, such
> as strlcpy and strlcat will virtually eliminate those problems.

Why would anyone use a third-party add-on, when strncat() can solve
those problems just as well?
Anyway, this is not a problem of the C string format. It is a problem of
a lack of buffer overflow checks. If you add a length field, but fail to
check for overflow of the underlying memory, you'll have the same
problem. (And when you do, strncat() would solve it as well.)

Richard

Richard Bos

unread,

Mar 14, 2008, 5:32:54 AM3/14/08

to

Eric Sosman <Eric....@sun.com> wrote:

> Pemberton's proposal (as I understood it, anyhow) was to
> store a string's length just before its first character, in
> much the say some malloc() implementations store metadata
> just before an allocated block:
>
> [12] V o g o n p o e t r y
> ^
> The decayed `string' points here
>
> strlen() would swizzle its argument to locate and fetch the 12
> without hunting for a (non-existent) sentinel character. And
> my point was that
>
> [12] V o g o n p o e t r y
> ^
> `string+6' points here
>
> ... would cause strlen() to mis-swizzle its argument and
> pick up some kind of garbage length.

Not necessarily. It would require pointers (to char and void, at least)
to consist of a base pointer and an offset, and strlen() to get its
length from the base pointer, not from base+offset. This would be
needlessly cumbersome on most systems, but not impossible.

Richard

Eric Sosman

unread,

Mar 14, 2008, 9:21:41 AM3/14/08

to

It was not my impression that Pemberton was proposing
"fat pointers." Even if he were, there would be problems
(perhaps not insuperable, but problems nonetheless) meshing
such a scheme with C's view of the world:

char buff[100];
strcpy(buff, "Vogon poetry");
strcpy(buff+strlen(buff), "Slartibartfast");

From C's point of view these are two independent strings (I'm
assuming the length governs a string's extent and that no
sentinel character is used). Where are the two lengths (or
information to compute them) stored?

(Is this an "artificial" construct? When handling large
numbers of short-ish strings, I have more than once tried to
minimize malloc() overhead by allocating a big pool and cramming
multiple strings into it, cheek by jowl, until the pool fills and
I allocate a new one. The idea of storing many strings back to
back in one big char[] is not so far-fetched.)

--
Eric Sosman
eso...@ieee-dot-org.invalid

Rod Pemberton

unread,

Mar 14, 2008, 5:48:27 PM3/14/08

to

"Eric Sosman" <eso...@ieee-dot-org.invalid> wrote in message
news:csqdnYXM0v3Y4Ufa...@comcast.com...

> It was not my impression that Pemberton was proposing
> "fat pointers." Even if he were, there would be problems
> (perhaps not insuperable, but problems nonetheless) meshing
> such a scheme with C's view of the world:
>

...
>insuperable

C has a defined method for stings. So, one use for "headered" strings in
the C context is to provide _temporary_ C compatibility with some other
existing implementation, e.g., converting a database and porting database
application code to C. So, for those in such a situation, I think the
question is: "How near to C compliance can headered strings become?" The
fundamental principle that C maps everything onto contiguous sequences of
characters or bytes will most likely be broken in the process... since a
header must be there to be recognized a string. But, with limitations,
e.g., only passing 'decayed' pointers to the start of declared string
objects to string functions, I think C could work for the most part with
some C incompatible implementation.

Rod Pemberton

BruceMcF

unread,

Apr 12, 2008, 9:27:54 PM4/12/08

to

On Mar 11, 8:04 am, billg...@cs.uofs.edu (Bill Gunshannon) wrote:
> Actually, EOT = end of transmission
> ETX = end of text

> But the problem with this approach is it misses the point of ASCII.
> American Standard Code for Information Interchange
> While ASCII has been used for local storage of characters I believe its
> intended purpose was for moving them between locations over what were the
> common transmission methods of its day. Thus I think while there is an
> ETX it would be meaningless without a preceding STX somewhere in the string.

Precisely. What makes ASCII NUL an appropriate terminator for
terminated strings is that fact that it is defined *in* ASCII as a ...
uh ... NUL character ... a character that the transmitter is free to
add as much as it wishes, say to keep a connection alive or to provide
a timing delay ... and which the receiver is free to discard on
receipt. A no-op.

It is, therefore, one character in ASCII that is not meaningful as a
character.

Now, we very rarely do text fascimile transmission anymore (!), but
there are lots of analogues that could be found.

For text in storage rather than text in flight, the main useful ones
are FS GS RS and US, though if the resources are not available to
support UTF-8, and Latin-1 is not appropriate, SI and SO are also
useful.

If the process of a fascimile sender and fascimile receiver is mapped
as an analogy to the process of a selected SPI device or the
microcontroller acting as bus master talking to the other as sender
and receiver, many of the others can be found a reasonable use. And in
that context, the last thing you'd want would be to have them as
printable characters, or in use as part of the massages being passed
back and forth, since the whole point is that if the character AND $E0
is 0, its an ASCII7 control, and then you can do an indexed jump to
act on it, while it its not, you repeat the loop that you are in.