Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ANN: An open source C based Forth for learning purposes ...

80 views
Skip to first unread message

Sp...@controlq.com

unread,
May 10, 2011, 12:58:00 AM5/10/11
to

A one file Forth, written in C, for the community (BSD licensed)...

http://www.ControlQ.com/blog/wordpress/?p=352


Cheers,
Rob Sciuk


ron

unread,
May 10, 2011, 7:29:23 AM5/10/11
to
On May 10, 7:58 am, S...@ControlQ.com wrote:
> A one file Forth, written in C, for the community (BSD licensed)...

Very cool, thanks!

Sp...@controlq.com

unread,
May 10, 2011, 5:03:08 PM5/10/11
to

On Tue, 10 May 2011, Sp...@ControlQ.com wrote:

> Date: Tue, 10 May 2011 00:58:00 -0400
> From: Sp...@ControlQ.com
> Newsgroups: comp.lang.forth
> Subject: ANN: An open source C based Forth for learning purposes ...

Problems in building the mingw32 version have resulted in a revision to
the MiniForth.c source. If you had no joy compiling on Windows, you might
like to try again now, as the updates are on the web site.

Just a couple of quick notes:

- does not follow a standard.
- cell size tracks the compiler's native int size.
- quoted strings need some explanation (TBD).
- stdin can be re-directed: " filename" infile
- stdout can be re-directed " filename" outfile
- either stdin or stdout can be returned to the tty
by pushing a null and again invoking either infile or outfile.
- base is adjustable up to 36.
- some errors are handled.
- literal constants follow C rules:
0xff - hex
0177 - octal
$abc - hex
123 - decimal

strings:

" this is a string" -> goes to pad
" this is a string" save constant str
str type cr

You can unsave a string, provided you have not made any new string saves,
or created a colon definition or variable (create) since the save.

Strings are null terminated, and *NOT* counted. type therefore takes
only a pointer, and count leaves the TOS in place, and pushes the length
of the string on top.

There are other weirdnesses as well.

Thanks to those who took the time to download, test and moreover, to
report problems.

Cheers,
Rob.

Paul Rubin

unread,
May 10, 2011, 6:42:57 PM5/10/11
to
Sp...@ControlQ.com writes:
> Strings are null terminated, and *NOT* counted. type therefore takes
> only a pointer,

Please fix that--it was a horrible mistake in C and is a mistake in
Forth. Type can still take a pointer; just embed the length in the
string.

John Passaniti

unread,
May 10, 2011, 7:03:45 PM5/10/11
to
On May 10, 6:42 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

Given that part of the intent of this interpreter appears to be an
interpreter for a scripting language based on Forth, using null-
terminated strings seems reasonable if what you're scripting is C
code. If it's a mistake depends on the application's programmer.
It's a mistake if you use a function like sprintf; it's not a mistake
if you use snprintf. It's a mistake if a lot of your processing is
concatenating strings with strcat; it's not a mistake if you keep a
pointer to the end of the string and maintain the length separately.

On the other hand, it should be pointed out that the only place the C
language defines a null-terminated string is in string literals. I
and many other embedded systems folk use a variety of string
representations, depending on the application's needs. The existence
of string.h doesn't mean you have to use it, or that it even makes
sense for some kinds of work.

The Beez'

unread,
May 11, 2011, 2:21:48 AM5/11/11
to
On 11 mei, 00:42, Paul Rubin <no.em...@nospam.invalid> wrote:
This issue has been discussed at length and never resolved. Both have
advantages and disadvantages.

Note that since the introduction of ANS strings (addr/count) the need
for either type has been diminished greatly. In practice I only need
it when fetching string variables or low level string operations. And
even then, with the appropriate COUNT/PLACE words portable programs
can be created with either an internal counted string or NULL-
terminated string format.

I really don't know why Forthers are so in love with counted strings -
NULL-terminated strings are just fine. And the argument that it is "so
C-like" isn't valid anymore: so are locals, S\", etc.

Hans Bezemer

Paul Rubin

unread,
May 11, 2011, 3:00:57 AM5/11/11
to
"The Beez'" <han...@bigfoot.com> writes:
> I really don't know why Forthers are so in love with counted strings -
> NULL-terminated strings are just fine. And the argument that it is "so
> C-like" isn't valid anymore: so are locals, S\", etc.

The strings might contain NUL bytes that are part of the string.
Suppose they are a UTF-16 representation of Unicode, for example.

Elizabeth D Rather

unread,
May 11, 2011, 3:09:43 AM5/11/11
to
On 5/10/11 8:21 PM, The Beez' wrote:
...

> I really don't know why Forthers are so in love with counted strings -
> NULL-terminated strings are just fine.

Null-terminated strings have to be counted at run-time. Counted strings
don't. Cycles are precious.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

ron

unread,
May 11, 2011, 3:52:32 AM5/11/11
to
On May 11, 9:21 am, "The Beez'" <hans...@bigfoot.com> wrote:
> I really don't know why Forthers are so in love with counted strings -
> NULL-terminated strings are just fine. And the argument that it is "so
> C-like" isn't valid anymore: so are locals, S\", etc.

No need to waste time counting the string length before operating on
it.

In Reva the strings are counted and *also* NUL terminated; so that it
is possible to pass them to OS (or other) library routines without
doing anything extra. The added memory space tends to be less of an
issue than extra cycles, in my applications.

forth...@forthfiles.net

unread,
May 11, 2011, 4:22:21 AM5/11/11
to

Who wants to count every character in a string just so you can jump
over it (for example) at run time?

NULL terminated strings are Satanic and should be driven from the
earth!!

Alex McDonald

unread,
May 11, 2011, 5:56:31 AM5/11/11
to

Ditto Win32Forth.

Mat

unread,
May 11, 2011, 6:46:31 AM5/11/11
to
Nice example but I wonder why the label-as-value feature of gcc, icc,
clang (llvm), pcc etc. isn't used because this would simplify the
source code and offering far better performance ?!? Platform
independence can't be the reason because there exist a wide range of
supported platform for both gcc and clang. For an ANSI C version
replicated-switch threading can be used instead with near the same
advantages if one concentrate on a minimal set of primitives. Just
wondering.

Alex McDonald

unread,
May 11, 2011, 8:12:18 AM5/11/11
to

I think your newsreader is borked; there's no original message, and
you're replying to what appears to be the last poster in the thread
rather than the post you should be replying to. It's quite confusing.

Andreas

unread,
May 11, 2011, 8:59:56 AM5/11/11
to
Mat:

There is one rule for this, that there is no rule! ;-)

IOW VM efficiency is so CPU dependent, that one either has to decide and
optimize individually for every target CPU (like Intel C compilers do)
or you just neglect it. Otherwise CPU caching and branch misprediction
can eliminate all "clever" jump optimizations away. In many cases even
simple subroutine threading is the way to go - the "superfluous and
cycle-wasting" RET just doesn't "show" to slow down execution speed.

Andreas

Anton Ertl

unread,
May 11, 2011, 9:26:43 AM5/11/11
to
"The Beez'" <han...@bigfoot.com> writes:
>I really don't know why Forthers are so in love with counted strings -
>NULL-terminated strings are just fine.

Both suck. Zero-terminated strings suck for a variety of reasons;
some have already been mentioned, but the most important one is that
they cannot contain NULs. As a result, C has two sets of functions
for dealing with blocks of characters, the str... set for
zero-terminated blocks and the mem... set for blocks given by start
and length (similar to modern Forth). If that redundancy does not
indicate a design mistake, what does?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2010: http://www.euroforth.org/ef10/

Pablo Hugo Reda

unread,
May 11, 2011, 9:38:07 AM5/11/11
to
On 11 mayo, 04:09, Elizabeth D Rather <erat...@forth.com> wrote:
> On 5/10/11 8:21 PM, The Beez' wrote:
> ...
>
> > I really don't know why Forthers are so in love with counted strings -
> > NULL-terminated strings are just fine.
>
> Null-terminated strings have to be counted at run-time.  Counted strings
> don't.  Cycles are precious.
>
Cycles are precious.
generally the most used operation is traverse a string
null terminated is more easy to traverse
a counter string need one number more for counting,

and for bonus, null terminated can fill the ram !!, counter has a
limit

BruceMcF

unread,
May 11, 2011, 9:51:13 AM5/11/11
to
On May 11, 9:38 am, Pablo Hugo Reda <pablor...@gmail.com> wrote:
> and for bonus, null terminated can fill the ram !!, counter has a
> limit

With cell-counted strings, the limit is typically the limits of the
address space.

Sp...@controlq.com

unread,
May 11, 2011, 9:54:58 AM5/11/11
to
On Wed, 11 May 2011, Mat wrote:

> Date: Wed, 11 May 2011 03:46:31 -0700 (PDT)
> From: Mat <dam...@web.de>
> Newsgroups: comp.lang.forth
> Subject: Re: ANN: An open source C based Forth for learning purposes ...

Actually for a token threaded model, the 'label-as-value' model is fine. I
just find large switch statements to be ugly. MiniForth will never touch
Gforth's performance, but it will provide an accessible and (hopefully)
understandable model upon which to build.

For example, if you prefer to implement a label based token threaded forth
based upon MiniForth -- feel free (BSD licensed), but be prepared to
address alignment issues as well ... or lose portability.

If you want to write an assempler in a subset dialect of Miniforth to use
as an umbilicus to support your favourite Soc -- feel free (BSD licensed.)

If you want to change null terminated string to counted strings ... feel
free, but that would break a planned future compatibility with C
libraries.

Rob.

Sp...@controlq.com

unread,
May 11, 2011, 10:05:03 AM5/11/11
to
On Tue, 10 May 2011, Elizabeth D Rather wrote:

> On 5/10/11 8:21 PM, The Beez' wrote:
> ...
>
>> I really don't know why Forthers are so in love with counted strings -
>> NULL-terminated strings are just fine.
>
> Null-terminated strings have to be counted at run-time. Counted strings
> don't. Cycles are precious.
>
> Cheers,
> Elizabeth

Elizabeth,

I too believe that cycles are important, and far too many are wasted upon
the windows OS. I see from your headers that you are on a Mac :-) well
done!

As for the cycles used to count strings, I gladly expend them in the
larger goal of being able to link C based libraries, and support dynamic
linking in a future revision of MiniForth.

Look at Tim T. Russell's use of GForth and SDL to re-code some very
informative tutorials for use by forth scripters. I find Mr. Russell's
efforts not only commendable, but a model for similar contributions to
the community ...

Rob.

Sp...@controlq.com

unread,
May 11, 2011, 10:14:11 AM5/11/11
to
On Wed, 11 May 2011, forth...@forthfiles.net wrote:
[snip]

> Who wants to count every character in a string just so you can jump
> over it (for example) at run time?
>
> NULL terminated strings are Satanic and should be driven from the
> earth!!
>

LOL ... you may be right, BUT ... you would be forced to ignore a
considerable body of work, much of it quite good if you couldn't link to
code which uses null terminated strings ... de facto ...

Rob.

Sp...@controlq.com

unread,
May 11, 2011, 10:18:33 AM5/11/11
to
On Tue, 10 May 2011, John Passaniti wrote:

[snip]

> if you use snprintf. It's a mistake if a lot of your processing is
> concatenating strings with strcat; it's not a mistake if you keep a
> pointer to the end of the string and maintain the length separately.

A closer examination of MiniForth will show that I did not link to any of
the str*()'s, but rather implemented all string and numeric conversion
stuff internally, this to support native (non-hosted applications) without
relying upon great big libraries.

> On the other hand, it should be pointed out that the only place the C
> language defines a null-terminated string is in string literals. I
> and many other embedded systems folk use a variety of string
> representations, depending on the application's needs. The existence
> of string.h doesn't mean you have to use it, or that it even makes
> sense for some kinds of work.

True.

Sp...@controlq.com

unread,
May 11, 2011, 10:19:56 AM5/11/11
to
On Tue, 10 May 2011, Paul Rubin wrote:

> Date: Tue, 10 May 2011 15:42:57 -0700
> From: Paul Rubin <no.e...@nospam.invalid>
> Newsgroups: comp.lang.forth
> Subject: Re: ANN: An open source C based Forth for learning purposes ...

Sorry, Paul. "fixing" the feature would relegate MiniForth to too small a
sandbox for its original intent.

Rob.

Sp...@controlq.com

unread,
May 11, 2011, 10:21:59 AM5/11/11
to
On Wed, 11 May 2011, Anton Ertl wrote:

> Date: Wed, 11 May 2011 13:26:43 GMT
> From: Anton Ertl <an...@mips.complang.tuwien.ac.at>
> Newsgroups: comp.lang.forth
> Subject: Re: ANN: An open source C based Forth for learning purposes ...


>
> "The Beez'" <han...@bigfoot.com> writes:
>> I really don't know why Forthers are so in love with counted strings -
>> NULL-terminated strings are just fine.
>
> Both suck. Zero-terminated strings suck for a variety of reasons;
> some have already been mentioned, but the most important one is that
> they cannot contain NULs. As a result, C has two sets of functions
> for dealing with blocks of characters, the str... set for
> zero-terminated blocks and the mem... set for blocks given by start
> and length (similar to modern Forth). If that redundancy does not
> indicate a design mistake, what does?
>
> - anton
>

Interesting. I'd never considered the mem*()'s to be an alternative to
the str*()'s, but rather a more efficient implementation which requires
the user to provide the bufs and lens. I'd still proffer an argument that
mem stuff and str stuff are different things, though.

Andrew Haley

unread,
May 11, 2011, 10:39:50 AM5/11/11
to
Mat <dam...@web.de> wrote:
> Nice example but I wonder why the label-as-value feature of gcc, icc,
> clang (llvm), pcc etc. isn't used because this would simplify the
> source code and offering far better performance ?!? Platform
> independence can't be the reason because there exist a wide range of
> supported platform for both gcc and clang.

Because if he had done that, someone else would have popped up and
said "Why not write this in Standard C?"

Andrew.

Anton Ertl

unread,
May 11, 2011, 11:25:32 AM5/11/11
to

I.e., whichever way it is decided, this is a design decision that
people are interested in.

Concerning the performance issues that have been mentioned:

Nowadays gcc typically compiles goto * to code that eliminates most of
the potential performance advantage of using labels-as-values (a
factor of 2 on modern Intel and AMD processors due to improved branch
prediction). It's actually possible to regain that branch prediction
performance advantage on gcc by using repicated switches instead of
labels-as-values, at the cost of some additional overhead elsewhere; I
have not tested how robust that gain is across gcc versions, though.

When I tested compiling gforth on LLVM, I found out that they
implement labels-as-values and goto * apparently through a switch
statement: each label is represented by a small integer. I am not
sure if the branch prediction advantage is present in code generated
by LLVM, but in any case it was extremely slow; IIRC Gforth compiled
with LLVM (llvm-gcc though) was a factor of 6 slower than Gforth
compiled with gcc. That's too bad, some competition would help.

ICC crashed when I last tried to compile Gforth with it (many years
ago), and it's proprietary anyway.

I have not tried pcc for Gforth. I did try tcc, though, but it did
not like some of the code; I would not expect high code quality from
tcc, anyway (and Gforth is written in a way that requires copy
propagation for good performance).

Andrew Haley

unread,
May 11, 2011, 12:31:08 PM5/11/11
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>
> ICC crashed when I last tried to compile Gforth with it (many years
> ago), and it's proprietary anyway.

Yeah. I hear it's very good on benchmarks, though. :-)

Andrew.

The Beez'

unread,
May 11, 2011, 12:32:44 PM5/11/11
to
On 11 mei, 15:51, BruceMcF <agil...@netscape.net> wrote:
> With cell-counted strings, the limit is typically the limits of the
> address space.
How many do implement that one? Like I said, I don't want an old
discussion here again. Last time I checked there were as many
advantages to counted strings than null-term. strings and as many
string operations that were easier with null-term. strings than
counted strings. And when parsing strings, there is no difference at
all, since any user a terminator. And yes, we have UTF-8 and we don't
handle UTF-16 in a similar way, because they're simply not compatible
with ASCII. So all in all, I think these precious cycles average each
other out in a non-trivial program. Don't exaggerate.

Point is, when you do your job properly, you load a string ONCE in
memory, convert it to addr/count and leave it that way until you use
PLACE.

Hans Bezemer

Rod Pemberton

unread,
May 11, 2011, 12:38:40 PM5/11/11
to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2011May1...@mips.complang.tuwien.ac.at...

> "The Beez'" <han...@bigfoot.com> writes:
> >I really don't know why Forthers are so in love with counted strings -
> >NULL-terminated strings are just fine.
>
> Both suck. Zero-terminated strings suck for a variety of reasons;
> some have already been mentioned, but the most important one is that
> they cannot contain NULs. As a result, C has two sets of functions
> for dealing with blocks of characters, the str... set for
> zero-terminated blocks and the mem... set for blocks given by start
> and length (similar to modern Forth). If that redundancy does not
> indicate a design mistake, what does?
>

That's a deceptive explanation. The C mem... functions are not just for
characters. They are for non-character data also. C code handles all sorts
of raw data. So, what redundancy? If anything, one should claim that
str... functions are a design mistake since they *do not* handle
non-character data...


Rod Pemberton


Rod Pemberton

unread,
May 11, 2011, 12:39:41 PM5/11/11
to
"Elizabeth D Rather" <era...@forth.com> wrote in message
news:4vGdnZOxu5iqq1fQ...@supernews.com...

> On 5/10/11 8:21 PM, The Beez' wrote:
> ...
>
> > I really don't know why Forthers are so in love with counted strings -
> > NULL-terminated strings are just fine.
>
> Null-terminated strings have to be counted at run-time.
>

Why? How often does one really need the length of a string?

Counted strings can have the problem of a mismatch between counted string
length vs. actual string length. Something must keep updating the counter
if the string is of variable size, which is no better. Counted strings
"saving" time is only valid for fixed-length strings, and that assumes that
one needs the length in the first place...


Rod Pemberton


Andrew Haley

unread,
May 11, 2011, 12:52:32 PM5/11/11
to
Rod Pemberton <do_no...@noavailemail.cmm> wrote:
> "Elizabeth D Rather" <era...@forth.com> wrote in message
> news:4vGdnZOxu5iqq1fQ...@supernews.com...
>> On 5/10/11 8:21 PM, The Beez' wrote:
>> ...
>>
>> > I really don't know why Forthers are so in love with counted strings -
>> > NULL-terminated strings are just fine.
>>
>> Null-terminated strings have to be counted at run-time.
>>
>
> Why? How often does one really need the length of a string?

A heck of a lot, IME. Even when you're copying strings, it's a lot
easier if you know how much you have to copy. Sure, you can use some
tricky code to discover if the cell you're copying has a zero in it
somewhere, but it's better not to need to.

> Counted strings can have the problem of a mismatch between counted
> string length vs. actual string length.

That's not possible.

> Something must keep updating the counter if the string is of
> variable size, which is no better. Counted strings "saving" time is
> only valid for fixed-length strings, and that assumes that one needs
> the length in the first place...

I'm trying and failing to think of a string that doesn't have a
length.

Andrew.

Rod Pemberton

unread,
May 11, 2011, 2:05:03 PM5/11/11
to
"Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
news:hbadndhSJfBNI1fQ...@supernews.com...

> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
> > "Elizabeth D Rather" <era...@forth.com> wrote in message
> > news:4vGdnZOxu5iqq1fQ...@supernews.com...
> >> On 5/10/11 8:21 PM, The Beez' wrote:
> >> ...
> >>
> >> > I really don't know why Forthers are so in love with counted
strings -
> >> > NULL-terminated strings are just fine.
> >>
> >> Null-terminated strings have to be counted at run-time.
> >>
> >
> > Why? How often does one really need the length of a string?
>
> A heck of a lot, IME.
>

For what? It's not needed for printing. Print until there's a terminator.
It's not needed for copying. Copy until there's a terminator. Etc. As
long as the string is valid, i.e., has a terminator, there isn't an issue
with length.

> Even when you're copying strings, it's a lot
> easier if you know how much you have to copy.
>

How often does one really need to copy strings?

In assembly, copying is going to require a loop, even if you know the length
in advance. Why not loop until a terminator is seen?

> Sure, you can use some
> tricky code to discover if the cell you're copying has
> a zero in it somewhere, but it's better not to need to.
>

Something must count the length. If it's a string of varying length, it
must be counted at run-time anyway, not compile time.

> > Counted strings can have the problem of a mismatch between counted
> > string length vs. actual string length.
>
> That's not possible.
>

It is if the count isn't kept synced with the string's length. It's a
problem with strings of variable length.

> > Something must keep updating the counter if the string is of
> > variable size, which is no better. Counted strings "saving" time is
> > only valid for fixed-length strings, and that assumes that one needs
> > the length in the first place...
>
> I'm trying and failing to think of a string that doesn't have a
> length.
>

Why?

I'm trying and failing to think of a string of which I'd need to know the
length. Too much C?


RP


Rod Pemberton

unread,
May 11, 2011, 2:19:00 PM5/11/11
to
"Paul Rubin" <no.e...@nospam.invalid> wrote in message
news:7xaaetu...@ruckus.brouhaha.com...

So? Wouldn't the NUL-termination be the same size or larger than the
character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
should be *two* NUL bytes, shouldn't it?

Do the 8-bit NUL bytes have any effect on 16-bit characters? Remember, even
C defines the termination of strings as by a C "byte with all bits zero" and
*not* by a C character. A C "byte" is not defined as 8-bits per ASCII or
EBCDIC. It's defined as an addressable unit of storage large enough to hold
a character, or similar. The result is that the C byte must be as large or
larger in size as the C character. Today, they are typically the same size.
That's because most modern platforms are 8-bit byte addressable and use
8-bit bytes for characters. But, they weren't always. I.e., if char's are
9-bits, addressing is 8-bits, the NUL "byte" terminator for C would be
16-bits, all cleared. IIRC, they ran into this problem with late B or very
early C on 16-bit word addressable machines, i.e., 9-bit characters,
addressing is 16-bits. The point is that a terminator does not have to be a
character.


Rod Pemberton


ron

unread,
May 11, 2011, 2:52:22 PM5/11/11
to
On May 11, 9:05 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:


> I'm trying and failing to think of a string of which I'd need to know the
> length.  Too much C?

Unless you only ever deal with statically allocated buffers, it's a
good idea to know how much memory you will need to allocate to hold a
result string.

Concatenating two strings into a third, allocated, buffer is a very
common thing to want to do. Not knowing the strings' sizes means one
has to traverse the source strings twice -- first to get the length
and allocate the buffer, and a second time to perform the copy itself.

There are numerous cases where knowing the length of a string saves a
lot of time.

Stephen Pelc

unread,
May 11, 2011, 3:31:40 PM5/11/11
to
On Wed, 11 May 2011 14:05:03 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

>I'm trying and failing to think of a string of which I'd need to know the
>length. Too much C?

Function names in shared libraries, especially mangled C++ names, have
been observed in the wild at over 2000 characters. The vast majority
of functions differ only in the last few letters. Hence, ELF shared
objects (as used by most Unices) are very slow to find. When there
are hundreds of thousands of functions, e.g. Open/LibreOffice, initial
start up is dog slow.

Search for Ulrich Drepper on the topic.

Stephen


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

BruceMcF

unread,
May 11, 2011, 4:04:03 PM5/11/11
to
On May 11, 12:32 pm, "The Beez'" <hans...@bigfoot.com> wrote:
> On 11 mei, 15:51, BruceMcF <agil...@netscape.net> wrote:> With cell-counted strings, the limit is typically the limits of the
> > address space.

> How many do implement that one?

Cell counted in place, or cell counted by reference? I'd expect cell
counted is the normal way to handle strings embedded in the dictionary
that are not of known-byte-count size, but there's no need for any
separate handling words that take ( ca ), since the {S"} factor or
embedded string constant would be delivered to the stack in:
( ca u )
... form.

If the system expects nul-terminated strings, appending a nul to
stored embedded strings seems like a fine way to avoid a double move
to append the nul in a working buffer, but there's no avoiding the
move if the system expects a nul-terminated string and the string is
actually a substring.

> Point is, when you do your job properly, you load a string ONCE in
> memory, convert it to addr/count and leave it that way until you use
> PLACE.

I'd think that what you do when you do your job properly is more
context sensitive than that. The optimal "what you do when you do your
job properly" when skimming through a page source looking for embedded
links to external streaming hosts and what you do when using strings
flashed into the message string pool for a remote data logger do not
on the face of it look likely to be the same thing.

BruceMcF

unread,
May 11, 2011, 4:12:52 PM5/11/11
to
On May 11, 12:39 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:

> Counted strings can have the problem of a mismatch between counted
> string length vs. actual string length.  Something must keep
> updating the counter if the string is of variable size, ...

So to make this concrete, you are saying that concatenating two
strings by:

( s1 s2 s3 -- s4 )
\ s1 is head of new string, s2 is tail of new string
\ s3 is buffer that holds the new string
\ s4 is new string
\ error thrown if s1+s2>s3

... is prone to the len(s4)<>len(s1)+len(s2) if the length is
explicitly on the stack rather than implied by a nul terminator.

Rod Pemberton

unread,
May 11, 2011, 7:01:24 PM5/11/11
to
"BruceMcF" <agi...@netscape.net> wrote in message
news:6ab7abf1-9e5f-4e7e...@m40g2000vbt.googlegroups.com...

> On May 11, 12:39 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
> wrote:
> > Counted strings can have the problem of a mismatch between counted
> > string length vs. actual string length. Something must keep
> > updating the counter if the string is of variable size, ...
>
> So to make this concrete, you are saying that concatenating
> two strings by:
>

Counted strings have their length stored somewhere, yes? So, what happens
when the string is changed? The actual length of the string is now
different than the stored length. Some code, somewhere, must recount, at
run-time, and restore, at run-time, the length of the string.

Let's say you place that count at the start of a string:

[length][s0][s1][s2][s3][s4][s5][s6]...

Then, at compile time, the string is set to "Hello" (without a NUL) and
count stored:

5 "Hello"

But, the string is allowed to be changed, i.e., variable not fixed. So, now
you change the string, at run-time, so it says "Goodbye" instead of "Hello":

X "Goodbye"

What value is X? Is X still 5 because there is no run-time code to count
the length? Or, is there a function that resets X to 7? Or, is there
behind-the-scenes code, say in the string copying function, that sets the
new length? The actual string length is 7, not 5. This is the issue with
counted strings, if those strings are not of fixed length, i.e., variable.

IIRC, PL/1 dealt with this issue...


Rod Pemberton


Elizabeth D Rather

unread,
May 11, 2011, 8:00:38 PM5/11/11
to
On 5/11/11 8:05 AM, Rod Pemberton wrote:
> "Andrew Haley"<andr...@littlepinkcloud.invalid> wrote in message
...

>>> Why? How often does one really need the length of a string?
>>
>> A heck of a lot, IME.
>>
>
> For what? It's not needed for printing. Print until there's a terminator.
> It's not needed for copying. Copy until there's a terminator. Etc. As
> long as the string is valid, i.e., has a terminator, there isn't an issue
> with length.

In Forth, TYPE and most other string operators take <addr-len> stack
parameters. It's both faster and easier to type a specified number of
characters than to have to examine each one to see if you can quit yet.

>> Even when you're copying strings, it's a lot
>> easier if you know how much you have to copy.
>>
>
> How often does one really need to copy strings?
>
> In assembly, copying is going to require a loop, even if you know the length
> in advance. Why not loop until a terminator is seen?

It obviously depends on the application, but in my experience strings
get moved fairly often, e.g. from the terminal input buffer to WORD's
space. And, again, it's faster to simply run a loop with a count than
to examine each character.

>> Sure, you can use some
>> tricky code to discover if the cell you're copying has
>> a zero in it somewhere, but it's better not to need to.
>
> Something must count the length. If it's a string of varying length, it
> must be counted at run-time anyway, not compile time.

The parsing words in Forth count lengths. Forth does not generally
handle variable-length strings (they may occur in applications, though
rarely in my experience, but not in Standard Forth).

>>> Counted strings can have the problem of a mismatch between counted
>>> string length vs. actual string length.
>>
>> That's not possible.
>>
>
> It is if the count isn't kept synced with the string's length. It's a
> problem with strings of variable length.

As noted above, Forth doesn't deal with variable-length strings.

>>> Something must keep updating the counter if the string is of
>>> variable size, which is no better. Counted strings "saving" time is
>>> only valid for fixed-length strings, and that assumes that one needs
>>> the length in the first place...
>>
>> I'm trying and failing to think of a string that doesn't have a
>> length.
>>
>
> Why?
>
> I'm trying and failing to think of a string of which I'd need to know the
> length. Too much C?

Apparently!

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

BruceMcF

unread,
May 11, 2011, 8:13:26 PM5/11/11
to
On May 11, 7:01 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:

> But, the string is allowed to be changed, i.e., variable not fixed.
> So, now you change the string, at run-time, so it says "Goodbye"
> instead of "Hello"

Yeah? And if that's done at runtime, you know the count of the new
string. If that is transitory, store the new count where you store the
new address:
{string} 2!

... and if it is permanent, store the new count with the new string in
its permanent location ...
HERE S, <string> !


> What value is X?
The same value it was when you were originally handed the new value of
the string.

Paul Rubin

unread,
May 11, 2011, 8:29:41 PM5/11/11
to
"Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> 5 "Hello"
> But, the string is allowed to be changed, i.e., variable not fixed. So, now
> you change the string, at run-time, so it says "Goodbye" instead of "Hello":
> X "Goodbye"
>
> What value is X? Is X still 5 because there is no run-time code to count
> the length? Or, is there a function that resets X to 7?

"Hello" is a 5-char string and "Goodbye" is a 7-char string. You may be
confusing the string length with the amount of space allocated for the
string (which might exceed the length, especially in situations where
the string can grow). Obviously in an application where the two sizes
can be unequal, you have to keep track of both.

Paul Rubin

unread,
May 11, 2011, 8:43:33 PM5/11/11
to
"Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> So? Wouldn't the NUL-termination be the same size or larger than the
> character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
> should be *two* NUL bytes, shouldn't it?

But now you have to care about whether some given byte string in memory
semantially represents a UTF-16 encoding or something else.

The basic problem here is that NUL is a perfectly good character in both
ascii and unicode, and in general it can occur in the middle of strings.
C strings are not a sequence c1,c2,c3... where c1,c2... can be any
character. They are restricted against containing a certain character
that doesn't show up in most text messages, but ends up needing special
handling in enough cases to cause problems.

I do agree with the suggestion of allocating an extra byte after the end
of the string and setting it to NUL, to prevent any C library strxxx
functions from going off into the weeds. I've seen several Lisp
implementations that do that.

Paul Rubin

unread,
May 11, 2011, 8:44:01 PM5/11/11
to
Sp...@ControlQ.com writes:
> Sorry, Paul. "fixing" the feature would relegate MiniForth to too
> small a sandbox for its original intent.

I guess I don't understand what you mean by that.

Albert van der Horst

unread,
May 11, 2011, 10:14:49 PM5/11/11
to
In article <4vGdnZOxu5iqq1fQ...@supernews.com>,

Elizabeth D Rather <era...@forth.com> wrote:
>On 5/10/11 8:21 PM, The Beez' wrote:
>...
>
>> I really don't know why Forthers are so in love with counted strings -
>> NULL-terminated strings are just fine.
>
>Null-terminated strings have to be counted at run-time. Counted strings
>don't. Cycles are precious.

Add to that the need to copy them, just to put zero behind to make them
a proper string. It is not just the copying, it is also the need to find
a place to store the copy.

>
>Cheers,
>Elizabeth

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

BruceMcF

unread,
May 11, 2011, 9:27:51 PM5/11/11
to
On May 11, 8:29 pm, Paul Rubin <no.em...@nospam.invalid> wrote:
> Obviously in an application where the two sizes
> can be unequal, you have to keep track of both.

And just as obviously, it would be silly to change the string but not
store the new length, when you knew the length when you first moved it
into that location.

Andrew Haley

unread,
May 12, 2011, 5:30:57 AM5/12/11
to
Rod Pemberton <do_no...@noavailemail.cmm> wrote:
> "Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
> news:hbadndhSJfBNI1fQ...@supernews.com...
>> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
>> > "Elizabeth D Rather" <era...@forth.com> wrote in message
>> > news:4vGdnZOxu5iqq1fQ...@supernews.com...
>> >> On 5/10/11 8:21 PM, The Beez' wrote:
>> >> ...
>> >>
>> >> > I really don't know why Forthers are so in love with counted
> strings -
>> >> > NULL-terminated strings are just fine.
>> >>
>> >> Null-terminated strings have to be counted at run-time.
>> >>
>> >
>> > Why? How often does one really need the length of a string?
>>
>> A heck of a lot, IME.
>
> For what? It's not needed for printing. Print until there's a terminator.
> It's not needed for copying. Copy until there's a terminator.

But you need to know the length of a string before you copy it,
because you have to make sure there is room.

And you need two routines that copy memory: one that scans until a
terminator, and one that can copy binary data.

> Etc. As long as the string is valid, i.e., has a terminator, there
> isn't an issue with length.
>
>> Even when you're copying strings, it's a lot
>> easier if you know how much you have to copy.
>
> How often does one really need to copy strings?

I think this is becoming something of an argument about programming
style. IME, copying strings isn't so very unusual. A string stack
isn't unusual in Forth, and I wouldn't want to do that with strings
with terminators. Sure, to find the third item you could scan the
strings or maintain a separate array of pointers to each string, but
it's not nice.

> In assembly, copying is going to require a loop, even if you know
> the length in advance. Why not loop until a terminator is seen?

Like I said, looking for a zero in a cell is hard. Obviously, you
don't want to copy a string one byte at a time.

>> > Counted strings can have the problem of a mismatch between counted
>> > string length vs. actual string length.
>>
>> That's not possible.
>
> It is if the count isn't kept synced with the string's length. It's a
> problem with strings of variable length.

The count is the string's length. There cannot be a mismatch becasue
there's nothing for it to be mismatched with.

>> > Something must keep updating the counter if the string is of
>> > variable size, which is no better. Counted strings "saving" time is
>> > only valid for fixed-length strings, and that assumes that one needs
>> > the length in the first place...
>>
>> I'm trying and failing to think of a string that doesn't have a
>> length.
>
> Why?

Please give me an example of a string that has no length.

The real kicker here is the need to maintain two interfaces to every
interface that handles data: one that reads/writes terminated strings,
and one that reads/writes binary data. The C library is plagued with
this: fputs() and fwrite(), memcpy() and strcpy() and strncpy(), etc,
etc.

Andrew.

Stephen Pelc

unread,
May 12, 2011, 5:42:53 AM5/12/11
to
On Wed, 11 May 2011 14:05:03 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

>In assembly, copying is going to require a loop, even if you know the length
>in advance. Why not loop until a terminator is seen?

Because, on many CPUs, a tuned copy routine that knows the length of
the string is 4 to 5 times faster than the dumb byte by byte copy.
We measured it on ARM32 and Cortex-M.

Mat

unread,
May 12, 2011, 5:45:05 AM5/12/11
to
On 11 Mai, 14:59, Andreas <a....@nospam.org> wrote:
> Mat:

>
> > Nice example but I wonder why the label-as-value feature of gcc, icc,
> > clang (llvm), pcc etc. isn't used because this would simplify the
> > source code and offering far better performance ?!? Platform
> > independence can't be the reason because there exist a wide range of
> > supported platform for both gcc and clang. For an ANSI C version
> > replicated-switch threading can be used instead with near the same
> > advantages if one concentrate on a minimal set of primitives. Just
> > wondering.
>
> There is one rule for this, that there is no rule!  ;-)
>
> IOW VM efficiency is so CPU dependent, that one either has to decide and
> optimize individually for every target CPU (like Intel C compilers do)
> or you just neglect it. Otherwise CPU caching and branch misprediction
> can eliminate all "clever" jump optimizations away. In many cases even
> simple subroutine threading is the way to go - the "superfluous and
> cycle-wasting" RET just doesn't "show" to slow down execution speed.
>
> Andreas

That's all true but with one exception: You will see that "Call"
threading is the most ineffective of all threading variants simply
because each call involving some kind of c-frame initialisation and
deconstruction so what other threading variant is chosen? chances are
high for somewhat better performance regardless of other optimisation
possibilities. As frame-setting is a language inherent feature I don't
know any ANSI compatible way for bypassing it. If someone knows a way
to implement subroutine threading in an ANSI C compatible way I would
be glad to know about it.

By the way: Newer versions of clang compiling indirect threading code
with quite well results (tested on ARM, Mips64, PowerPC G3, x86-32).

Mat.

Anton Ertl

unread,
May 12, 2011, 6:29:29 AM5/12/11
to
"Rod Pemberton" <do_no...@noavailemail.cmm> writes:
>
>"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
>news:2011May1...@mips.complang.tuwien.ac.at...
>> As a result, C has two sets of functions
>> for dealing with blocks of characters, the str... set for
>> zero-terminated blocks and the mem... set for blocks given by start
>> and length (similar to modern Forth). If that redundancy does not
>> indicate a design mistake, what does?
>>
>
>That's a deceptive explanation. The C mem... functions are not just for
>characters. They are for non-character data also. C code handles all sorts
>of raw data. So, what redundancy? If anything, one should claim that
>str... functions are a design mistake since they *do not* handle
>non-character data...

Correct. It's the zero-terminated strings and their str... functions
that are the design mistake.

Andrew Haley

unread,
May 12, 2011, 8:33:44 AM5/12/11
to
Mat <dam...@web.de> wrote:
> On 11 Mai, 14:59, Andreas <a....@nospam.org> wrote:
>> Mat:
>>
>> > Nice example but I wonder why the label-as-value feature of gcc, icc,
>> > clang (llvm), pcc etc. isn't used because this would simplify the
>> > source code and offering far better performance ?!? Platform
>> > independence can't be the reason because there exist a wide range of
>> > supported platform for both gcc and clang. For an ANSI C version
>> > replicated-switch threading can be used instead with near the same
>> > advantages if one concentrate on a minimal set of primitives. Just
>> > wondering.
>>
>> There is one rule for this, that there is no rule! ?;-)

>>
>> IOW VM efficiency is so CPU dependent, that one either has to decide and
>> optimize individually for every target CPU (like Intel C compilers do)
>> or you just neglect it. Otherwise CPU caching and branch misprediction
>> can eliminate all "clever" jump optimizations away. In many cases even
>> simple subroutine threading is the way to go - the "superfluous and
>> cycle-wasting" RET just doesn't "show" to slow down execution speed.
>
> That's all true but with one exception: You will see that "Call"
> threading is the most ineffective of all threading variants simply
> because each call involving some kind of c-frame initialisation and
> deconstruction so what other threading variant is chosen?

This isn't necessarily true. There's nothing in the C language spec
that requires a frame to be contructed, and modern C ABIs tend not do
it. Consider this function

int add (int a, int b)
{
return a+b;
}

for which gcc on x86-64 generates

add:
leal (%rsi,%rdi), %eax
ret

Andrew.

Mark Wills

unread,
May 12, 2011, 9:26:11 AM5/12/11
to
On May 12, 12:01 am, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:
> "BruceMcF" <agil...@netscape.net> wrote in message

In Forth, strings are constants. Their size doesn't change. Mostly, I
suppose, because Forth doesn't have a heap, so it's too darned
difficult.

Mat

unread,
May 12, 2011, 9:54:41 AM5/12/11
to
On 12 Mai, 14:33, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:

that's not surprising because the compiler can assume a direct call
position. What I mean are the methods to call C functions indirect
though a pointer (which would be common situation for an interpreter).
Compile this with gcc for example:

typedef int (*inst)(int a, int b);

int add (int a, int b)
{
return a+b;
}

int main (void)
{
inst test = &add;
int res = (test) (2, 3);
}

Code:
.file "test.c"
.text
.globl add
.type add, @function
add:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -8(%rbp), %eax
movl -4(%rbp), %edx
leal (%rdx,%rax), %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size add, .-add
.globl main
.type main, @function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movq $add, -16(%rbp)
movq -16(%rbp), %rax
movl $3, %esi
movl $2, %edi
call *%rax
movl %eax, -4(%rbp)
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits

You can optimise the call-frame out in this example with some basic
flag settings but that don't apply to arrays of function pointers
because the compiler can't assume static call positions for threaded
code this way [pointer arithmetic is valid in C and typical the
interpreter will dispatching like this: (*vmPC++) (args..)]. So
implementing some kind of vm-memory like:

inst vmMem[cMemSize];

will ruin all hope for optimisation' most cases (there are exceptions
and it's possible this is handled in future versions of gcc, who
knows).

-Mat.

ken...@cix.compulink.co.uk

unread,
May 12, 2011, 11:32:43 AM5/12/11
to
In article <iqf46e$l1c$1...@speranza.aioe.org>,
do_no...@noavailemail.cmm (Rod Pemberton) wrote:

> Some code, somewhere, must recount, at
> run-time, and restore, at run-time, the length of the string.

8 bit basics used variable length counted strings.The count was stored
in the first byte of the string. String operations recalculated the
length of the resulting string and if required would allocate new space.
With Microsoft basics the actual string variable was a pointer to the
string in a reserved string space. This did mean that garbage collection
was required at intervals. Now if this can be done with a 16K
interpreter I see no reason why Forth can not use counted strings.

Counted strings were used because the alternative was to compare every
byte in all operations instead of just string comparisons.

Ken Young

Tarkin

unread,
May 12, 2011, 5:21:22 PM5/12/11
to
On May 12, 5:30 am, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:
> Rod Pemberton <do_not_h...@noavailemail.cmm> wrote:
> > "Andrew Haley" <andre...@littlepinkcloud.invalid> wrote in message
> >news:hbadndhSJfBNI1fQ...@supernews.com...
> >> Rod Pemberton <do_not_h...@noavailemail.cmm> wrote:
> >> > "Elizabeth D Rather" <erat...@forth.com> wrote in message

It shouldn't be, on x86-32.:
repne scasb

<snip>
> Andrew.

Alex McDonald

unread,
May 12, 2011, 5:46:46 PM5/12/11
to

That's not copying the data.

>
> <snip>
>
>
>
>
>
>
>
> > Andrew.

Chris Hinsley

unread,
May 12, 2011, 6:36:11 PM5/12/11
to
On 2011-05-10 05:58:00 +0100, Sp...@ControlQ.com said:

> A one file Forth, written in C, for the community (BSD licensed)...
>
> http://www.ControlQ.com/blog/wordpress/?p=352
>
>
> Cheers,
> Rob Sciuk

Oh dear, you've awoke the beards. Good luck with that. :) <--- Forth
powers that be , note the smiley

Chris

Rod Pemberton

unread,
May 12, 2011, 7:20:16 PM5/12/11
to
"Mat" <dam...@web.de> wrote in message
news:697bfffc-ec58-4ef2...@u26g2000vby.googlegroups.com...

>
> that's not surprising because the compiler can assume a direct call
> position. What I mean are the methods to call C functions indirect
> though a pointer (which would be common situation for an interpreter).
> Compile this with gcc for example:
>
> typedef int (*inst)(int a, int b);
>

You probably want:

typedef void (*inst)(void);

Parameters need a stackframe (on the C implementations that use a stack,
i.e., virtually all...). With GCC, IIRC, "void (*)(void)" is about the only
way to almost eliminate the stackframe. Other compilers, e.g., OpenWatcom,
have directives for generating "naked" functions, i.e., no stackframe. The
last time I checked I couldn't find anything similar for GCC. However,
apparently with GCC, you can just throw in GAS assembly and C code, wherever
needed... E.g., assembly label in middle of C code, etc. tricks. Don't
quote me.


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:21:54 PM5/12/11
to

"Paul Rubin" <no.e...@nospam.invalid> wrote in message
news:7x62pgv...@ruckus.brouhaha.com...

> "Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> > So? Wouldn't the NUL-termination be the same size or larger than the
> > character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
> > should be *two* NUL bytes, shouldn't it?
>

The reason I brought that up was, IIRC, unicode chars map to ASCII for the
lower values. However, I don't know whether UTF-16 extends those ASCII
values, including NUL, from 8-bits to 16-bits. I vaguely recall some sort
of RLL-like compression... It could very well be that 8-bits zeroed is NUL
in UTF-16 too. Someone else will have to expound on this.

> But now you have to care about whether some given byte string in memory
> semantially represents a UTF-16 encoding or something else.
>
> The basic problem here is that NUL is a perfectly good character in both
> ascii and unicode,

Perfectly good character for what? It's not a text character. It's not a
graphics character. It's not a formatting character. It's not a host
specific graphics character. It's nothing of use in ASCII, EBCDIC, or even
PETSCII.

> C strings are not a sequence c1,c2,c3... where c1,c2... can be any
> character.

Yes, they are. Of course, a NUL character could be confused by code for C's
"null byte" terminator, e.g., if byte size and character size are the same.

> They are restricted against containing a certain character
> that doesn't show up in most text messages,
>

I doubt a C specification pedant would agree here. They'd probably say
something similar to what I said above. I.e., a C system could support
both, but isn't required to do so, e.g., if byte size and character size are
different.


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:23:46 PM5/12/11
to

"Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
news:J4SdnR5U5tdMNVbQ...@supernews.com...

> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
> > "Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
> > news:hbadndhSJfBNI1fQ...@supernews.com...
> >> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
...

> > In assembly, copying is going to require a loop, even if you know
> > the length in advance. Why not loop until a terminator is seen?
>
> Like I said, looking for a zero in a cell is hard. Obviously, you
> don't want to copy a string one byte at a time.
>

Why are you looking for zero? E.g., x86 assembly has very fast,
specialized instructions that do that exact task.

> >> > Something must keep updating the counter if the string is of
> >> > variable size, which is no better. Counted strings "saving" time is
> >> > only valid for fixed-length strings, and that assumes that one needs
> >> > the length in the first place...
> >>
> >> I'm trying and failing to think of a string that doesn't have a
> >> length.
> >
> > Why?
>
> Please give me an example of a string that has no length.
>

No length strings exist in many languages, including C. Strings in most
other languages are not count based. Forth and PL/1 are the only two I'm
aware of that do use counts. So, I'm still not sure why you are stuck on
the no length string thing... But, OK, in a counted string environment, if
the count=0, then the string has no length.

In C this string: "\0" has no length. None. Zip. Zero. Nada. \0
represents the "null byte" terminator. It's not a character in C. It's an
addressable unit as large as or larger than a C character with all bits
zeroed. Every sequence of characters must have one to be a valid C string.
But, it's not part of the string, i.e., allowing a zero length string. The
compiler places them on certain strings implicitly. Programmers do so
explicitly in other situations.

> The real kicker here is the need to maintain two interfaces to every
> interface that handles data: one that reads/writes terminated strings,
> and one that reads/writes binary data.

That's not true. C's strxxx() are sometimes written in terms of C's
memxxx() functions. It's just simpler not to have to write the code using
memxxx() and strlen(). The strxxx() functions were implemented for you
already as part of the default library, whether they use memxxx() and
strlen() or not.

> The C library is plagued with
> this: fputs() and fwrite(),

fputs() is frequently just a macro to printf().

> memcpy() and strcpy() and strncpy(), etc,
>

strcpy() is for strings, i.e., characters followed by a "null byte".
memcpy() is for any data. Sometimes strcpy() is implemented using memcpy()
and strlen(). strncpy() allows you to copy substrings. These all provide
different or slight variations of the same thing. I don't see why you claim
the multiple interfaces are a "plague" or a "need" just because all of them
ended up as part of the default library. They are redudant though.


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:25:43 PM5/12/11
to
"Mat" <dam...@web.de> wrote in message
news:6d88c763-6b50-4d65...@hd10g2000vbb.googlegroups.com...

>
> You will see that "Call"
> threading is the most ineffective of all threading variants simply
> because each call involving some kind of c-frame initialisation and
> deconstruction so what other threading variant is chosen?
>

...?

In assembly, you don't need to use stackframes. Forth doesn't use
stackframes either. C, and other HLLs, use stacks, and therefore
stackframes, to pass parameters. IIRC, they found that recursion required a
stack and memory usage was more efficient with a stack, which is why most C
implementations use a stack. So, most likely, STC code would just be a
"call" and a "ret" (or "jsr" "rts") or whatever is used in the assembly
language of your host platform.


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:27:29 PM5/12/11
to
"Stephen Pelc" <steph...@mpeforth.com> wrote in message
news:4dcbaad9....@192.168.0.50...

> On Wed, 11 May 2011 14:05:03 -0400, "Rod Pemberton"
> <do_no...@noavailemail.cmm> wrote:
>
> >In assembly, copying is going to require a loop, even if you know the
length
> >in advance. Why not loop until a terminator is seen?
>
> Because, on many CPUs, a tuned copy routine that knows the length of
> the string is 4 to 5 times faster than the dumb byte by byte copy.
> We measured it on ARM32 and Cortex-M.
>

Does ARM have specialized instructions for this like x86 does? It has quite
a large instruction list.

Does ARM have conditional branches, branch on zero? etc.?

If not, I'd say it's an instruction set defiency issue...

What about word-size, or dword-size copies?

Even in copying code poorly written for x86, e.g., C code, no one uses
byte-by-byte copies.


Rod Pemberton


Tarkin

unread,
May 12, 2011, 7:28:41 PM5/12/11
to

Sorry, I didn't trim enough, and was contextually unclear;
'Like I said, looking for a zero in a cell is hard.'

No, it's not. (zeroing EAX omitted)
repne scasb

Neither is copying, on x86-32:
rep movsb/w/d

The origin cell destination cells would,
hypothetically, already be 'at hand'.

These sorts of things are exactly what x86-32
and/or CISC are designed to do, at very low level.

TTFN,
Tarkin

Rod Pemberton

unread,
May 12, 2011, 7:34:52 PM5/12/11
to
"BruceMcF" <agi...@netscape.net> wrote in message
news:d0321a1a-b815-435c...@c26g2000vbq.googlegroups.com...

On May 11, 7:01 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:
> > But, the string is allowed to be changed, i.e., variable not fixed.
> > So, now you change the string, at run-time, so it says "Goodbye"
> > instead of "Hello"
>
> Yeah? And if that's done at runtime, you know the count of
> the new string.

Only if you, your code, or the Forth environment code counted it...

> The same value it was when you were originally handed the
> new value of the string.

You're saying it's been counted... Who or what counted it?


I.e., your "mental model" of strings implicitly requires that all strings
are counted. Does a string exist without a count in your model? It does in
many languages, including C.


RP


Rod Pemberton

unread,
May 12, 2011, 7:45:57 PM5/12/11
to
"Stephen Pelc" <steph...@mpeforth.com> wrote in message
news:4dcae339....@192.168.0.50...

> On Wed, 11 May 2011 14:05:03 -0400, "Rod Pemberton"
> <do_no...@noavailemail.cmm> wrote:
>
> >I'm trying and failing to think of a string of which I'd need to know the
> >length. Too much C?
>
> Function names in shared libraries, especially mangled C++ names, have
> been observed in the wild at over 2000 characters. The vast majority
> of functions differ only in the last few letters. Hence, ELF shared
> objects (as used by most Unices) are very slow to find. When there
> are hundreds of thousands of functions, e.g. Open/LibreOffice, initial
> start up is dog slow.
>

Why aren't they using a hash function with low-collisions on those strings?

Perhaps,

Austin Appleby's MurmurHash2
Bob Jenkins' hashlittle()


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:46:43 PM5/12/11
to

"Paul Rubin" <no.e...@nospam.invalid> wrote in message
news:7xaaesv...@ruckus.brouhaha.com...

> "Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> > 5 "Hello"
> > But, the string is allowed to be changed, i.e., variable not fixed. So,
now
> > you change the string, at run-time, so it says "Goodbye" instead of
"Hello":
> > X "Goodbye"
> >
> > What value is X? Is X still 5 because there is no run-time code to
count
> > the length? Or, is there a function that resets X to 7?
>
> "Hello" is a 5-char string and "Goodbye" is a 7-char string. You may be
> confusing the string length with the amount of space allocated for the
> string

No.

> (which might exceed the length, especially in situations where
> the string can grow).

True.

> Obviously in an application where the two sizes
> can be unequal, you have to keep track of both.
>

Or, you can just use terminated strings instead of counted strings. Much
easier. More effective.


Rod Pemberton


Rod Pemberton

unread,
May 12, 2011, 7:47:19 PM5/12/11
to
"BruceMcF" <agi...@netscape.net> wrote in message
news:02ef6672-d50d-4a65...@h36g2000pro.googlegroups.com...

Who says it was moved to that location instead of being directly written
there from the input device? How did you know the length of what was
written without something computing it for you?


Rod Pemberton


Paul Rubin

unread,
May 12, 2011, 10:54:10 PM5/12/11
to
Tarkin <tark...@gmail.com> writes:
> 'Like I said, looking for a zero in a cell is hard.'
>
> No, it's not. (zeroing EAX omitted)
> repne scasb
>
> Neither is copying, on x86-32:
> rep movsb/w/d

But those instructions are pretty slow compared with copying a word at a
time, or in even wider units with the XMM or AVX instructions.

Andrew Haley

unread,
May 13, 2011, 3:26:58 AM5/13/11
to
Rod Pemberton <do_no...@noavailemail.cmm> wrote:
>
> "Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
> news:J4SdnR5U5tdMNVbQ...@supernews.com...
>> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
>> > "Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
>> > news:hbadndhSJfBNI1fQ...@supernews.com...
>> >> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
> ...
>
>> > In assembly, copying is going to require a loop, even if you know
>> > the length in advance. Why not loop until a terminator is seen?
>>
>> Like I said, looking for a zero in a cell is hard. Obviously, you
>> don't want to copy a string one byte at a time.
>
> Why are you looking for zero?

Because it's the terminator, and you want to know where it is.

> E.g., x86 assembly has very fast,
> specialized instructions that do that exact task.

You still want to copy a word at a time, or in bigger chunks if that
helps.

>> >> > Something must keep updating the counter if the string is of
>> >> > variable size, which is no better. Counted strings "saving" time is
>> >> > only valid for fixed-length strings, and that assumes that one needs
>> >> > the length in the first place...
>> >>
>> >> I'm trying and failing to think of a string that doesn't have a
>> >> length.
>> >
>> > Why?
>>
>> Please give me an example of a string that has no length.
>
> No length strings exist in many languages, including C. Strings in
> most other languages are not count based. Forth and PL/1 are the
> only two I'm aware of that do use counts. So, I'm still not sure
> why you are stuck on the no length string thing... But, OK, in a
> counted string environment, if the count=0, then the string has no
> length.

No it does not. It has a length, and that length is zero. Every
string has a length; the only issue is whether that length is implicit
or explicit.

>> The real kicker here is the need to maintain two interfaces to every
>> interface that handles data: one that reads/writes terminated strings,
>> and one that reads/writes binary data.
>
> That's not true. C's strxxx() are sometimes written in terms of C's
> memxxx() functions.

The way they're implemented makes no difference to that fact.

> It's just simpler not to have to write the code using memxxx() and
> strlen().

Exactly.

> The strxxx() functions were implemented for you already as part of
> the default library, whether they use memxxx() and strlen() or not.

Yes, they are, and the API grows substantially as a result. fwrite()
first has to scan the string to find its length, Etc, etc.



>> The C library is plagued with this: fputs() and fwrite(),
>
> fputs() is frequently just a macro to printf().

Or not.

>> memcpy() and strcpy() and strncpy(), etc,
>>
>
> strcpy() is for strings, i.e., characters followed by a "null byte".
> memcpy() is for any data.

Indeed.

> Sometimes strcpy() is implemented using memcpy() and strlen().

Thus traversing the string twice, as a boon to efficiency.

> strncpy() allows you to copy substrings.

... without tragic buffer overrun behaviour.

> These all provide different or slight variations of the same thing.
> I don't see why you claim the multiple interfaces are a "plague" or
> a "need" just because all of them ended up as part of the default
> library. They are redudant though.

QED, I think.

Andrew.

Andrew Haley

unread,
May 13, 2011, 3:40:42 AM5/13/11
to
Mat <dam...@web.de> wrote:
> On 12 Mai, 14:33, Andrew Haley <andre...@littlepinkcloud.invalid>
> wrote:
>> Mat <damb...@web.de> wrote:
>> > On 11 Mai, 14:59, Andreas <a....@nospam.org> wrote:
>> >> Mat:
>>
>> >> > Nice example but I wonder why the label-as-value feature of gcc, icc,
>> >> > clang (llvm), pcc etc. isn't used because this would simplify the
>> >> > source code and offering far better performance !? Platform

>> >> > independence can't be the reason because there exist a wide range of
>> >> > supported platform for both gcc and clang. For an ANSI C version
>> >> > replicated-switch threading can be used instead with near the same
>> >> > advantages if one concentrate on a minimal set of primitives. Just
>> >> > wondering.
>>
>> >> There is one rule for this, that there is no rule! ;-)

>>
>> >> IOW VM efficiency is so CPU dependent, that one either has to decide and
>> >> optimize individually for every target CPU (like Intel C compilers do)
>> >> or you just neglect it. Otherwise CPU caching and branch misprediction
>> >> can eliminate all "clever" jump optimizations away. In many cases even
>> >> simple subroutine threading is the way to go - the "superfluous and
>> >> cycle-wasting" RET just doesn't "show" to slow down execution speed.
>>
>> > That's all true but with one exception: You will see that "Call"
>> > threading is the most ineffective of all threading variants simply
>> > because each call involving some kind of c-frame initialisation and
>> > deconstruction so what other threading variant is chosen?
>>
>> This isn't necessarily true. There's nothing in the C language spec
>> that requires a frame to be contructed, and modern C ABIs tend not do
>> it. Consider this function
>>
>> int add (int a, int b)
>> {
>> return a+b;
>>
>> }
>>
>> for which gcc on x86-64 generates
>>
>> add:
>> leal (%rsi,%rdi), %eax
>> ret
>

Well, yes, you can. I had to change the code to prevent it from being
optimized away completely, but for the indirect jump throigh a
function pointer case I get:

extern int add (int a, int b);

inst foo = add;

int main (void)
{
inst test = foo;


int res = (test) (2, 3);

return res;
}

main:


movl $3, %esi
movl $2, %edi

movq foo(%rip), %rax
jmp *%rax

There's no call frame there. There's nothing about function pointers
that requires a stack frame.

> but that don't apply to arrays of function pointers because the
> compiler can't assume static call positions for threaded code this
> way [pointer arithmetic is valid in C and typical the interpreter
> will dispatching like this: (*vmPC++) (args..)]. So implementing
> some kind of vm-memory like:
>
> inst vmMem[cMemSize];
>
> will ruin all hope for optimisation' most cases (there are exceptions
> and it's possible this is handled in future versions of gcc, who
> knows).

I still don't see why you think a call frame will always be needed. I
don't believe it: sometimes it will, sometimes not. You're making a
sweeping generalization.

Andrew.

Mat

unread,
May 13, 2011, 3:44:53 AM5/13/11
to
On 13 Mai, 01:20, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:
> "Mat" <damb...@web.de> wrote in message

that was also originally my assumption but sadly code like this:

typedef void (*inst) ();

int r, a, b;

void add (void)
{
r = a+b;
}

int main (void)
{
a = 2; b = 3;
inst test = &add;
(test) ();
}

compile to this:

.file "test.c"
.comm r,4,4
.comm a,4,4
.comm b,4,4


.text
.globl add
.type add, @function
add:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6

movl a(%rip), %edx
movl b(%rip), %eax
leal (%rdx,%rax), %eax
movl %eax, r(%rip)


leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size add, .-add
.globl main
.type main, @function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp

movl $2, a(%rip)
movl $3, b(%rip)
movq $add, -8(%rbp)
movq -8(%rbp), %rdx
movl $0, %eax
call *%rdx


leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits

so no real difference here !

Again, on closer inspection this is actually not surprising because
the problem isn't some code optimisation for elimination of stack-
frames. As written before the code representation as array of
addresses causes the impossibility within SSA representation to detect
static call positions because the compiler simply can't exclude the
possibility for address calculations in all cases (which is an
inherent language feature). It can be, that depending on other code-
optimisations (eg. -O3) stack-frame generation can be rejecting but in
most cases my experience is it wouldn't. That is also the reason why
flags like -fomit-frame-pointer seems to be ignored - sadly.

Mat.

Andrew Haley

unread,
May 13, 2011, 3:57:33 AM5/13/11
to
Mat <dam...@web.de> wrote:
> On 13 Mai, 01:20, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
> wrote:
>> "Mat" <damb...@web.de> wrote in message
>>
>> news:697bfffc-ec58-4ef2...@u26g2000vby.googlegroups.com...
>>
>>
>>
>> > that's not surprising because the compiler can assume a direct call
>> > position. What I mean are the methods to call C functions indirect
>> > though a pointer (which would be common situation for an interpreter).
>> > Compile this with gcc for example:
>>
>> > typedef int (*inst)(int a, int b);
>>
>> You probably want:
>>
>> ? typedef void (*inst)(void);

>>
>> Parameters need a stackframe (on the C implementations that use a stack,
>> i.e., virtually all...). With GCC, IIRC, "void (*)(void)" is about the only
>> way to almost eliminate the stackframe. Other compilers, e.g., OpenWatcom,
>> have directives for generating "naked" functions, i.e., no stackframe. The
>> last time I checked I couldn't find anything similar for GCC. However,
>> apparently with GCC, you can just throw in GAS assembly and C code, wherever
>> needed... E.g., assembly label in middle of C code, etc. tricks. Don't
>> quote me.
>
> that was also originally my assumption but sadly code like this:
>
> typedef void (*inst) ();
>
> int r, a, b;
>
> void add (void)
> {
> r = a+b;
> }
>
> int main (void)
> {
> a = 2; b = 3;
> inst test = &add;
> (test) ();
> }
>
> compile to this:

add:
movl b(%rip), %eax
addl a(%rip), %eax
movl %eax, r(%rip)
ret

main:


movl $2, a(%rip)
movl $3, b(%rip)

call add
rep
ret


You really do have to use the optimizer.

Andrew.

Mat

unread,
May 13, 2011, 4:39:48 AM5/13/11
to
On 13 Mai, 09:57, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:

Well, for this example it works of course. The source-code don't
define an array of function pointers !!!!

typedef void (*inst) ();

int r, a, b;

void add (void)
{
r = a+b;
}

inst *vmPC;
inst vmMem = {add};

int main (void)
{
a = 2; b = 3;

vmPC = &vmMem;
(*vmPC) ();
}

compiled with: gcc -O3 -fomit-frame-pointer :

.file "test.c"
.text
.p2align 4,,15


.globl add
.type add, @function
add:
.LFB0:
.cfi_startproc

movl b(%rip), %eax
addl a(%rip), %eax
movl %eax, r(%rip)
ret

.cfi_endproc
.LFE0:
.size add, .-add

.p2align 4,,15


.globl main
.type main, @function
main:
.LFB1:
.cfi_startproc

subq $8, %rsp
.cfi_def_cfa_offset 16
xorl %eax, %eax


movl $2, a(%rip)
movl $3, b(%rip)

movq $vmMem, vmPC(%rip)
call *vmMem(%rip)
addq $8, %rsp
.cfi_def_cfa_offset 8


ret
.cfi_endproc
.LFE1:
.size main, .-main

.globl vmMem
.data
.align 8
.type vmMem, @object
.size vmMem, 8
vmMem:
.quad add
.comm r,4,16
.comm a,4,16
.comm b,4,16
.comm vmPC,8,8


.ident "GCC: (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2"
.section .note.GNU-stack,"",@progbits

Mat.

Andrew Haley

unread,
May 13, 2011, 5:52:39 AM5/13/11
to
Mat <dam...@web.de> wrote:
> On 13 Mai, 09:57, Andrew Haley <andre...@littlepinkcloud.invalid>
> wrote:
>>
>> You really do have to use the optimizer.
>
> Well, for this example it works of course. The source-code don't
> define an array of function pointers !!!!
>
> typedef void (*inst) ();
>
> int r, a, b;
>
> void add (void)
> {
> r = a+b;
> }
>
> inst *vmPC;
> inst vmMem = {add};
>
> int main (void)
> {
> a = 2; b = 3;
> vmPC = &vmMem;
> (*vmPC) ();
> }
>
> compiled with: gcc -O3 -fomit-frame-pointer :

I can't see any stack frame being created. I removed the noise so
that everyone can see the instructions:

add:
movl b(%rip), %eax
addl a(%rip), %eax
movl %eax, r(%rip)
ret

main:
subq $8, %rsp


xorl %eax, %eax
movl $2, a(%rip)
movl $3, b(%rip)
movq $vmMem, vmPC(%rip)
call *vmMem(%rip)
addq $8, %rsp

ret

The only thing here that could possibly be described as frame-related
is the add/subtract of 8 to keep the stack pointer 16-aligned.

Sometimes you really do need a stack frame, such as when you have
variable-sized local arrays, setjmp(), and sometimes because it's the
fastest way to do it. But the claim that C generally needs stack
frames just isn't true.

Andrew.

Rod Pemberton

unread,
May 13, 2011, 5:59:02 AM5/13/11
to
"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2011May1...@mips.complang.tuwien.ac.at...
> "Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> >
> >"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
> >news:2011May1...@mips.complang.tuwien.ac.at...
> >> As a result, C has two sets of functions
> >> for dealing with blocks of characters, the str... set for
> >> zero-terminated blocks and the mem... set for blocks given by start
> >> and length (similar to modern Forth). If that redundancy does not
> >> indicate a design mistake, what does?
> >>
> >
> >That's a deceptive explanation. The C mem... functions are not just for
> >characters. They are for non-character data also. C code handles all
sorts
> >of raw data. So, what redundancy? If anything, one should claim that
> >str... functions are a design mistake since they *do not* handle
> >non-character data...
>
> Correct. It's the zero-terminated strings and their str...
> functions that are the design mistake.
>

Are you also implying fig-Forth and F83 had design mistakes because they had
terminated strings?

Yes, fig-Forth and F83 both had *counted* strings, but both had *terminated*
strings too. The MSB was set on the last character of the Forth name. This
was called ASCIN for ASCII-Negated. It was commonly used in 8-bit assembly
programming to mark the end of the string. It's no different functionally
than NUL-terminated. It does save a byte. You seem to forget the history
of Forth...


Rod Pemberton


Mat

unread,
May 13, 2011, 6:09:18 AM5/13/11
to
On 13 Mai, 11:52, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:

If you look at add again you see the sequence:

subq $8, %rsp
xorl %eax, %eax

...
addq $8, %rsp

isn't needed at all.

> Sometimes you really do need a stack frame, such as when you have
> variable-sized local arrays, setjmp(), and sometimes because it's the
> fastest way to do it.  But the claim that C generally needs stack
> frames just isn't true.

have I written that ? If so youre right.

-Mat.

Andrew Haley

unread,
May 13, 2011, 6:15:00 AM5/13/11
to

Well, it is actually. The top bit of the last chracter was set
because the name might have been truncated (via WIDTH) to save space.
The names had two lengths, one that was the real length of the name
and one the stored length.

Is anyone still doing this?

Andrew.

Albert van der Horst

unread,
May 13, 2011, 7:42:10 AM5/13/11
to
In article <iqeiqm$7ha$1...@speranza.aioe.org>,

Rod Pemberton <do_no...@noavailemail.cmm> wrote:
>"Andrew Haley" <andr...@littlepinkcloud.invalid> wrote in message
>news:hbadndhSJfBNI1fQ...@supernews.com...
>> Rod Pemberton <do_no...@noavailemail.cmm> wrote:
<SNIP>

>
>> Even when you're copying strings, it's a lot
>> easier if you know how much you have to copy.
>>
>
>How often does one really need to copy strings?

Not very often, unless one is stupid enough to demand that
all strings are prepended with a count ( as in WORD ) or
ended with a zero ( as in c ) or are copied to user-supplied buffers
( ACCEPT ).

In my Forth strings are copied from the TIB to the memory whenever
a definition is in order. And that is it. That is essentially
*all* the string copying that is going on within my Forth.
So copying "AAP" directly from ": AAP WE GAAN NAAR ROME ;" without an
intermediate step where there is AAP<NULL>.
(What about ACCEPT ? It is demanded by the standard, but could be a
loadable extension in ciforth, nothing essential.)

The same applies to my Pentium assembler. Now count the strxxx()
calls in your favorite assembler written in c. Better yet, count
them dynamically.

Unix - c is a brilliant design. Few things were wrong, such as the
name creat (instead of create). The null-ending of strings is a
compromise reasonable at the time, but now past its date and it has
gone bad.

Greater minds than you have designed languages like C++, C#, Java.
They all contain strings as objects, not as a convention tucked
on a character pointer.

>
>In assembly, copying is going to require a loop, even if you know the length
>in advance. Why not loop until a terminator is seen?

Even in my 6809 Forth, copying is done 4 bytes at a time.
On modern machines maybe 128 ?

>
>I'm trying and failing to think of a string of which I'd need to know the
>length. Too much C?

Definitely. Surely. You are what you eat. You think what you read.

>RP

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Andrew Haley

unread,
May 13, 2011, 6:56:25 AM5/13/11
to

The "xorl %eax, %eax" is there because your prototype for "inst" is
incorrect.

Correct it:

typedef void (*inst) (void);

and that instruction goes away.

The add/sub in main is keeping the stack 16-aligned. It is nothing to
do with "some kind of c-frame initialisation and deconstruction". And
it is not done at all when your "add" routine is called, which will
happen in the Forth interpreter.

>> Sometimes you really do need a stack frame, such as when you have
>> variable-sized local arrays, setjmp(), and sometimes because it's the
>> fastest way to do it. But the claim that C generally needs stack
>> frames just isn't true.
>
> have I written that? If so youre right.

You wrote

"You will see that "Call" threading is the most ineffective of all
threading variants simply because each call involving some kind of
c-frame initialisation and deconstruction"

That isn't happening with your example, which just fetches a pointer
and jumps to a routine that doesn't do any "c-frame initialisation and
deconstruction".

Andrew.

Mat

unread,
May 13, 2011, 7:39:42 AM5/13/11
to
On 13 Mai, 12:56, Andrew Haley <andre...@littlepinkcloud.invalid>

right.

> >> Sometimes you really do need a stack frame, such as when you have
> >> variable-sized local arrays, setjmp(), and sometimes because it's the
> >> fastest way to do it.  But the claim that C generally needs stack
> >> frames just isn't true.
>
> > have I written that?   If so youre right.
>
> You wrote
>
> "You will see that "Call" threading is the most ineffective of all
> threading variants simply because each call involving some kind of
> c-frame initialisation and deconstruction"

Ok, that's wrong for the example without parameters. I'm sorry but
seem to lost focus in the change of examples.

> That isn't happening with your example, which just fetches a pointer
> and jumps to a routine that doesn't do any "c-frame initialisation and
> deconstruction".

Your are right.

-Mat.

Anton Ertl

unread,
May 13, 2011, 8:50:32 AM5/13/11
to
"Rod Pemberton" <do_no...@noavailemail.cmm> writes:
>Are you also implying fig-Forth and F83 had design mistakes because they had
>terminated strings?
>
>Yes, fig-Forth and F83 both had *counted* strings, but both had *terminated*
>strings too. The MSB was set on the last character of the Forth name.

Yes, that's a design mistake, and one that Gforth did not follow,
despite its goal of being similar to fig-Forth and Forth-83. As
Andrew Haley notes, this design mistake was a consequence of WIDTH,
which was a mistake in itself. If you want to save space (and time),
use len+3 chars, like the original Forth; no need for marking the end.
Or if you want full names, just do that, like Gforth; no need for
marking the end, either.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2010: http://www.euroforth.org/ef10/

Albert van der Horst

unread,
May 13, 2011, 12:42:16 PM5/13/11
to
In article <zv-dncjV68TweVHQ...@supernews.com>,

This is what Bill makes of it, it is not as good as what Richard does:

The optimizer is on by default as
/GB (optimize for blend, I take it that is best if the processor is
unknown.) I can see no obvious ways to beef up the optimization.

"
; Listing generated by Microsoft (R) Optimizing Compiler Version 13.00.9466

<SNIPPED PREAMBULE>

_TEXT SEGMENT
_add PROC NEAR
; File c:\progra~1\micros~1.net\framew~1\bin\kp.c
; Line 8
push ebp
mov ebp, esp
; Line 9
mov eax, DWORD PTR _a
add eax, DWORD PTR _b
mov DWORD PTR _r, eax
; Line 10
pop ebp
ret 0
_add ENDP
_TEXT ENDS
PUBLIC _main
; Function compile flags: /Ods
_TEXT SEGMENT
_test$ = -4
_main PROC NEAR
; Line 12
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
; Line 18
mov DWORD PTR _a, 2
; Line 19
mov DWORD PTR _b, 3
; Line 20
mov DWORD PTR _test$[ebp], OFFSET FLAT:_add
; Line 21
call DWORD PTR _test$[ebp]
; Line 22
mov eax, DWORD PTR _r
; Line 23
leave
ret 0
_main ENDP
_TEXT ENDS
END
"

Disclaimer: This is not the latest and greatest Microsoft environment.

Correct me if I'm wrong, don't sue me.

>
>Andrew.

Andreas

unread,
May 13, 2011, 1:47:40 PM5/13/11
to
Albert van der Horst:

How does it look if you

typedef void (*inst) (void);
... ?

Andreas

Tarkin

unread,
May 13, 2011, 2:21:55 PM5/13/11
to
On May 12, 10:54 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

> Tarkin <tarkin...@gmail.com> writes:
> > 'Like I said, looking for a zero in a cell is hard.'
>
> > No, it's not. (zeroing EAX omitted)
> > repne scasb
>
> > Neither is copying, on x86-32:
> > rep movsb/w/d
>
> But those instructions are pretty slow compared with copying a word at a
> time,

Where does that 'word' count come from?
It's a separate value, and may have to be masked from the string
itself.
So, you gain the benefit of 1/2 to 1/4 (less on some processors)
copy operations. But you have to factor in the loading, and perhaps
masking, as a one-time setup cost.

Whereas, with byte-copy-with-terminator, you're counting and copying
concurrently, with perhaps some math at the end to adjust the count.

> or in even wider units with the XMM or AVX instructions.

I thought we were talking strings, i.e. byte arrays of data
meant to be interpreted as human language?

But, while we are numbers, what's faster, a single multiply,
or multiple shifts? The answer: it depends....

In any event, handling either counted or terminated strings,
is, IMHO, context-dependant. To say one or the other is
inherently 'evil', is, to my mind, tantamount to premature
optimization.

TTFN,
Tarkin

Ed

unread,
May 14, 2011, 2:11:28 AM5/14/11
to
The Beez' wrote:
> On 11 mei, 00:42, Paul Rubin <no.em...@nospam.invalid> wrote:
> > S...@ControlQ.com writes:
> > > Strings are null terminated, and *NOT* counted. type therefore takes
> > > only a pointer,
> >
> > Please fix that--it was a horrible mistake in C and is a mistake in
> > Forth. Type can still take a pointer; just embed the length in the
> > string.
> This issue has been discussed at length and never resolved. Both have
> advantages and disadvantages.
>
> Note that since the introduction of ANS strings (addr/count) the need
> for either type has been diminished greatly. In practice I only need
> it when fetching string variables or low level string operations. And
> even then, with the appropriate COUNT/PLACE words portable programs
> can be created with either an internal counted string or NULL-
> terminated string format.

ANS didn't "introduce" addr/count strings. They existed in Forth from
the beginning e.g. CMOVE. ANS merely underlined the fact that
common string functions S" COMPARE etc were not going to be
dependent on a particular string storage method.

That parsed string literals longer 255 shall not be used in an ANS
Program (without a dependency disclaimer) was there to reflect
the reality that most forths used counted strings internally (and
still do today).

> I really don't know why Forthers are so in love with counted strings -
> NULL-terminated strings are just fine.

Perhaps because counted strings too, have proven to be fine
for the majority of applications, including forth implementations.

> And the argument that it is "so
> C-like" isn't valid anymore: so are locals, S\", etc.

If what's being proposed contradicts norms and principles
on which the language was founded, then it does matter.

Are S\" , locals, 'a' etc. "Forth-like"? I can't imagine it.

Can I imagine using null-terminated strings inside a Forth or
application? Yes. Would it be my first choice? Probably not.

The Beez'

unread,
May 14, 2011, 6:03:25 AM5/14/11
to
On May 14, 8:11 am, "Ed" <nos...@invalid.com> wrote:
> If what's being proposed contradicts norms and principles
> on which the language was founded, then it does matter.
>
> Are  S\" , locals, 'a' etc. "Forth-like"?  I can't imagine it.
I agree with you, that's why I keep complaining at each and every such
proposal that it's not Forth-like and don't really support it - unless
I can emulate it in 4tH. That way it doesn't really bother me.

> Can I imagine using null-terminated strings inside a Forth or
> application?  Yes.  Would it be my first choice?  Probably not.

I have used ASCIIZ strings in 4tH since the beginning, simply because
it was intended to work seamlessly with C. In version 3.5a I
standardized on the addr/count representation. PLACE/COUNT does the
translation for most part between ASCIIZ and addr/count, just like
they do it in a standard Forth between counted strings and addr/count.
My experience was that the use of COUNT was greatly reduced. I would
applaud the introduction of both in the standard as a way to hide the
string implementation details from a standard Forth program. That way
we could put this issue to rest once and for all. And don't tell me
low level string handling tools would have to have carnal knowledge of
the implementation, because so is required by a CELL based counted
string implementation.

Still, funny enough (little anecdote) moving to this representation
meant that 4tH couldn't guarantee anymore (as it could before) that a
string was NULL-terminated, so any string would have to be placed in
the temporary string buffer before calling a C-function. ;-) So the
original objective (using C-based strings for compatibility) was
effectively nullified!

Hans Bezemer

David Thompson

unread,
May 19, 2011, 1:50:39 AM5/19/11
to
On Wed, 11 May 2011 14:19:00 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

> "Paul Rubin" <no.e...@nospam.invalid> wrote in message

> news:7xaaetu...@ruckus.brouhaha.com...


> > "The Beez'" <han...@bigfoot.com> writes:
> > > I really don't know why Forthers are so in love with counted strings -

> > > NULL-terminated strings are just fine. And the argument that it is "so


> > > C-like" isn't valid anymore: so are locals, S\", etc.
> >

> > The strings might contain NUL bytes that are part of the string.
> > Suppose they are a UTF-16 representation of Unicode, for example.
> >
>
> So? Wouldn't the NUL-termination be the same size or larger than the
> character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
> should be *two* NUL bytes, shouldn't it?
>
Presumably and sortof. UTF-16 would be terminated by a 16-bit zero;
that's 2 bytes *if* byte is 8-bits.

> Do the 8-bit NUL bytes have any effect on 16-bit characters? Remember, even
> C defines the termination of strings as by a C "byte with all bits zero" and
> *not* by a C character. A C "byte" is not defined as 8-bits per ASCII or
> EBCDIC. It's defined as an addressable unit of storage large enough to hold
> a character, or similar. The result is that the C byte must be as large or
> larger in size as the C character. Today, they are typically the same size.

Right, C _defines_ byte as "addressable unit of data storage large
enough to hold" a (narrow) character; relatedly, it requires the char
*types* (all three of /*plain*/ char, signed char, and unsigned char)
to occupy exactly 1 byte which must be at least 8 bits, and at least
unsigned char must use all the bits of that byte.

C does not specify the character code, except two constraints: code 0
is reserved for the null terminator, and codes for digits '0' to '9'
must be consecutive. It does specify a required character *set* which,
probably not entirely by chance, contains the graphics common to ASCII
and slightly tweaked EBCDIC, plus a few controls also common.

It says "A byte with all bits set to 0, called the _null character_,
shall exist [and...] be used to terminate a character string"
where italicization means that is the definition of the term, so
it's a byte and a character at the same time, kind of quantum.

If stdio.h is implemented (which isn't required in "freestanding"
implementations) on a "binary" file it must be able to write and read
back all unsigned-char values, even if they aren't valid characters.
Other than that, characters may indeed not fully use C's byte.

> That's because most modern platforms are 8-bit byte addressable and use
> 8-bit bytes for characters. But, they weren't always. I.e., if char's are
> 9-bits, addressing is 8-bits, the NUL "byte" terminator for C would be
> 16-bits, all cleared. IIRC, they ran into this problem with late B or very
> early C on 16-bit word addressable machines, i.e., 9-bit characters,
> addressing is 16-bits. The point is that a terminator does not have to be a
> character.
>
According to Ritchie (in HOPL2, reprint available on his page at
BellLabs) the transition from B to "new B" and C was exactly the
transition to byte addressing. B was developed on the 18-bit
word-addressed PDP-7. He doesn't say exactly how characters were
packed into words but in those days I think 3x6 would have been more
obvious than 2x9 -- many mainframes of the day used 6-bit (addressed)
chars or Nx6-bit words (and I believe B, like BCPL, didn't require
lowercase, so 6-bit is practical). In porting B to the PDP-11, DEC's
first 8-bit-byte-addressed machine, it evolved into C.

So, yes, if a.u. is 8-bit and you need (for whatever reason) 9-bit
character in C you must use 16-bit byte.

Note all above is for "narrow" characters and strings in C since its
beginning. Several "wide" character and string features were added in
the first standard (89/90) and more in an addendum (95). Those are
clearly designed to work well as Unicode, either slightly restricted
16-bit or full 20ish-to-32-bit, although not formally required.

David Thompson

unread,
May 19, 2011, 1:50:39 AM5/19/11
to
On Thu, 12 May 2011 19:23:46 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:
<snip>

> No length strings exist in many languages, including C. Strings in most
> other languages are not count based. Forth and PL/1 are the only two I'm
> aware of that do use counts. So, I'm still not sure why you are stuck on
> the no length string thing... But, OK, in a counted string environment, if
> the count=0, then the string has no length.
>
I think people use "length" somewhat variously; to me, and in my
understanding in the related math, it's an inherent property
and always exists, even if it isn't *stored*.
"No count" has a concrete meaning, and is clearer.

Quite a few languages have fixed-length strings, usually padded with
spaces, so the length is neither stored as a count *nor* determined by
looking for a terminator. Leaving those aside:

- Fortan has count+address optional (part 2 of the standard, not too
often implemented) since 90, and mandatory (in a slightly different
more convenient form) since IIRC 03. Some compilers also have C-style
terminated as an extension.

- PL/I always had both fixed and prefix-count, and added C-style
terminated.

- Pascal standardly had only fixed but prefix-count was a very common
extension, enough so that C programmers writing to e.g. the MacOS API
called them "Pascal" strings.

- Ada has fixed, prefix-count, *and* count+address.

- C++ has both C-style terminated and std::string more-or-less
count+address (definitely allowed to include NUL).

- Java has (only) count+address.

- Most BASICs I've seen are counted, but aren't well standardized.

> In C this string: "\0" has no length. None. Zip. Zero. Nada. \0

I would say zero length, but not "no" length. Terminology again.
If something is "priceless" that doesn't mean it costs $0.

> represents the "null byte" terminator. It's not a character in C. It's an
> addressable unit as large as or larger than a C character with all bits
> zeroed. Every sequence of characters must have one to be a valid C string.
> But, it's not part of the string, i.e., allowing a zero length string. The

C calls it a character, although it has the special property of
terminating a string. Terminology. It is definitely the same size as
any other (narrow) character, and as you say must be there.

> compiler places them on certain strings implicitly. Programmers do so
> explicitly in other situations.
>
The compiler always adds it to a string literal (or pedantically the
object compiled from a string literal; if the string literal is used
only to initialize a char array it isn't an object, although if you
use implicit sizing it *is* included in the implicit size).
For values constructed by code you need to add it, which may be
explicit or may be in operations you call like strcpy(), unless you
are writing into a buffer which *already* contains null(s).

<snip rest>

David Thompson

unread,
May 19, 2011, 1:50:39 AM5/19/11
to
On Thu, 12 May 2011 19:21:54 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

>
> "Paul Rubin" <no.e...@nospam.invalid> wrote in message

> news:7x62pgv...@ruckus.brouhaha.com...


> > "Rod Pemberton" <do_no...@noavailemail.cmm> writes:
> > > So? Wouldn't the NUL-termination be the same size or larger than the
> > > character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
> > > should be *two* NUL bytes, shouldn't it?
> >
>

> The reason I brought that up was, IIRC, unicode chars map to ASCII for the
> lower values. However, I don't know whether UTF-16 extends those ASCII
> values, including NUL, from 8-bits to 16-bits. I vaguely recall some sort
> of RLL-like compression... It could very well be that 8-bits zeroed is NUL
> in UTF-16 too. Someone else will have to expound on this.
>
More exactly, codepoints 0-255 in Unicode are exactly ISO 8859-1
(aka Latin-1) an 8-bit code whose first half is 7-bit ASCII.
UTF-16 represents *everything* in 16-bit elements.
Codepoints 0-65535 in Unicode are 1 16-bit value, including 0 NUL;
codepoints >= 65536 are represented as 2 16-bit 'surrogate' values.

The only thing that's even vaguely RLL-like is UTF-8, where except for
the ASCII-equivalent codes the number of leading 1 bits in the first
octet indicates the number of octets:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

> > But now you have to care about whether some given byte string in memory
> > semantially represents a UTF-16 encoding or something else.
> >
> > The basic problem here is that NUL is a perfectly good character in both
> > ascii and unicode,
>
> Perfectly good character for what? It's not a text character. It's not a
> graphics character. It's not a formatting character. It's not a host
> specific graphics character. It's nothing of use in ASCII, EBCDIC, or even
> PETSCII.
>
NUL is defined in both ASCII et seq and EBCDIC as a control. In
both/all those codes there are numerous other controls like STX ETX
ENQ ACK SO SI ESC DLE FS GS that are neither graphics nor formatting.
(Well, SO SI *can be* formatting and *some* ESC *sequences* are.)
They may or may not be redefined in modified codes. But in addition to
being a terminator in C, NUL/zero has been used as filler or padding
in many other formats, protocols, and operations, so people designing
modifications are wise not to reuse it for anything important.

> > C strings are not a sequence c1,c2,c3... where c1,c2... can be any
> > character.
>
> Yes, they are. Of course, a NUL character could be confused by code for C's
> "null byte" terminator, e.g., if byte size and character size are the same.
>
As elsethread, C requires "narrow" char *types* to be exactly a byte,
although not all values within that size need to be valid characters.
There is also a "wide" char that can (usually should) be larger.

> > They are restricted against containing a certain character
> > that doesn't show up in most text messages,
> >
>
> I doubt a C specification pedant would agree here. They'd probably say
> something similar to what I said above. I.e., a C system could support
> both, but isn't required to do so, e.g., if byte size and character size are
> different.
>
(Narrow) C strings definitely cannot contain a null character aka null
byte. To be totally pedantic 7.1.1 says a string is "terminated by and
includes" the null, but essentially all operations on the string are
defined to exclude the null. (Wide string similarly has a null *wide*
character only at the end, but usually contains null bytes.)

You can of course write operations in C on counted char arrays (prefix
or separate) with no terminator, and a few are provided standardly.
But in C terminology those are not strings.

Rod Pemberton

unread,
May 19, 2011, 7:01:28 AM5/19/11
to
"David Thompson" <dave.th...@verizon.net> wrote in message
news:8ma9t6dhgq27gdc0r...@4ax.com...

> On Wed, 11 May 2011 14:19:00 -0400, "Rod Pemberton"
> <do_no...@noavailemail.cmm> wrote:
>
> > "Paul Rubin" <no.e...@nospam.invalid> wrote in message
> > news:7xaaetu...@ruckus.brouhaha.com...
> > > "The Beez'" <han...@bigfoot.com> writes:
> > > > I really don't know why Forthers are so in love with counted
strings -
> > > > NULL-terminated strings are just fine. And the argument that it is
"so
> > > > C-like" isn't valid anymore: so are locals, S\", etc.
> > >
> > > The strings might contain NUL bytes that are part of the string.
> > > Suppose they are a UTF-16 representation of Unicode, for example.
> >
> > So? Wouldn't the NUL-termination be the same size or larger than the
> > character size, i.e., 16-bits all zero'd? I.e., the UTF-16 terminator
> > should be *two* NUL bytes, shouldn't it?
> >
> Presumably and sortof. UTF-16 would be terminated by a 16-bit zero;
> that's 2 bytes *if* byte is 8-bits.
>

In a C context, a byte >= char in size, a minimum of 7-bits is required due
to early character sets. If the machine natively addresses 8-bits and
characters are implemented as 12-bits, then a byte must be 16-bits. So, in
this case, a 16-bit quantity should be called a byte.

The point was that Forth code should use a "correctly" sized unit for the
strings. I.e., 8-bits is too small to manipulate 16-bit strings. Use
16-bits.

> > Do the 8-bit NUL bytes have any effect on 16-bit characters? Remember,
even
> > C defines the termination of strings as by a C "byte with all bits zero"
and
> > *not* by a C character. A C "byte" is not defined as 8-bits per ASCII
or
> > EBCDIC. It's defined as an addressable unit of storage large enough to
hold
> > a character, or similar. The result is that the C byte must be as large
or
> > larger in size as the C character. Today, they are typically the same
size.
>
> Right, C _defines_ byte as "addressable unit of data storage large
> enough to hold" a (narrow) character; relatedly, it requires the char
> *types* (all three of /*plain*/ char, signed char, and unsigned char)
> to occupy exactly 1 byte which must be at least 8 bits, and at least
> unsigned char must use all the bits of that byte.
>

No. 7 bits is the minimum character size for C. This was due to early
character sets on teletypes. This requirement is in one of the includes.

No again. The character must be storable in a byte. It doesn't have to
"occupy", i.e., use all bits. (You actually state this correctly later,
snipped.) E.g., if the minimum "'addressable unit of data storage large
enough to hold' a (narrow) character" is 16-bits, i.e., a C byte, and
(unsigned) characters are 8-bits, two characters can be stored in that C
byte. The cryptic words are "to hold" ...

> C does not specify the character code, except two constraints: code 0
> is reserved for the null terminator,

Null byte: \0 is not a character. It's a byte.

> and codes for digits '0' to '9'
> must be consecutive.

True. This is because they are consecutive in ASCII and EBCDIC. The
alphabetic characters are not. They couldn't require that they are
consecutive.

> It does specify a required character *set* which,
> probably not entirely by chance, contains the graphics common to ASCII
> and slightly tweaked EBCDIC, plus a few controls also common.
>

It's not by chance at all. Those are the two primary character sets upon
which C (narrow) characters are based.

> It says "A byte with all bits set to 0, called the _null character_,
> shall exist [and...] be used to terminate a character string"
> where italicization means that is the definition of the term, so
> it's a byte and a character at the same time, kind of quantum.

No. That's irrational. It's *CALLED* is different from *IT IS*. You
already know that the "null character" *IS NOT* a character from the
definitions of "byte" and "character". The beginning of the definition
explicitly clarifies it for you: "A byte with all bits set to 0 ... shall
exist [and ...] be used to terminate a character string". The string
terminator in C is a null byte - which is "_called_ the null character"...


Rod Pemberton


Rod Pemberton

unread,
May 19, 2011, 7:02:26 AM5/19/11
to
"David Thompson" <dave.th...@verizon.net> wrote in message
news:64b9t6h3al5ak2ode...@4ax.com...

> On Thu, 12 May 2011 19:21:54 -0400, "Rod Pemberton"
> <do_no...@noavailemail.cmm> wrote:
> > "Paul Rubin" <no.e...@nospam.invalid> wrote in message
> > news:7x62pgv...@ruckus.brouhaha.com...
> > > "Rod Pemberton" <do_no...@noavailemail.cmm> writes:
>
> > > The basic problem here is that NUL is a perfectly good character in
both
> > > ascii and unicode,
> >
> > Perfectly good character for what? It's not a text character. It's not
a
> > graphics character. It's not a formatting character. It's not a host
> > specific graphics character. It's nothing of use in ASCII, EBCDIC, or
even
> > PETSCII.
> >
> NUL is defined in both ASCII et seq and EBCDIC as a control. In
> both/all those codes there are numerous other controls like STX ETX
> ENQ ACK SO SI ESC DLE FS GS that are neither graphics nor formatting.
> (Well, SO SI *can be* formatting and *some* ESC *sequences* are.)
>

IIRC, those all control characters control various aspects of teletypes:
electrical handshakes, character set switch, form separators, etc. Yes?
No? You say NUL is a control character. What does NUL control?

> > > C strings are not a sequence c1,c2,c3... where c1,c2... can be any
> > > character.
> >
> > Yes, they are. Of course, a NUL character could be confused by code for
C's
> > "null byte" terminator, e.g., if byte size and character size are the
same.
> >
> As elsethread, C requires "narrow" char *types* to be exactly a byte,
>

No. The original articles by Ritchie describe how they implemented strings
on a 16-bit word-sized architecture. The null byte string terminator was
16-bits all cleared. Characters were 9-bits, IIRC.

> > I doubt a C specification pedant would agree here. They'd probably say
> > something similar to what I said above. I.e., a C system could support
> > both, but isn't required to do so, e.g., if byte size and character size
are
> > different.
> >
> (Narrow) C strings definitely cannot contain a null character aka null
> byte.

They are not the same. They are only the same if a byte is the same size as
a character. That is not a requirement of C. So, yes, in theory, a C
implementation could support both, as long as the sizes are different.
E.g., on a word-addressable system with 16-bit bytes and 9-bit chars, any
word with all 16-bits clear could be the null byte string terminator
required by C. It's larger in size than a character. Any word with the
lower 9-bits cleared but at least one upper bit set could be a NUL
character. Alternately, one specific upper bit combination with the lower
9-bits cleared could be chosen for the NUL character.


Rod Pemberton


Rod Pemberton

unread,
May 19, 2011, 7:17:41 AM5/19/11
to
"David Thompson" <dave.th...@verizon.net> wrote in message
news:abb9t6l5o0j3oafl1...@4ax.com...

> On Thu, 12 May 2011 19:23:46 -0400, "Rod Pemberton"
> <do_no...@noavailemail.cmm> wrote:
> <snip>
> > No length strings exist in many languages, including C. Strings in most
> > other languages are not count based. Forth and PL/1 are the only two
> > I'm aware of that do use counts. So, I'm still not sure why you are
> > stuck on the no length string thing... But, OK, in a counted string
> > environment, if the count=0, then the string has no length.
> >
> I think people use "length" somewhat variously; to me, and in my
> understanding in the related math, it's an inherent property
> and always exists, even if it isn't *stored*.
>

"In my understanding", a length is a unit of measure. Positive value means
it has length. Zero means it has no length.

> "No count" has a concrete meaning, and is clearer.
>

Ok...

> > In C this string: "\0" has no length. None. Zip. Zero. Nada. \0
>
> I would say zero length,

Zero is not a length. A length is a unit of measure, i.e., positive
non-zero if it has length.

> ... but not "no" length. Terminology again.
...

> > represents the "null byte" terminator. It's not a character in C. It's
> > an addressable unit as large as or larger than a C character with all
> > bits zeroed. Every sequence of characters must have one to be a
> > valid C string. But, it's not part of the string, i.e., allowing a zero
> > length string.
>

> C calls it a character, although it has the special property of
> terminating a string. Terminology.

...

> It is definitely the same size as
> any other (narrow) character,
>

No. That's wrong! It *CAN BE* "the same size as any other (narrow)
character", depending on C implementation. Some implementations have C
bytes that are the same size as C characters. Some don't. They are *NOT
REQUIRED* to be the same size. And, it *HAS BEEN* implemented with a
different size, by the original creators of C, no less. They wrote about
it.


Rod Pemberton


Alex McDonald

unread,
May 19, 2011, 9:32:06 AM5/19/11
to
On May 19, 12:17 pm, "Rod Pemberton" <do_not_h...@noavailemail.cmm>
wrote:
> "David Thompson" <dave.thomps...@verizon.net> wrote in message

>
> news:abb9t6l5o0j3oafl1...@4ax.com...
>
> > On Thu, 12 May 2011 19:23:46 -0400, "Rod Pemberton"
> > <do_not_h...@noavailemail.cmm> wrote:
> > <snip>
> > > No length strings exist in many languages, including C.  Strings in most
> > > other languages are not count based.  Forth and PL/1 are the only two
> > > I'm aware of that do use counts.  So, I'm still not sure why you are
> > > stuck on the no length string thing...  But, OK, in a counted string
> > > environment, if the count=0, then the string has no length.
>
> > I think people use "length" somewhat variously; to me, and in my
> > understanding in the related math, it's an inherent property
> > and always exists, even if it isn't *stored*.
>
> "In my understanding", a length is a unit of measure.  Positive value means
> it has length.  Zero means it has no length.
>
> > "No count" has a concrete meaning, and is clearer.
>
> Ok...
>
> > > In C this string: "\0" has no length.  None.  Zip.  Zero.  Nada.  \0
>
> > I would say zero length,
>
> Zero is not a length.  A length is a unit of measure, i.e., positive
> non-zero if it has length.

I think you're mistaking 0 in this case for dimensionless, which it
isn't. Some programming languages have the concept of signed 0; some
differentiate between 0 and a missing value (not to be confused with
an uninitialized value, since you can do arithmetic on missings).
Dimension is not lost when something is nothing, since 0kg indicates
quite clearly a zero mass; and its other properties may well be non-
zero.

[snip]

David Thompson

unread,
Jun 1, 2011, 4:33:15 PM6/1/11
to
On Thu, 19 May 2011 07:01:28 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

> In a C context, a byte >= char in size, a minimum of 7-bits is required due
> to early character sets. If the machine natively addresses 8-bits and
> characters are implemented as 12-bits, then a byte must be 16-bits. So, in
> this case, a 16-bit quantity should be called a byte.
>

C has always required the storage unit 'byte' and the 'char' *types*
to be at least 8 bits. The characters stored in those units can indeed
be smaller. In your example yes C byte must be 16 bits.

> > Right, C _defines_ byte as "addressable unit of data storage large
> > enough to hold" a (narrow) character; relatedly, it requires the char
> > *types* (all three of /*plain*/ char, signed char, and unsigned char)
> > to occupy exactly 1 byte which must be at least 8 bits, and at least
> > unsigned char must use all the bits of that byte.
> >
>
> No. 7 bits is the minimum character size for C. This was due to early
> character sets on teletypes. This requirement is in one of the includes.
>

8 bits. The size for a given implementation is CHAR_BIT in limits.h;
5.2.4.2.1 says CHAR_BIT must be *at least* 8. Remember that in C,
especially early C on machines where memory was fairly precious,
programmers commonly used 'char' variables for small integers not
actual characters. A char/byte size less than 8 could ruin that.

Actually many 'early' devices required and computers stored only 5 or
6 bits for a character. When C was created in ~1971, terminals with
lower case, and thus really needing 7 bits, were rare, and making C
case sensitive with lowercase keywords was considered actually one of
its more daring leaps forward; today we don't even notice this.

> No again. The character must be storable in a byte. It doesn't have to
> "occupy", i.e., use all bits. (You actually state this correctly later,

I distinguished (perhaps too subtly) the char *types* and actual
character values. The char *types* must fill a byte; the character
*values* need not fill a char=byte storage unit.

> snipped.) E.g., if the minimum "'addressable unit of data storage large
> enough to hold' a (narrow) character" is 16-bits, i.e., a C byte, and
> (unsigned) characters are 8-bits, two characters can be stored in that C
> byte. The cryptic words are "to hold" ...
>

You can manually pack (or compress) characters into bytes, or longs or
floats, as you wish. But what C considers and operates on as a
character, namely the char *types*, must be exactly one (C) byte.

> > C does not specify the character code, except two constraints: code 0
> > is reserved for the null terminator,
>
> Null byte: \0 is not a character. It's a byte.
>

According to C, it's both. See quote elsethread.

> > and codes for digits '0' to '9'
> > must be consecutive.
>
> True. This is because they are consecutive in ASCII and EBCDIC. The
> alphabetic characters are not. They couldn't require that they are
> consecutive.
>
> > It does specify a required character *set* which,
> > probably not entirely by chance, contains the graphics common to ASCII
> > and slightly tweaked EBCDIC, plus a few controls also common.
> >
>
> It's not by chance at all. Those are the two primary character sets upon
> which C (narrow) characters are based.
>

I was being a little sarcastic. My implicit point is that C specifies
characters in a way that allows ASCII or EBCDIC, *or* some different
charcode if someone comes up with a popular new one -- which
at this stage seems extremely unlikely. The same for Fortran and COBOL
and PL/I -- for the same reason. OTOH Forth specifies exactly ASCII
(which can easily be mapped to EBCDIC and back, where necessary),
and Ada specifies exactly 8859-1 and BMP (Unicode) (which can't).

> > It says "A byte with all bits set to 0, called the _null character_,
> > shall exist [and...] be used to terminate a character string"
> > where italicization means that is the definition of the term, so
> > it's a byte and a character at the same time, kind of quantum.
>
> No. That's irrational. It's *CALLED* is different from *IT IS*. You
> already know that the "null character" *IS NOT* a character from the
> definitions of "byte" and "character". The beginning of the definition
> explicitly clarifies it for you: "A byte with all bits set to 0 ... shall
> exist [and ...] be used to terminate a character string". The string
> terminator in C is a null byte - which is "_called_ the null character"...
>

This is (at most) semantics. If you think of characters as glyphs (or
vice versa), lots of things stored in computers aren't characters. But
the people who defined character codes, and programming languages,
treated space as a character, and linefeed, and NUL, and many others.
If you want to disagree with the whole scheme, that's coherent if
inconvenient, but there's no sound reason to exclude *only* NUL.

David Thompson

unread,
Jun 1, 2011, 4:33:15 PM6/1/11
to
On Thu, 19 May 2011 07:17:41 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

> "David Thompson" <dave.th...@verizon.net> wrote in message
> news:abb9t6l5o0j3oafl1...@4ax.com...

> > I think people use "length" somewhat variously; <snip>


> "In my understanding", a length is a unit of measure. Positive value means
> it has length. Zero means it has no length.
>

This is approaching metaphysics. I would call it a measurement, or a
measure, but not a unit. But I think everyone understood what each of
us meant and continuing discussion of terminology is unproductive.

> > [C null terminator] is definitely the same size as
> > any other (narrow) character,
> >
>
To be clear I probably should have said 'in storage'.

> No. That's wrong! It *CAN BE* "the same size as any other (narrow)
> character", depending on C implementation. Some implementations have C
> bytes that are the same size as C characters. Some don't. They are *NOT
> REQUIRED* to be the same size. And, it *HAS BEEN* implemented with a
> different size, by the original creators of C, no less. They wrote about
> it.
>

char *types* and the null terminator are 1 byte; see other reply.

David Thompson

unread,
Jun 1, 2011, 4:34:03 PM6/1/11
to
On Thu, 19 May 2011 07:02:26 -0400, "Rod Pemberton"
<do_no...@noavailemail.cmm> wrote:

> "David Thompson" <dave.th...@verizon.net> wrote in message

<snip>


> > NUL is defined in both ASCII et seq and EBCDIC as a control. In
> > both/all those codes there are numerous other controls like STX ETX
> > ENQ ACK SO SI ESC DLE FS GS that are neither graphics nor formatting.
> > (Well, SO SI *can be* formatting and *some* ESC *sequences* are.)
> >
>
> IIRC, those all control characters control various aspects of teletypes:
> electrical handshakes, character set switch, form separators, etc. Yes?

No. Assuming you use 'teletype' to include at least all printing
terminals (as was common usage) and maybe all character terminals (as
was somewhat more controversial), relatively few devices implemented
comms controls (SOA etc.) and most implemented only a subset of format
effectors, and I never saw *any* that implemented SUB EM or the
separators FS GS RS US as anything but no-ops -- just like NUL. And
similarly DEL; it was *defined* to do nothing, just like NUL, and was
often used for padding like NUL but in situations where NUL wasn't
convenient (like a C string) or might get stripped prematurely.

PS: control chars like SOA..EOT etc and DC* were used for *protocol*
handshakes. *Electrical* handshakes e.g. RTS/CTS and DTR/DSR for
RS-232, did not use any character codes. Although Hayes modems, and
their many successors, did use NON-control characters to command
physical-level functions like offhook/onhook and dialing.

> No? You say NUL is a control character. What does NUL control?
>

It occupies a character space, and for serial transmission a character
time. Many CPUs have a NOP instruction that does nothing except occupy
space and take time, and sometimes not even that; do you consider that
not an instruction? (Aside: one of the many things I liked about the
PDP-10 is that it has many no-op instructions, with mostly different
performance, all documented!)

> > As elsethread, C requires "narrow" char *types* to be exactly a byte,
> >
>
> No. The original articles by Ritchie describe how they implemented strings
> on a 16-bit word-sized architecture. The null byte string terminator was
> 16-bits all cleared. Characters were 9-bits, IIRC.
>

What original articles? My K&R1 is long gone, but my recollection is
it clearly specified that a char is a byte and vice versa in C, and my
(vaguer) recollection is so did the Labs Unix v6 manuals. Ritchie's
HOPL2 paper ('reprinted' on his webpage) treats 'char(acter)' and
'byte' as equivalent, and the C standard explicitly requires it. The
paper does say that packed characters in PDP-7 *B* were 9-bit,
contrary to my previous guess of 6. The terminator in B was a single
character, but the paper doesn't say it was 0.

> > > I doubt a C specification pedant would agree here. They'd probably say
> > > something similar to what I said above. I.e., a C system could support
> > > both, but isn't required to do so, e.g., if byte size and character size
> are
> > > different.
> > >
> > (Narrow) C strings definitely cannot contain a null character aka null
> > byte.
>
> They are not the same. They are only the same if a byte is the same size as
> a character. That is not a requirement of C. So, yes, in theory, a C
> implementation could support both, as long as the sizes are different.

C char *types* must occupy exactly one byte, and thus all characters
must be stored as a byte, but some byte/char values can be skipped and
not used as characters. This could be bitwise e.g. 7bit ASCII in an
8bit byte (with space or even parity, but not odd or mark, see below)
or values e.g. EBCDIC uses IIRC 40,4A-F, 50,5A-F, 60,61,6A-F, 70-F,
C1-9, D1-9, E2-9, F0-9 and a few others out of 00-FF.

> E.g., on a word-addressable system with 16-bit bytes and 9-bit chars, any
> word with all 16-bits clear could be the null byte string terminator
> required by C. It's larger in size than a character. Any word with the
> lower 9-bits cleared but at least one upper bit set could be a NUL
> character. Alternately, one specific upper bit combination with the lower
> 9-bits cleared could be chosen for the NUL character.
>

No. C99 5.2.1p2, unchanged from C89: A byte with
all bits set to 0, called the _null character_, shall exist in the
basic execution character set; it is used to terminate a character
string. (_italics_ mean this is the definition of the term)
and 7.1.1p1: A string is a contiguous sequence of characters
terminated by and including the first null character.

Rod Pemberton

unread,
Jun 2, 2011, 3:20:17 AM6/2/11
to
"David Thompson" <dave.th...@verizon.net> wrote in message
news:f27du69tmec3nm4mv...@4ax.com...
> [snip]

Sorry, but there is no point in me providing further response to this. ISTM
that you've just repeated arguments I sincerely believe to be incorrect and
I've explained as best I can. Maybe with a deeper understanding of assembly
and how C is implemented using it, things will become clearer. We're also
very OT.


Rod Pemberton


Rod Pemberton

unread,
Jun 2, 2011, 3:20:33 AM6/2/11
to
"David Thompson" <dave.th...@verizon.net> wrote in message
news:mg8du6d2ja4m3pc8t...@4ax.com...
0 new messages