RfD: c-addr/len

65 views
Skip to first unread message

Peter Knaggs

unread,
Sep 11, 2009, 3:08:40 PM9/11/09
to fort...@yahoogroups.com
c-add/len
=========

2009-09-09 Rendered into RfD form, added Forth200x words
1999-06-22 Original Text by John Rible


Problem
=======
A large number of words use "c-add u" to indicate the address of a
string (c-addr) and its length (u) on the stack. With the
introduction of variable width characters, it is not clear if "u" is
referring to the number of characters or address units.


Solution
========
Introduce a new pseudo-type ("len") into the document of these words
to clarify the intent. Replacing the "u" with a "len" should improve
the documentation of these words. The words effected are:

6.1.0040 #>
6.1.0570 >NUMBER
6.1.0980 COUNT
6.1.1345 ENVIRONMENT?
6.1.1360 EVALUATE
6.1.1540 FILL
6.1.2165 S"
6.1.2216 SOURCE
6.1.2310 TYPE
6.2.2008 PARSE
6.2.xxxx PARSE-NAME
11.6.1.1010 CREATE-FILE
11.6.1.1190 DELETE-FILE
11.6.1.1718 INCLUDED
11.6.1.1970 OPEN-FILE
11.6.1.2080 READ-FILE
11.6.1.2090 READ-LINE
11.6.1.2165 S"
11.6.1.2480 WRITE-FILE
11.6.1.2485 WRITE-LINE
11.6.2.1524 FILE-STATUS
12.6.1.0558 >FLOAT
11.6.2.2130 RENAME-FILE
11.6.2.xxxx REQUIRED
12.6.1.2143 REPRESENT
13.6.1.0086 (LOCAL)
16.6.1.2192 SEARCH-WORDLIST
17.6.1.0170 -TRAILING
17.6.1.0245 /STRING
17.6.1.0780 BLANK
17.6.1.0910 CMOVE
17.6.1.0920 CMOVE>
17.6.1.0935 COMPARE
17.6.1.2191 SEARCH
17.6.1.2212 SLITERAL


Proposal
========

1. Add the following to table 3.1 - Data Types

len character-string length 1 cell

2. Add the following to 3.1.1 Data-type relationships

len => u => x

3. Replace "u" with "len" in 3.1.4.2 Character strings:

A string is specified by a cell pair (c-addr len) representing
its starting address and length in characters.

4. Add the following to table 3.5 - Environmental Query Strings:

/CHARACTER-STRING n yes maximum size of len in characters

5. Change "u" to "len" in the stack description, definition and
rationale of the words listed under the Solution.

6. Replace "u" with "len" in section A.3.1.3.4 Counted Strings.

7. Change "u" to "len" in the rationale for A.6.2.0855 C".


Author
======
Peter Knaggs, P.J.K...@exeter.ac.uk

Josh Grams

unread,
Sep 12, 2009, 8:29:07 AM9/12/09
to
Peter Knaggs wrote: <4AAAA038...@bcs.org.uk>
>
>
> c-add/len

c-addr

>=========
>
> 2009-09-09 Rendered into RfD form, added Forth200x words
> 1999-06-22 Original Text by John Rible
>
>
> Problem
>=======
> A large number of words use "c-add u" to indicate the address of a

"c-addr u"

> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.

Er...unless I missed a decision to do away with the distinction between
"1 CHARS" and "address units", isn't the ambiguity between "variable
width characters" and "characters"? I don't see that this proposal
actually clarifies that.

At any rate, I think the definition at 3.1.4.2 Character strings makes
it clear that "c-addr u" as a unit means something special, so I don't
see any reason to replace the "u" with "len".

> Solution
>========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent. Replacing the "u" with a "len" should improve
> the documentation of these words. The words effected are:

affected

> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>
> A string is specified by a cell pair (c-addr len) representing
> its starting address and length in characters.

In 2.1 Definitions of Terms, we have:

character:
Depending on context, either 1) a storage unit capable of holding a
character; or 2) a member of a character set.

I think that the presence of an address (i.e. the location of some
storage) makes it pretty clear that sense 1 is meant here, but if people
are confused by that, you might want to clarify.

----

Instead of adopting this (and that "pchar" rename proposal), I think it
would make much more sense to clarify things by leaving the existing
"char" and "character" alone, and instead adopting new terminology for
variable width characters.

As I see it, there's no reason to go changing terminology on people when
you could instead just adopt new terminology for the new concept. Much
less potential for confusion that way.

--Josh

Bernd Paysan

unread,
Sep 12, 2009, 3:10:23 PM9/12/09
to
Josh Grams wrote:
> Instead of adopting this (and that "pchar" rename proposal), I think
> it would make much more sense to clarify things by leaving the
> existing "char" and "character" alone, and instead adopting new
> terminology for variable width characters.

We are pretty much there - the extended characters are called "extended
characters" or short "xchars". An xchar in memory may consist of
several characters (primitive characters, that is). I think it's easier
to deal with the name "pchar" when the "storage unit" is meant than name
it "character", but outside the xchar proposal, the terminology is not
needed.

The c-addr/len makes live easier as it definitely states that the length
is meant to be in characters (pchars), i.e. the storage unit as meant in
2.1. character 1).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

m_l_g3

unread,
Sep 13, 2009, 2:59:51 PM9/13/09
to
Peter Knaggs 写了:

>
> Problem
> =======
> A large number of words use "c-add u" to indicate the address of a
> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.
>
>
> Solution
> ========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent.

Sorry, but I do not see from your proposal what sort of length "len"
denotes: is it "length in characters", "length in logical (multi-
byte)
characters", or "length in address units"?

> Replacing the "u" with a "len" should improve
> the documentation of these words.

In fact, in many cases words are commented as taking and/or
leaving ( addr len ) rather than ( c-addr u ), so there is existing
practice.

IMO replacing "u" with "len" does improve readability, but does
not resolve the "which length" puzzle.

au-length, log-length, c-length ?

CHARS ( c-length -- au-length )
and so on...

Bernd Paysan

unread,
Sep 13, 2009, 4:43:03 PM9/13/09
to
m_l_g3 wrote:
> IMO replacing "u" with "len" does improve readability, but does
> not resolve the "which length" puzzle.
>
> au-length, log-length, c-length ?
>
> CHARS ( c-length -- au-length )
> and so on...

Char, as it is now is:

6.1.0898 CHARS
( n1 -- n2 )
n2 is the size in address units of n1 characters.

IMHO, the stack effect is at least misleading. I find it difficult to
get a correct stack effect - we want -1 CHARS to be used to step through
strings backwards, so we want the sign. I.e. "len" is not the right
left side of this stack effect (len is a subtype of u, no sign). But we
basically use CHARS to convert +-len into a +-c-addr offset. Works fine
on two's complement, might cause problems on one's complement ;-).

Peter Knaggs

unread,
Sep 13, 2009, 7:01:11 PM9/13/09
to
m_l_g3 wrote:
>
> Sorry, but I do not see from your proposal what sort of length "len"
> denotes: is it "length in characters", "length in logical (multi-
> byte)
> characters", or "length in address units"?

length in primitive characters (bytes).

> In fact, in many cases words are commented as taking and/or
> leaving ( addr len ) rather than ( c-addr u ), so there is existing
> practice.

Not in the standards document, hence the change.

--
Peter Knaggs

Aleksej Saushev

unread,
Sep 14, 2009, 2:27:01 AM9/14/09
to
Peter Knaggs <p...@bcs.org.uk> writes:

> m_l_g3 wrote:
>>
>> Sorry, but I do not see from your proposal what sort of length "len"
>> denotes: is it "length in characters", "length in logical (multi-
>> byte)
>> characters", or "length in address units"?
>
> length in primitive characters (bytes).

No, length in address units. Byte length is what is returned by "1 chars",
consider 4-bit address unit.


--
CE3OH...

Peter Knaggs

unread,
Sep 16, 2009, 8:02:23 AM9/16/09
to
m_l_g3 wrote:
>
> Sorry, but I do not see from your proposal what sort of length "len"
> denotes: is it "length in characters", "length in logical (multi-
> byte)
> characters", or "length in address units"?

Would it help if we replace item 1, the definition of "len" with:

len length of a character-string in address units 1 cell

--
Peter Knaggs

David N. Williams

unread,
Sep 16, 2009, 8:41:13 AM9/16/09
to

Shouldn't that be in characters? (3.1.4.2)

-- David

Peter Knaggs

unread,
Sep 16, 2009, 8:49:33 AM9/16/09
to

Which type of character? Primitive characters (3.1.3) possibly but you
could also interpret characters to be extended characters (XChar) which
include variable width characters, which is precisely what we are trying
to get away from.

David N. Williams

unread,
Sep 16, 2009, 9:31:35 AM9/16/09
to

I guess whatever character you meant in this:

3. Replace "u" with "len" in 3.1.4.2 Character strings:

A string is specified by a cell pair (c-addr len) representing
its starting address and length in characters.

It would be a substantial change if it were to be address units,
since 1 CHARS is not necessarily one address unit.

I'm unclear what you intend. Is the meaning of "character
string" in the above being changed to allow for extended
characters?

-- David

Peter Knaggs

unread,
Sep 16, 2009, 9:54:04 AM9/16/09
to
David N. Williams wrote:
>
> I guess whatever character you meant in this:
>
> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>
> A string is specified by a cell pair (c-addr len) representing
> its starting address and length in characters.

The part of the X:key-ekey proposal which was accepted at the Exeter
meeting included the following:

3.1.2 Character types
Characters shall have the following properties:
– at least one address unit wide;
– contain at least eight bits;
– be of fixed width;
– have a size less than or equal to cell size;
– be unsigned.

3.1.2.3 Primitive Character
A primitive character (pchar) is a character with no restrictions
on its contents. Unless otherwise stated, a “character” refers to
a primitive character.

Thus item 3 should be changed to refer to the "length in primitive
characters". In this case I feel it probably is worth spelling out.

> It would be a substantial change if it were to be address units,
> since 1 CHARS is not necessarily one address unit.

This is part of the problem, what does u mean in CMOVE? According to
its definition "copy u consecutive characters", while most people
believe it refers to address units.

> I'm unclear what you intend. Is the meaning of "character
> string" in the above being changed to allow for extended
> characters?

No, but once extended characters are introduced there is the potential
for confusion, hence the introduction of a primitive character. Extended
characters will always be referenced as "extended character" or xchar,
while a "character" is a primitive characters or pchar.

David N. Williams

unread,
Sep 16, 2009, 10:06:50 AM9/16/09
to
Peter Knaggs wrote:
> David N. Williams wrote:
>>
>> I guess whatever character you meant in this:
>>
>> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>>
>> A string is specified by a cell pair (c-addr len) representing
>> its starting address and length in characters.
>
> The part of the X:key-ekey proposal which was accepted at the Exeter
> meeting included the following:
>
> 3.1.2 Character types
> Characters shall have the following properties:
> – at least one address unit wide;
> – contain at least eight bits;
> – be of fixed width;
> – have a size less than or equal to cell size;
> – be unsigned.
>
> 3.1.2.3 Primitive Character
> A primitive character (pchar) is a character with no restrictions
> on its contents. Unless otherwise stated, a “character” refers to
> a primitive character.
>
> Thus item 3 should be changed to refer to the "length in primitive
> characters". In this case I feel it probably is worth spelling out.

Me, too!

>> It would be a substantial change if it were to be address units,
>> since 1 CHARS is not necessarily one address unit.
>
> This is part of the problem, what does u mean in CMOVE? According to
> its definition "copy u consecutive characters", while most people
> believe it refers to address units.

Not me! :-) MOVE is for that.

>> I'm unclear what you intend. Is the meaning of "character
>> string" in the above being changed to allow for extended
>> characters?
>
> No, but once extended characters are introduced there is the potential
> for confusion, hence the introduction of a primitive character. Extended
> characters will always be referenced as "extended character" or xchar,
> while a "character" is a primitive characters or pchar.

Good! I can probably wrap my head around that.

-- David

Anton Ertl

unread,
Sep 14, 2009, 10:51:25 AM9/14/09
to
Aleksej Saushev <as...@inbox.ru> writes:
>Peter Knaggs <p...@bcs.org.uk> writes:
>
>> m_l_g3 wrote:
>>>
>>> Sorry, but I do not see from your proposal what sort of length "len"
>>> denotes: is it "length in characters", "length in logical (multi-
>>> byte)
>>> characters", or "length in address units"?
>>
>> length in primitive characters (bytes).
>
>No, length in address units.

This proposal replaces "u" with "len" in words where "u" denotes the
number of characters.

A change to let this parameter specify a number of address units would
break existing standard programs. Granted, there are only few
standard programs that don't have an environmental dependency on
1 CHARS = 1, and all maintained systems support these programs, so
there would be little problem with such a change, but I see little
point in having such a change. Better propose standardizing
1 CHARS = 1.

>Byte length is what is returned by "1 chars",
>consider 4-bit address unit.

Yes, nibble-addressed hardware was the original rationale for
differentiating between aus and chars, but in 15 years there have been
no Forth-94 systems for nibble-addressed hardware, so I consider CHARS
a good solution for a problem that does not exist in practice.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/

Anton Ertl

unread,
Sep 12, 2009, 8:59:41 AM9/12/09
to
Josh Grams <jo...@qualdan.com> writes:
>Instead of adopting this (and that "pchar" rename proposal), I think it
>would make much more sense to clarify things by leaving the existing
>"char" and "character" alone, and instead adopting new terminology for
>variable width characters.

Yes. And the variable-width characters have a new name: xchars.

>As I see it, there's no reason to go changing terminology on people when
>you could instead just adopt new terminology for the new concept. Much
>less potential for confusion that way.

Apparently some people are confused because a member of the (extended)
character set need not fit into a char, and they think that renaming
chars into pchars will help avoid that confusion. I am not convinced
of that, but I can live with pchars (although I fear that we will make
mistakes in the renaming, which will increase the confusion rather
than reducing it).

Anton Ertl

unread,
Sep 12, 2009, 8:33:55 AM9/12/09
to
Peter Knaggs <p...@bcs.org.uk> writes:
>c-add/len
>=========
>
>2009-09-09 Rendered into RfD form, added Forth200x words
>1999-06-22 Original Text by John Rible
>
>
>Problem
>=======
>A large number of words use "c-add u" to indicate the address of a
>string (c-addr) and its length (u) on the stack. With the
>introduction of variable width characters, it is not clear if "u" is
>referring to the number of characters or address units.

Variable-width characters are introduced in the xchars proposal, they
are called xchars there (and can consist of one or more fixed-width
chars in memory). Variable-width characters don't exist in the
current standard document, and chars don't become variable-width in
xchars. It's clear in all words that deal with chars that u refers to
the number of chars.

It definitely does not refer to address units in these words (only in
MOVE and ERASE, which don't deal with chars), although given that 1
chars = 1 au in all maintained systems, that distinction is of no
consequence. Every word that refers to chars says so explicitly, and
every word that refers to aus says so explicitly, and if any word in
the xchars proposal refers to a number of xchars, it will say so
explicitly, too (but I don't think there is such a word).

Examples:

From 17.6.1.0910 CMOVE:
|[...] copy u consecutive characters [...]

From 6.1.1900 MOVE:
|[...]


>4. Add the following to table 3.5 - Environmental Query Strings:
>
> /CHARACTER-STRING n yes maximum size of len in characters

What's the point of that?

Any system that cannot deal with strings of the length of the longest
data memory region that can be had from the system is broken. And
that's not just IMO, but also in Forth-94.

So if the point of that query is to allow systems to not process some
of the strings that can be created, then existing standard programs
would become non-standard. Such a restriction requires a two-step
process of first declaring the feature obsolescent, and eventually
removing it. Moreover, I see no point in introducing such a
restriction.

If that's not the point of the query, then I see no point in it. If
we can process all strings we can create, there is no point in
querying for the maximum size.

Otherwise the proposal looks fine.

Ed

unread,
Sep 18, 2009, 7:43:39 AM9/18/09
to
Peter Knaggs wrote:
> c-add/len
> =========
>
> 2009-09-09 Rendered into RfD form, added Forth200x words
> 1999-06-22 Original Text by John Rible
>
>
> Problem
> =======
> A large number of words use "c-add u" to indicate the address of a
> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.
>
>
> Solution
> ========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent. Replacing the "u" with a "len" should improve
> the documentation of these words. The words effected are:
>
> ...
> 12.6.1.2143 REPRESENT

I must have missed it. When did "u most significant digits of
the significand" [of a number] become the length of a string?

Josh Grams

unread,
Sep 18, 2009, 7:48:31 AM9/18/09
to
Anton Ertl wrote:
> Josh Grams <jo...@qualdan.com> writes:
>>Instead of adopting this (and that "pchar" rename proposal), I think it
>>would make much more sense to clarify things by leaving the existing
>>"char" and "character" alone, and instead adopting new terminology for
>>variable width characters.
>
> Yes. And the variable-width characters have a new name: xchars.
>
>>As I see it, there's no reason to go changing terminology on people when
>>you could instead just adopt new terminology for the new concept. Much
>>less potential for confusion that way.
>
> Apparently some people are confused because a member of the (extended)
> character set need not fit into a char, and they think that renaming
> chars into pchars will help avoid that confusion. I am not convinced
> of that, but I can live with pchars (although I fear that we will make
> mistakes in the renaming, which will increase the confusion rather
> than reducing it).

That's pretty much how I feel about it...

--Josh

Anton Ertl

unread,
Sep 18, 2009, 9:46:26 AM9/18/09
to
"Ed" <nos...@invalid.com> writes:
>> Solution
>> ========
>> Introduce a new pseudo-type ("len") into the document of these words
>> to clarify the intent. Replacing the "u" with a "len" should improve
>> the documentation of these words. The words effected are:
>>
>> ...
>> 12.6.1.2143 REPRESENT
>
>I must have missed it. When did "u most significant digits of
>the significand" [of a number] become the length of a string?

u has always been the length of the buffer in characters in REPRESENT.
That's the only interpretation of the specification that makes any
sense.

Albert van der Horst

unread,
Sep 18, 2009, 2:03:03 PM9/18/09
to
In article <4AAAA038...@bcs.org.uk>, Peter Knaggs <p...@bcs.org.uk> wrote:
>c-add/len
>=========
>
>2009-09-09 Rendered into RfD form, added Forth200x words
>1999-06-22 Original Text by John Rible
>
>
>Problem
>=======
>A large number of words use "c-add u" to indicate the address of a
>string (c-addr) and its length (u) on the stack. With the
>introduction of variable width characters, it is not clear if "u" is
>referring to the number of characters or address units.
>
>
>Solution
>========
>Introduce a new pseudo-type ("len") into the document of these words
>to clarify the intent. Replacing the "u" with a "len" should improve
>the documentation of these words. The words effected are:

I use the word "sc" in my documentation for the pair.
It means string-constant. It implies that the word
using it must not reach through to the "c-add" and change
characters there. (So e.g. /STRING is okay.)

Anyway, I'm in favour of using a single indication of the pair
whenever they cannot be logically separated.
This allows for a full explanation of "sc" at one place, instead of
limited explanations regarding address units/ character units at
several places. Maybe a distinction between "sc" and "xsc" is in
order.

>Peter Knaggs, P.J.K...@exeter.ac.uk

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Aleksej Saushev

unread,
Sep 19, 2009, 5:16:21 PM9/19/09
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>Peter Knaggs <p...@bcs.org.uk> writes:
>>
>>> m_l_g3 wrote:
>>>>
>>>> Sorry, but I do not see from your proposal what sort of length "len"
>>>> denotes: is it "length in characters", "length in logical (multi-
>>>> byte)
>>>> characters", or "length in address units"?
>>>
>>> length in primitive characters (bytes).
>>
>>No, length in address units.
>
> This proposal replaces "u" with "len" in words where "u" denotes the
> number of characters.
>
> A change to let this parameter specify a number of address units would
> break existing standard programs. Granted, there are only few
> standard programs that don't have an environmental dependency on
> 1 CHARS = 1, and all maintained systems support these programs, so
> there would be little problem with such a change, but I see little
> point in having such a change. Better propose standardizing
> 1 CHARS = 1.

You're not consistent in your opinion that we should use UNICODE:
either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
platforms, or 1 CHARS may be any other value, and you return to address
units, which are octets in many cases. The third way is decoupling Forth
from hardware in full, so that you don't deal with real CPU address units
at all.


--
CE3OH...

Bernd Paysan

unread,
Sep 19, 2009, 5:41:10 PM9/19/09
to
Aleksej Saushev wrote:
> You're not consistent in your opinion that we should use UNICODE:
> either 1 CHARS = 1, and you use one-octet encodings on
> octet-addressing platforms, or 1 CHARS may be any other value, and you
> return to address units, which are octets in many cases. The third way
> is decoupling Forth from hardware in full, so that you don't deal with
> real CPU address units at all.

"Unicode" is not just one encoding. You can have an ASCII-compatible
byte-encoding like UTF-8 (which is what I recommend for Forth with
Unicode), or UTF-16, which is still a variable length encoding (one or
two 16-bit words make a character, i.e. you still need the XCHAR wordset
to work with UTF-16), or UCS4, which will be fixed-size, but is quite
wasteful.

Except a few experiments, all Forth systems have 1 CHARS = 1. Most
programs rely on that, as well (i.e. they don't use CHARS where they
should, often, they also don't use CHAR+ but 1+ or so).

Aleksej Saushev

unread,
Sep 20, 2009, 7:45:16 AM9/20/09
to
Bernd Paysan <bernd....@gmx.de> writes:

Again internal inconsistency. If you want 1 CHARS = 1 always, then you
should get rid of it and assume that you address bytes/characters or
octets, whatever you decide. You return to the way C took.
Then you won't need any conversion of code to use wide characters &c.

So, what is the point in dragging this "CHARS" stuff?

This brings another problem of Standard Forth: lack of internal consistency.
You either have overengineered parts, impractical parts, or lack of standard
tools to solve every day practical tasks (like reading non-textual streams).

Could you and Anton decide for yourself what you really want and stick to it?
Because as for now you easily jump from 1 CHARS being able to hold a byte,
i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
addressable unit like it is in C.

Each variant has right to exist and has its own consequences.
If you decide 1 CHARS = 1, then how I access address units? Octets?
If you decide 1 CHARS to be byte width, how do I read non-textual file?


P.S. Most of UNIX text processing programs use "char" and don't care of
locales still, but there's some kind of general consensus that they
should be converted. So what's your argument about? I don't understand it.

Again, you overengineer standard in domain nobody has much experience with,
and skip fixing defects affecting practical everyday tasks.


--
CE3OH...

Anton Ertl

unread,
Sep 20, 2009, 2:51:11 PM9/20/09
to
Aleksej Saushev <as...@inbox.ru> writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Granted, there are only few
>> standard programs that don't have an environmental dependency on
>> 1 CHARS = 1, and all maintained systems support these programs, so
>> there would be little problem with such a change, but I see little
>> point in having such a change. Better propose standardizing
>> 1 CHARS = 1.
>
>You're not consistent in your opinion that we should use UNICODE:
>either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
>platforms,

Yes, that's that way things work without xchars. With xchars, you can
use variable-width encodings like UTF-8, and UTF-8 is compatible with
8-bit chars.

>or 1 CHARS may be any other value, and you return to address
>units, which are octets in many cases.

And? The words where u refers to the number of characters still deal
with u chars, not u address units.

Anton Ertl

unread,
Sep 20, 2009, 3:15:27 PM9/20/09
to
Aleksej Saushev <as...@inbox.ru> writes:
>If you want 1 CHARS = 1 always, then you
>should get rid of it

rid of what?

> and assume that you address bytes/characters or
>octets, whatever you decide.

On word-addressed machines 1 CHARS = 1, but a character is not a byte
or octet, but a word.

>So, what is the point in dragging this "CHARS" stuff?

It's in the current standard and nobody (not even you) has submitted
an RfD for making it obsolescent.

>This brings another problem of Standard Forth: lack of internal consistency.
>You either have overengineered parts, impractical parts, or lack of standard
>tools to solve every day practical tasks (like reading non-textual streams).
>
>Could you and Anton decide for yourself what you really want and stick to it?
>Because as for now you easily jump from 1 CHARS being able to hold a byte,
>i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
>addressable unit like it is in C.

I can only guess what you mean here, but maybe the following can clear
things up: A char is a fixed-width memory unit, and on byte-addressed
machines it is a byte in all maintained systems. There are also
xchars (in the xchars proposal); they have a variable-width
representation in memory, i.e., each xchar is stored in one or more
chars. The "len" in this proposal always refers to the number of
chars, not to the number of xchars.

>Each variant has right to exist and has its own consequences.
>If you decide 1 CHARS = 1, then how I access address units?

Easy in that case: c@ and c!

> Octets?

No octets in the standard yet. If you have a Forth system on a
word-addressed machine, you have to use system-specific code to deal
with octets.

>If you decide 1 CHARS to be byte width, how do I read non-textual file?

Use BIN.

A more interesting case is word-addressed machines: How should they
deal with BIN? But I guess if the people implementing and programming
on such systems feel the need for standardization in this regard, they
will come forward and start discussing it.

>P.S. Most of UNIX text processing programs use "char" and don't care of
>locales still, but there's some kind of general consensus that they
>should be converted.

Converted to what? Consensus among whom?

> So what's your argument about? I don't understand it.

Which argument?

Aleksej Saushev

unread,
Sep 21, 2009, 4:52:28 AM9/21/09
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>If you want 1 CHARS = 1 always, then you
>>should get rid of it
>
> rid of what?

CHARS, if CHARS is always no-op.

>> and assume that you address bytes/characters or
>>octets, whatever you decide.
>
> On word-addressed machines 1 CHARS = 1, but a character is not a byte
> or octet, but a word.

Then byte is word rather than octet.

>>So, what is the point in dragging this "CHARS" stuff?
>
> It's in the current standard and nobody (not even you) has submitted
> an RfD for making it obsolescent.

That's what I tell you: you overengineer some parts nobody has enough
experience with and omit everyday practice.

>>This brings another problem of Standard Forth: lack of internal consistency.
>>You either have overengineered parts, impractical parts, or lack of standard
>>tools to solve every day practical tasks (like reading non-textual streams).
>>
>>Could you and Anton decide for yourself what you really want and stick to it?
>>Because as for now you easily jump from 1 CHARS being able to hold a byte,
>>i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
>>addressable unit like it is in C.
>
> I can only guess what you mean here, but maybe the following can clear
> things up: A char is a fixed-width memory unit, and on byte-addressed
> machines it is a byte in all maintained systems. There are also
> xchars (in the xchars proposal); they have a variable-width
> representation in memory, i.e., each xchar is stored in one or more
> chars. The "len" in this proposal always refers to the number of
> chars, not to the number of xchars.

A char is a byte, it is fixed-width memory unit, on octet-addressing
machine, it may take more than one address unit, on word-addressing
machine it may take more than one address unit as well, even though
it is less probable.

>>Each variant has right to exist and has its own consequences.
>>If you decide 1 CHARS = 1, then how I access address units?
>
> Easy in that case: c@ and c!
>
>> Octets?
>
> No octets in the standard yet. If you have a Forth system on a
> word-addressed machine, you have to use system-specific code to deal
> with octets.

I have Forth system on octet-addressing machine and want to understand
how you're going to deal with "wide" characters you're so fond of.

>>If you decide 1 CHARS to be byte width, how do I read non-textual file?
>
> Use BIN.

11.6.1.2080 READ-FILE ... ( c-addr u1 fileid -- u2 ior )
"Read u1 consecutive characters to c-addr from the current
position of the file identified by fileid."

Where're address units here? How do I read 5 octets on octet-addressing
machine, if I follow your advice to go for wide characters and define
1 CHARS = 2?

Do you want to tell me that I don't ever need it?
Is this scenario impossible? Highly improbable? Anything else?

> A more interesting case is word-addressed machines: How should they
> deal with BIN? But I guess if the people implementing and programming
> on such systems feel the need for standardization in this regard, they
> will come forward and start discussing it.

You don't have clear understanding on octet-addressing case yet.

>>P.S. Most of UNIX text processing programs use "char" and don't care of
>>locales still, but there's some kind of general consensus that they
>>should be converted.
>
> Converted to what? Consensus among whom?

To wide characters. Among system developers.

>> So what's your argument about? I don't understand it.
>
> Which argument?

This one about 1 CHARS = 1.


--
CE3OH...

Aleksej Saushev

unread,
Sep 21, 2009, 5:00:00 AM9/21/09
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> Granted, there are only few
>>> standard programs that don't have an environmental dependency on
>>> 1 CHARS = 1, and all maintained systems support these programs, so
>>> there would be little problem with such a change, but I see little
>>> point in having such a change. Better propose standardizing
>>> 1 CHARS = 1.
>>
>>You're not consistent in your opinion that we should use UNICODE:
>>either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
>>platforms,
>
> Yes, that's that way things work without xchars. With xchars, you can
> use variable-width encodings like UTF-8, and UTF-8 is compatible with
> 8-bit chars.

Compatible in what sense? Do they have fixed width like chars? No.
How are they compatible then?

>>or 1 CHARS may be any other value, and you return to address
>>units, which are octets in many cases.
>
> And? The words where u refers to the number of characters still deal
> with u chars, not u address units.

How do I read one address unit from a file?

Again, you lack clear vision and consistency in what you're doing.


--
CE3OH...

Bernd Paysan

unread,
Sep 21, 2009, 5:52:05 AM9/21/09
to
Aleksej Saushev wrote:

> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Yes, that's that way things work without xchars. With xchars, you can
>> use variable-width encodings like UTF-8, and UTF-8 is compatible with
>> 8-bit chars.
>
> Compatible in what sense? Do they have fixed width like chars? No.
> How are they compatible then?

UTF-8 is encoded in bytes, and ASCII-compatible. That's how it is
compatible with 8-bit chars: As long as you stay with ASCII, you don't see
any difference. If you go beyond ASCII, you can use C@ as building block to
implement the xchar wordset. Unlike UTF-16, where even simple ASCII
characters change encoding, UTF-8 works quite seamless. You can even use
gforth-0.6.2 (without xchar wordset) in an UTF-8 environment, and unless you
don't do too heavy extended-character edits in the command line, it will
just work. That's what I call "compatible".

>>>or 1 CHARS may be any other value, and you return to address
>>>units, which are octets in many cases.
>>
>> And? The words where u refers to the number of characters still deal
>> with u chars, not u address units.
>
> How do I read one address unit from a file?
>
> Again, you lack clear vision and consistency in what you're doing.

The confusion comes from the Forth94 standard. The reason for CHARS was to
allow implementation on nibble-addressed machines. Later, some experimental
Forths like jaxforth used it to implement a UCS2 charset (predecessor of
UTF-16), where the system is still a byte-oriented system, but 1 CHARS = 2
and c@/c! access words. This didn't catch on.

What we could do is mark CHARS obsolecent, since CHARS = NOP in all known
and maintained Forth systems. Word-addressed machines will still have an
address unit that can store more than one byte, but that's not the issue
here. And then, address units would be equal to characters, but only at the
next review, when the obsolecent CHARS would become obsolete (how does this
match with reality? CHARS never caught on, so it was obsolete from the very
beginning).

Needs to go through RfD/CfV, since this is a significant change to the
standard. I agree that this would make the standard more consistent, and
more in alignment to common practice.

Stephen Pelc

unread,
Sep 21, 2009, 7:08:01 AM9/21/09
to
On Mon, 21 Sep 2009 12:52:28 +0400, Aleksej Saushev <as...@inbox.ru>
wrote:

>I have Forth system on octet-addressing machine and want to understand
>how you're going to deal with "wide" characters you're so fond of.

This is the guts of the pchars idea. Virtually all comms/transfer
protocols use 8 bit units. Yes, I know about vending machines.
The traditional ASCII or code-page Forth uses 8-bit characters.

People who want to internationalise their applications have used
a variety of character encodings including, 8, 16, 32 and variable
width characters. All of these use 8 bit units, which we propose to
call a pchar.

To avoid the confusion in the file words and else where, we ended
up defining the use of "character" in Forth200x to mean a pchar.
This simply avoided a vast amount of work for the editor. If the
standard just says "character" with no qualification, it means a
pchar.

Once we define string lengths in terms of pchars, internationalisation
becomes easier. The display/comms words don't need to know what they
are sending, they just send multiple pchars. Where multiple pchars
form a single character, we refer to that character as an xchar.

The simple rule is
memory size in pchars,
number of characters in xchars.

KEY and EMIT and friends work in terms of pchars, XKEY and XEMIT
and friends in terms of xchars. That KEY and EMIT work in terms
of pchars is essential in embedded work and to allow Telnet or
HTTP to handle set up before transferring UTF-8 text.

Stephen


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Aleksej Saushev

unread,
Sep 21, 2009, 7:24:23 AM9/21/09
to
Bernd Paysan <bernd....@gmx.de> writes:

> Aleksej Saushev wrote:
>
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> Yes, that's that way things work without xchars. With xchars, you can
>>> use variable-width encodings like UTF-8, and UTF-8 is compatible with
>>> 8-bit chars.
>>
>> Compatible in what sense? Do they have fixed width like chars? No.
>> How are they compatible then?
>
> UTF-8 is encoded in bytes, and ASCII-compatible. That's how it is
> compatible with 8-bit chars: As long as you stay with ASCII, you don't see
> any difference. If you go beyond ASCII, you can use C@ as building block to
> implement the xchar wordset. Unlike UTF-16, where even simple ASCII
> characters change encoding, UTF-8 works quite seamless. You can even use
> gforth-0.6.2 (without xchar wordset) in an UTF-8 environment, and unless you
> don't do too heavy extended-character edits in the command line, it will
> just work. That's what I call "compatible".

Here we have another thing then: you've silently introduced octet bytes.
You say that UTF-8 works on the top of octet bytes, octet characters,
what if implementation bytes (characters) take 2 octets each?

I don't see any benefit in dealing with UTF-8 inside otherwise than
converting data for input/output. Non-uniform size data are harder to
process.

>>>>or 1 CHARS may be any other value, and you return to address
>>>>units, which are octets in many cases.
>>>
>>> And? The words where u refers to the number of characters still deal
>>> with u chars, not u address units.
>>
>> How do I read one address unit from a file?
>>
>> Again, you lack clear vision and consistency in what you're doing.
>
> The confusion comes from the Forth94 standard. The reason for CHARS was to
> allow implementation on nibble-addressed machines. Later, some experimental
> Forths like jaxforth used it to implement a UCS2 charset (predecessor of
> UTF-16), where the system is still a byte-oriented system, but 1 CHARS = 2
> and c@/c! access words. This didn't catch on.
>
> What we could do is mark CHARS obsolecent, since CHARS = NOP in all known
> and maintained Forth systems. Word-addressed machines will still have an
> address unit that can store more than one byte, but that's not the issue
> here. And then, address units would be equal to characters, but only at the
> next review, when the obsolecent CHARS would become obsolete (how does this
> match with reality? CHARS never caught on, so it was obsolete from the very
> beginning).
>
> Needs to go through RfD/CfV, since this is a significant change to the
> standard. I agree that this would make the standard more consistent, and
> more in alignment to common practice.

I think that this is wrong approach.

It is better to cure things this way:
1) make it clear that there are address units (already in standard)
and uniform bytes aka characters (already in standard);
2) CHARS is used to convert from number of characters to address units.
3) make READ-FILE/WRITE-FILE to use address units instead of characters
in BIN mode (not in the standard and still has to enter in some way);
4) optionally, provide address unit access words.


--
CE3OH...

Anton Ertl

unread,
Sep 21, 2009, 7:33:36 AM9/21/09
to
Aleksej Saushev <as...@inbox.ru> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
>> Aleksej Saushev <as...@inbox.ru> writes:
[...]

>> On word-addressed machines 1 CHARS = 1, but a character is not a byte
>> or octet, but a word.
>
>Then byte is word rather than octet.

Word-addressed machines usually don't have bytes; and even if they
have bytes (e.g., in some sub-word-extracting instructions), they are
smaller than words. So you statement seems to be a Humpty-Dumpty like
case of trying to redefine words to have uncommon meanings. And it's
completely pointless (unless you just want to confuse), because the
standard does not talk about bytes, either.

>> I can only guess what you mean here, but maybe the following can clear
>> things up: A char is a fixed-width memory unit, and on byte-addressed
>> machines it is a byte in all maintained systems. There are also
>> xchars (in the xchars proposal); they have a variable-width
>> representation in memory, i.e., each xchar is stored in one or more
>> chars. The "len" in this proposal always refers to the number of
>> chars, not to the number of xchars.
>
>A char is a byte, it is fixed-width memory unit, on octet-addressing
>machine, it may take more than one address unit, on word-addressing
>machine it may take more than one address unit as well, even though
>it is less probable.

That's what's theoretically possible according to Forth-94. In
practice, 1 CHARS = 1 address unit.

>> No octets in the standard yet. If you have a Forth system on a
>> word-addressed machine, you have to use system-specific code to deal
>> with octets.
>
>I have Forth system on octet-addressing machine and want to understand
>how you're going to deal with "wide" characters you're so fond of.

Maybe you are confusing me with someone else (but whom? I don't
remember anyone asking for wide characters recently). I am not find
of wide characters. Or maybe you are trying to pull another
Humpty-Dumpty here and use "wide characters" with an unusual meaning.

>11.6.1.2080 READ-FILE ... ( c-addr u1 fileid -- u2 ior )
>"Read u1 consecutive characters to c-addr from the current
>position of the file identified by fileid."
>
>Where're address units here? How do I read 5 octets on octet-addressing
>machine, if I follow your advice to go for wide characters and define
>1 CHARS = 2?

You are definitely confusing me with someone else. I have not given
any such advice, certainly not in the last five years, and IIRC nobody
else has given such advice recently, either.

>>>P.S. Most of UNIX text processing programs use "char" and don't care of
>>>locales still, but there's some kind of general consensus that they
>>>should be converted.
>>
>> Converted to what? Consensus among whom?
>
>To wide characters. Among system developers.

No, wide characters are pretty dead in the Unix world, especially
among system developers: I cannot name a system call that takes or
produces wide characters; can you?

>>> So what's your argument about? I don't understand it.
>>
>> Which argument?
>
>This one about 1 CHARS = 1.

That's common practice and hopefully someone will work out a proposal
to standardize that.

Bernd Paysan

unread,
Sep 21, 2009, 8:06:41 AM9/21/09
to
Aleksej Saushev wrote:
> Here we have another thing then: you've silently introduced octet bytes.
> You say that UTF-8 works on the top of octet bytes, octet characters,
> what if implementation bytes (characters) take 2 octets each?

Apart from Jaxforth, nobody did that. Therefore, the question is quite
hypothetical. Yes, the Forth94 standard allows this sort of implementation.

> I don't see any benefit in dealing with UTF-8 inside otherwise than
> converting data for input/output. Non-uniform size data are harder to
> process.

UTF-16 has a non-uniform character size, either (characters may be either
one or two 16 bit words). Only UTF-32 is uniform. Since most strings pass
a Forth program as a whole, you don't gain or lose much by not being able to
randomly address each character. Stepping through sequentially is easy
(just use XCHAR+ instead of CHAR+). The main point however is that text
processing uses strings as a whole much more often than the individual
characters inside strings. And there, it doesn't matter if it's ASCII or
UTF-8.

The Unix world and the Internet have pretty much consolidated on UTF-8, and
through files shared on the Internet, even the Windows world needs quite
good UTF-8 support (though it is internally on the system call level still
UTF-16, as is Java or C#). Having a different internal and external text
representation is IMHO a bad idea. UTF-16 gives you the worst of both
sides: variable-length characters *and* incompatibility.

> I think that this is wrong approach.
>
> It is better to cure things this way:
> 1) make it clear that there are address units (already in standard)
> and uniform bytes aka characters (already in standard);
> 2) CHARS is used to convert from number of characters to address units.

That's already in the standard. That's what nobody uses. CHARS is trying
to solve a problem nobody has.

> 3) make READ-FILE/WRITE-FILE to use address units instead of characters
> in BIN mode (not in the standard and still has to enter in some way);

That won't work on nibble addressed machines, which are just about as
hypothetical 1 CHARS = 2 machines as UTF-16 Forths.

> 4) optionally, provide address unit access words.

We have a wordset for a similar purpose in preparation, but it probably
still deals with something else than you want.

Anton Ertl

unread,
Sep 21, 2009, 8:06:12 AM9/21/09
to
steph...@mpeforth.com (Stephen Pelc) writes:
>The simple rule is
> memory size in pchars,
> number of characters in xchars.

Just a clarification, so this is not misunderstood: "number of
characters" always refers to chars (aka pchars), never to xchars.
Moreover, there is not a single occurence of a "number of xchars" or
anything like it in the xchars proposal.