RfD: c-addr/len

Peter Knaggs

unread,

Sep 11, 2009, 3:08:40 PM9/11/09

to fort...@yahoogroups.com

c-add/len
=========

2009-09-09 Rendered into RfD form, added Forth200x words
1999-06-22 Original Text by John Rible

Problem
=======
A large number of words use "c-add u" to indicate the address of a
string (c-addr) and its length (u) on the stack. With the
introduction of variable width characters, it is not clear if "u" is
referring to the number of characters or address units.

Solution
========
Introduce a new pseudo-type ("len") into the document of these words
to clarify the intent. Replacing the "u" with a "len" should improve
the documentation of these words. The words effected are:

6.1.0040 #>
6.1.0570 >NUMBER
6.1.0980 COUNT
6.1.1345 ENVIRONMENT?
6.1.1360 EVALUATE
6.1.1540 FILL
6.1.2165 S"
6.1.2216 SOURCE
6.1.2310 TYPE
6.2.2008 PARSE
6.2.xxxx PARSE-NAME
11.6.1.1010 CREATE-FILE
11.6.1.1190 DELETE-FILE
11.6.1.1718 INCLUDED
11.6.1.1970 OPEN-FILE
11.6.1.2080 READ-FILE
11.6.1.2090 READ-LINE
11.6.1.2165 S"
11.6.1.2480 WRITE-FILE
11.6.1.2485 WRITE-LINE
11.6.2.1524 FILE-STATUS
12.6.1.0558 >FLOAT
11.6.2.2130 RENAME-FILE
11.6.2.xxxx REQUIRED
12.6.1.2143 REPRESENT
13.6.1.0086 (LOCAL)
16.6.1.2192 SEARCH-WORDLIST
17.6.1.0170 -TRAILING
17.6.1.0245 /STRING
17.6.1.0780 BLANK
17.6.1.0910 CMOVE
17.6.1.0920 CMOVE>
17.6.1.0935 COMPARE
17.6.1.2191 SEARCH
17.6.1.2212 SLITERAL

Proposal
========

1. Add the following to table 3.1 - Data Types

len character-string length 1 cell

2. Add the following to 3.1.1 Data-type relationships

len => u => x

3. Replace "u" with "len" in 3.1.4.2 Character strings:

A string is specified by a cell pair (c-addr len) representing
its starting address and length in characters.

4. Add the following to table 3.5 - Environmental Query Strings:

/CHARACTER-STRING n yes maximum size of len in characters

5. Change "u" to "len" in the stack description, definition and
rationale of the words listed under the Solution.

6. Replace "u" with "len" in section A.3.1.3.4 Counted Strings.

7. Change "u" to "len" in the rationale for A.6.2.0855 C".

Author
======
Peter Knaggs, P.J.K...@exeter.ac.uk

Josh Grams

unread,

Sep 12, 2009, 8:29:07 AM9/12/09

to

Peter Knaggs wrote: <4AAAA038...@bcs.org.uk>
>
>
> c-add/len

c-addr

>=========
>
> 2009-09-09 Rendered into RfD form, added Forth200x words
> 1999-06-22 Original Text by John Rible
>
>
> Problem
>=======
> A large number of words use "c-add u" to indicate the address of a

"c-addr u"

> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.

Er...unless I missed a decision to do away with the distinction between
"1 CHARS" and "address units", isn't the ambiguity between "variable
width characters" and "characters"? I don't see that this proposal
actually clarifies that.

At any rate, I think the definition at 3.1.4.2 Character strings makes
it clear that "c-addr u" as a unit means something special, so I don't
see any reason to replace the "u" with "len".

> Solution
>========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent. Replacing the "u" with a "len" should improve
> the documentation of these words. The words effected are:

affected

> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>
> A string is specified by a cell pair (c-addr len) representing
> its starting address and length in characters.

In 2.1 Definitions of Terms, we have:

character:
Depending on context, either 1) a storage unit capable of holding a
character; or 2) a member of a character set.

I think that the presence of an address (i.e. the location of some
storage) makes it pretty clear that sense 1 is meant here, but if people
are confused by that, you might want to clarify.

----

Instead of adopting this (and that "pchar" rename proposal), I think it
would make much more sense to clarify things by leaving the existing
"char" and "character" alone, and instead adopting new terminology for
variable width characters.

As I see it, there's no reason to go changing terminology on people when
you could instead just adopt new terminology for the new concept. Much
less potential for confusion that way.

--Josh

Bernd Paysan

unread,

Sep 12, 2009, 3:10:23 PM9/12/09

to

Josh Grams wrote:
> Instead of adopting this (and that "pchar" rename proposal), I think
> it would make much more sense to clarify things by leaving the
> existing "char" and "character" alone, and instead adopting new
> terminology for variable width characters.

We are pretty much there - the extended characters are called "extended
characters" or short "xchars". An xchar in memory may consist of
several characters (primitive characters, that is). I think it's easier
to deal with the name "pchar" when the "storage unit" is meant than name
it "character", but outside the xchar proposal, the terminology is not
needed.

The c-addr/len makes live easier as it definitely states that the length
is meant to be in characters (pchars), i.e. the storage unit as meant in
2.1. character 1).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

m_l_g3

unread,

Sep 13, 2009, 2:59:51 PM9/13/09

to

Peter Knaggs 写了:

>
> Problem
> =======
> A large number of words use "c-add u" to indicate the address of a
> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.
>
>
> Solution
> ========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent.

Sorry, but I do not see from your proposal what sort of length "len"
denotes: is it "length in characters", "length in logical (multi-
byte)
characters", or "length in address units"?

> Replacing the "u" with a "len" should improve
> the documentation of these words.

In fact, in many cases words are commented as taking and/or
leaving ( addr len ) rather than ( c-addr u ), so there is existing
practice.

IMO replacing "u" with "len" does improve readability, but does
not resolve the "which length" puzzle.

au-length, log-length, c-length ?

CHARS ( c-length -- au-length )
and so on...

Bernd Paysan

unread,

Sep 13, 2009, 4:43:03 PM9/13/09

to

m_l_g3 wrote:
> IMO replacing "u" with "len" does improve readability, but does
> not resolve the "which length" puzzle.
>
> au-length, log-length, c-length ?
>
> CHARS ( c-length -- au-length )
> and so on...

Char, as it is now is:

6.1.0898 CHARS
( n1 -- n2 )
n2 is the size in address units of n1 characters.

IMHO, the stack effect is at least misleading. I find it difficult to
get a correct stack effect - we want -1 CHARS to be used to step through
strings backwards, so we want the sign. I.e. "len" is not the right
left side of this stack effect (len is a subtype of u, no sign). But we
basically use CHARS to convert +-len into a +-c-addr offset. Works fine
on two's complement, might cause problems on one's complement ;-).

Peter Knaggs

unread,

Sep 13, 2009, 7:01:11 PM9/13/09

to

m_l_g3 wrote:
>
> Sorry, but I do not see from your proposal what sort of length "len"
> denotes: is it "length in characters", "length in logical (multi-
> byte)
> characters", or "length in address units"?

length in primitive characters (bytes).

> In fact, in many cases words are commented as taking and/or
> leaving ( addr len ) rather than ( c-addr u ), so there is existing
> practice.

Not in the standards document, hence the change.

--
Peter Knaggs

Aleksej Saushev

unread,

Sep 14, 2009, 2:27:01 AM9/14/09

to

Peter Knaggs <p...@bcs.org.uk> writes:

> m_l_g3 wrote:
>>
>> Sorry, but I do not see from your proposal what sort of length "len"
>> denotes: is it "length in characters", "length in logical (multi-
>> byte)
>> characters", or "length in address units"?
>
> length in primitive characters (bytes).

No, length in address units. Byte length is what is returned by "1 chars",
consider 4-bit address unit.

--
CE3OH...

Peter Knaggs

unread,

Sep 16, 2009, 8:02:23 AM9/16/09

to

m_l_g3 wrote:
>
> Sorry, but I do not see from your proposal what sort of length "len"
> denotes: is it "length in characters", "length in logical (multi-
> byte)
> characters", or "length in address units"?

Would it help if we replace item 1, the definition of "len" with:

len length of a character-string in address units 1 cell

--
Peter Knaggs

David N. Williams

unread,

Sep 16, 2009, 8:41:13 AM9/16/09

to

Shouldn't that be in characters? (3.1.4.2)

-- David

Peter Knaggs

unread,

Sep 16, 2009, 8:49:33 AM9/16/09

to

Which type of character? Primitive characters (3.1.3) possibly but you
could also interpret characters to be extended characters (XChar) which
include variable width characters, which is precisely what we are trying
to get away from.

David N. Williams

unread,

Sep 16, 2009, 9:31:35 AM9/16/09

to

I guess whatever character you meant in this:

3. Replace "u" with "len" in 3.1.4.2 Character strings:

A string is specified by a cell pair (c-addr len) representing
its starting address and length in characters.

It would be a substantial change if it were to be address units,
since 1 CHARS is not necessarily one address unit.

I'm unclear what you intend. Is the meaning of "character
string" in the above being changed to allow for extended
characters?

-- David

Peter Knaggs

unread,

Sep 16, 2009, 9:54:04 AM9/16/09

to

David N. Williams wrote:
>
> I guess whatever character you meant in this:
>
> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>
> A string is specified by a cell pair (c-addr len) representing
> its starting address and length in characters.

The part of the X:key-ekey proposal which was accepted at the Exeter
meeting included the following:

3.1.2 Character types
Characters shall have the following properties:
– at least one address unit wide;
– contain at least eight bits;
– be of fixed width;
– have a size less than or equal to cell size;
– be unsigned.

3.1.2.3 Primitive Character
A primitive character (pchar) is a character with no restrictions
on its contents. Unless otherwise stated, a “character” refers to
a primitive character.

Thus item 3 should be changed to refer to the "length in primitive
characters". In this case I feel it probably is worth spelling out.

> It would be a substantial change if it were to be address units,
> since 1 CHARS is not necessarily one address unit.

This is part of the problem, what does u mean in CMOVE? According to
its definition "copy u consecutive characters", while most people
believe it refers to address units.

> I'm unclear what you intend. Is the meaning of "character
> string" in the above being changed to allow for extended
> characters?

No, but once extended characters are introduced there is the potential
for confusion, hence the introduction of a primitive character. Extended
characters will always be referenced as "extended character" or xchar,
while a "character" is a primitive characters or pchar.

David N. Williams

unread,

Sep 16, 2009, 10:06:50 AM9/16/09

to

Peter Knaggs wrote:
> David N. Williams wrote:
>>
>> I guess whatever character you meant in this:
>>
>> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
>>
>> A string is specified by a cell pair (c-addr len) representing
>> its starting address and length in characters.
>
> The part of the X:key-ekey proposal which was accepted at the Exeter
> meeting included the following:
>
> 3.1.2 Character types
> Characters shall have the following properties:
> – at least one address unit wide;
> – contain at least eight bits;
> – be of fixed width;
> – have a size less than or equal to cell size;
> – be unsigned.
>
> 3.1.2.3 Primitive Character
> A primitive character (pchar) is a character with no restrictions
> on its contents. Unless otherwise stated, a “character” refers to
> a primitive character.
>
> Thus item 3 should be changed to refer to the "length in primitive
> characters". In this case I feel it probably is worth spelling out.

Me, too!

>> It would be a substantial change if it were to be address units,
>> since 1 CHARS is not necessarily one address unit.
>
> This is part of the problem, what does u mean in CMOVE? According to
> its definition "copy u consecutive characters", while most people
> believe it refers to address units.

Not me! :-) MOVE is for that.

>> I'm unclear what you intend. Is the meaning of "character
>> string" in the above being changed to allow for extended
>> characters?
>
> No, but once extended characters are introduced there is the potential
> for confusion, hence the introduction of a primitive character. Extended
> characters will always be referenced as "extended character" or xchar,
> while a "character" is a primitive characters or pchar.

Good! I can probably wrap my head around that.

-- David

Anton Ertl

unread,

Sep 14, 2009, 10:51:25 AM9/14/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>Peter Knaggs <p...@bcs.org.uk> writes:
>
>> m_l_g3 wrote:
>>>
>>> Sorry, but I do not see from your proposal what sort of length "len"
>>> denotes: is it "length in characters", "length in logical (multi-
>>> byte)
>>> characters", or "length in address units"?
>>
>> length in primitive characters (bytes).
>
>No, length in address units.

This proposal replaces "u" with "len" in words where "u" denotes the
number of characters.

A change to let this parameter specify a number of address units would
break existing standard programs. Granted, there are only few
standard programs that don't have an environmental dependency on
1 CHARS = 1, and all maintained systems support these programs, so
there would be little problem with such a change, but I see little
point in having such a change. Better propose standardizing
1 CHARS = 1.

>Byte length is what is returned by "1 chars",
>consider 4-bit address unit.

Yes, nibble-addressed hardware was the original rationale for
differentiating between aus and chars, but in 15 years there have been
no Forth-94 systems for nibble-addressed hardware, so I consider CHARS
a good solution for a problem that does not exist in practice.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/

Anton Ertl

unread,

Sep 12, 2009, 8:59:41 AM9/12/09

to

Josh Grams <jo...@qualdan.com> writes:
>Instead of adopting this (and that "pchar" rename proposal), I think it
>would make much more sense to clarify things by leaving the existing
>"char" and "character" alone, and instead adopting new terminology for
>variable width characters.

Yes. And the variable-width characters have a new name: xchars.

>As I see it, there's no reason to go changing terminology on people when
>you could instead just adopt new terminology for the new concept. Much
>less potential for confusion that way.

Apparently some people are confused because a member of the (extended)
character set need not fit into a char, and they think that renaming
chars into pchars will help avoid that confusion. I am not convinced
of that, but I can live with pchars (although I fear that we will make
mistakes in the renaming, which will increase the confusion rather
than reducing it).

Anton Ertl

unread,

Sep 12, 2009, 8:33:55 AM9/12/09

to

Peter Knaggs <p...@bcs.org.uk> writes:
>c-add/len
>=========
>
>2009-09-09 Rendered into RfD form, added Forth200x words
>1999-06-22 Original Text by John Rible
>
>
>Problem
>=======
>A large number of words use "c-add u" to indicate the address of a
>string (c-addr) and its length (u) on the stack. With the
>introduction of variable width characters, it is not clear if "u" is
>referring to the number of characters or address units.

Variable-width characters are introduced in the xchars proposal, they
are called xchars there (and can consist of one or more fixed-width
chars in memory). Variable-width characters don't exist in the
current standard document, and chars don't become variable-width in
xchars. It's clear in all words that deal with chars that u refers to
the number of chars.

It definitely does not refer to address units in these words (only in
MOVE and ERASE, which don't deal with chars), although given that 1
chars = 1 au in all maintained systems, that distinction is of no
consequence. Every word that refers to chars says so explicitly, and
every word that refers to aus says so explicitly, and if any word in
the xchars proposal refers to a number of xchars, it will say so
explicitly, too (but I don't think there is such a word).

Examples:

From 17.6.1.0910 CMOVE:
|[...] copy u consecutive characters [...]

From 6.1.1900 MOVE:
|[...]

>4. Add the following to table 3.5 - Environmental Query Strings:
>
> /CHARACTER-STRING n yes maximum size of len in characters

What's the point of that?

Any system that cannot deal with strings of the length of the longest
data memory region that can be had from the system is broken. And
that's not just IMO, but also in Forth-94.

So if the point of that query is to allow systems to not process some
of the strings that can be created, then existing standard programs
would become non-standard. Such a restriction requires a two-step
process of first declaring the feature obsolescent, and eventually
removing it. Moreover, I see no point in introducing such a
restriction.

If that's not the point of the query, then I see no point in it. If
we can process all strings we can create, there is no point in
querying for the maximum size.

Otherwise the proposal looks fine.

Ed

unread,

Sep 18, 2009, 7:43:39 AM9/18/09

to

Peter Knaggs wrote:
> c-add/len
> =========
>
> 2009-09-09 Rendered into RfD form, added Forth200x words
> 1999-06-22 Original Text by John Rible
>
>
> Problem
> =======
> A large number of words use "c-add u" to indicate the address of a
> string (c-addr) and its length (u) on the stack. With the
> introduction of variable width characters, it is not clear if "u" is
> referring to the number of characters or address units.
>
>
> Solution
> ========
> Introduce a new pseudo-type ("len") into the document of these words
> to clarify the intent. Replacing the "u" with a "len" should improve
> the documentation of these words. The words effected are:
>

> ...
> 12.6.1.2143 REPRESENT

I must have missed it. When did "u most significant digits of
the significand" [of a number] become the length of a string?

Josh Grams

unread,

Sep 18, 2009, 7:48:31 AM9/18/09

to

Anton Ertl wrote:
> Josh Grams <jo...@qualdan.com> writes:
>>Instead of adopting this (and that "pchar" rename proposal), I think it
>>would make much more sense to clarify things by leaving the existing
>>"char" and "character" alone, and instead adopting new terminology for
>>variable width characters.
>
> Yes. And the variable-width characters have a new name: xchars.
>
>>As I see it, there's no reason to go changing terminology on people when
>>you could instead just adopt new terminology for the new concept. Much
>>less potential for confusion that way.
>
> Apparently some people are confused because a member of the (extended)
> character set need not fit into a char, and they think that renaming
> chars into pchars will help avoid that confusion. I am not convinced
> of that, but I can live with pchars (although I fear that we will make
> mistakes in the renaming, which will increase the confusion rather
> than reducing it).

That's pretty much how I feel about it...

--Josh

Anton Ertl

unread,

Sep 18, 2009, 9:46:26 AM9/18/09

to

"Ed" <nos...@invalid.com> writes:
>> Solution
>> ========
>> Introduce a new pseudo-type ("len") into the document of these words
>> to clarify the intent. Replacing the "u" with a "len" should improve
>> the documentation of these words. The words effected are:
>>
>> ...
>> 12.6.1.2143 REPRESENT
>
>I must have missed it. When did "u most significant digits of
>the significand" [of a number] become the length of a string?

u has always been the length of the buffer in characters in REPRESENT.
That's the only interpretation of the specification that makes any
sense.

Albert van der Horst

unread,

Sep 18, 2009, 2:03:03 PM9/18/09

to

In article <4AAAA038...@bcs.org.uk>, Peter Knaggs <p...@bcs.org.uk> wrote:
>c-add/len
>=========
>
>2009-09-09 Rendered into RfD form, added Forth200x words
>1999-06-22 Original Text by John Rible
>
>
>Problem
>=======
>A large number of words use "c-add u" to indicate the address of a
>string (c-addr) and its length (u) on the stack. With the
>introduction of variable width characters, it is not clear if "u" is
>referring to the number of characters or address units.
>
>
>Solution
>========
>Introduce a new pseudo-type ("len") into the document of these words
>to clarify the intent. Replacing the "u" with a "len" should improve
>the documentation of these words. The words effected are:

I use the word "sc" in my documentation for the pair.
It means string-constant. It implies that the word
using it must not reach through to the "c-add" and change
characters there. (So e.g. /STRING is okay.)

Anyway, I'm in favour of using a single indication of the pair
whenever they cannot be logically separated.
This allows for a full explanation of "sc" at one place, instead of
limited explanations regarding address units/ character units at
several places. Maybe a distinction between "sc" and "xsc" is in
order.

>Peter Knaggs, P.J.K...@exeter.ac.uk

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Aleksej Saushev

unread,

Sep 19, 2009, 5:16:21 PM9/19/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>Peter Knaggs <p...@bcs.org.uk> writes:
>>
>>> m_l_g3 wrote:
>>>>
>>>> Sorry, but I do not see from your proposal what sort of length "len"
>>>> denotes: is it "length in characters", "length in logical (multi-
>>>> byte)
>>>> characters", or "length in address units"?
>>>
>>> length in primitive characters (bytes).
>>
>>No, length in address units.
>
> This proposal replaces "u" with "len" in words where "u" denotes the
> number of characters.
>
> A change to let this parameter specify a number of address units would
> break existing standard programs. Granted, there are only few
> standard programs that don't have an environmental dependency on
> 1 CHARS = 1, and all maintained systems support these programs, so
> there would be little problem with such a change, but I see little
> point in having such a change. Better propose standardizing
> 1 CHARS = 1.

You're not consistent in your opinion that we should use UNICODE:
either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
platforms, or 1 CHARS may be any other value, and you return to address
units, which are octets in many cases. The third way is decoupling Forth
from hardware in full, so that you don't deal with real CPU address units
at all.

--
CE3OH...

Bernd Paysan

unread,

Sep 19, 2009, 5:41:10 PM9/19/09

to

Aleksej Saushev wrote:
> You're not consistent in your opinion that we should use UNICODE:
> either 1 CHARS = 1, and you use one-octet encodings on
> octet-addressing platforms, or 1 CHARS may be any other value, and you
> return to address units, which are octets in many cases. The third way
> is decoupling Forth from hardware in full, so that you don't deal with
> real CPU address units at all.

"Unicode" is not just one encoding. You can have an ASCII-compatible
byte-encoding like UTF-8 (which is what I recommend for Forth with
Unicode), or UTF-16, which is still a variable length encoding (one or
two 16-bit words make a character, i.e. you still need the XCHAR wordset
to work with UTF-16), or UCS4, which will be fixed-size, but is quite
wasteful.

Except a few experiments, all Forth systems have 1 CHARS = 1. Most
programs rely on that, as well (i.e. they don't use CHARS where they
should, often, they also don't use CHAR+ but 1+ or so).

Aleksej Saushev

unread,

Sep 20, 2009, 7:45:16 AM9/20/09

to

Bernd Paysan <bernd....@gmx.de> writes:

Again internal inconsistency. If you want 1 CHARS = 1 always, then you
should get rid of it and assume that you address bytes/characters or
octets, whatever you decide. You return to the way C took.
Then you won't need any conversion of code to use wide characters &c.

So, what is the point in dragging this "CHARS" stuff?

This brings another problem of Standard Forth: lack of internal consistency.
You either have overengineered parts, impractical parts, or lack of standard
tools to solve every day practical tasks (like reading non-textual streams).

Could you and Anton decide for yourself what you really want and stick to it?
Because as for now you easily jump from 1 CHARS being able to hold a byte,
i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
addressable unit like it is in C.

Each variant has right to exist and has its own consequences.
If you decide 1 CHARS = 1, then how I access address units? Octets?
If you decide 1 CHARS to be byte width, how do I read non-textual file?

P.S. Most of UNIX text processing programs use "char" and don't care of
locales still, but there's some kind of general consensus that they
should be converted. So what's your argument about? I don't understand it.

Again, you overengineer standard in domain nobody has much experience with,
and skip fixing defects affecting practical everyday tasks.

--
CE3OH...

Anton Ertl

unread,

Sep 20, 2009, 2:51:11 PM9/20/09

to

Aleksej Saushev <as...@inbox.ru> writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Granted, there are only few
>> standard programs that don't have an environmental dependency on
>> 1 CHARS = 1, and all maintained systems support these programs, so
>> there would be little problem with such a change, but I see little
>> point in having such a change. Better propose standardizing
>> 1 CHARS = 1.
>
>You're not consistent in your opinion that we should use UNICODE:
>either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
>platforms,

Yes, that's that way things work without xchars. With xchars, you can
use variable-width encodings like UTF-8, and UTF-8 is compatible with
8-bit chars.

>or 1 CHARS may be any other value, and you return to address
>units, which are octets in many cases.

And? The words where u refers to the number of characters still deal
with u chars, not u address units.

Anton Ertl

unread,

Sep 20, 2009, 3:15:27 PM9/20/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>If you want 1 CHARS = 1 always, then you
>should get rid of it

rid of what?

> and assume that you address bytes/characters or
>octets, whatever you decide.

On word-addressed machines 1 CHARS = 1, but a character is not a byte
or octet, but a word.

>So, what is the point in dragging this "CHARS" stuff?

It's in the current standard and nobody (not even you) has submitted
an RfD for making it obsolescent.

>This brings another problem of Standard Forth: lack of internal consistency.
>You either have overengineered parts, impractical parts, or lack of standard
>tools to solve every day practical tasks (like reading non-textual streams).
>
>Could you and Anton decide for yourself what you really want and stick to it?
>Because as for now you easily jump from 1 CHARS being able to hold a byte,
>i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
>addressable unit like it is in C.

I can only guess what you mean here, but maybe the following can clear
things up: A char is a fixed-width memory unit, and on byte-addressed
machines it is a byte in all maintained systems. There are also
xchars (in the xchars proposal); they have a variable-width
representation in memory, i.e., each xchar is stored in one or more
chars. The "len" in this proposal always refers to the number of
chars, not to the number of xchars.

>Each variant has right to exist and has its own consequences.
>If you decide 1 CHARS = 1, then how I access address units?

Easy in that case: c@ and c!

> Octets?

No octets in the standard yet. If you have a Forth system on a
word-addressed machine, you have to use system-specific code to deal
with octets.

>If you decide 1 CHARS to be byte width, how do I read non-textual file?

Use BIN.

A more interesting case is word-addressed machines: How should they
deal with BIN? But I guess if the people implementing and programming
on such systems feel the need for standardization in this regard, they
will come forward and start discussing it.

>P.S. Most of UNIX text processing programs use "char" and don't care of
>locales still, but there's some kind of general consensus that they
>should be converted.

Converted to what? Consensus among whom?

> So what's your argument about? I don't understand it.

Which argument?

Aleksej Saushev

unread,

Sep 21, 2009, 4:52:28 AM9/21/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>If you want 1 CHARS = 1 always, then you
>>should get rid of it
>
> rid of what?

CHARS, if CHARS is always no-op.

>> and assume that you address bytes/characters or
>>octets, whatever you decide.
>
> On word-addressed machines 1 CHARS = 1, but a character is not a byte
> or octet, but a word.

Then byte is word rather than octet.

>>So, what is the point in dragging this "CHARS" stuff?
>
> It's in the current standard and nobody (not even you) has submitted
> an RfD for making it obsolescent.

That's what I tell you: you overengineer some parts nobody has enough
experience with and omit everyday practice.

>>This brings another problem of Standard Forth: lack of internal consistency.
>>You either have overengineered parts, impractical parts, or lack of standard
>>tools to solve every day practical tasks (like reading non-textual streams).
>>
>>Could you and Anton decide for yourself what you really want and stick to it?
>>Because as for now you easily jump from 1 CHARS being able to hold a byte,
>>i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being
>>addressable unit like it is in C.
>
> I can only guess what you mean here, but maybe the following can clear
> things up: A char is a fixed-width memory unit, and on byte-addressed
> machines it is a byte in all maintained systems. There are also
> xchars (in the xchars proposal); they have a variable-width
> representation in memory, i.e., each xchar is stored in one or more
> chars. The "len" in this proposal always refers to the number of
> chars, not to the number of xchars.

A char is a byte, it is fixed-width memory unit, on octet-addressing
machine, it may take more than one address unit, on word-addressing
machine it may take more than one address unit as well, even though
it is less probable.

>>Each variant has right to exist and has its own consequences.
>>If you decide 1 CHARS = 1, then how I access address units?
>
> Easy in that case: c@ and c!
>
>> Octets?
>
> No octets in the standard yet. If you have a Forth system on a
> word-addressed machine, you have to use system-specific code to deal
> with octets.

I have Forth system on octet-addressing machine and want to understand
how you're going to deal with "wide" characters you're so fond of.

>>If you decide 1 CHARS to be byte width, how do I read non-textual file?
>
> Use BIN.

11.6.1.2080 READ-FILE ... ( c-addr u1 fileid -- u2 ior )
"Read u1 consecutive characters to c-addr from the current
position of the file identified by fileid."

Where're address units here? How do I read 5 octets on octet-addressing
machine, if I follow your advice to go for wide characters and define
1 CHARS = 2?

Do you want to tell me that I don't ever need it?
Is this scenario impossible? Highly improbable? Anything else?

> A more interesting case is word-addressed machines: How should they
> deal with BIN? But I guess if the people implementing and programming
> on such systems feel the need for standardization in this regard, they
> will come forward and start discussing it.

You don't have clear understanding on octet-addressing case yet.

>>P.S. Most of UNIX text processing programs use "char" and don't care of
>>locales still, but there's some kind of general consensus that they
>>should be converted.
>
> Converted to what? Consensus among whom?

To wide characters. Among system developers.

>> So what's your argument about? I don't understand it.
>
> Which argument?

This one about 1 CHARS = 1.

--
CE3OH...

Aleksej Saushev

unread,

Sep 21, 2009, 5:00:00 AM9/21/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> Granted, there are only few
>>> standard programs that don't have an environmental dependency on
>>> 1 CHARS = 1, and all maintained systems support these programs, so
>>> there would be little problem with such a change, but I see little
>>> point in having such a change. Better propose standardizing
>>> 1 CHARS = 1.
>>
>>You're not consistent in your opinion that we should use UNICODE:
>>either 1 CHARS = 1, and you use one-octet encodings on octet-addressing
>>platforms,
>
> Yes, that's that way things work without xchars. With xchars, you can
> use variable-width encodings like UTF-8, and UTF-8 is compatible with
> 8-bit chars.

Compatible in what sense? Do they have fixed width like chars? No.
How are they compatible then?

>>or 1 CHARS may be any other value, and you return to address
>>units, which are octets in many cases.
>
> And? The words where u refers to the number of characters still deal
> with u chars, not u address units.

How do I read one address unit from a file?

Again, you lack clear vision and consistency in what you're doing.

--
CE3OH...

Bernd Paysan

unread,

Sep 21, 2009, 5:52:05 AM9/21/09

to

Aleksej Saushev wrote:

> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> Yes, that's that way things work without xchars. With xchars, you can
>> use variable-width encodings like UTF-8, and UTF-8 is compatible with
>> 8-bit chars.
>
> Compatible in what sense? Do they have fixed width like chars? No.
> How are they compatible then?

UTF-8 is encoded in bytes, and ASCII-compatible. That's how it is
compatible with 8-bit chars: As long as you stay with ASCII, you don't see
any difference. If you go beyond ASCII, you can use C@ as building block to
implement the xchar wordset. Unlike UTF-16, where even simple ASCII
characters change encoding, UTF-8 works quite seamless. You can even use
gforth-0.6.2 (without xchar wordset) in an UTF-8 environment, and unless you
don't do too heavy extended-character edits in the command line, it will
just work. That's what I call "compatible".

>>>or 1 CHARS may be any other value, and you return to address
>>>units, which are octets in many cases.
>>
>> And? The words where u refers to the number of characters still deal
>> with u chars, not u address units.
>
> How do I read one address unit from a file?
>
> Again, you lack clear vision and consistency in what you're doing.

The confusion comes from the Forth94 standard. The reason for CHARS was to
allow implementation on nibble-addressed machines. Later, some experimental
Forths like jaxforth used it to implement a UCS2 charset (predecessor of
UTF-16), where the system is still a byte-oriented system, but 1 CHARS = 2
and c@/c! access words. This didn't catch on.

What we could do is mark CHARS obsolecent, since CHARS = NOP in all known
and maintained Forth systems. Word-addressed machines will still have an
address unit that can store more than one byte, but that's not the issue
here. And then, address units would be equal to characters, but only at the
next review, when the obsolecent CHARS would become obsolete (how does this
match with reality? CHARS never caught on, so it was obsolete from the very
beginning).

Needs to go through RfD/CfV, since this is a significant change to the
standard. I agree that this would make the standard more consistent, and
more in alignment to common practice.

Stephen Pelc

unread,

Sep 21, 2009, 7:08:01 AM9/21/09

to

On Mon, 21 Sep 2009 12:52:28 +0400, Aleksej Saushev <as...@inbox.ru>
wrote:

>I have Forth system on octet-addressing machine and want to understand
>how you're going to deal with "wide" characters you're so fond of.

This is the guts of the pchars idea. Virtually all comms/transfer
protocols use 8 bit units. Yes, I know about vending machines.
The traditional ASCII or code-page Forth uses 8-bit characters.

People who want to internationalise their applications have used
a variety of character encodings including, 8, 16, 32 and variable
width characters. All of these use 8 bit units, which we propose to
call a pchar.

To avoid the confusion in the file words and else where, we ended
up defining the use of "character" in Forth200x to mean a pchar.
This simply avoided a vast amount of work for the editor. If the
standard just says "character" with no qualification, it means a
pchar.

Once we define string lengths in terms of pchars, internationalisation
becomes easier. The display/comms words don't need to know what they
are sending, they just send multiple pchars. Where multiple pchars
form a single character, we refer to that character as an xchar.

The simple rule is
memory size in pchars,
number of characters in xchars.

KEY and EMIT and friends work in terms of pchars, XKEY and XEMIT
and friends in terms of xchars. That KEY and EMIT work in terms
of pchars is essential in embedded work and to allow Telnet or
HTTP to handle set up before transferring UTF-8 text.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Aleksej Saushev

unread,

Sep 21, 2009, 7:24:23 AM9/21/09

to

Bernd Paysan <bernd....@gmx.de> writes:

> Aleksej Saushev wrote:
>
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> Yes, that's that way things work without xchars. With xchars, you can
>>> use variable-width encodings like UTF-8, and UTF-8 is compatible with
>>> 8-bit chars.
>>
>> Compatible in what sense? Do they have fixed width like chars? No.
>> How are they compatible then?
>
> UTF-8 is encoded in bytes, and ASCII-compatible. That's how it is
> compatible with 8-bit chars: As long as you stay with ASCII, you don't see
> any difference. If you go beyond ASCII, you can use C@ as building block to
> implement the xchar wordset. Unlike UTF-16, where even simple ASCII
> characters change encoding, UTF-8 works quite seamless. You can even use
> gforth-0.6.2 (without xchar wordset) in an UTF-8 environment, and unless you
> don't do too heavy extended-character edits in the command line, it will
> just work. That's what I call "compatible".

Here we have another thing then: you've silently introduced octet bytes.
You say that UTF-8 works on the top of octet bytes, octet characters,
what if implementation bytes (characters) take 2 octets each?

I don't see any benefit in dealing with UTF-8 inside otherwise than
converting data for input/output. Non-uniform size data are harder to
process.

>>>>or 1 CHARS may be any other value, and you return to address
>>>>units, which are octets in many cases.
>>>
>>> And? The words where u refers to the number of characters still deal
>>> with u chars, not u address units.
>>
>> How do I read one address unit from a file?
>>
>> Again, you lack clear vision and consistency in what you're doing.
>
> The confusion comes from the Forth94 standard. The reason for CHARS was to
> allow implementation on nibble-addressed machines. Later, some experimental
> Forths like jaxforth used it to implement a UCS2 charset (predecessor of
> UTF-16), where the system is still a byte-oriented system, but 1 CHARS = 2
> and c@/c! access words. This didn't catch on.
>
> What we could do is mark CHARS obsolecent, since CHARS = NOP in all known
> and maintained Forth systems. Word-addressed machines will still have an
> address unit that can store more than one byte, but that's not the issue
> here. And then, address units would be equal to characters, but only at the
> next review, when the obsolecent CHARS would become obsolete (how does this
> match with reality? CHARS never caught on, so it was obsolete from the very
> beginning).
>
> Needs to go through RfD/CfV, since this is a significant change to the
> standard. I agree that this would make the standard more consistent, and
> more in alignment to common practice.

I think that this is wrong approach.

It is better to cure things this way:
1) make it clear that there are address units (already in standard)
and uniform bytes aka characters (already in standard);
2) CHARS is used to convert from number of characters to address units.
3) make READ-FILE/WRITE-FILE to use address units instead of characters
in BIN mode (not in the standard and still has to enter in some way);
4) optionally, provide address unit access words.

--
CE3OH...

Anton Ertl

unread,

Sep 21, 2009, 7:33:36 AM9/21/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
>> Aleksej Saushev <as...@inbox.ru> writes:

[...]

>> On word-addressed machines 1 CHARS = 1, but a character is not a byte
>> or octet, but a word.
>
>Then byte is word rather than octet.

Word-addressed machines usually don't have bytes; and even if they
have bytes (e.g., in some sub-word-extracting instructions), they are
smaller than words. So you statement seems to be a Humpty-Dumpty like
case of trying to redefine words to have uncommon meanings. And it's
completely pointless (unless you just want to confuse), because the
standard does not talk about bytes, either.

>> I can only guess what you mean here, but maybe the following can clear
>> things up: A char is a fixed-width memory unit, and on byte-addressed
>> machines it is a byte in all maintained systems. There are also
>> xchars (in the xchars proposal); they have a variable-width
>> representation in memory, i.e., each xchar is stored in one or more
>> chars. The "len" in this proposal always refers to the number of
>> chars, not to the number of xchars.
>
>A char is a byte, it is fixed-width memory unit, on octet-addressing
>machine, it may take more than one address unit, on word-addressing
>machine it may take more than one address unit as well, even though
>it is less probable.

That's what's theoretically possible according to Forth-94. In
practice, 1 CHARS = 1 address unit.

>> No octets in the standard yet. If you have a Forth system on a
>> word-addressed machine, you have to use system-specific code to deal
>> with octets.
>
>I have Forth system on octet-addressing machine and want to understand
>how you're going to deal with "wide" characters you're so fond of.

Maybe you are confusing me with someone else (but whom? I don't
remember anyone asking for wide characters recently). I am not find
of wide characters. Or maybe you are trying to pull another
Humpty-Dumpty here and use "wide characters" with an unusual meaning.

>11.6.1.2080 READ-FILE ... ( c-addr u1 fileid -- u2 ior )
>"Read u1 consecutive characters to c-addr from the current
>position of the file identified by fileid."
>
>Where're address units here? How do I read 5 octets on octet-addressing
>machine, if I follow your advice to go for wide characters and define
>1 CHARS = 2?

You are definitely confusing me with someone else. I have not given
any such advice, certainly not in the last five years, and IIRC nobody
else has given such advice recently, either.

>>>P.S. Most of UNIX text processing programs use "char" and don't care of
>>>locales still, but there's some kind of general consensus that they
>>>should be converted.
>>
>> Converted to what? Consensus among whom?
>
>To wide characters. Among system developers.

No, wide characters are pretty dead in the Unix world, especially
among system developers: I cannot name a system call that takes or
produces wide characters; can you?

>>> So what's your argument about? I don't understand it.
>>
>> Which argument?
>
>This one about 1 CHARS = 1.

That's common practice and hopefully someone will work out a proposal
to standardize that.

Bernd Paysan

unread,

Sep 21, 2009, 8:06:41 AM9/21/09

to

Aleksej Saushev wrote:
> Here we have another thing then: you've silently introduced octet bytes.
> You say that UTF-8 works on the top of octet bytes, octet characters,
> what if implementation bytes (characters) take 2 octets each?

Apart from Jaxforth, nobody did that. Therefore, the question is quite
hypothetical. Yes, the Forth94 standard allows this sort of implementation.

> I don't see any benefit in dealing with UTF-8 inside otherwise than
> converting data for input/output. Non-uniform size data are harder to
> process.

UTF-16 has a non-uniform character size, either (characters may be either
one or two 16 bit words). Only UTF-32 is uniform. Since most strings pass
a Forth program as a whole, you don't gain or lose much by not being able to
randomly address each character. Stepping through sequentially is easy
(just use XCHAR+ instead of CHAR+). The main point however is that text
processing uses strings as a whole much more often than the individual
characters inside strings. And there, it doesn't matter if it's ASCII or
UTF-8.

The Unix world and the Internet have pretty much consolidated on UTF-8, and
through files shared on the Internet, even the Windows world needs quite
good UTF-8 support (though it is internally on the system call level still
UTF-16, as is Java or C#). Having a different internal and external text
representation is IMHO a bad idea. UTF-16 gives you the worst of both
sides: variable-length characters *and* incompatibility.

> I think that this is wrong approach.
>
> It is better to cure things this way:
> 1) make it clear that there are address units (already in standard)
> and uniform bytes aka characters (already in standard);
> 2) CHARS is used to convert from number of characters to address units.

That's already in the standard. That's what nobody uses. CHARS is trying
to solve a problem nobody has.

> 3) make READ-FILE/WRITE-FILE to use address units instead of characters
> in BIN mode (not in the standard and still has to enter in some way);

That won't work on nibble addressed machines, which are just about as
hypothetical 1 CHARS = 2 machines as UTF-16 Forths.

> 4) optionally, provide address unit access words.

We have a wordset for a similar purpose in preparation, but it probably
still deals with something else than you want.

Anton Ertl

unread,

Sep 21, 2009, 8:06:12 AM9/21/09

to

steph...@mpeforth.com (Stephen Pelc) writes:
>The simple rule is
> memory size in pchars,
> number of characters in xchars.

Just a clarification, so this is not misunderstood: "number of
characters" always refers to chars (aka pchars), never to xchars.
Moreover, there is not a single occurence of a "number of xchars" or
anything like it in the xchars proposal.

Bernd Paysan

unread,

Sep 21, 2009, 8:29:22 AM9/21/09

to

Anton Ertl wrote:

>>> Converted to what? Consensus among whom?
>>
>>To wide characters. Among system developers.
>
> No, wide characters are pretty dead in the Unix world, especially
> among system developers: I cannot name a system call that takes or
> produces wide characters; can you?

He probably talks about Windows. However, there are words like
XDrawString16 in the X Window System calls, but of course today people can
use Xutf8DrawString instead. Or the Xft-variants for anti-aliased vector
fonts.

In the win32 port of MINOS, I use the following two words for conversion:

: utf16> ( addr u -- addr' u' ) 0 0 2swap
swap scratch# scratch 2swap 0 CP_UTF8
WideCharToMultiByte scratch swap ;
: >utf16 ( addr u -- addr' u' )
swap scratch# 2/ scratch 2swap 0 CP_UTF8
MultiByteToWideChar scratch swap 2dup 2* + 0 swap w! ;

Thanks to Windows wanting wide character strings to be zero-terminated, as
well, in-place use of UTF-16 strings from Forth won't be possible, anyways.

Ed

unread,

Sep 21, 2009, 8:32:40 AM9/21/09

to

Anton Ertl wrote:
> "Ed" <nos...@invalid.com> writes:
> >> Solution
> >> ========
> >> Introduce a new pseudo-type ("len") into the document of these words
> >> to clarify the intent. Replacing the "u" with a "len" should improve
> >> the documentation of these words. The words effected are:
> >>
> >> ...
> >> 12.6.1.2143 REPRESENT
> >
> >I must have missed it. When did "u most significant digits of
> >the significand" [of a number] become the length of a string?
>
> u has always been the length of the buffer in characters in REPRESENT.

Not according to the Standard.

> That's the only interpretation of the specification that makes any
> sense.

On the contrary there are many interpretations - some of which
produce good outcomes, and others that produce plainly poor
ones as has been demonstrated.

Aleksej Saushev

unread,

Sep 21, 2009, 9:34:17 AM9/21/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>
>>> Aleksej Saushev <as...@inbox.ru> writes:
> [...]
>>> On word-addressed machines 1 CHARS = 1, but a character is not a byte
>>> or octet, but a word.
>>
>>Then byte is word rather than octet.
>
> Word-addressed machines usually don't have bytes;

Definition of byte is "the number of bits required to represent any single
character". Byte is byte, octet is octet, word is word. Don't confuse things.
If you use 16-bit encoding on 64-bit octet addressing machine, byte is
16 bits, octet is 8 bits (always), word is 64 bits.

> So you statement seems to be a Humpty-Dumpty like
> case of trying to redefine words to have uncommon meanings. And it's
> completely pointless (unless you just want to confuse), because the
> standard does not talk about bytes, either.

Maybe it's you who should consult dictionary. I've checked several places,
and each one tells the same: byte is memory unit to hold any single character.
If this comes against your expectations, so worse to them. If you want
to base your arguments on established practice, then both C and C++ in
separate have much more practice than Forth.

>>> I can only guess what you mean here, but maybe the following can clear
>>> things up: A char is a fixed-width memory unit, and on byte-addressed
>>> machines it is a byte in all maintained systems. There are also
>>> xchars (in the xchars proposal); they have a variable-width
>>> representation in memory, i.e., each xchar is stored in one or more
>>> chars. The "len" in this proposal always refers to the number of
>>> chars, not to the number of xchars.
>>
>>A char is a byte, it is fixed-width memory unit, on octet-addressing
>>machine, it may take more than one address unit, on word-addressing
>>machine it may take more than one address unit as well, even though
>>it is less probable.
>
> That's what's theoretically possible according to Forth-94. In
> practice, 1 CHARS = 1 address unit.

It is possible in practice as well. And in fact, here it is:
1. http://home.att.net/~jackklein/c/inttypes.html
2. http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.4

>>> No octets in the standard yet. If you have a Forth system on a
>>> word-addressed machine, you have to use system-specific code to deal
>>> with octets.
>>
>>I have Forth system on octet-addressing machine and want to understand
>>how you're going to deal with "wide" characters you're so fond of.
>
> Maybe you are confusing me with someone else (but whom? I don't
> remember anyone asking for wide characters recently). I am not find
> of wide characters. Or maybe you are trying to pull another
> Humpty-Dumpty here and use "wide characters" with an unusual meaning.

We all know pretty well, who advocates UNICODE here.

>>>>P.S. Most of UNIX text processing programs use "char" and don't care of
>>>>locales still, but there's some kind of general consensus that they
>>>>should be converted.
>>>
>>> Converted to what? Consensus among whom?
>>
>>To wide characters. Among system developers.
>
> No, wide characters are pretty dead in the Unix world, especially
> among system developers: I cannot name a system call that takes or
> produces wide characters; can you?

That's easy:

STANDARDS
These functions conform to ISO/IEC 9899:1999 (``ISO C99'') and were first
introduced in ISO/IEC 9899/AMD1:1995 (``ISO C90, Amendment 1''), with the
exception of wcslcat() and wcslcpy(), which are extensions. The wcswcs()
function conforms to X/Open Portability Guide Issue 4, Version 2
(``XPG4.2'').

Or do you think that standard library isn't part of the system?

>>>> So what's your argument about? I don't understand it.
>>>
>>> Which argument?
>>
>>This one about 1 CHARS = 1.
>
> That's common practice and hopefully someone will work out a proposal
> to standardize that.

So what? FORGET is common practice except you and some other people,
but you're trying very hard to remove it from standard.

--
CE3OH...

Aleksej Saushev

unread,

Sep 21, 2009, 9:52:45 AM9/21/09

to

Bernd Paysan <bernd....@gmx.de> writes:

> Aleksej Saushev wrote:
>> Here we have another thing then: you've silently introduced octet bytes.
>> You say that UTF-8 works on the top of octet bytes, octet characters,
>> what if implementation bytes (characters) take 2 octets each?
>
> Apart from Jaxforth, nobody did that. Therefore, the question is quite
> hypothetical. Yes, the Forth94 standard allows this sort of implementation.
>
>> I don't see any benefit in dealing with UTF-8 inside otherwise than
>> converting data for input/output. Non-uniform size data are harder to
>> process.
>
> UTF-16 has a non-uniform character size, either (characters may be either
> one or two 16 bit words).

So what? Do you argue that I'm to use UTF-16 as internal encoding?

> The Unix world and the Internet have pretty much consolidated on UTF-8, and
> through files shared on the Internet, even the Windows world needs quite
> good UTF-8 support (though it is internally on the system call level still
> UTF-16, as is Java or C#). Having a different internal and external text
> representation is IMHO a bad idea.

That's your opinion, isn't it?

>> I think that this is wrong approach.
>>
>> It is better to cure things this way:
>> 1) make it clear that there are address units (already in standard)
>> and uniform bytes aka characters (already in standard);
>> 2) CHARS is used to convert from number of characters to address units.
>
> That's already in the standard. That's what nobody uses.

Sorry? Who is this "nobody"? You may not use it, others do.

Recent practice shows that you don't use markers, otherwise you'd find
that bug before me. By your own account that means that marker should be
removed from the standard.

>> 3) make READ-FILE/WRITE-FILE to use address units instead of characters
>> in BIN mode (not in the standard and still has to enter in some way);
>
> That won't work on nibble addressed machines, which are just about as
> hypothetical 1 CHARS = 2 machines as UTF-16 Forths.

Will we hear anything more resembling proof than just assertion?
What exactly prevents receiving 4 bits via stream on nibble addressing
machine? Is bit banging banned?

--
CE3OH...

Anton Ertl

unread,

Sep 21, 2009, 8:16:59 AM9/21/09

to

Bernd Paysan <bernd....@gmx.de> writes:
>What we could do is mark CHARS obsolecent, since CHARS = NOP in all known
>and maintained Forth systems. Word-addressed machines will still have an
>address unit that can store more than one byte, but that's not the issue
>here. And then, address units would be equal to characters, but only at the
>next review, when the obsolecent CHARS would become obsolete (how does this
>match with reality? CHARS never caught on, so it was obsolete from the very
>beginning).

There are a number of programs around that try to conform to the
standard and use CHARS; just removing CHARS from one standard to the
next would suddenly de-standardize these programs. That would be
especially bad because the programmers an extra mile in order to
increase standards compliance and compatibility.

Bernd Paysan

unread,

Sep 21, 2009, 10:27:15 AM9/21/09

to

Aleksej Saushev wrote:
>> UTF-16 has a non-uniform character size, either (characters may be either
>> one or two 16 bit words).
>
> So what? Do you argue that I'm to use UTF-16 as internal encoding?

I argue that using UTF-16 has no benefit over using UTF-8, while it has the
same disadvantages.

>> The Unix world and the Internet have pretty much consolidated on UTF-8,
>> and through files shared on the Internet, even the Windows world needs
>> quite good UTF-8 support (though it is internally on the system call
>> level still
>> UTF-16, as is Java or C#). Having a different internal and external text
>> representation is IMHO a bad idea.
>
> That's your opinion, isn't it?

Of course any statement with IMHO ("in my humble opinion") is an opinion
;-).

>> That's already in the standard. That's what nobody uses.
>
> Sorry? Who is this "nobody"? You may not use it, others do.

I have no idea who this "nobody" is, I just don't know anybody, and they
fail to speak up.

> Recent practice shows that you don't use markers, otherwise you'd find
> that bug before me. By your own account that means that marker should be
> removed from the standard.

MARKER is a pretty infrequently used word, yes. If any, it should go to
TOOLS EXT, the bag of tools people may or may not implement. As far as I
understood Anton, the bug was a feature, i.e. you had a file starting with a
marker, and if you called that marker, the entire file - including its file
name - vanished. Like

require foo.fs

----foo.fs----
marker forget-foo

... definitions to implement foo ...
--------------

After forget-foo, require foo.fs will reload the file (since the marker will
also remove the file). The code wasn't water-tight, i.e. what should have
been there:

* a detection if the marker really was the first definition in the file
* a detection, if the file was still active (then don't forget it) or not
(then forget it).

>>> 3) make READ-FILE/WRITE-FILE to use address units instead of characters
>>> in BIN mode (not in the standard and still has to enter in some way);
>>
>> That won't work on nibble addressed machines, which are just about as
>> hypothetical 1 CHARS = 2 machines as UTF-16 Forths.
>
> Will we hear anything more resembling proof than just assertion?
> What exactly prevents receiving 4 bits via stream on nibble addressing
> machine? Is bit banging banned?

Well, in theory, you could create a nibble addressed machines with a file
system that has nibble granularity for file sizes. As long as this is an
isolated system, everything is fine. You just can't exchange these files
with anybody else (at least not without special meta-information). I'd
suggest to use a special NIBBLE mode if you really need that feature on such
a system - as nonstandard extension. A standard doesn't have to solve
hypothetical problems.

Anton Ertl

unread,

Sep 21, 2009, 11:57:58 AM9/21/09

to

"Ed" <nos...@invalid.com> writes:
>Anton Ertl wrote:
>> "Ed" <nos...@invalid.com> writes:
>> >> Solution
>> >> ========
>> >> Introduce a new pseudo-type ("len") into the document of these words
>> >> to clarify the intent. Replacing the "u" with a "len" should improve
>> >> the documentation of these words. The words effected are:
>> >>
>> >> ...
>> >> 12.6.1.2143 REPRESENT
>> >
>> >I must have missed it. When did "u most significant digits of
>> >the significand" [of a number] become the length of a string?
>>
>> u has always been the length of the buffer in characters in REPRESENT.

>> That's the only interpretation of the specification that makes any
>> sense.
>
>On the contrary there are many interpretations

But only ones that don't make sense.

Ok, for the moment let us assume that u does not specify the buffer
length. There is nothing else in the standard which specifies the
length of that buffer, so how would a standard system know how far it
might write? And how would a standard program create a buffer whose
address it could pass to REPRESENT? This gives us two sorts of
interpretations:

1) The system-centric one: A standard system can write arbitrarily
far. Then standard programs cannot use REPRESENT, because they cannot
create a buffer that cannot be overflown by some system.

2) The program-centric one: A standard program may pass an arbitrarily
small buffer. Then a standard system cannot write anything there,
because it might overflow the buffer. Yet it has to store the u most
significant digits there. With this interpretation, a standard system
cannot implement REPRESENT, because there is a conflict between the
buffer size interpretation and the standard requirements.

So, if u does not specify the buffer length, either programs cannot
use REPRESENT, or systems cannot implement it. Neither of these
options makes sense. The only variant that makes sense is that u does
specify the buffer length.

Aleksej Saushev

unread,

Sep 21, 2009, 2:25:25 PM9/21/09

to

Bernd Paysan <bernd....@gmx.de> writes:

> Aleksej Saushev wrote:
>>> UTF-16 has a non-uniform character size, either (characters may be either
>>> one or two 16 bit words).
>>
>> So what? Do you argue that I'm to use UTF-16 as internal encoding?
>
> I argue that using UTF-16 has no benefit over using UTF-8, while it has the
> same disadvantages.

How does it apply? I'm not obliged to use UTF-16 or any other UNICODE
for internal use.

>>> That's already in the standard. That's what nobody uses.
>>
>> Sorry? Who is this "nobody"? You may not use it, others do.
>
> I have no idea who this "nobody" is, I just don't know anybody, and they
> fail to speak up.

I use exactly this feature and don't plan to stop using it in near future.

>> Recent practice shows that you don't use markers, otherwise you'd find
>> that bug before me. By your own account that means that marker should be
>> removed from the standard.
>
> MARKER is a pretty infrequently used word, yes. If any, it should go to
> TOOLS EXT, the bag of tools people may or may not implement. As far as I
> understood Anton, the bug was a feature, i.e. you had a file starting with a
> marker, and if you called that marker, the entire file - including its file
> name - vanished.

Pretty broken feature was it.

And this is what is called "bad idea:" overly complex and still not working.

I don't like your attitude to juggle word sets, this isn't circus.

>>>> 3) make READ-FILE/WRITE-FILE to use address units instead of characters
>>>> in BIN mode (not in the standard and still has to enter in some way);
>>>
>>> That won't work on nibble addressed machines, which are just about as
>>> hypothetical 1 CHARS = 2 machines as UTF-16 Forths.
>>
>> Will we hear anything more resembling proof than just assertion?
>> What exactly prevents receiving 4 bits via stream on nibble addressing
>> machine? Is bit banging banned?
>
> Well, in theory, you could create a nibble addressed machines with a file
> system that has nibble granularity for file sizes. As long as this is an
> isolated system, everything is fine. You just can't exchange these files
> with anybody else (at least not without special meta-information). I'd
> suggest to use a special NIBBLE mode if you really need that feature on such
> a system - as nonstandard extension. A standard doesn't have to solve
> hypothetical problems.

What is the problem you see there? If file is padded to full octets,
it is fine to exchange with the rest of the world. Again, you see some
quite imaginary problem, while there're none in reality. What you
refer to as an isolated system with nibble granularity, maps ideally
to the real system with octet address units and 16-bit bytes.

There's no problem with handling octets and wider than octet characters,
if you have properly constructed underlying level. The latter is what
current standard doesn't provide and you refuse to admit it.

--
CE3OH...

Elizabeth D Rather

unread,

Sep 21, 2009, 2:28:16 PM9/21/09

to

Aleksej Saushev wrote:
> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
>> Aleksej Saushev <as...@inbox.ru> writes:
>>> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>>
>>>> Aleksej Saushev <as...@inbox.ru> writes:
>> [...]
>>>> On word-addressed machines 1 CHARS = 1, but a character is not a byte
>>>> or octet, but a word.
>>> Then byte is word rather than octet.
>> Word-addressed machines usually don't have bytes;
>
> Definition of byte is "the number of bits required to represent any single
> character". Byte is byte, octet is octet, word is word. Don't confuse things.
> If you use 16-bit encoding on 64-bit octet addressing machine, byte is
> 16 bits, octet is 8 bits (always), word is 64 bits.

...

I distinctly remember during the discussions in 1999 regarding
internationalization, Greg Bailey strongly advocated introducing octets
in order to have a firmly-defined 8-bit entity with which to specify
communications. Since it appears there is still some concern about
this, maybe octet language could be helpful. I personally think "octet"
is a more useful (clearly-defined and widely accepted) than "pchar".

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Frank Buss

unread,

Sep 21, 2009, 2:45:46 PM9/21/09

to

Elizabeth D Rather wrote:

> I distinctly remember during the discussions in 1999 regarding
> internationalization, Greg Bailey strongly advocated introducing octets
> in order to have a firmly-defined 8-bit entity with which to specify
> communications. Since it appears there is still some concern about
> this, maybe octet language could be helpful. I personally think "octet"
> is a more useful (clearly-defined and widely accepted) than "pchar".

For me, octet sounds a bit old fashioned. E.g. take a look at some random
datasheets (or even the product pages) from microcontrollers and they all
use byte, which sounds more natural for an 8 bit value for me, too.

http://www.microchip.com/wwwproducts/Devices.aspx?dDocName=en010280
http://www.atmel.com/products/AT91/overview.asp

And even Wikipedia says: "The use of a byte to mean 8 bits has become
ubiquitous.":

http://en.wikipedia.org/wiki/Byte

And I don't know anyone who says "my memory stick can store 1 giga octets".
But for a standard document a definition of the words used in the document
is required anyway, so maybe at the beginning "byte=octet=8 bits" should be
mentioned.

--
Frank Buss, f...@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Anton Ertl

unread,

Sep 21, 2009, 2:38:01 PM9/21/09

to

Elizabeth D Rather <era...@forth.com> writes:
>I distinctly remember during the discussions in 1999 regarding
>internationalization, Greg Bailey strongly advocated introducing octets
>in order to have a firmly-defined 8-bit entity with which to specify
>communications. Since it appears there is still some concern about
>this, maybe octet language could be helpful. I personally think "octet"
>is a more useful (clearly-defined and widely accepted) than "pchar".

That may be the case, but these are different concepts. The main case
where this difference plays a role is word-addressed machines, i.e.,
where an au and consequently a char (aka pchar) is bigger than 8 bits.
There a char can hold more data than an octet. Whether and how the
system implementor choses to make use of this possibility is up to the
system implementor; but in any case, octets are an additional concern
beyond chars (and IMO dragging them into this discussion is not
helpful).

Anton Ertl

unread,

Sep 21, 2009, 2:48:43 PM9/21/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>Bernd Paysan <bernd....@gmx.de> writes:
>
>> Aleksej Saushev wrote:

[Bernd Paysan:]
[on CHARS for converting between number of chars and number of aus]

>>>> That's already in the standard. That's what nobody uses.
>>>
>>> Sorry? Who is this "nobody"? You may not use it, others do.
>>
>> I have no idea who this "nobody" is, I just don't know anybody, and they
>> fail to speak up.
>
>I use exactly this feature and don't plan to stop using it in near future.

So, you use CHARS to convert from number of chars to number of aus.
Then I wonder why you are asking us to get rid of CHARS
<87zl8oo...@inbox.ru>.

ken...@cix.compulink.co.uk

unread,

Sep 22, 2009, 6:13:14 AM9/22/09

to

In article <4ab75b1d....@192.168.0.50>, steph...@mpeforth.com
(Stephen Pelc) wrote:

> The traditional ASCII or code-page Forth uses 8-bit characters.

Standard ASCII was actually 7-bit. The 8-bit extended character set was
never standardised as far as I know with the nearest to a standard being
IBM code pages. If your Forth implementation uses a terminal or terminal
emulator characters below IIRC 32 are reserved for control codes.

Ken Young

Aleksej Saushev

unread,

Sep 22, 2009, 6:53:52 AM9/22/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Elizabeth D Rather <era...@forth.com> writes:
>>I distinctly remember during the discussions in 1999 regarding
>>internationalization, Greg Bailey strongly advocated introducing octets
>>in order to have a firmly-defined 8-bit entity with which to specify
>>communications. Since it appears there is still some concern about
>>this, maybe octet language could be helpful. I personally think "octet"
>>is a more useful (clearly-defined and widely accepted) than "pchar".
>
> That may be the case, but these are different concepts. The main case
> where this difference plays a role is word-addressed machines, i.e.,
> where an au and consequently a char (aka pchar) is bigger than 8 bits.
> There a char can hold more data than an octet. Whether and how the
> system implementor choses to make use of this possibility is up to the
> system implementor; but in any case, octets are an additional concern
> beyond chars (and IMO dragging them into this discussion is not
> helpful).

It is easy to see in this group that "pchar" name is at the very least
misleading, while "address unit" is not. The latter has more practice as
well. You can trace it to Ada standard of year 1983 or whatever it is.
"Address unit" is already in Forth standard too, and now you're trying
to push extra entity called "pchar", as if "character" is not enough.

You're adding more and more confusion to the already controversial
standard. You can't tell what Forth is now: if it is supposed to be
hardware-oriented language, it lacks clear concepts of hardware
standards like octets, memory/address units, address spaces and such,
if Forth is supposed to be higher level languages, it lacks even more
concepts. You're going to make it bad choice in both of the two worlds.

I think that all of you should stop and reassess goals of your standard,
because you're so much in haste to get it ready before 2010 that you add
more and more inconsistency instead of fixing mistakes of the past.

--
CE3OH...

Aleksej Saushev

unread,

Sep 22, 2009, 6:58:44 AM9/22/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>Bernd Paysan <bernd....@gmx.de> writes:
>>
>>> Aleksej Saushev wrote:
> [Bernd Paysan:]
> [on CHARS for converting between number of chars and number of aus]
>>>>> That's already in the standard. That's what nobody uses.
>>>>
>>>> Sorry? Who is this "nobody"? You may not use it, others do.
>>>
>>> I have no idea who this "nobody" is, I just don't know anybody, and they
>>> fail to speak up.
>>
>>I use exactly this feature and don't plan to stop using it in near future.
>
> So, you use CHARS to convert from number of chars to number of aus.
> Then I wonder why you are asking us to get rid of CHARS
> <87zl8oo...@inbox.ru>.

I'm asking you to state your goals clearly: you should take either one
way and another but not both. One way is you support CHARS and let it
have arbitrary value (introducing comminication level concepts like "octet"),
other way is removing it. As for now, you don't look to have strong
position on what the practical value of CHAR is.

--
CE3OH...

Aleksej Saushev

unread,

Sep 22, 2009, 7:49:49 AM9/22/09

to

ken...@cix.compulink.co.uk writes:

> In article <4ab75b1d....@192.168.0.50>, steph...@mpeforth.com
> (Stephen Pelc) wrote:
>
>> The traditional ASCII or code-page Forth uses 8-bit characters.
>
> Standard ASCII was actually 7-bit. The 8-bit extended character set was
> never standardised

Ever heard of ISO/IEC 8859?

--
CE3OH...

Anton Ertl

unread,

Sep 23, 2009, 4:53:56 AM9/23/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

>> So, you use CHARS to convert from number of chars to number of aus.
>> Then I wonder why you are asking us to get rid of CHARS
>> <87zl8oo...@inbox.ru>.
>
>I'm asking you to state your goals clearly: you should take either one
>way and another but not both. One way is you support CHARS and let it
>have arbitrary value (introducing comminication level concepts like "octet"),
>other way is removing it. As for now, you don't look to have strong
>position on what the practical value of CHAR is.

I guess you mean CHARS, right? I have already given my position on
that <2009Sep2...@mips.complang.tuwien.ac.at>:

|[1 CHARS = 1 is] common practice, and hopefully someone will work

|out a proposal to standardize that.

Using CHARS in a program has no practical value. Implementing CHARS
in a system has the practical value of supporting those few programs
that actually use CHARS.

Aleksej Saushev

unread,

Sep 23, 2009, 1:26:31 PM9/23/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> So, you use CHARS to convert from number of chars to number of aus.
>>> Then I wonder why you are asking us to get rid of CHARS
>>> <87zl8oo...@inbox.ru>.
>>
>>I'm asking you to state your goals clearly: you should take either one
>>way and another but not both. One way is you support CHARS and let it
>>have arbitrary value (introducing comminication level concepts like "octet"),
>>other way is removing it. As for now, you don't look to have strong
>>position on what the practical value of CHAR is.
>
> I guess you mean CHARS, right? I have already given my position on
> that <2009Sep2...@mips.complang.tuwien.ac.at>:
>
> |[1 CHARS = 1 is] common practice, and hopefully someone will work
> |out a proposal to standardize that.
>
> Using CHARS in a program has no practical value. Implementing CHARS
> in a system has the practical value of supporting those few programs
> that actually use CHARS.

Now you tell me that having one program to support text rather than
several ones, for regular text and for UTF-8, is impractical.

JFYI, of several major sites I've just probed, only two use UTF-8,
others use different kinds of unioctet Russian Cyrillic.
Thus you still have to recode text you receive during communication.
I don't understand where you take your phantasies from.

Show me how you derive need of your beloved UTF-8 from unioctet
encodings and MIME, when common practice is using MIME.

--
CE3OH...

Anton Ertl

unread,

Sep 23, 2009, 4:15:41 PM9/23/09

to

Aleksej Saushev <as...@inbox.ru> writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>
>> Aleksej Saushev <as...@inbox.ru> writes:
>>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>>> So, you use CHARS to convert from number of chars to number of aus.
>>>> Then I wonder why you are asking us to get rid of CHARS
>>>> <87zl8oo...@inbox.ru>.
>>>
>>>I'm asking you to state your goals clearly: you should take either one
>>>way and another but not both. One way is you support CHARS and let it
>>>have arbitrary value (introducing comminication level concepts like "octet"),
>>>other way is removing it. As for now, you don't look to have strong
>>>position on what the practical value of CHAR is.
>>
>> I guess you mean CHARS, right? I have already given my position on
>> that <2009Sep2...@mips.complang.tuwien.ac.at>:
>>
>> |[1 CHARS = 1 is] common practice, and hopefully someone will work
>> |out a proposal to standardize that.
>>
>> Using CHARS in a program has no practical value. Implementing CHARS
>> in a system has the practical value of supporting those few programs
>> that actually use CHARS.
>
>Now you tell me that having one program to support text rather than
>several ones, for regular text and for UTF-8, is impractical.

No, I don't tell you that. What I tell you is written above.

>JFYI, of several major sites I've just probed, only two use UTF-8,
>others use different kinds of unioctet Russian Cyrillic.
>Thus you still have to recode text you receive during communication.

And this has what to do with this discussion?

>I don't understand where you take your phantasies from.

I don't know what phantasies you are referring to.

>Show me how you derive need of your beloved UTF-8 from unioctet
>encodings and MIME, when common practice is using MIME.

Common practice where? Certainly not in Forth.

ASCII-compatible 8-bit encodings are compatible with Forth-94 and with
the xchars proposal; the reference implementation of xchars contains
an implementation for 8-bit encodings.

Aleksej Saushev

unread,

Sep 23, 2009, 11:08:15 PM9/23/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Aleksej Saushev <as...@inbox.ru> writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>
>>> Aleksej Saushev <as...@inbox.ru> writes:
>>>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>>>> So, you use CHARS to convert from number of chars to number of aus.
>>>>> Then I wonder why you are asking us to get rid of CHARS
>>>>> <87zl8oo...@inbox.ru>.
>>>>
>>>>I'm asking you to state your goals clearly: you should take either one
>>>>way and another but not both. One way is you support CHARS and let it
>>>>have arbitrary value (introducing comminication level concepts like "octet"),
>>>>other way is removing it. As for now, you don't look to have strong
>>>>position on what the practical value of CHAR is.
>>>
>>> I guess you mean CHARS, right? I have already given my position on
>>> that <2009Sep2...@mips.complang.tuwien.ac.at>:
>>>
>>> |[1 CHARS = 1 is] common practice, and hopefully someone will work
>>> |out a proposal to standardize that.
>>>
>>> Using CHARS in a program has no practical value. Implementing CHARS
>>> in a system has the practical value of supporting those few programs
>>> that actually use CHARS.
>>
>>Now you tell me that having one program to support text rather than
>>several ones, for regular text and for UTF-8, is impractical.
>
> No, I don't tell you that. What I tell you is written above.

This is the same, you tell me that there's no practical value,
what do you count for practical value at all, if not this?

>>JFYI, of several major sites I've just probed, only two use UTF-8,
>>others use different kinds of unioctet Russian Cyrillic.
>>Thus you still have to recode text you receive during communication.
>
> And this has what to do with this discussion?

This voids your argument about what is impractical.

>>I don't understand where you take your phantasies from.
>
> I don't know what phantasies you are referring to.

Your phantasies about practical side of CHARS.

>>Show me how you derive need of your beloved UTF-8 from unioctet
>>encodings and MIME, when common practice is using MIME.
>
> Common practice where? Certainly not in Forth.
>
> ASCII-compatible 8-bit encodings are compatible with Forth-94 and with
> the xchars proposal; the reference implementation of xchars contains
> an implementation for 8-bit encodings.

Again. Show me how the need for _non-uniform_ length UTF-8 arises from
_uniform_ unioctet encoding, which was common practice before UTF-8
acceptance (leaving the numbers for acceptance outside the scope of
discussion).

Contrary to what you tell, your xchars are _incompatible_ with Forth-94:
they cannot be copied with CMOVE ("copy u consecutive characters"),
they cannot be read with READ-FILE ("read u1 consecutive characters"),
you cannot do anything with them, unless you rely on low level
representation. How does that make xchars compatible?

Thus what we see for now, you design standard, you make another series
of chages making new standard largely incompatible with previous one.
What makes it horrible, is that you are rationalizing your phantasies
about what is practical and what is not. You fail to demonstrate
necessity of reverting the standard feature and thus assert it is
impractical without any support. You demonstrated your perfect knowledge
of common practice in other programming languages pretty well. So, would
you (and Berndt too) be kind to retract your character sets proposals or
stay away from standard process at all until you (either of you) publish
survey to support your point? I question your expertise in this domain.

This is to hold you from making delusional points in future, you both
originate from Latin-writing countries, you hardly have enough
experience with localization issues, and even more so, Germans are
well-known for their attempt to strip their umlauts and convert to bare
Latin. And we're not going to convert to another encoding just because
it is proclaimed standard, we had this experience with ISO/IEC 8859-5.

Another thing is more procedural.

From my side I see that you're so in haste to make new standard out,
that you forget about standardizing goals. What do you aim for?

Previous standard was controversial and the only way it got acceptance
is the lack of another standard for too long. Now you seem trying to
make modern one out as soon as possible at any price. If you start
controversy yourself, it doesn't matter, "make haste." If you revert
previously standardized practice, it doesn't matter, "I don't know
anyone using it, no practical value, make haste."

I _am_ pure practical in regard to Forth, I don't have much time to
participate in standard process, I do use and want to continue using
Forth in practice, but this is becoming tiresome. You started FORGET
controversy removing it unilaterally from gforth without paying any
attention if anyone uses it, I had to adapt a good deal of code and
work process to that. Now you want to strip it from standard despite
voices contra. Sure, of course, "make haste," 2009 is coming to the end.

I want to remind you that you misdesigned MARKER in gforth, and this
made me to waste about a week to find where the actual bug lies. Thus
you can't argue on MARKER vs. FORGET issue, because you don't use any of
them.

Now you're going to do the same with CHARS, another feature I actively
use. What the hell? Can you play somewhere else? I use it consistently
with other languages I use, "text is text, raw data is data, data may
be text but not necessarily so."

I've just realized one more point against your character sets proposals.
You assert that your "xchars" are compatible with regular CHARS or, to
be more precise, octets. Then dump this shit altogether, it doesn't
belong to the standard. Provide it as public domain library. If we find
many users past 5 years of usage, we may return to standardizing it.
Standardize those features that may lead to incompatibilities. Oh, you
will need octet access for your UTF-8 library. This is the very part to
standardize, because it has more uses than only to UTF-8 lovers and may
lead to incompatibilities, since 1 CHARS may be more than 1 already and
it's been so for 15 years.

--
HE CE3OH...

Ed

unread,

Sep 24, 2009, 7:04:51 AM9/24/09

to

Anton Ertl wrote:
> "Ed" <nos...@invalid.com> writes:
> >Anton Ertl wrote:
> >> "Ed" <nos...@invalid.com> writes:
> >> >> Solution
> >> >> ========
> >> >> Introduce a new pseudo-type ("len") into the document of these words
> >> >> to clarify the intent. Replacing the "u" with a "len" should improve
> >> >> the documentation of these words. The words effected are:
> >> >>
> >> >> ...
> >> >> 12.6.1.2143 REPRESENT
> >> >
> >> >I must have missed it. When did "u most significant digits of
> >> >the significand" [of a number] become the length of a string?
> >>
> >> u has always been the length of the buffer in characters in REPRESENT.
> >> That's the only interpretation of the specification that makes any
> >> sense.
> >
> >On the contrary there are many interpretations
>
> But only ones that don't make sense.
>
> Ok, for the moment let us assume that u does not specify the buffer
> length. There is nothing else in the standard which specifies the
> length of that buffer, so how would a standard system know how far it
> might write? And how would a standard program create a buffer whose
> address it could pass to REPRESENT? This gives us two sorts of
> interpretations:

> ...

From my last reply to you on this subject only 23 days ago:

--
What's deplorable is that someone should wish to foist upon all forth
users an interpretation of REPRESENT they know to be deficient:

2 SET-PRECISION

467.8E CR FE.
470.E0

-INF CR FE.
-I

Here, finite numbers are printed with as many characters as are
needed. But for no apparent reason, non-numbers are truncated
to 2 characters.

That's a very stupid and unusable REPRESENT. It's laughable
that anyone should support it.
--

Bernd Paysan

unread,

Sep 24, 2009, 4:32:13 PM9/24/09

to

Aleksej Saushev wrote:
> Again. Show me how the need for _non-uniform_ length UTF-8 arises from
> _uniform_ unioctet encoding, which was common practice before UTF-8
> acceptance (leaving the numbers for acceptance outside the scope of
> discussion).

Well, uniform byte-encoding with an encoding jungle was common practice,
plus another non-uniform one/two byte mixed-size encoding jungle for
CJK. I don't want to get into the encoding jungle hell, that's why I
stripped the suggestions from the xchar proposal. The encoding jungle
is legacy, not good practice; the Unix/Linux world has largely abandoned
it. I follow the IETF recommendation, that you should use Unicode
instead of encoding jungles when possible, and especially encode the
Unicode as UTF-8. Therefore, my XCHAR proposal says that every Forth
system "should" (no strong enforcement, it is still standard if it
doesn't) provide UTF-8 as encoding, and all other encodings (apart from
ASCII, which is the required subset) are left as exercise to the
implementer. I don't care, it's not recommended good practice. It's a
legacy. Deal with it as you like. The XCHAR wordset isn't there to
cure cancer.

> Contrary to what you tell, your xchars are _incompatible_ with
> Forth-94: they cannot be copied with CMOVE ("copy u consecutive
> characters"), they cannot be read with READ-FILE ("read u1 consecutive
> characters"), you cannot do anything with them, unless you rely on low
> level representation. How does that make xchars compatible?

xchars are always, and by definition, composed out of characters.
Characters you can fetch and store with C@ and C!, copy with CMOVE, read
from files with READ-FILE, and so on. That's the definition of an
XCHAR: It is composed of one or more characters. If you fail to
understand that, no further discussion is possible.

[random rambling from a baboon deleted]

MINOS has an example what is the real problem: There's a "hello world"
MINOS application demo (hello-world.m). It is currently localized into
those seven languages where I'm good enough in to be sufficiently
confident that I'm not writing complete rubbish (Sergey Plis corrected
my slightly misspelled Russian version; both Russian and Japanese are
the worst two languages on the list - even my Arab is better, which
didn't make it on the list, because MINOS currently only supports left-
to-right writing). You can switch between these languages by selecting
their name (localized, of course). You see something similar on
Wikipedia, on the lower left side of each article - the cross-references
to the same article in another language. This is the sort of problem
Unicode solves, and all those encoding jungle stuff doesn't.

The xchar wordset however is sufficiently general enough that you can
implement your encoding jungle, if you want. Well, unless you want to
mix 8 bit ASCII and some wide-char approach. If you do this, your are
on your own, this won't get standardized. You can use the incompletely
thought through Forth94 approach at wide characters with CHARS=2* and
CHAR+=2+.

> Another thing is more procedural.
>
> From my side I see that you're so in haste to make new standard out,
> that you forget about standardizing goals. What do you aim for?

Standards are controversial, but as long as you resort to name-calling
instead, your rejection of it can't be taken seriously.

Robert Epprecht

unread,

Sep 25, 2009, 7:33:07 AM9/25/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Using CHARS in a program has no practical value. Implementing CHARS
> in a system has the practical value of supporting those few programs
> that actually use CHARS.

As the author of some of the 'few programs' that do use CHARS I would
not have any problem if CHARS gets removed. I used it, because I have
tried very hard to write my pograms as portable as I could. I did not
know that there is no practical value in using CHARS.

Same applies for CHAR+

Robert Epprecht

Anton Ertl

unread,

Sep 25, 2009, 8:16:42 AM9/25/09

to

Robert Epprecht <eppr...@solnet.ch> writes:
>I did not
>know that there is no practical value in using CHARS.

The lack of practical value comes from the fact that maintained Forth
systems with "1 CHARS > 1" don't exist in practice.

There was a time when we expected that such systems will exist some
time in the future (and at that time the value of using CHARS was in
being more future-proof), but I no longer expect that.

>Same applies for CHAR+

Yes, of course.

Marcel Hendrix

unread,

Sep 25, 2009, 2:27:56 PM9/25/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: RfD: c-addr/len

> Robert Epprecht <eppr...@solnet.ch> writes:
>>I did not
>>know that there is no practical value in using CHARS.

> The lack of practical value comes from the fact that maintained Forth
> systems with "1 CHARS > 1" don't exist in practice.

If unicode ever becomes necessary, I intend to go for 32-bit characters
(4 bytes) to keep things simple. Would there be problems (apart from
a tiny bit enlarged dataspace)?

-marcel

Anton Ertl

unread,

Sep 26, 2009, 10:19:54 AM9/26/09

to

Most programs have an environmental dependency on 1 CHARS = 1 and
would not work on your system unless you also make the address unit
32-bits (which would cause problems when calling foreign functions).

Also, it's unnecessary to do that. Gforth and others support Unicode
while keeping 8-bit characters by supporting the UTF-8 encoding of
Unicode (which was designed for exactly this purpose). Actually,
other Forth systems already work with UTF-8 in many situations, too.

E.g., I just tried the folloing on Gforth 0.6.2 (which has no explicit
support for UTF-8 (or xchars)), iForth 2.1.2541, vfxlin 4.30 and
SP-Forth 4.20:

I started an uxterm (alternatively, a recent xterm with an UTF-8 font
and locale should also work), and then executed the following
commands:

#download the UTF-8 example program
wget http://www.complang.tuwien.ac.at/forth/utf8/example.fs
#now run it on the various Forth systems
gforth -e "include example.fs cr bye"
iforth "cr include example.fs cr bye"
vfxlin "cr include example.fs cr bye"
spf4 example.fs CR BYE

It worked on all these Forth systems that AFAIK have no particular
support for UTF-8. Note that the program uses a word name that is not
in ASCII.

The next test was cutting and pasting the program on the command line
rather than including it.

It worked on all systems, but there were some shortcomings:

On Gforth 0.6.2, command-line editing does not work properly for the
non-ASCII characters, but just pasting is fine.

Vfxlin has a similar problem; moreover, it works when the code is
copied after vfxlin was started, but not if it was copied before
(it shows "#" for the non-ASCII characters then; strange).

On iForth, the code looks funny in the command-line editor, but it
works.

On SP-Forth, there is no command-line editing, only backspace. That
worked perfectly, however, which was unexpected. Maybe SP-Forth has
special support for UTF-8. What does not work properly is the error
position indicator if there are non-ASCII characters before or in the
erroneous word.

The point of all that is to show that in most places one deals with
strings without having to know how they break into display characters;
even stuff deep inside a Forth system like the dictionary and file
inclusion (with parsing etc.) just deals with UTF-8 strings like it
deals with plain ASCII strings, and therefore it just works.

The only parts in Gforth that had to be changed to support UTF-8 and
other variable-with encodings was command-line editing and error
indication (because there display characters matter). And of course
we also added the xchars wordset so that applications can deal with
extended display characters, too.

Bernd Paysan

unread,

Sep 26, 2009, 5:57:56 PM9/26/09

to

Marcel Hendrix wrote:
> If unicode ever becomes necessary, I intend to go for 32-bit
> characters (4 bytes) to keep things simple. Would there be problems
> (apart from a tiny bit enlarged dataspace)?

Yes - you will have to recode it when reading and writing files. It
might appear to be "simple", but in reality, you aren't dealing with
Unicode characters, you are much more likely dealing with UTF-8 strings.
Strings, where you rarely care about the individual characters, so the
"complexity" is a straw man.

Aleksej Saushev

unread,

Sep 26, 2009, 6:41:00 PM9/26/09

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> m...@iae.nl (Marcel Hendrix) writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: RfD: c-addr/len
>>
>>> Robert Epprecht <eppr...@solnet.ch> writes:
>>>>I did not
>>>>know that there is no practical value in using CHARS.
>>
>>> The lack of practical value comes from the fact that maintained Forth
>>> systems with "1 CHARS > 1" don't exist in practice.
>>
>>If unicode ever becomes necessary, I intend to go for 32-bit characters
>>(4 bytes) to keep things simple. Would there be problems (apart from
>>a tiny bit enlarged dataspace)?
>
> Most programs have an environmental dependency on 1 CHARS = 1 and
> would not work on your system unless you also make the address unit
> 32-bits (which would cause problems when calling foreign functions).

This isn't dependency on 1 CHARS = 1, they depend on unioctet character
encoding, i.e. the dependency is more strict. That it works with UTF-8
is mere coincidence.

> #download the UTF-8 example program
> wget http://www.complang.tuwien.ac.at/forth/utf8/example.fs

This test is checking quite another thing: it checks system of being
8-bit clean and accepting codes of higher half of codeset as letters.
This invalidates your conclusion.

> #now run it on the various Forth systems
> gforth -e "include example.fs cr bye"
> iforth "cr include example.fs cr bye"
> vfxlin "cr include example.fs cr bye"
> spf4 example.fs CR BYE
>
> It worked on all these Forth systems that AFAIK have no particular
> support for UTF-8. Note that the program uses a word name that is not
> in ASCII.
>
> The next test was cutting and pasting the program on the command line
> rather than including it.
>
> It worked on all systems, but there were some shortcomings:
>
> On Gforth 0.6.2, command-line editing does not work properly for the
> non-ASCII characters, but just pasting is fine.
>
> Vfxlin has a similar problem; moreover, it works when the code is
> copied after vfxlin was started, but not if it was copied before
> (it shows "#" for the non-ASCII characters then; strange).
>
> On iForth, the code looks funny in the command-line editor, but it
> works.
>
> On SP-Forth, there is no command-line editing, only backspace. That
> worked perfectly, however, which was unexpected. Maybe SP-Forth has
> special support for UTF-8. What does not work properly is the error
> position indicator if there are non-ASCII characters before or in the
> erroneous word.

So, it worked only to the extent of using string as a whole, where you
had to process the text, your test failed.

You didn't test anything where character count matters, for instance
text formatting. Even simple line folding would show that.

> The point of all that is to show that in most places one deals with
> strings without having to know how they break into display characters;
> even stuff deep inside a Forth system like the dictionary and file
> inclusion (with parsing etc.) just deals with UTF-8 strings like it
> deals with plain ASCII strings, and therefore it just works.

Sure, because this stuff depends on unioctet encoding, splitting text
into words at ASCII blanks. And all failures you noticed, you've
attributed to non-working error place indicator and such.

> The only parts in Gforth that had to be changed to support UTF-8 and
> other variable-with encodings was command-line editing and error
> indication (because there display characters matter). And of course
> we also added the xchars wordset so that applications can deal with
> extended display characters, too.

Thus, you care only of Gforth and nothing else.

--
HE CE3OH...