2009-09-09 Rendered into RfD form, added Forth200x words 1999-06-22 Original Text by John Rible
Problem ======= A large number of words use "c-add u" to indicate the address of a string (c-addr) and its length (u) on the stack. With the introduction of variable width characters, it is not clear if "u" is referring to the number of characters or address units.
Solution ======== Introduce a new pseudo-type ("len") into the document of these words to clarify the intent. Replacing the "u" with a "len" should improve the documentation of these words. The words effected are:
> 2009-09-09 Rendered into RfD form, added Forth200x words > 1999-06-22 Original Text by John Rible
> Problem >======= > A large number of words use "c-add u" to indicate the address of a
"c-addr u"
> string (c-addr) and its length (u) on the stack. With the > introduction of variable width characters, it is not clear if "u" is > referring to the number of characters or address units.
Er...unless I missed a decision to do away with the distinction between "1 CHARS" and "address units", isn't the ambiguity between "variable width characters" and "characters"? I don't see that this proposal actually clarifies that.
At any rate, I think the definition at 3.1.4.2 Character strings makes it clear that "c-addr u" as a unit means something special, so I don't see any reason to replace the "u" with "len".
> Solution >======== > Introduce a new pseudo-type ("len") into the document of these words > to clarify the intent. Replacing the "u" with a "len" should improve > the documentation of these words. The words effected are:
affected
> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
> A string is specified by a cell pair (c-addr len) representing > its starting address and length in characters.
In 2.1 Definitions of Terms, we have:
character: Depending on context, either 1) a storage unit capable of holding a character; or 2) a member of a character set.
I think that the presence of an address (i.e. the location of some storage) makes it pretty clear that sense 1 is meant here, but if people are confused by that, you might want to clarify.
----
Instead of adopting this (and that "pchar" rename proposal), I think it would make much more sense to clarify things by leaving the existing "char" and "character" alone, and instead adopting new terminology for variable width characters.
As I see it, there's no reason to go changing terminology on people when you could instead just adopt new terminology for the new concept. Much less potential for confusion that way.
Josh Grams wrote: > Instead of adopting this (and that "pchar" rename proposal), I think > it would make much more sense to clarify things by leaving the > existing "char" and "character" alone, and instead adopting new > terminology for variable width characters.
We are pretty much there - the extended characters are called "extended characters" or short "xchars". An xchar in memory may consist of several characters (primitive characters, that is). I think it's easier to deal with the name "pchar" when the "storage unit" is meant than name it "character", but outside the xchar proposal, the terminology is not needed.
The c-addr/len makes live easier as it definitely states that the length is meant to be in characters (pchars), i.e. the storage unit as meant in 2.1. character 1).
> Problem > ======= > A large number of words use "c-add u" to indicate the address of a > string (c-addr) and its length (u) on the stack. With the > introduction of variable width characters, it is not clear if "u" is > referring to the number of characters or address units.
> Solution > ======== > Introduce a new pseudo-type ("len") into the document of these words > to clarify the intent.
Sorry, but I do not see from your proposal what sort of length "len" denotes: is it "length in characters", "length in logical (multi- byte) characters", or "length in address units"?
> Replacing the "u" with a "len" should improve > the documentation of these words.
In fact, in many cases words are commented as taking and/or leaving ( addr len ) rather than ( c-addr u ), so there is existing practice.
IMO replacing "u" with "len" does improve readability, but does not resolve the "which length" puzzle.
m_l_g3 wrote: > IMO replacing "u" with "len" does improve readability, but does > not resolve the "which length" puzzle.
> au-length, log-length, c-length ?
> CHARS ( c-length -- au-length ) > and so on...
Char, as it is now is:
6.1.0898 CHARS ( n1 -- n2 ) n2 is the size in address units of n1 characters.
IMHO, the stack effect is at least misleading. I find it difficult to get a correct stack effect - we want -1 CHARS to be used to step through strings backwards, so we want the sign. I.e. "len" is not the right left side of this stack effect (len is a subtype of u, no sign). But we basically use CHARS to convert +-len into a +-c-addr offset. Works fine on two's complement, might cause problems on one's complement ;-).
> Sorry, but I do not see from your proposal what sort of length "len" > denotes: is it "length in characters", "length in logical (multi- > byte) > characters", or "length in address units"?
length in primitive characters (bytes).
> In fact, in many cases words are commented as taking and/or > leaving ( addr len ) rather than ( c-addr u ), so there is existing > practice.
Peter Knaggs <p...@bcs.org.uk> writes: > m_l_g3 wrote:
>> Sorry, but I do not see from your proposal what sort of length "len" >> denotes: is it "length in characters", "length in logical (multi- >> byte) >> characters", or "length in address units"?
> length in primitive characters (bytes).
No, length in address units. Byte length is what is returned by "1 chars", consider 4-bit address unit.
> Sorry, but I do not see from your proposal what sort of length "len" > denotes: is it "length in characters", "length in logical (multi- > byte) > characters", or "length in address units"?
Would it help if we replace item 1, the definition of "len" with:
len length of a character-string in address units 1 cell
>> Sorry, but I do not see from your proposal what sort of length "len" >> denotes: is it "length in characters", "length in logical (multi- >> byte) >> characters", or "length in address units"?
> Would it help if we replace item 1, the definition of "len" with:
> len length of a character-string in address units 1 cell
David N. Williams wrote: > Peter Knaggs wrote: >> m_l_g3 wrote:
>>> Sorry, but I do not see from your proposal what sort of length "len" >>> denotes: is it "length in characters", "length in logical (multi- >>> byte) >>> characters", or "length in address units"?
>> Would it help if we replace item 1, the definition of "len" with:
>> len length of a character-string in address units 1 cell
> Shouldn't that be in characters? (3.1.4.2)
Which type of character? Primitive characters (3.1.3) possibly but you could also interpret characters to be extended characters (XChar) which include variable width characters, which is precisely what we are trying to get away from.
>>>> >>>> Sorry, but I do not see from your proposal what sort of length "len" >>>> denotes: is it "length in characters", "length in logical (multi- >>>> byte) >>>> characters", or "length in address units"? >>> >>> Would it help if we replace item 1, the definition of "len" with: >>> >>> len length of a character-string in address units 1 cell >> >> Shouldn't that be in characters? (3.1.4.2) > > Which type of character? Primitive characters (3.1.3) possibly but you > could also interpret characters to be extended characters (XChar) which > include variable width characters, which is precisely what we are trying > to get away from.
I guess whatever character you meant in this:
3. Replace "u" with "len" in 3.1.4.2 Character strings:
A string is specified by a cell pair (c-addr len) representing its starting address and length in characters.
It would be a substantial change if it were to be address units, since 1 CHARS is not necessarily one address unit.
I'm unclear what you intend. Is the meaning of "character string" in the above being changed to allow for extended characters?
> 3. Replace "u" with "len" in 3.1.4.2 Character strings:
> A string is specified by a cell pair (c-addr len) representing > its starting address and length in characters.
The part of the X:key-ekey proposal which was accepted at the Exeter meeting included the following:
3.1.2 Character types Characters shall have the following properties: – at least one address unit wide; – contain at least eight bits; – be of fixed width; – have a size less than or equal to cell size; – be unsigned.
3.1.2.3 Primitive Character A primitive character (pchar) is a character with no restrictions on its contents. Unless otherwise stated, a “character” refers to a primitive character.
Thus item 3 should be changed to refer to the "length in primitive characters". In this case I feel it probably is worth spelling out.
> It would be a substantial change if it were to be address units, > since 1 CHARS is not necessarily one address unit.
This is part of the problem, what does u mean in CMOVE? According to its definition "copy u consecutive characters", while most people believe it refers to address units.
> I'm unclear what you intend. Is the meaning of "character > string" in the above being changed to allow for extended > characters?
No, but once extended characters are introduced there is the potential for confusion, hence the introduction of a primitive character. Extended characters will always be referenced as "extended character" or xchar, while a "character" is a primitive characters or pchar.
> David N. Williams wrote: >> >> I guess whatever character you meant in this: >> >> 3. Replace "u" with "len" in 3.1.4.2 Character strings: >> >> A string is specified by a cell pair (c-addr len) representing >> its starting address and length in characters. > > The part of the X:key-ekey proposal which was accepted at the Exeter > meeting included the following: > > 3.1.2 Character types > Characters shall have the following properties: > – at least one address unit wide; > – contain at least eight bits; > – be of fixed width; > – have a size less than or equal to cell size; > – be unsigned. > > 3.1.2.3 Primitive Character > A primitive character (pchar) is a character with no restrictions > on its contents. Unless otherwise stated, a “character” refers to > a primitive character. > > Thus item 3 should be changed to refer to the "length in primitive > characters". In this case I feel it probably is worth spelling out.
Me, too!
>> It would be a substantial change if it were to be address units, >> since 1 CHARS is not necessarily one address unit. > > This is part of the problem, what does u mean in CMOVE? According to > its definition "copy u consecutive characters", while most people > believe it refers to address units.
Not me! :-) MOVE is for that.
>> I'm unclear what you intend. Is the meaning of "character >> string" in the above being changed to allow for extended >> characters? > > No, but once extended characters are introduced there is the potential > for confusion, hence the introduction of a primitive character. Extended > characters will always be referenced as "extended character" or xchar, > while a "character" is a primitive characters or pchar.
>>> Sorry, but I do not see from your proposal what sort of length "len" >>> denotes: is it "length in characters", "length in logical (multi- >>> byte) >>> characters", or "length in address units"?
>> length in primitive characters (bytes).
>No, length in address units.
This proposal replaces "u" with "len" in words where "u" denotes the number of characters.
A change to let this parameter specify a number of address units would break existing standard programs. Granted, there are only few standard programs that don't have an environmental dependency on 1 CHARS = 1, and all maintained systems support these programs, so there would be little problem with such a change, but I see little point in having such a change. Better propose standardizing 1 CHARS = 1.
>Byte length is what is returned by "1 chars", >consider 4-bit address unit.
Yes, nibble-addressed hardware was the original rationale for differentiating between aus and chars, but in 15 years there have been no Forth-94 systems for nibble-addressed hardware, so I consider CHARS a good solution for a problem that does not exist in practice.
Josh Grams <j...@qualdan.com> writes: >Instead of adopting this (and that "pchar" rename proposal), I think it >would make much more sense to clarify things by leaving the existing >"char" and "character" alone, and instead adopting new terminology for >variable width characters.
Yes. And the variable-width characters have a new name: xchars.
>As I see it, there's no reason to go changing terminology on people when >you could instead just adopt new terminology for the new concept. Much >less potential for confusion that way.
Apparently some people are confused because a member of the (extended) character set need not fit into a char, and they think that renaming chars into pchars will help avoid that confusion. I am not convinced of that, but I can live with pchars (although I fear that we will make mistakes in the renaming, which will increase the confusion rather than reducing it).
Peter Knaggs <p...@bcs.org.uk> writes: >c-add/len >=========
>2009-09-09 Rendered into RfD form, added Forth200x words >1999-06-22 Original Text by John Rible
>Problem >======= >A large number of words use "c-add u" to indicate the address of a >string (c-addr) and its length (u) on the stack. With the >introduction of variable width characters, it is not clear if "u" is >referring to the number of characters or address units.
Variable-width characters are introduced in the xchars proposal, they are called xchars there (and can consist of one or more fixed-width chars in memory). Variable-width characters don't exist in the current standard document, and chars don't become variable-width in xchars. It's clear in all words that deal with chars that u refers to the number of chars.
It definitely does not refer to address units in these words (only in MOVE and ERASE, which don't deal with chars), although given that 1 chars = 1 au in all maintained systems, that distinction is of no consequence. Every word that refers to chars says so explicitly, and every word that refers to aus says so explicitly, and if any word in the xchars proposal refers to a number of xchars, it will say so explicitly, too (but I don't think there is such a word).
Examples:
From 17.6.1.0910 CMOVE: |[...] copy u consecutive characters [...]
From 6.1.1900 MOVE: |[...]
>4. Add the following to table 3.5 - Environmental Query Strings:
> /CHARACTER-STRING n yes maximum size of len in characters
What's the point of that?
Any system that cannot deal with strings of the length of the longest data memory region that can be had from the system is broken. And that's not just IMO, but also in Forth-94.
So if the point of that query is to allow systems to not process some of the strings that can be created, then existing standard programs would become non-standard. Such a restriction requires a two-step process of first declaring the feature obsolescent, and eventually removing it. Moreover, I see no point in introducing such a restriction.
If that's not the point of the query, then I see no point in it. If we can process all strings we can create, there is no point in querying for the maximum size.
> 2009-09-09 Rendered into RfD form, added Forth200x words > 1999-06-22 Original Text by John Rible
> Problem > ======= > A large number of words use "c-add u" to indicate the address of a > string (c-addr) and its length (u) on the stack. With the > introduction of variable width characters, it is not clear if "u" is > referring to the number of characters or address units.
> Solution > ======== > Introduce a new pseudo-type ("len") into the document of these words > to clarify the intent. Replacing the "u" with a "len" should improve > the documentation of these words. The words effected are:
> ... > 12.6.1.2143 REPRESENT
I must have missed it. When did "u most significant digits of the significand" [of a number] become the length of a string?
Anton Ertl wrote: > Josh Grams <j...@qualdan.com> writes: >>Instead of adopting this (and that "pchar" rename proposal), I think it >>would make much more sense to clarify things by leaving the existing >>"char" and "character" alone, and instead adopting new terminology for >>variable width characters.
> Yes. And the variable-width characters have a new name: xchars.
>>As I see it, there's no reason to go changing terminology on people when >>you could instead just adopt new terminology for the new concept. Much >>less potential for confusion that way.
> Apparently some people are confused because a member of the (extended) > character set need not fit into a char, and they think that renaming > chars into pchars will help avoid that confusion. I am not convinced > of that, but I can live with pchars (although I fear that we will make > mistakes in the renaming, which will increase the confusion rather > than reducing it).
"Ed" <nos...@invalid.com> writes: >> Solution >> ======== >> Introduce a new pseudo-type ("len") into the document of these words >> to clarify the intent. Replacing the "u" with a "len" should improve >> the documentation of these words. The words effected are:
>> ... >> 12.6.1.2143 REPRESENT
>I must have missed it. When did "u most significant digits of >the significand" [of a number] become the length of a string?
u has always been the length of the buffer in characters in REPRESENT. That's the only interpretation of the specification that makes any sense.
>2009-09-09 Rendered into RfD form, added Forth200x words >1999-06-22 Original Text by John Rible
>Problem >======= >A large number of words use "c-add u" to indicate the address of a >string (c-addr) and its length (u) on the stack. With the >introduction of variable width characters, it is not clear if "u" is >referring to the number of characters or address units.
>Solution >======== >Introduce a new pseudo-type ("len") into the document of these words >to clarify the intent. Replacing the "u" with a "len" should improve >the documentation of these words. The words effected are:
I use the word "sc" in my documentation for the pair. It means string-constant. It implies that the word using it must not reach through to the "c-add" and change characters there. (So e.g. /STRING is okay.)
Anyway, I'm in favour of using a single indication of the pair whenever they cannot be logically separated. This allows for a full explanation of "sc" at one place, instead of limited explanations regarding address units/ character units at several places. Maybe a distinction between "sc" and "xsc" is in order.
-- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
>>>> Sorry, but I do not see from your proposal what sort of length "len" >>>> denotes: is it "length in characters", "length in logical (multi- >>>> byte) >>>> characters", or "length in address units"?
>>> length in primitive characters (bytes).
>>No, length in address units.
> This proposal replaces "u" with "len" in words where "u" denotes the > number of characters.
> A change to let this parameter specify a number of address units would > break existing standard programs. Granted, there are only few > standard programs that don't have an environmental dependency on > 1 CHARS = 1, and all maintained systems support these programs, so > there would be little problem with such a change, but I see little > point in having such a change. Better propose standardizing > 1 CHARS = 1.
You're not consistent in your opinion that we should use UNICODE: either 1 CHARS = 1, and you use one-octet encodings on octet-addressing platforms, or 1 CHARS may be any other value, and you return to address units, which are octets in many cases. The third way is decoupling Forth from hardware in full, so that you don't deal with real CPU address units at all.
Aleksej Saushev wrote: > You're not consistent in your opinion that we should use UNICODE: > either 1 CHARS = 1, and you use one-octet encodings on > octet-addressing platforms, or 1 CHARS may be any other value, and you > return to address units, which are octets in many cases. The third way > is decoupling Forth from hardware in full, so that you don't deal with > real CPU address units at all.
"Unicode" is not just one encoding. You can have an ASCII-compatible byte-encoding like UTF-8 (which is what I recommend for Forth with Unicode), or UTF-16, which is still a variable length encoding (one or two 16-bit words make a character, i.e. you still need the XCHAR wordset to work with UTF-16), or UCS4, which will be fixed-size, but is quite wasteful.
Except a few experiments, all Forth systems have 1 CHARS = 1. Most programs rely on that, as well (i.e. they don't use CHARS where they should, often, they also don't use CHAR+ but 1+ or so).
Bernd Paysan <bernd.pay...@gmx.de> writes: > Aleksej Saushev wrote: >> You're not consistent in your opinion that we should use UNICODE: >> either 1 CHARS = 1, and you use one-octet encodings on >> octet-addressing platforms, or 1 CHARS may be any other value, and you >> return to address units, which are octets in many cases. The third way >> is decoupling Forth from hardware in full, so that you don't deal with >> real CPU address units at all.
> "Unicode" is not just one encoding. You can have an ASCII-compatible > byte-encoding like UTF-8 (which is what I recommend for Forth with > Unicode), or UTF-16, which is still a variable length encoding (one or > two 16-bit words make a character, i.e. you still need the XCHAR wordset > to work with UTF-16), or UCS4, which will be fixed-size, but is quite > wasteful.
> Except a few experiments, all Forth systems have 1 CHARS = 1. Most > programs rely on that, as well (i.e. they don't use CHARS where they > should, often, they also don't use CHAR+ but 1+ or so).
Again internal inconsistency. If you want 1 CHARS = 1 always, then you should get rid of it and assume that you address bytes/characters or octets, whatever you decide. You return to the way C took. Then you won't need any conversion of code to use wide characters &c.
So, what is the point in dragging this "CHARS" stuff?
This brings another problem of Standard Forth: lack of internal consistency. You either have overengineered parts, impractical parts, or lack of standard tools to solve every day practical tasks (like reading non-textual streams).
Could you and Anton decide for yourself what you really want and stick to it? Because as for now you easily jump from 1 CHARS being able to hold a byte, i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being addressable unit like it is in C.
Each variant has right to exist and has its own consequences. If you decide 1 CHARS = 1, then how I access address units? Octets? If you decide 1 CHARS to be byte width, how do I read non-textual file?
P.S. Most of UNIX text processing programs use "char" and don't care of locales still, but there's some kind of general consensus that they should be converted. So what's your argument about? I don't understand it.
Again, you overengineer standard in domain nobody has much experience with, and skip fixing defects affecting practical everyday tasks.
Aleksej Saushev <a...@inbox.ru> writes: >an...@mips.complang.tuwien.ac.at (Anton Ertl) writes: >> Granted, there are only few >> standard programs that don't have an environmental dependency on >> 1 CHARS = 1, and all maintained systems support these programs, so >> there would be little problem with such a change, but I see little >> point in having such a change. Better propose standardizing >> 1 CHARS = 1.
>You're not consistent in your opinion that we should use UNICODE: >either 1 CHARS = 1, and you use one-octet encodings on octet-addressing >platforms,
Yes, that's that way things work without xchars. With xchars, you can use variable-width encodings like UTF-8, and UTF-8 is compatible with 8-bit chars.
>or 1 CHARS may be any other value, and you return to address >units, which are octets in many cases.
And? The words where u refers to the number of characters still deal with u chars, not u address units.
Aleksej Saushev <a...@inbox.ru> writes: >If you want 1 CHARS = 1 always, then you >should get rid of it
rid of what?
> and assume that you address bytes/characters or >octets, whatever you decide.
On word-addressed machines 1 CHARS = 1, but a character is not a byte or octet, but a word.
>So, what is the point in dragging this "CHARS" stuff?
It's in the current standard and nobody (not even you) has submitted an RfD for making it obsolescent.
>This brings another problem of Standard Forth: lack of internal consistency. >You either have overengineered parts, impractical parts, or lack of standard >tools to solve every day practical tasks (like reading non-textual streams).
>Could you and Anton decide for yourself what you really want and stick to it? >Because as for now you easily jump from 1 CHARS being able to hold a byte, >i.e. real character be it 32-bit wide or octet-wide, to 1 CHARS being >addressable unit like it is in C.
I can only guess what you mean here, but maybe the following can clear things up: A char is a fixed-width memory unit, and on byte-addressed machines it is a byte in all maintained systems. There are also xchars (in the xchars proposal); they have a variable-width representation in memory, i.e., each xchar is stored in one or more chars. The "len" in this proposal always refers to the number of chars, not to the number of xchars.
>Each variant has right to exist and has its own consequences. >If you decide 1 CHARS = 1, then how I access address units?
Easy in that case: c@ and c!
> Octets?
No octets in the standard yet. If you have a Forth system on a word-addressed machine, you have to use system-specific code to deal with octets.
>If you decide 1 CHARS to be byte width, how do I read non-textual file?
Use BIN.
A more interesting case is word-addressed machines: How should they deal with BIN? But I guess if the people implementing and programming on such systems feel the need for standardization in this regard, they will come forward and start discussing it.
>P.S. Most of UNIX text processing programs use "char" and don't care of >locales still, but there's some kind of general consensus that they >should be converted.
Converted to what? Consensus among whom?
> So what's your argument about? I don't understand it.