20060822 Updated solution section.
20060821 First draft.
Rationale
=========
Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.
Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very similar
operations. S\" behaves like S", but uses the '\' character as an escape
character for the entry of characters that cannot be used with S".
This technique is widespread in languages other than Forth.
It has benefit in areas such as
1) construction of multi line strings for display by operating system
services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.
The majority of current Forth systems contain code, either in the kernel
or in application code, that assumes char=byte=au. To avoid breaking
existing code, we have to live with this practice.
Considerations
--------------
We are trying to integrate several issues:
1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality
Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the impact
of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel with 1..4
byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
The basis of the current approach is to use the terminology of primitive
characters and extended characters. A primitive character (called a
pchar here) is a fixed-width unit handled by EMIT and friends. It
corresponds to the current ANS definition of a character. An extended
character (called an xchar here) consists of one or more primitive
characters and represents the encoding for a "display unit". A string is
represented by caddr/len in terms of primitive characters.
The consequences of this are:
1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display unit,
but may have several additional comms channels. The impact of a
keyboard driver having to convert Chinese or Russian characters into
a (say) UTF-8 sequence is minimal compared to handling the key stroke
sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they are
described as strings.
5) No conflict arises with the XCHARs proposal.
Multiple encodings can be handled if they share a common primitive
character size - nearly all of these are described in terms of octets:
TCP/IP, UTF-8, UTF-16, UTF-32, ...
The XCHARs proposal can be used to handle extended characters on the
stack. XEMIT and friends allow us to handle some additional odd-ball
requirements such as 9-bit control characters, e.g. for the MDB bus used
by vending machines.
Solution
--------
To ease discussion we refer to character handled by C@, C! and friends
as "primitive characters" or pchars. Characters that may be wider than a
pchar are called "extended characters" or xchars. These are compatible
with the XCHARs proposal. This proposal does not require systems to
handle xchars, but does not disenfranchise those that do.
S\" is used like S" but treats the '\' character specially. One or more
characters after the '\' indicate what is substituted. The following
list is what is currently available in the Forth systems surveyed.
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the first
non-hex character
\\ backslash itself
\ before any other character represents that character
The following three of these cause parsing and readability problems. As
far as I know, requiring characters to come in 8 bit units will not
upset any systems. Systems with characters less than 7 bits are non-
compliant, and I know of no 7 bit CPUs. All current systems use
character units of 8 bits or more.
\[0-7]+ Octal numerical character value, finishes at the first
non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the first
non-hex character
Why do we need two representations, both of variable length? This
proposal selects the hexadecimal representation, requiring two hex
digits. A consequence of this is that xchars must be represented as a
sequence of pchars. Although initially seen as a problem by some people,
it avoids at least the following problems:
1) Endian issues when transmitting an xchar, e.g. big-endian host to
little-endian comms channel
2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
system.
3) Does not have problems in distinguishing the end of the number from a
following character such as '0' or 'A'.
At least one system (Gforth) already supports UTF-8 as it's native
character set, and one system (JaxForth) used UTF-16. These systems are
not affected.
\ before any other character represents that character
This is an unnecessary general case, and so is not mandated. By making
it an ambiguous condition, we do not disenfranchise existing
implementations, and leave the way open for future extensions.
Proposal
========
6.2.xxxx S\"
s-slash-quote CORE EXT
Interpretation:
Interpretation semantics for this word are undefined.
Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by " (double-quote), using the translation rules
below. Append the run-time semantics given below to the current
definition.
Translation rules:
Characters are processed one at a time and appended to the compiled
string. If the character is a '\' character it is processed by
parsing and substituting one or more characters as follows:
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10)
\n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\xAB A and B are Hexadecimal numerical characters. The resulting
character is the conversion of these two characters.
\\ backslash itself
\ before any other character constitutes an ambiguous
condition.
Run-time: ( -- c-addr u )
Return c-addr and u describing a string consisting of the translation
of the characters ccc. A program shall not alter the returned string.
See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"
Ambiguous conditions occur:
If a hex value is more than two characters
If \x is not followed by two hexadecimal characters
Reference Implementation
========================
(as yet untested)
Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefixes to indicate decimal and hexadecimal respectively.
Another implementation (with some deviations) can be found at
http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth/quotes.fs?root=gforth
decimal
: PLACE \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
2dup 2>r \ write count last
1 chars + swap move
2r> c! \ to avoid in-place problems
;
: $, \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
dup >r
here place
r> 1 chars + allot
align
;
: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;
: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;
: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over >number 2drop drop
>r 2 chars /string r>
r> base !
;
create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
#27 c, \ \e
#12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
#10 c, \ \l
char m c, \ \m
#10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
#13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
#11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z
create CRLF$ \ -- addr ; CR/LF as counted string
2 c, #13 c, #10 c,
internal
: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit endif
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 chars /string extract2H r> addchar exit
endif
over c@ [char] m = if \ CR/LF pair?
1 chars /string #13 r@ addchar #10 r> addchar exit
endif
over c@ [char] n = if \ CR/LF pair?
1 chars /string crlf$ count r> append exit
endif
over c@ [char] a [char] z 1+ within if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
endif
1 chars /string
;
external
: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string endif
r> drop
;
: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;
: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
compile (s") $,
then
; IMMEDIATE
Test Cases
==========
TBD.
How would the following
s\" \"
be handled? Win32Forth treats incomplete strings
s" incomplete
as being correctly terminated at the cf/lf boundary.
> The following three of these cause parsing and readability problems. As
> far as I know, requiring characters to come in 8 bit units will not
> upset any systems. Systems with characters less than 7 bits are non-
> compliant, and I know of no 7 bit CPUs. All current systems use
> character units of 8 bits or more.
>
> \[0-7]+ Octal numerical character value, finishes at the first
> non-octal character
> \x[0-9a-f]+ Hex numerical character value, finishes at the first
> non-hex character
>
> Why do we need two representations, both of variable length? This
> proposal selects the hexadecimal representation, requiring two hex
> digits. A consequence of this is that xchars must be represented as a
> sequence of pchars. Although initially seen as a problem by some people,
> it avoids at least the following problems:
>
> 1) Endian issues when transmitting an xchar, e.g. big-endian host to
> little-endian comms channel
> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
> system.
> 3) Does not have problems in distinguishing the end of the number from a
> following character such as '0' or 'A'.
>
> At least one system (Gforth) already supports UTF-8 as it's native
> character set, and one system (JaxForth) used UTF-16. These systems are
> not affected.
>
I'm confused by the previous, and how to terminate an octal or hex
string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
[snipped]
The current definition of s" does not define what happens in this
circumstance. Consequently this proposal does not not define this
condition either. Your solution would be just as valid for s\" as s".
It find it moderately interesting that the rather standard \<newline> is
not. Traditionally this means ignore the line break.
>> \[0-7]+ Octal numerical character value, finishes at the first
>> non-octal character
>> \x[0-9a-f]+ Hex numerical character value, finishes at the first
>> non-hex character
>>
>> Why do we need two representations, both of variable length? This
>> proposal selects the hexadecimal representation, requiring two hex
>> digits. A consequence of this is that xchars must be represented as a
>> sequence of pchars. Although initially seen as a problem by some people,
>> it avoids at least the following problems:
>>
>> 1) Endian issues when transmitting an xchar, e.g. big-endian host to
>> little-endian comms channel
>> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
>> system.
>> 3) Does not have problems in distinguishing the end of the number from a
>> following character such as '0' or 'A'.
>
> I'm confused by the previous, and how to terminate an octal or hex
> string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
This is a problem of the existing solutions. This proposal suggests that
\x should be followed by only two characters. Thus your \x12AB would
produce the sequence 12, 'A', and 'B'.
That would be a useful enhancement; but perhaps \c might be clearer,
as it differentiates between a silent space as in \<newline> and \
<newline> and permits comments.
s\" abcdefg\c \ continue on a new line
hijklmn" \ blank strip leading & catenate for
abcdefghijklmn
>
>
>
> >> \[0-7]+ Octal numerical character value, finishes at the first
> >> non-octal character
> >> \x[0-9a-f]+ Hex numerical character value, finishes at the first
> >> non-hex character
>
> >> Why do we need two representations, both of variable length? This
> >> proposal selects the hexadecimal representation, requiring two hex
> >> digits. A consequence of this is that xchars must be represented as a
> >> sequence of pchars. Although initially seen as a problem by some people,
> >> it avoids at least the following problems:
>
> >> 1) Endian issues when transmitting an xchar, e.g. big-endian host to
> >> little-endian comms channel
> >> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
> >> system.
> >> 3) Does not have problems in distinguishing the end of the number from a
> >> following character such as '0' or 'A'.
>
> > I'm confused by the previous, and how to terminate an octal or hex
> > string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
>
> This is a problem of the existing solutions. This proposal suggests that
> \x should be followed by only two characters. Thus your \x12AB would
> produce the sequence 12, 'A', and 'B'.
Ah, thanks, clear.
--
Regards
Alex McDonald
>How would the following
>
> s\" \"
>
>be handled? Win32Forth treats incomplete strings
>
> s" incomplete
It's a badly formed string, and so ambiguous. I've added this to the
ambiguous conditions list.
>I'm confused by the previous, and how to terminate an octal or hex
>string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
This was part of the discussion, so we define \xABcdef as generating
the primitive character AB and cdef is then parsed.
The octal notation is not specified in the normative part of the
proposal.
Stephen
--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads
That's what the standard prescribes in Section 3.4.1:
|[If no delimiter character is present], the string continues up to
|and including the last character in the parse area, and the number in
|>IN is changed to the length of the input buffer, thus emptying the
|parse area.
Since the proposal uses the usual "parse ... delimited by ..." idiom,
I expect that it works the same way, modulo not interrpreting the " in
\" as delimiter. Maybe this could be made clearer in the proposal.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/
No existing practice in Forth.
> but perhaps \c might be clearer,
>as it differentiates between a silent space as in \<newline> and \
><newline> and permits comments.
>
>s\" abcdefg\c \ continue on a new line
> hijklmn" \ blank strip leading & catenate for
>abcdefghijklmn
In C one can construct a longer literal string by writing to adjacent
literal strings, separated only by white space and comments. E.g.:
int main()
{
printf("hello, " /* comment */
"world");
return 0;
}
Note that this allows a little more flexibility about where the string
starts in the next line. Insired by this, we could do it in Forth
with words like +" and +\", which would extend a string started with
S" or S\". But no existing practice, either, so not for this
Pretty good. There's always room for improvement:
- Test cases should be added before the CfV.
- I guess that you want \xAB to represent a (primitive) character.
This does not come out clearly (actually, if there was no mention of
XCHARS and definition of "primitive characters" in the informative
sections, this would be clearer).
- It seems that the detailed description of an existing solution in
the "Solution" section is confusing, because it is very similar to the
proposal, but still different. Better leave it away and just mention
the issues (like fixed-length vs. variable-length \x) in a discussion
section.
>- Test cases should be added before the CfV.
Volunteer? You? The gForth test suite?
>- I guess that you want \xAB to represent a (primitive) character.
>This does not come out clearly (actually, if there was no mention of
>XCHARS and definition of "primitive characters" in the informative
>sections, this would be clearer).
Given the problems with the definition of char throughout the
document, the definition of char in terms of primitve characters
*has* to be done in a different section of the document.
For example, if char=16 bits on a byte-addressed machine, there
is no way for a standard program to write a byte to a file!
If you use a variable width character set such as UTF-8, what does
CMOVE mean?
The only practical solutions I see are
a) define char=byte
b) define char=implementation defined unit
Given the amount of code that currently assumes char=byte=au, the
least code breakage and maximum instant compliance is to replace
"char" in the document by "primitive char" ("pchar") and then to
define "extended char" ("xchar") in terms of pchars. The vast
majority of systems can then happily impose char=byte=au.
>- It seems that the detailed description of an existing solution in
>the "Solution" section is confusing, because it is very similar to the
>proposal, but still different. Better leave it away and just mention
>the issues (like fixed-length vs. variable-length \x) in a discussion
>section.
Revamped and posted separately.
>21 August 2006, Stephen Pelc
Here's the latest version
Stephen
RfD - S\" and quoted strings with escapes
21 August 2006, Stephen Pelc
20070712 Redrafted non-normative portions.
20060822 Updated solution section.
20060821 First draft.
Rationale
=========
Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as dicussed
in the Forth 200x internationalisation and XCHAR proposals.
Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".
This technique is widespread in languages other than Forth.
It has benefit in areas such as
1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.
The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.
The following list describes what is currently available in the
surveyed Forth systems that support escaped strings.
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character
Considerations
--------------
We are trying to integrate several issues:
1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality
Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here)is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.
The consequences of this are:
1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display
unit, but may have several additional comms channels. The
impact of a keyboard driver having to convert Chinese or Russian
characters into a (say) UTF-8 sequence is minimal compared to
handling the key stroke sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they
are described as strings.
5) No conflict arises with the XCHARs proposal.
Multiple encodings can be handled if they share a common primitive
character size - nearly all encodings are described in terms of
octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...
Approach
--------
This proposal does not require systems to handle xchars, and does
not disenfranchise those that do.
S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituted.
The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
All current systems use character units of 8 bits or more.
Of observed current practice, the following two are problematic.
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
Why do we need two representations, both of variable length?
This proposal selects the hexadecimal representation, requiring
two hex digits. A consequence of this is that xchars must be
represented as a sequence of pchars. Although initially seen as a
problem by some people, it avoids at least the following problems:
1) Endian issues when trasmitting an xchar, e.g. big-endian host
to little-endian comms channel
2) Issues when an xchar is larger than a cell, e.g. UTF-32 on
a 16 bit system.
3) Does not have problems in distinguishing the end of the
number from a following character such as '0' or 'A'.
At least one system (Gforth) already supports UTF-8 as its native
Proposal
========
Labelling
=========
ENVIRONMENT? impact
name stack conditions
Ambiguous conditions occur:
If a hex value is more than two characters
If \x is not followed by by two hexadecimal characters
If the string is incorrectly formed
Reference Implementation
========================
(as yet untested)
Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefices to indicate decimal and hexadecimal respectively.
decimal
Test Cases
==========
TBD.
--
> ... we define \xABcdef as generating
> the primitive character AB and cdef is then parsed.
...
How un-Forthlike! 'cdef' isn't space delimited. This is likely to create
hard-to-find bugs. It would be better to barf over it.
Jerry
--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Space delimited where exactly?
abcd\xABdef
abcd\xAB def
abcd \xAB def
abcd \xABdef
Only the first and third are consistent, and only the first is what the
author intended; no spaces. I don't think Stephen meant parsed in the
sense of parsed and compiled.
--
Regards
Alex McDonald
> 21 August 2006, Stephen Pelc
> 20060822 Updated solution section.
> 20060821 First draft.
[..]
> \a BEL (alert, ASCII 7)
> \b BS (backspace, ASCII 8)
What is supposed to *happen* (if anything) when the programmer
does
S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE ,
or writes the string to mass storage for later use etc.?
[..]
-marcel
Whatever your output device does with a string that contains these
device dependent control sequences. Storing it (presumably for later
use) doesn't change the fact that s\" doesn't specify (and neither does
s") the intent or meaning of the string. Did you have some other
perspective on this?
--
Regards
Alex McDonald
How do you feel about Greg Bailey's suggestion of some years back that
we introduce the data type 'byte' or 'octet' with a small set of
operators to handle explicitly 8-bit units? That's sort of moving in
the opposite direction from what you suggest, but seems an equally valid
approach, I think. Greg's solution leaves everything regarding chars in
place, while introducing a new opportunity for situations in which you
need exactly 8 bits (e.g. comms, I/O).
Cheers,
Elizabeth
--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.forth.com
"Forth-based products and Services for real-time
applications since 1973."
==================================================
> Marcel Hendrix wrote:
>> Peter Knaggs <pkn...@bournemouth.ac.uk> writes Re: RfD: Escaped Strings
>>> 21 August 2006, Stephen Pelc
>>> 20060822 Updated solution section.
>>> 20060821 First draft.
>> [..]
>>> \a BEL (alert, ASCII 7)
>>> \b BS (backspace, ASCII 8)
>> What is supposed to *happen* (if anything) when the programmer
>> does
>> S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE ,
>> or writes the string to mass storage for later use etc.?
[..]
> Whatever your output device does with a string that contains these
> device dependent control sequences.
The proposal might suggest that a programmer has now a guaranteed
way to cause some hitherto impossible (at least in standard Forth) output
actions to happen (like tab, line-down, begin-of-line, erasing (?) backspace,
ESC sequences on a VT100 terminal, etc.)
> Storing it (presumably for later
> use) doesn't change the fact that s\" doesn't specify (and neither does
> s") the intent or meaning of the string. Did you have some other
> perspective on this?
What happens when an S\" string is written to a file opened with
R/W or R/W BIN and then read back?
-marcel
I suppose S\" \n" TYPE would need to be defined as the equivalent of CR,
and that use of other escaped strings would have an environmental
dependency. The same is true of EMIT, which I'm sure many use for that
purpose.
>
>> Storing it (presumably for later
>> use) doesn't change the fact that s\" doesn't specify (and neither does
>> s") the intent or meaning of the string. Did you have some other
>> perspective on this?
>
> What happens when an S\" string is written to a file opened with
> R/W or R/W BIN and then read back?
>
> -marcel
>
I would expect standard behaviour; only READ-LINE is allowed to
interpret the characters and look for up to two line terminator
characters (which are implementation defined); ditto for WRITE-LINE.
I would presume that the intention is that \n is the same line
terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs
to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" "
WRITE-LINE.
--
Regards
Alex McDonald
> I would presume that the intention is that \n is the same line
> terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs
> to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" "
> WRITE-LINE.
This points out another possible problems:
* How many READ-LINEs are needed to read back in S\" \lhello,\mworld!\n\n(fini)\x00" ?
* Will it be the same string that was written out?
-marcel
This would depend on the line terminator for your operating system. In a
system which uses \l as the line terminator I would suggest five lines:
1:
2: hello,\r
3: world!
4:
5: (fini)\x00
While a system which uses \r would have four lines:
1: \lhello,
2: \lworld!
3:
4: (fini)\x00
And in a system which uses \r\l there would also be be four lines:
1: \lhello,
2: world!
3:
4: (fini)\x00
In other words the behaviour would be environmentally dependent. This is
no different that in other languages.
I would like to remind people that the point of the standard is not
necessarily to make all standard programs portable between forth
systems, but to allow programmers to be portable. As Elisabeth puts it,
the standard provides a set of entitlements to the programmer, or a set
of assumptions which the programmer is entitled to make about a standard
system.
--
Peter Knaggs
It's your proposal:-) Feel free to be inspired by the tests in Gforth:
s" 123" drop 10 parse-num-x 123 <> throw drop .s
s" 123a" drop 10 parse-num 123 <> throw drop .s
s" x1fg" drop \-escape 31 <> throw drop .s
s" 00129" drop \-escape 10 <> throw drop .s
s" a" drop \-escape 7 <> throw drop .s
\"-parse " s" " str= 0= throw .s
\"-parse \a\b\c\e\f\n\r\t\v\100\x40xabcde" dump
s\" \a\bcd\e\fghijklm\12op\"\rs\tu\v" \-escape-table over str= 0= throw
s\" \w\0101\x041\"\\" name wAA"\ str= 0= throw
s\" s\\\" \\" ' evaluate catch 0= throw
However, given that the current Gforth implementation does not
completely match your proposal, you have to adapt it.
>>- I guess that you want \xAB to represent a (primitive) character.
>>This does not come out clearly (actually, if there was no mention of
>>XCHARS and definition of "primitive characters" in the informative
>>sections, this would be clearer).
>
>Given the problems with the definition of char throughout the
>document, the definition of char in terms of primitve characters
>*has* to be done in a different section of the document.
The definition that the XCHARS proposal makes is that chars are
primitive characters.
>For example, if char=16 bits on a byte-addressed machine, there
>is no way for a standard program to write a byte to a file!
Yes, there is no standard way to deal with bytes. Bytes are not (yet)
a standard concept.
>If you use a variable width character set such as UTF-8, what does
>CMOVE mean?
CMOVE ( from to count -- )
Copy count characters (in your terminology, primitive characters) from
FROM to TO, character by character, starting at the low addresses.
>The only practical solutions I see are
>a) define char=byte
>b) define char=implementation defined unit
>
>Given the amount of code that currently assumes char=byte=au, the
>least code breakage and maximum instant compliance is to replace
>"char" in the document by "primitive char" ("pchar") and then to
>define "extended char" ("xchar") in terms of pchars.
That's what the XCHARS proposal does, except that it says char where
you say pchar, and it says xchar where you sometimes say char.
What do you, as a programmer expect it to do? The proposal does not
specify anything other than that the string will be equivalent to a
string created via
here 27 c, [char] 7 c, 7 c, 13 c, 10 c, 9 c, 9 c, ( ... ) here over - 1 chars /
What the user output device does when you TYPE this string, in
whatever way it was created, is not defined by the proposal or
anywhere in the Forth-94 standard. You could declare an environmental
dependency on outputting to an ANSI terminal (emulator).
>What happens when an S\" string is written to a file opened with
>R/W or R/W BIN and then read back?
With BIN on both reading and writing I would expect the string to come
back unchanged. With BIN missing on both, the \m might be changed to
something else. With BIN missing on exactly one of them, pretty much
anything goes.
Yes.
>This points out another possible problems:
> * How many READ-LINEs are needed to read back in S\" \lhello,\mworld!\n\n(fini)\x00" ?
That's implementation-defined, just like the result of -1 3 /. I
would expect at least 3: up to the first \n, from the first to the
second, and the rest.
Note that S\" does not make a difference here. You could create the
same file in other ways.
> * Will it be the same string that was written out?
Obviously, the READ-LINEs will consume the newlines without putting
them in the resulting strings.
>How do you feel about Greg Bailey's suggestion of some years back that
>we introduce the data type 'byte' or 'octet' with a small set of
>operators to handle explicitly 8-bit units? That's sort of moving in
>the opposite direction from what you suggest, but seems an equally valid
>approach, I think. Greg's solution leaves everything regarding chars in
>place, while introducing a new opportunity for situations in which you
>need exactly 8 bits (e.g. comms, I/O).
Greg's solution has merit, especially for word/cell addressed
machines, however the discussion in it indicates that life
isn't that simple unless you use his alternative 2.
Given that nearly all comms and character systems are defined
in bytes, most CPUs are byte-addressed, and many Forth systems
and/or programmers assume char=byte=au, the least effort is to
permit wide characters (xchars) without breaking the assumption
or code.
For those who haven't seen it, Greg's proposal is attached below.
Stephen
====================================
From: Greg Bailey [greg at minerva dot com]
Sent: Tuesday, June 01, 1999 7:41 PM
To: 'ANSForth real mailgroup'
Cc: 'Localisation and Internationalisation'; 'ark-gvb-i'
Subject: Octet String Prospectus
Problem Statement:
------------------
Most standards defining interoperable data structures, such as for
example those used in networking and cryptography, do so in terms of
sequences of octets. Even in embedded applications, these standards
are
increasingly relevant and are indeed supporting them is often a
critical
application requirement.
The most commonly encountered computer architectures today address
their memories in units of 8 bit bytes, and Standard Forth appli-
cations have no difficulty in manipulating octet sequences directly
when
running on typical systems, with eight bit character sets, for such
machines.
However, such applications are environmentally dependent upon this
common combination in which addresses are in units of bytes or octets,
*and* in which characters are eight bits wide; or upon machines whose
addresses are in units such as 4-bit nibbles which divide 8, and whose
characters are also eight bits wide. On these families of
architectures
portable software may manipulate octet sequences by treating them as
characters.
If, however, either character size or address units are larger
than eight bits, we do not document standard ways of allocating,
manipulating, or performing I/O using sequences of octets.
This proposal provides mechanism that may be used by standard
programs to manipulate sequences of octets on any standard system
which supports it.
(Actual packaging TBD. Should probably be an extension, but if
so it will depend upon presence of the DOUBLE extension; and it
will include additions to the FILE extension if both are present.)
Discussion of common practice and architectural tradeoffs:
----------------------------------------------------------
Many systems and applications have been written for "cell addressed"
machines with 16 bit and larger address units. Many strategies have
been used for addressing characters, which were generally equivalent
to
octets, on such machines. In general the hardware does not directly
support linear addressing of bytes, characters, or octets, so this
type
of arithmetically usable address has generally been simulated in
software. The most commonly used strategy has been to multiply the
physical, cell address by the number of octets held within a cell, and
add to this product the relative position of the octet within the
cell,
in order to form a linear octet address. Coding strategies for
employing this additional, synthetic address data type depend on the
nature of the underlying CPU. Since there is usually a substantial
performance penalty for using these synthetic addresses, it has been
common practice to use the octet address data type only in conjunction
with octet operators, and to use native cell addresses for all other
purposes.
Since the dynamic range required of this synthetic data type is
one or more bits larger than for native address units, it follows
that if the machine supports full cell width cell addresses, then
an address capable of identifying any stored character or octet
within the memory must be greater than one cell in width.
A number of practical systems have used cell width octet addresses
with varying degrees of success. For example, a number of the 16-
bit minicomputers have been restricted architecturally to 15 bit
cell addressing; in fact, in some cases, the 16th bit has been used to
mark indirect addresses. On such systems, it has been possible to
address all of memory with a 16 bit octet address, with no negative
side
effects.
Less successful have been efforts to use 16 bit synthetic octet
addresses on machines that support full 16 bit cell addressing.
One strategy is to limit octet addressing to the low half of
memory. Another is to "float" octet addressing upon each task's
private memory. Yet another subdivides octet addressable space
into a static, common region and another which is "floated". Each
of these strategies has inflicted pain upon programmers who have
had to live with them.
A slightly less obvious form of this pain has been experienced
when maintaining a single source base that runs on both cell and
octet addresed machines. In a typical synthetic addressing scheme
for such 16 bit machines, it is possible to convert a cell address
into the synthetic address of its first octet by simply doubling
the cell address. The advantage of this transformation was that
all the system had to do was specify which operators took octet
addresses as opposed to cell addresses, and expect the programmer
to use the conversion operator when needed. This avoided the need
for special allocation and declaration functions for octet space.
The disadvantage is that, when running on an octet addressed machine,
the conversion operators were no-ops. The consequence of failing to
use
a conversion operator, or of using the wrong address type with a given
function, were nil. As a result, a programmer could change such a
program inattentively, test it on an octet addressed machine, and
never
discover the bugs thus introduced until the program was later run on a
cell addressed machine. Practical experience has shown that this
error
is easy to make, hard to detect, and is a direct consequence of having
an octet address that is of the same size and the same value as is the
regular memory address on octet addressed machines. As a result, it
appears that from the perspective of human factors this is an
architecture to be avoided.
Based on this experience, it is proposed that explicit octet add-
ressing be done using an ordered pair. This practice has actually
been used in a number of systems, and is also the method often
used in hardware and software support for octet sequences on
large cell addressed mainframes.
Synopsis of proposed architecture:
----------------------------------
The ordered pair of an Octet Address consists of a Base Address
and an Octet Index. The base Address is the standard Address of
the beginning of a memory allocation declared for an Octet Sequence.
All
Octet Addresses within that allocation share the same Base Address,
and
there is no portable method for transforming an Octet Address with a
given Base Address to use a different Base Address. The Octet Index is
a
zero relative positive integer denoting the position of an octet
within
the sequence which starts at the Base Address.
On the stack, the Base Address is on top. Arithmetic on Octet
Addresses is meaningful only when subtracting the address of
one octet from that of another within the same sequence, or
when adding or subtracting a scalar to or from the address of
an octet. This structure and these rules allow the application
to use double operators such as M+ and D- for the valid arithmetic
if those operators are assumed present; otherwise, since such valid
arithmetic never involves carries or borrows between the Index and
Base
parts of the Octet Address, they are amenable to simple arithmetic
operations using standard CORE operators and similarly for machine
code.
For example, the difference between two Octet Addresses that may be
validly compared may be computed
ROT 2DROP - ( in lieu of D- )
and an Octet Address may be decremented using
SWAP 1- SWAP ( in lieu of -1 M+ )
Incrementation is of course done by the dedicated operator below.
Finally, this arrangement leads to syntax which is analogous to
that which is commonly used with arrays in Forth. If PACKET has
been declared as an octet sequence, the phrase:
5 PACKET
places on the stack the formal Octet Address of the sixth octet
in that sequence since PACKET simply provides the Base Address
for that sequence. In a loop,
I PACKET
or 4 + DUP PACKET
occurs naturally as it does with arrays, helping out with stack
bloat that would occur if "indexing" were not available and
arithmetic on the double form was the only way to navigate.
I believe, based on considerable experience, that this is the
cleanest way to deal with this issue. In fact, it is precisely
the solution that ATHENA uses for data structures defined as
sequences of *bits*, where it has served well, led to readable
code, and produced no glaring inconsistencies. Based on this,
the minimum set of things we might need is:
OCTETS ( n1 - n2) Clone defn from CHARS
OCTET+ ( 8-addr1 - 8-addr2) Clone defn from CHAR+
8@ ( 8-addr - u) Clone defn from C@
8! ( u 8-addr) Clone defn from C!
8MOVE ( 8-addr1 8-addr2 u) Clone defn from CMOVE
It is strictly coincidental that "8" looks very much like "B"
at first glance ;-)
Storage for octet sequences is allocated using the present
conventions for allocating and identifying *aligned* addresses.
For example,
CREATE PACKET 536 OCTETS ALLOT
... , ... ALIGN HERE 64 OCTETS ALLOT ...
For the purpose of complying with standards, the first form is
more likely to be used. The requirement for ALIGNing Base
Addresses facilitates efficient implementations on the universe
of equipment.
Addition of octet sequence support to the FILE extension must be
done in such a way that it is independent of character size, which
may be larger than an octet. However, as written all FILE operators
function in terms of lengths and positions whose units are charcters.
Because more than one octet position may map onto the same character
position, dealing with the same file ID in terms of both octets and
characters would be problematic.
Instead, the following is proposed:
OCT ( fam1 - fam2)
Modify the implementation-defined file access method fam1
to additionally select an octet oriented, as opposed to character
or file oriented, access method. When a file ID has been opened
with the OCT access method, all file positions and sizes used in
association with that file are in units of octets instead of
characters. In addition, it is an amgiguous condition to use
READ-FILE, READ-LINE, WRITE-FILE, WRITE-LINE, or INCLUDE-FILE
with such a file ID. INCLUDED is not mentioned in this list
because it does not consume a file ID.
READ-OCTET ( 8-addr u1 fileid - u2 ior) Clone from READ-FILE
Note ambiguous condition if used with a fileid not opened as OCT
WRITE-OCTET ( 8-addr u fileid - ior) Clone from WRITE-FILE
Note ambiguous condition if used with a fileid not opened as OCT
This appears to be the minimum necessary change. READ-FILE and
WRITE-FILE are not overloaded because experience indicates that
having different arguments for the same function depending on a
flag leads to maintenance problems.
If written, this proposal will of course have to include a number
of details in sections 2, 3, and 4 as well as 11 and whatever is
assigned for this extension.
ALTERNATIVE STRUCTURE 1:
------------------------
If the TC strongly feels that this is too much solution for the
problem, there is a simpler alternative that is logically self
consistent:
1. An octet is guaranteed to fit inside the storage allocation
for a character.
2. Therefore, omit all of this except the FILE wordset part.
3. In the FILE wordset, include OCT but simply note that in
this access method octets are read from and written to the
device, sizes and positions are in octets, and the data
are read into and written from character storage such that
octets are right justified and zero filled into characters
on READ-FILE, only the low order eight bits of each character
are written by WRITE-FILE, and that READ-LINE, WRITE-LINE,
and INCLUDE-FILE are ambiguous with an OCT file handle.
The disadvantage of this is that while it would allow everyone
with the AU=byte=char=octet dependency to congratulate themselves
as having complied without doing any work, it would not address
the physical storage structures commonly used by hardware and
operating systems for cell addressed equipment, and would be
inefficient on byte addressed machines with large characters.
ALTERNATIVE STRUCTURE 2:
------------------------
It might be more useful to use the initial structure above but to
de-ambiguify READ-FILE and WRITE-FILE by incorporating the con-
ventions in item 3. of alternative 1 above. What this would buy
is that an existing AU=byte=octet=char application that had to
be converted in a hurry to use say 16 bit characters could adapt
to such a system by using OCT as file access method with no other
changes (assuming it was coded with CHARS and CHAR+ as needed)
and still operate upon its octet sequence structures with reduced
efficiency. For that matter, it could run on cell addressed hard-
ware with similarly reduced efficiency. In either case, at leisure
and
if necessary the application could be upgraded to actually use the
Octet
Addressing functions, but in the meanwhile there would be a fast and
dirty way to solve the problem with minimal effort.
At present I think that Alternative 2 would be the wisest of these
three. Perhaps the part of Alternative 2 taken from Alternative 1
could be the OCTET extension, and the rest of it could be called
OCTET EXT.
Or, if one felt more strongly about it, OCT could be added to the
base FILE wordset along with the change in behavior of that wordset
per
Alternative 1, and the rest of 2 implemented as simply the OCTET
wordset
with no OCTET EXT (as yet). For those maintaining typical systems,
that
could require as little as adding OCT as a no-op.
Obviously it would be nice to have a first draft that might pass,
so these packaging issues should be more or less resolved first.
In that regard the central question is, to me, how essential and
therefore how non-optional each of these layers should be.
-----------------------------------------------------------------
> ...
> Reference Implementation
> ========================
> (as yet untested)
>
> Taken from the VFX Forth source tree and modified to remove most
> implementation dependencies. Assumes the use of the # and $ numeric
> prefixes to indicate decimal and hexadecimal respectively.
> ...
The reference implementation relies on appending the parsed string to a
**counted string** in PAD. A first glance at the code suggests that it will
break if the string being parsed is greater than 255 chars. Does the proposal
imply S\" should not parse strings greater than this length? DPANS94 does not
appear to set an upper limit on the number of characters which may be parsed by
S". It does require that S" support a minimum of 80 chars.
Krishna Myneni
Well, if the 'cdef' is ignored, I'd call that the rational choice. I
assumed parsed as a word and either compiled or flagged as a "not in the
dictionary" error. Silly me!
It's not clear whether the escape chars are required to be lower-case
or case-insensitive.
Forth systems have moved away from case-sensitivity, so let's
not introducing it into new proposals. I note Forth Inc code uses
mainly upper-case escapes in their S\" strings.