20060822 Updated solution section. 20060821 First draft.
Rationale =========
Problem ------- The word S" 6.1.2165 is the primary word for generating strings. In more complex applications, it suffers from several deficiencies: 1) the S" string can only contain printable characters, 2) the S" string cannot contain the '"' character, 3) the S" string cannot be used with wide characters as discussed in the Forth 200x internationalisation and XCHAR proposals.
Current practice ---------------- At least SwiftForth, gForth and VFX Forth support S\" with very similar operations. S\" behaves like S", but uses the '\' character as an escape character for the entry of characters that cannot be used with S".
This technique is widespread in languages other than Forth.
It has benefit in areas such as 1) construction of multi line strings for display by operating system services, 2) construction of HTTP headers, 3) generation of GSM modem and Telnet control strings.
The majority of current Forth systems contain code, either in the kernel or in application code, that assumes char=byte=au. To avoid breaking existing code, we have to live with this practice.
Considerations -------------- We are trying to integrate several issues:
1) no/least code breakage 2) minimal standards changes 3) variable width character sets 4) small system functionality
Item 1) is about the common char=byte=au assumption. Item 2) includes the use of COUNT to step through memory and the impact of char in the file word sets. Item 3) has to rationalise a fixed width serial/comms channel with 1..4 byte characters, e.g. UTF-8 Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
The basis of the current approach is to use the terminology of primitive characters and extended characters. A primitive character (called a pchar here) is a fixed-width unit handled by EMIT and friends. It corresponds to the current ANS definition of a character. An extended character (called an xchar here) consists of one or more primitive characters and represents the encoding for a "display unit". A string is represented by caddr/len in terms of primitive characters.
The consequences of this are:
1) No existing code is broken. 2) Most systems have only one keyboard and only one screen/display unit, but may have several additional comms channels. The impact of a keyboard driver having to convert Chinese or Russian characters into a (say) UTF-8 sequence is minimal compared to handling the key stroke sequences. Similarly on display. 3) Comms channels and files work as expected. 4) 16-bit embedded systems can handle all character widths as they are described as strings. 5) No conflict arises with the XCHARs proposal.
Multiple encodings can be handled if they share a common primitive character size - nearly all of these are described in terms of octets: TCP/IP, UTF-8, UTF-16, UTF-32, ...
The XCHARs proposal can be used to handle extended characters on the stack. XEMIT and friends allow us to handle some additional odd-ball requirements such as 9-bit control characters, e.g. for the MDB bus used by vending machines.
Solution -------- To ease discussion we refer to character handled by C@, C! and friends as "primitive characters" or pchars. Characters that may be wider than a pchar are called "extended characters" or xchars. These are compatible with the XCHARs proposal. This proposal does not require systems to handle xchars, but does not disenfranchise those that do.
S\" is used like S" but treats the '\' character specially. One or more characters after the '\' indicate what is substituted. The following list is what is currently available in the Forth systems surveyed.
\a BEL (alert, ASCII 7) \b BS (backspace, ASCII 8) \e ESC (not in C99, ASCII 27) \f FF (form feed, ASCII 12) \l LF (ASCII 10) \m CR/LF pair (ASCII 13, 10) - for HTML etc. \n newline - CRLF for Windows/DOS, LF for Unices \q double-quote (ASCII 34) \r CR (ASCII 13) \t HT (tab, ASCII 9) \v VT (ASCII 11) \z NUL (ASCII 0) \" " \[0-7]+ Octal numerical character value, finishes at the first non-octal character \x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character \\ backslash itself \ before any other character represents that character
The following three of these cause parsing and readability problems. As far as I know, requiring characters to come in 8 bit units will not upset any systems. Systems with characters less than 7 bits are non- compliant, and I know of no 7 bit CPUs. All current systems use character units of 8 bits or more.
\[0-7]+ Octal numerical character value, finishes at the first non-octal character \x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character
Why do we need two representations, both of variable length? This proposal selects the hexadecimal representation, requiring two hex digits. A consequence of this is that xchars must be represented as a sequence of pchars. Although initially seen as a problem by some people, it avoids at least the following problems:
1) Endian issues when transmitting an xchar, e.g. big-endian host to little-endian comms channel 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit system. 3) Does not have problems in distinguishing the end of the number from a following character such as '0' or 'A'.
At least one system (Gforth) already supports UTF-8 as it's native character set, and one system (JaxForth) used UTF-16. These systems are not affected.
\ before any other character represents that character
This is an unnecessary general case, and so is not mandated. By making it an ambiguous condition, we do not disenfranchise existing implementations, and leave the way open for future extensions.
Proposal ========
6.2.xxxx S\" s-slash-quote CORE EXT
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( "ccc<quote>" -- ) Parse ccc delimited by " (double-quote), using the translation rules below. Append the run-time semantics given below to the current definition.
Translation rules: Characters are processed one at a time and appended to the compiled string. If the character is a '\' character it is processed by parsing and substituting one or more characters as follows:
\a BEL (alert, ASCII 7) \b BS (backspace, ASCII 8) \e ESC (not in C99, ASCII 27) \f FF (form feed, ASCII 12) \l LF (ASCII 10) \m CR/LF pair (ASCII 13, 10) \n implementation dependent newline, e.g. CR/LF, LF, or LF/CR. \q double-quote (ASCII 34) \r CR (ASCII 13) \t HT (tab, ASCII 9) \v VT (ASCII 11) \z NUL (ASCII 0) \" " \xAB A and B are Hexadecimal numerical characters. The resulting character is the conversion of these two characters. \\ backslash itself \ before any other character constitutes an ambiguous condition.
Run-time: ( -- c-addr u ) Return c-addr and u describing a string consisting of the translation of the characters ccc. A program shall not alter the returned string.
Taken from the VFX Forth source tree and modified to remove most implementation dependencies. Assumes the use of the # and $ numeric prefixes to indicate decimal and hexadecimal respectively.
: PLACE \ c-addr1 u c-addr2 -- \ *G Copy the string described by c-addr1 u to a counted string at \ ** the memory address described by c-addr2. 2dup 2>r \ write count last 1 chars + swap move 2r> c! \ to avoid in-place problems ;
: $, \ caddr len -- \ *G Lay the string into the dictionary at *\fo{HERE}, reserve \ ** space for it and *\fo{ALIGN} the dictionary. dup >r here place r> 1 chars + allot align ;
: addchar \ char string -- \ *G Add the character to the end of the counted string. tuck count + c! 1 swap c+! ;
: append \ c-addr u $dest -- \ *G Add the string described by C-ADDR U to the counted string at \ ** $DEST. The strings must not overlap. >r tuck r@ count + swap cmove \ add source to end r> c+! \ add length to count ;
: extract2H \ caddr len -- caddr' len' u \ *G Extract a two-digit hex number in the given base from the \ ** start of the* string, returning the remaining string \ ** and the converted number. base @ >r hex 0 0 2over >number 2drop drop >r 2 chars /string r> r> base ! ;
create EscapeTable \ -- addr \ *G Table of translations for \a..\z. 7 c, \ \a 8 c, \ \b char c c, \ \c char d c, \ \d #27 c, \ \e #12 c, \ \f char g c, \ \g char h c, \ \h char i c, \ \i char j c, \ \j char k c, \ \k #10 c, \ \l char m c, \ \m #10 c, \ \n (Unices only) char o c, \ \o char p c, \ \p char " c, \ \q #13 c, \ \r char s c, \ \s 9 c, \ \t char u c, \ \u #11 c, \ \v char w c, \ \w char x c, \ \x char y c, \ \y 0 c, \ \z
internal : addEscape \ caddr len dest -- caddr' len' \ *G Add an escape sequence to the counted string at dest, \ ** returning the remaining string. over 0= \ zero length check if drop exit endif >r \ -- caddr len ; R: -- dest over c@ [char]
Peter Knaggs wrote: > 21 August 2006, Stephen Pelc
> 20060822 Updated solution section. > 20060821 First draft.
> Rationale > =========
> Problem > ------- > The word S" 6.1.2165 is the primary word for generating strings. > In more complex applications, it suffers from several deficiencies: > 1) the S" string can only contain printable characters, > 2) the S" string cannot contain the '"' character, > 3) the S" string cannot be used with wide characters as discussed > in the Forth 200x internationalisation and XCHAR proposals.
> Current practice > ---------------- > At least SwiftForth, gForth and VFX Forth support S\" with very similar > operations. S\" behaves like S", but uses the '\' character as an escape > character for the entry of characters that cannot be used with S".
> This technique is widespread in languages other than Forth.
> It has benefit in areas such as > 1) construction of multi line strings for display by operating system > services, > 2) construction of HTTP headers, > 3) generation of GSM modem and Telnet control strings.
> The majority of current Forth systems contain code, either in the kernel > or in application code, that assumes char=byte=au. To avoid breaking > existing code, we have to live with this practice.
> Considerations > -------------- > We are trying to integrate several issues:
> 1) no/least code breakage > 2) minimal standards changes > 3) variable width character sets > 4) small system functionality
> Item 1) is about the common char=byte=au assumption. > Item 2) includes the use of COUNT to step through memory and the impact > of char in the file word sets. > Item 3) has to rationalise a fixed width serial/comms channel with 1..4 > byte characters, e.g. UTF-8 > Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
> The basis of the current approach is to use the terminology of primitive > characters and extended characters. A primitive character (called a > pchar here) is a fixed-width unit handled by EMIT and friends. It > corresponds to the current ANS definition of a character. An extended > character (called an xchar here) consists of one or more primitive > characters and represents the encoding for a "display unit". A string is > represented by caddr/len in terms of primitive characters.
> The consequences of this are:
> 1) No existing code is broken. > 2) Most systems have only one keyboard and only one screen/display unit, > but may have several additional comms channels. The impact of a > keyboard driver having to convert Chinese or Russian characters into > a (say) UTF-8 sequence is minimal compared to handling the key stroke > sequences. Similarly on display. > 3) Comms channels and files work as expected. > 4) 16-bit embedded systems can handle all character widths as they are > described as strings. > 5) No conflict arises with the XCHARs proposal.
> Multiple encodings can be handled if they share a common primitive > character size - nearly all of these are described in terms of octets: > TCP/IP, UTF-8, UTF-16, UTF-32, ...
> The XCHARs proposal can be used to handle extended characters on the > stack. XEMIT and friends allow us to handle some additional odd-ball > requirements such as 9-bit control characters, e.g. for the MDB bus used > by vending machines.
> Solution > -------- > To ease discussion we refer to character handled by C@, C! and friends > as "primitive characters" or pchars. Characters that may be wider than a > pchar are called "extended characters" or xchars. These are compatible > with the XCHARs proposal. This proposal does not require systems to > handle xchars, but does not disenfranchise those that do.
> S\" is used like S" but treats the '\' character specially. One or more > characters after the '\' indicate what is substituted. The following > list is what is currently available in the Forth systems surveyed.
> \a BEL (alert, ASCII 7) > \b BS (backspace, ASCII 8) > \e ESC (not in C99, ASCII 27) > \f FF (form feed, ASCII 12) > \l LF (ASCII 10) > \m CR/LF pair (ASCII 13, 10) - for HTML etc. > \n newline - CRLF for Windows/DOS, LF for Unices > \q double-quote (ASCII 34) > \r CR (ASCII 13) > \t HT (tab, ASCII 9) > \v VT (ASCII 11) > \z NUL (ASCII 0) > \" " > \[0-7]+ Octal numerical character value, finishes at the > first non-octal character > \x[0-9a-f]+ Hex numerical character value, finishes at the first > non-hex character > \\ backslash itself > \ before any other character represents that character
How would the following
s\" \"
be handled? Win32Forth treats incomplete strings
s" incomplete
as being correctly terminated at the cf/lf boundary.
> The following three of these cause parsing and readability problems. As > far as I know, requiring characters to come in 8 bit units will not > upset any systems. Systems with characters less than 7 bits are non- > compliant, and I know of no 7 bit CPUs. All current systems use > character units of 8 bits or more.
> \[0-7]+ Octal numerical character value, finishes at the first > non-octal character > \x[0-9a-f]+ Hex numerical character value, finishes at the first > non-hex character
> Why do we need two representations, both of variable length? This > proposal selects the hexadecimal representation, requiring two hex > digits. A consequence of this is that xchars must be represented as a > sequence of pchars. Although initially seen as a problem by some people, > it avoids at least the following problems:
> 1) Endian issues when transmitting an xchar, e.g. big-endian host to > little-endian comms channel > 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit > system. > 3) Does not have problems in distinguishing the end of the number from a > following character such as '0' or 'A'.
> At least one system (Gforth) already supports UTF-8 as it's native > character set, and one system (JaxForth) used UTF-16. These systems are > not affected.
I'm confused by the previous, and how to terminate an octal or hex string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
> be handled? Win32Forth treats incomplete strings
> s" incomplete
> as being correctly terminated at the cf/lf boundary.
The current definition of s" does not define what happens in this circumstance. Consequently this proposal does not not define this condition either. Your solution would be just as valid for s\" as s".
It find it moderately interesting that the rather standard \<newline> is not. Traditionally this means ignore the line break.
>> \[0-7]+ Octal numerical character value, finishes at the first >> non-octal character >> \x[0-9a-f]+ Hex numerical character value, finishes at the first >> non-hex character
>> Why do we need two representations, both of variable length? This >> proposal selects the hexadecimal representation, requiring two hex >> digits. A consequence of this is that xchars must be represented as a >> sequence of pchars. Although initially seen as a problem by some people, >> it avoids at least the following problems:
>> 1) Endian issues when transmitting an xchar, e.g. big-endian host to >> little-endian comms channel >> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit >> system. >> 3) Does not have problems in distinguishing the end of the number from a >> following character such as '0' or 'A'.
> I'm confused by the previous, and how to terminate an octal or hex > string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
This is a problem of the existing solutions. This proposal suggests that \x should be followed by only two characters. Thus your \x12AB would produce the sequence 12, 'A', and 'B'.
> > be handled? Win32Forth treats incomplete strings
> > s" incomplete
> > as being correctly terminated at the cf/lf boundary.
> The current definition of s" does not define what happens in this > circumstance. Consequently this proposal does not not define this > condition either. Your solution would be just as valid for s\" as s".
> It find it moderately interesting that the rather standard \<newline> is > not. Traditionally this means ignore the line break.
That would be a useful enhancement; but perhaps \c might be clearer, as it differentiates between a silent space as in \<newline> and \ <newline> and permits comments.
s\" abcdefg\c \ continue on a new line hijklmn" \ blank strip leading & catenate for abcdefghijklmn
> >> \[0-7]+ Octal numerical character value, finishes at the first > >> non-octal character > >> \x[0-9a-f]+ Hex numerical character value, finishes at the first > >> non-hex character
> >> Why do we need two representations, both of variable length? This > >> proposal selects the hexadecimal representation, requiring two hex > >> digits. A consequence of this is that xchars must be represented as a > >> sequence of pchars. Although initially seen as a problem by some people, > >> it avoids at least the following problems:
> >> 1) Endian issues when transmitting an xchar, e.g. big-endian host to > >> little-endian comms channel > >> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit > >> system. > >> 3) Does not have problems in distinguishing the end of the number from a > >> following character such as '0' or 'A'.
> > I'm confused by the previous, and how to terminate an octal or hex > > string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?
> This is a problem of the existing solutions. This proposal suggests that > \x should be followed by only two characters. Thus your \x12AB would > produce the sequence 12, 'A', and 'B'.
Alex McDonald <b...@rivadpm.com> writes: >How would the following
> s\" \"
>be handled? Win32Forth treats incomplete strings
> s" incomplete
>as being correctly terminated at the cf/lf boundary.
That's what the standard prescribes in Section 3.4.1:
|[If no delimiter character is present], the string continues up to |and including the last character in the parse area, and the number in |>IN is changed to the length of the input buffer, thus emptying the |parse area.
Since the proposal uses the usual "parse ... delimited by ..." idiom, I expect that it works the same way, modulo not interrpreting the " in \" as delimiter. Maybe this could be made clearer in the proposal.
Alex McDonald <b...@rivadpm.com> writes: >On Jul 13, 10:08 am, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote: >> It find it moderately interesting that the rather standard \<newline> is >> not. Traditionally this means ignore the line break.
>That would be a useful enhancement;
No existing practice in Forth.
> but perhaps \c might be clearer, >as it differentiates between a silent space as in \<newline> and \ ><newline> and permits comments.
>s\" abcdefg\c \ continue on a new line > hijklmn" \ blank strip leading & catenate for >abcdefghijklmn
In C one can construct a longer literal string by writing to adjacent literal strings, separated only by white space and comments. E.g.:
Note that this allows a little more flexibility about where the string starts in the next line. Insired by this, we could do it in Forth with words like +" and +\", which would extend a string started with S" or S\". But no existing practice, either, so not for this proposal.
Peter Knaggs <pkna...@bournemouth.ac.uk> writes: >21 August 2006, Stephen Pelc
Pretty good. There's always room for improvement:
- Test cases should be added before the CfV.
- I guess that you want \xAB to represent a (primitive) character. This does not come out clearly (actually, if there was no mention of XCHARS and definition of "primitive characters" in the informative sections, this would be clearer).
- It seems that the detailed description of an existing solution in the "Solution" section is confusing, because it is very similar to the proposal, but still different. Better leave it away and just mention the issues (like fixed-length vs. variable-length \x) in a discussion section.
On Fri, 13 Jul 2007 12:24:39 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote: >- Test cases should be added before the CfV.
Volunteer? You? The gForth test suite?
>- I guess that you want \xAB to represent a (primitive) character. >This does not come out clearly (actually, if there was no mention of >XCHARS and definition of "primitive characters" in the informative >sections, this would be clearer).
Given the problems with the definition of char throughout the document, the definition of char in terms of primitve characters *has* to be done in a different section of the document.
For example, if char=16 bits on a byte-addressed machine, there is no way for a standard program to write a byte to a file!
If you use a variable width character set such as UTF-8, what does CMOVE mean?
The only practical solutions I see are a) define char=byte b) define char=implementation defined unit
Given the amount of code that currently assumes char=byte=au, the least code breakage and maximum instant compliance is to replace "char" in the document by "primitive char" ("pchar") and then to define "extended char" ("xchar") in terms of pchars. The vast majority of systems can then happily impose char=byte=au.
>- It seems that the detailed description of an existing solution in >the "Solution" section is confusing, because it is very similar to the >proposal, but still different. Better leave it away and just mention >the issues (like fixed-length vs. variable-length \x) in a discussion >section.
Revamped and posted separately.
-- Stephen Pelc, stephen...@mpeforth.com MicroProcessor Engineering Ltd - More Real, Less Time 133 Hill Lane, Southampton SO15 5AF, England tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691 web: http://www.mpeforth.com - free VFX Forth downloads
Problem ------- The word S" 6.1.2165 is the primary word for generating strings. In more complex applications, it suffers from several deficiencies: 1) the S" string can only contain printable characters, 2) the S" string cannot contain the '"' character, 3) the S" string cannot be used with wide characters as dicussed in the Forth 200x internationalisation and XCHAR proposals.
Current practice ---------------- At least SwiftForth, gForth and VFX Forth support S\" with very similar operations. S\" behaves like S", but uses the '\' character as an escape character for the entry of characters that cannot be used with S".
This technique is widespread in languages other than Forth.
It has benefit in areas such as 1) construction of multiline strings for display by operating system services, 2) construction of HTTP headers, 3) generation of GSM modem and Telnet control strings.
The majority of current Forth systems contain code, either in the kernel or in application code, that assumes char=byte=au. To avoid breaking existing code, we have to live with this practice.
The following list describes what is currently available in the surveyed Forth systems that support escaped strings.
\a BEL (alert, ASCII 7) \b BS (backspace, ASCII 8) \e ESC (not in C99, ASCII 27) \f FF (form feed, ASCII 12) \l LF (ASCII 10) \m CR/LF pair (ASCII 13, 10) - for HTML etc. \n newline - CRLF for Windows/DOS, LF for Unices \q double-quote (ASCII 34) \r CR (ASCII 13) \t HT (tab, ASCII 9) \v VT (ASCII 11) \z NUL (ASCII 0) \" " \[0-7]+ Octal numerical character value, finishes at the first non-octal character \x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character \\ backslash itself \ before any other character represents that character
Considerations -------------- We are trying to integrate several issues:
1) no/least code breakage 2) minimal standards changes 3) variable width character sets 4) small system functionality
Item 1) is about the common char=byte=au assumption. Item 2) includes the use of COUNT to step through memory and the impact of char in the file word sets. Item 3) has to rationalise a fixed width serial/comms channel with 1..4 byte characters, e.g. UTF-8 Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
The basis of the current approach is to use the terminology of primitive characters and extended characters. A primitive character (called a pchar here)is a fixed-width unit handled by EMIT and friends as well as C@, C! and friends. A pchar corresponds to the current ANS definition of a character. Characters that may be wider than a pchar are called "extended characters" or xchars. The xchars are an integer multiple of pchars. An xchar consists of one or more primitive characters and represents the encoding for a "display unit". A string is represented by caddr/len in terms of primitive characters.
The consequences of this are:
1) No existing code is broken. 2) Most systems have only one keyboard and only one screen/display unit, but may have several additional comms channels. The impact of a keyboard driver having to convert Chinese or Russian characters into a (say) UTF-8 sequence is minimal compared to handling the key stroke sequences. Similarly on display. 3) Comms channels and files work as expected. 4) 16-bit embedded systems can handle all character widths as they are described as strings. 5) No conflict arises with the XCHARs proposal.
Multiple encodings can be handled if they share a common primitive character size - nearly all encodings are described in terms of octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...
Approach -------- This proposal does not require systems to handle xchars, and does not disenfranchise those that do.
S\" is used like S" but treats the '\' character specially. One or more characters after the '\' indicate what is substituted. The following three of these cause parsing and readability problems. As far as I know, requiring characters to come in 8 bit units will not upset any systems. Systems with characters less than 7 bits are non-compliant, and I know of no 7 bit CPUs. All current systems use character units of 8 bits or more.
Of observed current practice, the following two are problematic. \[0-7]+ Octal numerical character value, finishes at the first non-octal character \x[0-9a-f]+ Hex numerical character value, finishes at the first non-hex character
Why do we need two representations, both of variable length? This proposal selects the hexadecimal representation, requiring two hex digits. A consequence of this is that xchars must be represented as a sequence of pchars. Although initially seen as a problem by some people, it avoids at least the following problems: 1) Endian issues when trasmitting an xchar, e.g. big-endian host to little-endian comms channel 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit system. 3) Does not have problems in distinguishing the end of the number from a following character such as '0' or 'A'. At least one system (Gforth) already supports UTF-8 as its native character set, and one system (JaxForth) used UTF-16. These systems are not affected.
\ before any other character represents that character
This is an unnecessary general case, and so is not mandated. By making it an ambiguous condition, we do not disenfranchise existing implementations, and leave the way open for future extensions.
Proposal ========
6.2.xxxx S\" s-slash-quote CORE EXT
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( "ccc<quote>" -- ) Parse ccc delimited by " (double-quote), using the translation rules below. Append the run-time semantics given below to the current definition.
Translation rules: Characters are processed one at a time and appended to the compiled string. If the character is a '\' character it is processed by parsing and substituting one or more characters as follows: \a BEL (alert, ASCII 7) \b BS (backspace, ASCII 8) \e ESC (not in C99, ASCII 27) \f FF (form feed, ASCII 12) \l LF (ASCII 10) \m CR/LF pair (ASCII 13, 10) \n implementation dependent newline, e.g. CR/LF, LF, or LF/CR. \q double-quote (ASCII 34) \r CR (ASCII 13) \t HT (tab, ASCII 9) \v VT (ASCII 11) \z NUL (ASCII 0) \" " \xAB A and B are Hexadecimal numerical characters. The resulting character is the conversion of these two characters. \\ backslash itself \ before any other character constitutes an ambiguous condition.
Run-time: ( -- c-addr u ) Return c-addr and u describing a string consisting of the translation of the characters ccc. A program shall not alter the returned string.
Labelling ========= ENVIRONMENT? impact name stack conditions
Ambiguous conditions occur: If a hex value is more than two characters If \x is not followed by by two hexadecimal characters If the string is incorrectly formed
Reference Implementation ======================== (as yet untested) Taken from the VFX Forth source tree and modified to remove most implementation dependencies. Assumes the use of the # and $ numeric prefices to indicate decimal and hexadecimal respectively.
: PLACE \ c-addr1 u c-addr2 -- \ *G Copy the string described by c-addr1 u to a counted string at \ ** the memory address described by c-addr2. 2dup 2>r \ write count last 1 chars + swap move 2r> c! \ to avoid in-place problems ;
: $, \ caddr len -- \ *G Lay the string into the dictionary at *\fo{HERE}, reserve \ ** space for it and *\fo{ALIGN} the dictionary. dup >r here place r> 1 chars + allot align ;
: addchar \ char string -- \ *G Add the character to the end of the counted string. tuck count + c! 1 swap c+! ;
: append \ c-addr u $dest -- \ *G Add the string described by C-ADDR U to the counted string at \ ** $DEST. The strings must not overlap. >r tuck r@ count + swap cmove \ add source to end r> c+! \ add length to count ;
: extract2H \ caddr len -- caddr' len' u \ *G Extract a two-digit hex number in the given base from the \ ** start of the* string, returning the remaining string \ ** and the converted number. base @ >r hex 0 0 2over >number 2drop drop >r 2 chars /string r> r> base ! ;
create EscapeTable \ -- addr \ *G Table of translations for \a..\z. 7 c, \ \a 8 c, \ \b char c c, \ \c char d c, \ \d #27 c, \ \e #12 c, \ \f char g c, \ \g char h c, \ \h char i c, \ \i char j c, \ \j char k c, \ \k #10 c, \ \l char m c, \ \m #10 c, \ \n (Unices only) char o c, \ \o char p c, \ \p char " c, \ \q #13 c, \ \r char s c, \ \s 9 c, \ \t char u c, \ \u #11 c, \ \v char w c, \ \w char x c, \ \x char y c, \ \y 0 c, \ \z
internal : addEscape \ caddr len dest -- caddr' len' \ *G Add an escape sequence to the counted string at dest, \ ** returning the remaining string. over 0= \ zero length check if drop exit endif >r \ -- caddr len ; R: -- dest over c@ [char] x = if \ hex number? 1 chars /string extract2H r> addchar exit endif over c@ [char] m = if \ CR/LF
Stephen Pelc wrote: > ... we define \xABcdef as generating > the primitive character AB and cdef is then parsed.
...
How un-Forthlike! 'cdef' isn't space delimited. This is likely to create hard-to-find bugs. It would be better to barf over it.
Jerry -- Engineering is the art of making what you want from things you can get. ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ
Only the first and third are consistent, and only the first is what the author intended; no spaces. I don't think Stephen meant parsed in the sense of parsed and compiled.
> What is supposed to *happen* (if anything) when the programmer > does
> S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE ,
> or writes the string to mass storage for later use etc.?
> [..]
> -marcel
Whatever your output device does with a string that contains these device dependent control sequences. Storing it (presumably for later use) doesn't change the fact that s\" doesn't specify (and neither does s") the intent or meaning of the string. Did you have some other perspective on this?
> Given the problems with the definition of char throughout the > document, the definition of char in terms of primitve characters > *has* to be done in a different section of the document.
> For example, if char=16 bits on a byte-addressed machine, there > is no way for a standard program to write a byte to a file!
> If you use a variable width character set such as UTF-8, what does > CMOVE mean?
> The only practical solutions I see are > a) define char=byte > b) define char=implementation defined unit
> Given the amount of code that currently assumes char=byte=au, the > least code breakage and maximum instant compliance is to replace > "char" in the document by "primitive char" ("pchar") and then to > define "extended char" ("xchar") in terms of pchars. The vast > majority of systems can then happily impose char=byte=au.
How do you feel about Greg Bailey's suggestion of some years back that we introduce the data type 'byte' or 'octet' with a small set of operators to handle explicitly 8-bit units? That's sort of moving in the opposite direction from what you suggest, but seems an equally valid approach, I think. Greg's solution leaves everything regarding chars in place, while introducing a new opportunity for situations in which you need exactly 8 bits (e.g. comms, I/O).
Cheers, Elizabeth
-- ================================================== Elizabeth D. Rather (US & Canada) 800-55-FORTH FORTH Inc. +1 310-491-3356 5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454 Hawthorne, CA 90250 http://www.forth.com
"Forth-based products and Services for real-time applications since 1973." ==================================================
Alex McDonald <b...@rivadpm.com> writes Re: RfD: Escaped Strings
> Marcel Hendrix wrote: >> Peter Knaggs <pkna...@bournemouth.ac.uk> writes Re: RfD: Escaped Strings >>> 21 August 2006, Stephen Pelc >>> 20060822 Updated solution section. >>> 20060821 First draft. >> [..] >>> \a BEL (alert, ASCII 7) >>> \b BS (backspace, ASCII 8) >> What is supposed to *happen* (if anything) when the programmer >> does >> S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE , >> or writes the string to mass storage for later use etc.?
[..]
> Whatever your output device does with a string that contains these > device dependent control sequences.
The proposal might suggest that a programmer has now a guaranteed way to cause some hitherto impossible (at least in standard Forth) output actions to happen (like tab, line-down, begin-of-line, erasing (?) backspace, ESC sequences on a VT100 terminal, etc.)
> Storing it (presumably for later > use) doesn't change the fact that s\" doesn't specify (and neither does > s") the intent or meaning of the string. Did you have some other > perspective on this?
What happens when an S\" string is written to a file opened with R/W or R/W BIN and then read back?
>>> What is supposed to *happen* (if anything) when the programmer >>> does
>>> S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE ,
>>> or writes the string to mass storage for later use etc.?
> [..]
>> Whatever your output device does with a string that contains these >> device dependent control sequences.
> The proposal might suggest that a programmer has now a guaranteed > way to cause some hitherto impossible (at least in standard Forth) output > actions to happen (like tab, line-down, begin-of-line, erasing (?) backspace, > ESC sequences on a VT100 terminal, etc.)
I suppose S\" \n" TYPE would need to be defined as the equivalent of CR, and that use of other escaped strings would have an environmental dependency. The same is true of EMIT, which I'm sure many use for that purpose.
>> Storing it (presumably for later >> use) doesn't change the fact that s\" doesn't specify (and neither does >> s") the intent or meaning of the string. Did you have some other >> perspective on this?
> What happens when an S\" string is written to a file opened with > R/W or R/W BIN and then read back?
> -marcel
I would expect standard behaviour; only READ-LINE is allowed to interpret the characters and look for up to two line terminator characters (which are implementation defined); ditto for WRITE-LINE.
I would presume that the intention is that \n is the same line terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" " WRITE-LINE.
Alex McDonald <b...@rivadpm.com> writes Re: RfD: Escaped Strings
> Marcel Hendrix wrote: [..] >> What happens when an S\" string is written to a file opened with >> R/W or R/W BIN and then read back? [..] > I would expect standard behaviour; only READ-LINE is allowed to > interpret the characters and look for up to two line terminator > characters (which are implementation defined); ditto for WRITE-LINE. > I would presume that the intention is that \n is the same line > terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs > to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" " > WRITE-LINE.
This points out another possible problems: * How many READ-LINEs are needed to read back in S\" \lhello,\mworld!\n\n(fini)\x00" ?
* Will it be the same string that was written out?
Marcel Hendrix wrote: > Alex McDonald <b...@rivadpm.com> writes Re: RfD: Escaped Strings >> Marcel Hendrix wrote: > [..] >>> What happens when an S\" string is written to a file opened with >>> R/W or R/W BIN and then read back? > [..] >> I would expect standard behaviour; only READ-LINE is allowed to >> interpret the characters and look for up to two line terminator >> characters (which are implementation defined); ditto for WRITE-LINE.
>> I would presume that the intention is that \n is the same line >> terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs >> to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" " >> WRITE-LINE.
> This points out another possible problems: > * How many READ-LINEs are needed to read back in S\" \lhello,\mworld!\n\n(fini)\x00" ?
> * Will it be the same string that was written out?
This would depend on the line terminator for your operating system. In a system which uses \l as the line terminator I would suggest five lines:
1: 2: hello,\r 3: world! 4: 5: (fini)\x00
While a system which uses \r would have four lines:
1: \lhello, 2: \lworld! 3: 4: (fini)\x00
And in a system which uses \r\l there would also be be four lines:
1: \lhello, 2: world! 3: 4: (fini)\x00
In other words the behaviour would be environmentally dependent. This is no different that in other languages.
I would like to remind people that the point of the standard is not necessarily to make all standard programs portable between forth systems, but to allow programmers to be portable. As Elisabeth puts it, the standard provides a set of entitlements to the programmer, or a set of assumptions which the programmer is entitled to make about a standard system.
It's your proposal:-) Feel free to be inspired by the tests in Gforth:
s" 123" drop 10 parse-num-x 123 <> throw drop .s s" 123a" drop 10 parse-num 123 <> throw drop .s s" x1fg" drop \-escape 31 <> throw drop .s s" 00129" drop \-escape 10 <> throw drop .s s" a" drop \-escape 7 <> throw drop .s \"-parse " s" " str= 0= throw .s \"-parse \a\b\c\e\f\n\r\t\v\100\x40xabcde" dump s\" \a\bcd\e\fghijklm\12op\"\rs\tu\v" \-escape-table over str= 0= throw s\" \w\0101\x041\"\\" name wAA"\ str= 0= throw s\" s\\\" \\" ' evaluate catch 0= throw
However, given that the current Gforth implementation does not completely match your proposal, you have to adapt it.
>>- I guess that you want \xAB to represent a (primitive) character. >>This does not come out clearly (actually, if there was no mention of >>XCHARS and definition of "primitive characters" in the informative >>sections, this would be clearer).
>Given the problems with the definition of char throughout the >document, the definition of char in terms of primitve characters >*has* to be done in a different section of the document.
The definition that the XCHARS proposal makes is that chars are primitive characters.
>For example, if char=16 bits on a byte-addressed machine, there >is no way for a standard program to write a byte to a file!
Yes, there is no standard way to deal with bytes. Bytes are not (yet) a standard concept.
>If you use a variable width character set such as UTF-8, what does >CMOVE mean?
CMOVE ( from to count -- )
Copy count characters (in your terminology, primitive characters) from FROM to TO, character by character, starting at the low addresses.
>The only practical solutions I see are >a) define char=byte >b) define char=implementation defined unit
>Given the amount of code that currently assumes char=byte=au, the >least code breakage and maximum instant compliance is to replace >"char" in the document by "primitive char" ("pchar") and then to >define "extended char" ("xchar") in terms of pchars.
That's what the XCHARS proposal does, except that it says char where you say pchar, and it says xchar where you sometimes say char.
>> Marcel Hendrix wrote: >>> What is supposed to *happen* (if anything) when the programmer >>> does
>>> S\" \e[7m\a\m\t\tHello, world!\e[27m" TYPE ,
What do you, as a programmer expect it to do? The proposal does not specify anything other than that the string will be equivalent to a string created via
here 27 c, [char] 7 c, 7 c, 13 c, 10 c, 9 c, 9 c, ( ... ) here over - 1 chars /
What the user output device does when you TYPE this string, in whatever way it was created, is not defined by the proposal or anywhere in the Forth-94 standard. You could declare an environmental dependency on outputting to an ANSI terminal (emulator).
>What happens when an S\" string is written to a file opened with >R/W or R/W BIN and then read back?
With BIN on both reading and writing I would expect the string to come back unchanged. With BIN missing on both, the \m might be changed to something else. With BIN missing on exactly one of them, pretty much anything goes.
m...@iae.nl (Marcel Hendrix) writes: >Alex McDonald <b...@rivadpm.com> writes Re: RfD: Escaped Strings >> I would presume that the intention is that \n is the same line >> terminator used by READ-LINE and WRITE-LINE; perhaps the proposal needs >> to explicitly state this; that S\" \n" WRITE-FILE the equivalent of S" " >> WRITE-LINE.
Yes.
>This points out another possible problems: > * How many READ-LINEs are needed to read back in S\" \lhello,\mworld!\n\n(fini)\x00" ?
That's implementation-defined, just like the result of -1 3 /. I would expect at least 3: up to the first \n, from the first to the second, and the rest.
Note that S\" does not make a difference here. You could create the same file in other ways.
> * Will it be the same string that was written out?
Obviously, the READ-LINEs will consume the newlines without putting them in the resulting strings.
On Fri, 13 Jul 2007 09:03:32 -1000, Elizabeth D Rather
<erather...@forth.com> wrote: >How do you feel about Greg Bailey's suggestion of some years back that >we introduce the data type 'byte' or 'octet' with a small set of >operators to handle explicitly 8-bit units? That's sort of moving in >the opposite direction from what you suggest, but seems an equally valid >approach, I think. Greg's solution leaves everything regarding chars in >place, while introducing a new opportunity for situations in which you >need exactly 8 bits (e.g. comms, I/O).
Greg's solution has merit, especially for word/cell addressed machines, however the discussion in it indicates that life isn't that simple unless you use his alternative 2.
Given that nearly all comms and character systems are defined in bytes, most CPUs are byte-addressed, and many Forth systems and/or programmers assume char=byte=au, the least effort is to permit wide characters (xchars) without breaking the assumption or code.
For those who haven't seen it, Greg's proposal is attached below.
Stephen
==================================== From: Greg Bailey [greg at minerva dot com] Sent: Tuesday, June 01, 1999 7:41 PM To: 'ANSForth real mailgroup' Cc: 'Localisation and Internationalisation'; 'ark-gvb-i' Subject: Octet String Prospectus
Problem Statement: ------------------
Most standards defining interoperable data structures, such as for example those used in networking and cryptography, do so in terms of sequences of octets. Even in embedded applications, these standards are increasingly relevant and are indeed supporting them is often a critical application requirement.
The most commonly encountered computer architectures today address their memories in units of 8 bit bytes, and Standard Forth appli- cations have no difficulty in manipulating octet sequences directly when running on typical systems, with eight bit character sets, for such machines.
However, such applications are environmentally dependent upon this common combination in which addresses are in units of bytes or octets, *and* in which characters are eight bits wide; or upon machines whose addresses are in units such as 4-bit nibbles which divide 8, and whose characters are also eight bits wide. On these families of architectures portable software may manipulate octet sequences by treating them as characters.
If, however, either character size or address units are larger than eight bits, we do not document standard ways of allocating, manipulating, or performing I/O using sequences of octets.
This proposal provides mechanism that may be used by standard programs to manipulate sequences of octets on any standard system which supports it.
(Actual packaging TBD. Should probably be an extension, but if so it will depend upon presence of the DOUBLE extension; and it will include additions to the FILE extension if both are present.)
Discussion of common practice and architectural tradeoffs: ----------------------------------------------------------
Many systems and applications have been written for "cell addressed" machines with 16 bit and larger address units. Many strategies have been used for addressing characters, which were generally equivalent to octets, on such machines. In general the hardware does not directly support linear addressing of bytes, characters, or octets, so this type of arithmetically usable address has generally been simulated in software. The most commonly used strategy has been to multiply the physical, cell address by the number of octets held within a cell, and add to this product the relative position of the octet within the cell, in order to form a linear octet address. Coding strategies for employing this additional, synthetic address data type depend on the nature of the underlying CPU. Since there is usually a substantial performance penalty for using these synthetic addresses, it has been common practice to use the octet address data type only in conjunction with octet operators, and to use native cell addresses for all other purposes.
Since the dynamic range required of this synthetic data type is one or more bits larger than for native address units, it follows that if the machine supports full cell width cell addresses, then an address capable of identifying any stored character or octet within the memory must be greater than one cell in width.
A number of practical systems have used cell width octet addresses with varying degrees of success. For example, a number of the 16- bit minicomputers have been restricted architecturally to 15 bit cell addressing; in fact, in some cases, the 16th bit has been used to mark indirect addresses. On such systems, it has been possible to address all of memory with a 16 bit octet address, with no negative side effects.
Less successful have been efforts to use 16 bit synthetic octet addresses on machines that support full 16 bit cell addressing. One strategy is to limit octet addressing to the low half of memory. Another is to "float" octet addressing upon each task's private memory. Yet another subdivides octet addressable space into a static, common region and another which is "floated". Each of these strategies has inflicted pain upon programmers who have had to live with them.
A slightly less obvious form of this pain has been experienced when maintaining a single source base that runs on both cell and octet addresed machines. In a typical synthetic addressing scheme for such 16 bit machines, it is possible to convert a cell address into the synthetic address of its first octet by simply doubling the cell address. The advantage of this transformation was that all the system had to do was specify which operators took octet addresses as opposed to cell addresses, and expect the programmer to use the conversion operator when needed. This avoided the need for special allocation and declaration functions for octet space. The disadvantage is that, when running on an octet addressed machine, the conversion operators were no-ops. The consequence of failing to use a conversion operator, or of using the wrong address type with a given function, were nil. As a result, a programmer could change such a program inattentively, test it on an octet addressed machine, and never discover the bugs thus introduced until the program was later run on a cell addressed machine. Practical experience has shown that this error is easy to make, hard to detect, and is a direct consequence of having an octet address that is of the same size and the same value as is the regular memory address on octet addressed machines. As a result, it appears that from the perspective of human factors this is an architecture to be avoided.
Based on this experience, it is proposed that explicit octet add- ressing be done using an ordered pair. This practice has actually been used in a number of systems, and is also the method often used in hardware and software support for octet sequences on large cell addressed mainframes.
Synopsis of proposed architecture: ----------------------------------
The ordered pair of an Octet Address consists of a Base Address and an Octet Index. The base Address is the standard Address of the beginning of a memory allocation declared for an Octet Sequence. All Octet Addresses within that allocation share the same Base Address, and there is no portable method for transforming an Octet Address with a given Base Address to use a different Base Address. The Octet Index is a zero relative positive integer denoting the position of an octet within the sequence which starts at the Base Address.
On the stack, the Base Address is on top. Arithmetic on Octet Addresses is meaningful only when subtracting the address of one octet from that of another within the same sequence, or when adding or subtracting a scalar to or from the address of an octet. This structure and these rules allow the application to use double operators such as M+ and D- for the valid arithmetic if those operators are assumed present; otherwise, since such valid arithmetic never involves carries or borrows between the Index and Base parts of the Octet Address, they are amenable to simple arithmetic operations using standard CORE operators and similarly for machine code. For example, the difference between two Octet Addresses that may be validly compared may be computed
ROT 2DROP - ( in lieu of D- )
and an Octet Address may be decremented using
SWAP 1- SWAP ( in lieu of -1 M+ )
Incrementation is of course done by the dedicated operator below.
Finally, this arrangement leads to syntax which is analogous to that which is commonly used with arrays in Forth. If PACKET has been declared as an octet sequence, the phrase:
5 PACKET
places on the stack the formal Octet Address of the sixth octet in that sequence since PACKET simply provides the Base Address for that sequence. In a loop,
I PACKET
or 4 + DUP PACKET
occurs naturally as it does with arrays, helping out with stack bloat that would occur if "indexing" were not available and arithmetic on the double form was the only way to navigate.
I believe, based on considerable experience, that this is the cleanest way to deal with this issue. In fact, it is precisely the solution that ATHENA uses for data structures defined as sequences of *bits*, where it has served well, led to readable code, and produced no glaring inconsistencies. Based on this, the minimum set of things we might need is:
OCTETS ( n1 - n2) Clone defn from CHARS OCTET+ ( 8-addr1 - 8-addr2) Clone defn from CHAR+ 8@ ( 8-addr - u) Clone defn from C@ 8! ( u 8-addr) Clone defn from C! 8MOVE ( 8-addr1 8-addr2 u) Clone defn from CMOVE
It is strictly coincidental that "8" looks very much like "B" at first glance ;-)
Storage for octet sequences is allocated using the present conventions for allocating and identifying *aligned*
> Taken from the VFX Forth source tree and modified to remove most > implementation dependencies. Assumes the use of the # and $ numeric > prefixes to indicate decimal and hexadecimal respectively. > ...
The reference implementation relies on appending the parsed string to a **counted string** in PAD. A first glance at the code suggests that it will break if the string being parsed is greater than 255 chars. Does the proposal imply S\" should not parse strings greater than this length? DPANS94 does not appear to set an upper limit on the number of characters which may be parsed by S". It does require that S" support a minimum of 80 chars.
> Only the first and third are consistent, and only the first is what the > author intended; no spaces. I don't think Stephen meant parsed in the > sense of parsed and compiled.
Well, if the 'cdef' is ignored, I'd call that the rational choice. I assumed parsed as a word and either compiled or flagged as a "not in the dictionary" error. Silly me!
Jerry -- Engineering is the art of making what you want from things you can get. ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ