Peter Knaggs wrote:
> 21 August 2006, Stephen Pelc
> 20060822 Updated solution section.
> 20060821 First draft.
> Rationale
> =========
> Problem
> -------
> The word S" 6.1.2165 is the primary word for generating strings.
> In more complex applications, it suffers from several deficiencies:
> 1) the S" string can only contain printable characters,
> 2) the S" string cannot contain the '"' character,
> 3) the S" string cannot be used with wide characters as discussed
> in the Forth 200x internationalisation and XCHAR proposals.
> Current practice
> ----------------
> At least SwiftForth, gForth and VFX Forth support S\" with very similar
> operations. S\" behaves like S", but uses the '\' character as an escape
> character for the entry of characters that cannot be used with S".
> This technique is widespread in languages other than Forth.
> It has benefit in areas such as
> 1) construction of multi line strings for display by operating system
> services,
> 2) construction of HTTP headers,
> 3) generation of GSM modem and Telnet control strings.
> The majority of current Forth systems contain code, either in the kernel
> or in application code, that assumes char=byte=au. To avoid breaking
> existing code, we have to live with this practice.
> Considerations
> --------------
> We are trying to integrate several issues:
> 1) no/least code breakage
> 2) minimal standards changes
> 3) variable width character sets
> 4) small system functionality
> Item 1) is about the common char=byte=au assumption.
> Item 2) includes the use of COUNT to step through memory and the impact
> of char in the file word sets.
> Item 3) has to rationalise a fixed width serial/comms channel with 1..4
> byte characters, e.g. UTF-8
> Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.
> The basis of the current approach is to use the terminology of primitive
> characters and extended characters. A primitive character (called a
> pchar here) is a fixed-width unit handled by EMIT and friends. It
> corresponds to the current ANS definition of a character. An extended
> character (called an xchar here) consists of one or more primitive
> characters and represents the encoding for a "display unit". A string is
> represented by caddr/len in terms of primitive characters.
> The consequences of this are:
> 1) No existing code is broken.
> 2) Most systems have only one keyboard and only one screen/display unit,
> but may have several additional comms channels. The impact of a
> keyboard driver having to convert Chinese or Russian characters into
> a (say) UTF-8 sequence is minimal compared to handling the key stroke
> sequences. Similarly on display.
> 3) Comms channels and files work as expected.
> 4) 16-bit embedded systems can handle all character widths as they are
> described as strings.
> 5) No conflict arises with the XCHARs proposal.
> Multiple encodings can be handled if they share a common primitive
> character size - nearly all of these are described in terms of octets:
> TCP/IP, UTF-8, UTF-16, UTF-32, ...
> The XCHARs proposal can be used to handle extended characters on the
> stack. XEMIT and friends allow us to handle some additional odd-ball
> requirements such as 9-bit control characters, e.g. for the MDB bus used
> by vending machines.
> Solution
> --------
> To ease discussion we refer to character handled by C@, C! and friends
> as "primitive characters" or pchars. Characters that may be wider than a
> pchar are called "extended characters" or xchars. These are compatible
> with the XCHARs proposal. This proposal does not require systems to
> handle xchars, but does not disenfranchise those that do.
> S\" is used like S" but treats the '\' character specially. One or more
> characters after the '\' indicate what is substituted. The following
> list is what is currently available in the Forth systems surveyed.
> \a BEL (alert, ASCII 7)
> \b BS (backspace, ASCII 8)
> \e ESC (not in C99, ASCII 27)
> \f FF (form feed, ASCII 12)
> \l LF (ASCII 10)
> \m CR/LF pair (ASCII 13, 10) - for HTML etc.
> \n newline - CRLF for Windows/DOS, LF for Unices
> \q double-quote (ASCII 34)
> \r CR (ASCII 13)
> \t HT (tab, ASCII 9)
> \v VT (ASCII 11)
> \z NUL (ASCII 0)
> \" "
> \[0-7]+ Octal numerical character value, finishes at the
> first non-octal character
> \x[0-9a-f]+ Hex numerical character value, finishes at the first
> non-hex character
> \\ backslash itself
> \ before any other character represents that character
How would the following as being correctly terminated at the cf/lf boundary.
> The following three of these cause parsing and readability problems. As
> far as I know, requiring characters to come in 8 bit units will not
> upset any systems. Systems with characters less than 7 bits are non-
> compliant, and I know of no 7 bit CPUs. All current systems use
> character units of 8 bits or more.
> \[0-7]+ Octal numerical character value, finishes at the first
> non-octal character
> \x[0-9a-f]+ Hex numerical character value, finishes at the first
> non-hex character
> Why do we need two representations, both of variable length? This
> proposal selects the hexadecimal representation, requiring two hex
> digits. A consequence of this is that xchars must be represented as a
> sequence of pchars. Although initially seen as a problem by some people,
> it avoids at least the following problems:
> 1) Endian issues when transmitting an xchar, e.g. big-endian host to
> little-endian comms channel
> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
> system.
> 3) Does not have problems in distinguishing the end of the number from a
> following character such as '0' or 'A'.
> At least one system (Gforth) already supports UTF-8 as it's native
> character set, and one system (JaxForth) used UTF-16. These systems are
> not affected.
I'm confused by the previous, and how to terminate an octal or hex