Google Groups Home
Help | Sign in
Message from discussion RfD: Escaped Strings
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Alex McDonald  
View profile
 More options Jul 12 2007, 3:16 pm
Newsgroups: comp.lang.forth
From: Alex McDonald <b...@rivadpm.com>
Date: Thu, 12 Jul 2007 20:16:51 +0100
Local: Thurs, Jul 12 2007 3:16 pm
Subject: Re: RfD: Escaped Strings

Peter Knaggs wrote:
> 21 August 2006, Stephen Pelc

> 20060822 Updated solution section.
> 20060821 First draft.

> Rationale
> =========

> Problem
> -------
> The word S" 6.1.2165 is the primary word for generating strings.
> In more complex applications, it suffers from several deficiencies:
> 1) the S" string can only contain printable characters,
> 2) the S" string cannot contain the '"' character,
> 3) the S" string cannot be used with wide characters as discussed
>    in the Forth 200x internationalisation and XCHAR proposals.

> Current practice
> ----------------
> At least SwiftForth, gForth and VFX Forth support S\" with very similar
> operations. S\" behaves like S", but uses the '\' character as an escape
> character for the entry of characters that cannot be used with S".

> This technique is widespread in languages other than Forth.

> It has benefit in areas such as
> 1) construction of multi line strings for display by operating system
>    services,
> 2) construction of HTTP headers,
> 3) generation of GSM modem and Telnet control strings.

> The majority of current Forth systems contain code, either in the kernel
> or in application code, that assumes char=byte=au. To avoid breaking
> existing code, we have to live with this practice.

> Considerations
> --------------
> We are trying to integrate several issues:

> 1) no/least code breakage
> 2) minimal standards changes
> 3) variable width character sets
> 4) small system functionality

> Item 1) is about the common char=byte=au assumption.
> Item 2) includes the use of COUNT to step through memory and the impact
>         of char in the file word sets.
> Item 3) has to rationalise a fixed width serial/comms channel with 1..4
>         byte characters, e.g. UTF-8
> Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

> The basis of the current approach is to use the terminology of primitive
> characters and extended characters. A primitive character (called a
> pchar here) is a fixed-width unit handled by EMIT and friends. It
> corresponds to the current ANS definition of a character. An extended
> character (called an xchar here) consists of one or more primitive
> characters and represents the encoding for a "display unit". A string is
> represented by caddr/len in terms of primitive characters.

> The consequences of this are:

> 1) No existing code is broken.
> 2) Most systems have only one keyboard and only one screen/display unit,
>    but may have several additional comms channels. The impact of a
>    keyboard driver having to convert Chinese or Russian characters into
>    a (say) UTF-8 sequence is minimal compared to handling the key stroke
>    sequences. Similarly on display.
> 3) Comms channels and files work as expected.
> 4) 16-bit embedded systems can handle all character widths as they are
>    described as strings.
> 5) No conflict arises with the XCHARs proposal.

> Multiple encodings can be handled if they share a common primitive
> character size - nearly all of these are described in terms of octets:
> TCP/IP, UTF-8, UTF-16, UTF-32, ...

> The XCHARs proposal can be used to handle extended characters on the
> stack. XEMIT and friends allow us to handle some additional odd-ball
> requirements such as 9-bit control characters, e.g. for the MDB bus used
> by vending machines.

> Solution
> --------
> To ease discussion we refer to character handled by C@, C! and friends
> as "primitive characters" or pchars. Characters that may be wider than a
> pchar are called "extended characters" or xchars. These are compatible
> with the XCHARs proposal. This proposal does not require systems to
> handle xchars, but does not disenfranchise those that do.

> S\" is used like S" but treats the '\' character specially. One or more
> characters after the  '\' indicate what is substituted. The following
> list is what is currently available in the Forth systems surveyed.

> \a      BEL (alert, ASCII 7)
> \b      BS (backspace, ASCII 8)
> \e      ESC (not in C99, ASCII 27)
> \f      FF (form feed, ASCII 12)
> \l      LF (ASCII 10)
> \m      CR/LF pair (ASCII 13, 10) - for HTML etc.
> \n      newline - CRLF for Windows/DOS, LF for Unices
> \q      double-quote (ASCII 34)
> \r      CR (ASCII 13)
> \t      HT (tab, ASCII 9)
> \v      VT (ASCII 11)
> \z      NUL (ASCII 0)
> \"      "
> \[0-7]+ Octal numerical character value, finishes at the
>         first non-octal character
> \x[0-9a-f]+  Hex numerical character value, finishes at the first
>         non-hex character
> \\      backslash itself
> \       before any other character represents that character

How would the following

   s\" \"

be handled? Win32Forth treats incomplete strings

   s" incomplete

as being correctly terminated at the cf/lf boundary.

> The following three of these cause parsing and readability problems. As
> far as I know, requiring characters to come in 8 bit units will not
> upset any systems. Systems with characters less than 7 bits are non-
> compliant, and I know of no 7 bit CPUs. All current systems use
> character units of 8 bits or more.

> \[0-7]+      Octal numerical character value, finishes at the first
>              non-octal character
> \x[0-9a-f]+  Hex numerical character value, finishes at the first
>              non-hex character

> Why do we need two representations, both of variable length? This
> proposal selects the hexadecimal representation, requiring two hex
> digits. A consequence of this is that xchars must be represented as a
> sequence of pchars. Although initially seen as a problem by some people,
> it avoids at least the following problems:

> 1) Endian issues when transmitting an xchar, e.g. big-endian host to
>    little-endian comms channel
> 2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
>    system.
> 3) Does not have problems in distinguishing the end of the number from a
>    following character such as '0' or 'A'.

> At least one system (Gforth) already supports UTF-8 as it's native
> character set, and one system (JaxForth) used UTF-16. These systems are
> not affected.

I'm confused by the previous, and how to terminate an octal or hex
string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?

[snipped]


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google