Google Groups Home
Help | Sign in
RfD: Escaped Strings
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 26 - Collapse all   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Peter Knaggs  
View profile
 More options Jul 11 2007, 1:23 pm
Newsgroups: comp.lang.forth
From: Peter Knaggs <pkna...@bournemouth.ac.uk>
Date: Wed, 11 Jul 2007 18:23:53 +0100
Local: Wed, Jul 11 2007 1:23 pm
Subject: RfD: Escaped Strings
21 August 2006, Stephen Pelc

20060822 Updated solution section.
20060821 First draft.

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
    in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very similar
operations. S\" behaves like S", but uses the '\' character as an escape
character for the entry of characters that cannot be used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as
1) construction of multi line strings for display by operating system
    services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the kernel
or in application code, that assumes char=byte=au. To avoid breaking
existing code, we have to live with this practice.

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the impact
         of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel with 1..4
         byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of primitive
characters and extended characters. A primitive character (called a
pchar here) is a fixed-width unit handled by EMIT and friends. It
corresponds to the current ANS definition of a character. An extended
character (called an xchar here) consists of one or more primitive
characters and represents the encoding for a "display unit". A string is
represented by caddr/len in terms of primitive characters.

The consequences of this are:

1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display unit,
    but may have several additional comms channels. The impact of a
    keyboard driver having to convert Chinese or Russian characters into
    a (say) UTF-8 sequence is minimal compared to handling the key stroke
    sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they are
    described as strings.
5) No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive
character size - nearly all of these are described in terms of octets:
TCP/IP, UTF-8, UTF-16, UTF-32, ...

The XCHARs proposal can be used to handle extended characters on the
stack. XEMIT and friends allow us to handle some additional odd-ball
requirements such as 9-bit control characters, e.g. for the MDB bus used
by vending machines.

Solution
--------
To ease discussion we refer to character handled by C@, C! and friends
as "primitive characters" or pchars. Characters that may be wider than a
pchar are called "extended characters" or xchars. These are compatible
with the XCHARs proposal. This proposal does not require systems to
handle xchars, but does not disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One or more
characters after the  '\' indicate what is substituted. The following
list is what is currently available in the Forth systems surveyed.

\a      BEL (alert, ASCII 7)
\b      BS (backspace, ASCII 8)
\e      ESC (not in C99, ASCII 27)
\f      FF (form feed, ASCII 12)
\l      LF (ASCII 10)
\m      CR/LF pair (ASCII 13, 10) - for HTML etc.
\n      newline - CRLF for Windows/DOS, LF for Unices
\q      double-quote (ASCII 34)
\r      CR (ASCII 13)
\t      HT (tab, ASCII 9)
\v      VT (ASCII 11)
\z      NUL (ASCII 0)
\"      "
\[0-7]+ Octal numerical character value, finishes at the
         first non-octal character
\x[0-9a-f]+  Hex numerical character value, finishes at the first
         non-hex character
\\      backslash itself
\       before any other character represents that character

The following three of these cause parsing and readability problems. As
far as I know, requiring characters to come in 8 bit units will not
upset any systems. Systems with characters less than 7 bits are non-
compliant, and I know of no 7 bit CPUs. All current systems use
character units of 8 bits or more.

\[0-7]+      Octal numerical character value, finishes at the first
              non-octal character
\x[0-9a-f]+  Hex numerical character value, finishes at the first
              non-hex character

Why do we need two representations, both of variable length? This
proposal selects the hexadecimal representation, requiring two hex
digits. A consequence of this is that xchars must be represented as a
sequence of pchars. Although initially seen as a problem by some people,
it avoids at least the following problems:

1) Endian issues when transmitting an xchar, e.g. big-endian host to
    little-endian comms channel
2) Issues when an xchar is larger than a cell, e.g. UTF-32 on a 16 bit
    system.
3) Does not have problems in distinguishing the end of the number from a
    following character such as '0' or 'A'.

At least one system (Gforth) already supports UTF-8 as it's native
character set, and one system (JaxForth) used UTF-16. These systems are
not affected.

\          before any other character represents that character

This is an unnecessary general case, and so is not mandated. By making
it an ambiguous condition, we do not disenfranchise existing
implementations, and leave the way open for future extensions.

Proposal
========

6.2.xxxx S\"
s-slash-quote CORE EXT

Interpretation:
    Interpretation semantics for this word are undefined.

Compilation: ( "ccc<quote>" -- )
    Parse ccc delimited by " (double-quote), using the translation rules
    below. Append the run-time semantics given below to the current
    definition.

Translation rules:
    Characters are processed one at a time and appended to the compiled
    string. If the character is a '\' character it is processed by
    parsing and substituting one or more characters as follows:

    \a      BEL (alert, ASCII 7)
    \b      BS (backspace, ASCII 8)
    \e      ESC (not in C99, ASCII 27)
    \f      FF (form feed, ASCII 12)
    \l      LF (ASCII 10)
    \m      CR/LF pair (ASCII 13, 10)
    \n      implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
    \q      double-quote (ASCII 34)
    \r      CR (ASCII 13)
    \t      HT (tab, ASCII 9)
    \v      VT (ASCII 11)
    \z      NUL (ASCII 0)
    \"      "
    \xAB    A and B are Hexadecimal numerical characters. The resulting
            character is the conversion of these two characters.
    \\      backslash itself
    \       before any other character constitutes an ambiguous
            condition.

Run-time: ( -- c-addr u )
    Return c-addr and u describing a string consisting of the translation
    of the characters ccc. A program shall not alter the returned string.

See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Ambiguous conditions occur:
    If a hex value is more than two characters
    If \x is not followed by two hexadecimal characters

Reference Implementation
========================
(as yet untested)

Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefixes to indicate decimal and hexadecimal respectively.

Another implementation (with some deviations) can be found at
http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth...

decimal

: PLACE         \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
   2dup 2>r                  \ write count last
   1 chars + swap move
   2r> c!                    \ to avoid in-place problems
;

: $,            \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
   dup >r
   here place
   r> 1 chars + allot
   align
;

: addchar       \ char string --
\ *G Add the character to the end of the counted string.
   tuck count + c!
   1 swap c+!
;

: append        \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
   >r
   tuck  r@ count +  swap cmove          \ add source to end
   r> c+!                                \ add length to count
;

: extract2H     \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
   base @ >r  hex
   0 0 2over >number 2drop drop
   >r  2 chars /string r>
   r> base !
;

create EscapeTable      \ -- addr
\ *G Table of translations for \a..\z.
   7 c,         \ \a
   8 c,         \ \b
   char c c,    \ \c
   char d c,    \ \d
   #27 c,       \ \e
   #12 c,       \ \f
   char g c,    \ \g
   char h c,    \ \h
   char i c,    \ \i
   char j c,    \ \j
   char k c,    \ \k
   #10 c,       \ \l
   char m c,    \ \m
   #10 c,       \ \n (Unices only)
   char o c,    \ \o
   char p c,    \ \p
   char " c,     \ \q
   #13 c,       \ \r
   char s c,    \ \s
   9 c,         \ \t
   char u c,    \ \u
   #11 c,       \ \v
   char w c,    \ \w
   char x c,    \ \x
   char y c,    \ \y
   0 c,         \ \z

create CRLF$    \ -- addr ; CR/LF as counted string
  2 c,  #13 c,  #10 c,

internal
: addEscape     \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
   over 0=                              \ zero length check
   if  drop  exit  endif
   >r                                        \ -- caddr len ; R: -- dest
   over c@ [char]
...

read more »


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex McDonald  
View profile
 More options Jul 12 2007, 3:16 pm
Newsgroups: comp.lang.forth
From: Alex McDonald <b...@rivadpm.com>
Date: Thu, 12 Jul 2007 20:16:51 +0100
Local: Thurs, Jul 12 2007 3:16 pm
Subject: Re: RfD: Escaped Strings

How would the following

   s\" \"

be handled? Win32Forth treats incomplete strings

   s" incomplete

as being correctly terminated at the cf/lf boundary.

I'm confused by the previous, and how to terminate an octal or hex
string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?

[snipped]


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Knaggs  
View profile
 More options Jul 13 2007, 5:08 am
Newsgroups: comp.lang.forth
From: Peter Knaggs <pkna...@bournemouth.ac.uk>
Date: Fri, 13 Jul 2007 10:08:11 +0100
Local: Fri, Jul 13 2007 5:08 am
Subject: Re: RfD: Escaped Strings

Alex McDonald wrote:

> How would the following

>   s\" \"

> be handled? Win32Forth treats incomplete strings

>   s" incomplete

> as being correctly terminated at the cf/lf boundary.

The current definition of s" does not define what happens in this
circumstance. Consequently this proposal does not not define this
condition either. Your solution would be just as valid for s\" as s".

It find it moderately interesting that the rather standard \<newline> is
not. Traditionally this means ignore the line break.

This is a problem of the existing solutions. This proposal suggests that
\x should be followed by only two characters. Thus your \x12AB would
produce the sequence 12, 'A', and 'B'.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex McDonald  
View profile
 More options Jul 13 2007, 5:41 am
Newsgroups: comp.lang.forth
From: Alex McDonald <b...@rivadpm.com>
Date: Fri, 13 Jul 2007 02:41:17 -0700
Local: Fri, Jul 13 2007 5:41 am
Subject: Re: RfD: Escaped Strings
On Jul 13, 10:08 am, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote:

That would be a useful enhancement; but perhaps \c might be clearer,
as it differentiates between a silent space as in \<newline> and \
<newline> and permits comments.

s\" abcdefg\c           \ continue on a new line
    hijklmn"            \ blank strip leading & catenate for
abcdefghijklmn

Ah, thanks, clear.

--
Regards
Alex McDonald


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stephen Pelc  
View profile
 More options Jul 13 2007, 5:49 am
Newsgroups: comp.lang.forth
From: stephen...@mpeforth.com (Stephen Pelc)
Date: Fri, 13 Jul 2007 09:49:43 GMT
Local: Fri, Jul 13 2007 5:49 am
Subject: Re: RfD: Escaped Strings
On Thu, 12 Jul 2007 20:16:51 +0100, Alex McDonald <b...@rivadpm.com>
wrote:

>How would the following

>   s\" \"

>be handled? Win32Forth treats incomplete strings

>   s" incomplete

It's a badly formed string, and so ambiguous. I've added this to the
ambiguous conditions list.

>I'm confused by the previous, and how to terminate an octal or hex
>string. Is \x12AB the equivalent of pchars 12 'A' and 'B', or is it 0x12AB?

This was part of the discussion, so we define \xABcdef as generating
the primitive character AB and cdef is then parsed.

The octal notation is not specified in the normative part of the
proposal.

Stephen

--
Stephen Pelc, stephen...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 13 2007, 8:08 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Fri, 13 Jul 2007 12:08:30 GMT
Local: Fri, Jul 13 2007 8:08 am
Subject: Re: RfD: Escaped Strings

Alex McDonald <b...@rivadpm.com> writes:
>How would the following

>   s\" \"

>be handled? Win32Forth treats incomplete strings

>   s" incomplete

>as being correctly terminated at the cf/lf boundary.

That's what the standard prescribes in Section 3.4.1:

|[If no delimiter character is present], the string continues up to
|and including the last character in the parse area, and the number in
|>IN is changed to the length of the input buffer, thus emptying the
|parse area.

Since the proposal uses the usual "parse ... delimited by ..." idiom,
I expect that it works the same way, modulo not interrpreting the " in
\" as delimiter.  Maybe this could be made clearer in the proposal.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 13 2007, 8:15 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Fri, 13 Jul 2007 12:15:31 GMT
Local: Fri, Jul 13 2007 8:15 am
Subject: Re: RfD: Escaped Strings

Alex McDonald <b...@rivadpm.com> writes:
>On Jul 13, 10:08 am, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote:
>> It find it moderately interesting that the rather standard \<newline> is
>> not. Traditionally this means ignore the line break.

>That would be a useful enhancement;

No existing practice in Forth.

> but perhaps \c might be clearer,
>as it differentiates between a silent space as in \<newline> and \
><newline> and permits comments.

>s\" abcdefg\c           \ continue on a new line
>    hijklmn"            \ blank strip leading & catenate for
>abcdefghijklmn

In C one can construct a longer literal string by writing to adjacent
literal strings, separated only by white space and comments.  E.g.:

int main()
{
  printf("hello, " /* comment */
         "world");
  return 0;

}

Note that this allows a little more flexibility about where the string
starts in the next line.  Insired by this, we could do it in Forth
with words like +" and +\", which would extend a string started with
S" or S\".  But no existing practice, either, so not for this
proposal.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 13 2007, 8:24 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Fri, 13 Jul 2007 12:24:39 GMT
Local: Fri, Jul 13 2007 8:24 am
Subject: Re: RfD: Escaped Strings

Peter Knaggs <pkna...@bournemouth.ac.uk> writes:
>21 August 2006, Stephen Pelc

Pretty good.  There's always room for improvement:

- Test cases should be added before the CfV.

- I guess that you want \xAB to represent a (primitive) character.
This does not come out clearly (actually, if there was no mention of
XCHARS and definition of "primitive characters" in the informative
sections, this would be clearer).

- It seems that the detailed description of an existing solution in
the "Solution" section is confusing, because it is very similar to the
proposal, but still different.  Better leave it away and just mention
the issues (like fixed-length vs. variable-length \x) in a discussion
section.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stephen Pelc  
View profile
 More options Jul 13 2007, 10:05 am
Newsgroups: comp.lang.forth
From: stephen...@mpeforth.com (Stephen Pelc)
Date: Fri, 13 Jul 2007 14:05:19 GMT
Local: Fri, Jul 13 2007 10:05 am
Subject: Re: RfD: Escaped Strings
On Fri, 13 Jul 2007 12:24:39 GMT, an...@mips.complang.tuwien.ac.at

(Anton Ertl) wrote:
>- Test cases should be added before the CfV.

Volunteer? You? The gForth test suite?

>- I guess that you want \xAB to represent a (primitive) character.
>This does not come out clearly (actually, if there was no mention of
>XCHARS and definition of "primitive characters" in the informative
>sections, this would be clearer).

Given the problems with the definition of char throughout the
document, the definition of char in terms of primitve characters
*has* to be done in a different section of the document.

For example, if char=16 bits on a byte-addressed machine, there
is no way for a standard program to write a byte to a file!

If you use a variable width character set such as UTF-8, what does
CMOVE mean?

The only practical solutions I see are
a) define char=byte
b) define char=implementation defined unit

Given the amount of code that currently assumes char=byte=au, the
least code breakage and maximum instant compliance is to replace
"char" in the document by "primitive char" ("pchar") and then to
define "extended char" ("xchar") in terms of pchars. The vast
majority of systems can then happily impose char=byte=au.

>- It seems that the detailed description of an existing solution in
>the "Solution" section is confusing, because it is very similar to the
>proposal, but still different.  Better leave it away and just mention
>the issues (like fixed-length vs. variable-length \x) in a discussion
>section.

Revamped and posted separately.

--
Stephen Pelc, stephen...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads


    Reply to author    Forward