RfD: Escaped Strings S\" (version 5)

10 views
Skip to first unread message

Peter Knaggs

unread,
Oct 30, 2007, 5:16:12 PM10/30/07
to
30 October 2007, Stephen Pelc/Peter Knaggs

20071030 Clarification of case sensitivity:
Escape character is case sensitive,
Hex digits are not.
20070913 Added clarifications.
20070719 Modified ambiguous condition.
Added ambiguous conditions to definition of S\".
Added test cases.
Corrected Reference Implementation.
20070712 Redrafted non-normative portions.
20060822 Updated solution section.
20060821 First draft.

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as
1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

The following list describes what is currently available in the
surveyed Forth systems that support escaped strings.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here) is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display
unit, but may have several additional comms channels. The
impact of a keyboard driver having to convert Chinese or Russian
characters into a (say) UTF-8 sequence is minimal compared to
handling the key stroke sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they
are described as strings.
5) No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive
character size - nearly all encodings are described in terms of
octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...

Approach
--------
This proposal does not require systems to handle xchars, and does
not disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituted.
The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
All current systems use character units of 8 bits or more.

Of observed current practice, the following two are problematic.
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character

Why do we need two representations, both of variable length?
This proposal selects the hexadecimal representation, requiring
two hex digits. A consequence of this is that xchars must be
represented as a sequence of pchars. Although initially seen as a
problem by some people, it avoids at least the following problems:
1) Endian issues when transmitting an xchar, e.g. big-endian host
to little-endian comms channel
2) Issues when an xchar is larger than a cell, e.g. UTF-32 on
a 16 bit system.
3) Does not have problems in distinguishing the end of the
number from a following character such as '0' or 'A'.
At least one system (Gforth) already supports UTF-8 as its native
character set, and one system (JaxForth) used UTF-16. These systems
are not affected.

\ before any other character represents that character

This is an unnecessary general case, and so is not mandated. By
making it an ambiguous condition, we do not disenfranchise
existing implementations, and leave the way open for future
extensions.

Note that now the number-prefix extension has been accepted, 3.4.1
Parsing contains a definition of <hexdigit> to be a case insensitive
hexadecimal digit [0-9a-fA-F].

Proposal
========

6.2.xxxx S\"
s-slash-quote CORE EXT

Interpretation:
Interpretation semantics for this word are undefined.

Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by " (double-quote), using the translation
rules below. Append the run-time semantics given below to the
current definition.

Translation rules:
Characters are processed one at a time and appended to the
compiled string. If the character is a '\' character it is
processed by parsing and substituting one or more characters
as follows, where the character after the backslash is case
sensitive:
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10)
\n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\x<hexdigit><hexdigit>
The resulting character is the conversion of these two
hexadecimal digits. An ambiguous conditions exists if \x
is not followed by two hexadecimal characters.
\\ backslash itself
\ An ambiguous condition exists if a \ is placed before any
character, other than those defined in 6.2.xxxx S\".

Run-time: ( -- c-addr u )
Return c-addr and u describing a string consisting of the translation
of the characters ccc. A program shall not alter the returned string.

See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Labelling
=========
Ambiguous conditions occur:
If \x is not followed by two hexadecimal characters.
If a \ is placed before any character, other than those defined
in 6.2.xxxx S\".


Reference Implementation
========================
Taken from the VFX Forth source tree and modified to remove
implementation dependencies.

Another implementation (with some deviations) can be found at
http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth/quotes.fs?root=gforth

decimal

: c+! \ c c-addr --
\ *G Add character C to the contents of address C-ADDR.
tuck c@ + swap c!
;

: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;

: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;

: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over drop 2 >number 2drop drop
>r 2 /string r>
r> base !
;

create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
27 c, \ \e
12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
10 c, \ \l
char m c, \ \m
10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z

create CRLF$ \ -- addr ; CR/LF as counted string
2 c, 13 c, 10 c,

: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit then
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 /string extract2H r> addchar exit
then
over c@ [char] m = if \ CR/LF pair
1 /string 13 r@ addchar 10 r> addchar exit
then
over c@ [char] n = if \ CR/LF pair?
1 /string crlf$ count r> append exit
then
over c@ [char] a [char] z 1+ within if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
then
1 /string
;

: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string then
r> drop
;

: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;

: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
postpone sliteral
then
; IMMEDIATE


Test Cases
==========

HEX
{ : GC5 S\" XY" ; -> }
{ GC5 SWAP DROP -> 2 }
{ GC5 DROP DUP C@ SWAP CHAR+ C@ -> 58 59 }

{ S\" " SWAP DROP -> 0 }

{ S\" \a" SWAP C@ -> 1 07 } \ BEL Bell
{ S\" \b" SWAP C@ -> 1 08 } \ BS Backspace
{ S\" \e" SWAP C@ -> 1 1B } \ ESC Escape
{ S\" \f" SWAP C@ -> 1 0C } \ FF Formfeed
{ S\" \l" SWAP C@ -> 1 0A } \ LF Linefeed
{ S\" \q" SWAP C@ -> 1 22 } \ " Double Quote
{ S\" \r" SWAP C@ -> 1 0D } \ CR Carage Return
{ S\" \t" SWAP C@ -> 1 09 } \ TAB Horisontal Tab
{ S\" \v" SWAP C@ -> 1 0B } \ VT Virtical Tab
{ S\" \z" SWAP C@ -> 1 00 } \ NUL No Character
{ S\" \"" SWAP C@ -> 1 22 } \ " Double Quote
{ S\" \\" SWAP C@ -> 1 5C } \ \ Back Slash

{ S\" \n" 2DROP -> } \ System dependent
{ S\" \m" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0D 0A } \ CR\LF pair
{ S\" \x0F0" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0F 30 } \ Given Char
{ S\" \x1Fa" SWAP DUP C@ SWAP CHAR+ C@ -> 2 1F 61 }
{ S\" \xaBc" SWAP DUP C@ SWAP CHAR+ C@ -> 2 AB 63 }

{ S\" S\\\" \\a\"" EVALUATE SWAP C@ -> 1 7 } \ Evaluate S\"


Credits
=======
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441,
fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Peter Knaggs, pkn...@bournemouth.ac.uk
School of Design, Engineering and Computing,
University of Bournemouth, Dorset BH12 5BB, England
tel: +44 (0)12 0296 5625,
fax: +44 (0)12 0296 5314
web: http://dec.bournemouth.ac.uk/staff/pknaggs

Marcel Hendrix

unread,
Oct 31, 2007, 3:42:53 PM10/31/07
to
Peter Knaggs <pkn...@bournemouth.ac.uk> wrote Re: RfD: Escaped Strings S\" (version 5)
[..]

> Translation rules:
> Characters are processed one at a time and appended to the
> compiled string. If the character is a '\' character it is
> processed by parsing and substituting one or more characters
> as follows, where the character after the backslash is case
> sensitive:
> \a BEL (alert, ASCII 7)
[..]

> \\ backslash itself
> \ An ambiguous condition exists if a \ is placed before any
> character, other than those defined in 6.2.xxxx S\".
[..]

Why was it necessary to make this an ambiguous condition?
S\" is not used by any systems not represented in the Forth 200x effort.
IMHO it is a bit silly (for a standards effort) not to mention all \<char>
codes in use today, and/or to allow future vendor-specific extensions that
will break portability of code and require work-arounds.

A general solution could be to require that a deferred (and standardized)
hook word is executed in case an unknown code is encountered. This would
guarantee that any future S\" problems can be fixed by user code.
Thinking this through should convince most people that simply forbidding
non-standard \-codes is by far preferable.

-marcel

Elizabeth D Rather

unread,
Oct 31, 2007, 5:09:34 PM10/31/07
to

I don't have any strong feelings about this particular instance of an
"ambiguous condition", but want to suggest that a more appropriate way
of looking at it is that the proposed standard would *guarantee* success
with the listed codes, but make no guarantees about others. That's a
more positive view than saying that codes not on the list are
"forbidden". In general, that's how most "ambiguous conditions" are
intended: some, of course, are errors, but others are merely cases in
which no specific behavior is mandated.

IMO hooks such as you suggest shouldn't be mandated, but implementors
who see a need or value can provide them.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Anton Ertl

unread,
Nov 3, 2007, 1:40:30 PM11/3/07
to
m...@iae.nl (Marcel Hendrix) writes:
>> \ An ambiguous condition exists if a \ is placed before any
>> character, other than those defined in 6.2.xxxx S\".
>[..]
>
>Why was it necessary to make this an ambiguous condition?

The proposal does not define what the system should do, so the use by
a program is an ambiguous condition.

>S\" is not used by any systems not represented in the Forth 200x effort.
>IMHO it is a bit silly (for a standards effort) not to mention all \<char>
>codes in use today

That would require that all systems to support all these codes in
order to support the proposal, although some are redundant.

>and/or to allow future vendor-specific extensions that
>will break portability of code and require work-arounds.

The standard (and such proposals) describe only the supported
interface between programs and systems. There is no mechanism there
that forbids systems to have extensions in the areas not specified by
the standard (how could such a mechanism work?).

We can only hope that system implementors are sensible and do not
squat with lots of system-specific extensions on undefined codes.
OTOH, if they implement a common extension across many systems, that
could establkish common practice for the next standard.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/

Anton Ertl

unread,
Nov 3, 2007, 1:53:35 PM11/3/07
to
Peter Knaggs <pkn...@bournemouth.ac.uk> writes:
>Proposal
>========
>
>6.2.xxxx S\"
>s-slash-quote CORE EXT
>
>Interpretation:
> Interpretation semantics for this word are undefined.

Do we really want to have that restriction on S\" ? Which of the
systems that have S\" have this restriction? Even the reference
implementation works interpretively. We should just put in the usual
restrictions against ticking, POSTPONEing, and [COMPILE]ing to allow
state-smart implementations.

> \x<hexdigit><hexdigit>

And <hexdigit> refers to the definition that came with
X:number-prefixes, right? Clever. Maybe this should be made clearer
by including 3.4.1.3 here:

>See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Or maybe that's too far removed, and the reference should be given
earlier.

Dick van Oudheusden

unread,
Dec 2, 2007, 5:32:15 AM12/2/07
to
On 30 Oct, 22:16, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote:
>
> : parse\" \ caddr len dest -- caddr' len'
> \ *G Parses a string up to an unescaped '"', translating '\'
> \ ** escapes to characters much as C does. The
> \ ** translated string is a counted string at *\i{dest}
> \ ** The supported escapes (case sensitive) are:
> \ *D \a BEL (alert)
> \ *D \b BS (backspace)
> \ *D \e ESC (not in C99)
> \ *D \f FF (form feed)
> \ *D \l LF (ASCII 10)
> \ *D \m CR/LF pair - for HTML etc.
> \ *D \n newline - CRLF for Windows/DOS, LF for Unices
> \ *D \q double-quote
> \ *D \r CR (ASCII 13)
> \ *D \t HT (tab)
> \ *D \v VT
> \ *D \z NUL (ASCII 0)
> \ *D \" "
> \ *D \xAB Two char Hex numerical character value
> \ *D \\ backslash itself
> \ *D \ before any other character represents that character

Perhaps it is also a good idea to standardize the underlying word

parse\" ( -- c-addr u)

in this proposal so that programmers can use it for other words, like
.\" and ,\" and so on?

Dick

Ureir

unread,
Dec 2, 2007, 6:04:44 AM12/2/07
to
Dick van Oudheusden wrote :
> ..\" and ,\" and so on?
>
> Dick

It's useless, they can be written with standard S\" word:

: .\" POSTPONE S\" POSTPONE TYPE ; IMMEDIATE

: .\( ['] S\" EXECUTE TYPE ; IMMEDIATE

and so on...

--
Regards,

Ureir.

Anton Ertl

unread,
Dec 2, 2007, 5:44:04 AM12/2/07
to
Dick van Oudheusden <dvoudh...@gmail.com> writes:
>On 30 Oct, 22:16, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote:
>>
>> : parse\" \ caddr len dest -- caddr' len'
...

>Perhaps it is also a good idea to standardize the underlying word
>
> parse\" ( -- c-addr u)
>
>in this proposal so that programmers can use it for other words, like
>.\" and ,\" and so on?

Yes, such a word would be useful, especially if S\" has no
interrpetation semantics (as currently proposed) or if it has
interpretation semantics, but cannot be ticked or POSPONEd to allow
state-smart implementations (as I suggested).

However, the word with the ( "string<">" -- c-addr u ) stack effect is
called READESCAPED in the reference implementation.

A lower-level word like the PARSE\" ( c-addr1 u1 c-addr2 -- c-addr2 u2 )
from the reference implementation might be useful in a few additional
cases and thus might be a better choice. One thing that has to be
considered is the length of the buffer at c-addr2. With the current
escape sequences it is good enough if the buffer has length u1, and
that will also be the case for any likely candidates for escape
sequences, so the stack effect above may be ok.

An advantage of standardizing a word that passes the destination
buffer rather than a word that writes to a system-supplied buffer is
that we don't need to specify the lifetime of the result. The
disadvantage is that we need to consider the buffer length, what
happens on overflow (not an issue here), and the programmer can make a
mistake and supply buffer smaller than the specified length, leading
to a buffer overflow.

Anton Ertl

unread,
Dec 2, 2007, 6:12:07 AM12/2/07
to
Ureir <herve.Remove...@AndAlsoThis.hpeignelin.freesurf.fr> writes:
>Dick van Oudheusden wrote :
>> On 30 Oct, 22:16, Peter Knaggs <pkna...@bournemouth.ac.uk> wrote:
>> Perhaps it is also a good idea to standardize the underlying word
>>
>> parse\" ( -- c-addr u)
>>
>> in this proposal so that programmers can use it for other words, like
>> ..\" and ,\" and so on?
>>
>> Dick
>
>It's useless, they can be written with standard S\" word:
>
>: .\" POSTPONE S\" POSTPONE TYPE ; IMMEDIATE

That won't be standard, if the proposal excludes POSTPONEing S\" in
order to allow STATE-smart implementations like the reference
implementation.

>: .\( ['] S\" EXECUTE TYPE ; IMMEDIATE

That's not supported by the current proposal, because it only
defines the compilation semantics for S\".

Dick van Oudheusden

unread,
Dec 3, 2007, 7:00:18 AM12/3/07
to
On 2 dec, 11:44, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
> However, the word with the ( "string<">" -- c-addr u ) stack effect is
> called READESCAPED in the reference implementation.
>

I chose this stack behaviour because it would be consistent with the
standard word

6.2.2008 PARSE ( char "ccc<char>" -- c-addr u )

The suggested word pase\" is then a join of the words readEscaped and
parse\" from the reference implementation.

> A lower-level word like the PARSE\" ( c-addr1 u1 c-addr2 -- c-addr2 u2 )
> from the reference implementation might be useful in a few additional
> cases and thus might be a better choice. One thing that has to be
> considered is the length of the buffer at c-addr2. With the current
> escape sequences it is good enough if the buffer has length u1, and
> that will also be the case for any likely candidates for escape
> sequences, so the stack effect above may be ok.
>
> An advantage of standardizing a word that passes the destination
> buffer rather than a word that writes to a system-supplied buffer is
> that we don't need to specify the lifetime of the result. The
> disadvantage is that we need to consider the buffer length, what
> happens on overflow (not an issue here), and the programmer can make a
> mistake and supply buffer smaller than the specified length, leading
> to a buffer overflow.

Agreed.

Dick

Reply all
Reply to author
Forward
0 new messages