Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RfD - Escaped Strings (long)

18 views
Skip to first unread message

Stephen Pelc

unread,
Aug 21, 2006, 10:22:23 AM8/21/06
to
RfD - S\" and quoted strings with escapes
21 August 2006, Stephen Pelc

20060821 First draft

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as dicussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as
1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of my current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here)is a fixed-width unit handled by EMIT and
friends. It corresponds to the current ANS definition of a
character. An extended character (called an xchar here) consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display
unit, but may have several additional comms channels. The
impact of a keyboard driver having to convert Chinese or Russian
characters into a (say) UTF-8 sequence is minimal compared to
handling the key stroke sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they
are described as strings.
5) No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive
character size - nearly all of these are described in terms of octets:
TCP/IP, UTF-8, UTF-16, UTF-32, ...

The XCHARs proposal can be used to handle extended characters on the
stack. XEMIT and friends allow us to handle some additional odd-ball
requirements such as 9-bit control characters, e.g. for the MDB
bus used by vending machines.

Solution
--------
To ease discussion we refer to character handled by C@, C! and
friends as "primitive characters" or pchars. Characters that may
be wider than a pchar are called "extended characters" or xchars.
These are compatible with the XCHARs proposal. This proposal
does note requires systems to handle xchars, but does not
disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituded.
The following list is what is currently available in the Forth
systems surveyed.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
Al current systems use character units of 8 bits or more.

\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character

Why do we need two representations, both of variable length?
This proposal selects the hexadecimal representation, requiring
two hex digits. A consequence of this is that xchars must be
represented as a sequence of pchars. Although initially seen as a
problem by some people, it avoids the endian problems involved
in storing an xchar.

\ before any other character represents that character

This is an unnecessary general case, and so is not mandated.


Proposal
========

6.2.xxxx S\"
s-slash-quote CORE EXT

Interpretation:
Interpretation semantics for this word are undefined.

Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by " (double-quote), using the translation
rules below. Append the run-time semantics given below to the
current definition.

Translation rules:
Characters are processed one at a time and appended to the
compild string. If the character is a '\' character it is
processed by parsing and substituting one or more characters
as follows:
\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10)
\n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\xAB A and B are Hexadecimal numerical characters. The resulting
character is the conversion of these two characters.
\\ backslash itself

Run-time: ( -- c-addr u )
Return c-addr and u describing a string consisting of the translation
of the characters ccc. A program shall not alter the returned string.

See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Labelling
=========
ENVIRONMENT? impact
name stack conditions

Ambiguous conditions occur:
If a hex value is more than two characters
If \x is not followed by by two hexadecimal characters


Reference Implementation
========================
(as yet untested)
Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefices to indicate decimal and hexadecimal respectively.

decimal

: PLACE \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
2dup 2>r \ write count last
1 chars + swap move
2r> c! \ to avoid in-place problems
;

: $, \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
dup >r
here place
r> 1 chars + allot
align
;

: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;

: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;

: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over >number 2drop drop
>r 2 chars /string r>
r> base !
;

create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
#27 c, \ \e
#12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
#10 c, \ \l
char m c, \ \m
#10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
#13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
#11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z

l: CRLF$ \ -- addr ; CR/LF as counted string
2 c, #13 c, #10 c,

internal
: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit endif
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 chars /string extract2H r> addchar exit
endif
over c@ [char] m = if \ CR/LF pair?
1 chars /string #13 r@ addchar #10 r> addchar exit
endif
over c@ [char] n = if \ CR/LF pair?
1 chars /string crlf$ count r> append exit
endif
over c@ [char] a [char] z within? if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
endif
1 chars /string
;
external

: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string endif
r> drop
;

: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;

: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
compile (s") $,
then
; IMMEDIATE


Test Cases
==========
Forth source to test the reference implementation.


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Jean-François Michaud

unread,
Aug 21, 2006, 11:25:54 AM8/21/06
to

Stephen Pelc wrote:
> RfD - S\" and quoted strings with escapes
> 21 August 2006, Stephen Pelc
>
> 20060821 First draft
>
> Rationale
> =========
>
> Problem
> -------
> The word S" 6.1.2165 is the primary word for generating strings.
> In more complex applications, it suffers from several deficiencies:
> 1) the S" string can only contain printable characters,
> 2) the S" string cannot contain the '"' character,
> 3) the S" string cannot be used with wide characters as dicussed
> in the Forth 200x internationalisation and XCHAR proposals.
>
> Current practice
> ----------------
> At least SwiftForth, gForth and VFX Forth support S\" with very
> similar operations. S\" behaves like S", but uses the '\' character
> as an escape character for the entry of characters that cannot be
> used with S".

> This technique is widespread in languages other than Forth.

[snip]

Understandably, those are problems, but is complexification of S" by
adding support for escape characters an forthlike solution to pursue
(I'm thinking no)?

The problematic of the double quote character is directly linked to the
functionning of S" and is indeed problematic as far as S" is concerned.
This does lead to the thought that S" might not be an appropriate name
and that the stopper (double quote) might not be appropriate either.

As far as the other two problems are concerned, it seems to me that
they are not directly related to S" in itself but in the limitation
that we have of only being able to type printable/representable
characters. When looking at the problem in perspective, it seems to me
that it is not localized to S", but to any interpreted string of
characters, which means that S" is not the culprit. This limitation
doesn't prevent S" from functionning properly, but it does prevent what
is contained within its quotation marks from being interpreted
properly.

In such a case, wouldn't it be more appropriate to simply add words
that specifically support UTF-8, UTF-16, etc, instead of adding
complexity to S" which doesn't require the increase in complexity?

Regards
Jean-Francois Michaud

Stephen Pelc

unread,
Aug 21, 2006, 11:52:45 AM8/21/06
to
On 21 Aug 2006 08:25:54 -0700,
"=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=" <com...@comcast.net>
wrote:

>In such a case, wouldn't it be more appropriate to simply add words
>that specifically support UTF-8, UTF-16, etc, instead of adding
>complexity to S" which doesn't require the increase in complexity?

Internationalisation is *much* more complex that just adding
support for another character set. It is discussed in some
tedious detail in the wide character and internationalisation
proposals in the download section of our web site.

Stephen

Dennis Ruffer

unread,
Aug 21, 2006, 1:50:36 PM8/21/06
to
On 2006-08-21 10:22:23 -0400, steph...@mpeforth.com (Stephen Pelc) said:

> RfD - S\" and quoted strings with escapes

I'm sorry that I don't have an Open Firmware spec at hand right now,
but I do prefer the way that they accomplished this. Effectively, they
add only one escape sequence to the reserved set, which accomplishes a
much more flexible list of what can be embedded into a string. Here is
a very simple example:

" "(0A0D)starting logger "

The "( escape starts a sequence of hex digit pairs which is terminated by a ).

I can't remember if the 1275 spec includes the "" as a shortcut for
including a quote, but Apple did implement it.

I also added one other escape sequence which allowed compile time variability.

" BuildResults/"{Build}/Obj"

The "{ escape executes the word that is terminated by the } and appends
the string generated by that word to the constructed string.

While I do commend you Stephen for broaching the issue, I think we can
do better than replicating C's escape sequences. While I don't expect
OF or my extension to win any popularity contests, I do think the
alternative should be mentioned in whatever concept does win, for
historical purposes, at least.

DaR

Stephen Pelc

unread,
Aug 21, 2006, 2:18:43 PM8/21/06
to
On Mon, 21 Aug 2006 13:50:36 -0400, Dennis Ruffer
<dru...@speakeasy.net> wrote:

>While I do commend you Stephen for broaching the issue, I think we can
>do better than replicating C's escape sequences.

I know, I know. But standards and their acceptance are about common
practice and least code breakage. This one is based on SwiftForth,
gForth and VFX Forth, which covers a lot of users.

Stephen

jacko

unread,
Aug 21, 2006, 2:44:42 PM8/21/06
to
hi

> On 2006-08-21 10:22:23 -0400, steph...@mpeforth.com (Stephen Pelc) said:
>
> > RfD - S\" and quoted strings with escapes

most forth would use utf-8 for compactness.

utf-16 would be needed for string buffer to allow easy counting and
modification.

\ seems to be common for escape character, and is defined to allow end
of line comments.

c escape sequences may take a longer parse implimentation => bigger
compilier

if there is standard code to implement this then why extend the
standard??

Alex McDonald

unread,
Aug 21, 2006, 6:10:45 PM8/21/06
to

Stephen Pelc wrote:
> RfD - S\" and quoted strings with escapes
> 21 August 2006, Stephen Pelc
>
> 20060821 First draft
>
> Rationale
> =========
>
> Problem
> -------
> The word S" 6.1.2165 is the primary word for generating strings.
> In more complex applications, it suffers from several deficiencies:
> 1) the S" string can only contain printable characters,
> 2) the S" string cannot contain the '"' character,

Way back in IBM360 BAL land, "abcd""def" was the answer; a double
doublequote colapsed to a single doublequote, and parsing of the string
continued. Could S" be extended to accept "" without breaking existing
code?

[snipped]

> \z NUL (ASCII 0)

\0 ?


> \[0-7]+ Octal numerical character value, finishes at the
> first non-octal character
> \x[0-9a-f]+ Hex numerical character value, finishes at the
> first non-hex character
> \\ backslash itself
> \ before any other character represents that character
>
> The following three of these cause parsing and readability
> problems. As far as I know, requiring characters to come in
> 8 bit units will not upset any systems. Systems with characters
> less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
> Al current systems use character units of 8 bits or more.
>
> \[0-7]+ Octal numerical character value, finishes at the
> first non-octal character
> \x[0-9a-f]+ Hex numerical character value, finishes at the
> first non-hex character
>
> Why do we need two representations, both of variable length?
> This proposal selects the hexadecimal representation, requiring
> two hex digits. A consequence of this is that xchars must be
> represented as a sequence of pchars. Although initially seen as a
> problem by some people, it avoids the endian problems involved
> in storing an xchar.
>

Here I would propose

\unnnn
and
\Unnnnnnnn

for UTF16 and UTF32 support. Python iirc supports this construct. It
avoids any ambiguity over endianess problems.

[rest snipped]

--
Regards
Alex McDonald

Stephen Pelc

unread,
Aug 21, 2006, 6:54:35 PM8/21/06
to
On 21 Aug 2006 15:10:45 -0700, "Alex McDonald"
<alex...@btopenworld.com> wrote:

>Way back in IBM360 BAL land, "abcd""def" was the answer; a double
>doublequote colapsed to a single doublequote, and parsing of the string
>continued. Could S" be extended to accept "" without breaking existing
>code?

Has it any common practice in Forth?

>> Why do we need two representations, both of variable length?
>> This proposal selects the hexadecimal representation, requiring
>> two hex digits. A consequence of this is that xchars must be
>> represented as a sequence of pchars. Although initially seen as a
>> problem by some people, it avoids the endian problems involved
>> in storing an xchar.
>>
>
>Here I would propose
>
>\unnnn
>and
>\Unnnnnnnn
>
>for UTF16 and UTF32 support. Python iirc supports this construct. It
>avoids any ambiguity over endianess problems.

What terminates it? If you want say '00' immediately after
\Uxxxxxx do you write \Uxxxxxx00 which I believe to be
ambiguous. Variable length extensions without a terminator
are dangerous!

The use of hex characters is not just to provide wide
character support, but also allow insertion of control
codes into comms channels, e.g. Telnet IAC handling.

Anton is pushing hard for UTF-8 support. I argue that separated
octets supports UTF-8/16/32 without any required changes.
Another advantage of the octet approach is that it enables
16 bit embedded systems to support characters of any size
wider than a cell. With UTF-8 this is required even on a
32 bit Forth.

Stephen

Alex McDonald

unread,
Aug 21, 2006, 7:17:41 PM8/21/06
to
Stephen Pelc wrote:
> On 21 Aug 2006 15:10:45 -0700, "Alex McDonald"
> <alex...@btopenworld.com> wrote:
>
> >Way back in IBM360 BAL land, "abcd""def" was the answer; a double
> >doublequote colapsed to a single doublequote, and parsing of the string
> >continued. Could S" be extended to accept "" without breaking existing
> >code?
>
> Has it any common practice in Forth?

Err... no. But it would be useful as it's a common case, and could be
trivially implemented.

>
> >> Why do we need two representations, both of variable length?
> >> This proposal selects the hexadecimal representation, requiring
> >> two hex digits. A consequence of this is that xchars must be
> >> represented as a sequence of pchars. Although initially seen as a
> >> problem by some people, it avoids the endian problems involved
> >> in storing an xchar.
> >>
> >
> >Here I would propose
> >
> >\unnnn
> >and
> >\Unnnnnnnn
> >
> >for UTF16 and UTF32 support. Python iirc supports this construct. It
> >avoids any ambiguity over endianess problems.
>
> What terminates it? If you want say '00' immediately after
> \Uxxxxxx do you write \Uxxxxxx00 which I believe to be
> ambiguous. Variable length extensions without a terminator
> are dangerous!

They're fixed length; \u has 4 digits, \U has 8.

>
> The use of hex characters is not just to provide wide
> character support, but also allow insertion of control
> codes into comms channels, e.g. Telnet IAC handling.
>
> Anton is pushing hard for UTF-8 support. I argue that separated
> octets supports UTF-8/16/32 without any required changes.

\x12\x34 has a specific storage order, as does \x1234, if I get the
details of the proposal correctly. They're endian sensitive. \u and \U
don't have that problem; they're stored as required by the endianness
of the target, not in the order specified.

> Another advantage of the octet approach is that it enables
> 16 bit embedded systems to support characters of any size
> wider than a cell. With UTF-8 this is required even on a
> 32 bit Forth.

I wasn't considering UTF8, just UTF16 and 32 support.

>
> Stephen
>
> --
> Stephen Pelc, steph...@mpeforth.com
> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
> web: http://www.mpeforth.com - free VFX Forth downloads

--
Regards
Alex McDonald

Marcel Hendrix

unread,
Aug 21, 2006, 9:24:58 PM8/21/06
to
"Alex McDonald" <alex...@btopenworld.com> writes Re: RfD - Escaped Strings (long)
[..]

> Way back in IBM360 BAL land, "abcd""def" was the answer; a double
> doublequote colapsed to a single doublequote, and parsing of the string
> continued. Could S" be extended to accept "" without breaking existing
> code?

[..]

Wil Baden has suggested it long ago, and I have it in iForth since a
few years. It does not impact existing code.

However, I never use this extension, as a mix of S" ..." and S~ ...~
proves to fix all practical problems and is (IMHO) easier to read.

I do not like S\" because my fear has always been that it obstructs the
traditional ways of redirecting strings to other devices: now if
redirection is wanted, the complete string must be reparsed by EMIT
and/or TYPE. Traditionally one redefines CR , BS etc., which is much
easier.

-marcel


Stephen Pelc

unread,
Aug 22, 2006, 5:55:38 AM8/22/06
to
On 21 Aug 2006 16:17:41 -0700, "Alex McDonald"
<alex...@btopenworld.com> wrote:

>\x12\x34 has a specific storage order, as does \x1234, if I get the
>details of the proposal correctly. They're endian sensitive. \u and \U
>don't have that problem; they're stored as required by the endianness
>of the target, not in the order specified.

How would handle a little-endian comms channel on a big-endian
machine?

Stephen Pelc

unread,
Aug 22, 2006, 6:08:14 AM8/22/06
to
On Tue, 22 Aug 2006 01:24:58 GMT, m...@iae.nl (Marcel Hendrix) wrote:

>I do not like S\" because my fear has always been that it obstructs the
>traditional ways of redirecting strings to other devices: now if
>redirection is wanted, the complete string must be reparsed by EMIT
>and/or TYPE. Traditionally one redefines CR , BS etc., which is much
>easier.

Having tried to avoid it for some time, MPE implemented S\"
to handle HTML and XML over sockets. It reduced our code enough
to justify its existence, especially when the \m option for a
CR/LF pair was used.

Stephen

Alex McDonald

unread,
Aug 22, 2006, 9:06:04 AM8/22/06
to
Stephen Pelc wrote:
> On 21 Aug 2006 16:17:41 -0700, "Alex McDonald"
> <alex...@btopenworld.com> wrote:
>
> >\x12\x34 has a specific storage order, as does \x1234, if I get the
> >details of the proposal correctly. They're endian sensitive. \u and \U
> >don't have that problem; they're stored as required by the endianness
> >of the target, not in the order specified.
>
> How would handle a little-endian comms channel on a big-endian
> machine?

This is not a "source compile big-endian"/"target execute small endian"
problem, it's a mixed endianess problem. So, little-endian comms on
big-endian box has to be done explicitly, by using \x.

I propose \u and \U in addtion to \x, not in replacement of it. Where
endianess is consistent for the target, \u and \U will always work,
whereas \x requires you to recognise the endianess of the target in the
string itself (unless it's stored asis and reparsed on TYPE and EMIT,
as Marcel has noted).

\u8301 is stored as 83 01 or 01 83 depending on endianess of the
target; it's an xchar, not two pchars
\x8301 Is this an xchar or two pchars?
\x83\x01 is always 83 01; that is, it's stored as two pchars that just
happens to be an xchar for a big-endian machine.

On the issue of variable length, as \<not-an-escaped-char> is
<not-an-escaped-char> then \x3132\abcd is unambiguously the string
12abcd and \x313233\\ is 123\ (assuming that \x generates pchars).

Albert van der Horst

unread,
Aug 22, 2006, 3:04:08 PM8/22/06
to
In article <44e9c14f....@news.demon.co.uk>,
Stephen Pelc <steph...@INVALID.mpeforth.com> wrote:

<SNIP>

>
>: PLACE \ c-addr1 u c-addr2 --
>\ *G Copy the string described by c-addr1 u to a counted string at
>\ ** the memory address described by c-addr2.
> 2dup 2>r \ write count last
> 1 chars + swap move
> 2r> c! \ to avoid in-place problems
>;

The use of a byte count is unfortunate.
With a cell count it is called $! in my book.

(PLACE is called $!-BD in my book, BD is an abbreviation for a
pejorative.)
Like in
"monkey.jpg" GET-FILE BUFFER $!
BUFFER $@ DISPLAY

>
>: addchar \ char string --
>\ *G Add the character to the end of the counted string.
> tuck count + c!
> 1 swap c+!
>;

(with cell count)
$C+ in my book

>
>: append \ c-addr u $dest --
>\ *G Add the string described by C-ADDR U to the counted string at
>\ ** $DEST. The strings must not overlap.
> >r
> tuck r@ count + swap cmove \ add source to end
> r> c+! \ add length to count
>;

(with cell count)
$+! in my book,

I find the use of non-textual's and the association with
@ ! +! very helpful in reading, once you are used to it.
The use of $ binds the wordset together.

>--
>Stephen Pelc, steph...@mpeforth.com

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Albert van der Horst

unread,
Aug 22, 2006, 3:15:04 PM8/22/06
to
In article <1156198245.9...@h48g2000cwc.googlegroups.com>,

Alex McDonald <alex...@btopenworld.com> wrote:
>
>Stephen Pelc wrote:
>> RfD - S\" and quoted strings with escapes
>> 21 August 2006, Stephen Pelc
>>
>> 20060821 First draft
>>
>> Rationale
>> =========
>>
>> Problem
>> -------
>> The word S" 6.1.2165 is the primary word for generating strings.
>> In more complex applications, it suffers from several deficiencies:
>> 1) the S" string can only contain printable characters,
>> 2) the S" string cannot contain the '"' character,
>
>Way back in IBM360 BAL land, "abcd""def" was the answer; a double
>doublequote colapsed to a single doublequote, and parsing of the string
>continued. Could S" be extended to accept "" without breaking existing
>code?

Because
S" AAP""AAP"
is illegal in an ISO FORTH program, implementors are free to add a
meaning to it. In doing so they don't break any legal ISO program. So
I added this to ciforth (lina, wina), later also adopted in chforth.
"abcd""def" is understood in ciforth, and this also doesn't break
any legal ISO programs.

>--
>Regards
>Alex McDonald
>

--

Groetjes Albert

Anton Ertl

unread,
Aug 24, 2006, 2:05:22 PM8/24/06
to
"=?iso-8859-1?q?Jean-Fran=E7ois_Michaud?=" <com...@comcast.net> writes:

>
>Stephen Pelc wrote:
>> 3) the S" string cannot be used with wide characters as dicussed
>> in the Forth 200x internationalisation and XCHAR proposals.

Actually S" can be used for xchars, at least with the UTF-8 and 8bit
encodings (and probably a number of others).

>In such a case, wouldn't it be more appropriate to simply add words
>that specifically support UTF-8, UTF-16, etc, instead of adding
>complexity to S" which doesn't require the increase in complexity?

UTF-8 is no problem with S". And the proposal does not add complexity
to S", but introduces another word S\". S\" was introduced in Gforth
before xchars, so it is obvious that the reason for S\" is not to
support xchars.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2006: http://www.complang.tuwien.ac.at/anton/euroforth2006/

Anton Ertl

unread,
Aug 24, 2006, 2:15:32 PM8/24/06
to
m...@iae.nl (Marcel Hendrix) writes:
>I do not like S\" because my fear has always been that it obstructs the
>traditional ways of redirecting strings to other devices: now if
>redirection is wanted, the complete string must be reparsed by EMIT
>and/or TYPE. Traditionally one redefines CR , BS etc., which is much
>easier.

Sorry, I don't understand that scenario. Do you mean redirection
where CR, BS etc. send other characters than usual? Can you give a
concrete example?

Anton Ertl

unread,
Aug 24, 2006, 2:18:58 PM8/24/06
to
Albert van der Horst <alb...@spenarnc.xs4all.nl> writes:
>Because
>S" AAP""AAP"
>is illegal in an ISO FORTH program, implementors are free to add a
>meaning to it.

But

: " ." hi!" ;
s" foo"" type

is a standard program and implementors are not free to change its meaning.

>So
>I added this to ciforth (lina, wina), later also adopted in chforth.
>"abcd""def" is understood in ciforth, and this also doesn't break
>any legal ISO programs.

So what does ciforth do for the code above. A standard system does:

: " ." hi!" ; ok
s" foo"" type hi!foo ok

Marcel Hendrix

unread,
Aug 24, 2006, 6:02:38 PM8/24/06
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: RfD - Escaped Strings (long)

> A standard system does:

> : " ." hi!" ; ok
> s" foo"" type hi!foo ok

Interesting!

-marcel

-- -----
Gforth 0.6.2, Copyright (C) 1995-2003 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
.( hello) hello ok
.( hello)) hello ok
.( hello))) hello
*the terminal*:3: Undefined word
.( hello)))
^^^^^^^^
Backtrace:
$18F850B4 throw
$18F90B10 no.extensions

-- ---------
MPE VFX Forth for Windows IA32
© MicroProcessor Engineering Ltd, 1998-2005

Version: 3.80 [build 1937]
Build date: 29 September 2005

Free dictionary = 7726106 bytes [7545kb]


.( hello)) hello
Err# -13 ERR: Undefined word.
-> .( hello))
^

Anton Ertl

unread,
Aug 24, 2006, 2:21:49 PM8/24/06
to

Not in Gforth.

>\n newline - CRLF for Windows/DOS, LF for Unices
>\q double-quote (ASCII 34)

Just write:
\q "

>\r CR (ASCII 13)
>\t HT (tab, ASCII 9)
>\v VT (ASCII 11)
>\z NUL (ASCII 0)
>\" "

IIRC not supported by PFE, and costs a line or so more than the rest
to implement, but the mnemonic value makes this worth the price IMO.

>\[0-7]+ Octal numerical character value, finishes at the
> first non-octal character
>\x[0-9a-f]+ Hex numerical character value, finishes at the
> first non-hex character

IIRC C has a limit on the length for at least one of these, ensuring
that one can do a character in this way followed by 0. The \u and \U
proposals would do this for us.

>\\ backslash itself
>\ before any other character represents that character

Actually we should restrict that to a few characters, in particular \
(C also has ' and ?, but it's unclear to me why), and leave the others
undefined in standard programs, such that we can extend this with new
escape sequences should the need ever arise. Since you already
mention \\ explicitly, just write:

\ before any other character constitutes an ambiguous condition.

>\[0-7]+ Octal numerical character value, finishes at the
> first non-octal character
>\x[0-9a-f]+ Hex numerical character value, finishes at the
> first non-hex character
>
>Why do we need two representations, both of variable length?
>This proposal selects the hexadecimal representation, requiring
>two hex digits.

I think that's exactly the other way round compared to the C way, but
since \x... is quite obscure in C anyway, I don't think this will
confuse many people.

> A consequence of this is that xchars must be
>represented as a sequence of pchars.

Not necessarily. One can write them down directly (they are usually
printable), or use the octal notation with arbitrary length (ouch),
or, with a \u or \U extension, use that.

> Although initially seen as a
>problem by some people, it avoids the endian problems involved
>in storing an xchar.

There is no char order problem with xchars. A given xchar is stored
as a sequence of chars in a defined way, with any encoding I know of.

>\ before any other character represents that character
>
>This is an unnecessary general case, and so is not mandated.

As mentioned above, this should be made explicitly ambiguous.

>Ambiguous conditions occur:
> If a hex value is more than two characters

When would that happen?

>Reference Implementation
>========================

Another implementation (with some deviations) can be found at
http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth/quotes.fs?root=gforth

>Test Cases
>==========
>Forth source to test the reference implementation.

There seems to be something missing here.

Anton Ertl

unread,
Aug 24, 2006, 4:55:02 PM8/24/06
to
m...@iae.nl (Marcel Hendrix) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: RfD - Escaped Strings (long)
>
>> A standard system does:
>
>> : " ." hi!" ; ok
>> s" foo"" type hi!foo ok
>
>Interesting!
>
>-marcel
>
>-- -----
>Gforth 0.6.2, Copyright (C) 1995-2003 Free Software Foundation, Inc.
>Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
>Type `bye' to exit
>.( hello) hello ok
>.( hello)) hello ok

There is a word ) in Gforth; it's the counterpart to assert(. We
should probably define it as compile-only, then you would get an error
message in this case.

>.( hello))) hello
>*the terminal*:3: Undefined word

But no word )).

>.( hello)))
> ^^^^^^^^

In the meantime we have improved error reporting to point out the
wrong part correctly even in such unusual situations:

.( hello))) hello
:1: Undefined word
.( hello)>>>))<<<

Marcel Hendrix

unread,
Aug 24, 2006, 7:53:57 PM8/24/06
to
an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: RfD - Escaped Strings (long)

> m...@iae.nl (Marcel Hendrix) writes:


>> I do not like S\" because my fear has always been that it obstructs the
>> traditional ways of redirecting strings to other devices: now if
>> redirection is wanted, the complete string must be reparsed by EMIT
>> and/or TYPE. Traditionally one redefines CR , BS etc., which is much
>> easier.

> Sorry, I don't understand that scenario. Do you mean redirection
> where CR, BS etc. send other characters than usual? Can you give a
> concrete example?

Sending text to an LCD or a vocoder? Crosscompiling on Windows for a
Linux machine or vv. (what does \n mean?)

Or say CR must do more than usual, e.g. incrementing a line counter.
With S\" it might be necessary to inspect strings for occurences of
\l \m \n .

The proposal also define \e, so enterprising souls will start emitting
embedded escape sequences to set terminal characteristics. These
sequences will mess up logfiles. Without S\" one has to define a
new word to emit the esc sequence, and interleave it with normal text.
(much easier to manipulate when logging or such becomes necessary).

-marcel


Anton Ertl

unread,
Aug 24, 2006, 6:09:16 PM8/24/06
to
m...@iae.nl (Marcel Hendrix) writes:
>Or say CR must do more than usual, e.g. incrementing a line counter.
>With S\" it might be necessary to inspect strings for occurences of
>\l \m \n .

Big issue:

\ count output lines to generate sync lines for output

: count-nls ( addr u -- )
bounds u+do
i c@ nl-char = negate out-nls +!
loop ;

:noname ( addr u -- )
2dup count-nls
defers type ;
is type

This comes straigt out of prims2x.fs. As you can see, counting \l is
enough.

>The proposal also define \e, so enterprising souls will start emitting
>embedded escape sequences to set terminal characteristics. These
>sequences will mess up logfiles. Without S\" one has to define a
>new word to emit the esc sequence, and interleave it with normal text.
>(much easier to manipulate when logging or such becomes necessary).

That's just one option. Or one defines S\" in ANS Forth. Or a
thousand other ways. Tough luck for the log-file.

Albert van der Horst

unread,
Aug 24, 2006, 4:20:16 PM8/24/06
to
In article <2006Aug2...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Albert van der Horst <alb...@spenarnc.xs4all.nl> writes:
>>Because
>>S" AAP""AAP"
>>is illegal in an ISO FORTH program, implementors are free to add a
>>meaning to it.
>
>But
>
>: " ." hi!" ;
>s" foo"" type
>
>is a standard program and implementors are not free to change its meaning.
>
>>So
>>I added this to ciforth (lina, wina), later also adopted in chforth.
>>"abcd""def" is understood in ciforth, and this also doesn't break
>>any legal ISO programs.
>
>So what does ciforth do for the code above. A standard system does:
>
>: " ." hi!" ; ok
>s" foo"" type hi!foo ok

Is it? In ciforth from the last " the parsing goes on to the
next " or to the end of the parsing area, in this case the end
of the line, or maybe the end of the file.

Thanks for pointing this out.
I guess I'll make S" a REQUIRED word. A ciforther wouldn't
use S" anyway. It has more issues, as I have recently discovered.
(It permanently assigns storage to the string, in interpreter
mode this is a no-no, however convenient.)

HERE S" AAP" 2DROP HERE =
0 OK

Anyway, the name ciforth (close to iso/computer intelligence)
implies that it is an experimental Forth that doesn't try or claim
to implement ISO to a full legal depth. Such is stated
in the introduction of the documentation.
It is intended to run non-pathological ;-) iso-programs with the
least surprises, sacrificing speed, but not simplicity.

In practice there are not too many Forth programs around that
are intended to be portable, and they are all heavily
non-pathological.


--

Stephen Pelc

unread,
Aug 25, 2006, 5:45:21 AM8/25/06
to
On Thu, 24 Aug 2006 18:21:49 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>\ before any other character represents that character
>
>Actually we should restrict that to a few characters, in particular \
>(C also has ' and ?, but it's unclear to me why), and leave the others
>undefined in standard programs, such that we can extend this with new
>escape sequences should the need ever arise. Since you already
>mention \\ explicitly, just write:
>
>\ before any other character constitutes an ambiguous condition.

Accepted and added to working draft.

>There is no char order problem with xchars. A given xchar is stored
>as a sequence of chars in a defined way, with any encoding I know of.

UTF-32 big endian to little-endian comms channel?

>>Test Cases
>>==========
>>Forth source to test the reference implementation.
>
>There seems to be something missing here.

Yes.

Stephen

dbu

unread,
Aug 26, 2006, 3:53:00 AM8/26/06
to
Stephen Pelc schrieb:

> RfD - S\" and quoted strings with escapes
> 21 August 2006, Stephen Pelc
>

> SNIP <


>
> Reference Implementation
> ========================
> (as yet untested)
> Taken from the VFX Forth source tree and modified to remove most
> implementation dependencies. Assumes the use of the # and $ numeric
> prefices to indicate decimal and hexadecimal respectively.

Hello Stephen,

I had to replace L: with CREATE and WITHIN? with BETWEEN to make
it work under Win32Forth. Since WITHIN? isn't standard you should add
it's definition to the reference implementation.

regards
Dirk Busch

Stephen Pelc

unread,
Aug 26, 2006, 7:44:09 AM8/26/06
to
On 26 Aug 2006 00:53:00 -0700, "dbu" <di...@win32forth.org> wrote:

>I had to replace L: with CREATE and WITHIN? with BETWEEN to make
>it work under Win32Forth. Since WITHIN? isn't standard you should add
>it's definition to the reference implementation.

Thank you. Fixed in the current draft.
L: - > CREATE
WITHIN? -> 1+ WITHIN

Anton Ertl

unread,
Aug 26, 2006, 7:51:47 AM8/26/06
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Thu, 24 Aug 2006 18:21:49 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>>There is no char order problem with xchars. A given xchar is stored
>>as a sequence of chars in a defined way, with any encoding I know of.
>
>UTF-32 big endian to little-endian comms channel?

IIRC using UTF-32 in an xchars setting with 8-bit or 16-bit chars is
not backwards compatible, and probably nobody in his right mind would
do that (use UTF-8 instead).

For UTF-32 with 32-bit chars there is no char ordering problem; there
is a byte ordering problem when you send it over an 8-bit channel, but
that problem already exists without S\".

Dave Thompson

unread,
Sep 4, 2006, 12:31:36 AM9/4/06
to
On Thu, 24 Aug 2006 18:21:49 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

> steph...@mpeforth.com (Stephen Pelc) writes:
> >RfD - S\" and quoted strings with escapes
> >21 August 2006, Stephen Pelc
> >
> >20060821 First draft

> >\e ESC (not in C99, ASCII 27)

(or any other C) (nor are several others I haven't marked)

> >\[0-7]+ Octal numerical character value, finishes at the
> > first non-octal character
> >\x[0-9a-f]+ Hex numerical character value, finishes at the
> > first non-hex character
>
> IIRC C has a limit on the length for at least one of these, ensuring
> that one can do a character in this way followed by 0. The \u and \U
> proposals would do this for us.
>

C octal is limited to three, hex is unlimited and has more successors
it can 'swallow' but I believe was added only at the first standard
(C89) which also added a solution: In C two (or more) string literals
(by definition quoted) that are adjacent (after preprocessing,
removing comments, and handling escapes within each) are concatenated.
Thus "\x11DEAD" is only one character plus a terminator but
"\x11" "DEAD" is a single string (usable anywhere a single string is)
of 5 characters plus a terminator.

> >\ before any other character represents that character
>
> Actually we should restrict that to a few characters, in particular \
> (C also has ' and ?, but it's unclear to me why), and leave the others

' because the same escapes are used in both double-quoted string
literals and single-quoted character literals.

? because of trigraphs, a feature so stupid I don't want to explain
it. Just trust me on this one.

> >\[0-7]+ Octal numerical character value, finishes at the
> > first non-octal character
> >\x[0-9a-f]+ Hex numerical character value, finishes at the
> > first non-hex character
> >
> >Why do we need two representations, both of variable length?
> >This proposal selects the hexadecimal representation, requiring
> >two hex digits.
>
> I think that's exactly the other way round compared to the C way, but
> since \x... is quite obscure in C anyway, I don't think this will
> confuse many people.
>

It is indeed the other way round w.r.t. variable length, as above.
I don't think \x is that obscure in C, but much of the code I work on
involves communication protocols and/or cryptography, both of which
tend to use arbitrary (looking) 'binary' data. But I don't think the
simple and I hope common cases will confuse people either way.

> > A consequence of this is that xchars must be
> >represented as a sequence of pchars.
>
> Not necessarily. One can write them down directly (they are usually
> printable), or use the octal notation with arbitrary length (ouch),
> or, with a \u or \U extension, use that.
>

Given xchars are in fact (intended to be) Unicode/10646, I would like
\u and \U; they are unambiguous AND useful. They may be a little
bulkier but I don't think that's often a big problem, and if it is for
you just write your own version -- that's the FORTH way!


- David.Thompson1 at worldnet.att.net

0 new messages