RfD: Escaped Strings version 4

15 views
Skip to first unread message

Peter Knaggs

unread,
Aug 9, 2007, 2:22:30 PM8/9/07
to
RfD: Escaped Strings S\"
19 July 2007, Stephen Pelc

20070719 Modified ambiguous condition
Added ambiguous conditions to definition of S\"
Added test cases
Corrected Reference Implementation
20070712 Redrafted non-normative portions.
20060822 Updated solution section.
20060821 First draft.

Rationale
=========

Problem
-------
The word S" 6.1.2165 is the primary word for generating strings.
In more complex applications, it suffers from several deficiencies:
1) the S" string can only contain printable characters,
2) the S" string cannot contain the '"' character,
3) the S" string cannot be used with wide characters as discussed
in the Forth 200x internationalisation and XCHAR proposals.

Current practice
----------------
At least SwiftForth, gForth and VFX Forth support S\" with very
similar operations. S\" behaves like S", but uses the '\' character
as an escape character for the entry of characters that cannot be
used with S".

This technique is widespread in languages other than Forth.

It has benefit in areas such as

1) construction of multiline strings for display by operating
system services,
2) construction of HTTP headers,
3) generation of GSM modem and Telnet control strings.

The majority of current Forth systems contain code, either in the
kernel or in application code, that assumes char=byte=au. To avoid
breaking existing code, we have to live with this practice.

The following list describes what is currently available in the
surveyed Forth systems that support escaped strings.

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10) - for HTML etc.
\n newline - CRLF for Windows/DOS, LF for Unices
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\[0-7]+ Octal numerical character value, finishes at the
first non-octal character
\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character
\\ backslash itself
\ before any other character represents that character

Considerations
--------------
We are trying to integrate several issues:

1) no/least code breakage
2) minimal standards changes
3) variable width character sets
4) small system functionality

Item 1) is about the common char=byte=au assumption.
Item 2) includes the use of COUNT to step through memory and the
impact of char in the file word sets.
Item 3) has to rationalise a fixed width serial/comms channel
with 1..4 byte characters, e.g. UTF-8
Item 4) should enable 16 bit systems to handle UTF-8 and UTF-32.

The basis of the current approach is to use the terminology of
primitive characters and extended characters. A primitive character
(called a pchar here) is a fixed-width unit handled by EMIT and
friends as well as C@, C! and friends. A pchar corresponds to the
current ANS definition of a character. Characters that may be
wider than a pchar are called "extended characters" or xchars.
The xchars are an integer multiple of pchars. An xchar consists
of one or more primitive characters and represents the encoding
for a "display unit". A string is represented by caddr/len
in terms of primitive characters.

The consequences of this are:

1) No existing code is broken.
2) Most systems have only one keyboard and only one screen/display
unit, but may have several additional comms channels. The
impact of a keyboard driver having to convert Chinese or Russian
characters into a (say) UTF-8 sequence is minimal compared to
handling the key stroke sequences. Similarly on display.
3) Comms channels and files work as expected.
4) 16-bit embedded systems can handle all character widths as they
are described as strings.
5) No conflict arises with the XCHARs proposal.

Multiple encodings can be handled if they share a common primitive
character size - nearly all encodings are described in terms of
octets, e.g. TCP/IP, UTF-8, UTF-16, UTF-32, ...

Approach
--------
This proposal does not require systems to handle xchars, and does
not disenfranchise those that do.

S\" is used like S" but treats the '\' character specially. One
or more characters after the '\' indicate what is substituted.
The following three of these cause parsing and readability
problems. As far as I know, requiring characters to come in
8 bit units will not upset any systems. Systems with characters
less than 7 bits are non-compliant, and I know of no 7 bit CPUs.
All current systems use character units of 8 bits or more.

Of observed current practice, the following two are problematic.

\[0-7]+ Octal numerical character value, finishes at the
first non-octal character

\x[0-9a-f]+ Hex numerical character value, finishes at the
first non-hex character

Why do we need two representations, both of variable length?
This proposal selects the hexadecimal representation, requiring
two hex digits. A consequence of this is that xchars must be
represented as a sequence of pchars. Although initially seen as a
problem by some people, it avoids at least the following problems:

1) Endian issues when transmitting an xchar, e.g. big-endian host
to little-endian comms channel

2) Issues when an xchar is larger than a cell, e.g. UTF-32 on
a 16 bit system.

3) Does not have problems in distinguishing the end of the
number from a following character such as '0' or 'A'.

At least one system (Gforth) already supports UTF-8 as its native
character set, and one system (JaxForth) used UTF-16. These systems
are not affected.

\ before any other character represents that character

This is an unnecessary general case, and so is not mandated. By
making it an ambiguous condition, we do not disenfranchise
existing implementations, and leave the way open for future
extensions.


Proposal
========

6.2.xxxx S\"
s-slash-quote CORE EXT

Interpretation:
Interpretation semantics for this word are undefined.

Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by " (double-quote), using the translation
rules below. Append the run-time semantics given below to the
current definition.

Translation rules:
Characters are processed one at a time and appended to the
compiled string. If the character is a '\' character it is
processed by parsing and substituting one or more characters
as follows:

\a BEL (alert, ASCII 7)
\b BS (backspace, ASCII 8)
\e ESC (not in C99, ASCII 27)
\f FF (form feed, ASCII 12)
\l LF (ASCII 10)
\m CR/LF pair (ASCII 13, 10)
\n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
\q double-quote (ASCII 34)
\r CR (ASCII 13)
\t HT (tab, ASCII 9)
\v VT (ASCII 11)
\z NUL (ASCII 0)
\" "
\xAB A and B are Hexadecimal numerical characters. The resulting
character is the conversion of these two characters. An
ambiguous conditions exists if \x is not followed by two
hexadecimal characters.
\\ backslash itself
\ An ambiguous condition exists if a \ is placed before any
character, other than those defined in 6.2.xxx s\".

Run-time: ( -- c-addr u )
Return c-addr and u describing a string consisting of the translation
of the characters ccc. A program shall not alter the returned string.

See: 3.4.1 Parsing, 6.2.0855 C" , 11.6.1.2165 S" , A.6.1.2165 S"

Labelling
=========
Ambiguous conditions occur:
If \x is not followed by two hexadecimal characters.
If a \ is placed before any character, other than those defined
in 6.2.xxx s\".


Reference Implementation
========================
Taken from the VFX Forth source tree and modified to remove most
implementation dependencies. Assumes the use of the # and $ numeric
prefixes to indicate decimal and hexadecimal respectively.

Another implementation (with some deviations) can be found at
http://b2.complang.tuwien.ac.at/cgi-bin/viewcvs.cgi/*checkout*/gforth/quotes.fs?root=gforth

decimal

: PLACE \ c-addr1 u c-addr2 --
\ *G Copy the string described by c-addr1 u to a counted string at
\ ** the memory address described by c-addr2.
2dup 2>r \ write count last
1 chars + swap move
2r> c! \ to avoid in-place problems
;

: $, \ caddr len --
\ *G Lay the string into the dictionary at *\fo{HERE}, reserve
\ ** space for it and *\fo{ALIGN} the dictionary.
dup >r
here place
r> 1 chars + allot
align
;

: addchar \ char string --
\ *G Add the character to the end of the counted string.
tuck count + c!
1 swap c+!
;

: append \ c-addr u $dest --
\ *G Add the string described by C-ADDR U to the counted string at
\ ** $DEST. The strings must not overlap.
>r
tuck r@ count + swap cmove \ add source to end
r> c+! \ add length to count
;

: extract2H \ caddr len -- caddr' len' u
\ *G Extract a two-digit hex number in the given base from the
\ ** start of the* string, returning the remaining string
\ ** and the converted number.
base @ >r hex
0 0 2over drop 2 >number 2drop drop
>r 2 /string r>
r> base !
;

create EscapeTable \ -- addr
\ *G Table of translations for \a..\z.
7 c, \ \a
8 c, \ \b
char c c, \ \c
char d c, \ \d
#27 c, \ \e
#12 c, \ \f
char g c, \ \g
char h c, \ \h
char i c, \ \i
char j c, \ \j
char k c, \ \k
#10 c, \ \l
char m c, \ \m
#10 c, \ \n (Unices only)
char o c, \ \o
char p c, \ \p
char " c, \ \q
#13 c, \ \r
char s c, \ \s
9 c, \ \t
char u c, \ \u
#11 c, \ \v
char w c, \ \w
char x c, \ \x
char y c, \ \y
0 c, \ \z

create CRLF$ \ -- addr ; CR/LF as counted string
2 c, #13 c, #10 c,

internal
: addEscape \ caddr len dest -- caddr' len'
\ *G Add an escape sequence to the counted string at dest,
\ ** returning the remaining string.
over 0= \ zero length check
if drop exit endif
>r \ -- caddr len ; R: -- dest
over c@ [char] x = if \ hex number?
1 /string extract2H r> addchar exit
endif
over c@ [char] m = if \ CR/LF pair?
1 /string #13 r@ addchar #10 r> addchar exit
endif
over c@ [char] n = if \ CR/LF pair?
1 /string crlf$ count r> append exit
endif
over c@ [char] a [char] z 1+ within if
over c@ [char] a - EscapeTable + c@ r> addchar
else
over c@ r> addchar
endif
1 /string
;
external

: parse\" \ caddr len dest -- caddr' len'
\ *G Parses a string up to an unescaped '"', translating '\'
\ ** escapes to characters much as C does. The
\ ** translated string is a counted string at *\i{dest}
\ ** The supported escapes (case sensitive) are:
\ *D \a BEL (alert)
\ *D \b BS (backspace)
\ *D \e ESC (not in C99)
\ *D \f FF (form feed)
\ *D \l LF (ASCII 10)
\ *D \m CR/LF pair - for HTML etc.
\ *D \n newline - CRLF for Windows/DOS, LF for Unices
\ *D \q double-quote
\ *D \r CR (ASCII 13)
\ *D \t HT (tab)
\ *D \v VT
\ *D \z NUL (ASCII 0)
\ *D \" "
\ *D \xAB Two char Hex numerical character value
\ *D \\ backslash itself
\ *D \ before any other character represents that character
dup >r 0 swap c! \ zero destination
begin \ -- caddr len ; R: -- dest
dup
while
over c@ [char] " <> \ check for terminator
while
over c@ [char] \ = if \ deal with escapes
1 /string r@ addEscape
else \ normal character
over c@ r@ addchar 1 /string
endif
repeat then
dup \ step over terminating "
if 1 /string endif
r> drop
;

: readEscaped \ "string" -- caddr
\ *G Parses an escaped string from the input stream according to
\ ** the rules of *\fo{parse\"} above, returning the address
\ ** of the translated counted string in *\fo{PAD}.
source >in @ /string tuck \ -- len caddr len
pad parse\" nip
- >in +!
pad
;

: S\" \ "string" -- caddr u
\ *G As *\fo{S"}, but translates escaped characters using
\ ** *\fo{parse\"} above.
readEscaped count state @ if
compile (s") $,
then
; IMMEDIATE


Test Cases
==========

( The same tests as for S" )

{ : GC5 S\" XY" ; -> }
{ GC5 SWAP DROP -> 2 }
{ GC5 DROP DUP C@ SWAP CHAR+ C@ -> 58 59 }

( The following are inspired by the gForth test suite )

{ S\" " SWAP DROP -> 0 }

{ S\" \a" SWAP C@ -> 1 07 } \ BEL Bell
{ S\" \b" SWAP C@ -> 1 08 } \ BS Backspace
{ S\" \e" SWAP C@ -> 1 1B } \ ESC Escape
{ S\" \f" SWAP C@ -> 1 0C } \ FF Formfeed
{ S\" \l" SWAP C@ -> 1 0A } \ LF Linefeed
{ S\" \q" SWAP C@ -> 1 22 } \ " Double Quote
{ S\" \r" SWAP C@ -> 1 0D } \ CR Carage Return
{ S\" \t" SWAP C@ -> 1 09 } \ TAB Horisontal Tab
{ S\" \v" SWAP C@ -> 1 0B } \ VT Virtical Tab
{ S\" \z" SWAP C@ -> 1 00 } \ NUL No Character
{ S\" \"" SWAP C@ -> 1 22 } \ " Double Quote
{ S\" \\" SWAP C@ -> 1 5C } \ \ Back Slash

{ S\" \n" 2DROP -> } \ System dependent
{ S\" \m" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0D 0A } \ CR\LF pair
{ S\" \x1Fa" SWAP DUP C@ SWAP CHAR+ C@ -> 2 1F 61 } \ Specified Char

{ S\" S\\\" \\a\"" EVALUATE SWAP C@ -> 1 7 }

Peter Knaggs

unread,
Aug 9, 2007, 2:22:55 PM8/9/07
to

hel...@gmail.com

unread,
Aug 9, 2007, 4:13:33 PM8/9/07
to

Peter Knaggs wrote:
> Test Cases
> ==========
>
> ...

They are missing HEX and something like
TESTING S\"

Regards,
-Helmar

Ed

unread,
Aug 10, 2007, 12:54:52 AM8/10/07
to

"Peter Knaggs" <pkn...@bournemouth.ac.uk> wrote in message news:46BB5B66...@bournemouth.ac.uk...

> RfD: Escaped Strings S\"
> 19 July 2007, Stephen Pelc
> ...

Are the escape chars required to be lower-case?

Peter Knaggs

unread,
Aug 10, 2007, 4:17:59 AM8/10/07
to
hel...@gmail.com wrote:
> Peter Knaggs wrote:
>> Test Cases
>> ==========
>>
>> ...
>
> They are missing HEX and something like
> TESTING S\"

The entire test suite is in HEX.

The test cases appear in the rationale for each individual word being
tested, in a "Testing" section. I see no need for the TESTING heading.
Anyhow this would be folded into the character tests (CHAR [CHAR] [ ] BL S")

Peter Fälth

unread,
Aug 10, 2007, 10:44:28 AM8/10/07
to

> Translation rules:
> Characters are processed one at a time and appended to the
> compiled string. If the character is a '\' character it is
> processed by parsing and substituting one or more characters
> as follows:
>
> \a BEL (alert, ASCII 7)
> \b BS (backspace, ASCII 8)
> \e ESC (not in C99, ASCII 27)
> \f FF (form feed, ASCII 12)
> \l LF (ASCII 10)
> \m CR/LF pair (ASCII 13, 10)
> \n implementation dependent newline, e.g. CR/LF, LF, or LF/CR.
> \q double-quote (ASCII 34)
> \r CR (ASCII 13)
> \t HT (tab, ASCII 9)
> \v VT (ASCII 11)
> \z NUL (ASCII 0)
> \" "
> \xAB A and B are Hexadecimal numerical characters. The resulting
> character is the conversion of these two characters. An
> ambiguous conditions exists if \x is not followed by two
> hexadecimal characters.
> \\ backslash itself
> \ An ambiguous condition exists if a \ is placed before any
> character, other than those defined in 6.2.xxx s\".
>
I suggest also to define \u and \U for inputing 4 and 8 hex digits
unicode codepoints. In my system \u20AC (the euro sign) will insert
the utf8 sequence E282AC into the string.

Peter Fälth

Anton Ertl

unread,
Aug 11, 2007, 5:12:13 AM8/11/07
to
Peter Knaggs <pkn...@bournemouth.ac.uk> writes:
>RfD: Escaped Strings S\"
>19 July 2007, Stephen Pelc
>
>20070719 Modified ambiguous condition
> Added ambiguous conditions to definition of S\"

Ok.

> Added test cases

I have now changed the development version of Gforth so that it passes
the tests.

> Corrected Reference Implementation

There were still some non-standard words in there. I have
eliminated/defined all non-standard words and put the result on

http://www.forth200x.org/reference-implementations/escaped-strings.fs

This runs on the current development Gforth (not on Gforth-0.6.2 due
to the use of the # number prefix).

Concerning the question about the case sensitivity of the escapes,
both Gforth and the reference implementation treat them
case-sensitively.

>A consequence of this is that xchars must be
>represented as a sequence of pchars.

That's ok, but the most of the justifications are nonsense. A much
better justification is that this allows any sequence of bytes to be
generated with S\" even if that sequence is not a proper xchar string;
and one needs such binary strings in various applications.

>Although initially seen as a
>problem by some people, it avoids at least the following problems:
>
>1) Endian issues when transmitting an xchar, e.g. big-endian host
> to little-endian comms channel

If there are byte order issues when transmitting xchars (e.g., for
UTF-32), that has to be dealt with at transmission, not at generation
of strings containing xchars.

>2) Issues when an xchar is larger than a cell, e.g. UTF-32 on
> a 16 bit system.

Since S\" is generating a string, the cell size is irrelevant, and
this is not an issue.

>3) Does not have problems in distinguishing the end of the
> number from a following character such as '0' or 'A'.

That's a very good justification.

> \z NUL (ASCII 0)

\0 seems to be a better candidate, because it is more in line with the
usage in other languages (in particular, C and it's children, which
inspired this approach).

> \xAB A and B are Hexadecimal numerical characters.

As in "3.2.1.2 Digit conversion" (i.e. only upper case is standard at
the moment) or as in the X:number-prefixes (case-insensitive).

>{ S\" \x1Fa" SWAP DUP C@ SWAP CHAR+ C@ -> 2 1F 61 } \ Specified Char

You might also add

S\" \x0F0" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0F 30 }

which might catch some non-conformant implementations that the test
above doesn't catch.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/

Stephen Pelc

unread,
Aug 13, 2007, 10:15:06 AM8/13/07
to
On Fri, 10 Aug 2007 07:44:28 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=
<peter...@tin.it> wrote:

>I suggest also to define \u and \U for inputing 4 and 8 hex digits
>unicode codepoints. In my system \u20AC (the euro sign) will insert
>the utf8 sequence E282AC into the string.

That suggestion leads to six forms, which is why I gave up and
define extended characters as a stream of primitive characters.
UTF-8 encoding or char number?
UTF-16 little or big-endian?
UTF-32 little or big-endian?

Stephen


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Anton Ertl

unread,
Aug 13, 2007, 10:36:35 AM8/13/07
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Fri, 10 Aug 2007 07:44:28 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=
><peter...@tin.it> wrote:
>
>>I suggest also to define \u and \U for inputing 4 and 8 hex digits
>>unicode codepoints. In my system \u20AC (the euro sign) will insert
>>the utf8 sequence E282AC into the string.
>
>That suggestion leads to six forms, which is why I gave up and
>define extended characters as a stream of primitive characters.
> UTF-8 encoding or char number?
> UTF-16 little or big-endian?
> UTF-32 little or big-endian?

I think it would be premature to include this stuff in the present
proposal. It may be included in the Xchars proposal.

That being said, I think you are mistaken about what the \U and \u
forms mean, and how that relates to encodings:

It should be the Unicode character number in any case, so only \U
would be needed, and \u would be a convenience so you don't need to
say \U0000xxxx for the frequent characters in the BMP. The Unicode
number is unique and not influenced by the byte order or encoding.
The internal encoding used by the system determines whether this xchar
is represented in UTF-8, UTF-16, or UTF-32, and with which byte order.

Example:

The Euro sign mentioned above: You can write it as a literal Euro in
the string, or as \U000020AC or \u20AC. This would be equivalent to
the following \x sequence in various encodings:

Sequence Encoding
\xE2\x82\xAC UTF-8
\x20\xAC UTF-16 big-endian
\xAC\x20 UTF-16 little-endian
\x00\x00\x20\xAC UTF-32 big-endian
\xAC\x20\x00\x00 UTF-32 little-endian

(Well, actually, that assumes UTF-16 and UTF-32 xchars with 8-bit
chars, which is not a proper xchars implementation, but I hope it
get's the idea across).

Peter Fälth

unread,
Aug 13, 2007, 10:58:15 AM8/13/07
to
On Aug 13, 4:15 pm, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Fri, 10 Aug 2007 07:44:28 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=
>
> <peter.fa...@tin.it> wrote:
> >I suggest also to define \u and \U for inputing 4 and 8 hex digits
> >unicode codepoints. In my system \u20AC (the euro sign) will insert
> >the utf8 sequence E282AC into the string.
>
> That suggestion leads to six forms, which is why I gave up and
> define extended characters as a stream of primitive characters.
> UTF-8 encoding or char number?
> UTF-16 little or big-endian?
> UTF-32 little or big-endian?
>
> Stephen

No it does not! What follows the \u is the 4 digit hex number of
the unicode code point. This is always the same and independent of
encoding or endianess. S\" will then translate this to the encoding
and endianess used in the specific system. If I write the string
S\" Please pay me 10\u20AC" this will be portable to whatever your
system uses for unicode encoding. On my system it stores E282AC in the
bytestream. On a Windows system using uft16 it will store AC20
at the character position. It is when I input individual bytes with
\x that I need to keep track of the 6 cases. I want to avoid this

Peter

>
> --
> Stephen Pelc, stephen...@mpeforth.com

Stephen Pelc

unread,
Aug 13, 2007, 12:23:32 PM8/13/07
to
On Mon, 13 Aug 2007 07:58:15 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=
<peter...@tin.it> wrote:

>> That suggestion leads to six forms, which is why I gave up and
>> define extended characters as a stream of primitive characters.
>> UTF-8 encoding or char number?
>> UTF-16 little or big-endian?
>> UTF-32 little or big-endian?
>>
>> Stephen
>
>No it does not! What follows the \u is the 4 digit hex number of
>the unicode code point. This is always the same and independent of
>encoding or endianess. S\" will then translate this to the encoding
>and endianess used in the specific system. If I write the string
>S\" Please pay me 10\u20AC" this will be portable to whatever your
>system uses for unicode encoding. On my system it stores E282AC in the
>bytestream. On a Windows system using uft16 it will store AC20
>at the character position. It is when I input individual bytes with
>\x that I need to keep track of the 6 cases. I want to avoid this

My bad! Thanks for the explanation. I assume that /U is followed by an
8 digit hex number. Although this notation solves the problems of the
host, is it enough when the string is sent across a comms channel to
another box of the other endianness?

Stephen

--
Stephen Pelc, steph...@mpeforth.com


MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

Peter Knaggs

unread,
Aug 13, 2007, 4:34:59 PM8/13/07
to
Peter Fälth wrote:
>
> I suggest also to define \u and \U for inputing 4 and 8 hex digits
> unicode codepoints. In my system \u20AC (the euro sign) will insert
> the utf8 sequence E282AC into the string.

This assumes the system is using unicode. There is nothing to mandate
that at current. If you where providing a non-unicode system, would \u
and \U reflect the native encoding or would you insist on a full unicode
conversion?

Peter Fälth

unread,
Aug 13, 2007, 4:50:59 PM8/13/07
to
On Aug 13, 6:23 pm, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Mon, 13 Aug 2007 07:58:15 -0700, =?iso-8859-1?B?UGV0ZXIgRuRsdGg=?=
>
>
>
> <peter.fa...@tin.it> wrote:
> >> That suggestion leads to six forms, which is why I gave up and
> >> define extended characters as a stream of primitive characters.
> >> UTF-8 encoding or char number?
> >> UTF-16 little or big-endian?
> >> UTF-32 little or big-endian?
>
> >> Stephen
>
> >No it does not! What follows the \u is the 4 digit hex number of
> >the unicode code point. This is always the same and independent of
> >encoding or endianess. S\" will then translate this to the encoding
> >and endianess used in the specific system. If I write the string
> >S\" Please pay me 10\u20AC" this will be portable to whatever your
> >system uses for unicode encoding. On my system it stores E282AC in the
> >bytestream. On a Windows system using uft16 it will store AC20
> >at the character position. It is when I input individual bytes with
> >\x that I need to keep track of the 6 cases. I want to avoid this
>
> My bad! Thanks for the explanation. I assume that /U is followed by an
> 8 digit hex number. Although this notation solves the problems of the
> host, is it enough when the string is sent across a comms channel to
> another box of the other endianness?
>
> Stephen

Yes the /U is for 8 digits.

If the string is in utf8 there would be no problems with endianess.
I assume that for communicating there would be a protocol that
specify how strings are sent.

Peter

> --
> Stephen Pelc, stephen...@mpeforth.com


> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

Peter Fälth

unread,
Aug 13, 2007, 5:11:38 PM8/13/07
to

No \u and \U should always reflect the unicode codepoint. The system
should then try to convert this to the encoding in use. If this will
fail a predefined character would be inserting to show a failed
conversion. This could be a ? or box character. In a system with
Latin-1 all codes above $FF will fail all below will be a direct
translation. For other encodings the translation will require more
work. There are libraries in both Linux and Windows that can handle
this

Peter

Peter Knaggs

unread,
Aug 14, 2007, 6:47:37 PM8/14/07
to
Anton Ertl wrote:
>
> There were still some non-standard words in there. I have
> eliminated/defined all non-standard words and put the result on

Thanks, I have now removed PLACE and $, as they are no longer required.

> This runs on the current development Gforth (not on Gforth-0.6.2 due
> to the use of the # number prefix).

The use of DECIMAL means that the # prefix is not required, I have
removed it.


> Concerning the question about the case sensitivity of the escapes,
> both Gforth and the reference implementation treat them
> case-sensitively.

Agreed. Although if we are to consider \u and \U then escapes would have
to be case sensitive.


>> Although initially seen as a
>> problem by some people, it avoids at least the following problems:
>>
>> 1) Endian issues when transmitting an xchar, e.g. big-endian host
>> to little-endian comms channel
>
> If there are byte order issues when transmitting xchars (e.g., for
> UTF-32), that has to be dealt with at transmission, not at generation
> of strings containing xchars.

This is an argument in favour of \u.


>> \z NUL (ASCII 0)
>
> \0 seems to be a better candidate, because it is more in line with the
> usage in other languages (in particular, C and it's children, which
> inspired this approach).

\0 is a side effect of allowing octal values. This naturally lead on to
the ability to specify characters in decimal \ddd or hex \0xhh. If we
are going to use number prefixes then we should use \#0 or \$00. I am
not suggesting this, as there is little point in having multiple methods
for entering specific character codes.


>> \xAB A and B are Hexadecimal numerical characters.
>
> As in "3.2.1.2 Digit conversion" (i.e. only upper case is standard at
> the moment) or as in the X:number-prefixes (case-insensitive).

Good question, I would say upper-case as per the current document. If we
change this to allow lower-case then so be it.


>> { S\" \x1Fa" SWAP DUP C@ SWAP CHAR+ C@ -> 2 1F 61 } \ Specified Char
>
> You might also add
>
> S\" \x0F0" SWAP DUP C@ SWAP CHAR+ C@ -> 2 0F 30 }
>
> which might catch some non-conformant implementations that the test
> above doesn't catch.

Good idea, done.

Anton Ertl

unread,
Aug 16, 2007, 3:52:42 AM8/16/07
to
Peter Knaggs <pkn...@bournemouth.ac.uk> writes:

>Anton Ertl wrote:
>> Concerning the question about the case sensitivity of the escapes,
>> both Gforth and the reference implementation treat them
>> case-sensitively.
>
>Agreed. Although if we are to consider \u and \U then escapes would have
>to be case sensitive.

"Although" sounds as if there would be a conflict, but there isn't.

>>> \z NUL (ASCII 0)
>>
>> \0 seems to be a better candidate, because it is more in line with the
>> usage in other languages (in particular, C and it's children, which
>> inspired this approach).
>
>\0 is a side effect of allowing octal values.

In C it is. However, the proposal does not specify octal notation for
characters, so in Forth it would not be. We still might want to
support \0 without supporting octal notation, for the reasons given
above.

> This naturally lead on to
>the ability to specify characters in decimal \ddd or hex \0xhh.

Not really. In particular, it has not lead to such consequences in C,
it conflicts with octal notation in C, and we (and C) already have
\x<h><h> for hex.

>>> \xAB A and B are Hexadecimal numerical characters.
>>
>> As in "3.2.1.2 Digit conversion" (i.e. only upper case is standard at
>> the moment) or as in the X:number-prefixes (case-insensitive).
>
>Good question, I would say upper-case as per the current document.

In that case I suggest adding a reference to 3.2.1.2.

> If we
>change this to allow lower-case then so be it.

In that case we would change it in 3.2.1.2.

Reply all
Reply to author
Forward
0 new messages