RfD: XCHAR wordset

Bernd Paysan

unread,

Jul 14, 2007, 3:56:12 PM7/14/07

to fort...@yahoogroups.com

Since it's time to post RfDs, I want to throw in the updated proposal for
the XCHAR wordset. I hope I have included all comments so far, and I also
included a reference implementation.

Problem:

ASCII is only appropriate for the English language. Most western
languages however fit somewhat into the Forth frame, since a byte is
sufficient to encode the few special characters in each (though not
always the same encoding can be used; latin-1 is most widely used,
though). For other languages, different char-sets have to be used,
several of them variable-width. Most prominent representant is
UTF-8. Let's call these extended characters XCHARs. Since ANS Forth
specifies ASCII encoding, only ASCII-compatible encodings may be
used. Furtunately, being ASCII compatible has so many benefits that
most encodings actually are ASCII compatible.

Proposal

Datatypes:

xc is an extended char on the stack. It occupies one cell, and is a
subset of unsigned cell. Note: UTF-8 can not store more that
31 bits; on 16 bit systems, only the UCS16 subset of the UTF-8
character set can be used. Small embedded systems can keep
xchars always in memory, because all words directly dealing with
the xc datatype are in the XCHAR EXT wordset.

xc_addr is the address of an XCHAR in memory. Alignment requirements are
the same as c_addr. The memory representation of an XCHAR differs
from the stack location, and depends on the encoding used. An XCHAR
may use a variable number of address units in memory.

encoding cell-sized opaque data type identifying a particular encoding.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only
when you don't change the encoding, or only use the ASCII
subset. Typical use is that the base system is ASCII only, and then
extended encoding-specific.

Side issues to be considered:

Many Forth systems today are case insensitive, to accept lower case
standard words. It is sufficient to be case insensitive for the ASCII
subset to make this work - this saves a large code mapping table for
comparison of other symbols. Case is mostly an issue of European
languages (latin, greek, and cyrillic), but similar issues exist
between traditional and simplified Chinese, and between different
Latin code pages in UCS, e.g. full width vs. normal half width latin
letters. Some encodings (not UTF-8) might give surprises when you use
a case insensitive ASCII-compare that's 8-bit save, but not aware of
the current encoding.

Words:

XC-SIZE ( xc -- u ) XCHAR EXT
Computes the memory size of the XCHAR xc in address units.

X-SIZE ( xc_addr u1 -- u2 ) XCHAR
Computes the memory size of the first XCHAR stored at xc_addr in
address units.

XC@+ ( xc_addr1 -- xc_addr2 xc ) XCHAR EXT
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 ) XCHAR EXT
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
characters large. xc_addr2 points to the first memory location after
xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
into the buffer, flag is true, otherwise flag is false, and xc_addr2
u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and
therefore preferred over XC!+.

XCHAR+ ( xc_addr1 -- xc_addr2 ) XCHAR EXT
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 ) XCHAR EXT
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
u2 is the remaining buffer after stepping over the first XCHAR in the
buffer.

-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
Step backward by one xchar in the buffer defined by xcaddr1 u1,
starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
after stepping backward over the last XCHAR in the buffer. Unlike
XCHAR-, -XSTRING can be implemented in encodings that have only a
forward-working string size.

-TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR
Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is
correct and it repesents a full character, u2 equals u1, otherwise, u2
represents the string without the last (garbled) XCHAR.

X-WIDTH ( xc_addr u -- n ) XCHAR
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc ) XCHAR EXT
Reads an XCHAR from the terminal.

XEMIT ( xc -- ) XCHAR EXT
Prints an XCHAR on the terminal.

SET-ENCODING ( encoding -- ) XCHAR EXT
Sets the input encoding to the specified encoding

GET-ENCODING ( -- encoding ) XCHAR EXT
Returns the current encoding.

Encodings are implementation specific, example encoding names can be

ISO-LATIN-1 ( -- encoding ) XCHAR EXT
ISO Latin1 encoding (one byte per character)

UTF-8 ( -- encoding ) XCHAR EXT
UTF-8 encoding (UCS codepage, byte-oriented variable length encoding)

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters. Parse name delimited by a space. Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( ?<spaces>name? -- )
Skip leading space delimiters. Parse name delimited by a space. Append the
run-time semantics given below to the current definition.
Run-time: ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

-------------------------xchar.fs----------------------------
\ xchar reference implementation: UTF-8 (and ISO-LATIN-1)

\ environmental dependency: characters are stored as bytes
\ environmental dependency: lower case words accepted

base @ hex

80 Value maxascii

: xc-size ( xc -- n )
dup maxascii u< IF drop 1 EXIT THEN \ special case ASCII
$800 2 >r
BEGIN 2dup u>= WHILE 5 lshift r> 1+ >r dup 0= UNTIL THEN
2drop r> ;

: xc@+ ( xcaddr -- xcaddr' u )
count dup maxascii u< IF EXIT THEN \ special case ASCII
7F and 40 >r
BEGIN dup r@ and WHILE r@ xor
6 lshift r> 5 lshift >r >r count
3F and r> or
REPEAT r> drop ;

: xc!+ ( xc xcaddr -- xcaddr' )
over maxascii u< IF tuck c! char+ EXIT THEN \ special case ASCII
>r 0 swap 3F
BEGIN 2dup u> WHILE
2/ >r dup 3F and 80 or swap 6 rshift r>
REPEAT 7F xor 2* or r>
BEGIN over 80 u< 0= WHILE tuck c! char+ REPEAT nip ;

: xc!+? ( xc xcaddr u -- xcaddr' u' flag )
>r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
\ not enough space
drop nip r> false
ELSE
>r xc!+ r> r> swap - true
THEN ;

\ scan to next/previous character

: xchar+ ( xcaddr -- xcaddr' ) xc@+ drop ;
: xchar- ( xcaddr -- xcaddr' )
BEGIN 1 chars - dup c@ C0 and maxascii <> UNTIL ;

: xstring+ ( xcaddr u -- xcaddr u' )
over + xchar+ over - ;
: xstring- ( xcaddr u -- xcaddr u' )
over + xchar- over - ;

: +xstring ( xc-addr1 u1 -- xc-addr2 u2 )
over dup xchar+ swap - /string ;
: -xstring ( xc-addr1 u1 -- xc-addr2 u2 )
over dup xchar- swap - /string ;

\ skip trailing garbage

: x-size ( xcaddr u1 -- u2 ) drop
\ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
c@
dup $80 u< IF drop 1 exit THEN
dup $c0 u< IF drop 1 EXIT THEN \ really is a malformed character
dup $e0 u< IF drop 2 exit THEN
dup $f0 u< IF drop 3 exit THEN
dup $f8 u< IF drop 4 exit THEN
dup $fc u< IF drop 5 exit THEN
dup $fe u< IF drop 6 exit THEN
drop 1 ; \ also malformed character

: -trailing-garbage ( xcaddr u1 -- xcaddr u2 )
2dup + dup xchar- ( addr u1 end1 end2 )
2dup dup over over - x-size + = IF \ last character ok
2drop
ELSE
nip nip over -
THEN ;

\ utf key and emit

: xkey ( -- xc )
key dup maxascii u< IF EXIT THEN \ special case ASCII
7F and 40 >r
BEGIN dup r@ and WHILE r@ xor
6 lshift r> 5 lshift >r >r key
3F and r> or
REPEAT r> drop ;

: xemit ( xc -- )
dup maxascii u< IF emit EXIT THEN \ special case ASCII
0 swap 3F
BEGIN 2dup u> WHILE
2/ >r dup 3F and 80 or swap 6 rshift r>
REPEAT 7F xor 2* or
BEGIN dup 80 u< 0= WHILE emit REPEAT drop ;

\ utf size

\ uses wcwidth ( xc -- n )

: wc, ( n low high -- ) 1+ , , , ;

Create wc-table \ derived from wcwidth source code, for UCS32
0 0300 0357 wc,
0 035D 036F wc,
0 0483 0486 wc,
0 0488 0489 wc,
0 0591 05A1 wc,
0 05A3 05B9 wc,
0 05BB 05BD wc,
0 05BF 05BF wc,
0 05C1 05C2 wc,
0 05C4 05C4 wc,
0 0600 0603 wc,
0 0610 0615 wc,
0 064B 0658 wc,
0 0670 0670 wc,
0 06D6 06E4 wc,
0 06E7 06E8 wc,
0 06EA 06ED wc,
0 070F 070F wc,
0 0711 0711 wc,
0 0730 074A wc,
0 07A6 07B0 wc,
0 0901 0902 wc,
0 093C 093C wc,
0 0941 0948 wc,
0 094D 094D wc,
0 0951 0954 wc,
0 0962 0963 wc,
0 0981 0981 wc,
0 09BC 09BC wc,
0 09C1 09C4 wc,
0 09CD 09CD wc,
0 09E2 09E3 wc,
0 0A01 0A02 wc,
0 0A3C 0A3C wc,
0 0A41 0A42 wc,
0 0A47 0A48 wc,
0 0A4B 0A4D wc,
0 0A70 0A71 wc,
0 0A81 0A82 wc,
0 0ABC 0ABC wc,
0 0AC1 0AC5 wc,
0 0AC7 0AC8 wc,
0 0ACD 0ACD wc,
0 0AE2 0AE3 wc,
0 0B01 0B01 wc,
0 0B3C 0B3C wc,
0 0B3F 0B3F wc,
0 0B41 0B43 wc,
0 0B4D 0B4D wc,
0 0B56 0B56 wc,
0 0B82 0B82 wc,
0 0BC0 0BC0 wc,
0 0BCD 0BCD wc,
0 0C3E 0C40 wc,
0 0C46 0C48 wc,
0 0C4A 0C4D wc,
0 0C55 0C56 wc,
0 0CBC 0CBC wc,
0 0CBF 0CBF wc,
0 0CC6 0CC6 wc,
0 0CCC 0CCD wc,
0 0D41 0D43 wc,
0 0D4D 0D4D wc,
0 0DCA 0DCA wc,
0 0DD2 0DD4 wc,
0 0DD6 0DD6 wc,
0 0E31 0E31 wc,
0 0E34 0E3A wc,
0 0E47 0E4E wc,
0 0EB1 0EB1 wc,
0 0EB4 0EB9 wc,
0 0EBB 0EBC wc,
0 0EC8 0ECD wc,
0 0F18 0F19 wc,
0 0F35 0F35 wc,
0 0F37 0F37 wc,
0 0F39 0F39 wc,
0 0F71 0F7E wc,
0 0F80 0F84 wc,
0 0F86 0F87 wc,
0 0F90 0F97 wc,
0 0F99 0FBC wc,
0 0FC6 0FC6 wc,
0 102D 1030 wc,
0 1032 1032 wc,
0 1036 1037 wc,
0 1039 1039 wc,
0 1058 1059 wc,
1 0000 1100 wc,
2 1100 115f wc,
0 1160 11FF wc,
0 1712 1714 wc,
0 1732 1734 wc,
0 1752 1753 wc,
0 1772 1773 wc,
0 17B4 17B5 wc,
0 17B7 17BD wc,
0 17C6 17C6 wc,
0 17C9 17D3 wc,
0 17DD 17DD wc,
0 180B 180D wc,
0 18A9 18A9 wc,
0 1920 1922 wc,
0 1927 1928 wc,
0 1932 1932 wc,
0 1939 193B wc,
0 200B 200F wc,
0 202A 202E wc,
0 2060 2063 wc,
0 206A 206F wc,
0 20D0 20EA wc,
2 2329 232A wc,
0 302A 302F wc,
2 2E80 303E wc,
0 3099 309A wc,
2 3040 A4CF wc,
2 AC00 D7A3 wc,
2 F900 FAFF wc,
0 FB1E FB1E wc,
0 FE00 FE0F wc,
0 FE20 FE23 wc,
2 FE30 FE6F wc,
0 FEFF FEFF wc,
2 FF00 FF60 wc,
2 FFE0 FFE6 wc,
0 FFF9 FFFB wc,
0 1D167 1D169 wc,
0 1D173 1D182 wc,
0 1D185 1D18B wc,
0 1D1AA 1D1AD wc,
2 20000 2FFFD wc,
2 30000 3FFFD wc,
0 E0001 E0001 wc,
0 E0020 E007F wc,
0 E0100 E01EF wc,
here wc-table - Constant #wc-table

\ inefficient table walk:

: wcwidth ( xc -- n )
wc-table #wc-table over + swap ?DO
dup I 2@ within IF I 2 cells + @ UNLOOP EXIT THEN
3 cells +LOOP 1 ;

: x-width ( xcaddr u -- n )
0 rot rot over + swap ?DO
I xc@+ swap >r wcwidth +
r> I - +LOOP ;

: char ( "name" -- xc ) bl word count drop xc@+ nip ;
: [char] ( "name" -- rt:xc ) char postpone Literal ; immediate

\ switching encoding is only recommended at startup
\ only two encodings are supported: UTF-8 and ISO-LATIN-1

80 Constant utf-8
100 Constant iso-latin-1

: set-encoding to maxascii ;
: get-encoding maxascii ;

base !
-------------------------xchar.fs----------------------------

Experience:

Build into Gforth (development version) and recent versions of bigFORTH.
Open issues are file reading and writing (conversion on the fly or leave as
it is?).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bruce McFarling

unread,

Jul 15, 2007, 9:06:25 AM7/15/07

to

How hard would it be to extend the reference implemenation to UTF-32?

Erratum:

XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
characters large. xc_addr2 points to the first memory location after
xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
into the buffer, flag is true, otherwise flag is false, and xc_addr2

u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
therefore preferred over XC!+.

Anton Ertl

unread,

Jul 15, 2007, 1:24:56 PM7/15/07

to

Bernd Paysan <bernd....@gmx.de> writes:
>xc_addr is the address of an XCHAR in memory. Alignment requirements are
> the same as c_addr. The memory representation of an XCHAR differs
> from the stack location, and depends on the encoding used. An XCHAR

^^^^^^^^
representation?

>Common encodings:
...

>Side issues to be considered:

These appear to be subsections that should be put in informative
sections, not the normative "Proposal" section.

>Many Forth systems today are case insensitive, to accept lower case
>standard words. It is sufficient to be case insensitive for the ASCII
>subset to make this work - this saves a large code mapping table for
>comparison of other symbols. Case is mostly an issue of European
>languages (latin, greek, and cyrillic), but similar issues exist
>between traditional and simplified Chinese, and between different
>Latin code pages in UCS, e.g. full width vs. normal half width latin
>letters. Some encodings (not UTF-8) might give surprises when you use
>a case insensitive ASCII-compare that's 8-bit save, but not aware of
>the current encoding.

Even in UTF-8 you can compose letters, e.g. an Umlaut-a from a
diaresis and an a, and that would be encoded differently than the
Latin-1-derived Umlaut-a.

Anyway, that's not a problem we should try to solve at the Forth
level, or at least not in this proposal.

>Words:
>
>XC-SIZE ( xc -- u ) XCHAR EXT
>Computes the memory size of the XCHAR xc in address units.
>
>X-SIZE ( xc_addr u1 -- u2 ) XCHAR
>Computes the memory size of the first XCHAR stored at xc_addr in
>address units.

...

>XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
>Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
>characters large.

Shouldn't the granularity of the size specifications be the same
(i.e., either aus or chars) throughout the wordset?

> xc_addr2 points to the first memory location after
>xc, u2 is the remaining size of the buffer.

In what units? The size units are missing in most of the rest of the
word specifications, but I do not mention this again.

>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
>u2 is the remaining buffer after stepping over the first XCHAR in the
>buffer.
>
>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>Step backward by one xchar in the buffer defined by xcaddr1 u1,
>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>after stepping backward over the last XCHAR in the buffer. Unlike
>XCHAR-, -XSTRING can be implemented in encodings that have only a
>forward-working string size.

The assymetry in the stack effects of XSTRING+ and -XSTRING is
probably hard to remember and may be confusing.

>X-WIDTH ( xc_addr u -- n ) XCHAR
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe mention that this is only relevant for monospaced displays/fonts.

>SET-ENCODING ( encoding -- ) XCHAR EXT
>Sets the input encoding to the specified encoding

So there's an input encoding and an internal encoding?

Are all inputs affected? I would set file encodings per-file.

What about the output encoding?

>The following words behave different when the XCHAR extension is present:
>
>CHAR ( "<spaces>name" -- xc )
>Skip leading space delimiters. Parse name delimited by a space. Put the
>value of its first XCHAR onto the stack.
>
>[CHAR]
>Interpretation: Interpretation semantics for this word are undefined.
> Compilation: ( ?<spaces>name? -- )
>Skip leading space delimiters. Parse name delimited by a space. Append the
>run-time semantics given below to the current definition.
> Run-time: ( -- xc )
>Place xc, the value of the first XCHAR of name, on the stack.

I would call that an extended behaviour, not a different behaviour,
because the behaviour will be the same for Forth-94 programs.

>Experience:
>
>Build into Gforth (development version) and recent versions of bigFORTH.

There's also at least one other implementation, lxf-ntf by Peter Falth.

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

We have not implemented it yet, but for text files the conversion to
and from the internal representation should be performed by
READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as
binary), the program has to keep track of which buffer contains which
encoding, and do the conversion itself, which is error-prone,
inconvenient, and the proposal does not supply words for that. But,
as mentioned above, if you really want that, you can have it by
treating the file as binary.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/

Bernd Paysan

unread,

Jul 15, 2007, 3:33:52 PM7/15/07

to

Bruce McFarling wrote:

> How hard would it be to extend the reference implemenation to UTF-32?

UTF-32 is not ASCII compatible, unless you have a system where 1 CHAR = 32
bit.

> Erratum:
>
> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
> characters large. xc_addr2 points to the first memory location after
> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
> into the buffer, flag is true, otherwise flag is false, and xc_addr2
> u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
> therefore preferred over XC!+.

Thanks, there was another save/safe error, as well.

Bernd Paysan

unread,

Jul 15, 2007, 4:02:32 PM7/15/07

to

Anton Ertl wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>xc_addr is the address of an XCHAR in memory. Alignment requirements are
>> the same as c_addr. The memory representation of an XCHAR differs
>> from the stack location, and depends on the encoding used. An
>> XCHAR
> ^^^^^^^^
> representation?

Yes.

>>Common encodings:
> ...
>>Side issues to be considered:
>
> These appear to be subsections that should be put in informative
> sections, not the normative "Proposal" section.

Moved it to an appendix

>>XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
>>Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
>>characters large.
>
> Shouldn't the granularity of the size specifications be the same
> (i.e., either aus or chars) throughout the wordset?

Should be AUs.

>> xc_addr2 points to the first memory location after
>>xc, u2 is the remaining size of the buffer.
>
> In what units? The size units are missing in most of the rest of the
> word specifications, but I do not mention this again.
>
>>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
>>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
>>u2 is the remaining buffer after stepping over the first XCHAR in the
>>buffer.
>>
>>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>>Step backward by one xchar in the buffer defined by xcaddr1 u1,
>>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>>after stepping backward over the last XCHAR in the buffer. Unlike
>>XCHAR-, -XSTRING can be implemented in encodings that have only a
>>forward-working string size.
>
> The assymetry in the stack effects of XSTRING+ and -XSTRING is
> probably hard to remember and may be confusing.

Oops, got it wrong, the description is actually of +XSTRING and XSTRING-.
The sign is on the side of the string which gets modified, and indicates
the direction (+ towards higher addresses, - towards lower). The sample
implementation also contains the opposite partner of each of those, but
that doesn't make too much sense (if you extend the buffer, you can as well
use XCHAR+ and XCHAR-).

>>X-WIDTH ( xc_addr u -- n ) XCHAR
>>n is the number of monospace ASCII characters that take the same space to
>>display as the the XCHAR string starting at xc_addr, using u address
>>units.
>
> Maybe mention that this is only relevant for monospaced displays/fonts.

Fonts where each character takes an integer multiple width of ASCII
characters. Calling that "monospaced" is a bit stretching the
word "monospaced" ;-).

>>SET-ENCODING ( encoding -- ) XCHAR EXT
>>Sets the input encoding to the specified encoding
>
> So there's an input encoding and an internal encoding?

Actually, there's just an encoding, which is both internal (for words like
XCHAR+), and external (for XKEY/XEMIT).

> Are all inputs affected? I would set file encodings per-file.
>
> What about the output encoding?

So far, only one encoding at a time is supported.

>>The following words behave different when the XCHAR extension is present:
>>
>>CHAR ( "<spaces>name" -- xc )
>>Skip leading space delimiters. Parse name delimited by a space. Put the
>>value of its first XCHAR onto the stack.
>>
>>[CHAR]
>>Interpretation: Interpretation semantics for this word are undefined.
>> Compilation: ( ?<spaces>name? -- )
>>Skip leading space delimiters. Parse name delimited by a space. Append
>>the run-time semantics given below to the current definition.
>> Run-time: ( -- xc )
>>Place xc, the value of the first XCHAR of name, on the stack.
>
> I would call that an extended behaviour, not a different behaviour,
> because the behaviour will be the same for Forth-94 programs.
>
>>Experience:
>>
>>Build into Gforth (development version) and recent versions of bigFORTH.
>
> There's also at least one other implementation, lxf-ntf by Peter Falth.

Fine.

>>Open issues are file reading and writing (conversion on the fly or leave
>>as it is?).
>
> We have not implemented it yet, but for text files the conversion to
> and from the internal representation should be performed by
> READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as
> binary), the program has to keep track of which buffer contains which
> encoding, and do the conversion itself, which is error-prone,
> inconvenient, and the proposal does not supply words for that. But,
> as mentioned above, if you really want that, you can have it by
> treating the file as binary.

I think for file encodings, we should have a word that sets the encoding of
a file, like SET-FILE-ENCODING ( encoding fd -- ior ), and we also need a
tag in the file to set the encoding while interpreting, i.e.
SET-SOURCE-ENCODING (sets the encoding of the source file).

Alex McDonald

unread,

Jul 15, 2007, 4:33:30 PM7/15/07

to forth200x

Bernd Paysan wrote:

[snipped]

Unfortunately, on first analysis, this is one proposal that Win32Forth
will not be adopting any time soon.

Windows is UTF-16, which is not ASCII compliant. Although Windows
provides APIs to translate from locale to locale, there is no method in
Win32Forth to automatically identify which parameters would be require
to be translated from XHCARS to UTF-16 and back; the programmer would be
responsible for coding the conversions.

We would need something like the proposal Anton made at EuroForth 2006
(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
Function Call Interface), with extensions to identify string pointers,
before implementing this.

--
Regards
Alex McDonald

Anton Ertl

unread,

Jul 16, 2007, 5:46:32 AM7/16/07

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>> Bernd Paysan <bernd....@gmx.de> writes:
>>>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
>>>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
>>>u2 is the remaining buffer after stepping over the first XCHAR in the
>>>buffer.
>>>
>>>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>>>Step backward by one xchar in the buffer defined by xcaddr1 u1,
>>>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>>>after stepping backward over the last XCHAR in the buffer. Unlike
>>>XCHAR-, -XSTRING can be implemented in encodings that have only a
>>>forward-working string size.
>>
>> The assymetry in the stack effects of XSTRING+ and -XSTRING is
>> probably hard to remember and may be confusing.
>
>Oops, got it wrong, the description is actually of +XSTRING and XSTRING-.
>The sign is on the side of the string which gets modified, and indicates
>the direction (+ towards higher addresses, - towards lower). The sample
>implementation also contains the opposite partner of each of those, but
>that doesn't make too much sense (if you extend the buffer, you can as well
>use XCHAR+ and XCHAR-).

Hmm, your mistake may indicate that this naming is error-prone,
especially in implementations where the opposite partners exist.

>>>SET-ENCODING ( encoding -- ) XCHAR EXT
>>>Sets the input encoding to the specified encoding
>>
>> So there's an input encoding and an internal encoding?
>
>Actually, there's just an encoding, which is both internal (for words like
>XCHAR+), and external (for XKEY/XEMIT).

I think that no word for changing the internal encoding should be
standardized. Or if you standardize it, it should fail if the new
internal encoding is not an extension of the old one (i.e.,
ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since
this is a one-way street, GET-ENCODING makes little sense.

Otherwise a standard program could contain strings in different,
incompatible encodings, some of them in system-controlled strings
(e.g., word names), controlled by a global state variable. This would
be worse than STATE and BASE. No need to introduce another such
mistake.

>>>Open issues are file reading and writing (conversion on the fly or leave
>>>as it is?).
>>
>> We have not implemented it yet, but for text files the conversion to
>> and from the internal representation should be performed by
>> READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as
>> binary), the program has to keep track of which buffer contains which
>> encoding, and do the conversion itself, which is error-prone,
>> inconvenient, and the proposal does not supply words for that. But,
>> as mentioned above, if you really want that, you can have it by
>> treating the file as binary.
>
>I think for file encodings, we should have a word that sets the encoding of
>a file, like SET-FILE-ENCODING ( encoding fd -- ior ),

The primary method should work through OPEN-FILE and CREATE-FILE
(e.g., by specifying the encoding in the fam). But yes, a word like
SET-FILE-ENCODING is useful when the program learns about the encoding
later (e.g., when the encoding is specified at the start of the file).

> and we also need a
>tag in the file to set the encoding while interpreting, i.e.
>SET-SOURCE-ENCODING (sets the encoding of the source file).

That sounds sensible.

Anton Ertl

unread,

Jul 16, 2007, 7:28:57 AM7/16/07

to

Alex McDonald <bl...@rivadpm.com> writes:
>Bernd Paysan wrote:
>
>[snipped]
>
>Unfortunately, on first analysis, this is one proposal that Win32Forth
>will not be adopting any time soon.
>
>Windows is UTF-16, which is not ASCII compliant. Although Windows
>provides APIs to translate from locale to locale, there is no method in
>Win32Forth to automatically identify which parameters would be require
>to be translated from XHCARS to UTF-16 and back; the programmer would be
>responsible for coding the conversions.

I don't see that you are any worse off with xchars in this situation
than with chars.

>We would need something like the proposal Anton made at EuroForth 2006
>(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
>Function Call Interface), with extensions to identify string pointers,
>before implementing this.

For strings my approach in the C interface is that one needs to
convert explicitly. Even without Unicode, you already have the
problem of needing zero-termination in C and explicit length counts in
Forth. Hmm, maybe we need some support words for the conversion.

Alex McDonald

unread,

Jul 16, 2007, 8:39:13 AM7/16/07

to

On Jul 16, 12:28 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Alex McDonald <b...@rivadpm.com> writes:
> >Bernd Paysan wrote:
>
> >[snipped]
>
> >Unfortunately, on first analysis, this is one proposal that Win32Forth
> >will not be adopting any time soon.
>
> >Windows is UTF-16, which is not ASCII compliant. Although Windows
> >provides APIs to translate from locale to locale, there is no method in
> >Win32Forth to automatically identify which parameters would be require
> >to be translated from XHCARS to UTF-16 and back; the programmer would be
> >responsible for coding the conversions.
>
> I don't see that you are any worse off with xchars in this situation
> than with chars.

The au would be 16bits, with a max of 127 characters in a counted
string. This might be considered too short. It would be a pretty big
change as well, as there are a good few COUNTs and C@ in a lot of
Win32Forth code.

I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
proposal; for conversion between encodings I would have thought it
useful.

As a general note, it's worth following the Unicode 5.0 standard for
malformed Unicode; to throw an error in all such cases. The XCHARS
standard should be explicit about which Unicode processing standard it
adheres to (or insist that the implementor name the standard).

>
> >We would need something like the proposal Anton made at EuroForth 2006
> >(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
> >Function Call Interface), with extensions to identify string pointers,
> >before implementing this.
>
> For strings my approach in the C interface is that one needs to
> convert explicitly. Even without Unicode, you already have the
> problem of needing zero-termination in C and explicit length counts in
> Forth. Hmm, maybe we need some support words for the conversion.

There's also a Java style null ("modified UTF-8"), encoded as 0xc0
0x80. It has some advantages, as C won't stop on it when using
strlen(), and strings with imbedded nulls can be correctly passed to C
(for instance, when using C to write to file).

Win32Forth makes sure strings are null terminated (and the programmer
needs to be aware of this when allocating buffers for string handling;
they need to be one byte longer than required by the string).

Bernd Paysan

unread,

Jul 16, 2007, 10:58:38 AM7/16/07

to

Anton Ertl wrote:
>>Windows is UTF-16, which is not ASCII compliant. Although Windows
>>provides APIs to translate from locale to locale, there is no method in
>>Win32Forth to automatically identify which parameters would be require
>>to be translated from XHCARS to UTF-16 and back; the programmer would be
>>responsible for coding the conversions.
>
> I don't see that you are any worse off with xchars in this situation
> than with chars.

It's somewhat worse, because Windows has "A" prototypes, which convert the
current code page (can be multibyte) into UTF-16 on the fly. The "W"
prototypes take UTF-16 directly. But there's some light: UTF-8 is one of
the code pages in Windows (number 65001), and you can at least use
MultiByteToWideChar to convert data.

Actually, it might be possible to change the current code page to UTF-8, but
I didn't see a hint how to do that other than for console i/o (SetConsoleCP
and SetConsoleOutputCP). I must honestly admit that I don't like the online
access to MSDN represents information. The internal search is horrible, and
it's one of the rare sites where even Google is confused.

>>We would need something like the proposal Anton made at EuroForth 2006
>>(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
>>Function Call Interface), with extensions to identify string pointers,
>>before implementing this.
>
> For strings my approach in the C interface is that one needs to
> convert explicitly. Even without Unicode, you already have the
> problem of needing zero-termination in C and explicit length counts in
> Forth. Hmm, maybe we need some support words for the conversion.

Windows strings are usually not C strings, but buffers with start address
and size (i.e. Forth strings).

Alex McDonald

unread,

Jul 16, 2007, 1:17:36 PM7/16/07

to

On Jul 16, 3:58 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Anton Ertl wrote:
> >>Windows is UTF-16, which is not ASCII compliant. Although Windows
> >>provides APIs to translate from locale to locale, there is no method in
> >>Win32Forth to automatically identify which parameters would be require
> >>to be translated from XHCARS to UTF-16 and back; the programmer would be
> >>responsible for coding the conversions.
>
> > I don't see that you are any worse off with xchars in this situation
> > than with chars.
>
> It's somewhat worse, because Windows has "A" prototypes, which convert the
> current code page (can be multibyte) into UTF-16 on the fly. The "W"
> prototypes take UTF-16 directly. But there's some light: UTF-8 is one of
> the code pages in Windows (number 65001), and you can at least use
> MultiByteToWideChar to convert data.
>
> Actually, it might be possible to change the current code page to UTF-8, but
> I didn't see a hint how to do that other than for console i/o (SetConsoleCP
> and SetConsoleOutputCP).

It isn't possible, for reasons related to the A form of the functions;
they aren't designed to be used for anything other than byte=char code
pages. 65001, the codepage for UTF-8, isn't a valid code page for the
SetConsolexxx functions either. It can only be used by the
MultiByteToWideChar function and its reverse WideCharToMultiByte.

It's possible to build a UTF-8 Forth for Windows, but only if we know
where all the string parameters are in the A calls and trampoline them
to the W equivalents. Someone did this for cygwin;
http://www.okisoft.co.jp/esc/utf8-cygwin/ but it appears that the
cygwin maintainers rejected it; http://www.cygwin.com/ml/cygwin-patches/2006-q3/msg00014.html.

It's not that hard as there are a limited number of A form calls and
they aren't being added to or changed in any way. The problem is where
there are only W functions with no A equivalents. Then we're into
stupid territory trampolining functions by the gazillion.

If we're to go UTF-8 in Windows, the Forth implementor can cover TYPE
and the like, but the rest is then the programmer's responsibilty.
Ugly. But less ugly than a UTF-16 Forth with a 16bit char and an 8 bit
au, where everyone else's code breaks as it's not ANS and they're
COUNTing and C@ing.

> I must honestly admit that I don't like the online
> access to MSDN represents information. The internal search is horrible, and
> it's one of the rare sites where even Google is confused.

I download it; it's large (over 1GB) but free.

Anton Ertl

unread,

Jul 16, 2007, 3:54:41 PM7/16/07

to

Alex McDonald <bl...@rivadpm.com> writes:
>On Jul 16, 12:28 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
>wrote:
>> Alex McDonald <b...@rivadpm.com> writes:
>> >Bernd Paysan wrote:
>>
>> >[snipped]
>>
>> >Unfortunately, on first analysis, this is one proposal that Win32Forth
>> >will not be adopting any time soon.
>>
>> >Windows is UTF-16, which is not ASCII compliant. Although Windows
>> >provides APIs to translate from locale to locale, there is no method in
>> >Win32Forth to automatically identify which parameters would be require
>> >to be translated from XHCARS to UTF-16 and back; the programmer would be
>> >responsible for coding the conversions.
>>
>> I don't see that you are any worse off with xchars in this situation
>> than with chars.
>
>The au would be 16bits,

I guess you mean that the minimum size of an xchar would be 16bits.

> with a max of 127 characters in a counted
>string.

And here you mean 127 xchars in a counted string.

> This might be considered too short.

Yes, the short count is one of the disadvantages of counted strings.
The difference between 127 and 255 does not make a big difference IMO,
though.

Anyway, I think you misunderstood my point: Currently you have ASCII
strings consisting of 8-bit characters, and when dealing with Windows
functions taking or returning UTF-16 strings, you have to translate
from ASCII to UTF-16 and back, no? What's the difference if you have
to translate from UTF-8 to UTF-16 and back?

I see several options for dealing with your situation; some of that is
outlined in Section 3 of
<http://www.complang.tuwien.ac.at/papers/ertl%26paysan05.ps.gz>.

The current situation is that you have 8-bit aus and 8-bit chars, and
you want to pass/return UTF-16 strings to/from Windows functions.
Currently you have to convert between Forth strings and Windows
strings at some point. Your options are:

a) Switch to 16-bit chars with 8-bit aus, i.e., the Jax4th way. This
would give you the Unicode BMP with simple chars, or complete UTF-16
Unicode with xchars, but it would break a lot of code that assumes
1 chars = 1. Therefore I don't think this would be practical.

b) This could be addressed by using 16-bit chars with 16-bit aus. That
would require an unusual address representation, but an unusual
address representation has been used in the past in Win32Forth
(addresses relative to some base), so that should be doable. Still,
it would require converting addresses on calling Windows or C code,
and this model would probably be confusing to Forth programmers.

c) Have 8-bit chars and aus and UTF-16 xchars. Ordinary char strings
would then be incompatible with xchar strings (e.g., TYPE would work
on one of them, but not for both), but it's probably still possible to
make this workable (easier than a, but probably harder than b).

d) Have 8-bit chars and aus, and UTF-8 xchars. Then you need to
translate the strings when dealing with UTF-16 Windows strings, but it
should have the least problems when dealing with existing Forth code
and you don't need to translate addresses. Also, in many cases a
counted string can accommodate many more xchars than the other
options.

The best alternative appears to be d to me, but xchars are
implementable in the other cases, too (although c would cause some
problems there).

>I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
>proposal; for conversion between encodings I would have thought it
>useful.

You mean the number of xchars in a string? I don't think it is very
useful except maybe for conversions to fixed-width representations
(i.e., UTRF-32). Standardizing conversion words might be more
generally useful.

>As a general note, it's worth following the Unicode 5.0 standard for
>malformed Unicode; to throw an error in all such cases. The XCHARS
>standard should be explicit about which Unicode processing standard it
>adheres to (or insist that the implementor name the standard).

Actually xchars is not tied to Unicode, although we expect that it
will be mostly used for some Unicode encoding, in particular UTF-8.

>> For strings my approach in the C interface is that one needs to
>> convert explicitly. Even without Unicode, you already have the
>> problem of needing zero-termination in C and explicit length counts in
>> Forth. Hmm, maybe we need some support words for the conversion.

...

>Win32Forth makes sure strings are null terminated (and the programmer
>needs to be aware of this when allocating buffers for string handling;
>they need to be one byte longer than required by the string).

I guess this works only for a subset of words, not, e.g., for MOVE.
Also, if, e.g., READ-FILE wrote a 0 in the character beyond the end of
the buffer I have passed to READ-FILE, I would consider the Forth
system broken. So in many cases you will still have to convert a
Forth string to a zero-terminated string when you pass it to C.

Bruce McFarling

unread,

Jul 16, 2007, 8:09:19 PM7/16/07

to

On Jul 15, 3:33 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Bruce McFarling wrote:
> > How hard would it be to extend the reference implemenation to UTF-32?

> UTF-32 is not ASCII compatible, unless you have a system where 1 CHAR = 32
> bit.

So the XCHAR proposal is compatible with 8-bit code pages and with
UTF-8, but not with UCS16 or UTF32?

Bernd Paysan

unread,

Jul 15, 2007, 6:10:57 PM7/15/07

to

Alex McDonald wrote:

> Bernd Paysan wrote:
>
> [snipped]
>
> Unfortunately, on first analysis, this is one proposal that Win32Forth
> will not be adopting any time soon.
>
> Windows is UTF-16, which is not ASCII compliant. Although Windows
> provides APIs to translate from locale to locale, there is no method in
> Win32Forth to automatically identify which parameters would be require
> to be translated from XHCARS to UTF-16 and back; the programmer would be
> responsible for coding the conversions.

It's a bit different. Windows has 'A' type prototypes for strings (which use
the current code page, and byte-oriented strings; multi-byte encodings are
supported for that), and 'W' type prototypes for wide char strings, i.e.
UTF-16. Windows has a code page integer reserved for UTF-8 (65001), so you
can at least use the MultiByteToWideChar to convert UTF-8 into UTF-16 by
using a Windows API function, and not your own brain.

I don't know if you can set the current code page with setlocale() or
SetLocaleInfo to UTF-8, but you might get unwanted side-effects by doing
so. I'm not sure how, because Windows NT and further translate all A string
functions to W internally anyway, before doing anything. But I can't see
any suggestions to set the code page deliberately other than SetConsoleCP
and SetConsoleOutputCP (for Console programs).

> We would need something like the proposal Anton made at EuroForth 2006
> (http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
> Function Call Interface), with extensions to identify string pointers,
> before implementing this.

--

Bernd Paysan

unread,

Jul 17, 2007, 5:13:05 AM7/17/07

to

Bruce McFarling wrote:

Strictly speaking, you could use the XCHAR proposal with UCS-2 or
UCS-4/UTF-32. As long as XC@+ gives you ASCII values when reading through
an XCHAR string, it's ok. The problem is that XCHAR strings won't be
compatible with ASCII strings, so you'd need a whole lot of new words for
dealing with XCHAR strings. And that's what we want to avoid: Strings are
still strings, whether you have an XCHAR encoding or ASCII.

For Windows systems, it means that probably one or two dozen API calls need
to be wrapped in a conversion mapper between UTF-8 and UTF-16. I don't see
that this should cause implementers a big headache; maybe I find some hours
to add that to bigFORTH on Windows, as proof of the concept.

There are some serious issues with UTF-16 and UTF-32, e.g. endianess.
Basically, a file or a string can start with a "silent" endianess-switching
character. This requires a global state of endianess, which is awful when
you are dealing with more than one string at the same time (and then also
makes it mandatory for every string to start with the endian marker).

IMHO, Ken Thompson did the only reasonable thing by inventing UTF-8; the
other Unicode encodings are too broken to be useful.

Alex McDonald

unread,

Jul 17, 2007, 8:07:25 AM7/17/07

to

On Jul 16, 8:54 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)

wrote:
> Alex McDonald <b...@rivadpm.com> writes:
> >On Jul 16, 12:28 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
> >wrote:
> >> Alex McDonald <b...@rivadpm.com> writes:
> >> >Bernd Paysan wrote:
>
> >> >[snipped]
>
> >> >Unfortunately, on first analysis, this is one proposal that Win32Forth
> >> >will not be adopting any time soon.
>
> >> >Windows is UTF-16, which is not ASCII compliant. Although Windows
> >> >provides APIs to translate from locale to locale, there is no method in
> >> >Win32Forth to automatically identify which parameters would be require
> >> >to be translated from XHCARS to UTF-16 and back; the programmer would be
> >> >responsible for coding the conversions.
>
> >> I don't see that you are any worse off with xchars in this situation
> >> than with chars.
>
> >The au would be 16bits,
>
> I guess you mean that the minimum size of an xchar would be 16bits.

I meant the au. See your reply below...

>
> > with a max of 127 characters in a counted
> >string.
>
> And here you mean 127 xchars in a counted string.

Yes.

>
> > This might be considered too short.
>
> Yes, the short count is one of the disadvantages of counted strings.
> The difference between 127 and 255 does not make a big difference IMO,
> though.

True, but Windows filenames can "get on up there" pretty quickly.
Counted strings are nice, but a pain in this circumstance.

>
> Anyway, I think you misunderstood my point: Currently you have ASCII
> strings consisting of 8-bit characters, and when dealing with Windows
> functions taking or returning UTF-16 strings, you have to translate
> from ASCII to UTF-16 and back, no? What's the difference if you have
> to translate from UTF-8 to UTF-16 and back?

There are A-form system calls (for instance, TextOutA) that translate
between ASCII (or ANSI, to use MS' prehistoric and seriously
inaccurate terminology) and UTF-16. It's automatic, whereas UTF-8 to
UTF-16 requires a separate call to MultiByteToWideChars for each
string parameter, and a call to the W form (TextOutW). The A form
can't deal with UTF-8.

.procs textout
Location ProcName Prm ProcEP LibName
-------- -------- --- -------- --------
0048F744 TextOutW - 77F17EBC gdi32.dll
0048F72C TextOutA - 77F1BBDC gdi32.dll
004852A4 TextOut - 77F1BBDC gdi32.dll

As you can see from the entry point (ProcEP column), TextOut is an
alias for TextOutA. This Windows behaviour is to provide backward
compatability. C/C++ code compiled with UTF-16 automatically
translates TextOut to TextOutW calls (some magic in the included .h
files, if I understand correctly).

Win32Forth already has support for this, in that we force a call to
the A type if it is present (we use dynamic linking, and we always
look for SomeNameA first, falling back to SomeName if it isn't
present). It can be changed to force a W very easily. What it can't do
so easily is translate the strings.

Perhaps a wrapper might work here; a set of forced U calls, as in
TextOutU, with automagic translation, where TextOutU does the needed
work and calls TextOutW. Win32Forth doesn't care which library
contains the name; in fact, it's difficult write code in Win32forth
that binds a specific name to a specific library. It's also
transparent to the programmer. Perhaps this is the escape route.

>
> I see several options for dealing with your situation; some of that is
> outlined in Section 3 of
> <http://www.complang.tuwien.ac.at/papers/ertl%26paysan05.ps.gz>.
>
> The current situation is that you have 8-bit aus and 8-bit chars, and
> you want to pass/return UTF-16 strings to/from Windows functions.
> Currently you have to convert between Forth strings and Windows
> strings at some point. Your options are:
>
> a) Switch to 16-bit chars with 8-bit aus, i.e., the Jax4th way. This
> would give you the Unicode BMP with simple chars, or complete UTF-16
> Unicode with xchars, but it would break a lot of code that assumes
> 1 chars = 1. Therefore I don't think this would be practical.
>
> b) This could be addressed by using 16-bit chars with 16-bit aus. That
> would require an unusual address representation, but an unusual
> address representation has been used in the past in Win32Forth
> (addresses relative to some base), so that should be doable. Still,
> it would require converting addresses on calling Windows or C code,
> and this model would probably be confusing to Forth programmers.

That was my original assertion (see above).

>
> c) Have 8-bit chars and aus and UTF-16 xchars. Ordinary char strings
> would then be incompatible with xchar strings (e.g., TYPE would work
> on one of them, but not for both), but it's probably still possible to
> make this workable (easier than a, but probably harder than b).

XTYPE? XEMIT is in the proposal.

>
> d) Have 8-bit chars and aus, and UTF-8 xchars. Then you need to
> translate the strings when dealing with UTF-16 Windows strings, but it
> should have the least problems when dealing with existing Forth code
> and you don't need to translate addresses. Also, in many cases a
> counted string can accommodate many more xchars than the other
> options.
>
> The best alternative appears to be d to me, but xchars are
> implementable in the other cases, too (although c would cause some
> problems there).

Me too. The issue is identifying the strings that need translating
back and fore automatically. Having the programmer deal with this is
too much; it needs to be pretty transparent.

>
> >I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
> >proposal; for conversion between encodings I would have thought it
> >useful.
>
> You mean the number of xchars in a string? I don't think it is very
> useful except maybe for conversions to fixed-width representations
> (i.e., UTRF-32). Standardizing conversion words might be more
> generally useful.

For conversion to UTF-16, it would be helpful; however, as it should
be transparent, perhaps the Forth programmer shouldn't need to know.

>
> >As a general note, it's worth following the Unicode 5.0 standard for
> >malformed Unicode; to throw an error in all such cases. The XCHARS
> >standard should be explicit about which Unicode processing standard it
> >adheres to (or insist that the implementor name the standard).
>
> Actually xchars is not tied to Unicode, although we expect that it
> will be mostly used for some Unicode encoding, in particular UTF-8.
>
>
>
> >> For strings my approach in the C interface is that one needs to
> >> convert explicitly. Even without Unicode, you already have the
> >> problem of needing zero-termination in C and explicit length counts in
> >> Forth. Hmm, maybe we need some support words for the conversion.
> ...
> >Win32Forth makes sure strings are null terminated (and the programmer
> >needs to be aware of this when allocating buffers for string handling;
> >they need to be one byte longer than required by the string).
>
> I guess this works only for a subset of words, not, e.g., for MOVE.
> Also, if, e.g., READ-FILE wrote a 0 in the character beyond the end of
> the buffer I have passed to READ-FILE, I would consider the Forth
> system broken. So in many cases you will still have to convert a
> Forth string to a zero-terminated string when you pass it to C.

The null is silent in terms of the count; it's just there one byte
past the end, and as you note, not for all words. It works well in
practise; 99% of the code in Win32Forth doesn't even know its there,
as it resembles padding for alignment.

Stephen Pelc

unread,

Jul 17, 2007, 5:41:52 AM7/17/07

to

On Sun, 15 Jul 2007 21:33:30 +0100, Alex McDonald <bl...@rivadpm.com>
wrote:

>Unfortunately, on first analysis, this is one proposal that Win32Forth
>will not be adopting any time soon.
>
>Windows is UTF-16, which is not ASCII compliant. Although Windows
>provides APIs to translate from locale to locale, there is no method in
>Win32Forth to automatically identify which parameters would be require
>to be translated from XHCARS to UTF-16 and back; the programmer would be
>responsible for coding the conversions.

There are two reasons for the XCHARs proposal, and these reasons
may cause some conflict.
1) Universal encoding, which may be variable width, e.g. UTF-8
2) Handling internationalisation of applications.

Side issues are how to handle comms control strings, modem commands,
and all the other comms standards which rely on 7/8 bit characters.
These issues are explored in the internationalisation and wide
character proposals on our website at
http://www.mpeforth.com/arena.htm

The important issue for internationalisation is to separate
the three phases of deployment:
DCS = Development Character Set - what the underlying Forth uses
OCS = Operating Character Set - what the OS or hardware uses
ACS = Application Character Set - what the application uses.

A (real-life) example is of a Russian banker/engineer using an
application set to Cyrillic in Hong Kong using a Chinese version
of Windows with an application programmed (in English) using
a Forth that assumes char=byte=au.

Such scenarios occur daily and the universal encoding is not
yet present, nor is it likely to be in programming environments
that have to cover small embedded systems with vending machines
and Telnet through large commercial PC applications.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Alex McDonald

unread,

Jul 17, 2007, 10:11:11 AM7/17/07

to

On Jul 17, 1:07 pm, Alex McDonald <b...@rivadpm.com> wrote:

>
> Perhaps a wrapper might work here; a set of forced U calls, as in
> TextOutU, with automagic translation, where TextOutU does the needed
> work and calls TextOutW. Win32Forth doesn't care which library
> contains the name; in fact, it's difficult write code in Win32forth
> that binds a specific name to a specific library. It's also
> transparent to the programmer. Perhaps this is the escape route.
>

Or perhaps not. There appear to be an large number indeed of these
calls; a hangover from Windows 3.1 by the look of them. More thought
required.

--
Regards
Alex McDonald

Anton Ertl

unread,

Jul 17, 2007, 10:36:45 AM7/17/07

to

Alex McDonald <bl...@rivadpm.com> writes:
>On Jul 16, 8:54 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
>wrote:
>> Alex McDonald <b...@rivadpm.com> writes:
>> >The au would be 16bits,
>>
>> I guess you mean that the minimum size of an xchar would be 16bits.
>
>I meant the au. See your reply below...
>
>>
>> > with a max of 127 characters in a counted
>> >string.

Ok, if an au is 16 bits, a char is at least 16 bits, so there would be
up to 65535 chars in a counted string, and with UTF-16 up to 65536

xchars in a counted string.

>> I see several options for dealing with your situation; some of that is

>> outlined in Section 3 of
>> <http://www.complang.tuwien.ac.at/papers/ertl%26paysan05.ps.gz>.
>>
>> The current situation is that you have 8-bit aus and 8-bit chars, and
>> you want to pass/return UTF-16 strings to/from Windows functions.
>> Currently you have to convert between Forth strings and Windows
>> strings at some point. Your options are:

...

>> b) This could be addressed by using 16-bit chars with 16-bit aus. That
>> would require an unusual address representation, but an unusual
>> address representation has been used in the past in Win32Forth
>> (addresses relative to some base), so that should be doable. Still,
>> it would require converting addresses on calling Windows or C code,
>> and this model would probably be confusing to Forth programmers.
>
>That was my original assertion (see above).

Xchars would be straightforward for full Unicode support (using
UTF-16) in such a system; of course, plain chars already give you the
BMP, so the main reason for supporting xchars in such a system would
be to be able to run programs using the xchars words; the exotic
character sets outside the BMP would probably only be a minor benefit.

>> c) Have 8-bit chars and aus and UTF-16 xchars. Ordinary char strings
>> would then be incompatible with xchar strings (e.g., TYPE would work
>> on one of them, but not for both), but it's probably still possible to
>> make this workable (easier than a, but probably harder than b).
>
>XTYPE?

Yes. But there are lots of words that deal with strings, so adding X
versions of all of these would not be so nice (reminds me of the wide
character proposal). The beauty of Xchars is that string words work
with them just as usual, but c would break that.

>> The best alternative appears to be d to me, but xchars are
>> implementable in the other cases, too (although c would cause some
>> problems there).
>
>Me too. The issue is identifying the strings that need translating
>back and fore automatically. Having the programmer deal with this is
>too much; it needs to be pretty transparent.

The reason why I think that string conversion between the Forth and
the foreign representation has to be done explicitly by the programmer
is allocation of a buffer for the string and the lifetime of the
buffer. In many cases this can be very stylized, and maybe wrappers
can be generated mostly automatically, but I think for some calls one
may have to do something special.

>> >I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
>> >proposal; for conversion between encodings I would have thought it
>> >useful.
>>
>> You mean the number of xchars in a string? I don't think it is very
>> useful except maybe for conversions to fixed-width representations
>> (i.e., UTRF-32). Standardizing conversion words might be more
>> generally useful.
>
>For conversion to UTF-16, it would be helpful;

In which way? You cannot compute the exact buffer size from that
(there are characters that take 32 bits in UTF-16).

Bernd Paysan

unread,

Jul 17, 2007, 11:12:40 AM7/17/07

to

Alex McDonald wrote:
> Or perhaps not. There appear to be an large number indeed of these
> calls; a hangover from Windows 3.1 by the look of them. More thought
> required.

How many of those calls do you actually need? In bigFORTH, I need about a
dozen or so. But then, bigFORTH/MINOS uses only low-level stuff of Windows,
and rolls all the high level by itself. For a system like that, wrapper
words probably are the best strategy.

The other option is to make the conversion itself easy. Convert the strings
to UTF-16 before calling the *W word, and provide a Forth word that's easy
to use, e.g. >utf-16 and +>utf-16. The two variants are necessary, because
you'll need to manage the temporary buffer. The >utf-16 will be the first
string converted, and reinitialize the temporary buffer. +>utf-16 will
start at the end of the previous string. Both take Forth strings and return
Windows strings (both are in addr u format). For the returned strings, a
similar utf-16> and +utf-16> should do it, as well (using the same
temporary buffer).

The rule is: Don't try to be too smart ;-).

Bernd Paysan

unread,

Jul 17, 2007, 11:51:09 AM7/17/07

to

Stephen Pelc wrote:
> A (real-life) example is of a Russian banker/engineer using an
> application set to Cyrillic in Hong Kong using a Chinese version
> of Windows with an application programmed (in English) using
> a Forth that assumes char=byte=au.

But that's a shortcoming of the historic systems in question. Consider the
same man using Linux or Plan-9 instead. They use UTF-8 as default encoding
(Plan-9 since late 1992, Linux for several years now), and they have proper
i18n, so it's not a "Chinese version". It's just a system with Chinese
users, everything else identical to a European Linux.

But let's assume you do the above with a XCHAR extended version of MINOS (to
be done for Windows). The system has been set to Cyrillic (koi-8-r), which
is ASCII compatible, and uses 8-bit characters. MINOS converts all strings
on the fly from the internal encoding (here: koi-8-r) into UTF-16, and
passes it to the *W words like TextOutW. This works on a Chinese version of
Windows, too, supposed you have the necessary fonts installed.

Actually, I suppose it won't be allowed to have koi-8-r as internal
encoding, but the user will be forced to UTF-8 as internal encoding. This
means the Russian programmer can continue to write cyrillic comments and
diagnostic messages (and maybe also cyrillic words), and the Chinese user
can have his Chinese dialog box texts and so on, just like MINOS on Linux.

> Such scenarios occur daily and the universal encoding is not
> yet present,

Well, I think the universal encoding is there. There are just many
non-universal encodings there as well, which don't deserve too much
attention.

My approach is: If you have a problem, look for standard solutions to the
problem. UCS and UTF-8 are widely used standards, e.g. XHTML and XML
default to UTF-8 encoding (you can also use UTF-16, when you start with a
byte order mark).

Microsoft apparently put quite some energy into UTF-16, and thus makes it
more difficult to use UTF-8 as internal encoding as other systems, but
since even Microsoft accepts UTF-8 as default external encoding for XML, a
retreat is in sight. .NET has UTF-8 as first-class citizen for
System.Text.Encoding. Internally, .NET uses UTF-16, though. Since UTF-16 is
a variable width encoding, as well (surrogates, encoding 20 bits split up
in two characters which encode 10 bits each), it's a bit pointless, because
you really can't treat an UTF-16 string like an UCS-2 string.

> nor is it likely to be in programming environments
> that have to cover small embedded systems with vending machines
> and Telnet through large commercial PC applications.

Small embedded systems with their own graphic IO will quite likely use a
single 8 bit character set; if you do telnet to a small embedded system,
even UTF-8 is fine (the embedded system doesn't need to know).

Alex McDonald

unread,

Jul 17, 2007, 5:22:33 PM7/17/07

to

Anton Ertl wrote:
> Alex McDonald <bl...@rivadpm.com> writes:
>> On Jul 16, 8:54 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
>> wrote:
>>> Alex McDonald <b...@rivadpm.com> writes:
>>>> The au would be 16bits,
>>> I guess you mean that the minimum size of an xchar would be 16bits.
>> I meant the au. See your reply below...
>>
>>>> with a max of 127 characters in a counted
>>>> string.
>
> Ok, if an au is 16 bits, a char is at least 16 bits, so there would be
> up to 65535 chars in a counted string, and with UTF-16 up to 65536
> xchars in a counted string.

My bad. I'm getting cross-eyed. (This is a Very Bad Pun, given the name
of the proposal.)

That's just too much; there are 297 API calls in the STC version of
Win32Forth, and a quick scan shows that 50% or thereabouts involve one
or more strings. A few involve return strings.

The lifetime of the buffer isn't a problem, as they would all be
transient; the UTF-8 to UTF-16 conversion and back calls demand separate
buffers, and the UTF-16 only needs to live for the lifetime of the call.

>>>> I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
>>>> proposal; for conversion between encodings I would have thought it
>>>> useful.
>>> You mean the number of xchars in a string? I don't think it is very
>>> useful except maybe for conversions to fixed-width representations
>>> (i.e., UTRF-32). Standardizing conversion words might be more
>>> generally useful.
>> For conversion to UTF-16, it would be helpful;
>
> In which way? You cannot compute the exact buffer size from that
> (there are characters that take 32 bits in UTF-16).

True. Another cross-eyed moment.

>
> - anton

I'm rapidly coming to the conclusion that I need to look seriously at
UTF-8 internally, UTF-16 for the OS, and work through some of the
problems (probably by testing out some version). I'll report back.
Thanks to you and Bernd for the discussions.

--
Regards
Alex McDonald

Stephen Pelc

unread,

Jul 19, 2007, 1:23:15 PM7/19/07

to

On Tue, 17 Jul 2007 17:51:09 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>Stephen Pelc wrote:
>> A (real-life) example is of a Russian banker/engineer using an
>> application set to Cyrillic in Hong Kong using a Chinese version
>> of Windows with an application programmed (in English) using
>> a Forth that assumes char=byte=au.
>
>But that's a shortcoming of the historic systems in question. Consider the
>same man using Linux or Plan-9 instead. They use UTF-8 as default encoding
>(Plan-9 since late 1992, Linux for several years now), and they have proper
>i18n, so it's not a "Chinese version". It's just a system with Chinese
>users, everything else identical to a European Linux.

You know, the funny thing is that standards have to deal with reality
and common practice, and Windows is covered by both of those.

If there is anything I've proposed that disenfranchises future best
practice, please let me know.

> Small embedded systems with their own graphic IO will quite likely
> use a single 8 bit character set; if you do telnet to a small embedded
> system, even UTF-8 is fine (the embedded system doesn't need to
> know).

I regularly need to send non UTF-8 characters to/from a PC over some
comms channel. This is precisely why KEY, EMIT and friends need to be
decoupled from XKEY, XEMIT and friends. At present, for certain
classes of PC programming, the assumption that KEY and friends
handle an arbitrary 8-bit unit is as common as char=byte=au.

There are some seriously weird devices at the other ends of
serial and other comms channels. Forth200x has to recognise
this. Many people still do binary transfers over these links
- it's fast and efficient.

Bernd Paysan

unread,

Jul 20, 2007, 7:33:04 AM7/20/07

to

Stephen Pelc wrote:
>>But that's a shortcoming of the historic systems in question. Consider the
>>same man using Linux or Plan-9 instead. They use UTF-8 as default encoding
>>(Plan-9 since late 1992, Linux for several years now), and they have
>>proper i18n, so it's not a "Chinese version". It's just a system with
>>Chinese users, everything else identical to a European Linux.
>
> You know, the funny thing is that standards have to deal with reality
> and common practice, and Windows is covered by both of those.

Yes, but I'm quite convinced now that it is possible to write a UTF-8 Forth
under Windows, which takes the major headaches out of the way. The
headaches are: Different code-pages (some with multibyte encodings), or
UTF-16, which would require a 16 bit char type, and the need of XCHAR
words, since UTF-16 still is a variable length encoding.

> If there is anything I've proposed that disenfranchises future best
> practice, please let me know.

The three different character sets certainly are not future best practice.
It's not a problem to have a different I/O encoding (which is handled by
translation from internal to IO), but otherwise, a single encoding should
be used.

>> Small embedded systems with their own graphic IO will quite likely
>> use a single 8 bit character set; if you do telnet to a small embedded
>> system, even UTF-8 is fine (the embedded system doesn't need to
>> know).
>
> I regularly need to send non UTF-8 characters to/from a PC over some
> comms channel. This is precisely why KEY, EMIT and friends need to be
> decoupled from XKEY, XEMIT and friends. At present, for certain
> classes of PC programming, the assumption that KEY and friends
> handle an arbitrary 8-bit unit is as common as char=byte=au.

Yes, I'm completely ok with that.

> There are some seriously weird devices at the other ends of
> serial and other comms channels. Forth200x has to recognise
> this. Many people still do binary transfers over these links
> - it's fast and efficient.

Sure.

Anton Ertl

unread,

Jul 23, 2007, 7:38:31 AM7/23/07

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:

>> Shouldn't the granularity of the size specifications be the same
>> (i.e., either aus or chars) throughout the wordset?
>
>Should be AUs.

Hmm, thinking a little longer about it, all the string words use chars
for their granularity, so the xchars words should use chars, too. Not
that it makes a difference in practice.

Bruce McFarling

unread,

Jul 23, 2007, 9:23:26 AM7/23/07

to

On Jul 23, 7:38 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Hmm, thinking a little longer about it, all the string words use chars
> for their granularity, so the xchars words should use chars, too. Not
> that it makes a difference in practice.

If XCHAR means eXtended CHAR, and an XCHAR in memory is always a
multiple (sometimes variable multiple) number of CHARs, then a char
size is feasible. It would seem that translating code that works with
UTF-8 so that it works with with 16-bit chars and UTF-16 would not be
made any worse by having char granularity.

For the Atlantic economic space, utf-8 is fine ... for much of the
Atlantic economic space, Latin-1 is fine ... and since the main
difference that comes into play between utf-8 and the utf-16's are
storage space inflation, the ability to do on the fly translation
between I/O the Forth implementation gets most of the way there in the
East Asian Economic Space. IOW, flexibility in file source-encoding
can mostly cover for a little rigidity in internal character set in
use, for the situations where it is most critical, which is where
there is a need to access existing databases and their character
encoding is entrenched by its own set of existing practices.

With on the fly translation and internal processing in utf-8, the
buffers in memory would be sized in bytes=chars, and how many xchars
you end up with would have to be determined on an ongoing basis ...
whether the original source is utf-8 or not.

Bruce McFarling

unread,

Jul 23, 2007, 9:42:35 AM7/23/07

to

On Jul 16, 5:46 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> I think that no word for changing the internal encoding should be
> standardized. Or if you standardize it, it should fail if the new
> internal encoding is not an extension of the old one (i.e.,
> ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since
> this is a one-way street, GET-ENCODING makes little sense.

And I think that this kind of setting should never be made a one-way
street, but I am happy with not having any "set internal encoding" at
all.

... IOW, "even where everything is fluid, we build boats so that we
have a place to stand".

> Otherwise a standard program could contain strings in different,
> incompatible encodings, some of them in system-controlled strings
> (e.g., word names), controlled by a global state variable. This would
> be worse than STATE and BASE. No need to introduce another such
> mistake.

> The primary method should work through OPEN-FILE and CREATE-FILE

> (e.g., by specifying the encoding in the fam). But yes, a word like
> SET-FILE-ENCODING is useful when the program learns about the encoding
> later (e.g., when the encoding is specified at the start of the file).

As I've just noted, SET-FILE-ENCODING is the real standardization hook
for not being boxed in by the implementation defined encoding ... as
long as we can talk to file and other i/o in the encoding it uses,
then it makes much less difference what encoding we use internally.

For proponents of utf-8 uber alles, SET-FILE-ENCODING is important for
reading a file that uses an ASCII header that contains code-page
information ... with essentially all relevant code pages lying within
the UCS16 character set, a 512 byte table gives the information
required to translate any code page to utf-8 on the fly.

And of course, for reading well-formed files in one of the utf-16s,
its critical, since you would open it in your default utf-16 encoding,
and if the first character is FEFF, would reset the file encoding to
the other utf-16 encoding.

GET-FILE-ENCODING / SET-FILE-ENCODING, obviously, have none of the
"trap door" implication of GET-INTERNAL-ENCODING / SET-INTERNAL-
ENCODING.

Bernd Paysan

unread,

Jul 23, 2007, 10:59:15 AM7/23/07

to

Bruce McFarling wrote:
> And of course, for reading well-formed files in one of the utf-16s,
> its critical, since you would open it in your default utf-16 encoding,
> and if the first character is FEFF, would reset the file encoding to
> the other utf-16 encoding.

Actually, you can use the UTF-16 start mark of a file to jump out of UTF-8
encoding, as well, since both FF FE and FE FF are illegal UTF-8 characters.
So it is possible to open a text file and autodetect the three different
widespread UTF encodings with the first two bytes.

For converting other encodings than latin-1 to Unicode, there's not much
hope to make that easy (latin-1 is the first code-page, so conversion is
straight forward). koi8-r and the cyrillic code page are quite different
(isn't there a collating sequence in Russian? The koi8-r page looks like
there is, the same as the latin ABC/greek alpha-beta-gamma one, with the
extra letters appended behind, but Unicode seemed to ignore that). gb2312,
big5, and the CJK Unicode code pages are very different, as well. Note that
there *is* a collating sequence for Chinese characters, as well (otherwise,
using a dictionary would be hell - with the collating sequence, it's only
heck ;-).

Bruce McFarling

unread,

Jul 23, 2007, 11:52:15 AM7/23/07

to

On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> For converting other encodings than latin-1 to Unicode, there's not much
> hope to make that easy

In what sense? Use the character as an index into a table of USC16
characters for that code page, convert that character to utf-8, and
you are done. A "code page base" value, with 0 turning off code page
to utf-8 conversion, would be sufficient for triggering translation in
that direction.

Its converting Unicode to other code page encodings that is more
cumbersome. I have no idea whether a search in a table, a b-tree, or
some hash based approach is most efficient in general ... and of
course, there may be different answers for different priorities on
space and speed.

Bruce McFarling

unread,

Jul 23, 2007, 12:00:54 PM7/23/07

to

On Jul 17, 5:13 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> There are some serious issues with UTF-16 and UTF-32, e.g. endianess.
> Basically, a file or a string can start with a "silent" endianess-switching
> character. This requires a global state of endianess, which is awful when
> you are dealing with more than one string at the same time (and then also
> makes it mandatory for every string to start with the endian marker).

If this is externalized to SET-FILE-ENCODING and GET-FILE-ENCODING,
then there is no need for a mutable global endianess state ... each
file can have its own endianess state. Indeed, even a utf-16 system
could have a fixed endianess and cope with files in the other
endianess in this way.

Under this, the most natural way to expose the implementation encoding
available would be as an environment query.

Bruce McFarling

unread,

Jul 23, 2007, 12:19:03 PM7/23/07

to

On Jul 20, 7:33 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> The three different character sets certainly are not future best practice.
> It's not a problem to have a different I/O encoding (which is handled by
> translation from internal to IO), but otherwise, a single encoding should
> be used.

This would seem to contradict the remark upthread:

> >> Small embedded systems with their own graphic IO will quite likely
> >> use a single 8 bit character set; if you do telnet to a small embedded
> >> system, even UTF-8 is fine (the embedded system doesn't need to
> >> know).

Best practice with no dead weight of an established code base and no
external network, and best practice for a specific situation, can
easily be two different things. DCS, OCS, and ACS is a very useful
framework for analyzing what best practice is in a particular
situation, even when the result in many cases is DCS=OCS=ACS.

A small embedded device that has its own 8-bit character set which can
report to a larger system what that character set is one scenario that
can involve OCS=ACS for the small embedded device, but OCS<>ACS and
DCS<>ACS for the larger system on the other end of the wire.

And porting code from one system that uses one OCS to another system
that uses another OCS is a scenario that can readily involve DCS!=OCS
on one end, the other, or both.

Bernd Paysan

unread,

Jul 23, 2007, 12:10:22 PM7/23/07

to

Bruce McFarling wrote:

> On Jul 23, 10:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
>> For converting other encodings than latin-1 to Unicode, there's not much
>> hope to make that easy
>
> In what sense? Use the character as an index into a table of USC16
> characters for that code page, convert that character to utf-8, and
> you are done.

In the sense of a memory-saving way to do it. Yes, you can always have a
full table (or in case of gb2312 perhaps a compressed table), and use that.

> Its converting Unicode to other code page encodings that is more
> cumbersome. I have no idea whether a search in a table, a b-tree, or
> some hash based approach is most efficient in general ... and of
> course, there may be different answers for different priorities on
> space and speed.

Indeed. It might be possible that different code pages have
different "optimal" approaches. Converting UTF-8 to ISO-Latin-1 doesn't
need a table, it's simple

: utf8>latin1 ( xc -- xc ) dup $FF > IF drop [char] ? THEN ;

Cyrillic might need a small table for the cyrillic page itself. Chinese
needs a large table (and then, a table is quite ok, because it's
sufficiently dense populated).

Bernd Paysan

unread,

Jul 23, 2007, 12:19:42 PM7/23/07

to

Bruce McFarling wrote:

> Under this, the most natural way to expose the implementation encoding
> available would be as an environment query.

Yes, I thought about that, as well. I've modified the proposal with the
discussion results as we go on, and removing SET-ENCODING and GET-ENCODING
is part of it. XCHAR-ENCODING is now an environment query, which returns a
string like "UTF-8". There must be some other standard where I can refer to
for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from
here, and use the preferred MIME name (if there is one, otherwise the name
itself) as unique identifier:

http://www.iana.org/assignments/character-sets

Bruce McFarling

unread,

Jul 23, 2007, 2:21:43 PM7/23/07

to

On Jul 23, 12:19 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Yes, I thought about that, as well. I've modified the proposal with the
> discussion results as we go on, and removing SET-ENCODING and GET-ENCODING
> is part of it. XCHAR-ENCODING is now an environment query, which returns a
> string like "UTF-8". There must be some other standard where I can refer to
> for unambiguous names (e.g. MIME or HTTP RFCs). I'll suggest to start from
> here, and use the preferred MIME name (if there is one, otherwise the name
> itself) as unique identifier:

MIME is good. MIME encodings do not distinguish big endian and little
endian in standards that are defined in 16-bit or 31-bit integer
spaces, like usc16, utf-16 or utf-32. This is not a real issue for the
internal character encoding ... in the possible rare case, such as an
8-bit CHAR Forth with UTF-16 XCHARS, it could be specified explicitly
that the internal encoding has the same endianess as the
implementation itself, if only to forestall pointless semantic
quibbling.

However, for SET-FILE-ENCODING, a one-to-one correspondence between
MIME encoding names and encoding tokens would entail some provision
for endianess ... the simplest would be to specify little endian where
not specified in the MIME standard, and provide a word to convert an
encoding token to big endian (or its exact mirror image).

John Doty

unread,

Jul 23, 2007, 2:56:54 PM7/23/07

to

Bernd Paysan wrote:

> For converting other encodings than latin-1 to Unicode, there's not much
> hope to make that easy (latin-1 is the first code-page, so conversion is
> straight forward). koi8-r and the cyrillic code page are quite different
> (isn't there a collating sequence in Russian? The koi8-r page looks like
> there is, the same as the latin ABC/greek alpha-beta-gamma one, with the
> extra letters appended behind, but Unicode seemed to ignore that). gb2312,
> big5, and the CJK Unicode code pages are very different, as well. Note that
> there *is* a collating sequence for Chinese characters, as well (otherwise,
> using a dictionary would be hell - with the collating sequence, it's only
> heck ;-).

Is there really a collating sequence for Chinese characters? Japanese
certainly doesn't have one for the Chinese characters (kanji) it uses.
Every dictionary maker uses its own order.

To look up an unfamiliar word in my paper dictionaries, the procedure is:

1. Decide where the word starts. This is nontrivial, since there are no
spaces between words and there is no clear distinction between "word"
and "common phrase". Also, common paper dictionaries aren't good for
technical terms. So it's easy to get lost.

2. Look up the first kanji in a kanji dictionary. This needs a whole set
of skills including counting strokes, recognizing which radical it will
be indexed under, distinguishing between similar radicals, and
recognizing changes in style. And sometimes you'll encounter one that
isn't in your dictionary.

3. Guess which pronunciation the kanji has in this word. In Japanese,
pronunciation of kanji shifts with context.

4. Look up the word phonetically in a Japanese-English dictionary. At
least phonetic dictionary order is *almost* standardized.

5. Interpret and iterate as needed.

Not really hell: a lot more fun and useful than a crossword puzzle. But
it does eat up valuable time.

I recall the first time I encountered the word "読み込み". A compound of
gerund forms of two common verbs, but a standard dictionary is not very
helpful ("読み" might mean "insight"). Took me awhile to understand it
means "operand fetch".

Fortunately, these days computers help a lot here: they can index kanji
in multiple ways and match words multiple ways. Digital dictionaries are
much better about listing technical terms than any paper dictionary I've
found. I can now sit in the back of the lecture hall with my laptop and
look up unfamiliar words from the slides during a talk.

But for LSE64 I duck all these encoding issues by using mbtowc() and
friends. That keeps LSE64 consistent with other software in my world,
although basically I stick with ASCII anyway. And I have much better
ways to spend my time than inventing another approach here.

To see how this paints over a very serious mess, check out:

http://examples.oreilly.com/cjkvinfo/doc/cjk.inf

It doesn't seem to me from reading this that there is any common
standard collating sequence for Chinese characters. Various ones, some
partially correlated, but still different in detail. Even tables can't
really work all the time: the distinction between character identity and
style is blurry.

--
John Doty, Noqsi Aerospace, Ltd.
http://www.noqsi.com/
--
Specialization is for robots.

Bernd Paysan

unread,

Jul 23, 2007, 4:55:02 PM7/23/07

to

John Doty wrote:
> Is there really a collating sequence for Chinese characters?

Yes, actually, there are several (but for all practical purposes,
dictionaries use a single one for simplified Chinese, and all the others
are for traditional Chinese only). The main problem is how many glyphs you
include in your collating sequence, and debates about how to write a
particular glyph (see the revisions of the GB tables, where some glyphs
were moved around for using a simplified radical).

> Japanese
> certainly doesn't have one for the Chinese characters (kanji) it uses.
> Every dictionary maker uses its own order.

That's mostly past in China; maybe you find other sorting orders in Taiwan.

There's a second sorting order, that's by pinyin. You always need both
sorting orders in a dictionary, since you either read a glyph (then you go
through the radical/stroke order), or you hear it, then you go through
pinyin. Dictionaries often use pinyin as their primary sorting order, and
the glyph order as secondary, with an indirection (table driven).

> To look up an unfamiliar word in my paper dictionaries, the procedure is:
>
> 1. Decide where the word starts. This is nontrivial, since there are no
> spaces between words and there is no clear distinction between "word"
> and "common phrase". Also, common paper dictionaries aren't good for
> technical terms. So it's easy to get lost.

This one is much easier in Chinese, because of its completely different
grammar. People sometimes have problems where sentences start (that's why
they use punctation marks now), but words are dead easy.

> 2. Look up the first kanji in a kanji dictionary. This needs a whole set
> of skills including counting strokes, recognizing which radical it will
> be indexed under, distinguishing between similar radicals, and
> recognizing changes in style.

Sounds remarkable similar to the Chinese system, apart from the "each
dictionary maker uses his own order". You need to be sufficiently skilled
in the art of calligraphy to know how a glyph is written.

> And sometimes you'll encounter one that
> isn't in your dictionary.

Yes, that happens. Despite all the efforts to standardize the Chinese
written language the past 2200 years, the number of glyphs around still
seems to be unbound. Usually, when you find such a glyph, asking some
native speaker also won't help - they don't know more than the dictionary.

> 3. Guess which pronunciation the kanji has in this word. In Japanese,
> pronunciation of kanji shifts with context.

Fortunately, pronunciation only shifts with dialect in Chinese.

> 4. Look up the word phonetically in a Japanese-English dictionary. At
> least phonetic dictionary order is *almost* standardized.
>
> 5. Interpret and iterate as needed.
>
> Not really hell: a lot more fun and useful than a crossword puzzle. But
> it does eat up valuable time.

Indeed.

> I recall the first time I encountered the word "読み込み". A compound of
> gerund forms of two common verbs, but a standard dictionary is not very
> helpful ("読み" might mean "insight"). Took me awhile to understand it
> means "operand fetch".

From my Chinese knowledge, it seems to be easier to decipher. The first sign
("du") means "read" (and it's not part of my dictionary, since the usual
way to say "read" is "see book" (kan shu), so I found it with gucharmap),
and all the rest is Japanese. The usual way to decipher Japanese from
Chinese is to discard all the Japanese stuff, and guess from the ambiguous
meaning the remaining words have.

> To see how this paints over a very serious mess, check out:
>
> http://examples.oreilly.com/cjkvinfo/doc/cjk.inf
>
> It doesn't seem to me from reading this that there is any common
> standard collating sequence for Chinese characters. Various ones, some
> partially correlated, but still different in detail. Even tables can't
> really work all the time: the distinction between character identity and
> style is blurry.

Yes, it's not easy. One particular problem is that the glyph space is so
large, and if you combine them all, you end up with a lot more glyphs than
you expect. Unicode is a very typical example: Here's a set of codepages
with lots of CJK glyphs. And here's another one, discontiguous with the
previous, containing glyphs we forgot last time. And oops, we made a
mistake, this glyph really should be written like that, and that means it
ends up somewhere completely different in the sorting order ;-).

John Doty

unread,

Jul 23, 2007, 7:04:32 PM7/23/07

to

Very interesting!

Bernd Paysan wrote:
> John Doty wrote:
> ...

>> 2. Look up the first kanji in a kanji dictionary. This needs a whole set
>> of skills including counting strokes, recognizing which radical it will
>> be indexed under, distinguishing between similar radicals, and
>> recognizing changes in style.
>
> Sounds remarkable similar to the Chinese system, apart from the "each
> dictionary maker uses his own order". You need to be sufficiently skilled
> in the art of calligraphy to know how a glyph is written.

I wonder what the future holds here. Computers are destroying
calligraphic knowledge: some of my Japanese friends tell me that after
years of computer use, they cannot write Japanese by hand anymore.
Calligraphy is irrelevant to typing Japanese: you type phonetically and
then coax the computer into substituting appropriate kanji as needed. On
my American keyboard, I type "yomikomi", what I first see is "よみこみ"
(on a Japanese keyboard I'd type this directly) and then I hit the space
bar to get "読み込み", in this case what I wanted. Hitting the space bar
again would have given me a menu of alternatives. This is using Apple's
"Kotoeri" input method, but others I've seen are similar.

> Fortunately, pronunciation only shifts with dialect in Chinese.

In Japanese it's really difficult. Once I had a transcript of a meeting
that I hadn't attended and I really needed to know about a certain
decision they had made. I showed it to a native Japanese colleague. He
pointed to a character and said "If I knew which pronunciation the
speaker used here, I could tell whether the decision was yes or no. But
from this I can't tell"!

> ...

>
>> I recall the first time I encountered the word "読み込み". A compound of
>> gerund forms of two common verbs, but a standard dictionary is not very
>> helpful ("読み" might mean "insight"). Took me awhile to understand it
>> means "operand fetch".
>
> From my Chinese knowledge, it seems to be easier to decipher. The first sign
> ("du") means "read" (and it's not part of my dictionary, since the usual
> way to say "read" is "see book" (kan shu), so I found it with gucharmap),
> and all the rest is Japanese. The usual way to decipher Japanese from
> Chinese is to discard all the Japanese stuff, and guess from the ambiguous
> meaning the remaining words have.

Well, the hiragana part is largely particles and inflections, so you use
it to guess at sentence structure, which helps you find the words in the
first place. And gerunds are tricky: lots of meanings you can attach to
"reading", and in non-technical Japanese it seems to tend more toward
understanding and interpretation than the mechanical act of reading.

But the way I actually figured it out was pretty similar to the way you
describe: I knew 読 was the root of the verb "to read" (when you're
learning to read, it's one of the first you encounter). And 込 consists
of two radicals, one meaning "carry" and the other "entrance". In
context it was preceded by the name of a variable (this was in a box in
a flowchart). Japanese is very postfixy, like Forth. So, "read and carry
in". Light comes on ;-)

But I've resisted making "読み込み" a synonym for "@" in LSE64 ;-)

sl...@jedit.org

unread,

Jul 23, 2007, 8:09:35 PM7/23/07

to

On Jul 23, 7:04 pm, John Doty <j...@whispertel.LoseTheH.net> wrote:
> But I've resisted making "読み込み" a synonym for "@" in LSE64 ;-)

Does LSE64 have full Unicode support?

Slava

John Doty

unread,

Jul 23, 2007, 9:51:24 PM7/23/07

to

It uses C99 wide characters encoded in 64 bit cells (using mbtowc() et
al. and copying) so if you have your locale set appropriately, yes.
Pretty trivial from my point of view. Of course what's lurking behind
the C99 façade isn't trivial...

Bernd Paysan

unread,

Jul 28, 2007, 6:04:23 PM7/28/07

to

Bernd Paysan wrote:
> Yes, but I'm quite convinced now that it is possible to write a UTF-8
> Forth under Windows, which takes the major headaches out of the way.

Here's a report of my work on bigFORTH for Windows with UTF-8: Fortunately,
almost everything was prepared (by bigFORTH on Linux), so there was little
to change; mainly the draw method of the font object, and the selection IO
stuff.

The minor things to be done were:

* Choose different fonts with more characters in
* Fix a bug in the font chooser ;-)
* Declare MultiByteToWideChar and WideCharToMultiByte
* Words for >utf16 and utf16> using a scratch pad.
* Changing the used fonts to Courier New and Arial, which contain at least
some unicode characters.

By default, European Windows doesn't have Asian language support installed,
but you can install that stuff fairly simple (it just requires to go to the
language settings, and find a checkbox to select there). This gives you
fonts with CJK characters and such.

The next problem are input methods. I can select different keyboards with
the hotkey, but didn't get much further (Russian, Greek and similar work
out of the box, Chinese doesn't). Apparently, the input method editor
requires some special interfacing, which I haven't found out yet how it is
supposed to work. Any pointer in that direction is highly welcome.

Bernd Paysan

unread,

Jul 29, 2007, 4:19:35 PM7/29/07

to

Bernd Paysan wrote:
> Apparently, the input method editor
> requires some special interfacing, which I haven't found out yet how it is
> supposed to work. Any pointer in that direction is highly welcome.

Found it - at least the most important part - the DefWindowProc comes in A
and W flavor, and I certainly have to use the W flavor.

Still annoying: Courier new isn't as "fixed" as I want it (characters 1 or 2
ASCII characters wide).

And even more annoying: The input method editor reacts only when I try to
move my bigFORTH window (there's something similar with the appearance of
the task bar icon - this is significantly delayed unless you move the
bigFORTH window). This doesn't happen with other applications, so I
probably found another bug in Windows ;-). Any hint how to work around this
one would be welcome.

Bernd Paysan

unread,

Jul 29, 2007, 5:23:21 PM7/29/07

to

Bernd Paysan wrote:

> Bernd Paysan wrote:
>> Apparently, the input method editor
>> requires some special interfacing, which I haven't found out yet how it
>> is supposed to work. Any pointer in that direction is highly welcome.
>
> Found it - at least the most important part - the DefWindowProc comes in A
> and W flavor, and I certainly have to use the W flavor.
>
> Still annoying: Courier new isn't as "fixed" as I want it (characters 1 or
> 2 ASCII characters wide).
>
> And even more annoying: The input method editor reacts only when I try to
> move my bigFORTH window (there's something similar with the appearance of
> the task bar icon - this is significantly delayed unless you move the
> bigFORTH window). This doesn't happen with other applications, so I
> probably found another bug in Windows ;-). Any hint how to work around
> this one would be welcome.

For anybody who wants to try an UTF-8 Forth under Windows, I've put a
current snapshot under http://www.jwdt.com/~paysan/bigforth-2.1.8.exe. So
far, only the GUI and the file content encoding is UTF-8, file names are
still ANSI only (or rather ASCII only). This is a proof of the concept,
that an internal UTF-8 encoding is not a major obstacle for a Windows Forth
(with the XCHAR set implemented, for sure). Well, at least with MINOS as
GUI - if you use Windows directly, you'll probably have to clutter your
application with >UTF16 and UTF16>.

Alex McDonald

unread,

Jul 30, 2007, 5:30:00 AM7/30/07

to

On Jul 29, 10:23 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Bernd Paysan wrote:
> > Bernd Paysan wrote:
> >> Apparently, the input method editor
> >> requires some special interfacing, which I haven't found out yet how it
> >> is supposed to work. Any pointer in that direction is highly welcome.
>
> > Found it - at least the most important part - the DefWindowProc comes in A
> > and W flavor, and I certainly have to use the W flavor.
>
> > Still annoying: Courier new isn't as "fixed" as I want it (characters 1 or
> > 2 ASCII characters wide).
>
> > And even more annoying: The input method editor reacts only when I try to
> > move my bigFORTH window (there's something similar with the appearance of
> > the task bar icon - this is significantly delayed unless you move the
> > bigFORTH window). This doesn't happen with other applications, so I
> > probably found another bug in Windows ;-). Any hint how to work around
> > this one would be welcome.

I'll have a dig around the Windows doc for you; it's probably a
message you need to respond to from DefWindowProc when using IME.

>
> For anybody who wants to try an UTF-8 Forth under Windows, I've put a

> current snapshot underhttp://www.jwdt.com/~paysan/bigforth-2.1.8.exe. So

> far, only the GUI and the file content encoding is UTF-8, file names are
> still ANSI only (or rather ASCII only). This is a proof of the concept,
> that an internal UTF-8 encoding is not a major obstacle for a Windows Forth
> (with the XCHAR set implemented, for sure). Well, at least with MINOS as
> GUI - if you use Windows directly, you'll probably have to clutter your
> application with >UTF16 and UTF16>.
>
> --
> Bernd Paysan
> "If you want it done right, you have to do it yourself"http://www.jwdt.com/~paysan/

Good work; it's encouraging for us Win32Forth types. I'll certainly
test it out as far as I can over the next few days.

--
Regards
Alex McDonald

Bernd Paysan

unread,

Jul 30, 2007, 8:08:27 AM7/30/07

to

Alex McDonald wrote:
>> > And even more annoying: The input method editor reacts only when I try
>> > to move my bigFORTH window (there's something similar with the
>> > appearance of the task bar icon - this is significantly delayed unless
>> > you move the bigFORTH window). This doesn't happen with other
>> > applications, so I probably found another bug in Windows ;-). Any hint
>> > how to work around this one would be welcome.
>
> I'll have a dig around the Windows doc for you; it's probably a
> message you need to respond to from DefWindowProc when using IME.

I forward all callbacks I don't handle myself to DefWindowProc. I've put a
~~ in front of my callback handler (do-callback) to see what kind of
message I get - I don't get any (they arrive when I start moving the
window). Note that there's only a few seconds time between clicking onto
the IME stuff and moving the window - if you are too slow, the IME ignores
the request.

The event handler in bigFORTH is not fully trivial. There's a idle loop
which uses SetTimer and WaitMessage. It also removes all timer Messages
with PeekMessage (those messages won't be translated and dispatched). The
main event loop takes out messages with PeekMessage, translates and
dispatches them. Changing that (i.e. translating and dispatching the timer
events) didn't help.

Peter Fälth

unread,

Aug 13, 2007, 9:21:59 AM8/13/07

to

On Jul 14, 9:56 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Since it's time to post RfDs, I want to throw in the updated proposal for
> the XCHAR wordset. I hope I have included all comments so far, and I also
> included a reference implementation.
>

I would like to give my input. I implemented the xchar proposal when
it was
first discussed and have now more then one year experience from both
Linux and Windows implementations.

My implementation is to use utf8 as the xchar encoding. This has the
benefit
that all ANS forth programs load and work as expected without any
problems.
Programs using an 8bit encoding will need to be converted for the
strings
to display correctly. For me this is only one file, that containing my
name!

On windows 15 systemcalls had to be modified to use the wide (W)
version of
the call and a s>wz inserted to convert the string from utf8 to ucs2.

On Linux no other changes than to use an utf8 locale had to be done.

The main work was to get the command-line editing to work as I
wanted.
A variable size character needs more thinking, the trick is to work
with
strings more than with individual characters.

I have not implemented xemit and xkey but instead modified emit and
key
(on Linux they did not even need to be modified) (also char, [char],
parse,
word and parse-word has been modified). Letting key and emit handle
xchars
does not prevent me from sending non utf8 control chars to then.

I would also suggest to put in the rfd that when the xchar wordset is
loaded
the Forth system uses uft8 strings for input and output. I think this
would
simplify and avoid a lot of different implementations. Note that this
does not
prevent a system to use for example ucs2 internally. I looked at doing
this for
my Windows version but in the end decided not to and keep it closer to
the Linux
vesrion.

Contact me if you want to test my Forth (ntf/lxf)

regards
Peter Fälth

I have made additional specific comments in the rfd below

> Problem:
>
> ASCII is only appropriate for the English language. Most western
> languages however fit somewhat into the Forth frame, since a byte is
> sufficient to encode the few special characters in each (though not
> always the same encoding can be used; latin-1 is most widely used,
> though). For other languages, different char-sets have to be used,
> several of them variable-width. Most prominent representant is
> UTF-8. Let's call these extended characters XCHARs. Since ANS Forth
> specifies ASCII encoding, only ASCII-compatible encodings may be
> used. Furtunately, being ASCII compatible has so many benefits that
> most encodings actually are ASCII compatible.
>
> Proposal
>
> Datatypes:
>
> xc is an extended char on the stack. It occupies one cell, and is a
> subset of unsigned cell. Note: UTF-8 can not store more that
> 31 bits; on 16 bit systems, only the UCS16 subset of the UTF-8
> character set can be used. Small embedded systems can keep
> xchars always in memory, because all words directly dealing with
> the xc datatype are in the XCHAR EXT wordset.
>
> xc_addr is the address of an XCHAR in memory. Alignment requirements are
> the same as c_addr. The memory representation of an XCHAR differs
> from the stack location, and depends on the encoding used. An XCHAR
> may use a variable number of address units in memory.
>
> encoding cell-sized opaque data type identifying a particular encoding.
>
> Common encodings:
>
> Input and files commonly are either encoded iso-latin-1 or utf-8. The
> encoding depends on settings of the computer system such as the LANG
> environment variable on Unix. You can use the system consistently only
> when you don't change the encoding, or only use the ASCII
> subset. Typical use is that the base system is ASCII only, and then
> extended encoding-specific.
>
> Side issues to be considered:
>
> Many Forth systems today are case insensitive, to accept lower case
> standard words. It is sufficient to be case insensitive for the ASCII
> subset to make this work - this saves a large code mapping table for
> comparison of other symbols. Case is mostly an issue of European
> languages (latin, greek, and cyrillic), but similar issues exist
> between traditional and simplified Chinese, and between different
> Latin code pages in UCS, e.g. full width vs. normal half width latin
> letters. Some encodings (not UTF-8) might give surprises when you use
> a case insensitive ASCII-compare that's 8-bit save, but not aware of
> the current encoding.
>
> Words:
>
> XC-SIZE ( xc -- u ) XCHAR EXT
> Computes the memory size of the XCHAR xc in address units.
>
> X-SIZE ( xc_addr u1 -- u2 ) XCHAR
> Computes the memory size of the first XCHAR stored at xc_addr in
> address units.

Why does it need to have a string as argument only the xcaddr is
needed?

I think this also needs a better name not to confuse it with xc-size
above.
I use xcs (xcaddr -- u) , but that is hardly better!

>
> XC@+ ( xc_addr1 -- xc_addr2 xc ) XCHAR EXT
> Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
> location after xc.
>
> XC!+ ( xc xc_addr1 -- xc_addr2 ) XCHAR EXT
> Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
> location after xc.
>
> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
> characters large. xc_addr2 points to the first memory location after
> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
> into the buffer, flag is true, otherwise flag is false, and xc_addr2
> u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and
> therefore preferred over XC!+.

I use xc!-step ( xc xc_addr1 u1 -- xc_addr2 u2 ) with similar effect
u2 is zero when the operation did not succeed or the end of the buffer
is reached. I also have a xc@-step. I am considering introducing the
flag as that gives some more information and avoids a dup in most
uses.

>
> XCHAR+ ( xc_addr1 -- xc_addr2 ) XCHAR EXT
> Adds the size of the XCHAR stored at xc_addr1 to this address, giving
> xc_addr2.
>
> XCHAR- ( xc_addr1 -- xc_addr2 ) XCHAR EXT
> Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
> XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
> work for every possible encoding.
>
> XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
> Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
> u2 is the remaining buffer after stepping over the first XCHAR in the
> buffer.
>
> -XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
> Step backward by one xchar in the buffer defined by xcaddr1 u1,
> starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
> after stepping backward over the last XCHAR in the buffer. Unlike
> XCHAR-, -XSTRING can be implemented in encodings that have only a
> forward-working string size.

This nameing is confusing. In the implementation you show 4 versions,
some
of then not correctly implemented according to the description. I use
: x/string ( xcaddr u -- xcaddr1 u1 ) over x-size /string ;
do we need more?

> -TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR
> Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is
> correct and it repesents a full character, u2 equals u1, otherwise, u2
> represents the string without the last (garbled) XCHAR.

I have not yet needed this

>
> X-WIDTH ( xc_addr u -- n ) XCHAR
> n is the number of monospace ASCII characters that take the same space to
> display as the the XCHAR string starting at xc_addr, using u address units.

This I name xc-width. I have also xc-length that counts the number of
xchars
in a string.
>
> XKEY ( -- xc ) XCHAR EXT
> Reads an XCHAR from the terminal.
>
> XEMIT ( xc -- ) XCHAR EXT
> Prints an XCHAR on the terminal.

Functionality implemented in key and emit. You can still send and
recive
whatever the device allows you

>
> SET-ENCODING ( encoding -- ) XCHAR EXT
> Sets the input encoding to the specified encoding
>
> GET-ENCODING ( -- encoding ) XCHAR EXT
> Returns the current encoding.

Only encoding allowed in my system is utf8 therefore not needed

>
> Encodings are implementation specific, example encoding names can be
>
> ISO-LATIN-1 ( -- encoding ) XCHAR EXT
> ISO Latin1 encoding (one byte per character)
>
> UTF-8 ( -- encoding ) XCHAR EXT
> UTF-8 encoding (UCS codepage, byte-oriented variable length encoding)
>
> The following words behave different when the XCHAR extension is present:
>
> CHAR ( "<spaces>name" -- xc )
> Skip leading space delimiters. Parse name delimited by a space. Put the
> value of its first XCHAR onto the stack.
>
> [CHAR]
> Interpretation: Interpretation semantics for this word are undefined.
> Compilation: ( ?<spaces>name? -- )
> Skip leading space delimiters. Parse name delimited by a space. Append the
> run-time semantics given below to the current definition.
> Run-time: ( -- xc )
> Place xc, the value of the first XCHAR of name, on the stack.

Also parse, parse-word and word modified accordingly

>
> Reference implementation:
>
> -------------------------xchar.fs----------------------------
> \ xchar reference implementation: UTF-8 (and ISO-LATIN-1)
>
> \ environmental dependency: characters are stored as bytes
> \ environmental dependency: lower case words accepted
>
> base @ hex
>
> 80 Value maxascii
>
> : xc-size ( xc -- n )
> dup maxascii u< IF drop 1 EXIT THEN \ special case ASCII
> $800 2 >r
> BEGIN 2dup u>= WHILE 5 lshift r> 1+ >r dup 0= UNTIL THEN
> 2drop r> ;
>
> : xc@+ ( xcaddr -- xcaddr' u )
> count dup maxascii u< IF EXIT THEN \ special case ASCII
> 7F and 40 >r
> BEGIN dup r@ and WHILE r@ xor
> 6 lshift r> 5 lshift >r >r count
> 3F and r> or
> REPEAT r> drop ;
>
> : xc!+ ( xc xcaddr -- xcaddr' )
> over maxascii u< IF tuck c! char+ EXIT THEN \ special case ASCII
> >r 0 swap 3F
> BEGIN 2dup u> WHILE
> 2/ >r dup 3F and 80 or swap 6 rshift r>
> REPEAT 7F xor 2* or r>
> BEGIN over 80 u< 0= WHILE tuck c! char+ REPEAT nip ;
>
> : xc!+? ( xc xcaddr u -- xcaddr' u' flag )
> >r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
> \ not enough space
> drop nip r> false
> ELSE
> >r xc!+ r> r> swap - true
> THEN ;
>
> \ scan to next/previous character
>
> : xchar+ ( xcaddr -- xcaddr' ) xc@+ drop ;
> : xchar- ( xcaddr -- xcaddr' )
> BEGIN 1 chars - dup c@ C0 and maxascii <> UNTIL ;
>
> : xstring+ ( xcaddr u -- xcaddr u' )
> over + xchar+ over - ;

This adds an xchar after the buffer. How can we know that there is
one?

> : xstring- ( xcaddr u -- xcaddr u' )
> over + xchar- over - ;
>
> : +xstring ( xc-addr1 u1 -- xc-addr2 u2 )
> over dup xchar+ swap - /string ;
> : -xstring ( xc-addr1 u1 -- xc-addr2 u2 )
> over dup xchar- swap - /string ;

This adds an xchar in front of the buffer.
How can we know that there is one?

>
> \ skip trailing garbage
>
> : x-size ( xcaddr u1 -- u2 ) drop
> \ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
> c@
> dup $80 u< IF drop 1 exit THEN
> dup $c0 u< IF drop 1 EXIT THEN \ really is a malformed character
> dup $e0 u< IF drop 2 exit THEN
> dup $f0 u< IF drop 3 exit THEN
> dup $f8 u< IF drop 4 exit THEN
> dup $fc u< IF drop 5 exit THEN
> dup $fe u< IF drop 6 exit THEN
> drop 1 ; \ also malformed character
>
> : -trailing-garbage ( xcaddr u1 -- xcaddr u2 )
> 2dup + dup xchar- ( addr u1 end1 end2 )
> 2dup dup over over - x-size + = IF \ last character ok
> 2drop
> ELSE
> nip nip over -
> THEN ;
>
> \ utf key and emit
>
> : xkey ( -- xc )
> key dup maxascii u< IF EXIT THEN \ special case ASCII
> 7F and 40 >r
> BEGIN dup r@ and WHILE r@ xor
> 6 lshift r> 5 lshift >r >r key
> 3F and r> or
> REPEAT r> drop ;
>
> : xemit ( xc -- )
> dup maxascii u< IF emit EXIT THEN \ special case ASCII
> 0 swap 3F
> BEGIN 2dup u> WHILE
> 2/ >r dup 3F and 80 or swap 6 rshift r>
> REPEAT 7F xor 2* or
> BEGIN dup 80 u< 0= WHILE emit REPEAT drop ;
>
> \ utf size
>
> \ uses wcwidth ( xc -- n )
>
> : wc, ( n low high -- ) 1+ , , , ;
>
> Create wc-table \ derived from wcwidth source code, for UCS32
> 0 0300 0357 wc,
> 0 035D 036F wc,
> 0 0483 0486 wc,
> 0 0488 0489 wc,
> 0 0591 05A1 wc,
> 0 05A3 05B9 wc,
> 0 05BB 05BD wc,
> 0 05BF 05BF wc,
> 0 05C1 05C2 wc,
> 0 05C4 05C4 wc,
> 0 0600 0603 wc,
> 0 0610 0615 wc,
> 0 064B 0658 wc,
> 0 0670 0670 wc,
> 0 06D6 06E4 wc,
> 0 06E7 06E8 wc,
> 0 06EA 06ED wc,
> 0 070F 070F wc,
> 0 0711 0711 wc,
> 0 0730 074A wc,
> 0 07A6 07B0 wc,
> 0 0901 0902 wc,
> 0 093C 093C wc,
> 0 0941 0948 wc,
> 0 094D 094D wc,
> 0 0951 0954 wc,
> 0 0962 0963 wc,
> 0 0981 0981 wc,
> 0 09BC 09BC wc,
> 0 09C1 09C4 wc,
> 0 09CD 09CD wc,
> 0 09E2 09E3 wc,
> 0 0A01 0A02 wc,
> 0 0A3C 0A3C wc,
> 0 0A41 0A42 wc,
> 0 0A47 0A48 wc,
> 0 0A4B 0A4D wc,
> 0 0A70 0A71 wc,
> 0 0A81 0A82 wc,
> 0 0ABC 0ABC wc,
> 0 0AC1 0AC5 wc,
> 0 0AC7 0AC8 wc,
> 0 0ACD 0ACD wc,
> 0 0AE2 0AE3 wc,
> 0 0B01 0B01 wc,
> 0 0B3C 0B3C wc,
> 0 0B3F 0B3F wc,
> 0 0B41 0B43 wc,
> 0 0B4D 0B4D wc,
> 0 0B56 0B56 wc,
> 0 0B82 0B82 wc,
> 0 0BC0 0BC0 wc,
> 0 0BCD 0BCD wc,
> 0 0C3E 0C40 wc,
> 0 0C46 0C48 wc,
> 0 0C4A 0C4D wc,
> 0 0C55 0C56 wc,
> 0 0CBC 0CBC wc,
> 0 0CBF 0CBF wc,
> 0 0CC6 0CC6 wc,
> 0 0CCC 0CCD wc,
> 0 0D41 0D43 wc,
> 0 0D4D 0D4D wc,
> 0 0DCA 0DCA wc,
> 0 0DD2 0DD4 wc,
> 0 0DD6 0DD6 wc,
> 0 0E31 0E31 wc,
> 0 0E34 0E3A wc,
> 0 0E47 0E4E wc,
> 0 0EB1 0EB1 wc,
> 0 0EB4 0EB9 wc,
> 0 0EBB 0EBC wc,
> 0 0EC8 0ECD wc,
> 0 0F18 0F19 wc,
> 0 0F35 0F35 wc,
> 0 0F37 0F37 wc,
> 0 0F39 0F39 wc,
> 0 0F71 0F7E wc,
> 0 0F80 0F84 wc,
> 0 0F86 0F87 wc,
> 0 0F90 0F97 wc,
> 0 0F99 0FBC wc,
> 0 0FC6 0FC6 wc,
> 0 102D 1030 wc,
> 0 1032 1032 wc,
> 0 1036 1037 wc,
> 0 1039 1039 wc,
> 0 1058 1059 wc,
> 1 0000 1100 wc,
> 2 1100 115f wc,
> 0 1160 11FF wc,
> 0 1712 1714 wc,
> 0 1732 1734 wc,
> 0 1752 1753 wc,
> 0 1772 1773 wc,
> 0 17B4 17B5 wc,
> 0 17B7 17BD wc,
> 0 17C6 17C6 wc,
> 0 17C9 17D3 wc,
> 0 17DD 17DD wc,
> 0 180B 180D wc,
> 0 18A9 18A9 wc,
> 0 1920 1922 wc,
> 0 1927 1928 wc,
> 0 1932 1932 wc,
> 0 1939 193B wc,
> 0 200B 200F wc,
> 0 202A 202E wc,
> 0 2060 2063 wc,
> 0 206A 206F wc,
> 0 20D0 20EA wc,
> 2 2329 232A wc,
> 0 302A 302F wc,
> 2 2E80 303E wc,
> 0 3099 309A wc,
> 2 3040 A4CF wc,
> 2 AC00 D7A3 wc,
> 2 F900 FAFF wc,
> 0 FB1E FB1E wc,
> 0 FE00 FE0F wc,
> 0 FE20 FE23 wc,
> 2 FE30 FE6F wc,
> 0 FEFF FEFF wc,
> 2 FF00 FF60 wc,
> 2 FFE0 FFE6 wc,
> 0 FFF9 FFFB wc,
> 0 1D167 1D169 wc,
> 0 1D173 1D182 wc,
> 0 1D185 1D18B wc,
> 0 1D1AA 1D1AD wc,
> 2 20000 2FFFD wc,
> 2 30000 3FFFD wc,
> 0 E0001 E0001 wc,
> 0 E0020 E007F wc,
> 0 E0100 E01EF wc,
> here wc-table - Constant #wc-table
>
> \ inefficient table walk:
>
> : wcwidth ( xc -- n )
> wc-table #wc-table over + swap ?DO
> dup I 2@ within IF I 2 cells + @ UNLOOP EXIT THEN
> 3 cells +LOOP 1 ;
>
> : x-width ( xcaddr u -- n )
> 0 rot rot over + swap ?DO
> I xc@+ swap >r wcwidth +
> r> I - +LOOP ;
>
> : char ( "name" -- xc ) bl word count drop xc@+ nip ;
> : [char] ( "name" -- rt:xc ) char postpone Literal ; immediate
>
> \ switching encoding is only recommended at startup
> \ only two encodings are supported: UTF-8 and ISO-LATIN-1
>
> 80 Constant utf-8
> 100 Constant iso-latin-1
>
> : set-encoding to maxascii ;
> : get-encoding maxascii ;
>
> base !
> -------------------------xchar.fs----------------------------
>
> Experience:
>
> Build into Gforth (development version) and recent versions of bigFORTH.
> Open issues are file reading and writing (conversion on the fly or leave as
> it is?).

Bernd Paysan

unread,

Aug 13, 2007, 12:17:43 PM8/13/07

to

Peter Fälth wrote:

>> Words:
>>
>> XC-SIZE ( xc -- u ) XCHAR EXT
>> Computes the memory size of the XCHAR xc in address units.
>>
>> X-SIZE ( xc_addr u1 -- u2 ) XCHAR
>> Computes the memory size of the first XCHAR stored at xc_addr in
>> address units.
>
> Why does it need to have a string as argument only the xcaddr is
> needed?

X-SIZE can be at most u1 (i.e. avoid overflow if the encoding is potentially
endless). You probably are ok if you do

: x-size ( xc_addr u1 -- u2 ) >r xcs r> umin ;

But since read access over buffer boundaries is much less harmful than write
access, XCS would be enough (whatever we call it).

>> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
>> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
>> characters large. xc_addr2 points to the first memory location after
>> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
>> into the buffer, flag is true, otherwise flag is false, and xc_addr2
>> u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and
>> therefore preferred over XC!+.
>
> I use xc!-step ( xc xc_addr1 u1 -- xc_addr2 u2 ) with similar effect
> u2 is zero when the operation did not succeed or the end of the buffer
> is reached. I also have a xc@-step. I am considering introducing the
> flag as that gives some more information and avoids a dup in most
> uses.

XC@+? would have four return values - bad style.

>> -XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>> Step backward by one xchar in the buffer defined by xcaddr1 u1,
>> starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>> after stepping backward over the last XCHAR in the buffer. Unlike
>> XCHAR-, -XSTRING can be implemented in encodings that have only a
>> forward-working string size.
>
> This nameing is confusing. In the implementation you show 4 versions,
> some
> of then not correctly implemented according to the description. I use
> : x/string ( xcaddr u -- xcaddr1 u1 ) over x-size /string ;
> do we need more?

I'm thinking about changing the name of the string manipulator to +X/STRING
(with a -X/STRING moving backward one char). However, there needs to be a
way to step back one character, and since some encodings can do that only
by forward scanning through the buffer, the string format is necessary (and
it can only step back at the end of the string).

>> -TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR
>> Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is
>> correct and it repesents a full character, u2 equals u1, otherwise, u2
>> represents the string without the last (garbled) XCHAR.
>
> I have not yet needed this

It's good if you don't want to confuse an UTF-8 terminal.

>> X-WIDTH ( xc_addr u -- n ) XCHAR
>> n is the number of monospace ASCII characters that take the same space to
>> display as the the XCHAR string starting at xc_addr, using u address
>> units.
>
> This I name xc-width. I have also xc-length that counts the number of
> xchars in a string.

I'm not sure what you need the second one for (other than inside a
conversion function to UTF-32, because it will fail for UTF-16 already if
you have something in the later codepages).

>> XKEY ( -- xc ) XCHAR EXT
>> Reads an XCHAR from the terminal.
>>
>> XEMIT ( xc -- ) XCHAR EXT
>> Prints an XCHAR on the terminal.
>
> Functionality implemented in key and emit. You can still send and
> recive whatever the device allows you

I do that map into KEY and EMIT in bigFORTH as well, because these are
vectorized. So you can't get your individual UTF-8 bytes with KEY and EMIT.

>>
>> SET-ENCODING ( encoding -- ) XCHAR EXT
>> Sets the input encoding to the specified encoding
>>
>> GET-ENCODING ( -- encoding ) XCHAR EXT
>> Returns the current encoding.
>
> Only encoding allowed in my system is utf8 therefore not needed

The consensus here is that you can query the encoding as environment query,
but you can't set it.

> Also parse, parse-word and word modified accordingly

For UTF-8 terminators? Ok, might make sense.

>> : xstring+ ( xcaddr u -- xcaddr u' )
>> over + xchar+ over - ;
>
> This adds an xchar after the buffer. How can we know that there is
> one?

Actually, only xstring- and +xstring (better +x/string) really make sense.

Peter Fälth

unread,

Aug 13, 2007, 6:21:02 PM8/13/07

to

On Aug 13, 6:17 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Peter Fälth wrote:
> >> Words:
>
> >> XC-SIZE ( xc -- u ) XCHAR EXT
> >> Computes the memory size of the XCHAR xc in address units.
>
> >> X-SIZE ( xc_addr u1 -- u2 ) XCHAR
> >> Computes the memory size of the first XCHAR stored at xc_addr in
> >> address units.
>
> > Why does it need to have a string as argument only the xcaddr is
> > needed?
>
> X-SIZE can be at most u1 (i.e. avoid overflow if the encoding is potentially
> endless). You probably are ok if you do
>
> : x-size ( xc_addr u1 -- u2 ) >r xcs r> umin ;
>
> But since read access over buffer boundaries is much less harmful than write
> access, XCS would be enough (whatever we call it).

Aren't we trying to correct an error committed somewhere else with
this?
If at xcaddr we have an utf8 character with 3 bytes and u1 is 2 it has
gone
wrong earlier.

>
> >> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
> >> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
> >> characters large. xc_addr2 points to the first memory location after
> >> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
> >> into the buffer, flag is true, otherwise flag is false, and xc_addr2
> >> u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and
> >> therefore preferred over XC!+.
>
> > I use xc!-step ( xc xc_addr1 u1 -- xc_addr2 u2 ) with similar effect
> > u2 is zero when the operation did not succeed or the end of the buffer
> > is reached. I also have a xc@-step. I am considering introducing the
> > flag as that gives some more information and avoids a dup in most
> > uses.
>
> XC@+? would have four return values - bad style.

Then maybe my xc@-step ( xcaddr1 u1 -- xcaddr2 u2 xc ) is an
alternative
it works well in a loop to process xchars until you reach the end of
the
buffer
>

> >> -XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
> >> Step backward by one xchar in the buffer defined by xcaddr1 u1,
> >> starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
> >> after stepping backward over the last XCHAR in the buffer. Unlike
> >> XCHAR-, -XSTRING can be implemented in encodings that have only a
> >> forward-working string size.
>
> > This nameing is confusing. In the implementation you show 4 versions,
> > some
> > of then not correctly implemented according to the description. I use
> > : x/string ( xcaddr u -- xcaddr1 u1 ) over x-size /string ;
> > do we need more?
>
> I'm thinking about changing the name of the string manipulator to +X/STRING
> (with a -X/STRING moving backward one char). However, there needs to be a
> way to step back one character, and since some encodings can do that only
> by forward scanning through the buffer, the string format is necessary (and
> it can only step back at the end of the string).

Wouldn't string/x be the natural name when we remove at the end?

: string/x ( xcaddr u -- xcaddr1 u1 ) over + xchar- over - :
and
: x/string ( xcaddr u -- xcaddr1 u1 ) over dup xchar+ swap - /string :

>
> >> -TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR
> >> Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is
> >> correct and it repesents a full character, u2 equals u1, otherwise, u2
> >> represents the string without the last (garbled) XCHAR.
>
> > I have not yet needed this
>
> It's good if you don't want to confuse an UTF-8 terminal.

OK

>
> >> X-WIDTH ( xc_addr u -- n ) XCHAR
> >> n is the number of monospace ASCII characters that take the same space to
> >> display as the the XCHAR string starting at xc_addr, using u address
> >> units.
>
> > This I name xc-width. I have also xc-length that counts the number of
> > xchars in a string.
>
> I'm not sure what you need the second one for (other than inside a
> conversion function to UTF-32, because it will fail for UTF-16 already if
> you have something in the later codepages).

You are probably right. I went back and examined my files an saw that
it is
not longer used.

> >> XKEY ( -- xc ) XCHAR EXT
> >> Reads an XCHAR from the terminal.
>
> >> XEMIT ( xc -- ) XCHAR EXT
> >> Prints an XCHAR on the terminal.
>
> > Functionality implemented in key and emit. You can still send and
> > recive whatever the device allows you
>
> I do that map into KEY and EMIT in bigFORTH as well, because these are
> vectorized. So you can't get your individual UTF-8 bytes with KEY and EMIT.

On Linux it will work to read and write each byte. On windows I use
the
console functions they give and take an ucs2 (I am not sure if utf16
is implemented also in the console)

>
>
>
> >> SET-ENCODING ( encoding -- ) XCHAR EXT
> >> Sets the input encoding to the specified encoding
>
> >> GET-ENCODING ( -- encoding ) XCHAR EXT
> >> Returns the current encoding.
>
> > Only encoding allowed in my system is utf8 therefore not needed
>
> The consensus here is that you can query the encoding as environment query,
> but you can't set it.

Good

>
> > Also parse, parse-word and word modified accordingly
>
> For UTF-8 terminators? Ok, might make sense.
>
> >> : xstring+ ( xcaddr u -- xcaddr u' )
> >> over + xchar+ over - ;
>
> > This adds an xchar after the buffer. How can we know that there is
> > one?
>
> Actually, only xstring- and +xstring (better +x/string) really make sense.
>
> --
> Bernd Paysan
> "If you want it done right, you have to do it yourself"http://www.jwdt.com/~paysan/

Peter

Bernd Paysan

unread,

Aug 14, 2007, 9:12:20 AM8/14/07

to

Peter Fälth wrote:
>> But since read access over buffer boundaries is much less harmful than
>> write access, XCS would be enough (whatever we call it).
>
> Aren't we trying to correct an error committed somewhere else with
> this?
> If at xcaddr we have an utf8 character with 3 bytes and u1 is 2 it has
> gone wrong earlier.

Well, indeed. So XCS or XC-SIZE with ( xc_addr -- n ) should be enough.

>> XC@+? would have four return values - bad style.
>
> Then maybe my xc@-step ( xcaddr1 u1 -- xcaddr2 u2 xc ) is an
> alternative
> it works well in a loop to process xchars until you reach the end of
> the buffer

Yes, and error handling for wrong buffer ends could be done with other means
(e.g. throwing an error or returning the garbage code).

> Wouldn't string/x be the natural name when we remove at the end?

Or string\x, or x\string or whatever indicates backwards best.

Thanks for jumping into that discussion.

m_l...@yahoo.com

unread,

Sep 4, 2007, 12:03:56 PM9/4/07

to

On Jul 14, 11:56 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:
...

> Open issues are file reading and writing (conversion on the fly or leave as
> it is?).

IMO, it's better to leave it as it is.
Imagine a language with multiple encodings and a program converting
from one encoding into another.
(Example: in Russia, where the most popular encodings are
Windows-1251, CP-866 (console-mode windows apps), KOI8-R, encoding
detectors, converters, and decoders of erroneously converted text
(say, a text in 866 was converted from KOI8-R into 1251; since it was
not KOI8-R, the result is unreadable) are a popular kind of software).

As I have heard of, there's a similar situation with Chinese/Japanese/
Korean, there's at least one M$ encoding and Unicode.

Given that Russians most likely won't use Unicode and won't test with
Unicode, and that Europeans won't test their systems with Cyrillic or
Eastern scripts, it's better to leave the text as it is.

I mean, NOBODY WILL TEST ON-THE-FLY CONVERSION for all possible
languages, therefore IT'S BETTER NOT TO HAVE IT.

(You cannot delegate the conversion to the operating system because
for some human languages there are bugs... It should not be so, but it
is so. You cannot rely on the built-in stuff. For example, I began to
see Cyrillic characters in the Debian Linux console only after I
installed Chinese stuff.;)