Google Groups Home
Help | Sign in
RfD: XCHAR wordset
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 49 - Collapse all   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Bernd Paysan  
View profile
 More options Jul 14 2007, 3:56 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Sat, 14 Jul 2007 21:56:12 +0200
Local: Sat, Jul 14 2007 3:56 pm
Subject: RfD: XCHAR wordset
Since it's time to post RfDs, I want to throw in the updated proposal for
the XCHAR wordset. I hope I have included all comments so far, and I also
included a reference implementation.

Problem:

ASCII is only appropriate for the English language. Most western
languages however fit somewhat into the Forth frame, since a byte is
sufficient to encode the few special characters in each (though not
always the same encoding can be used; latin-1 is most widely used,
though). For other languages, different char-sets have to be used,
several of them variable-width. Most prominent representant is
UTF-8. Let's call these extended characters XCHARs. Since ANS Forth
specifies ASCII encoding, only ASCII-compatible encodings may be
used. Furtunately, being ASCII compatible has so many benefits that
most encodings actually are ASCII compatible.

Proposal

Datatypes:

xc      is an extended char on the stack. It occupies one cell, and is a
        subset of unsigned cell. Note: UTF-8 can not store more that
        31 bits; on 16 bit systems, only the UCS16 subset of the UTF-8
        character set can be used. Small embedded systems can keep
        xchars always in memory, because all words directly dealing with
        the xc datatype are in the XCHAR EXT wordset.

xc_addr is the address of an XCHAR in memory. Alignment requirements are
        the same as c_addr. The memory representation of an XCHAR differs
        from the stack location, and depends on the encoding used. An XCHAR
        may use a variable number of address units in memory.

encoding    cell-sized opaque data type identifying a particular encoding.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only
when you don't change the encoding, or only use the ASCII
subset. Typical use is that the base system is ASCII only, and then
extended encoding-specific.

Side issues to be considered:

Many Forth systems today are case insensitive, to accept lower case
standard words. It is sufficient to be case insensitive for the ASCII
subset to make this work - this saves a large code mapping table for
comparison of other symbols. Case is mostly an issue of European
languages (latin, greek, and cyrillic), but similar issues exist
between traditional and simplified Chinese, and between different
Latin code pages in UCS, e.g. full width vs. normal half width latin
letters. Some encodings (not UTF-8) might give surprises when you use
a case insensitive ASCII-compare that's 8-bit save, but not aware of
the current encoding.

Words:

XC-SIZE ( xc -- u ) XCHAR EXT
Computes the memory size of the XCHAR xc in address units.

X-SIZE ( xc_addr u1 -- u2 ) XCHAR
Computes the memory size of the first XCHAR stored at xc_addr in
address units.

XC@+ ( xc_addr1 -- xc_addr2 xc ) XCHAR EXT
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 ) XCHAR EXT
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
characters large. xc_addr2 points to the first memory location after
xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
into the buffer, flag is true, otherwise flag is false, and xc_addr2
u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and
therefore preferred over XC!+.

XCHAR+ ( xc_addr1 -- xc_addr2 ) XCHAR EXT
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 ) XCHAR EXT
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
u2 is the remaining buffer after stepping over the first XCHAR in the
buffer.

-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
Step backward by one xchar in the buffer defined by xcaddr1 u1,
starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
after stepping backward over the last XCHAR in the buffer. Unlike
XCHAR-, -XSTRING can be implemented in encodings that have only a
forward-working string size.

-TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR
Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is
correct and it repesents a full character, u2 equals u1, otherwise, u2
represents the string without the last (garbled) XCHAR.

X-WIDTH ( xc_addr u -- n ) XCHAR
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc ) XCHAR EXT
Reads an XCHAR from the terminal.

XEMIT ( xc -- ) XCHAR EXT
Prints an XCHAR on the terminal.

SET-ENCODING ( encoding -- ) XCHAR EXT
Sets the input encoding to the specified encoding

GET-ENCODING ( -- encoding ) XCHAR EXT
Returns the current encoding.

Encodings are implementation specific, example encoding names can be

ISO-LATIN-1 ( -- encoding ) XCHAR EXT
ISO Latin1 encoding (one byte per character)

UTF-8 ( -- encoding ) XCHAR EXT
UTF-8 encoding (UCS codepage, byte-oriented variable length encoding)

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters.  Parse name delimited by a space.  Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
        Compilation:    ( ?<spaces>name? -- )
Skip leading space delimiters.  Parse name delimited by a space.  Append the
run-time semantics given below to the current definition.
        Run-time:       ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

-------------------------xchar.fs----------------------------
\ xchar reference implementation: UTF-8 (and ISO-LATIN-1)

\ environmental dependency: characters are stored as bytes
\ environmental dependency: lower case words accepted

base @ hex

80 Value maxascii

: xc-size ( xc -- n )
    dup      maxascii u< IF  drop 1  EXIT  THEN \ special case ASCII
    $800  2 >r
    BEGIN  2dup u>=  WHILE  5 lshift r> 1+ >r  dup 0= UNTIL  THEN
    2drop r> ;

: xc@+ ( xcaddr -- xcaddr' u )
    count  dup maxascii u< IF  EXIT  THEN  \ special case ASCII
    7F and  40 >r
    BEGIN   dup r@ and  WHILE  r@ xor
            6 lshift r> 5 lshift >r >r count
            3F and r> or
    REPEAT  r> drop ;

: xc!+ ( xc xcaddr -- xcaddr' )
    over maxascii u< IF  tuck c! char+  EXIT  THEN \ special case ASCII
    >r 0 swap  3F
    BEGIN  2dup u>  WHILE
            2/ >r  dup 3F and 80 or swap 6 rshift r>
    REPEAT  7F xor 2* or  r>
    BEGIN   over 80 u< 0= WHILE  tuck c! char+  REPEAT  nip ;

: xc!+? ( xc xcaddr u -- xcaddr' u' flag )
    >r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
        \ not enough space
        drop nip r> false
    ELSE
        >r xc!+ r> r> swap - true
    THEN ;

\ scan to next/previous character

: xchar+ ( xcaddr -- xcaddr' )  xc@+ drop ;
: xchar- ( xcaddr -- xcaddr' )
    BEGIN  1 chars - dup c@ C0 and maxascii <>  UNTIL ;

: xstring+ ( xcaddr u -- xcaddr u' )
    over + xchar+ over - ;
: xstring- ( xcaddr u -- xcaddr u' )
    over + xchar- over - ;

: +xstring ( xc-addr1 u1 -- xc-addr2 u2 )
    over dup xchar+ swap - /string ;
: -xstring ( xc-addr1 u1 -- xc-addr2 u2 )
    over dup xchar- swap - /string ;

\ skip trailing garbage

: x-size ( xcaddr u1 -- u2 ) drop
    \ length of UTF-8 char starting at u8-addr (accesses only u8-addr)
    c@
    dup $80 u< IF drop 1 exit THEN
    dup $c0 u< IF drop 1 EXIT THEN \ really is a malformed character
    dup $e0 u< IF drop 2 exit THEN
    dup $f0 u< IF drop 3 exit THEN
    dup $f8 u< IF drop 4 exit THEN
    dup $fc u< IF drop 5 exit THEN
    dup $fe u< IF drop 6 exit THEN
    drop 1 ; \ also malformed character

: -trailing-garbage ( xcaddr u1 -- xcaddr u2 )
    2dup + dup xchar- ( addr u1 end1 end2 )
    2dup dup over over - x-size + = IF \ last character ok
        2drop
    ELSE
        nip nip over -
    THEN ;

\ utf key and emit

: xkey ( -- xc )
    key dup maxascii u< IF  EXIT  THEN  \ special case ASCII
    7F and  40 >r
    BEGIN  dup r@ and  WHILE  r@ xor
            6 lshift r> 5 lshift >r >r key
            3F and r> or
    REPEAT  r> drop ;

: xemit ( xc -- )
    dup maxascii u< IF  emit  EXIT  THEN \ special case ASCII
    0 swap  3F
    BEGIN  2dup u>  WHILE
            2/ >r  dup 3F and 80 or swap 6 rshift r>
    REPEAT  7F xor 2* or
    BEGIN   dup 80 u< 0= WHILE emit  REPEAT  drop ;

\ utf size

\ uses wcwidth ( xc -- n )

: wc, ( n low high -- )  1+ , , , ;

Create wc-table \ derived from wcwidth source code, for UCS32
0 0300 0357 wc,
0 035D 036F wc,
0 0483 0486 wc,
0 0488 0489 wc,
0 0591 05A1 wc,
0 05A3 05B9 wc,
0 05BB 05BD wc,
0 05BF 05BF wc,
0 05C1 05C2 wc,
0 05C4 05C4 wc,
0 0600 0603 wc,
0 0610 0615 wc,
0 064B 0658 wc,
0 0670 0670 wc,
0 06D6 06E4 wc,
0 06E7 06E8 wc,
0 06EA 06ED wc,
0 070F 070F wc,
0 0711 0711 wc,
0 0730 074A wc,
0 07A6 07B0 wc,
0 0901 0902 wc,
0 093C 093C wc,
0 0941 0948 wc,
0 094D 094D wc,
0 0951 0954 wc,
0 0962 0963 wc,
0 0981 0981 wc,
0 09BC 09BC wc,
0 09C1 09C4 wc,
0 09CD 09CD wc,
0 09E2 09E3 wc,
0 0A01 0A02 wc,
0 0A3C 0A3C wc,
0 0A41 0A42 wc,
0 0A47 0A48 wc,
0 0A4B 0A4D wc,
0 0A70 0A71 wc,
0 0A81 0A82 wc,
0 0ABC 0ABC wc,
0 0AC1 0AC5 wc,
0 0AC7 0AC8 wc,
0 0ACD 0ACD wc,
0 0AE2 0AE3 wc,
0 0B01 0B01 wc,
0 0B3C 0B3C wc,
0 0B3F 0B3F wc,
0 0B41 0B43 wc,
0 0B4D 0B4D wc,
0 0B56 0B56 wc,
0 0B82 0B82 wc,
0 0BC0 0BC0 wc,
0 0BCD 0BCD wc,
0 0C3E 0C40 wc,
0 0C46 0C48 wc,
0 0C4A 0C4D wc,
0 0C55 0C56 wc,
0 0CBC 0CBC wc,
0 0CBF 0CBF wc,
0 0CC6 0CC6 wc,
0 0CCC 0CCD wc,
0 0D41 0D43 wc,
0 0D4D 0D4D
...

read more »


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bruce McFarling  
View profile
 More options Jul 15 2007, 9:06 am
Newsgroups: comp.lang.forth
From: Bruce McFarling <agil...@netscape.net>
Date: Sun, 15 Jul 2007 06:06:25 -0700
Local: Sun, Jul 15 2007 9:06 am
Subject: Re: RfD: XCHAR wordset
How hard would it be to extend the reference implemenation to UTF-32?

Erratum:

XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
characters large. xc_addr2 points to the first memory location after
xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
into the buffer, flag is true, otherwise flag is false, and xc_addr2
u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
therefore preferred over XC!+.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 15 2007, 1:24 pm
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Sun, 15 Jul 2007 17:24:56 GMT
Local: Sun, Jul 15 2007 1:24 pm
Subject: Re: RfD: XCHAR wordset
Bernd Paysan <bernd.pay...@gmx.de> writes:
>xc_addr is the address of an XCHAR in memory. Alignment requirements are
>        the same as c_addr. The memory representation of an XCHAR differs
>        from the stack location, and depends on the encoding used. An XCHAR

                        ^^^^^^^^
representation?

>Common encodings:
...
>Side issues to be considered:

These appear to be subsections that should be put in informative
sections, not the normative "Proposal" section.

>Many Forth systems today are case insensitive, to accept lower case
>standard words. It is sufficient to be case insensitive for the ASCII
>subset to make this work - this saves a large code mapping table for
>comparison of other symbols. Case is mostly an issue of European
>languages (latin, greek, and cyrillic), but similar issues exist
>between traditional and simplified Chinese, and between different
>Latin code pages in UCS, e.g. full width vs. normal half width latin
>letters. Some encodings (not UTF-8) might give surprises when you use
>a case insensitive ASCII-compare that's 8-bit save, but not aware of
>the current encoding.

Even in UTF-8 you can compose letters, e.g. an Umlaut-a from a
diaresis and an a, and that would be encoded differently than the
Latin-1-derived Umlaut-a.

Anyway, that's not a problem we should try to solve at the Forth
level, or at least not in this proposal.

>Words:

>XC-SIZE ( xc -- u ) XCHAR EXT
>Computes the memory size of the XCHAR xc in address units.

>X-SIZE ( xc_addr u1 -- u2 ) XCHAR
>Computes the memory size of the first XCHAR stored at xc_addr in
>address units.
...
>XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
>Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
>characters large.

Shouldn't the granularity of the size specifications be the same
(i.e., either aus or chars) throughout the wordset?

> xc_addr2 points to the first memory location after
>xc, u2 is the remaining size of the buffer.

In what units?  The size units are missing in most of the rest of the
word specifications, but I do not mention this again.

>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
>u2 is the remaining buffer after stepping over the first XCHAR in the
>buffer.

>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>Step backward by one xchar in the buffer defined by xcaddr1 u1,
>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>after stepping backward over the last XCHAR in the buffer. Unlike
>XCHAR-, -XSTRING can be implemented in encodings that have only a
>forward-working string size.

The assymetry in the stack effects of XSTRING+ and -XSTRING is
probably hard to remember and may be confusing.

>X-WIDTH ( xc_addr u -- n ) XCHAR
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe mention that this is only relevant for monospaced displays/fonts.

>SET-ENCODING ( encoding -- ) XCHAR EXT
>Sets the input encoding to the specified encoding

So there's an input encoding and an internal encoding?

Are all inputs affected?  I would set file encodings per-file.

What about the output encoding?

>The following words behave different when the XCHAR extension is present:

>CHAR ( "<spaces>name" -- xc )
>Skip leading space delimiters.  Parse name delimited by a space.  Put the
>value of its first XCHAR onto the stack.

>[CHAR]
>Interpretation: Interpretation semantics for this word are undefined.
>        Compilation:    ( ?<spaces>name? -- )
>Skip leading space delimiters.  Parse name delimited by a space.  Append the
>run-time semantics given below to the current definition.
>        Run-time:       ( -- xc )
>Place xc, the value of the first XCHAR of name, on the stack.

I would call that an extended behaviour, not a different behaviour,
because the behaviour will be the same for Forth-94 programs.

>Experience:

>Build into Gforth (development version) and recent versions of bigFORTH.

There's also at least one other implementation, lxf-ntf by Peter Falth.

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

We have not implemented it yet, but for text files the conversion to
and from the internal representation should be performed by
READ/WRITE-FILE/LINE.  If you read it in unconverted (i.e., as
binary), the program has to keep track of which buffer contains which
encoding, and do the conversion itself, which is error-prone,
inconvenient, and the proposal does not supply words for that.  But,
as mentioned above, if you really want that, you can have it by
treating the file as binary.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 15 2007, 3:33 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Sun, 15 Jul 2007 21:33:52 +0200
Local: Sun, Jul 15 2007 3:33 pm
Subject: Re: RfD: XCHAR wordset

Bruce McFarling wrote:
> How hard would it be to extend the reference implemenation to UTF-32?

UTF-32 is not ASCII compatible, unless you have a system where 1 CHAR = 32
bit.

> Erratum:

> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
> characters large. xc_addr2 points to the first memory location after
> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
> into the buffer, flag is true, otherwise flag is false, and xc_addr2
> u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
> therefore preferred over XC!+.

Thanks, there was another save/safe error, as well.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bernd Paysan  
View profile
 More options Jul 15 2007, 4:02 pm
Newsgroups: comp.lang.forth
From: Bernd Paysan <bernd.pay...@gmx.de>
Date: Sun, 15 Jul 2007 22:02:32 +0200
Local: Sun, Jul 15 2007 4:02 pm
Subject: Re: RfD: XCHAR wordset

Anton Ertl wrote:
> Bernd Paysan <bernd.pay...@gmx.de> writes:
>>xc_addr is the address of an XCHAR in memory. Alignment requirements are
>>        the same as c_addr. The memory representation of an XCHAR differs
>>        from the stack location, and depends on the encoding used. An
>>        XCHAR
>                         ^^^^^^^^
> representation?

Yes.

>>Common encodings:
> ...
>>Side issues to be considered:

> These appear to be subsections that should be put in informative
> sections, not the normative "Proposal" section.

Moved it to an appendix

>>XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
>>Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
>>characters large.

> Shouldn't the granularity of the size specifications be the same
> (i.e., either aus or chars) throughout the wordset?

Should be AUs.

Oops, got it wrong, the description is actually of +XSTRING and XSTRING-.
The sign is on the side of the string which gets modified, and indicates
the direction (+ towards higher addresses, - towards lower). The sample
implementation also contains the opposite partner of each of those, but
that doesn't make too much sense (if you extend the buffer, you can as well
use XCHAR+ and XCHAR-).

>>X-WIDTH ( xc_addr u -- n ) XCHAR
>>n is the number of monospace ASCII characters that take the same space to
>>display as the the XCHAR string starting at xc_addr, using u address
>>units.

> Maybe mention that this is only relevant for monospaced displays/fonts.

Fonts where each character takes an integer multiple width of ASCII
characters. Calling that "monospaced" is a bit stretching the
word "monospaced" ;-).

>>SET-ENCODING ( encoding -- ) XCHAR EXT
>>Sets the input encoding to the specified encoding

> So there's an input encoding and an internal encoding?

Actually, there's just an encoding, which is both internal (for words like
XCHAR+), and external (for XKEY/XEMIT).

> Are all inputs affected?  I would set file encodings per-file.

> What about the output encoding?

So far, only one encoding at a time is supported.

Fine.

>>Open issues are file reading and writing (conversion on the fly or leave
>>as it is?).

> We have not implemented it yet, but for text files the conversion to
> and from the internal representation should be performed by
> READ/WRITE-FILE/LINE.  If you read it in unconverted (i.e., as
> binary), the program has to keep track of which buffer contains which
> encoding, and do the conversion itself, which is error-prone,
> inconvenient, and the proposal does not supply words for that.  But,
> as mentioned above, if you really want that, you can have it by
> treating the file as binary.

I think for file encodings, we should have a word that sets the encoding of
a file, like SET-FILE-ENCODING ( encoding fd -- ior ), and we also need a
tag in the file to set the encoding while interpreting, i.e.
SET-SOURCE-ENCODING (sets the encoding of the source file).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex McDonald  
View profile
 More options Jul 15 2007, 4:33 pm
Newsgroups: comp.lang.forth
From: Alex McDonald <b...@rivadpm.com>
Date: Sun, 15 Jul 2007 21:33:30 +0100
Local: Sun, Jul 15 2007 4:33 pm
Subject: Re: RfD: XCHAR wordset

Bernd Paysan wrote:

[snipped]

Unfortunately, on first analysis, this is one proposal that Win32Forth
will not be adopting any time soon.

Windows is UTF-16, which is not ASCII compliant. Although Windows
provides APIs to translate from locale to locale, there is no method in
Win32Forth to automatically identify which parameters would be require
to be translated from XHCARS to UTF-16 and back; the programmer would be
responsible for coding the conversions.

We would need something like the proposal Anton made at EuroForth 2006
(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
Function Call Interface), with extensions to identify string pointers,
before implementing this.

--
Regards
Alex McDonald


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 16 2007, 5:46 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Mon, 16 Jul 2007 09:46:32 GMT
Local: Mon, Jul 16 2007 5:46 am
Subject: Re: RfD: XCHAR wordset

Hmm, your mistake may indicate that this naming is error-prone,
especially in implementations where the opposite partners exist.

>>>SET-ENCODING ( encoding -- ) XCHAR EXT
>>>Sets the input encoding to the specified encoding

>> So there's an input encoding and an internal encoding?

>Actually, there's just an encoding, which is both internal (for words like
>XCHAR+), and external (for XKEY/XEMIT).

I think that no word for changing the internal encoding should be
standardized.  Or if you standardize it, it should fail if the new
internal encoding is not an extension of the old one (i.e.,
ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since
this is a one-way street, GET-ENCODING makes little sense.

Otherwise a standard program could contain strings in different,
incompatible encodings, some of them in system-controlled strings
(e.g., word names), controlled by a global state variable.  This would
be worse than STATE and BASE.  No need to introduce another such
mistake.

>>>Open issues are file reading and writing (conversion on the fly or leave
>>>as it is?).

>> We have not implemented it yet, but for text files the conversion to
>> and from the internal representation should be performed by
>> READ/WRITE-FILE/LINE.  If you read it in unconverted (i.e., as
>> binary), the program has to keep track of which buffer contains which
>> encoding, and do the conversion itself, which is error-prone,
>> inconvenient, and the proposal does not supply words for that.  But,
>> as mentioned above, if you really want that, you can have it by
>> treating the file as binary.

>I think for file encodings, we should have a word that sets the encoding of
>a file, like SET-FILE-ENCODING ( encoding fd -- ior ),

The primary method should work through OPEN-FILE and CREATE-FILE
(e.g., by specifying the encoding in the fam).  But yes, a word like
SET-FILE-ENCODING is useful when the program learns about the encoding
later (e.g., when the encoding is specified at the start of the file).

> and we also need a
>tag in the file to set the encoding while interpreting, i.e.
>SET-SOURCE-ENCODING (sets the encoding of the source file).

That sounds sensible.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Anton Ertl  
View profile
 More options Jul 16 2007, 7:28 am
Newsgroups: comp.lang.forth
From: an...@mips.complang.tuwien.ac.at (Anton Ertl)
Date: Mon, 16 Jul 2007 11:28:57 GMT
Local: Mon, Jul 16 2007 7:28 am
Subject: Re: RfD: XCHAR wordset

Alex McDonald <b...@rivadpm.com> writes:
>Bernd Paysan wrote:

>[snipped]

>Unfortunately, on first analysis, this is one proposal that Win32Forth
>will not be adopting any time soon.

>Windows is UTF-16, which is not ASCII compliant. Although Windows
>provides APIs to translate from locale to locale, there is no method in
>Win32Forth to automatically identify which parameters would be require
>to be translated from XHCARS to UTF-16 and back; the programmer would be
>responsible for coding the conversions.

I don't see that you are any worse off with xchars in this situation
than with chars.

>We would need something like the proposal Anton made at EuroForth 2006
>(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
>Function Call Interface), with extensions to identify string pointers,
>before implementing this.

For strings my approach in the C interface is that one needs to
convert explicitly.  Even without Unicode, you already have the
problem of needing zero-termination in C and explicit length counts in
Forth.  Hmm, maybe we need some support words for the conversion.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2007: http://www.complang.tuwien.ac.at/anton/euroforth2007/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alex McDonald  
View profile
 More options Jul 16 2007, 8:39 am
Newsgroups: comp.lang.forth
From: Alex McDonald <b...@rivadpm.com>
Date: Mon, 16 Jul 2007 05:39:13 -0700
Local: Mon, Jul 16 2007 8:39 am
Subject: Re: RfD: XCHAR wordset
On Jul 16, 12:28 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

The au would be 16bits, with a max of 127 characters in a counted
string. This might be considered too short.  It would be a pretty big
change as well, as there are a good few COUNTs and C@ in a lot of
Win32Forth code.

I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's
proposal; for conversion between encodings I would have thought it
useful.

As a general note, it's worth following the Unicode 5.0 standard for
malformed Unicode; to throw an error in all such cases. The XCHARS
standard should be explicit about which Unicode processing standard it
adheres to (or insist that the implementor name the standard).

> >We would need something like the proposal Anton made at EuroForth 2006
> >(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
> >Function Call Interface), with extensions to identify string pointers,
> >before implementing this.

> For strings my approach in the C interface is that one needs to
> convert explicitly.  Even without Unicode, you already have the
> problem of needing zero-termination in C and explicit length counts in
> Forth.  Hmm, maybe we need some support words for the conversion.

There's also a Java style null ("modified UTF-8"), encoded as 0xc0
0x80. It has some advantages, as C won't stop on it when using
strlen(), and strings with imbedded nulls can be correctly passed to C
(for instance, when using C to write to file).

Win32Forth makes sure strings are null terminated (and the programmer
needs to be aware of this when allocating buffers for string handling;
they need to be one byte longer than required by the string).