Since it's time to post RfDs, I want to throw in the updated proposal for the XCHAR wordset. I hope I have included all comments so far, and I also included a reference implementation.
Problem:
ASCII is only appropriate for the English language. Most western languages however fit somewhat into the Forth frame, since a byte is sufficient to encode the few special characters in each (though not always the same encoding can be used; latin-1 is most widely used, though). For other languages, different char-sets have to be used, several of them variable-width. Most prominent representant is UTF-8. Let's call these extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only ASCII-compatible encodings may be used. Furtunately, being ASCII compatible has so many benefits that most encodings actually are ASCII compatible.
Proposal
Datatypes:
xc is an extended char on the stack. It occupies one cell, and is a subset of unsigned cell. Note: UTF-8 can not store more that 31 bits; on 16 bit systems, only the UCS16 subset of the UTF-8 character set can be used. Small embedded systems can keep xchars always in memory, because all words directly dealing with the xc datatype are in the XCHAR EXT wordset.
xc_addr is the address of an XCHAR in memory. Alignment requirements are the same as c_addr. The memory representation of an XCHAR differs from the stack location, and depends on the encoding used. An XCHAR may use a variable number of address units in memory.
encoding cell-sized opaque data type identifying a particular encoding.
Common encodings:
Input and files commonly are either encoded iso-latin-1 or utf-8. The encoding depends on settings of the computer system such as the LANG environment variable on Unix. You can use the system consistently only when you don't change the encoding, or only use the ASCII subset. Typical use is that the base system is ASCII only, and then extended encoding-specific.
Side issues to be considered:
Many Forth systems today are case insensitive, to accept lower case standard words. It is sufficient to be case insensitive for the ASCII subset to make this work - this saves a large code mapping table for comparison of other symbols. Case is mostly an issue of European languages (latin, greek, and cyrillic), but similar issues exist between traditional and simplified Chinese, and between different Latin code pages in UCS, e.g. full width vs. normal half width latin letters. Some encodings (not UTF-8) might give surprises when you use a case insensitive ASCII-compare that's 8-bit save, but not aware of the current encoding.
Words:
XC-SIZE ( xc -- u ) XCHAR EXT Computes the memory size of the XCHAR xc in address units.
X-SIZE ( xc_addr u1 -- u2 ) XCHAR Computes the memory size of the first XCHAR stored at xc_addr in address units.
XC@+ ( xc_addr1 -- xc_addr2 xc ) XCHAR EXT Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory location after xc.
XC!+ ( xc xc_addr1 -- xc_addr2 ) XCHAR EXT Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory location after xc.
XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT Stores the XCHAR xc into the buffer starting at address xc_addr1, u1 characters large. xc_addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit into the buffer, flag is true, otherwise flag is false, and xc_addr2 u2 equal xc_addr1 u1. XC!+? is save for buffer overflows, and therefore preferred over XC!+.
XCHAR+ ( xc_addr1 -- xc_addr2 ) XCHAR EXT Adds the size of the XCHAR stored at xc_addr1 to this address, giving xc_addr2.
XCHAR- ( xc_addr1 -- xc_addr2 ) XCHAR EXT Goes backward from xc_addr1 until it finds an XCHAR so that the size of this XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to work for every possible encoding.
XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2 u2 is the remaining buffer after stepping over the first XCHAR in the buffer.
-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR Step backward by one xchar in the buffer defined by xcaddr1 u1, starting at the end of the buffer. xcaddr1 u2 is the remaining buffer after stepping backward over the last XCHAR in the buffer. Unlike XCHAR-, -XSTRING can be implemented in encodings that have only a forward-working string size.
-TRAILING-GARBAGE ( xcaddr u1 -- xcaddr u2 ) XCHAR Examine the last XCHAR in the buffer xcaddr u1 - if the encoding is correct and it repesents a full character, u2 equals u1, otherwise, u2 represents the string without the last (garbled) XCHAR.
X-WIDTH ( xc_addr u -- n ) XCHAR n is the number of monospace ASCII characters that take the same space to display as the the XCHAR string starting at xc_addr, using u address units.
XKEY ( -- xc ) XCHAR EXT Reads an XCHAR from the terminal.
XEMIT ( xc -- ) XCHAR EXT Prints an XCHAR on the terminal.
SET-ENCODING ( encoding -- ) XCHAR EXT Sets the input encoding to the specified encoding
GET-ENCODING ( -- encoding ) XCHAR EXT Returns the current encoding.
Encodings are implementation specific, example encoding names can be
ISO-LATIN-1 ( -- encoding ) XCHAR EXT ISO Latin1 encoding (one byte per character)
The following words behave different when the XCHAR extension is present:
CHAR ( "<spaces>name" -- xc ) Skip leading space delimiters. Parse name delimited by a space. Put the value of its first XCHAR onto the stack.
[CHAR] Interpretation: Interpretation semantics for this word are undefined. Compilation: ( ?<spaces>name? -- ) Skip leading space delimiters. Parse name delimited by a space. Append the run-time semantics given below to the current definition. Run-time: ( -- xc ) Place xc, the value of the first XCHAR of name, on the stack.
\ environmental dependency: characters are stored as bytes \ environmental dependency: lower case words accepted
base @ hex
80 Value maxascii
: xc-size ( xc -- n ) dup maxascii u< IF drop 1 EXIT THEN \ special case ASCII $800 2 >r BEGIN 2dup u>= WHILE 5 lshift r> 1+ >r dup 0= UNTIL THEN 2drop r> ;
: xc@+ ( xcaddr -- xcaddr' u ) count dup maxascii u< IF EXIT THEN \ special case ASCII 7F and 40 >r BEGIN dup r@ and WHILE r@ xor 6 lshift r> 5 lshift >r >r count 3F and r> or REPEAT r> drop ;
: xc!+ ( xc xcaddr -- xcaddr' ) over maxascii u< IF tuck c! char+ EXIT THEN \ special case ASCII >r 0 swap 3F BEGIN 2dup u> WHILE 2/ >r dup 3F and 80 or swap 6 rshift r> REPEAT 7F xor 2* or r> BEGIN over 80 u< 0= WHILE tuck c! char+ REPEAT nip ;
: xc!+? ( xc xcaddr u -- xcaddr' u' flag ) >r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 ) \ not enough space drop nip r> false ELSE >r xc!+ r> r> swap - true THEN ;
\ scan to next/previous character
: xchar+ ( xcaddr -- xcaddr' ) xc@+ drop ; : xchar- ( xcaddr -- xcaddr' ) BEGIN 1 chars - dup c@ C0 and maxascii <> UNTIL ;
: xstring+ ( xcaddr u -- xcaddr u' ) over + xchar+ over - ; : xstring- ( xcaddr u -- xcaddr u' ) over + xchar- over - ;
: +xstring ( xc-addr1 u1 -- xc-addr2 u2 ) over dup xchar+ swap - /string ; : -xstring ( xc-addr1 u1 -- xc-addr2 u2 ) over dup xchar- swap - /string ;
\ skip trailing garbage
: x-size ( xcaddr u1 -- u2 ) drop \ length of UTF-8 char starting at u8-addr (accesses only u8-addr) c@ dup $80 u< IF drop 1 exit THEN dup $c0 u< IF drop 1 EXIT THEN \ really is a malformed character dup $e0 u< IF drop 2 exit THEN dup $f0 u< IF drop 3 exit THEN dup $f8 u< IF drop 4 exit THEN dup $fc u< IF drop 5 exit THEN dup $fe u< IF drop 6 exit THEN drop 1 ; \ also malformed character
: -trailing-garbage ( xcaddr u1 -- xcaddr u2 ) 2dup + dup xchar- ( addr u1 end1 end2 ) 2dup dup over over - x-size + = IF \ last character ok 2drop ELSE nip nip over - THEN ;
\ utf key and emit
: xkey ( -- xc ) key dup maxascii u< IF EXIT THEN \ special case ASCII 7F and 40 >r BEGIN dup r@ and WHILE r@ xor 6 lshift r> 5 lshift >r >r key 3F and r> or REPEAT r> drop ;
: xemit ( xc -- ) dup maxascii u< IF emit EXIT THEN \ special case ASCII 0 swap 3F BEGIN 2dup u> WHILE 2/ >r dup 3F and 80 or swap 6 rshift r> REPEAT 7F xor 2* or BEGIN dup 80 u< 0= WHILE emit REPEAT drop ;
How hard would it be to extend the reference implemenation to UTF-32?
Erratum:
XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT Stores the XCHAR xc into the buffer starting at address xc_addr1, u1 characters large. xc_addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit into the buffer, flag is true, otherwise flag is false, and xc_addr2 u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and therefore preferred over XC!+.
Bernd Paysan <bernd.pay...@gmx.de> writes: >xc_addr is the address of an XCHAR in memory. Alignment requirements are > the same as c_addr. The memory representation of an XCHAR differs > from the stack location, and depends on the encoding used. An XCHAR
^^^^^^^^ representation?
>Common encodings: ... >Side issues to be considered:
These appear to be subsections that should be put in informative sections, not the normative "Proposal" section.
>Many Forth systems today are case insensitive, to accept lower case >standard words. It is sufficient to be case insensitive for the ASCII >subset to make this work - this saves a large code mapping table for >comparison of other symbols. Case is mostly an issue of European >languages (latin, greek, and cyrillic), but similar issues exist >between traditional and simplified Chinese, and between different >Latin code pages in UCS, e.g. full width vs. normal half width latin >letters. Some encodings (not UTF-8) might give surprises when you use >a case insensitive ASCII-compare that's 8-bit save, but not aware of >the current encoding.
Even in UTF-8 you can compose letters, e.g. an Umlaut-a from a diaresis and an a, and that would be encoded differently than the Latin-1-derived Umlaut-a.
Anyway, that's not a problem we should try to solve at the Forth level, or at least not in this proposal.
>Words:
>XC-SIZE ( xc -- u ) XCHAR EXT >Computes the memory size of the XCHAR xc in address units.
>X-SIZE ( xc_addr u1 -- u2 ) XCHAR >Computes the memory size of the first XCHAR stored at xc_addr in >address units. ... >XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT >Stores the XCHAR xc into the buffer starting at address xc_addr1, u1 >characters large.
Shouldn't the granularity of the size specifications be the same (i.e., either aus or chars) throughout the wordset?
> xc_addr2 points to the first memory location after >xc, u2 is the remaining size of the buffer.
In what units? The size units are missing in most of the rest of the word specifications, but I do not mention this again.
>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR >Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2 >u2 is the remaining buffer after stepping over the first XCHAR in the >buffer.
>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR >Step backward by one xchar in the buffer defined by xcaddr1 u1, >starting at the end of the buffer. xcaddr1 u2 is the remaining buffer >after stepping backward over the last XCHAR in the buffer. Unlike >XCHAR-, -XSTRING can be implemented in encodings that have only a >forward-working string size.
The assymetry in the stack effects of XSTRING+ and -XSTRING is probably hard to remember and may be confusing.
>X-WIDTH ( xc_addr u -- n ) XCHAR >n is the number of monospace ASCII characters that take the same space to >display as the the XCHAR string starting at xc_addr, using u address units.
Maybe mention that this is only relevant for monospaced displays/fonts.
>SET-ENCODING ( encoding -- ) XCHAR EXT >Sets the input encoding to the specified encoding
So there's an input encoding and an internal encoding?
Are all inputs affected? I would set file encodings per-file.
What about the output encoding?
>The following words behave different when the XCHAR extension is present:
>CHAR ( "<spaces>name" -- xc ) >Skip leading space delimiters. Parse name delimited by a space. Put the >value of its first XCHAR onto the stack.
>[CHAR] >Interpretation: Interpretation semantics for this word are undefined. > Compilation: ( ?<spaces>name? -- ) >Skip leading space delimiters. Parse name delimited by a space. Append the >run-time semantics given below to the current definition. > Run-time: ( -- xc ) >Place xc, the value of the first XCHAR of name, on the stack.
I would call that an extended behaviour, not a different behaviour, because the behaviour will be the same for Forth-94 programs.
>Experience:
>Build into Gforth (development version) and recent versions of bigFORTH.
There's also at least one other implementation, lxf-ntf by Peter Falth.
>Open issues are file reading and writing (conversion on the fly or leave as >it is?).
We have not implemented it yet, but for text files the conversion to and from the internal representation should be performed by READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as binary), the program has to keep track of which buffer contains which encoding, and do the conversion itself, which is error-prone, inconvenient, and the proposal does not supply words for that. But, as mentioned above, if you really want that, you can have it by treating the file as binary.
Bruce McFarling wrote: > How hard would it be to extend the reference implemenation to UTF-32?
UTF-32 is not ASCII compatible, unless you have a system where 1 CHAR = 32 bit.
> Erratum:
> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT > Stores the XCHAR xc into the buffer starting at address xc_addr1, u1 > characters large. xc_addr2 points to the first memory location after > xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit > into the buffer, flag is true, otherwise flag is false, and xc_addr2 > u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and > therefore preferred over XC!+.
Thanks, there was another save/safe error, as well.
Anton Ertl wrote: > Bernd Paysan <bernd.pay...@gmx.de> writes: >>xc_addr is the address of an XCHAR in memory. Alignment requirements are >> the same as c_addr. The memory representation of an XCHAR differs >> from the stack location, and depends on the encoding used. An >> XCHAR > ^^^^^^^^ > representation?
Yes.
>>Common encodings: > ... >>Side issues to be considered:
> These appear to be subsections that should be put in informative > sections, not the normative "Proposal" section.
Moved it to an appendix
>>XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT >>Stores the XCHAR xc into the buffer starting at address xc_addr1, u1 >>characters large.
> Shouldn't the granularity of the size specifications be the same > (i.e., either aus or chars) throughout the wordset?
>> xc_addr2 points to the first memory location after >>xc, u2 is the remaining size of the buffer.
> In what units? The size units are missing in most of the rest of the > word specifications, but I do not mention this again.
>>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR >>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2 >>u2 is the remaining buffer after stepping over the first XCHAR in the >>buffer.
>>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR >>Step backward by one xchar in the buffer defined by xcaddr1 u1, >>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer >>after stepping backward over the last XCHAR in the buffer. Unlike >>XCHAR-, -XSTRING can be implemented in encodings that have only a >>forward-working string size.
> The assymetry in the stack effects of XSTRING+ and -XSTRING is > probably hard to remember and may be confusing.
Oops, got it wrong, the description is actually of +XSTRING and XSTRING-. The sign is on the side of the string which gets modified, and indicates the direction (+ towards higher addresses, - towards lower). The sample implementation also contains the opposite partner of each of those, but that doesn't make too much sense (if you extend the buffer, you can as well use XCHAR+ and XCHAR-).
>>X-WIDTH ( xc_addr u -- n ) XCHAR >>n is the number of monospace ASCII characters that take the same space to >>display as the the XCHAR string starting at xc_addr, using u address >>units.
> Maybe mention that this is only relevant for monospaced displays/fonts.
Fonts where each character takes an integer multiple width of ASCII characters. Calling that "monospaced" is a bit stretching the word "monospaced" ;-).
>>SET-ENCODING ( encoding -- ) XCHAR EXT >>Sets the input encoding to the specified encoding
> So there's an input encoding and an internal encoding?
Actually, there's just an encoding, which is both internal (for words like XCHAR+), and external (for XKEY/XEMIT).
> Are all inputs affected? I would set file encodings per-file.
>>The following words behave different when the XCHAR extension is present:
>>CHAR ( "<spaces>name" -- xc ) >>Skip leading space delimiters. Parse name delimited by a space. Put the >>value of its first XCHAR onto the stack.
>>[CHAR] >>Interpretation: Interpretation semantics for this word are undefined. >> Compilation: ( ?<spaces>name? -- ) >>Skip leading space delimiters. Parse name delimited by a space. Append >>the run-time semantics given below to the current definition. >> Run-time: ( -- xc ) >>Place xc, the value of the first XCHAR of name, on the stack.
> I would call that an extended behaviour, not a different behaviour, > because the behaviour will be the same for Forth-94 programs.
>>Experience:
>>Build into Gforth (development version) and recent versions of bigFORTH.
> There's also at least one other implementation, lxf-ntf by Peter Falth.
Fine.
>>Open issues are file reading and writing (conversion on the fly or leave >>as it is?).
> We have not implemented it yet, but for text files the conversion to > and from the internal representation should be performed by > READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as > binary), the program has to keep track of which buffer contains which > encoding, and do the conversion itself, which is error-prone, > inconvenient, and the proposal does not supply words for that. But, > as mentioned above, if you really want that, you can have it by > treating the file as binary.
I think for file encodings, we should have a word that sets the encoding of a file, like SET-FILE-ENCODING ( encoding fd -- ior ), and we also need a tag in the file to set the encoding while interpreting, i.e. SET-SOURCE-ENCODING (sets the encoding of the source file).
Unfortunately, on first analysis, this is one proposal that Win32Forth will not be adopting any time soon.
Windows is UTF-16, which is not ASCII compliant. Although Windows provides APIs to translate from locale to locale, there is no method in Win32Forth to automatically identify which parameters would be require to be translated from XHCARS to UTF-16 and back; the programmer would be responsible for coding the conversions.
We would need something like the proposal Anton made at EuroForth 2006 (http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C Function Call Interface), with extensions to identify string pointers, before implementing this.
Bernd Paysan <bernd.pay...@gmx.de> writes: >Anton Ertl wrote: >> Bernd Paysan <bernd.pay...@gmx.de> writes: >>>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR >>>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2 >>>u2 is the remaining buffer after stepping over the first XCHAR in the >>>buffer.
>>>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR >>>Step backward by one xchar in the buffer defined by xcaddr1 u1, >>>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer >>>after stepping backward over the last XCHAR in the buffer. Unlike >>>XCHAR-, -XSTRING can be implemented in encodings that have only a >>>forward-working string size.
>> The assymetry in the stack effects of XSTRING+ and -XSTRING is >> probably hard to remember and may be confusing.
>Oops, got it wrong, the description is actually of +XSTRING and XSTRING-. >The sign is on the side of the string which gets modified, and indicates >the direction (+ towards higher addresses, - towards lower). The sample >implementation also contains the opposite partner of each of those, but >that doesn't make too much sense (if you extend the buffer, you can as well >use XCHAR+ and XCHAR-).
Hmm, your mistake may indicate that this naming is error-prone, especially in implementations where the opposite partners exist.
>>>SET-ENCODING ( encoding -- ) XCHAR EXT >>>Sets the input encoding to the specified encoding
>> So there's an input encoding and an internal encoding?
>Actually, there's just an encoding, which is both internal (for words like >XCHAR+), and external (for XKEY/XEMIT).
I think that no word for changing the internal encoding should be standardized. Or if you standardize it, it should fail if the new internal encoding is not an extension of the old one (i.e., ASCII->Latin-1 ok, ASCII->UTF-8 ok, but Latin-1->UTF-8 fails); since this is a one-way street, GET-ENCODING makes little sense.
Otherwise a standard program could contain strings in different, incompatible encodings, some of them in system-controlled strings (e.g., word names), controlled by a global state variable. This would be worse than STATE and BASE. No need to introduce another such mistake.
>>>Open issues are file reading and writing (conversion on the fly or leave >>>as it is?).
>> We have not implemented it yet, but for text files the conversion to >> and from the internal representation should be performed by >> READ/WRITE-FILE/LINE. If you read it in unconverted (i.e., as >> binary), the program has to keep track of which buffer contains which >> encoding, and do the conversion itself, which is error-prone, >> inconvenient, and the proposal does not supply words for that. But, >> as mentioned above, if you really want that, you can have it by >> treating the file as binary.
>I think for file encodings, we should have a word that sets the encoding of >a file, like SET-FILE-ENCODING ( encoding fd -- ior ),
The primary method should work through OPEN-FILE and CREATE-FILE (e.g., by specifying the encoding in the fam). But yes, a word like SET-FILE-ENCODING is useful when the program learns about the encoding later (e.g., when the encoding is specified at the start of the file).
> and we also need a >tag in the file to set the encoding while interpreting, i.e. >SET-SOURCE-ENCODING (sets the encoding of the source file).
Alex McDonald <b...@rivadpm.com> writes: >Bernd Paysan wrote:
>[snipped]
>Unfortunately, on first analysis, this is one proposal that Win32Forth >will not be adopting any time soon.
>Windows is UTF-16, which is not ASCII compliant. Although Windows >provides APIs to translate from locale to locale, there is no method in >Win32Forth to automatically identify which parameters would be require >to be translated from XHCARS to UTF-16 and back; the programmer would be >responsible for coding the conversions.
I don't see that you are any worse off with xchars in this situation than with chars.
>We would need something like the proposal Anton made at EuroForth 2006 >(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C >Function Call Interface), with extensions to identify string pointers, >before implementing this.
For strings my approach in the C interface is that one needs to convert explicitly. Even without Unicode, you already have the problem of needing zero-termination in C and explicit length counts in Forth. Hmm, maybe we need some support words for the conversion.
> Alex McDonald <b...@rivadpm.com> writes: > >Bernd Paysan wrote:
> >[snipped]
> >Unfortunately, on first analysis, this is one proposal that Win32Forth > >will not be adopting any time soon.
> >Windows is UTF-16, which is not ASCII compliant. Although Windows > >provides APIs to translate from locale to locale, there is no method in > >Win32Forth to automatically identify which parameters would be require > >to be translated from XHCARS to UTF-16 and back; the programmer would be > >responsible for coding the conversions.
> I don't see that you are any worse off with xchars in this situation > than with chars.
The au would be 16bits, with a max of 127 characters in a counted string. This might be considered too short. It would be a pretty big change as well, as there are a good few COUNTs and C@ in a lot of Win32Forth code.
I didn't see an X-STRING-SIZE (a poor name, I know) in Bernd's proposal; for conversion between encodings I would have thought it useful.
As a general note, it's worth following the Unicode 5.0 standard for malformed Unicode; to throw an error in all such cases. The XCHARS standard should be explicit about which Unicode processing standard it adheres to (or insist that the implementor name the standard).
> >We would need something like the proposal Anton made at EuroForth 2006 > >(http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C > >Function Call Interface), with extensions to identify string pointers, > >before implementing this.
> For strings my approach in the C interface is that one needs to > convert explicitly. Even without Unicode, you already have the > problem of needing zero-termination in C and explicit length counts in > Forth. Hmm, maybe we need some support words for the conversion.
There's also a Java style null ("modified UTF-8"), encoded as 0xc0 0x80. It has some advantages, as C won't stop on it when using strlen(), and strings with imbedded nulls can be correctly passed to C (for instance, when using C to write to file).
Win32Forth makes sure strings are null terminated (and the programmer needs to be aware of this when allocating buffers for string handling; they need to be one byte longer than required by the string).