RfD: XCHAR wordset (for UTF-8 and alike)

13 views
Skip to first unread message

Bernd Paysan

unread,
Sep 25, 2005, 6:16:25 PM9/25/05
to
Problem:

ASCII is only appropriate for the English language. Most western languages
however fit somewhat into the Forth frame, since a byte is sufficient to
encode the few special characters in each (though not always the same
encoding can be used; latin-1 is most widely used, though). For other
languages, different char-sets have to be used, several of them
variable-width. Most prominent representant is UTF-8. Let's call these
extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
ASCII-compatible encodings may be used.

Proposal

Datatypes:

xc is an extended char on the stack. It occupies one cell, and is
a subset of unsigned cell. Note: UTF-8 can not store more that 31
bits; on 16 bit systems, only the UCS16 subset of the UTF-8
character set can be used.
xc_addr is the address of an XCHAR in memory. Alignment requirements are
the same as c_addr. The memory representation of an XCHAR differs
from the stack location, and depends on the encoding used. An XCHAR
may use a variable number of address units in memory.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only when
you don't change the encoding, or only use the ASCII subset.

Words:

XC-SIZE ( xc -- u )
Computes the memory size of the XCHAR xc in address units.

XC@+ ( xc_addr1 -- xc_addr2 xc )
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 )
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XCHAR+ ( xc_addr1 -- xc_addr2 )
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 )
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

X-SIZE ( xc_addr u -- n )
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc )
Reads an XCHAR from the terminal.

XEMIT ( xc -- )
Prints an XCHAR on the terminal.

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters. Parse name delimited by a space. Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( ?<spaces>name? -- )
Skip leading space delimiters. Parse name delimited by a space. Append the
run-time semantics given below to the current definition.
Run-time: ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

Unfortunately, both the Gforth and the bigFORTH implementation have several
system-specific parts.

Experience:

Build into Gforth (development version) and recent versions of bigFORTH.
Open issues are file reading and writing (conversion on the fly or leave as
it is?).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bruce McFarling

unread,
Sep 26, 2005, 2:29:05 AM9/26/05
to

Bernd Paysan wrote:
> Problem:

> ASCII is only appropriate for the English language. Most western languages
> however fit somewhat into the Forth frame, since a byte is sufficient to
> encode the few special characters in each (though not always the same
> encoding can be used; latin-1 is most widely used, though).

> For other languages, different char-sets have to be used, several of
> them variable-width. Most prominent representant is UTF-8. Let's call
> these extended characters XCHARs. Since ANS Forth specifies ASCII
> encoding, only ASCII-compatible encodings may be used.

> Experience:

> Build into Gforth (development version) and recent versions of bigFORTH.
> Open issues are file reading and writing (conversion on the fly or leave as
> it is?).

The first thing to settle is whether XCHARS are "these" extended
character sets that are upwardly compatible with printable ASCII, or
"this" extended character set. And I could well see a wish to use, eg,
UTF-8 in file storage (if my primary targets were Europe, Africa, and
the Americas) and UTF-16 in processing.

It seems to me that, since you can always tell where a UTF character
begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
but you need to know know WHICH it is as well as endianess for UTF16
and UTF32, the most coherent thing to do is to have AN XCHAR
representation for processing and a set of file modes that specify the
kind of file you are loading:

* ASCII (latin-1, etc, any fixed 8-bit code pages)
* UTF8
* UTF16 (endedness of your system)
* UTF32 (endedness of your system)
* UTF16B
* UTF16L
* UTF32B
* UTF32L

Then if the file mode matches the system mode, you just load the file,
if it mismatches, it is translated on the fly on reading and writing.

Obviously the system mode would be a thing for a system query.

Bernd Paysan

unread,
Sep 26, 2005, 5:35:12 AM9/26/05
to fort...@yahoogroups.com
Bruce McFarling wrote:

> The first thing to settle is whether XCHARS are "these" extended
> character sets that are upwardly compatible with printable ASCII, or
> "this" extended character set. And I could well see a wish to use, eg,
> UTF-8 in file storage (if my primary targets were Europe, Africa, and
> the Americas) and UTF-16 in processing.
>
> It seems to me that, since you can always tell where a UTF character
> begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
> but you need to know know WHICH it is as well as endianess for UTF16
> and UTF32, the most coherent thing to do is to have AN XCHAR
> representation for processing and a set of file modes that specify the
> kind of file you are loading:
>
> * ASCII (latin-1, etc, any fixed 8-bit code pages)

Though, depending on the fixed code-page, the translation will be different
(latin-1 different from latin-2).

> * UTF8
> * UTF16 (endedness of your system)
> * UTF32 (endedness of your system)
> * UTF16B
> * UTF16L
> * UTF32B
> * UTF32L

You can add a few other encodings. UCS16 managed to have an easy conversion
from several previous ASCII-compatible encodings, even though the code
pages of the non-ASCII portion moves within UCS16 (E.g. the GB2312 format).
Which encoding actually is known to the Forth system would be subject of a
query, too.

> Then if the file mode matches the system mode, you just load the file,
> if it mismatches, it is translated on the fly on reading and writing.
>
> Obviously the system mode would be a thing for a system query.

Exactly.

Stephen Pelc

unread,
Sep 26, 2005, 6:17:02 AM9/26/05
to
On Mon, 26 Sep 2005 00:16:25 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

How does this fit in with the wide character and internationalisation
proposals at
www.mpeforth.com/arena/
i18n.propose.v7.PDF
i18n.widechar.v7.PDF
These proposals/RFCs are from the application developers point of
view. There's a sample implementation in the file
LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
is derived from 15+ years of experience. From the file header:

"You are free to use this code in any way, as long as the MPE
copyright notice in this section is retained.

This code is an implementation of the draft ANS internationalisation
specification available from the download area of the MPE web site.
The implementation provides more functionality than is required by
the ANS draft standard and provides enough hooks to be the basis of
a practical system."



>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.

IMHO standardising a word that can't be guaranteed to work is not
beneficial. If you must step back through a string, extend the
definition of /STRING to form /-STRING or some such, such that
the start of the string must be at the start of a character.

IMHO your approach is from the implementor's perspective, which is
valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
that what *applications* do with strings is at a *much* higher level
than implementors issues.

Can we merge the application developer issues with the kernel
issues? These inclue cleaning up the meaning of character,
byte/octet access, file wors and son on.

I look forward to discussing these issues at EuroForth 2005.

Stephen


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Bernd Paysan

unread,
Sep 26, 2005, 8:31:48 AM9/26/05
to
Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

The main difference with the i18n.widechar.v7.PDF proposal is that our
proposal (Anton's and my) doesn't distinguish between development character
set and application character set. I think this distinction is unnatural
and only valid in a historical context, e.g. the different code-pages used
in DOS-based Windows, and wide characters, which won't coexist with ASCII.

The string-based localization proposal in i18n.propose.v7.PDF is orthogonal
to the character issue, and works regardless of the coding system, as
strings always stay strings.

I would welcome it when you set up an RfD for your proposal.

>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of
>>this XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed
>>to work for every possible encoding.
>
> IMHO standardising a word that can't be guaranteed to work is not
> beneficial. If you must step back through a string, extend the
> definition of /STRING to form /-STRING or some such, such that
> the start of the string must be at the start of a character.

Quite a number of variable width wide-char encodings, especially UTF-8,
allow both stepping forward and backward a character at a time. Another
possible compromise is to simply outlaw those variable width wide-char
encodings that don't allow stepping back. UTF-8 allows to find the next and
the previous character regardless where you point to. Some of the chinese
encodings can do the same: the first byte of a double-byte glyph there has
the MSB set, the second clear.

It's like seeking in a file. Not all files allow seeking (pipes and sockets
won't, e.g.). Seeking is a useful activity, though. Adding a X/STRING
( xc_addr u n -- xc_addr' u' ) isn't much of a trouble. n would be the
number of XCHARs to step forward (positive) or backward (negative).

The question is rather what should XCHAR- do when it fails. It can throw an
error, as well as when it encounters a bad encoding.

> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Especially when they finally use some OS function to paint the text on the
screen. On the other hand, when they use something integrated into the
Forth system (like MINOS), they use the DCS to display things on screen.

Using UTF-8 internally is even possible for a Windows Forth, though you then
have to go through hoops to call TextOutW correctly (AFAIK it even doesn't
know how to deal with combining characters). So far, I haven't ported the
UTF-8 stuff to Windows, and concluded that it's easier to make the Windows
MINOS version use the same iso-latin-1 DCS as it always did. But then,
bigFORTH on Windows is not really supported.

> Can we merge the application developer issues with the kernel
> issues? These inclue cleaning up the meaning of character,

> byte/octet access, file words and so on.

Good idea.

Stephen Pelc

unread,
Sep 26, 2005, 11:13:32 AM9/26/05
to
On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>The main difference with the i18n.widechar.v7.PDF proposal is that our


>proposal (Anton's and my) doesn't distinguish between development character
>set and application character set. I think this distinction is unnatural
>and only valid in a historical context, e.g. the different code-pages used
>in DOS-based Windows, and wide characters, which won't coexist with ASCII.

Unfortunately I have to disagree here. Even if you can get to one
encoding from the UTF-xxx family in the long term, applications
written in South Africa (development character set, DCS) must be able
to be hosted and configured on a PC running a Chinese-xxx version
of some operating system (operating character set, OCS)and used by
a Russian-xxx speaker (application character set, ACS). This is a
mix that has been seen "in the wild" - it is not a scenario.

The impact of ACS is not necessarily in the encoding, but in
how the application presents information and the order of
text substitutions, e.g. subject/verb/object and time/manner/place.
Then there's the date/time display nightmare and ...

I really wish we could embrace a single encoding, but there are
Forth applications out there with 15-20 years of history.

>I would welcome it when you set up an RfD for your proposal.

Let's reserve time for it at EuroForth. Those who want to join a mail
list for this topic should email me directly. I will re-establish
the locale and other mailing lists when our servers have recovered
from the plumbing alterations at Hill Lane.

>Another
>possible compromise is to simply outlaw those variable width wide-char
>encodings that don't allow stepping back.

Tell that to an application developer and they will ignore you. Such
encodings exist and are used. In our experience, stepping back through
strings is most often encountered in file handling and affects DCS and

OCS rather than ACS.

>> Can we merge the application developer issues with the kernel
>> issues? These inclue cleaning up the meaning of character,
>> byte/octet access, file words and so on.
>
>Good idea.

Will you be at EuroForth?

Albert van der Horst

unread,
Sep 26, 2005, 5:30:20 AM9/26/05
to
In article <p6kj03-...@vimes.paysan.nom>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Problem:
>
>ASCII is only appropriate for the English language.

Hardly. English has given up one of the most important
advantages of a phonetic system. It is unpronouncable.
I am thinking about a phonetically correct spelling of
English and it would need a host of dia-critical marks,
like just every other lanugage.

> Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

>
>Proposal
<SNIP>

One of the problems, and I think it is a design issue, we have
inherited from C, is the mess resulting from using characters
as address units (in Forth parlance.)
In Forth with all the embedded programming we really need
a means to address bytes. I would like to split off from
the character handling in Forth, all that is in fact intended
to handle let's say assembler level programming.
This would make character handling much cleaner, and a better
starting point for extending the real character handling.

It is my hope that we need not introduce a new type for char's beside
the byte type that we need anyhow, and the normal CHAR.
Why would CHAR <some extended character> not fit in a Forth
character (provided we do not try it at the same time for
things like a length as exemplified by the ugly word COUNT.)

In fact bytes are somehow in place by the concept of
address unit. We only need to flesh it out a little.
Note that there is *no* Forth word to fetch or store
the content of an address unit. Still.
An address unit is the smallest part of memory that can
addressed, i.e. fetched or stored. But it can't because there
are no words for it.

>--

Groetjes Albert

--
--
Albert van der Horst,Oranjestr 8,3511 RA UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Bernd Paysan

unread,
Sep 26, 2005, 5:13:48 PM9/26/05
to
Stephen Pelc wrote:

> On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
> wrote:
>
>>The main difference with the i18n.widechar.v7.PDF proposal is that our
>>proposal (Anton's and my) doesn't distinguish between development
>>character set and application character set. I think this distinction is
>>unnatural and only valid in a historical context, e.g. the different
>>code-pages used in DOS-based Windows, and wide characters, which won't
>>coexist with ASCII.
>
> Unfortunately I have to disagree here. Even if you can get to one
> encoding from the UTF-xxx family in the long term, applications
> written in South Africa (development character set, DCS) must be able
> to be hosted and configured on a PC running a Chinese-xxx version
> of some operating system (operating character set, OCS)and used by
> a Russian-xxx speaker (application character set, ACS). This is a
> mix that has been seen "in the wild" - it is not a scenario.

The way it works in Unix/Linux (the platform where it really works) is to
use a single encoding, UTF-8, for everything. Unix platforms and Linux are
now delivered for some years with UTF-8 support, and recently, it's often
the default setting. I have absolutely no problem to install a SuSE with
two dozen languages all available to the user, just depending on the $LANG
variable - sharing documents with each others.

AFAIK, even Windows has some variants that ship with a multi-language
system, though in Windows, lots of system internals depend on the language
(such as the "Program Files" directory, or "My Documents"). Windows
supports Unicode as one of the codespaces, though UTF-8 support would be
left to the application (several do use it already, but most of them are
ported over from Unix).

But the XCHAR proposal is really not about having UTF-8 everywhere, but
about dealing with variable-width wide characters. Fixed wide characters
are a subset of that; though that takes the ASCII compatibility away, and
being incompatible to the DCS opens the can of worms you have with your
OCS!=DCS!=ACS.

> The impact of ACS is not necessarily in the encoding, but in
> how the application presents information and the order of
> text substitutions, e.g. subject/verb/object and time/manner/place.
> Then there's the date/time display nightmare and ...

That's another question, but not bound to the character encoding itself.

> I really wish we could embrace a single encoding, but there are
> Forth applications out there with 15-20 years of history.

The vast majority of Forth programs however is DCS=OCS=ACS. And since OCS
now is often enough UTF-8 by default, we should be able to handle that.

There might be place for a more complicated scheme even in future, but so
far, I see the DCS != OCS != ACS as a result of bad decisions in operating
system design. Such things should better be solved outside the scope of a
general standard (i.e. a rather specific standard "how to I overcome this
particular problem with the popular brainfuck operating system").

Having DCS != OCS/ACS is something that works for batch compiled programming
languages. There's still the problem of the string constants, but the
localization mapping handles that (you don't have strings in the user's
language around in your primary source code).

This however means that you enforce a particular way to deal with your
development system and your localization. This particular way is something
I really don't want in Forth. E.g. I could write some turtle graphics for
children, and it certainly is necessary that it has to be used in their
native language. On the other hand, it's quite obvious that it will use the
Forth interpreter. So it's definitely DCS, and the localization is a file
with lots of ' xxx alias yyy commands.

It reminds me all on target compilers. You jump through hoops because you
don't have your target system available. This is all well if you need it.
It's not something that should have an impact on the design of a Forth
system where build=host=target.

>>Another
>>possible compromise is to simply outlaw those variable width wide-char
>>encodings that don't allow stepping back.
>
> Tell that to an application developer and they will ignore you.

That's true.

> Such encodings exist and are used.

Unfortunately. For me, these encodings are other people's problems ;-).

> In our experience, stepping back through
> strings is most often encountered in file handling and affects DCS and
> OCS rather than ACS.

I use stepping backwards mostly in editing code, that's ACS.

>>> Can we merge the application developer issues with the kernel
>>> issues? These inclue cleaning up the meaning of character,
>>> byte/octet access, file words and so on.
>>
>>Good idea.
>
> Will you be at EuroForth?

Unfortunately not. I originally booked holiday before, but unfortunately, I
had to shift my trip by three weeks. So I'm now on the other side of the
world when EuroForth is :-(.

Bruce McFarling

unread,
Sep 26, 2005, 11:37:16 PM9/26/05
to

Albert van der Horst wrote:
> It is my hope that we need not introduce a new type for char's beside
> the byte type that we need anyhow, and the normal CHAR.
> Why would CHAR <some extended character> not fit in a Forth
> character (provided we do not try it at the same time for
> things like a length as exemplified by the ugly word COUNT.)

But the RfD is moving in the direction you want, in which characters
are treated as character set entities. After all, while a UTF-8
encoding is perfectly regular, any given character may be one, two,
three, or four bytes long.

COUNT is perfectly useful and clean. Its just using it to count, with
the attendant limitation of counts to the width of a uniform width
character set that is obsolete.

Bruce McFarling

unread,
Sep 27, 2005, 1:22:34 AM9/27/05
to

Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
of text processing and place them in the realm of networking standards
compliance. And a subset of the XCHAR words would suggest how to
handle them:

OCTET-SIZE ( -- u )
The memory size of a Byte in address units.

OCTET@+ ( oct_addr1 -- oct_addr2 oct )
Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
memory
location after xc.

OCTET!+ ( oct oct_addr1 -- oct_addr2 )
Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
location after xc.

OCTET+ ( oct_addr1 -- oct_addr2 )
Adds the size of an OCTET to oct_addr1, giving oct_addr2.

OCTET- ( oct_addr1 -- oct_addr2 )
Subracts the size of an OCTET from oct_addr1, giving oct_addr2.

After all, XCHARs do not get rid of the possibility that CHARs may be
16 bits wide, though they may be of use for 8-bit data when the CHARs
are 16 bits wide.

Bruce McFarling

unread,
Sep 27, 2005, 1:26:20 AM9/27/05
to
Stephen Pelc wrote:
> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Its not at an implementor's perspective, because I aint an implementor.
Its from a text processing perspective. Almost all applications must
use strings to communicate with the user, but only text processing
applications have to .... errr process text.

Since I have a bit of an non-professional interest in text processing
(for my job, I mostly just generate it), I'll have a crack at
"addressing" the interaction between this and the i18n proposal.

The ACS only critically depends on the language in use if it is an
8-bit code page. If it is UTF-8, UTF-16 or UTF-32 it does not change
when the language changes, that being the point of the Unicode
Translation Formats. Display, input, etc. may have to change, but not
the character set per se. And while the XCHAR proposal is focused on
UTF-8, it also fits into UTF-16, especially for a historian,
archeologist, or anthropologist who needs to work with archaic or
uncommon languages that may require characters above the 16-bit plane.

If the ACS is an 8-bit code page, the only thing that is likely to
change as a result of something you've done in the i18n system is
sorting order. AND SORTING ORDER IS NOT A CHARACTER SET ISSUE! Its a
language issue.

OK, now, any translation REQUIRED between the system and the DCS is
built into the implementation. That includes any text files read or
written by the system.

So, if "ASCII" is taken to mean, "ASCII, possibly extended by a
language-specific code page", the four most common OCS/ACS combinations
are:

ACS=ASCII, DCS=ASCII
ACS=ASCII, DCS=UTF-#
ACS=UTF-#, DCS=ASCII
ACS=UTF-#, DCS=UTF-#

The translation issues are:

* ACS=ASCII, DCS=ASCII
They happen to be different code pages. KEY, EMIT, [CHAR] and CHAR may
have an issue of which code page you are talking about. But neither
are XCHAR's.

* ACS=ASCII, DCS=UTF-#
The only question is whether XKEY/XEMIT is in Application space or
Developer space or are transitions between the two.

I don't see how the input can be FROM developer space and output TO
developer space (programming utilities, after all, are only
applications that happen to work in the developers languages, so ACS
happens to EQUAL DCS), so there are only two possibilities:

** If XKEY/XEMIT are entirely in Application Space, no possible dramas,
no matter what character set that is. As XKEY's they are just
arbitrary chunks of bits measured in arbitrary address units.

** If XKEY/XEMIT bring ACS characters into Developer Space and then out
again, then translations occur from ASCII + "SET LANGUAGE" code page to
UTF. If the application is internationalised, all characters emitted
will be from input or from resource files, so there is never any "CS
won't translate" problem.

* ACS=UTF-#, DCS=ASCII

** If XKEY/XEMIT bring ACS into Developer space, there is a potential
translation problem, in that not all UTF-# encoded characters will fit
into any given 8-bit code page.

* ACS=UTF-#, DCS=UTF-#

** For this, there is no XKEY/XEMIT translation barrier, even if they
are different UTF's (say the developer is Han Chinese, and so prefers
to develop in UTF-16, or is working with an OS that relies on UTF-16,
but is writing for an Atlantic Zone audience internationalised into
English/Spanish/French/Portuguese and so prefers UTF-8 as the ACS),
since there is well-defined translations between any well-encodeded UTF
character. There is translation overhead, but that is all.

** For this, the problem is that there need to be DIFFERENT "XKEY"'s if
they are different encodings of the same character set.

To my mind, XCHARS's belong to the Application Character Set, since the
kind of thing that can be portable between systems is more text
processing applications than how a particular system may talk to its
underlying operating systems.

Further XCHARS are quite clearly NEEDED for the text processing in an
ACS, since CHARs suffice for ASCII code-page encodings, but not for
UTF-# encodings of THE SAME CS, and ASCII code-page does not accomodate
all character sets.

And for things like searching source for a particular definition, just
set the ACS to the DCS.

This is orthogonal to my earlier comment. My earlier comment presumes
that XCHARS are for what might be termed the "Memory Storage CS", not
for what may be termed the "Permanent Storage CS", which may well be
different. XCHARS define a translation between the stack and Memory
Storage. File words bring parts of files into Memory Storage. Hence
my argument that there should be file modes that handle that
translation (which can be done in bulk). And indeed, in a certain
sense that needs to be done in the file word, because the file words
are designed to bring parts of files into ALLOCATED parts of storage,
so the file words should only bring as much as can fit into the
allocated part of storage under the Memory Storage CS.

On the other hand, while XCHARs are required in ACS land, the ACS is
subject to change. And it doesn't make sense to change it "behind the
back" of the I18N words. So that suggests that the SET LANGUAGE system
ought to include an ability to set the default working character set
encoding and the default permanent storage character set encoding.

There is no need for a portable program to SET the ACS encoding. But
it may have to be able to QUERY the ACS encoding, and then to be able
to associate that with a particular collection of text in memory so
that if necessary it can RESTORE the ACS encoding to what was in place
when that text went into memory.

Bernd Paysan

unread,
Sep 27, 2005, 4:57:17 AM9/27/05
to
Bruce McFarling wrote:
> WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
> of text processing and place them in the realm of networking standards
> compliance. And a subset of the XCHAR words would suggest how to
> handle them:
>
> OCTET-SIZE ( -- u )
> The memory size of a Byte in address units.
>
> OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> memory
> location after xc.
>
> OCTET!+ ( oct oct_addr1 -- oct_addr2 )
> Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
> location after xc.
>
> OCTET+ ( oct_addr1 -- oct_addr2 )
> Adds the size of an OCTET to oct_addr1, giving oct_addr2.
>
> OCTET- ( oct_addr1 -- oct_addr2 )
> Subracts the size of an OCTET from oct_addr1, giving oct_addr2.
>
> After all, XCHARs do not get rid of the possibility that CHARs may be
> 16 bits wide, though they may be of use for 8-bit data when the CHARs
> are 16 bits wide.

Another missing part of my XCHAR proposal is how to change the way these
XCHARs are handled. ATM, I say the system deals with that, depending on
user settings (e.g. LANG environment variable). What's obvious is that
there's a way to deal with several encodings, and OCTET could be one of
them.

OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
picture, but the u would not depend on xc.

Since the actually available encodings are rather system-dependent, I
suggest that the system documentation lists available encodings and ways to
set them. E.g.

XC-CODING ( xc-id -- ) set XC encoding.

XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
stored in memory; it is not defined what happens with bit 8.

OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
memory. This encoding is compatible with packed ASCII strings.

UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
with packed ASCII strings.

UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
compatible with packed ASCII strings, but ASCII strings can be converted.

This however is the part of the system which is still open, so I can't say
there is enough experience to push a RfD through.

Bruce McFarling

unread,
Sep 27, 2005, 7:26:52 AM9/27/05
to

Bernd Paysan wrote:
> OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
> picture, but the u would not depend on xc.

Or not be a word at all, but rather be a query, since it won't be
changing and won't need any magic going on behind the back of the
author of portable code to make the portable code work

> Since the actually available encodings are rather system-dependent, I
> suggest that the system documentation lists available encodings and ways to
> set them. E.g.

> XC-CODING ( xc-id -- ) set XC encoding.

> XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

I would stress that more important than the ability to generate xc-id's
is the ability to get the CURRENT xc-id. Scenario: you get some text
and it is stored in memory somewhere. Then you take an action that you
know MIGHT result in a switch in character set, and you get some text,
and it is stored in memory somewhere.

So, if YOU didn't SET the sc-id's, how do you know how to switch back
and forth between them, or even whether you need to?

SET-XCHAR ( xc-id -- )
GET-XCHAR ( -- xc-id )

is the core. That lets you get the xc-id when you store the first set
of information in memory, lets you get the xc-id when you store the
second set of information in memory, test for equality to see if you
have to take care, reset to the "old" xc-id when appropriate.

If there are going to be these:

> ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
> stored in memory; it is not defined what happens with bit 8.
>
> OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
> memory. This encoding is compatible with packed ASCII strings.
>
> UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
> with packed ASCII strings.
>
> UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
> compatible with packed ASCII strings, but ASCII strings can be converted.

There should also be LANGUAGE-XCHAR ( -- ) to synchronise the xc-id
with the current language. An implementation of XCHAR's that did not
have I18N implemented would reset xc-id to the system default.

Albert van der Horst

unread,
Sep 27, 2005, 8:11:52 AM9/27/05
to
In article <1127792236.4...@g43g2000cwa.googlegroups.com>,

It is not clean to store an integer (the count) in a character.
It is not useful to have a count limited to 256 in Britain
65526 in Japan and 4 billion in China.

Albert van der Horst

unread,
Sep 27, 2005, 8:20:36 AM9/27/05
to
In article <1127798554....@g14g2000cwa.googlegroups.com>,

Bruce McFarling <agi...@netscape.net> wrote:
>
>Stephen Pelc wrote:
>> How does this fit in with the wide character and internationalisation
>> proposals at
>> www.mpeforth.com/arena/
>> i18n.propose.v7.PDF
>> i18n.widechar.v7.PDF
>> These proposals/RFCs are from the application developers point of
>> view. There's a sample implementation in the file
>> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
>> is derived from 15+ years of experience. From the file header:
>
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance. And a subset of the XCHAR words would suggest how to
>handle them:
>
>OCTET-SIZE ( -- u )
>The memory size of a Byte in address units.

A byte is an address unit. Not only by definition but for all
practical purposes.
Can't we just condemn those that don't to declare a
"environmental dependancy on an address unit not to contain
8 bits".
By the way Chuck Moore would have to define OCTET-SIZE as one
quarter, anyway. How is that?

>
>OCTET@+ ( oct_addr1 -- oct_addr2 oct )
>Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
>memory
>location after xc.

Much too verbose for such a basic word.
Why not OCTET -> B

<SNIP>

>After all, XCHARs do not get rid of the possibility that CHARs may be
>16 bits wide, though they may be of use for 8-bit data when the CHARs
>are 16 bits wide.

CHAR's should not be used for 8-bit data.
XHAR's should not be used to free CHAR's of the chore to handle
8-bit data, because of a refusal to use bytes (or OCTET's).

So,
do we really need XCHAR ?

Groetjes Albert

Anton Ertl

unread,
Sep 27, 2005, 12:09:09 PM9/27/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Problem:
>
>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though).

Actually Unicode (in its UCS-4/UTF-32 encoding) would also fit in the
ANS Forth frame. However, most near-ANS code around has an
environmental dependency on 1 chars = 1 au, and I think that more
existing programs work with a system that uses 1-au chars and xchars
(even when processing wider xchars) than with a system that uses n-au
chars (n>1).

> Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

That's sounds like a requirement should therefore be part of the
proposal, not the problem description.

The on-stack representation of ASCII characters should certainly be
ASCII. For the in-memory representation that would also have some
advantages: in particular, programs that access individual characters
using char (not xchar) words would work correctly on strings
consisting only of ASCII characters (and ANS Forth does not give any
guarantee for other characters anyway).

>Proposal

I would have waited for some more time (and experience) before making
such a proposal (I am still unsure which words to include and which
not). But since you made it, let's collect the feedback.

>Words:
>
>XC-SIZE ( xc -- u )
>Computes the memory size of the XCHAR xc in address units.
>
>XC@+ ( xc_addr1 -- xc_addr2 xc )
>Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
>
>XC!+ ( xc xc_addr1 -- xc_addr2 )
>Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.

This is unsafe, as it writes an unknown amount of data behind
xc_addr1. One can use it safely in combination with XC-SIZE, but then
it is easier to use XC!+? (see below).

Providing this word, but not XC!+? discourages safe programming
practices and encourages creating buffer overflows.

In other words, this might become Forth's strcat().

It's probably best not to standardize this word.

>XCHAR+ ( xc_addr1 -- xc_addr2 )
>Adds the size of the XCHAR stored at xc_addr1 to this address, giving
>xc_addr2.
>
>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.
>
>X-SIZE ( xc_addr u -- n )
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe another name would be harder to confuse with XC-SIZE. How about
X-WIDTH or XC-WIDTH?

>XKEY ( -- xc )
>Reads an XCHAR from the terminal.
>
>XEMIT ( xc -- )
>Prints an XCHAR on the terminal.

Currently Gforth also implements:

+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like 1 /STRING

-X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like -1 /STRING

XC@ ( xc-addr -- xc )
like C@

DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
safe version of XC!+, f specifies success

-TRAILING-GARBAGE ( addr u1 -- addr u2 )
remove trailing incomplete xc

Of course, some of these can be defined from others, but it's not
clear to me yet which ones are the set that we want to select.

>The following words behave different when the XCHAR extension is present:

That is actually a compatible extension of ANS Forth's CHAR and
[CHAR]; for ASCII characters they behave exactly the same, and for
others ANS Forth does not specify a behaviour. So I would not say
"behave different", but use wording such as "extend the semantics of
..."

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

Definitely conversion on the fly. There must be only one character
encoding in memory. However, we have not implemented that yet.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.complang.tuwien.ac.at/forth/ansforth/forth200x.html
EuroForth 2005: http://www.complang.tuwien.ac.at/anton/euroforth2005/

Anton Ertl

unread,
Sep 27, 2005, 12:55:53 PM9/27/05
to
"Bruce McFarling" <agi...@netscape.net> writes:
>The first thing to settle is whether XCHARS are "these" extended
>character sets that are upwardly compatible with printable ASCII, or
>"this" extended character set. And I could well see a wish to use, eg,
>UTF-8 in file storage (if my primary targets were Europe, Africa, and
>the Americas) and UTF-16 in processing.

Xchars can be used for any fixed-width encodings (even for a
fixed-width encoding with three chars/xchar), and for any
variable-width encodings that satisfy the requirements (e.g., UTF-8
and UTF-16).

That being said, I don't see a point in using UTF-16 for processing;
it combines the disadvantages of a fixed-width encoding with the
disadvantages of a variable-width encoding. If you want fixed-width,
use UTF-32; if you want variable-width, use UTF-8.

>It seems to me that, since you can always tell where a UTF character
>begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
>but you need to know know WHICH it is as well as endianess for UTF16
>and UTF32, the most coherent thing to do is to have AN XCHAR
>representation for processing and a set of file modes that specify the
>kind of file you are loading:
>
>* ASCII (latin-1, etc, any fixed 8-bit code pages)
>* UTF8
>* UTF16 (endedness of your system)
>* UTF32 (endedness of your system)
>* UTF16B
>* UTF16L
>* UTF32B
>* UTF32L
>
>Then if the file mode matches the system mode, you just load the file,
>if it mismatches, it is translated on the fly on reading and writing.

Yes, that's somewhat like what I have in mind. Except that currently
I am only envisioning conversions between various 8-bit encodings and
UTF-8; but if there really are people around with UTF-16 files, adding
a converter for them is not a big issue.

>Obviously the system mode would be a thing for a system query.

Ideally programs should be written with the Xchars words such that
they do not need to know the encoding used in the system.

Anton Ertl

unread,
Sep 27, 2005, 1:08:20 PM9/27/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>>work for every possible encoding.
>
>IMHO standardising a word that can't be guaranteed to work is not
>beneficial.

This word is guaranteed to work (if there is at least one character
right before xc_addr1).

If you are thinking about encodings where you cannot find the previous
character, they are not supported by Xchars. And I consider this a
virtue, not a deficiency.

>I look forward to discussing these issues at EuroForth 2005.

I will be there.

Anton Ertl

unread,
Sep 27, 2005, 1:16:40 PM9/27/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>Unfortunately I have to disagree here. Even if you can get to one
>encoding from the UTF-xxx family in the long term, applications
>written in South Africa (development character set, DCS) must be able
>to be hosted and configured on a PC running a Chinese-xxx version
>of some operating system (operating character set, OCS)and used by
>a Russian-xxx speaker (application character set, ACS). This is a
>mix that has been seen "in the wild" - it is not a scenario.

No problem:

DCS: Unicode (encoded as UTF-8 or UTF-32)
OCS: Unicode (encoded as UTF-8 or UTF-32)
ACS: Unicode (encoded as UTF-8 or UTF-32)

So once your condition above is satisfied, this is not an issue at the
character set and encoding level, and is thus outside the scope of the
xchars words.

>The impact of ACS is not necessarily in the encoding, but in
>how the application presents information and the order of
>text substitutions, e.g. subject/verb/object and time/manner/place.
>Then there's the date/time display nightmare and ...

Well, that's internationalisation. Xchars don't solve (much of) that.

Anton Ertl

unread,
Sep 27, 2005, 1:30:59 PM9/27/05
to
"Bruce McFarling" <agi...@netscape.net> writes:

>
>Stephen Pelc wrote:
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance.

Bytes are not in ANS Forth, and are therefore not used in text
processing.

With Xchars, one might use Chars as bytes: Nearly all systems
implement chars as bytes anyway, and probably a number of programs use
chars for bytes, so one might standardize on that.

The disadvantage of such a step in the Xchars context would be that
the in-memory representation for UTF-16 and UTF-32 would no longer be
fully ASCII-compatible (one ASCII Xchar would become more than one
Char).

But I don't believe that UTF-16 or UTF-32 and multi-au Chars will
become significant, so one might just as well settle down to using
Chars for bytes.

>And a subset of the XCHAR words would suggest how to
>handle them:

Well, since octets are fixed-width, it may be better to model the
octet words on the Char or Cell words than on the Xchar words.

Anton Ertl

unread,
Sep 27, 2005, 1:39:52 PM9/27/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Another missing part of my XCHAR proposal is how to change the way these
>XCHARs are handled.

No, that's not missing. There should not be any switching between
encodings. There is one encoding in the Forth system that should be
able to represent anything, and everything is converted to that
encoding on input, and from that encoding on output. No need to
switch anything.

If you allowed switching, then:

- Either you would have to change the encoding all the strings in the
Forth system. This is impossible.

- Or the program would have to keep track of which strings are in
which encoding and always switch around. That's cumbersome and
error-prone.

>XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

IMO the encoding should be part of the fam, and not be set on the fly.
Or do you envision files that mix UTF-8 and, say UTF-16? So we might
have words like

UTF-8 ( fam1 -- fam2 )

latin-1 ( fam1 -- fam2 )

Bruce McFarling

unread,
Sep 28, 2005, 12:49:04 AM9/28/05
to

Anton Ertl wrote:
> That being said, I don't see a point in using UTF-16 for processing;

To save memory space, if your primary language uses a wide character
set in the first plane (where most UTF-8 encodings are three bytes
long). Also if you know what language you are working in, you know
whether or not you are going to stay down in the first plane, so the
variable width issue may be moot.

Not that I had those in mind when I wrote that, rather I had in mind
that as soon as you assume away something, you will find out that
someone else has a strong preference for it, so I tried to avoid
assuming away anything.

Bruce McFarling

unread,
Sep 28, 2005, 12:57:54 AM9/28/05
to

Albert van der Horst wrote:
> Much too verbose for such a basic word.
> Why not OCTET -> B
>
> <SNIP>
>
> >After all, XCHARs do not get rid of the possibility that CHARs may be
> >16 bits wide, though they may be of use for 8-bit data when the CHARs
> >are 16 bits wide.

> CHAR's should not be used for 8-bit data.
> XHAR's should not be used to free CHAR's of the chore to handle
> 8-bit data, because of a refusal to use bytes (or OCTET's).

> So,
> do we really need XCHAR ?

Yes, of course, because XCHARS is not about address units but about
character set units. XCHARS handle extended character data, where we
know perfectly well that sometimes it is one octet long, sometimes it
is two octets long, sometimes it is four octets long, sometimes is
ranges from one to four octets long, and sometimes it ranges from two
to four octets long. So XCHAR+, XCHAR-, XCHAR@+, and XCHAR!+ are
things that are likely to benefit from optimisation and especially
handy for portability given that you could write and test for, say,
UTF-8, and then have code that works for a fixed width 16-bit character
set.

So when I say "XCHAR's may be a byte wide", that's dependent on the
character set encoding in use, not the system and system-specific
address unit.

Bruce McFarling

unread,
Sep 28, 2005, 1:23:02 AM9/28/05
to

Albert van der Horst wrote:
[Bruce]

> >COUNT is perfectly useful and clean. Its just using it to count, with
> >the attendant limitation of counts to the width of a uniform width
> >character set that is obsolete.

> It is not clean to store an integer (the count) in a character.
> It is not useful to have a count limited to 256 in Britain
> 65526 in Japan and 4 billion in China.

I didn't say THAT was clean or useful. In fact, I said that THAT is
obsolete. But CHAR@+ is perfectly clean and useful, however confusing
the string of letters you use to do it.

Stephen Pelc

unread,
Sep 28, 2005, 6:12:59 AM9/28/05
to
On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>No, that's not missing. There should not be any switching between
>encodings. There is one encoding in the Forth system that should be
>able to represent anything, and everything is converted to that
>encoding on input, and from that encoding on output. No need to
>switch anything.

The key word is "should". However, reality intervenes. There are
apps out there that use multiple encodings. A standard formalises
current practice - it is *not* a design for the future.

If you push through a standard that disenfranchises existing
substantial apps, the developers of those apps will ignore
the standard. Is this what you want?

The preferred route, I suggest, is to provide GET-ENCODING and
SET-ENCODING. In your system, you can always be non-compliant
for the moment. You will then have an environmental dependency on
UTF8. This is no worse than the widely accepted char=byte=au
dependency.

Bruce McFarling

unread,
Sep 28, 2005, 6:22:56 AM9/28/05
to

Albert van der Horst wrote:

> >OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> >Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> >memory
> >location after xc.

> Much too verbose for such a basic word.
> Why not OCTET -> B

I don't know why not. I'm pretty confident that few people are likely
to have OCTET@+ lying around, and if they do its odds on it does that
anyway. B@+? The B could stand for "buffer", or "block". OTOH,
BYTE@+ is fine by me.

Could call it "BCOUNT" in homage to established naming conventions for
CHAR@+, which is called COUNT, or OCOUNT.

Or OC@+ in homage to the yank television show that the young'uns here
like so much.

Bruce McFarling

unread,
Sep 28, 2005, 6:28:35 AM9/28/05
to
Stephen Pelc wrote:

> The preferred route, I suggest, is to provide GET-ENCODING and
> SET-ENCODING. In your system, you can always be non-compliant
> for the moment. You will then have an environmental dependency on
> UTF8. This is no worse than the widely accepted char=byte=au
> dependency.

Note that an implementation may only do one encoding, in which case
GET-ENCODING will always get the same encoding, and SET-ENCODING will
either do nothing or throw an error if the encoding set is not the
supported one.

It certainly is not unreasonable for gforth to focus on UTF-8, which is
emerging as a de facto standard in much of Linux oriented open source.
A standard that did not accomodate UTF-8 would be flawed. But
prescribing in advance of common practice will limit the uptake of the
standard and therefore the portability of source relying on it.

Bernd Paysan

unread,
Sep 28, 2005, 8:00:58 AM9/28/05
to
Anton Ertl wrote:

> One can use it safely in combination with XC-SIZE, but then
> it is easier to use XC!+? (see below).

Well, the reference implementation of XC!+? then is

: xc!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
>r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
\ not enough space
drop nip r> false
else
>r xc!+ r> r> swap - true
then ;

> In other words, this might become Forth's strcat().

You at least know that there is an upper bound for how much you might
overwrite (not the case with strcat). Well, the upper bound depends on the
encoding, and we don't guarantee now that -1 XC-SIZE will return the
maximum one.

Stephen Pelc

unread,
Sep 28, 2005, 12:13:10 PM9/28/05
to
On 28 Sep 2005 03:28:35 -0700, "Bruce McFarling"
<agi...@netscape.net> wrote:

>It certainly is not unreasonable for gforth to focus on UTF-8, which is
>emerging as a de facto standard in much of Linux oriented open source.
>A standard that did not accomodate UTF-8 would be flawed. But
>prescribing in advance of common practice will limit the uptake of the
>standard and therefore the portability of source relying on it.

I've been discussing applications that have been shipping for 15 or
more years. Internationalisation and the consequent "char" issues
have been around for a long time, and some of our clients handle
them daily. I just don't want their *requirements* to be locked
out.

The DCS, OCS and ACS terminology stems from issues that exist for
real applications. It is certainly rare for encodings to change
after program initialisation (although some multilingual word
processors have worked that way) but it is common that an app
has to select the encoding at startup.

Anton Ertl

unread,
Sep 28, 2005, 1:43:44 PM9/28/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>It is certainly rare for encodings to change
>after program initialisation (although some multilingual word
>processors have worked that way) but it is common that an app
>has to select the encoding at startup.

Sounds to me that we are in agreement then. Gforth uses the standard
Unix mechanism (the LANG environment variable) for determining the
encoding on startup. No switching words needed.

As for multilingual word processors, that's a good reason for using a
universal character set and encoding rather than switching around.

Anton Ertl

unread,
Sep 28, 2005, 1:49:50 PM9/28/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>No, that's not missing. There should not be any switching between
>>encodings. There is one encoding in the Forth system that should be
>>able to represent anything, and everything is converted to that
>>encoding on input, and from that encoding on output. No need to
>>switch anything.
>
>The key word is "should". However, reality intervenes. There are
>apps out there that use multiple encodings. A standard formalises
>current practice - it is *not* a design for the future.

It makes no sense to standardize a current practice that has no
future.

But as I said before, IMO it's a little to early for the xchars
proposal, because there is not enough practice with it.

In the Linux world, UTF-8 is the present.

>If you push through a standard that disenfranchises existing
>substantial apps, the developers of those apps will ignore
>the standard. Is this what you want?

I have read enough statements from Forth vendors that it's impossible
to write substantial apps in ANS Forth, so supposedly the programmers
of those substantial apps are ignoring the standard already.

The existing apps will continue to work on the systems where they
worked before and be as non-standard as they ever where.

It seems to me that you are thinking about requirements of your
customers that most of the others don't have, and that hopefully will
go away at some point even for your customers.

>The preferred route, I suggest, is to provide GET-ENCODING and
>SET-ENCODING.

That's the worst possible design; or maybe having an ENCODING variable
would be even worse.

In general, the global-state approach is always causing problems,
whether it's STATE or BASE or something else.

If you want to support different encodings, the encoding should be
stored with the data. But then we would be dealing with something
that's much different from current Forth strings. And the words for
dealing with that stuff would probably be much different from the
xchars words.

Xchars were designed for dealing with one encoding used throughout the
Forth system. Several encodings are compatible with the requirements
of xchars, and a Forth system might let you choose on startup which
encoding to use, but you cannot switch around between encodings.

Anton Ertl

unread,
Sep 28, 2005, 2:15:12 PM9/28/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>
>> One can use it safely in combination with XC-SIZE, but then
>> it is easier to use XC!+? (see below).
>
>Well, the reference implementation of XC!+? then is

My point is that you should include XC!+? in the proposal and probably
delete XC!+ from it.

BTW, concerning a reference implementation of xchars, a reference
implementation for the 8bit (or a general fixed-width) encoding should
be easy (although not very exciting).

>> In other words, this might become Forth's strcat().
>
>You at least know that there is an upper bound for how much you might
>overwrite (not the case with strcat).

True, but XC+! can be enough to overwrite an xt, and that can be
enough to break into the system.

>Well, the upper bound depends on the
>encoding, and we don't guarantee now that -1 XC-SIZE will return the
>maximum one.

Even if an upper bound could be determined, making use of that would
require additional programmer effort, and it's a bad idea to design
words that require that; you need to educate the programmers about
that, and even if they know about it, it's still easier to make errors
when the required effort is higher.