RfD: XCHAR wordset (for UTF-8 and alike)

Bernd Paysan

unread,

Sep 25, 2005, 10:16:25 PM9/25/05

to

Problem:

ASCII is only appropriate for the English language. Most western languages
however fit somewhat into the Forth frame, since a byte is sufficient to
encode the few special characters in each (though not always the same
encoding can be used; latin-1 is most widely used, though). For other
languages, different char-sets have to be used, several of them
variable-width. Most prominent representant is UTF-8. Let's call these
extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
ASCII-compatible encodings may be used.

Proposal

Datatypes:

xc is an extended char on the stack. It occupies one cell, and is
a subset of unsigned cell. Note: UTF-8 can not store more that 31
bits; on 16 bit systems, only the UCS16 subset of the UTF-8
character set can be used.
xc_addr is the address of an XCHAR in memory. Alignment requirements are
the same as c_addr. The memory representation of an XCHAR differs
from the stack location, and depends on the encoding used. An XCHAR
may use a variable number of address units in memory.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only when
you don't change the encoding, or only use the ASCII subset.

Words:

XC-SIZE ( xc -- u )
Computes the memory size of the XCHAR xc in address units.

XC@+ ( xc_addr1 -- xc_addr2 xc )
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 )
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XCHAR+ ( xc_addr1 -- xc_addr2 )
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 )
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

X-SIZE ( xc_addr u -- n )
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc )
Reads an XCHAR from the terminal.

XEMIT ( xc -- )
Prints an XCHAR on the terminal.

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters. Parse name delimited by a space. Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( ?<spaces>name? -- )
Skip leading space delimiters. Parse name delimited by a space. Append the
run-time semantics given below to the current definition.
Run-time: ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

Unfortunately, both the Gforth and the bigFORTH implementation have several
system-specific parts.

Experience:

Build into Gforth (development version) and recent versions of bigFORTH.
Open issues are file reading and writing (conversion on the fly or leave as
it is?).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bruce McFarling

unread,

Sep 26, 2005, 6:29:05 AM9/26/05

to

Bernd Paysan wrote:
> Problem:

> ASCII is only appropriate for the English language. Most western languages
> however fit somewhat into the Forth frame, since a byte is sufficient to
> encode the few special characters in each (though not always the same
> encoding can be used; latin-1 is most widely used, though).

> For other languages, different char-sets have to be used, several of
> them variable-width. Most prominent representant is UTF-8. Let's call
> these extended characters XCHARs. Since ANS Forth specifies ASCII
> encoding, only ASCII-compatible encodings may be used.

> Experience:

> Build into Gforth (development version) and recent versions of bigFORTH.
> Open issues are file reading and writing (conversion on the fly or leave as
> it is?).

The first thing to settle is whether XCHARS are "these" extended
character sets that are upwardly compatible with printable ASCII, or
"this" extended character set. And I could well see a wish to use, eg,
UTF-8 in file storage (if my primary targets were Europe, Africa, and
the Americas) and UTF-16 in processing.

It seems to me that, since you can always tell where a UTF character
begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
but you need to know know WHICH it is as well as endianess for UTF16
and UTF32, the most coherent thing to do is to have AN XCHAR
representation for processing and a set of file modes that specify the
kind of file you are loading:

* ASCII (latin-1, etc, any fixed 8-bit code pages)
* UTF8
* UTF16 (endedness of your system)
* UTF32 (endedness of your system)
* UTF16B
* UTF16L
* UTF32B
* UTF32L

Then if the file mode matches the system mode, you just load the file,
if it mismatches, it is translated on the fly on reading and writing.

Obviously the system mode would be a thing for a system query.

Bernd Paysan

unread,

Sep 26, 2005, 9:35:12 AM9/26/05

to fort...@yahoogroups.com

Bruce McFarling wrote:

> The first thing to settle is whether XCHARS are "these" extended
> character sets that are upwardly compatible with printable ASCII, or
> "this" extended character set. And I could well see a wish to use, eg,
> UTF-8 in file storage (if my primary targets were Europe, Africa, and
> the Americas) and UTF-16 in processing.
>
> It seems to me that, since you can always tell where a UTF character
> begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
> but you need to know know WHICH it is as well as endianess for UTF16
> and UTF32, the most coherent thing to do is to have AN XCHAR
> representation for processing and a set of file modes that specify the
> kind of file you are loading:
>
> * ASCII (latin-1, etc, any fixed 8-bit code pages)

Though, depending on the fixed code-page, the translation will be different
(latin-1 different from latin-2).

> * UTF8
> * UTF16 (endedness of your system)
> * UTF32 (endedness of your system)
> * UTF16B
> * UTF16L
> * UTF32B
> * UTF32L

You can add a few other encodings. UCS16 managed to have an easy conversion
from several previous ASCII-compatible encodings, even though the code
pages of the non-ASCII portion moves within UCS16 (E.g. the GB2312 format).
Which encoding actually is known to the Forth system would be subject of a
query, too.

> Then if the file mode matches the system mode, you just load the file,
> if it mismatches, it is translated on the fly on reading and writing.
>
> Obviously the system mode would be a thing for a system query.

Exactly.

Stephen Pelc

unread,

Sep 26, 2005, 10:17:02 AM9/26/05

to

On Mon, 26 Sep 2005 00:16:25 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

How does this fit in with the wide character and internationalisation
proposals at
www.mpeforth.com/arena/
i18n.propose.v7.PDF
i18n.widechar.v7.PDF
These proposals/RFCs are from the application developers point of
view. There's a sample implementation in the file
LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
is derived from 15+ years of experience. From the file header:

"You are free to use this code in any way, as long as the MPE
copyright notice in this section is retained.

This code is an implementation of the draft ANS internationalisation
specification available from the download area of the MPE web site.
The implementation provides more functionality than is required by
the ANS draft standard and provides enough hooks to be the basis of
a practical system."

>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.

IMHO standardising a word that can't be guaranteed to work is not
beneficial. If you must step back through a string, extend the
definition of /STRING to form /-STRING or some such, such that
the start of the string must be at the start of a character.

IMHO your approach is from the implementor's perspective, which is
valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
that what *applications* do with strings is at a *much* higher level
than implementors issues.

Can we merge the application developer issues with the kernel
issues? These inclue cleaning up the meaning of character,
byte/octet access, file wors and son on.

I look forward to discussing these issues at EuroForth 2005.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Bernd Paysan

unread,

Sep 26, 2005, 12:31:48 PM9/26/05

to

Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

The main difference with the i18n.widechar.v7.PDF proposal is that our
proposal (Anton's and my) doesn't distinguish between development character
set and application character set. I think this distinction is unnatural
and only valid in a historical context, e.g. the different code-pages used
in DOS-based Windows, and wide characters, which won't coexist with ASCII.

The string-based localization proposal in i18n.propose.v7.PDF is orthogonal
to the character issue, and works regardless of the coding system, as
strings always stay strings.

I would welcome it when you set up an RfD for your proposal.

>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of
>>this XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed
>>to work for every possible encoding.
>
> IMHO standardising a word that can't be guaranteed to work is not
> beneficial. If you must step back through a string, extend the
> definition of /STRING to form /-STRING or some such, such that
> the start of the string must be at the start of a character.

Quite a number of variable width wide-char encodings, especially UTF-8,
allow both stepping forward and backward a character at a time. Another
possible compromise is to simply outlaw those variable width wide-char
encodings that don't allow stepping back. UTF-8 allows to find the next and
the previous character regardless where you point to. Some of the chinese
encodings can do the same: the first byte of a double-byte glyph there has
the MSB set, the second clear.

It's like seeking in a file. Not all files allow seeking (pipes and sockets
won't, e.g.). Seeking is a useful activity, though. Adding a X/STRING
( xc_addr u n -- xc_addr' u' ) isn't much of a trouble. n would be the
number of XCHARs to step forward (positive) or backward (negative).

The question is rather what should XCHAR- do when it fails. It can throw an
error, as well as when it encounters a bad encoding.

> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Especially when they finally use some OS function to paint the text on the
screen. On the other hand, when they use something integrated into the
Forth system (like MINOS), they use the DCS to display things on screen.

Using UTF-8 internally is even possible for a Windows Forth, though you then
have to go through hoops to call TextOutW correctly (AFAIK it even doesn't
know how to deal with combining characters). So far, I haven't ported the
UTF-8 stuff to Windows, and concluded that it's easier to make the Windows
MINOS version use the same iso-latin-1 DCS as it always did. But then,
bigFORTH on Windows is not really supported.

> Can we merge the application developer issues with the kernel
> issues? These inclue cleaning up the meaning of character,

> byte/octet access, file words and so on.

Good idea.

Stephen Pelc

unread,

Sep 26, 2005, 3:13:32 PM9/26/05

to

On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>The main difference with the i18n.widechar.v7.PDF proposal is that our

>proposal (Anton's and my) doesn't distinguish between development character
>set and application character set. I think this distinction is unnatural
>and only valid in a historical context, e.g. the different code-pages used
>in DOS-based Windows, and wide characters, which won't coexist with ASCII.

Unfortunately I have to disagree here. Even if you can get to one
encoding from the UTF-xxx family in the long term, applications
written in South Africa (development character set, DCS) must be able
to be hosted and configured on a PC running a Chinese-xxx version
of some operating system (operating character set, OCS)and used by
a Russian-xxx speaker (application character set, ACS). This is a
mix that has been seen "in the wild" - it is not a scenario.

The impact of ACS is not necessarily in the encoding, but in
how the application presents information and the order of
text substitutions, e.g. subject/verb/object and time/manner/place.
Then there's the date/time display nightmare and ...

I really wish we could embrace a single encoding, but there are
Forth applications out there with 15-20 years of history.

>I would welcome it when you set up an RfD for your proposal.

Let's reserve time for it at EuroForth. Those who want to join a mail
list for this topic should email me directly. I will re-establish
the locale and other mailing lists when our servers have recovered
from the plumbing alterations at Hill Lane.

>Another
>possible compromise is to simply outlaw those variable width wide-char
>encodings that don't allow stepping back.

Tell that to an application developer and they will ignore you. Such
encodings exist and are used. In our experience, stepping back through
strings is most often encountered in file handling and affects DCS and

OCS rather than ACS.

>> Can we merge the application developer issues with the kernel
>> issues? These inclue cleaning up the meaning of character,
>> byte/octet access, file words and so on.
>
>Good idea.

Will you be at EuroForth?

Albert van der Horst

unread,

Sep 26, 2005, 9:30:20 AM9/26/05

to

In article <p6kj03-...@vimes.paysan.nom>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Problem:
>
>ASCII is only appropriate for the English language.

Hardly. English has given up one of the most important
advantages of a phonetic system. It is unpronouncable.
I am thinking about a phonetically correct spelling of
English and it would need a host of dia-critical marks,
like just every other lanugage.

> Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

>
>Proposal
<SNIP>

One of the problems, and I think it is a design issue, we have
inherited from C, is the mess resulting from using characters
as address units (in Forth parlance.)
In Forth with all the embedded programming we really need
a means to address bytes. I would like to split off from
the character handling in Forth, all that is in fact intended
to handle let's say assembler level programming.
This would make character handling much cleaner, and a better
starting point for extending the real character handling.

It is my hope that we need not introduce a new type for char's beside
the byte type that we need anyhow, and the normal CHAR.
Why would CHAR <some extended character> not fit in a Forth
character (provided we do not try it at the same time for
things like a length as exemplified by the ugly word COUNT.)

In fact bytes are somehow in place by the concept of
address unit. We only need to flesh it out a little.
Note that there is *no* Forth word to fetch or store
the content of an address unit. Still.
An address unit is the smallest part of memory that can
addressed, i.e. fetched or stored. But it can't because there
are no words for it.

>--

Groetjes Albert

--
--
Albert van der Horst,Oranjestr 8,3511 RA UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Bernd Paysan

unread,

Sep 26, 2005, 9:13:48 PM9/26/05

to

Stephen Pelc wrote:

> On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
> wrote:
>
>>The main difference with the i18n.widechar.v7.PDF proposal is that our
>>proposal (Anton's and my) doesn't distinguish between development
>>character set and application character set. I think this distinction is
>>unnatural and only valid in a historical context, e.g. the different
>>code-pages used in DOS-based Windows, and wide characters, which won't
>>coexist with ASCII.
>
> Unfortunately I have to disagree here. Even if you can get to one
> encoding from the UTF-xxx family in the long term, applications
> written in South Africa (development character set, DCS) must be able
> to be hosted and configured on a PC running a Chinese-xxx version
> of some operating system (operating character set, OCS)and used by
> a Russian-xxx speaker (application character set, ACS). This is a
> mix that has been seen "in the wild" - it is not a scenario.

The way it works in Unix/Linux (the platform where it really works) is to
use a single encoding, UTF-8, for everything. Unix platforms and Linux are
now delivered for some years with UTF-8 support, and recently, it's often
the default setting. I have absolutely no problem to install a SuSE with
two dozen languages all available to the user, just depending on the $LANG
variable - sharing documents with each others.

AFAIK, even Windows has some variants that ship with a multi-language
system, though in Windows, lots of system internals depend on the language
(such as the "Program Files" directory, or "My Documents"). Windows
supports Unicode as one of the codespaces, though UTF-8 support would be
left to the application (several do use it already, but most of them are
ported over from Unix).

But the XCHAR proposal is really not about having UTF-8 everywhere, but
about dealing with variable-width wide characters. Fixed wide characters
are a subset of that; though that takes the ASCII compatibility away, and
being incompatible to the DCS opens the can of worms you have with your
OCS!=DCS!=ACS.

> The impact of ACS is not necessarily in the encoding, but in
> how the application presents information and the order of
> text substitutions, e.g. subject/verb/object and time/manner/place.
> Then there's the date/time display nightmare and ...

That's another question, but not bound to the character encoding itself.

> I really wish we could embrace a single encoding, but there are
> Forth applications out there with 15-20 years of history.

The vast majority of Forth programs however is DCS=OCS=ACS. And since OCS
now is often enough UTF-8 by default, we should be able to handle that.

There might be place for a more complicated scheme even in future, but so
far, I see the DCS != OCS != ACS as a result of bad decisions in operating
system design. Such things should better be solved outside the scope of a
general standard (i.e. a rather specific standard "how to I overcome this
particular problem with the popular brainfuck operating system").

Having DCS != OCS/ACS is something that works for batch compiled programming
languages. There's still the problem of the string constants, but the
localization mapping handles that (you don't have strings in the user's
language around in your primary source code).

This however means that you enforce a particular way to deal with your
development system and your localization. This particular way is something
I really don't want in Forth. E.g. I could write some turtle graphics for
children, and it certainly is necessary that it has to be used in their
native language. On the other hand, it's quite obvious that it will use the
Forth interpreter. So it's definitely DCS, and the localization is a file
with lots of ' xxx alias yyy commands.

It reminds me all on target compilers. You jump through hoops because you
don't have your target system available. This is all well if you need it.
It's not something that should have an impact on the design of a Forth
system where build=host=target.

>>Another
>>possible compromise is to simply outlaw those variable width wide-char
>>encodings that don't allow stepping back.
>
> Tell that to an application developer and they will ignore you.

That's true.

> Such encodings exist and are used.

Unfortunately. For me, these encodings are other people's problems ;-).

> In our experience, stepping back through
> strings is most often encountered in file handling and affects DCS and
> OCS rather than ACS.

I use stepping backwards mostly in editing code, that's ACS.

>>> Can we merge the application developer issues with the kernel
>>> issues? These inclue cleaning up the meaning of character,
>>> byte/octet access, file words and so on.
>>
>>Good idea.
>
> Will you be at EuroForth?

Unfortunately not. I originally booked holiday before, but unfortunately, I
had to shift my trip by three weeks. So I'm now on the other side of the
world when EuroForth is :-(.

Bruce McFarling

unread,

Sep 27, 2005, 3:37:16 AM9/27/05

to

Albert van der Horst wrote:
> It is my hope that we need not introduce a new type for char's beside
> the byte type that we need anyhow, and the normal CHAR.
> Why would CHAR <some extended character> not fit in a Forth
> character (provided we do not try it at the same time for
> things like a length as exemplified by the ugly word COUNT.)

But the RfD is moving in the direction you want, in which characters
are treated as character set entities. After all, while a UTF-8
encoding is perfectly regular, any given character may be one, two,
three, or four bytes long.

COUNT is perfectly useful and clean. Its just using it to count, with
the attendant limitation of counts to the width of a uniform width
character set that is obsolete.

Bruce McFarling

unread,

Sep 27, 2005, 5:22:34 AM9/27/05

to

Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
of text processing and place them in the realm of networking standards
compliance. And a subset of the XCHAR words would suggest how to
handle them:

OCTET-SIZE ( -- u )
The memory size of a Byte in address units.

OCTET@+ ( oct_addr1 -- oct_addr2 oct )
Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
memory
location after xc.

OCTET!+ ( oct oct_addr1 -- oct_addr2 )
Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
location after xc.

OCTET+ ( oct_addr1 -- oct_addr2 )
Adds the size of an OCTET to oct_addr1, giving oct_addr2.

OCTET- ( oct_addr1 -- oct_addr2 )
Subracts the size of an OCTET from oct_addr1, giving oct_addr2.

After all, XCHARs do not get rid of the possibility that CHARs may be
16 bits wide, though they may be of use for 8-bit data when the CHARs
are 16 bits wide.

Bruce McFarling

unread,

Sep 27, 2005, 5:26:20 AM9/27/05

to

Stephen Pelc wrote:
> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Its not at an implementor's perspective, because I aint an implementor.
Its from a text processing perspective. Almost all applications must
use strings to communicate with the user, but only text processing
applications have to .... errr process text.

Since I have a bit of an non-professional interest in text processing
(for my job, I mostly just generate it), I'll have a crack at
"addressing" the interaction between this and the i18n proposal.

The ACS only critically depends on the language in use if it is an
8-bit code page. If it is UTF-8, UTF-16 or UTF-32 it does not change
when the language changes, that being the point of the Unicode
Translation Formats. Display, input, etc. may have to change, but not
the character set per se. And while the XCHAR proposal is focused on
UTF-8, it also fits into UTF-16, especially for a historian,
archeologist, or anthropologist who needs to work with archaic or
uncommon languages that may require characters above the 16-bit plane.

If the ACS is an 8-bit code page, the only thing that is likely to
change as a result of something you've done in the i18n system is
sorting order. AND SORTING ORDER IS NOT A CHARACTER SET ISSUE! Its a
language issue.

OK, now, any translation REQUIRED between the system and the DCS is
built into the implementation. That includes any text files read or
written by the system.

So, if "ASCII" is taken to mean, "ASCII, possibly extended by a
language-specific code page", the four most common OCS/ACS combinations
are:

ACS=ASCII, DCS=ASCII
ACS=ASCII, DCS=UTF-#
ACS=UTF-#, DCS=ASCII
ACS=UTF-#, DCS=UTF-#

The translation issues are:

* ACS=ASCII, DCS=ASCII
They happen to be different code pages. KEY, EMIT, [CHAR] and CHAR may
have an issue of which code page you are talking about. But neither
are XCHAR's.

* ACS=ASCII, DCS=UTF-#
The only question is whether XKEY/XEMIT is in Application space or
Developer space or are transitions between the two.

I don't see how the input can be FROM developer space and output TO
developer space (programming utilities, after all, are only
applications that happen to work in the developers languages, so ACS
happens to EQUAL DCS), so there are only two possibilities:

** If XKEY/XEMIT are entirely in Application Space, no possible dramas,
no matter what character set that is. As XKEY's they are just
arbitrary chunks of bits measured in arbitrary address units.

** If XKEY/XEMIT bring ACS characters into Developer Space and then out
again, then translations occur from ASCII + "SET LANGUAGE" code page to
UTF. If the application is internationalised, all characters emitted
will be from input or from resource files, so there is never any "CS
won't translate" problem.

* ACS=UTF-#, DCS=ASCII

** If XKEY/XEMIT bring ACS into Developer space, there is a potential
translation problem, in that not all UTF-# encoded characters will fit
into any given 8-bit code page.

* ACS=UTF-#, DCS=UTF-#

** For this, there is no XKEY/XEMIT translation barrier, even if they
are different UTF's (say the developer is Han Chinese, and so prefers
to develop in UTF-16, or is working with an OS that relies on UTF-16,
but is writing for an Atlantic Zone audience internationalised into
English/Spanish/French/Portuguese and so prefers UTF-8 as the ACS),
since there is well-defined translations between any well-encodeded UTF
character. There is translation overhead, but that is all.

** For this, the problem is that there need to be DIFFERENT "XKEY"'s if
they are different encodings of the same character set.

To my mind, XCHARS's belong to the Application Character Set, since the
kind of thing that can be portable between systems is more text
processing applications than how a particular system may talk to its
underlying operating systems.

Further XCHARS are quite clearly NEEDED for the text processing in an
ACS, since CHARs suffice for ASCII code-page encodings, but not for
UTF-# encodings of THE SAME CS, and ASCII code-page does not accomodate
all character sets.

And for things like searching source for a particular definition, just
set the ACS to the DCS.

This is orthogonal to my earlier comment. My earlier comment presumes
that XCHARS are for what might be termed the "Memory Storage CS", not
for what may be termed the "Permanent Storage CS", which may well be
different. XCHARS define a translation between the stack and Memory
Storage. File words bring parts of files into Memory Storage. Hence
my argument that there should be file modes that handle that
translation (which can be done in bulk). And indeed, in a certain
sense that needs to be done in the file word, because the file words
are designed to bring parts of files into ALLOCATED parts of storage,
so the file words should only bring as much as can fit into the
allocated part of storage under the Memory Storage CS.

On the other hand, while XCHARs are required in ACS land, the ACS is
subject to change. And it doesn't make sense to change it "behind the
back" of the I18N words. So that suggests that the SET LANGUAGE system
ought to include an ability to set the default working character set
encoding and the default permanent storage character set encoding.

There is no need for a portable program to SET the ACS encoding. But
it may have to be able to QUERY the ACS encoding, and then to be able
to associate that with a particular collection of text in memory so
that if necessary it can RESTORE the ACS encoding to what was in place
when that text went into memory.

Bernd Paysan

unread,

Sep 27, 2005, 8:57:17 AM9/27/05

to

Bruce McFarling wrote:
> WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
> of text processing and place them in the realm of networking standards
> compliance. And a subset of the XCHAR words would suggest how to
> handle them:
>
> OCTET-SIZE ( -- u )
> The memory size of a Byte in address units.
>
> OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> memory
> location after xc.
>
> OCTET!+ ( oct oct_addr1 -- oct_addr2 )
> Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
> location after xc.
>
> OCTET+ ( oct_addr1 -- oct_addr2 )
> Adds the size of an OCTET to oct_addr1, giving oct_addr2.
>
> OCTET- ( oct_addr1 -- oct_addr2 )
> Subracts the size of an OCTET from oct_addr1, giving oct_addr2.
>
> After all, XCHARs do not get rid of the possibility that CHARs may be
> 16 bits wide, though they may be of use for 8-bit data when the CHARs
> are 16 bits wide.

Another missing part of my XCHAR proposal is how to change the way these
XCHARs are handled. ATM, I say the system deals with that, depending on
user settings (e.g. LANG environment variable). What's obvious is that
there's a way to deal with several encodings, and OCTET could be one of
them.

OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
picture, but the u would not depend on xc.

Since the actually available encodings are rather system-dependent, I
suggest that the system documentation lists available encodings and ways to
set them. E.g.

XC-CODING ( xc-id -- ) set XC encoding.

XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
stored in memory; it is not defined what happens with bit 8.

OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
memory. This encoding is compatible with packed ASCII strings.

UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
with packed ASCII strings.

UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
compatible with packed ASCII strings, but ASCII strings can be converted.

This however is the part of the system which is still open, so I can't say
there is enough experience to push a RfD through.

Bruce McFarling

unread,

Sep 27, 2005, 11:26:52 AM9/27/05

to

Bernd Paysan wrote:
> OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
> picture, but the u would not depend on xc.

Or not be a word at all, but rather be a query, since it won't be
changing and won't need any magic going on behind the back of the
author of portable code to make the portable code work

> Since the actually available encodings are rather system-dependent, I
> suggest that the system documentation lists available encodings and ways to
> set them. E.g.

> XC-CODING ( xc-id -- ) set XC encoding.

> XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

I would stress that more important than the ability to generate xc-id's
is the ability to get the CURRENT xc-id. Scenario: you get some text
and it is stored in memory somewhere. Then you take an action that you
know MIGHT result in a switch in character set, and you get some text,
and it is stored in memory somewhere.

So, if YOU didn't SET the sc-id's, how do you know how to switch back
and forth between them, or even whether you need to?

SET-XCHAR ( xc-id -- )
GET-XCHAR ( -- xc-id )

is the core. That lets you get the xc-id when you store the first set
of information in memory, lets you get the xc-id when you store the
second set of information in memory, test for equality to see if you
have to take care, reset to the "old" xc-id when appropriate.

If there are going to be these:

> ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
> stored in memory; it is not defined what happens with bit 8.
>
> OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
> memory. This encoding is compatible with packed ASCII strings.
>
> UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
> with packed ASCII strings.
>
> UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
> compatible with packed ASCII strings, but ASCII strings can be converted.

There should also be LANGUAGE-XCHAR ( -- ) to synchronise the xc-id
with the current language. An implementation of XCHAR's that did not
have I18N implemented would reset xc-id to the system default.

Albert van der Horst

unread,

Sep 27, 2005, 12:11:52 PM9/27/05

to

In article <1127792236.4...@g43g2000cwa.googlegroups.com>,

It is not clean to store an integer (the count) in a character.
It is not useful to have a count limited to 256 in Britain
65526 in Japan and 4 billion in China.

Albert van der Horst

unread,

Sep 27, 2005, 12:20:36 PM9/27/05

to

In article <1127798554....@g14g2000cwa.googlegroups.com>,

Bruce McFarling <agi...@netscape.net> wrote:
>
>Stephen Pelc wrote:
>> How does this fit in with the wide character and internationalisation
>> proposals at
>> www.mpeforth.com/arena/
>> i18n.propose.v7.PDF
>> i18n.widechar.v7.PDF
>> These proposals/RFCs are from the application developers point of
>> view. There's a sample implementation in the file
>> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
>> is derived from 15+ years of experience. From the file header:
>
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance. And a subset of the XCHAR words would suggest how to
>handle them:
>
>OCTET-SIZE ( -- u )
>The memory size of a Byte in address units.

A byte is an address unit. Not only by definition but for all
practical purposes.
Can't we just condemn those that don't to declare a
"environmental dependancy on an address unit not to contain
8 bits".
By the way Chuck Moore would have to define OCTET-SIZE as one
quarter, anyway. How is that?

>
>OCTET@+ ( oct_addr1 -- oct_addr2 oct )
>Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
>memory
>location after xc.

Much too verbose for such a basic word.
Why not OCTET -> B

<SNIP>

>After all, XCHARs do not get rid of the possibility that CHARs may be
>16 bits wide, though they may be of use for 8-bit data when the CHARs
>are 16 bits wide.

CHAR's should not be used for 8-bit data.
XHAR's should not be used to free CHAR's of the chore to handle
8-bit data, because of a refusal to use bytes (or OCTET's).

So,
do we really need XCHAR ?

Groetjes Albert

Anton Ertl

unread,

Sep 27, 2005, 4:09:09 PM9/27/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>Problem:
>
>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though).

Actually Unicode (in its UCS-4/UTF-32 encoding) would also fit in the
ANS Forth frame. However, most near-ANS code around has an
environmental dependency on 1 chars = 1 au, and I think that more
existing programs work with a system that uses 1-au chars and xchars
(even when processing wider xchars) than with a system that uses n-au
chars (n>1).

> Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

That's sounds like a requirement should therefore be part of the
proposal, not the problem description.

The on-stack representation of ASCII characters should certainly be
ASCII. For the in-memory representation that would also have some
advantages: in particular, programs that access individual characters
using char (not xchar) words would work correctly on strings
consisting only of ASCII characters (and ANS Forth does not give any
guarantee for other characters anyway).

>Proposal

I would have waited for some more time (and experience) before making
such a proposal (I am still unsure which words to include and which
not). But since you made it, let's collect the feedback.

>Words:
>
>XC-SIZE ( xc -- u )
>Computes the memory size of the XCHAR xc in address units.
>
>XC@+ ( xc_addr1 -- xc_addr2 xc )
>Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
>
>XC!+ ( xc xc_addr1 -- xc_addr2 )
>Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.

This is unsafe, as it writes an unknown amount of data behind
xc_addr1. One can use it safely in combination with XC-SIZE, but then
it is easier to use XC!+? (see below).

Providing this word, but not XC!+? discourages safe programming
practices and encourages creating buffer overflows.

In other words, this might become Forth's strcat().

It's probably best not to standardize this word.

>XCHAR+ ( xc_addr1 -- xc_addr2 )
>Adds the size of the XCHAR stored at xc_addr1 to this address, giving
>xc_addr2.
>
>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.
>
>X-SIZE ( xc_addr u -- n )
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe another name would be harder to confuse with XC-SIZE. How about
X-WIDTH or XC-WIDTH?

>XKEY ( -- xc )
>Reads an XCHAR from the terminal.
>
>XEMIT ( xc -- )
>Prints an XCHAR on the terminal.

Currently Gforth also implements:

+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like 1 /STRING

-X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like -1 /STRING

XC@ ( xc-addr -- xc )
like C@

DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
safe version of XC!+, f specifies success

-TRAILING-GARBAGE ( addr u1 -- addr u2 )
remove trailing incomplete xc

Of course, some of these can be defined from others, but it's not
clear to me yet which ones are the set that we want to select.

>The following words behave different when the XCHAR extension is present:

That is actually a compatible extension of ANS Forth's CHAR and
[CHAR]; for ASCII characters they behave exactly the same, and for
others ANS Forth does not specify a behaviour. So I would not say
"behave different", but use wording such as "extend the semantics of
..."

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

Definitely conversion on the fly. There must be only one character
encoding in memory. However, we have not implemented that yet.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.complang.tuwien.ac.at/forth/ansforth/forth200x.html
EuroForth 2005: http://www.complang.tuwien.ac.at/anton/euroforth2005/

Anton Ertl

unread,

Sep 27, 2005, 4:55:53 PM9/27/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>The first thing to settle is whether XCHARS are "these" extended
>character sets that are upwardly compatible with printable ASCII, or
>"this" extended character set. And I could well see a wish to use, eg,
>UTF-8 in file storage (if my primary targets were Europe, Africa, and
>the Americas) and UTF-16 in processing.

Xchars can be used for any fixed-width encodings (even for a
fixed-width encoding with three chars/xchar), and for any
variable-width encodings that satisfy the requirements (e.g., UTF-8
and UTF-16).

That being said, I don't see a point in using UTF-16 for processing;
it combines the disadvantages of a fixed-width encoding with the
disadvantages of a variable-width encoding. If you want fixed-width,
use UTF-32; if you want variable-width, use UTF-8.

>It seems to me that, since you can always tell where a UTF character
>begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
>but you need to know know WHICH it is as well as endianess for UTF16
>and UTF32, the most coherent thing to do is to have AN XCHAR
>representation for processing and a set of file modes that specify the
>kind of file you are loading:
>
>* ASCII (latin-1, etc, any fixed 8-bit code pages)
>* UTF8
>* UTF16 (endedness of your system)
>* UTF32 (endedness of your system)
>* UTF16B
>* UTF16L
>* UTF32B
>* UTF32L
>
>Then if the file mode matches the system mode, you just load the file,
>if it mismatches, it is translated on the fly on reading and writing.

Yes, that's somewhat like what I have in mind. Except that currently
I am only envisioning conversions between various 8-bit encodings and
UTF-8; but if there really are people around with UTF-16 files, adding
a converter for them is not a big issue.

>Obviously the system mode would be a thing for a system query.

Ideally programs should be written with the Xchars words such that
they do not need to know the encoding used in the system.

Anton Ertl

unread,

Sep 27, 2005, 5:08:20 PM9/27/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>>work for every possible encoding.
>
>IMHO standardising a word that can't be guaranteed to work is not
>beneficial.

This word is guaranteed to work (if there is at least one character
right before xc_addr1).

If you are thinking about encodings where you cannot find the previous
character, they are not supported by Xchars. And I consider this a
virtue, not a deficiency.

>I look forward to discussing these issues at EuroForth 2005.

I will be there.

Anton Ertl

unread,

Sep 27, 2005, 5:16:40 PM9/27/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>Unfortunately I have to disagree here. Even if you can get to one
>encoding from the UTF-xxx family in the long term, applications
>written in South Africa (development character set, DCS) must be able
>to be hosted and configured on a PC running a Chinese-xxx version
>of some operating system (operating character set, OCS)and used by
>a Russian-xxx speaker (application character set, ACS). This is a
>mix that has been seen "in the wild" - it is not a scenario.

No problem:

DCS: Unicode (encoded as UTF-8 or UTF-32)
OCS: Unicode (encoded as UTF-8 or UTF-32)
ACS: Unicode (encoded as UTF-8 or UTF-32)

So once your condition above is satisfied, this is not an issue at the
character set and encoding level, and is thus outside the scope of the
xchars words.

>The impact of ACS is not necessarily in the encoding, but in
>how the application presents information and the order of
>text substitutions, e.g. subject/verb/object and time/manner/place.
>Then there's the date/time display nightmare and ...

Well, that's internationalisation. Xchars don't solve (much of) that.

Anton Ertl

unread,

Sep 27, 2005, 5:30:59 PM9/27/05

to

"Bruce McFarling" <agi...@netscape.net> writes:

>
>Stephen Pelc wrote:
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance.

Bytes are not in ANS Forth, and are therefore not used in text
processing.

With Xchars, one might use Chars as bytes: Nearly all systems
implement chars as bytes anyway, and probably a number of programs use
chars for bytes, so one might standardize on that.

The disadvantage of such a step in the Xchars context would be that
the in-memory representation for UTF-16 and UTF-32 would no longer be
fully ASCII-compatible (one ASCII Xchar would become more than one
Char).

But I don't believe that UTF-16 or UTF-32 and multi-au Chars will
become significant, so one might just as well settle down to using
Chars for bytes.

>And a subset of the XCHAR words would suggest how to
>handle them:

Well, since octets are fixed-width, it may be better to model the
octet words on the Char or Cell words than on the Xchar words.

Anton Ertl

unread,

Sep 27, 2005, 5:39:52 PM9/27/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>Another missing part of my XCHAR proposal is how to change the way these
>XCHARs are handled.

No, that's not missing. There should not be any switching between
encodings. There is one encoding in the Forth system that should be
able to represent anything, and everything is converted to that
encoding on input, and from that encoding on output. No need to
switch anything.

If you allowed switching, then:

- Either you would have to change the encoding all the strings in the
Forth system. This is impossible.

- Or the program would have to keep track of which strings are in
which encoding and always switch around. That's cumbersome and
error-prone.

>XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

IMO the encoding should be part of the fam, and not be set on the fly.
Or do you envision files that mix UTF-8 and, say UTF-16? So we might
have words like

UTF-8 ( fam1 -- fam2 )

latin-1 ( fam1 -- fam2 )

Bruce McFarling

unread,

Sep 28, 2005, 4:49:04 AM9/28/05

to

Anton Ertl wrote:
> That being said, I don't see a point in using UTF-16 for processing;

To save memory space, if your primary language uses a wide character
set in the first plane (where most UTF-8 encodings are three bytes
long). Also if you know what language you are working in, you know
whether or not you are going to stay down in the first plane, so the
variable width issue may be moot.

Not that I had those in mind when I wrote that, rather I had in mind
that as soon as you assume away something, you will find out that
someone else has a strong preference for it, so I tried to avoid
assuming away anything.

Bruce McFarling

unread,

Sep 28, 2005, 4:57:54 AM9/28/05

to

Albert van der Horst wrote:

> Much too verbose for such a basic word.
> Why not OCTET -> B
>
> <SNIP>
>
> >After all, XCHARs do not get rid of the possibility that CHARs may be
> >16 bits wide, though they may be of use for 8-bit data when the CHARs
> >are 16 bits wide.

> CHAR's should not be used for 8-bit data.
> XHAR's should not be used to free CHAR's of the chore to handle
> 8-bit data, because of a refusal to use bytes (or OCTET's).

> So,
> do we really need XCHAR ?

Yes, of course, because XCHARS is not about address units but about
character set units. XCHARS handle extended character data, where we
know perfectly well that sometimes it is one octet long, sometimes it
is two octets long, sometimes it is four octets long, sometimes is
ranges from one to four octets long, and sometimes it ranges from two
to four octets long. So XCHAR+, XCHAR-, XCHAR@+, and XCHAR!+ are
things that are likely to benefit from optimisation and especially
handy for portability given that you could write and test for, say,
UTF-8, and then have code that works for a fixed width 16-bit character
set.

So when I say "XCHAR's may be a byte wide", that's dependent on the
character set encoding in use, not the system and system-specific
address unit.

Bruce McFarling

unread,

Sep 28, 2005, 5:23:02 AM9/28/05

to

Albert van der Horst wrote:

[Bruce]

> >COUNT is perfectly useful and clean. Its just using it to count, with
> >the attendant limitation of counts to the width of a uniform width
> >character set that is obsolete.

> It is not clean to store an integer (the count) in a character.
> It is not useful to have a count limited to 256 in Britain
> 65526 in Japan and 4 billion in China.

I didn't say THAT was clean or useful. In fact, I said that THAT is
obsolete. But CHAR@+ is perfectly clean and useful, however confusing
the string of letters you use to do it.

Stephen Pelc

unread,

Sep 28, 2005, 10:12:59 AM9/28/05

to

On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>No, that's not missing. There should not be any switching between
>encodings. There is one encoding in the Forth system that should be
>able to represent anything, and everything is converted to that
>encoding on input, and from that encoding on output. No need to
>switch anything.

The key word is "should". However, reality intervenes. There are
apps out there that use multiple encodings. A standard formalises
current practice - it is *not* a design for the future.

If you push through a standard that disenfranchises existing
substantial apps, the developers of those apps will ignore
the standard. Is this what you want?

The preferred route, I suggest, is to provide GET-ENCODING and
SET-ENCODING. In your system, you can always be non-compliant
for the moment. You will then have an environmental dependency on
UTF8. This is no worse than the widely accepted char=byte=au
dependency.

Bruce McFarling

unread,

Sep 28, 2005, 10:22:56 AM9/28/05

to

Albert van der Horst wrote:

> >OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> >Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> >memory
> >location after xc.

> Much too verbose for such a basic word.
> Why not OCTET -> B

I don't know why not. I'm pretty confident that few people are likely
to have OCTET@+ lying around, and if they do its odds on it does that
anyway. B@+? The B could stand for "buffer", or "block". OTOH,
BYTE@+ is fine by me.

Could call it "BCOUNT" in homage to established naming conventions for
CHAR@+, which is called COUNT, or OCOUNT.

Or OC@+ in homage to the yank television show that the young'uns here
like so much.

Bruce McFarling

unread,

Sep 28, 2005, 10:28:35 AM9/28/05

to

Stephen Pelc wrote:

> The preferred route, I suggest, is to provide GET-ENCODING and
> SET-ENCODING. In your system, you can always be non-compliant
> for the moment. You will then have an environmental dependency on
> UTF8. This is no worse than the widely accepted char=byte=au
> dependency.

Note that an implementation may only do one encoding, in which case
GET-ENCODING will always get the same encoding, and SET-ENCODING will
either do nothing or throw an error if the encoding set is not the
supported one.

It certainly is not unreasonable for gforth to focus on UTF-8, which is
emerging as a de facto standard in much of Linux oriented open source.
A standard that did not accomodate UTF-8 would be flawed. But
prescribing in advance of common practice will limit the uptake of the
standard and therefore the portability of source relying on it.

Bernd Paysan

unread,

Sep 28, 2005, 12:00:58 PM9/28/05

to

Anton Ertl wrote:

> One can use it safely in combination with XC-SIZE, but then
> it is easier to use XC!+? (see below).

Well, the reference implementation of XC!+? then is

: xc!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
>r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
\ not enough space
drop nip r> false
else
>r xc!+ r> r> swap - true
then ;

> In other words, this might become Forth's strcat().

You at least know that there is an upper bound for how much you might
overwrite (not the case with strcat). Well, the upper bound depends on the
encoding, and we don't guarantee now that -1 XC-SIZE will return the
maximum one.

Stephen Pelc

unread,

Sep 28, 2005, 4:13:10 PM9/28/05

to

On 28 Sep 2005 03:28:35 -0700, "Bruce McFarling"
<agi...@netscape.net> wrote:

>It certainly is not unreasonable for gforth to focus on UTF-8, which is
>emerging as a de facto standard in much of Linux oriented open source.
>A standard that did not accomodate UTF-8 would be flawed. But
>prescribing in advance of common practice will limit the uptake of the
>standard and therefore the portability of source relying on it.

I've been discussing applications that have been shipping for 15 or
more years. Internationalisation and the consequent "char" issues
have been around for a long time, and some of our clients handle
them daily. I just don't want their *requirements* to be locked
out.

The DCS, OCS and ACS terminology stems from issues that exist for
real applications. It is certainly rare for encodings to change
after program initialisation (although some multilingual word
processors have worked that way) but it is common that an app
has to select the encoding at startup.

Anton Ertl

unread,

Sep 28, 2005, 5:43:44 PM9/28/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>It is certainly rare for encodings to change
>after program initialisation (although some multilingual word
>processors have worked that way) but it is common that an app
>has to select the encoding at startup.

Sounds to me that we are in agreement then. Gforth uses the standard
Unix mechanism (the LANG environment variable) for determining the
encoding on startup. No switching words needed.

As for multilingual word processors, that's a good reason for using a
universal character set and encoding rather than switching around.

Anton Ertl

unread,

Sep 28, 2005, 5:49:50 PM9/28/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>No, that's not missing. There should not be any switching between
>>encodings. There is one encoding in the Forth system that should be
>>able to represent anything, and everything is converted to that
>>encoding on input, and from that encoding on output. No need to
>>switch anything.
>
>The key word is "should". However, reality intervenes. There are
>apps out there that use multiple encodings. A standard formalises
>current practice - it is *not* a design for the future.

It makes no sense to standardize a current practice that has no
future.

But as I said before, IMO it's a little to early for the xchars
proposal, because there is not enough practice with it.

In the Linux world, UTF-8 is the present.

>If you push through a standard that disenfranchises existing
>substantial apps, the developers of those apps will ignore
>the standard. Is this what you want?

I have read enough statements from Forth vendors that it's impossible
to write substantial apps in ANS Forth, so supposedly the programmers
of those substantial apps are ignoring the standard already.

The existing apps will continue to work on the systems where they
worked before and be as non-standard as they ever where.

It seems to me that you are thinking about requirements of your
customers that most of the others don't have, and that hopefully will
go away at some point even for your customers.

>The preferred route, I suggest, is to provide GET-ENCODING and
>SET-ENCODING.

That's the worst possible design; or maybe having an ENCODING variable
would be even worse.

In general, the global-state approach is always causing problems,
whether it's STATE or BASE or something else.

If you want to support different encodings, the encoding should be
stored with the data. But then we would be dealing with something
that's much different from current Forth strings. And the words for
dealing with that stuff would probably be much different from the
xchars words.

Xchars were designed for dealing with one encoding used throughout the
Forth system. Several encodings are compatible with the requirements
of xchars, and a Forth system might let you choose on startup which
encoding to use, but you cannot switch around between encodings.

Anton Ertl

unread,

Sep 28, 2005, 6:15:12 PM9/28/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>
>> One can use it safely in combination with XC-SIZE, but then
>> it is easier to use XC!+? (see below).
>
>Well, the reference implementation of XC!+? then is

My point is that you should include XC!+? in the proposal and probably
delete XC!+ from it.

BTW, concerning a reference implementation of xchars, a reference
implementation for the 8bit (or a general fixed-width) encoding should
be easy (although not very exciting).

>> In other words, this might become Forth's strcat().
>
>You at least know that there is an upper bound for how much you might
>overwrite (not the case with strcat).

True, but XC+! can be enough to overwrite an xt, and that can be
enough to break into the system.

>Well, the upper bound depends on the
>encoding, and we don't guarantee now that -1 XC-SIZE will return the
>maximum one.

Even if an upper bound could be determined, making use of that would
require additional programmer effort, and it's a bad idea to design
words that require that; you need to educate the programmers about
that, and even if they know about it, it's still easier to make errors
when the required effort is higher.

Albert van der Horst

unread,

Sep 28, 2005, 11:38:44 PM9/28/05

to

In article <1127884982.2...@f14g2000cwb.googlegroups.com>,

Of course, I agree to that. Here in the Netherlands the shorter C@+ is
in common use.

Groetjes Albert

Elizabeth D Rather

unread,

Sep 29, 2005, 12:36:59 AM9/29/05

to

"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2005Sep2...@mips.complang.tuwien.ac.at...

>
> I have read enough statements from Forth vendors that it's impossible
> to write substantial apps in ANS Forth, so supposedly the programmers
> of those substantial apps are ignoring the standard already.

That statement refers to the need for dependencies on things such as
underlying OS (and its interface), device drivers, and other extensions.
Wise programmers (IMO) stick to ANS Forth for everything not involving such
extensions, which is often the bulk of the app.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Bruce McFarling

unread,

Sep 29, 2005, 3:49:38 AM9/29/05

to

Anton Ertl wrote:
> steph...@mpeforth.com (Stephen Pelc) writes:
> >It is certainly rare for encodings to change
> >after program initialisation (although some multilingual word
> >processors have worked that way) but it is common that an app
> >has to select the encoding at startup.

> Sounds to me that we are in agreement then. Gforth uses the standard
> Unix mechanism (the LANG environment variable) for determining the
> encoding on startup. No switching words needed.

Except that is at the startup of gforth, not necessarily the startup of
the application. And that does not address someone who uses gforth as
a buffer against the expert-friendliness of Linux.

Bruce McFarling

unread,

Sep 29, 2005, 3:52:07 AM9/29/05

to

Stephen Pelc wrote:
> I've been discussing applications that have been shipping for 15 or
> more years. Internationalisation and the consequent "char" issues
> have been around for a long time, and some of our clients handle
> them daily. I just don't want their *requirements* to be locked
> out.

Yes, noted. I don't want their requirements locked out either, because
it interferes with uptake of a putative standard and limits
portability.

Bruce McFarling

unread,

Sep 29, 2005, 4:00:11 AM9/29/05

to

Anton Ertl wrote:
> >The key word is "should". However, reality intervenes. There are
> >apps out there that use multiple encodings. A standard formalises
> >current practice - it is *not* a design for the future.

> It makes no sense to standardize a current practice that has no
> future.

It makes no sense to standardise in a way that locks out a substantial
part of the present, since then the standard will not be viable and
won't have been respected in the available code base when the future
arrives.

> In the Linux world, UTF-8 is the present.

No standard can be limited to the Linux world, just as no standard
should shut out the Linux world.

> I have read enough statements from Forth vendors that it's impossible
> to write substantial apps in ANS Forth, so supposedly the programmers
> of those substantial apps are ignoring the standard already.

That's an all or nothing reading of what turn out to be qualified
statements. It may be impossible to write the entirety of substantial
apps in ANS Forth alone. There is nothing in that statement that
suggests the programmers of those apps are ignoring the standard.
After all, the standard does not *require* you to write the entirety of
an app in ANS Forth alone.

And XCHARs are right in the nitty gritty of low level support words for
text processing that is real appealing to have standardised, whether
formally or as a de facto toolkit.

Bruce McFarling

unread,

Sep 29, 2005, 4:15:58 AM9/29/05

to

Anton Ertl wrote:
[Bernd]

> >XC-SIZE ( xc -- u )
> >Computes the memory size of the XCHAR xc in address units.

> >XC!+ ( xc xc_addr1 -- xc_addr2 )

> >Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
> >location after xc.

> This is unsafe, as it writes an unknown amount of data behind
> xc_addr1. One can use it safely in combination with XC-SIZE, but then
> it is easier to use XC!+? (see below).

> DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )

> safe version of XC!+, f specifies success

I'm not sure about the level of this. An au length of a sequence of
XCHARs in memory seems handier, to me, for most things, and I
definitely prefer "know in advance" to "try it and clean up if it
fails".

One thing that occurs to me is that XC-SIZE seems to entail MOVE>,
analogous to CMOVE> in address units.

Bruce McFarling

unread,

Sep 29, 2005, 4:18:27 AM9/29/05

to

Anton Ertl wrote:
> BTW, concerning a reference implementation of xchars, a reference
> implementation for the 8bit (or a general fixed-width) encoding should
> be easy (although not very exciting).

Reference implementations for UTF-32, UTF-16 and UTF-8 would be enough
to give the idea. And of course code-page-ASCII is even easier than
UTF-32.

Stephen Pelc

unread,

Sep 29, 2005, 9:30:12 AM9/29/05

to

On Wed, 28 Sep 2005 17:43:44 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Sounds to me that we are in agreement then. Gforth uses the standard
>Unix mechanism (the LANG environment variable) for determining the
>encoding on startup. No switching words needed.

If OCS <> ACS, then switching may be needed.

>As for multilingual word processors, that's a good reason for using a
>universal character set and encoding rather than switching around.

Yes for a new design, not necessarily for an existing app being
ported. Standard = current practice. Some of the biggest issues
in ANS94 come from the introduction of new practice. The good
new parts come from the embodiment of best current practice, even
if it came from another language, e.g. CATCH and THROW.

Stephen Pelc

unread,

Sep 29, 2005, 10:13:59 AM9/29/05

to

On Wed, 28 Sep 2005 17:49:50 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>The key word is "should". However, reality intervenes. There are
>>apps out there that use multiple encodings. A standard formalises
>>current practice - it is *not* a design for the future.
>
>It makes no sense to standardize a current practice that has no
>future.

Yes it does! It encourages take up of current best practice after
the first port. Application developers simply will not discard a
large and proven code base just because you say they should.

We have been involved in two ports of large commercial Forth
applications, FigForth -> Forth83 and Forth83 -> ANS94. The
final application generates 10-16Mb of binary. Even the first
stage build requires compiling 250,000 lines of code. Until
you understand the mindset of these developers and the
management issues of large applications, you will not
understand why I'm taking this approach.

In essence you want to go from A -> B directly. I' saying that
acceptance of B requires some people to go A -> C -> B. The
end point is not in dispute, it's the journey that counts.

>I have read enough statements from Forth vendors that it's impossible
>to write substantial apps in ANS Forth, so supposedly the programmers
>of those substantial apps are ignoring the standard already.

I for one do not subscribe to that point of view. What many/some
vendors have said is
a) the standard does not cover enough
b) we were out of time to do more
c) we welcome your taking up the challenge.

>>The preferred route, I suggest, is to provide GET-ENCODING and
>>SET-ENCODING.
>
>That's the worst possible design; or maybe having an ENCODING variable
>would be even worse.
>
>In general, the global-state approach is always causing problems,
>whether it's STATE or BASE or something else.

That's why GET-ENCODING and SET-ENCODING are suggested - they hide
the implementation of the storage.

>Xchars were designed for dealing with one encoding used throughout
>the Forth system. Several encodings are compatible with the
>requirements of xchars, and a Forth system might let you choose on
>startup which encoding to use, but you cannot switch around between
>encodings.

The implication of XCHARs is then that they cannot be used when
ACS <> DCS or OCS <> DCS. This breaks XCHARs for application
development on current Forths.

Bruce McFarling

unread,

Sep 30, 2005, 3:56:17 AM9/30/05

to

Stephen Pelc wrote:
> The implication of XCHARs is then that they cannot be used when
> ACS <> DCS or OCS <> DCS. This breaks XCHARs for application
> development on current Forths.

Or that they cannot be used in a multi-tasking situation when the ACS
of one task is not the same as the ACS of another task.

On the other hand, GET-ENCODING SET-ENCODING can *accomodate* "UTF-8
uber alles" if SET-ENCODING is:

SET-ENCODING
( xc-id -- flag )
\ flag=FALSE, encoding is not available
\ flag=TRUE, atomic XCHAR encoding is available (it is always possible
to find the beginning of the current char from an abitrary memory
address within the string)
\ flag=1, XCHAR encoding available, encoding is not atomic (a valid
start of character address is required and you can only move forward).

Then a "UTF-8 uber alles" system simply refuses any other encodings for
XCHARs, and accepts code with system dependencies on AU=1CHAR=8bits,
and XCHAR-ENCODING=UTF-8. Systems that can accomodate those
dependencies (and whetever else they do not have a vanilla-ANS prelude
file for) are able to run those programs. Let the best approach win,
without forcing anybody to lose.

Bernd Paysan

unread,

Sep 30, 2005, 2:36:29 PM9/30/05

to

Stephen Pelc wrote:
> That's why GET-ENCODING and SET-ENCODING are suggested - they hide
> the implementation of the storage.

So far, I suggest that this part should be defined elsewhere. The XCHAR
wordset itself is orthogonal to the ACS/OCS/DCS separation, and can be
(ab)used to handle that (with SET-ENCODING/GET-ENCODING and the encodings
that live behind that).

Being able to change the encoding also needs to know how to call these
locales, so either dictionary names have to be defined, or the locale
specifier is a string, like with setlocale - and then you need to tell the
user what the string means.

I think we can agree on that we need to handle other encodings than ASCII
and fixed-width wide characters - this is Forth200x, after all, and these
things exist. We need to handle several encodings (switchable) on some
systems in some cases, but not on others.

Encoding changes apparently belongs to an internationalization wordset. It's
corresponding to the C "setlocale" word, and you already have
SET-LANGUAGE/GET-LANGUAGE and SET-COUNTRY/GET-COUNTRY words in your I18N
proposal. Let's keep them, they are fine. The LOCALE words apparently try
to address my concern how to list and display the available locales (though
I think there should be at least one known locale-id to start from, e.g.
FORTH-LSID, which corresponds to the DCS). So SET-ENCODING/GET-ENCODING fit
perfectly into the LOCALE wordset.

Systems without internationalization already may need XCHAR, because there
are widely used environments with UTF-8 as default character set. But they
don't need to switch between encodings.

Anton Ertl

unread,

Sep 30, 2005, 4:24:47 PM9/30/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
[>>Someone wrote:]

>> >The key word is "should". However, reality intervenes. There are
>> >apps out there that use multiple encodings. A standard formalises
>> >current practice - it is *not* a design for the future.

....

>> In the Linux world, UTF-8 is the present.
>
>No standard can be limited to the Linux world, just as no standard
>should shut out the Linux world.

My statement was refuting someone's statement about a design for the
future. It was not intended to be exhaustive, much less limiting.

E.g., in Plan9 UTF-8 has been the present since 1992.

I am no expert on Windows, but AFAIK Unicode (or its 16-bit subset) is
the standard character set of Windows NT and its offspring (also for
more than ten years). Jax4th on WNT supported Unicode in 1993.

So, universal character sets are not something that is in the distant
future.

[reinserted missing context]

>>> If you push through a standard that disenfranchises existing
>>> substantial apps, the developers of those apps will ignore
>>> the standard. Is this what you want?

>> I have read enough statements from Forth vendors that it's impossible

>> to write substantial apps in ANS Forth, so supposedly the programmers
>> of those substantial apps are ignoring the standard already.
>
>That's an all or nothing reading of what turn out to be qualified
>statements.

Well, I fail to see the qualification in the statement I responded to.

>It may be impossible to write the entirety of substantial
>apps in ANS Forth alone. There is nothing in that statement that
>suggests the programmers of those apps are ignoring the standard.
>After all, the standard does not *require* you to write the entirety of
>an app in ANS Forth alone.

Well, xchars don't require anyone write an entire app in ANS Forth
alone, either.

Brad Eckert

unread,

Sep 30, 2005, 5:03:18 PM9/30/05

to

Should XCHARS be variable length? A given character could be a byte, a
16-bit char, 32-bit char, etc. Then why not support xts too. When TYPE
encounters an xt in a string it would execute it. You can let your
imagination run with that.

I think the purpose of a wordset is to lay down rules for things that
you can't portably do in ANS Forth, like emit a character or string
using the more generalized characters. The other stuff sounds like an
exercise in creating useful data structures. If that's what we're
after, are there already common file formats that contain data
structures that deal with wide character sets?

Brad

Anton Ertl

unread,

Sep 30, 2005, 4:42:12 PM9/30/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>On Wed, 28 Sep 2005 17:49:50 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>>It makes no sense to standardize a current practice that has no
>>future.
>
>Yes it does! It encourages take up of current best practice after
>the first port.

?

>Application developers simply will not discard a
>large and proven code base just because you say they should.

Straw man argument.

>In essence you want to go from A -> B directly. I' saying that
>acceptance of B requires some people to go A -> C -> B. The
>end point is not in dispute, it's the journey that counts.

Lets say what we are talking about: For me A and B are:

A: single character set, single 8-bit encoding
B: single character set, single, possibly variable-width encoding

And most Forth programmers are at A right now. I see no reason for
all of us to go through:

C: multiple character sets, multiple, possibly nasty, encodings

Of course, you have customers who are currently at C. I don't know if
going to B is viable for them, and what immediate steps they should
take, but I don't see that those people who are at A need to go there.

>>>The preferred route, I suggest, is to provide GET-ENCODING and
>>>SET-ENCODING.
>>
>>That's the worst possible design; or maybe having an ENCODING variable
>>would be even worse.
>>
>>In general, the global-state approach is always causing problems,
>>whether it's STATE or BASE or something else.
>
>That's why GET-ENCODING and SET-ENCODING are suggested - they hide
>the implementation of the storage.

It's still global state, with all it's problems. Often, a better
design is to have a context wrapper, like

ENCODING-EXECUTE ( enc-id xt -- )

which executes xt in a context where the encoding is enc-id. That
would be safe against exceptions and makes reusable programming
easier.

But actually in the case of encodings, if you want to support multiple
encodings, they should be stored with the data, maybe with each
character (I believe Emacs does something like this).

>The implication of XCHARs is then that they cannot be used when
>ACS <> DCS or OCS <> DCS.

Yes.

>This breaks XCHARs for application
>development on current Forths.

It may make xchars inappropriate for some applications on some Forths,
but they work well enough on one, no, according to Bernd two current
Forth systems.

Anton Ertl

unread,

Sep 30, 2005, 5:13:56 PM9/30/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:

>> Gforth uses the standard
>> Unix mechanism (the LANG environment variable) for determining the
>> encoding on startup. No switching words needed.
>
>Except that is at the startup of gforth, not necessarily the startup of
>the application.

Sure, so what?

>And that does not address someone who uses gforth as
>a buffer against the expert-friendliness of Linux.

Using expert-friendly Gforth as a buffer against expert-friendly
Linux? Hmm.

Anyway, once Gforth is potentially poisoned with strings from one
encoding, the only reasonable way to change the encoding is to start
Gforth from scratch; we cannot recode all the strings lying around
with the earlier encoding. If anybody really has a problem with
exiting and restarting Gforth, one could write a word that exec()s
Gforth (which has the same effect, except possibly wrt open files and
stuff).

Anton Ertl

unread,

Sep 30, 2005, 5:21:33 PM9/30/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>On Wed, 28 Sep 2005 17:43:44 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>>As for multilingual word processors, that's a good reason for using a
>>universal character set and encoding rather than switching around.
>
>Yes for a new design, not necessarily for an existing app being
>ported. Standard = current practice. Some of the biggest issues
>in ANS94 come from the introduction of new practice. The good
>new parts come from the embodiment of best current practice, even
>if it came from another language, e.g. CATCH and THROW.

Well, then take a look at Java, which has used a universal character
set (AFAIK the 16-bit subset of Unicode) since 1995, and does not
provide for multiple encodings.

Anton Ertl

unread,

Sep 30, 2005, 5:31:19 PM9/30/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
>[Bernd]
>> >XC-SIZE ( xc -- u )
>> >Computes the memory size of the XCHAR xc in address units.
>
>> >XC!+ ( xc xc_addr1 -- xc_addr2 )
>> >Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>> >location after xc.
>
>> This is unsafe, as it writes an unknown amount of data behind
>> xc_addr1. One can use it safely in combination with XC-SIZE, but then
>> it is easier to use XC!+? (see below).
>
>> DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
>> safe version of XC!+, f specifies success
>
>I'm not sure about the level of this. An au length of a sequence of
>XCHARs in memory seems handier, to me, for most things, and I
>definitely prefer "know in advance" to "try it and clean up if it
>fails".

There is no need to clean up after XC!+?. It does the "know in
advance" internally. I think you'll have to try programming with both
words to see how it works.

Stephen Pelc

unread,

Sep 30, 2005, 6:13:21 PM9/30/05

to

On Fri, 30 Sep 2005 17:21:33 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Well, then take a look at Java, which has used a universal character
>set (AFAIK the 16-bit subset of Unicode) since 1995, and does not
>provide for multiple encodings.

At that stage in Java's life, there was not a substantial legacy
of 15 year old applications.

Stephen Pelc

unread,

Sep 30, 2005, 6:26:11 PM9/30/05

to

On Fri, 30 Sep 2005 16:42:12 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>Application developers simply will not discard a
>>large and proven code base just because you say they should.
>
>Straw man argument.

I disagree. Talk to Willem and Nick at EuroForth.

>A: single character set, single 8-bit encoding
>B: single character set, single, possibly variable-width encoding
>
>And most Forth programmers are at A right now. I see no reason for
>all of us to go through:
>
>C: multiple character sets, multiple, possibly nasty, encodings
>
>Of course, you have customers who are currently at C. I don't know if
>going to B is viable for them, and what immediate steps they should
>take, but I don't see that those people who are at A need to go there.

Are you really telling the Forth developers with:
a) The largest code base
b) the largest client base
c) the most experience of multiple languages and encodings
that they don't count in Forth200x?

>>The implication of XCHARs is then that they cannot be used when
>>ACS <> DCS or OCS <> DCS.
>
>Yes.

A large number of ebedded systems will use an 8 bit DCS for a long
while into the future, regardless of what any Forth200x standard says.
Such systems *do* and will *have* to use multiple encodings for a
long time to come.

Bruce McFarling

unread,

Oct 1, 2005, 1:41:12 AM10/1/05

to

Anton Ertl wrote:

> Anyway, once Gforth is potentially poisoned with strings from one
> encoding, the only reasonable way to change the encoding is to start
> Gforth from scratch; we cannot recode all the strings lying around
> with the earlier encoding. If anybody really has a problem with
> exiting and restarting Gforth, one could write a word that exec()s
> Gforth (which has the same effect, except possibly wrt open files and
> stuff).

Nobody said every system had to support every encoding, or that a
system that wished to target UTF-8 as a universal encoding was not free
to do so. And as I mentioned before, in the GNU/Linux opensource
space, UTF-8 as the only XCHAR is perfectly justifiable. That leaves
byte-wide CHARs for backword compatibility to working with code-pages.

However, just now trying to work out how encodings are juggled when
going between ACS/DCS/OCS, it has become clear to me that two XCHAR's
are needed. Anton's view of XCHAR is as a joint DCS/OCS when it
becomes necessary to work with multiple-CHAR encodings. Note that this
may equally well be UTF-8 with CHAR applies to 8-bit values or UTF-16
when CHAR applies to 16-bit values. Additionally, this may be in
resource constrained situations, as with no memory limit UTF-32 fits
into 32 bit wide CHARS.

(Also sidenote that the most appeal to the latter is in migrating a
Forth targetted to a simplified ideogram encoding that fits into the
first Unicode plane to full Unicode recognition, and philosophically as
well as with experience lecturing classes with large numbers of Chinese
and Taiwanese students, I don't want to lock out that migration path).

The definitions of some existing standards in terms of ASCII7, and the
ease of extension to ISO 8 bit sets requires byte word handling.

B@ B! BMOVE B, BYTES BYTE+

as referred to in Pelc and Knaggs (2001), citing Greg Bailey,
accomodates that. And additionally, note that a known size,
specialised access to bizarre address units can be defined in a
portable manner, and in any event bizarre address units typically
entail unportable code in any respect.

And finally, the statement in Knaggs and Pelc (2001) that:
"Because of the rarity of multibyte character sets, we believe that
they need be handled only by the LOCALE wordset proposal for
internationalisation of applications."

no longer applies with the growing adoption of UTF-8 in the open source
community.

Also, the assumption that Unicode is 16 bits no longer applies.

Therefore:

(1) I would propose that XCHARS be adopted for standard use with
variable-CHAR width OCS.

(2) A system has one variable width OCS encoding.

(3) A system may also adopt an OCS variable width encoding as its DCS,
provided that it is upwardly compatible with ASCII7, as both UTF8 with
8bit CHARS and UTF16 with 16bit CHARS are. Obviously only source that
is encoded in the ASCII7 subset may be considered portable.

OK now, what I was interested in was variable character sets were
standardisation would actually be useful, which is the ACS. A subset
of the XCHAR functionality provides operations on non-atomic variable
width encodings, for which you can go forward from the start address of
a well formed character, but you cannot necessarily go back from the
final address pointing to a well-formed character. The full XCHAR
functionality provides operations on atomic variable width encodings.
So:

(4) Call these "VCHAR"s.
(5) The discussion of GET-ENCODING SET-ENCODING applies to VCHARs.
Since by presumption one sets the encoding before starting to work in a
"non-native" set, if one works with multiple sets one knows before hand
that it is necessary to also store the appropriate vc-id in an
appropriate place.

Bruce McFarling

unread,

Oct 1, 2005, 2:54:53 AM10/1/05

to

Anton Ertl wrote:
> I am no expert on Windows, but AFAIK Unicode (or its 16-bit subset) is
> the standard character set of Windows NT and its offspring (also for
> more than ten years). Jax4th on WNT supported Unicode in 1993.

> So, universal character sets are not something that is in the distant
> future.

But then they found out that UTF-16 was not big enough to be universal,
and in particular allocated plane 2 for some of the more elegant Han
ideographs.

Since Windows NT relies on the 16-bit subset of UTF-16, it would not be
surprising to see a Forth implemented for Windows NT to have 16-bit
CHARS, and then need to have something like XCHARs to upgrade to full
Unicode.

What I was saying was not that universal-across-all-languages character
SETS were things we would see in the distant future, but that A
universal-across-all-systems character SET ENCODING (with possibly some
specialised legacy cases) is a possible future, and not the present.

And nothing about "distant" future. You added that out of whole cloth
to make a better straw man to knock down. (Not that I mind, being
faced with straw men versions of what you have said makes it easier to
see where you have been vague or confusing in your expression).

Bruce McFarling

unread,

Oct 1, 2005, 3:07:00 AM10/1/05

to

Brad Eckert wrote:

> Should XCHARS be variable length?

Yes, that's the whole point. Variable length character set encodings
are becoming more common, and UTF-8 since its the easiest upgrade path
to full Unicode from classic C character=byte, anybody who wants to
talk to internationalised Linux applications is going to want to handle
variable length character sets.

Bernd's discussion of XCHAR's introduced into gforth to cope with this
exact issue is, I think, a good layout of the basic functionality.

Stephen's raising the I18N OCS/DCS/ACS issues is pertinent as well. I
was attracted to XCHARs as an ACS tool, but the XCHAR's in gforth are
in effect an OCS tool.

For working with fixed width ACS characters, provided they are as big
or bigger than the DCS character set (CHAR) and one bit narrower than
the cell size (to avoid signedness/unsignedness problems), Pelc and
Knagg (2001) WCHAR's copes with that. But it is for a constant width
character set, not a variable width character set. If your WCHAR is
32-bit Unicode and you are building an IP-packet in UTF-8 in memory,
you have to HAVE the XCHAR functionality, whether you get it from
somewhere else or program it yourself. My humble proposal was to get
at that kind of ACS issue by having VCHARs in parallel with XCHARs, and
include a GET-ENCODING / SET-ENCODING that works with VCHARs.

Bruce McFarling

unread,

Oct 1, 2005, 3:14:41 AM10/1/05

to

Bernd Paysan wrote:

> Stephen Pelc wrote:
> > That's why GET-ENCODING and SET-ENCODING are suggested - they hide
> > the implementation of the storage.

> So far, I suggest that this part should be defined elsewhere. The XCHAR
> wordset itself is orthogonal to the ACS/OCS/DCS separation, and can be
> (ab)used to handle that (with SET-ENCODING/GET-ENCODING and the encodings
> that live behind that).

The functionality is orthogonal, but the set-encoding issues are tied
up with ACS/OCS/DCS. Or at least, that's my story, and I'm sticking to
it until the next smart person comes by and shakes it loose.

Basically, you don't WANT get-encoding to TOUCH the OCS or the DCS.
You only WANT it to touch the ACS. Anton's issues with building a
system first and then adding a mutating XCHAR after as being a problem
are, I think, perfectly valid, and they are examples of WHY the OCS/DCS
ought to be considered to be hardwired. The portability issue is
making it easier to share tools between various systems, and XCHAR does
that. And since it is a subset of the capabilities available with
fixed width chars, it is perfectly generic to ANY OCS, with the
exception that some efforts at variable width character encodings prior
to UTF-8 (and then UTF-16) were not atomic, and required a "start at
the beginning and scroll forward" approach.

The only way to make XCHARs mutable in some instances and immutable in
others is to have two parallel sets of words, a system-defined one
(XCHAR) and an application settable one (what I have impertinently
labelled VCHAR).

Bruce McFarling

unread,

Oct 1, 2005, 3:19:34 AM10/1/05

to

Anton Ertl wrote:
> And most Forth programmers are at A right now. I see no reason for
> all of us to go through:

> C: multiple character sets, multiple, possibly nasty, encodings

Precisely what is there in what Stephen has said that mandates going
through C?

The question is the difference between (1) "I want A->B available,
without having to go through C", and (2) "I want A->B to be the only
standardised option, to discourage going through C".

An standard that gives (1), while accomodating those who by force of
circumstance have been forced to start from C, is in my mind better. I
don't think there is any need to push people at "A" to skip "C" ...
given the option the appeal of "B" will be a strong enough pull on its
own.

And promulgating the option as widely as possible requires accomodating
as many different present starting positions as possible.

Bruce McFarling

unread,

Oct 1, 2005, 3:26:47 AM10/1/05

to

Anton Ertl wrote:
[Quoth I]

> >I'm not sure about the level of this. An au length of a sequence of
> >XCHARs in memory seems handier, to me, for most things, and I
> >definitely prefer "know in advance" to "try it and clean up if it
> >fails".

> There is no need to clean up after XC!+?. It does the "know in
> advance" internally. I think you'll have to try programming with both
> words to see how it works.

Yes there is, you have to work out what to do with the darn thing when
XC!+ fails. There may be no need to clean up the MEMORY, but whatever
process you are doing TRIED to put a character somewhere and failed.
On the other hand, if you already know from using X-SIZE (which I think
needs a better name to avoid confusion with XC-SIZE) that there is
space, no problems.

One word that would be handy would be XC-MAX ( -- u ) for the maximum
possible variable character length, or even better XC-SAFE ( n -- u )
for the maximum possible width of a string of that many variable width
characters. That would simplify setting up a work buffer with enough
room in the first place. But the way I see the three words,

X-SIZE tells me whether I have enough room in memory for this string
typed in or loaded from a file
XC-SIZE tells me what PART of the X-SIZE is covered by this particular
character, for inserts and deletes
XC!+ moves the thing a character at a time.

I prefer XC-LEN for X-SIZE

Anton Ertl

unread,

Oct 1, 2005, 8:01:12 AM10/1/05

to

"Brad Eckert" <nospaa...@tinyboot.com> writes:
>Should XCHARS be variable length?

The words are designed to support (a class of) variable-length
character encodings. But you can also use xchars for fixed-width
character encodings.

>Then why not support xts too.

There is no way to know that something is an xt.

E.g., if the xc returned by XC@+ has the numeric value 12345, how do
you know if its a character or an xt?

>I think the purpose of a wordset is to lay down rules for things that
>you can't portably do in ANS Forth, like emit a character or string
>using the more generalized characters. The other stuff sounds like an
>exercise in creating useful data structures.

What other stuff?

> If that's what we're
>after, are there already common file formats that contain data
>structures that deal with wide character sets?

The typical case will be a plain text or HTML file with UTF-8,
encoding, e.g., <http://www.columbia.edu/kermit/utf8.html>.

But there are also some non-text files that contain UTF-8 in certain
parts, e.g., Java .class files.

However, the proposal posted by Bernd Paysan does not contain any
stuff for dealing with files; what would be needed is

- for whole files: A way to tell the file words that a
file is in a specific encoding, such that the READ and WRITE words can
perform the appropriate conversion.

- for file parts: The files would be accessed in in binary mode, and
the program would have to convert individual strings between the
encoding in the file and the Forth system's encoding with conversion
words.

Anton Ertl

unread,

Oct 1, 2005, 4:53:57 PM10/1/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>On Fri, 30 Sep 2005 16:42:12 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>>Application developers simply will not discard a
>>>large and proven code base just because you say they should.
>>
>>Straw man argument.
>
>I disagree. Talk to Willem and Nick at EuroForth.

Just to spell it out for you: The straw man is that somebody says they
should discard a large and proven code base. You put up this straw
man, and are now beating on it.

>>A: single character set, single 8-bit encoding
>>B: single character set, single, possibly variable-width encoding
>>
>>And most Forth programmers are at A right now. I see no reason for
>>all of us to go through:
>>
>>C: multiple character sets, multiple, possibly nasty, encodings
>>
>>Of course, you have customers who are currently at C. I don't know if
>>going to B is viable for them, and what immediate steps they should
>>take, but I don't see that those people who are at A need to go there.
>
>Are you really telling the Forth developers with:
>a) The largest code base
>b) the largest client base
>c) the most experience of multiple languages and encodings
>that they don't count in Forth200x?

No. This discussion is about xchars, not Forth200x.

As for Forth200x, I guess there will be stuff in there that they will
find useful, but, as with ANS Forth, not everything they find useful
will end up in Forth200x.

>A large number of ebedded systems will use an 8 bit DCS for a long
>while into the future

Why the DCS in particular? I see two reasons why not:

- For cross-compiled systems the DCS is on the host, which should
easily be able to handle universal character sets and their encodings.

- If UTF-8 is used, most of the DCS usage will be ASCII, which has
the same size for UTF-8 and 8bit encodings.

>Such systems *do* and will *have* to use multiple encodings for a
>long time to come.

A large number of embedded systems do use multiple encodings? I think
it's a small proportion of embedded systems.

Anton Ertl

unread,

Oct 1, 2005, 6:18:29 PM10/1/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>Anton Ertl wrote:
>
>> Anyway, once Gforth is potentially poisoned with strings from one
>> encoding, the only reasonable way to change the encoding is to start
>> Gforth from scratch; we cannot recode all the strings lying around
>> with the earlier encoding. If anybody really has a problem with
>> exiting and restarting Gforth, one could write a word that exec()s
>> Gforth (which has the same effect, except possibly wrt open files and
>> stuff).
>
>Nobody said every system had to support every encoding, or that a
>system that wished to target UTF-8 as a universal encoding was not free
>to do so. And as I mentioned before, in the GNU/Linux opensource
>space, UTF-8 as the only XCHAR is perfectly justifiable.

Gforth supports two encodings at the moment: 8bit and UTF-8.

>Anton's view of XCHAR is as a joint DCS/OCS

Don't put things in my mouth. My view is that a real (interactive)
Forth system should not differentiate between DCS and ACS.
Cross-compilers may differentiate, because they have a clear border
between compile-time and run-time, but making the difference in real
Forth systems would impose that line there, too.

As for the OCS, that is a non-entity in Forth code. Forth code does
not care how the Forth system talks to the OS. E.g., the Forth system
might represent strings in UTF-8, and on some OSs (e.g., MS Windows)
the Forth system might convert this to UTF-16 when talking to the OS
without the Forth program noticing.

>Note that this
>may equally well be UTF-8 with CHAR applies to 8-bit values or UTF-16
>when CHAR applies to 16-bit values.

Actually, I think that many more programs break when you break the
assumption 1 chars = 1 than when you break the assumption that CHAR
words can be used to pick ASCII strings apart, so I would probably
choose 1 chars = 1 on byte-addressable machines even when xchars deal
with UTF-16 or UTF-32. OTOH, the second assumption is guaranteed by
the standard, while the first is not, so its a tough choice to make.

Anton Ertl

unread,

Oct 1, 2005, 6:47:23 PM10/1/05

to

"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
>[Quoth I]
>> >I'm not sure about the level of this. An au length of a sequence of
>> >XCHARs in memory seems handier, to me, for most things, and I
>> >definitely prefer "know in advance" to "try it and clean up if it
>> >fails".
>
>> There is no need to clean up after XC!+?. It does the "know in
>> advance" internally. I think you'll have to try programming with both
>> words to see how it works.
>
>Yes there is, you have to work out what to do with the darn thing when
>XC!+ fails.

Just as you have to work out what to do when the XC-SIZE check fails.

>On the other hand, if you already know from using X-SIZE (which I think
>needs a better name to avoid confusion with XC-SIZE) that there is
>space, no problems.

Well, if you use X-SIZE (which should be called XC-DISPLAY-WIDTH or
somesuch) to check whether XC+! can be safely used, you have already
made a mistake and probably introduced a buffer overflow error. If
you had used XC+!? instead, this would not happen.

>One word that would be handy would be XC-MAX ( -- u ) for the maximum
>possible variable character length, or even better XC-SAFE ( n -- u )
>for the maximum possible width of a string of that many variable width
>characters.

That makes sense in those cases where we know (or have a
not-too-excessive upper bound for) the number of xcs in the target
string. However, I am not convinced that those cases are all that
frequent (it still might be worthwhile). OTOH, the number of cases
where we have to put a string together by storing characters in it is
not that big, either.

>But the way I see the three words,
>
>X-SIZE tells me whether I have enough room in memory for this string
>typed in or loaded from a file

I am not sure what that means, but the X-SIZE defined by Bernd Paysan
does something completely different.

Bernd Paysan

unread,

Oct 2, 2005, 1:17:52 PM10/2/05

to

Anton Ertl wrote:
> Well, if you use X-SIZE (which should be called XC-DISPLAY-WIDTH or
> somesuch)

Yes, to avoid confusion. Should it work on whole strings (that's what I find
useful), or on single characters (I use wcwidth() to implement it;
wcwidth() works on a UCS32 character).

Bernd Paysan

unread,

Oct 2, 2005, 5:18:57 PM10/2/05

to

Bernd Paysan wrote:
> Reference implementation:
>
> Unfortunately, both the Gforth and the bigFORTH implementation have
> several system-specific parts.

I've cleaned up the implementation to create a reference implementation.
Here it is (for UTF-8 and ISO-LATIN-1 as fallback support). It takes some
parts of the discussion into account (+X/STRING, -X/STRING, XC!+?,
XC-DISPLAY-WIDTH instead of X-SIZE). It ignores that the line editor
(ACCEPT) will need major changes.

\ xchar reference implementation: UTF-8 (and ISO-LATIN-1)

\ environmental dependency: characters are stored as bytes
\ environmental dependency: lower case words accepted

base @ hex

80 Value maxascii

: xc-size ( u -- n )
dup maxascii u< IF drop 1 chars EXIT THEN \ special case ASCII
800 2 >r
BEGIN 2dup u< 0= WHILE 5 lshift r> char+ >r dup 0= UNTIL THEN
2drop r> ;

: xc@+ ( xcaddr -- xcaddr' u )
count dup maxascii u< IF EXIT THEN \ special case ASCII
7F and 40 >r
BEGIN dup r@ and WHILE r@ xor
6 lshift r> 5 lshift >r >r count
\ dup C0 and 80 <> abort" malformed character"
3F and r> or
REPEAT r> drop ;

: xc!+ ( xc xcaddr -- xcaddr' )
over maxascii u< IF tuck c! char+ EXIT THEN \ special case ASCII
>r 0 swap 3F
BEGIN 2dup u> WHILE
2/ >r dup 3F and 80 or swap 6 rshift r>
REPEAT 7F xor 2* or r>
BEGIN over 80 u< 0= WHILE tuck c! char+ REPEAT nip ;

: xc!+? ( xc xcaddr u -- xcaddr' u' )
>r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
\ not enough space
drop nip r> false
ELSE
>r xc!+ r> r> swap - true
THEN ;

\ scan to next/previous character

: xchar+ ( xcaddr -- xcaddr' ) xc@+ drop ;
: xchar- ( xcaddr -- xcaddr' )
BEGIN 1 chars - dup c@ C0 and maxascii <> UNTIL ;

: +x/string ( xcaddr u -- xcaddr' u' )
over + xchar+ over - ;
: -x/string ( xcaddr u -- xcaddr' u' )
over + xchar- over - ;

\ utf key and emit

: xkey ( -- xc )
key dup maxascii u< IF EXIT THEN \ special case ASCII
7F and 40 >r
BEGIN dup r@ and WHILE r@ xor
6 lshift r> 5 lshift >r >r key
\ dup C0 and 80 <> abort" malformed character"
3F and r> or
REPEAT r> drop ;

: xemit ( xc -- )
dup maxascii u< IF emit EXIT THEN \ special case ASCII
0 swap 3F
BEGIN 2dup u> WHILE
2/ >r dup 3F and 80 or swap 6 rshift r>
REPEAT 7F xor 2* or
BEGIN dup 80 u< 0= WHILE emit REPEAT drop ;

\ utf size

\ uses wcwidth ( xc -- n )

: wc, ( n low high -- ) 1+ , , , ;

Create wc-table \ derived from wcwidth source code, for UCS32
0 0300 0357 wc,
0 035D 036F wc,
0 0483 0486 wc,
0 0488 0489 wc,
0 0591 05A1 wc,
0 05A3 05B9 wc,
0 05BB 05BD wc,
0 05BF 05BF wc,
0 05C1 05C2 wc,
0 05C4 05C4 wc,
0 0600 0603 wc,
0 0610 0615 wc,
0 064B 0658 wc,
0 0670 0670 wc,
0 06D6 06E4 wc,
0 06E7 06E8 wc,
0 06EA 06ED wc,
0 070F 070F wc,
0 0711 0711 wc,
0 0730 074A wc,
0 07A6 07B0 wc,
0 0901 0902 wc,
0 093C 093C wc,
0 0941 0948 wc,
0 094D 094D wc,
0 0951 0954 wc,
0 0962 0963 wc,
0 0981 0981 wc,
0 09BC 09BC wc,
0 09C1 09C4 wc,
0 09CD 09CD wc,
0 09E2 09E3 wc,
0 0A01 0A02 wc,
0 0A3C 0A3C wc,
0 0A41 0A42 wc,
0 0A47 0A48 wc,
0 0A4B 0A4D wc,
0 0A70 0A71 wc,
0 0A81 0A82 wc,
0 0ABC 0ABC wc,
0 0AC1 0AC5 wc,
0 0AC7 0AC8 wc,
0 0ACD 0ACD wc,
0 0AE2 0AE3 wc,
0 0B01 0B01 wc,
0 0B3C 0B3C wc,
0 0B3F 0B3F wc,
0 0B41 0B43 wc,
0 0B4D 0B4D wc,
0 0B56 0B56 wc,
0 0B82 0B82 wc,
0 0BC0 0BC0 wc,
0 0BCD 0BCD wc,
0 0C3E 0C40 wc,
0 0C46 0C48 wc,
0 0C4A 0C4D wc,
0 0C55 0C56 wc,
0 0CBC 0CBC wc,
0 0CBF 0CBF wc,
0 0CC6 0CC6 wc,
0 0CCC 0CCD wc,
0 0D41 0D43 wc,
0 0D4D 0D4D wc,
0 0DCA 0DCA wc,
0 0DD2 0DD4 wc,
0 0DD6 0DD6 wc,
0 0E31 0E31 wc,
0 0E34 0E3A wc,
0 0E47 0E4E wc,
0 0EB1 0EB1 wc,
0 0EB4 0EB9 wc,
0 0EBB 0EBC wc,
0 0EC8 0ECD wc,
0 0F18 0F19 wc,
0 0F35 0F35 wc,
0 0F37 0F37 wc,
0 0F39 0F39 wc,
0 0F71 0F7E wc,
0 0F80 0F84 wc,
0 0F86 0F87 wc,
0 0F90 0F97 wc,
0 0F99 0FBC wc,
0 0FC6 0FC6 wc,
0 102D 1030 wc,
0 1032 1032 wc,
0 1036 1037 wc,
0 1039 1039 wc,
0 1058 1059 wc,
1 0000 1100 wc,
2 1100 115f wc,
0 1160 11FF wc,
0 1712 1714 wc,
0 1732 1734 wc,
0 1752 1753 wc,
0 1772 1773 wc,
0 17B4 17B5 wc,
0 17B7 17BD wc,
0 17C6 17C6 wc,
0 17C9 17D3 wc,
0 17DD 17DD wc,
0 180B 180D wc,
0 18A9 18A9 wc,
0 1920 1922 wc,
0 1927 1928 wc,
0 1932 1932 wc,
0 1939 193B wc,
0 200B 200F wc,
0 202A 202E wc,
0 2060 2063 wc,
0 206A 206F wc,
0 20D0 20EA wc,
2 2329 232A wc,
0 302A 302F wc,
2 2E80 303E wc,
0 3099 309A wc,
2 3040 A4CF wc,
2 AC00 D7A3 wc,
2 F900 FAFF wc,
0 FB1E FB1E wc,
0 FE00 FE0F wc,
0 FE20 FE23 wc,
2 FE30 FE6F wc,
0 FEFF FEFF wc,
2 FF00 FF60 wc,
2 FFE0 FFE6 wc,
0 FFF9 FFFB wc,
0 1D167 1D169 wc,
0 1D173 1D182 wc,
0 1D185 1D18B wc,
0 1D1AA 1D1AD wc,
2 20000 2FFFD wc,
2 30000 3FFFD wc,
0 E0001 E0001 wc,
0 E0020 E007F wc,
0 E0100 E01EF wc,
here wc-table - Constant #wc-table

\ inefficient table walk:

: wcwidth ( xc -- n )
wc-table #wc-table over + swap ?DO
dup I 2@ within IF I 2 cells + @ UNLOOP EXIT THEN
3 cells +LOOP 1 ;

: xc-display-width ( addr u -- n )
0 rot rot over + swap ?DO
I xc@+ swap >r wcwidth +
r> I - +LOOP ;

: char ( "name" -- xc ) bl word count drop xc@+ nip ;
: [char] ( "name" -- rt:xc ) char postpone Literal ; immediate

\ switching encoding is only recommended at startup
\ only two encodings are supported: UTF-8 and ISO-LATIN-1

80 Constant utf-8
100 Constant iso-latin-1

: set-encoding to maxascii ;
: get-encoding maxascii ;

base !

Bruce McFarling

unread,

Oct 3, 2005, 1:15:41 AM10/3/05

to

Bernd Paysan wrote:

> Anton Ertl wrote:
> > Well, if you use X-SIZE (which should be called XC-DISPLAY-WIDTH or
> > somesuch)

> Yes, to avoid confusion. Should it work on whole strings (that's what I find
> useful), or on single characters (I use wcwidth() to implement it;
> wcwidth() works on a UCS32 character).

It should work on whole strings. If you need a single character in
memory you can hand the address with a character count of 1, if you
need to know a single character on the stack to know how much to slide
stuff up to make room, XC-SIZE gives that.

Bruce McFarling

unread,

Oct 3, 2005, 1:25:54 AM10/3/05

to

Anton Ertl wrote:
> Just as you have to work out what to do when the XC-SIZE check fails.

Yes, but its "X-SIZE", under whatever name, that you will normally be
using. XC!+? really is encouraging you to

> Well, if you use X-SIZE (which should be called XC-DISPLAY-WIDTH or
> somesuch) to check whether XC+! can be safely used, you have already
> made a mistake and probably introduced a buffer overflow error. If
> you had used XC+!? instead, this would not happen.

If X-SIZE is XC-DISPLAY-WIDTH, that's not the one I want. Indeed, I
find it hard to believe that "display width" is what you really mean.
If "X-SIZE" means length in characters, its the function that X-SIZE
sounds like that I really want. Something that converts a ( xca u )
string into a ( xca au ) sized structure in memory.

> That makes sense in those cases where we know (or have a
> not-too-excessive upper bound for) the number of xcs in the target
> string. However, I am not convinced that those cases are all that
> frequent (it still might be worthwhile). OTOH, the number of cases
> where we have to put a string together by storing characters in it is
> not that big, either.

Of course. Normally we will have string variable length characters
with a character count, and wish to convert it to string of address
unit sized chunks to move around in memory.

I want something that will tell that to me in advance, not something
that makes me "try it and see if it fits".

Bruce McFarling

unread,

Oct 3, 2005, 1:51:30 AM10/3/05

to

Bernd Paysan wrote:
> X-SIZE ( xc_addr u -- n )
> n is the number of monospace ASCII characters that take the same space to
> display as the the XCHAR string starting at xc_addr, using u address units.

This would seem to be impossible to standardise. First, it will often
have a fractional part, second it depends on display font, both the
size of the monospace font and even the choice of font in some
particular character sets.

What is possible to standardise is the conversion between character
counts and address units.

XC-LENGTH ( xc-addr au -- u )
u is the number of variable length characters completely contained in
the area of memory beginning at sc-addr and extending for au address
units.

XC-SPACE ( xc-addr u -- au )
au is the number of address units required to contain the string of
variable length characters beginning at xc-addr and extending for u
variable length characters.

Bernd Paysan

unread,

Oct 3, 2005, 10:54:32 AM10/3/05

to

Bruce McFarling wrote:

> Bernd Paysan wrote:
>> X-SIZE ( xc_addr u -- n )
>> n is the number of monospace ASCII characters that take the same space to
>> display as the the XCHAR string starting at xc_addr, using u address
>> units.
>
> This would seem to be impossible to standardise. First, it will often
> have a fractional part, second it depends on display font, both the
> size of the monospace font and even the choice of font in some
> particular character sets.

Don't worry about it being doable or not. It already has been done in C (the
wcwidth() function returns the display width of a single UCS32 character).
The reference implementation I posted yesterday bases on the wcwidth
sourcecode.

To avoid an unnecessary long name, we could say XC-WIDTH instead of
XC-DISPLAY-WIDTH.

> What is possible to standardise is the conversion between character
> counts and address units.

That's easier to do, but offers little value.

There's another point to discuss: In my proposal, I introduced XKEY and
XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
and EMIT should react correctly on XCHARs.

Stephen Pelc

unread,

Oct 3, 2005, 11:35:41 AM10/3/05

to

On Mon, 03 Oct 2005 12:54:32 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>There's another point to discuss: In my proposal, I introduced XKEY and
>XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
>and EMIT should react correctly on XCHARs.

This is a really nasty one! Consider setting up a serial protocol
which starts with 8 bit actions (7 bit ASCII perhaps plus control
and check characters), e.g. Telnet IAC or MDB vending. Later data
transfers may be in a different encoding.

IMHO, even in UTF-8 you cannot get away from multiple encodings
if the protocol set up and handling is part of the Forth app.
This true of most embedded Forth apps.

Bernd Paysan

unread,

Oct 3, 2005, 12:33:08 PM10/3/05

to

Stephen Pelc wrote:

> On Mon, 03 Oct 2005 12:54:32 +0200, Bernd Paysan <bernd....@gmx.de>
> wrote:
>
>>There's another point to discuss: In my proposal, I introduced XKEY and
>>XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
>>and EMIT should react correctly on XCHARs.
>
> This is a really nasty one! Consider setting up a serial protocol
> which starts with 8 bit actions (7 bit ASCII perhaps plus control
> and check characters), e.g. Telnet IAC or MDB vending. Later data
> transfers may be in a different encoding.
>
> IMHO, even in UTF-8 you cannot get away from multiple encodings
> if the protocol set up and handling is part of the Forth app.
> This true of most embedded Forth apps.

KEY and EMIT in ANS Forth are defined as words that interact with the
keyboard and the terminal. I know, KEY and EMIT are a nice abstraction that
can be used for whatever serial communication protocol you use, and people
(including myself) use this abstraction.

It's obvious that when I define a KEY/EMIT pair for other purposes than the
terminal, I have to take the encoding of the particular protocol into
account. But when I work with the terminal, and hit the "ä" key, I want to
receive something my Forth system can deal with easily.

That's how KEY and EMIT are defined now:

6.1.1320 EMIT CORE
( x -- )
If x is a graphic character in the implementation-defined character set,
display x. The effect of EMIT for all other values of x is
implementation-defined.
When passed a character whose character-defining bits have a value between
hex 20 and 7E inclusive, the corresponding standard character, specified by
3.1.2.1 Graphic characters, is displayed. Because different output devices
can respond differently to control characters, programs that use control
characters to perform specific functions have an environmental dependency.
Each EMIT deals with only one character.

6.1.1750 KEY CORE
( -- char )
Receive one character char, a member of the implementation-defined
character set. Keyboard events that do not correspond to such characters
are discarded until a valid character is received, and those events are
subsequently unavailable.
All standard characters can be received. Characters received by KEY are not
displayed.
Any standard character returned by KEY has the numeric value specified in
3.1.2.1 Graphic characters. Programs that require the ability to receive
control characters have an environmental dependency.

Anton Ertl

unread,

Oct 3, 2005, 1:29:48 PM10/3/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>To avoid an unnecessary long name, we could say XC-WIDTH instead of
>XC-DISPLAY-WIDTH.

I think the potential for confusion is higher with the shorter name,
and it is not used often enough to merit a shorter name.

>There's another point to discuss: In my proposal, I introduced XKEY and
>XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
>and EMIT should react correctly on XCHARs.

Here's what I wrote about this issue when we started this:

|- Words that deal with characters, but not with addresses, e.g., EMIT
|and KEY (any others?): One might consider letting them process xcs, so
|that definitions using them would automatically also be usable for xcs
|without needing to be rewritten; however, most words using EMIT or KEY
|probably also do character-address arithmetic, so they have to be
|adapted to work with xcs anyway. My gut feeling is that less programs
|need to be adapted, and in less problematic ways if we let EMIT and
|KEY work on cs, and introduce new words XEMIT and XKEY that deal with
|xcs:
|
|XEMIT ( xc -- )
|XKEY ( -- xc )

I am still not sure if my gut feeling was right or not. I guess the
best course is to do some experiments with both approaches, and record
how they went.

Jerry Avins

unread,

Oct 3, 2005, 6:55:22 PM10/3/05

to

In many implementations, KEY and EMIT are vectored routines, often by
means of a USER variable. In one embedded system I use, I have code to
switch between the UART and a keypad and LCD. Why can't XKEY and XEMIT
be handled the same way?

Jerry
--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Bruce McFarling

unread,

Oct 4, 2005, 3:50:45 AM10/4/05

to

Bernd Paysan wrote:

> Bruce McFarling wrote:
> > This would seem to be impossible to standardise. First, it will often
> > have a fractional part, second it depends on display font, both the
> > size of the monospace font and even the choice of font in some
> > particular character sets.

> Don't worry about it being doable or not. It already has been done in C (the
> wcwidth() function returns the display width of a single UCS32 character).
> The reference implementation I posted yesterday bases on the wcwidth
> sourcecode.

How does it handle fractional columns?

> To avoid an unnecessary long name, we could say XC-WIDTH instead of
> XC-DISPLAY-WIDTH.

> > What is possible to standardise is the conversion between character
> > counts and address units.

> That's easier to do, but offers little value.

That's the function that allows me to check that there is space in a
buffer for an XCHAR string before I start storing it, and avoid being
forced to use the ugly XC!+?

> There's another point to discuss: In my proposal, I introduced XKEY and
> XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
> and EMIT should react correctly on XCHARs.

This seems straightforward if XCHARs span the OCS and DCS, which in
part requires that the CHAR in use is a well-behaved subject of the
XCHAR. If XCHARs are in the ACS space, maybe there could be issues.

Two system/applications that I am keeping in mind when thinking about
this are (1) a microcontroller that is either a web server or an active
agent talking to a web server and (2) Forth running in a thread in a
desktop system to provide a customised set of text processing
facilities.

An XCHAR to handle the OCS seems to be more likely in (2), and I can't
see how KEY/EMIT working with the DCS CHAR gives any problem as long as
the XCHAR is an upwardly compatible superset of the CHAR. For Unicode
there shouldn't be any problem since a standard DCS has to include
printable ASCII7.

Stephen Pelc

unread,

Oct 4, 2005, 10:58:44 AM10/4/05

to

On 3 Oct 2005 20:50:45 -0700, "Bruce McFarling" <agi...@netscape.net>
wrote:

>Two system/applications that I am keeping in mind when thinking about
>this are (1) a microcontroller that is either a web server or an active
>agent talking to a web server and (2) Forth running in a thread in a
>desktop system to provide a customised set of text processing
>facilities.
>
>An XCHAR to handle the OCS seems to be more likely in (2), and I can't
>see how KEY/EMIT working with the DCS CHAR gives any problem as long as
>the XCHAR is an upwardly compatible superset of the CHAR. For Unicode
>there shouldn't be any problem since a standard DCS has to include
>printable ASCII7.

We face exactly this problem in PowerNet. Many TCP-based protocols
perform commands sequences in which codes such $Fx have special
meanings. We also use a vectored I/O scheme for KEY EMIT and friends.

I really don't like the idea of introducing a set of encoding issues
to KEY and EMIT, which I prefer to consider as a transport system
e.g. for XModem data transfers. So somewhat reluctantly, XKEY XEMIT
and friends may be appropriate.

However, from the application programmer's point of view, XKEY and
friends are usually implementation details. Far more important are
the words that portably handle strings. In the CCS application,
which supports about 20 languages in several encodings, application
programmers work with string stacks and structures as well as with
primitives. For the mid-term an application may have to deal with:
7/8 bit DCS
NVT + control/status codes for IP protocols
legacy 8 bit code pages
ad-hoc legacy comms protocols
multi-byte legacy encodings
UTF-8
UTF-16
UTF-32 ?

In order to handle this cleanly, a standard will first have to:
clean up the meaning of char in the existing standard
clean up the char dependencies in the file word sets
define byte/octet access
- essential for comms protocols
- essential for embedded systems
Once this is done, the necessary tools for Xstrings will be
available. Currently, our and our clients opinion is that
run-time selection of encoding is essential, and that Xstring
words should be separated fro DCS activities. If the Xstring
words eventually become synonyms for DCS words, that would be
a bonus.

Bernd Paysan

unread,

Oct 4, 2005, 11:14:29 AM10/4/05

to

Bruce McFarling wrote:
>> Don't worry about it being doable or not. It already has been done in C
>> (the wcwidth() function returns the display width of a single UCS32
>> character). The reference implementation I posted yesterday bases on the
>> wcwidth sourcecode.
>
> How does it handle fractional columns?

There are no fractional columns. This is for monospaced fonts, which are
allowed to have characters which span 0, 1, or 2 columns (0 are "combining
characters").

> That's the function that allows me to check that there is space in a
> buffer for an XCHAR string before I start storing it, and avoid being
> forced to use the ugly XC!+?

All XCHAR strings have their size in AUs, not in XCHARs. You don't need to
go back and forward.

> This seems straightforward if XCHARs span the OCS and DCS, which in
> part requires that the CHAR in use is a well-behaved subject of the
> XCHAR. If XCHARs are in the ACS space, maybe there could be issues.

There are always issues when XCHARs are in the ACS space. That's why we
don't like the ACS space idea at all. Basically, I would never implement a
Forth with a separate ACS. Stephen Pelc did, and he sais he needs to, but
that's up to him. I think Windows 9x is dead, and even there, you could
install a Unicode package to be compatible with Windows NT, which had
Unicode from the beginning - though with UCS16 encoding.

> Two system/applications that I am keeping in mind when thinking about
> this are (1) a microcontroller that is either a web server or an active
> agent talking to a web server and (2) Forth running in a thread in a
> desktop system to provide a customised set of text processing
> facilities.
>
> An XCHAR to handle the OCS seems to be more likely in (2), and I can't
> see how KEY/EMIT working with the DCS CHAR gives any problem as long as
> the XCHAR is an upwardly compatible superset of the CHAR. For Unicode
> there shouldn't be any problem since a standard DCS has to include
> printable ASCII7.

As Stephen Pelc noticed, the problems start to arise when you use KEY and
EMIT for other things than for the keyboard and the terminal.

Bruce McFarling

unread,

Oct 4, 2005, 2:26:38 PM10/4/05

to

Stephen Pelc wrote:

...

> I really don't like the idea of introducing a set of encoding issues
> to KEY and EMIT, which I prefer to consider as a transport system
> e.g. for XModem data transfers. So somewhat reluctantly, XKEY XEMIT
> and friends may be appropriate.

It would also be less than satisfactory to "have to" have standard file
words or non-standard KEY and EMIT in a microcontroller Forth where
8-bit clean KEY and EMIT would do, just because KEY and EMIT are
defined NOT to be 8-bit clean when Xstrings are in use.

I'm getting the impression that as far as anything OTHER THAN treating
KEY and EMIT as a transport system, the horse has already bolted the
barn. If it was going to be used any other way, the groundwork would
have had to have been laid in 1994.

> However, from the application programmer's point of view, XKEY and
> friends are usually implementation details. Far more important are
> the words that portably handle strings.

Yes. !

...

> For the mid-term an application may have to deal with:
> 7/8 bit DCS
> NVT + control/status codes for IP protocols
> legacy 8 bit code pages
> ad-hoc legacy comms protocols
> multi-byte legacy encodings
> UTF-8
> UTF-16
> UTF-32 ?

That's a superset of what I was thinking about (though generically I
would not want to interfere with the IP/comms protocols, I'll just have
to take that part on faith).

If the endian-issues are in order for UTF-16, I figure that UTF-32 is
in place. Except for the endian transport issues, UTF-32 is basically
what UTF-16 and UTF-8 decode into.

> In order to handle this cleanly, a standard will first have to:
> clean up the meaning of char in the existing standard

Is this what CHAR vs WCHAR is getting at, reducing overloading on CHAR
to help eliminate ambiguities?

> clean up the char dependencies in the file word sets

I noted a reference to BIN mode wrt this in an I18N proposal, and the
proposal to make the counts in file words apply to address units. That
certainly would not distress me. Though an OCT fam that works with B@
B! BMOVE B, BYTES BYTE+ (cf, below) is all I would really care about
myself.

> define byte/octet access
> - essential for comms protocols
> - essential for embedded systems

YES! (repeat ! a random non zero number of times).

> Once this is done, the necessary tools for Xstrings will be
> available. Currently, our and our clients opinion is that
> run-time selection of encoding is essential, and that Xstring
> words should be separated fro DCS activities. If the Xstring
> words eventually become synonyms for DCS words, that would be
> a bonus.

Is that run-time selection for BOTH the OCS and ACS, or just the ACS?

Where I would like to see common Xstring semantics already available is
in the ACS, where run-time selection of encoding is a must.

I still think that what Anton is on about is Xstrings in the OCS [*].
Not many OS's will switch their native encoding on the fly. Even more,
you wouldn't want to find yourself in KOI-7 (that is, lower case ASCII7
positions are taken by upper case Cyrillic counter-parts) in your UTF-8
system messages because of a mislaid SET-ENCODING.

[* Oh, and a bit of his preferred OCS conquering the ACS and taking
over the world]

Stephen Pelc

unread,

Oct 4, 2005, 7:49:40 PM10/4/05

to

On 4 Oct 2005 07:26:38 -0700, "Bruce McFarling" <agi...@netscape.net>
wrote:

>> In order to handle this cleanly, a standard will first have to:

>> clean up the meaning of char in the existing standard
>
>Is this what CHAR vs WCHAR is getting at, reducing overloading on CHAR
>to help eliminate ambiguities?

Say that DCS = 16 bit Unicode, the common relationship char=byte=au
fails, counted strings use 16 bit counts, COUNT can no longer be used
to step through bytes in memory ...

So yes.

>> clean up the char dependencies in the file word sets
>
>I noted a reference to BIN mode wrt this in an I18N proposal, and the
>proposal to make the counts in file words apply to address units. That
>certainly would not distress me. Though an OCT fam that works with B@
>B! BMOVE B, BYTES BYTE+ (cf, below) is all I would really care about
>myself.

Some of the file set words are defined in terms of characters rather
than address units or bytes.

>Is that run-time selection for BOTH the OCS and ACS, or just the ACS?

Just the ACS - but the OCS <> DCS issue is common at present.

We should at least address the issue where DCS <> OCS <> ACS, if
only because the problem actually exists in current Forth
applications.

Bruce McFarling

unread,

Oct 5, 2005, 7:12:56 AM10/5/05

to

Stephen Pelc wrote:

> On 4 Oct 2005 07:26:38 -0700, "Bruce McFarling" <agi...@netscape.net>
> wrote:

> >> In order to handle this cleanly, a standard will first have to:
> >> clean up the meaning of char in the existing standard

> >Is this what CHAR vs WCHAR is getting at, reducing overloading on CHAR
> >to help eliminate ambiguities?

> Say that DCS = 16 bit Unicode, the common relationship char=byte=au
> fails, counted strings use 16 bit counts, COUNT can no longer be used
> to step through bytes in memory ...

> So yes.

I think that extending the OCTET words to include byte-get (B@+) and
byte-put (B!+) is cleaner. Pragmatically, there could well be more
code relying on char = 1 byte than code relying on the purist use of
CHAR, as Anton argues. Two alternatives would therefore appear to be:

(1a) CHAR is the DCS, its allowed to be any uniform width character
that is upwardly compatible with ASCII-7, whether an 8-bit code page,
plane 0 of UTF-16, or UTF-32. When you used to use CHAR as BYTES, use
B@ etc.

(2a) CHAR is unsigned 8-bit, WCHAR is a uniform width unsigned
character DCS, 1CHARS<=1WCHARS<=1CELLS

> >> clean up the char dependencies in the file word sets

> >I noted a reference to BIN mode wrt this in an I18N proposal, and the
> >proposal to make the counts in file words apply to address units. That
> >certainly would not distress me. Though an OCT fam that works with B@
> >B! BMOVE B, BYTES BYTE+ (cf, below) is all I would really care about
> >myself.

> Some of the file set words are defined in terms of characters rather
> than address units or bytes.

In ordinary character mode, that makes sense. Its in binary mode that
its clearly broken. The minimal change may be to fix it where its
broken ... to modify where it says "characters" in the file access
words to "characters (or in a binary mode, address units)".

And of course if you have address unit counts in binary mode, and you
have the byte words, you can handle save and restore arbitrary data
structures composed of bytes, cells, chars, etc.

(1b) Following from the discussion above, in binary mode, replace
character counts by address unit counts in file access words that
involve data counts.

(2b) Following one above, where it says chars, it means unsigned bytes.

Nothing I have for personal use goes beyond latin-1, and it mostly
stays within the CHAR universe, and mostly treats files as sequences of
characters, so the more convenient for me is for CHARS to scale up
(alternative 1) in the spirit of ANS'94 and to have BYTE words to be
able to avoid assuming char=byte. So alternative 1 would be handier
for me, but 2 would not be any great dramas.

> >Is that run-time selection for BOTH the OCS and ACS, or just the ACS?

> Just the ACS - but the OCS <> DCS issue is common at present.

> We should at least address the issue where DCS <> OCS <> ACS, if
> only because the problem actually exists in current Forth
> applications.

The XCHAR semantics would cope with operating system character sets
whether they are uniform width or variable width. If that was the
original intention, maybe name them as OCHARs.

Then the X in XCHAR could be read as eXternal for the ACS. If that is
the ACS, then GET-ENCODING SET-ENCODING is required.

The rationale for using variable width semantics for standard base OSC
and ASC functionality is straightforward -- GET and PUT cope equally
well with variable width and uniform width character encodings, and as
the OSC and ASC are not necessarily under the control of the Forth
implementation, the more generic approach is preferable.

Anton Ertl

unread,

Oct 5, 2005, 8:44:25 AM10/5/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>All XCHAR strings have their size in AUs, not in XCHARs.

No, they have their sizes in chars, not xchars or aus. Having their
sizes in chars is essential for compatibility with existing string
words like TYPE.

Anton Ertl

unread,

Oct 5, 2005, 8:51:50 AM10/5/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>Once this is done, the necessary tools for Xstrings will be
>available.

What are Xstrings?

The central idea of xchars is that programs using just normal string
words (like TYPE) work for strings that contain xchars. And since
referring to individual characters is much rarer than dealing with
strings, that kind of transition requires much less porting effort
than one where we change the size of a char (because that affects a
lot of uses of strings, and most of the code I have seen assumes that
1 chars=1).

We tried that approach for the internal character and string handling
in Gforth, and it worked out rather painlessly.

Anton Ertl

unread,

Oct 5, 2005, 6:02:32 PM10/5/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>> Well, if you use X-SIZE (which should be called XC-DISPLAY-WIDTH or
>> somesuch)
>
>Yes, to avoid confusion. Should it work on whole strings (that's what I find
>useful), or on single characters (I use wcwidth() to implement it;
>wcwidth() works on a UCS32 character).

I, too, thought that having one for strings might be more useful.
Since we are still in the experimentation stage, one could provide
both, and see which one is used more. I see you have done so in your
reference implementation; I feel that the XC prefix is not the right
one for a word working on a string, though (i.e., I would call the
character version XC-DISPLAY-WIDTH, and the string version
S-DISPLAY-WIDTH, or just DISPLAY-WIDTH).

Bruce McFarling

unread,

Oct 6, 2005, 1:38:25 AM10/6/05

to

Anton Ertl wrote:

> steph...@mpeforth.com (Stephen Pelc) writes:
> >Once this is done, the necessary tools for Xstrings will be
> >available.

> What are Xstrings?

> The central idea of xchars is that programs using just normal string
> words (like TYPE) work for strings that contain xchars. And since
> referring to individual characters is much rarer than dealing with
> strings, that kind of transition requires much less porting effort
> than one where we change the size of a char (because that affects a
> lot of uses of strings, and most of the code I have seen assumes that
> 1 chars=1).

> We tried that approach for the internal character and string handling
> in Gforth, and it worked out rather painlessly.

In that context the monospace fonts make sense, because you normally
have to have the DCS up and running before you have a proper rich text
display, and it is perfectly possibly that you run without every having
a proper rich text display, or the rich text display is off site, as in
a html server. That makes TYPE a system utility. System utilities do
not do much actual text processing, and the text processing that they
do tends to fall in regular expression pattern recognition.

Under that, the definition of XCHAR is: an extended DCS character set
where each code point is defined in terms of one or more CHAR.

(Anton's position is clearly that the extension DCS should be chosen to
recover the original Forth position of DCS=OCS=ACS, with translation on
the fly to bridge any gaps. But that is to one side of the proposal
that XCHARs be an additional word set that does that for the DCS.)

The semantics are perfectly general (apart from setting encodings and
handling rich text display, where the latter is not handled at this
level in any event), but the specifics of how they integrate to
existing standard words will vary substantially depending on whether
you are allowing the DCS to be a variable width proper superset of
ASCII7, or supporting an application that is offering a translation
dictionary between Russian encoded in KOI-8 and French encoded in
latin-1 (and obviously using UTF-16 plane 0 internally to remain sane).

Anton Ertl

unread,

Oct 8, 2005, 8:35:01 AM10/8/05

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Bernd Paysan <bernd....@gmx.de> writes:
>>To avoid an unnecessary long name, we could say XC-WIDTH instead of
>>XC-DISPLAY-WIDTH.
>
>I think the potential for confusion is higher with the shorter name,
>and it is not used often enough to merit a shorter name.
>
>>There's another point to discuss: In my proposal, I introduced XKEY and
>>XEMIT. I think this is not optimal - when the XCHAR set is available, KEY
>>and EMIT should react correctly on XCHARs.
>
>Here's what I wrote about this issue when we started this:
>
>|- Words that deal with characters, but not with addresses, e.g., EMIT
>|and KEY (any others?): One might consider letting them process xcs, so
>|that definitions using them would automatically also be usable for xcs
>|without needing to be rewritten; however, most words using EMIT or KEY
>|probably also do character-address arithmetic, so they have to be
>|adapted to work with xcs anyway. My gut feeling is that less programs
>|need to be adapted, and in less problematic ways if we let EMIT and
>|KEY work on cs, and introduce new words XEMIT and XKEY that deal with
>|xcs:
>|
>|XEMIT ( xc -- )
>|XKEY ( -- xc )

I have thought a little more about this. IIRC the original example I
had in mind was code like

: mytype ( addr u -- )
over + swap do
i c@ emit
loop ;

The idea was to let code like that continue to work. In order to do
that, xchars must not be passed to EMIT, and therefore XEMIT was
introduced. If EMIT emits the chars that make up an xchar in the
in-memory representation in the right order, this would output the
corresponding xchar.

However, this thought is based on the assumption that The in-memory
representation of xchars and the representation in the outside world
are the same. Otherwise the emitted characters would have to be
buffered and assembled into an xchar, and the xchar would have to be
translated to the external representation and then output.

Given this issue, and the fact that the amount of code that would
benefit from that is probably very small, and the additional confusion
(and potential for errors) of having both EMIT and XEMIT, I now think
tha it is better to extend EMIT and KEY rather than to introduce XEMIT
and XKEY.

Currently this opinion has not been tested by implementing the
extended versions of EMIT and KEY in Gforth and looking what breaks,
so take this with a grain of salt (hmm, 357 occurences of EMIT and 283
occurences of KEY in Gforth; that might require quite a bit of
work just for checking).

Anton Ertl

unread,

Oct 8, 2005, 5:52:27 PM10/8/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>: xc@+ ( xcaddr -- xcaddr' u )
> count dup maxascii u< IF EXIT THEN \ special case ASCII
> 7F and 40 >r
> BEGIN dup r@ and WHILE r@ xor
> 6 lshift r> 5 lshift >r >r count
>\ dup C0 and 80 <> abort" malformed character"
> 3F and r> or
> REPEAT r> drop ;

That decodes UTF-8 into the Unicode number, which is the obvious
on-stack representation for Unicode characters. However, I wonder
what we would miss if we just used a more UTF-8-like on-stack
representation, implemented like this:

\ assumptions:
\ little-endian (big-endian would prefer different encoding)
\ no cell alignment restrictions
\ a whole cell starting at xc-addr1 is accessible
\ 32-bit cells
\ 8-bit chars
\ 1 chars = 1
\ no xchars with more then 4 bytes (true for Unicode)
: xc@+ ( xc-addr1 -- xc-addr2 xc )
dup c@ >r
r@ $80 u< if 1+ r> exit then
r@ $c0 u< if true abort" malformed xchar" then
r@ $e0 u< if dup 2 + swap @ $0000ffff and r> drop exit then
r@ $f0 u< if dup 3 + swap @ $00ffffff and r> drop exit then
r@ $f8 u< if dup 4 + swap @ ( $ffffffff and ) r> drop exit then
true abort" malformed xchar" ;

Well, this code is somewhat faster, a lot less portable, and less
robust than yours, but that idea can also be implemented with other
code that would still be a little faster than yours. My question is:
What is the Unicode number good for, or can we just as well use a
different on-stack representation as long as the ASCII chars still
have ASCII codes?

Bernd Paysan

unread,

Oct 3, 2005, 5:06:52 PM10/3/05

to

Anton Ertl wrote:
> Here's what I wrote about this issue when we started this:
>
> |- Words that deal with characters, but not with addresses, e.g., EMIT
> |and KEY (any others?): One might consider letting them process xcs, so
> |that definitions using them would automatically also be usable for xcs
> |without needing to be rewritten; however, most words using EMIT or KEY
> |probably also do character-address arithmetic, so they have to be
> |adapted to work with xcs anyway. My gut feeling is that less programs
> |need to be adapted, and in less problematic ways if we let EMIT and
> |KEY work on cs, and introduce new words XEMIT and XKEY that deal with
> |xcs:
> |
> |XEMIT ( xc -- )
> |XKEY ( -- xc )
>
> I am still not sure if my gut feeling was right or not. I guess the
> best course is to do some experiments with both approaches, and record
> how they went.

So far, in Gforth, we use the separation of KEY/EMIT and XKEY/XEMIT, while
in bigFORTH, XKEY and XEMIT are aliases to KEY and EMIT. The latter seemed
more natural with the way bigFORTH deals with IO (vector table).

Fortunately, as long as you use memory-based words like TYPE, you can get a
deterministic IO effect, since we don't recode strings.

Bernd Paysan

unread,

Oct 8, 2005, 9:48:10 PM10/8/05

to

Anton Ertl wrote:
> What is the Unicode number good for, or can we just as well use a
> different on-stack representation as long as the ASCII chars still
> have ASCII codes?

With a Unicode number, I can use other functions that are written with
Unicode in mind, like wcwidth. I can look into the Unicode code pages if I
need a character. It's, after all, a standard. And I can use XCHARs to
store whatever variable width data I have (that's something Klaus
Schleisiek did with a similar encoding for a long time).

Anton Ertl

unread,

Oct 9, 2005, 3:42:49 PM10/9/05

to

Bernd Paysan <bernd....@gmx.de> writes:
>\ xchar reference implementation: UTF-8 (and ISO-LATIN-1)

And here are some example definitions using xchars.

One thing that I noticed is that it is actually not that easy to find
examples where characters are dealt with individually.

The following word works like TYPE, but prints the string
back-to-front.

: revtype1 ( xc-addr u -- )
over >r + begin
dup r@ u> while
xchar- dup xc@ emit
repeat
r> 2drop ;

One other thing I noticed is that often, instead of converting an
xchar to the on-stack representation, it can just as well be treated
as a string (and this is often more efficient):

: revtype2 ( xc-addr u -- )
over >r + begin
dup r@ u> while
0 -x/string over swap type
repeat
r> 2drop ;

Here's another example, implementation of the widely-available word
SCAN that searches for a character in a string. First, here is an
xchar variant of the non-xchar version in Gforth:

: scan1 ( xc-addr1 u1 xc -- xc-addr2 u2 )
>r
BEGIN
dup
WHILE
over xc@ r@ <>
WHILE
+x/string
REPEAT THEN
rdrop ;

And here is a version that deals with the xchar as string:

: xc->s ( xc -- xc-addr u )
\ convert xc into ALLOCATEd in-memory representation
dup xc-size dup chars allocate throw swap ( xc xc-addr u )
2dup 2>r xc!+? 0= abort" bug" 2drop 2r> ;

: scan2 ( xc-addr1 u1 xc -- xc-addr2 u2 )
xc->s 2dup 2>r search 0= if \ no match
dup /string then
2r> drop free throw ;

In many cases, the programmer can also provide the xchar as string and
call SEARCH directly instead of through SCAN2.

Finally, here's a primitive implementation of ACCEPT for xchars.

: accept1 ( c-addr +n -- +n2 )
over >r begin
key dup #cr <> while ( c-addr1 u1 xc )
dup 2swap xc!+? 0= if
drop #bell then
emit
repeat
2drop r> - ;

Anton Ertl

unread,

Oct 11, 2005, 10:04:29 AM10/11/05

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>: accept1 ( c-addr +n -- +n2 )
> over >r begin
> key dup #cr <> while ( c-addr1 u1 xc )
> dup 2swap xc!+? 0= if
> drop #bell then
> emit
> repeat
> 2drop r> - ;

This has a bug. Here's a correct version:

: accept1 ( c-addr +n -- +n2 )
over >r begin
key dup #cr <> while ( c-addr1 u1 xc )

dup 2swap xc!+? >r rot r> 0= if

Anton Ertl

unread,

Oct 11, 2005, 10:07:03 AM10/11/05

to

steph...@mpeforth.com (Stephen Pelc) writes:
>How does this fit in with the wide character and internationalisation
>proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF

Here's the comparison I made with the widechar paper:

|Pelc and Knaggs [pelc&knaggs01widechar] identified the same problems
|as we did (in particular the widespread environmental dependency on
|1 chars=1, and similar to us propose adding new words for dealing
|with wider characters: They propose adding wide-character versions of
|the existing character and string words, for use with wide
|fixed-width encodings; the system and old-style applications would
|continue to use the regular character and string words, but
|applications could be converted to use these wide-character words.
|In contrast, we propose adding words that support variable-width
|encodings, but only for words that deal with individual characters;
|the string words work just as well for strings containing extended
|characters as for strings containing classical characters. Our
|approach requires less conversion work, so we propose applying it
|throughout the system instead of just to application data.

Essentially, widechars and xchars are alternative proposals for the
same purpose, widechars for fixed-width character sets, and xchars for
variable-width character sets. A system could probably implement both
of them, but an application would probably use only one of them.

Whether the system uses extended characters internally is somewhat,
but not completely, orthogonal to the question of which kind of
extended characters are used. With both extensions, you can leave the
system mostly untouched (such that it does not necessarily work with
extended characters), and see the extension only as an offering for
applications. However, there are some differences between the
extensions here:

- For widechars, the unextended characters are typically 8-bit,
whereas for xchars with UTF-8 encoding the unextended characters are
ASCII.

- For xchars, some of the string words used by applications have to be
adapted (e.g., ACCEPT). These words are typically also used by the
system, making the system somewhat capable of handling xchars
internally already (one could introduce an ASCII-ACCEPT for internal
use by the system, but that would be pointless, as an xchar-extended
ACCEPT works well for ASCII characters).

The transition can be performed word-by-word, eventually making the
whole system xchars-capable, including the system-internal stuff.
E.g., Gforth's PARSE implementation has not been adapted to xchars
yet and cannot deal with non-ASCII delimiters, but still works with
ASCII delimiters even if the input stream contains non-ASCII
characters.

Overall, with xchars, the transition can be made gradually, and,
OTOH, the xchars-capability of words like ACCEPT that are used in
applications and the system makes the system half xchars-capable as
soon as you have xchars. So going all the way to a system that can
handle xchars everywhere is just another step in that direction.

- For widechars, the wordsets for ordinary chars and widechars are
completely separate, and the data has to be kept separately, too, so
a gradual transition is not possible. All the code that deals with
a given data structure (e.g., the input stream) would have to be
converted at once. In addition, the system would have to change the
implementation of the old words to behave correctly for the new data
structure (consider a word like SOURCE). So conversion of the whole
system to widechars appears hard and will probably be avoided by the
system implementor.

I can imagine a system that implements xchars and widechars, uses
xchars internally, and provides widechars for those applications that
already use them. There might be some gotchas in that concept, but at
the moment I don't see them.

Concerning the internationalisation paper, that covers things that
xchars do not cover. It seems mostly orthogonal to the xchars, at
least in concept (probably not in all the actually proposed words).

The Beez'

unread,

Oct 11, 2005, 1:00:14 PM10/11/05

to

> It is not clean to store an integer (the count) in a character.
> It is not useful to have a count limited to 256 in Britain
> 65526 in Japan and 4 billion in China.
My idea. I do agree with you, we do not need XCHAR, because CHAR is
*already* an abstraction for characters! IMHO, we need a new string! I
tried to formulate an abstract string wordset for it, but due to other
activities, I've not elaborated it further. May be I should .. :-(

May be we should. May be only a way to check encoding is useful..

Hans Bezemer

Bruce McFarling

unread,

Oct 12, 2005, 11:16:14 AM10/12/05

to

The Beez' wrote:

> My idea. I do agree with you, we do not need XCHAR, because CHAR is
> *already* an abstraction for characters!

But chars is an abstraction for characters that are a constant number
of address units. "CHAR+" always adds the same amount to an address.
That breaks down when you have an encoding that ranges from one to
three bytes, depending on the character that has been encoded. Or an
encoding that is normally two bytes, but is sometimes four bytes.

Like, for example, UTF-8 and UTF-16.

Its not like these are exotic or extremely unusual character sets to
encounter. Most Chinese language web pages are transported across the
internet as UTF-8 encoded Unicode characters.

jmdra...@yahoo.com

unread,

Oct 12, 2005, 3:43:54 PM10/12/05

to

Bruce McFarling wrote:
> Brad Eckert wrote:
>
> > Should XCHARS be variable length?
>
> Yes, that's the whole point. Variable length character set encodings
> are becoming more common, and UTF-8 since its the easiest upgrade path
> to full Unicode from classic C character=byte, anybody who wants to
> talk to internationalised Linux applications is going to want to handle
> variable length character sets.

Well, one alternative to variable length XCHARS would be to look at
UTF-8 as "compressed" or "packed" Unicode. If you have some UTF-8
data to process, first "unpack" it into a buffer, then do whatever
manipulations you need to do on CELL wide data. Then "pack" it
back into UTF-8 when you're finished.

Regards,

John M. Drake

Anton Ertl

unread,

Oct 12, 2005, 4:13:20 PM10/12/05

to

jmdra...@yahoo.com writes:
>Well, one alternative to variable length XCHARS would be to look at
>UTF-8 as "compressed" or "packed" Unicode. If you have some UTF-8
>data to process, first "unpack" it into a buffer, then do whatever
>manipulations you need to do on CELL wide data. Then "pack" it
>back into UTF-8 when you're finished.

Yes, you can work that way with xchars, or you could just use big
(4-byte) chars and do the conversion on I/O; or (to avoid portability
problems for programs that assume 1 chars=1) you could introduce a
wide fixed-width character data type (widechar proposal by Pelc and
Knaggs), and do the conversion on I/O.

I, too, originally thought that that was the way to go. But when I
thought about example programs, I found that there are not that many
examples where you have to deal with individual characters (examples
that I came up with were anagram or palindrome checkers, i.e., stuff
that's not very common).

In most cases you deal with strings of characters, or you can write
the code such that it works on strings (see the scan2 example). And
for that kind of code, variable-width characters (done the right way)
work just as well as fixed-width characters, so why deal with all the
trouble that widening fixed-width characters would cause?

Brad Eckert

unread,

Oct 12, 2005, 4:47:43 PM10/12/05

to

I haven't dealt with UTF-8 strings, but they look like regular byte
strings to me except that sometimes substrings need to be treated as
characters. I think Forth source code expressed in UTF-8 format would
compile with no problems, for example.

Are there editors that support UTF-8?

Brad

Anton Ertl

unread,

Oct 12, 2005, 4:50:56 PM10/12/05

to

"Brad Eckert" <nospaa...@tinyboot.com> writes:
>I haven't dealt with UTF-8 strings, but they look like regular byte
>strings to me except that sometimes substrings need to be treated as
>characters.

Correct.

>I think Forth source code expressed in UTF-8 format would
>compile with no problems, for example.

Yes. Case-insensitive systems might cause trouble in rare cases,
however (hmm, I think we have not adapted that part of Gforth yet,
either).

>Are there editors that support UTF-8?

Yes. <http://www.cl.cam.ac.uk/~mgk25/unicode.html#apps> lists vim,
emacs, yudit, mined2000, joe, cooledit, qemacs, abiword (if you count
that as editor). Vim works nicely (once you install the appropriate
locale and set the locale environment variables accordingly), but I
have had little luck with Emacs.

[trailing full quote]

Please quote properly
<http://www.complang.tuwien.ac.at/anton/mail-news-errors.html#quoting>

Bruce McFarling

unread,

Oct 13, 2005, 9:44:20 AM10/13/05

to

Anton Ertl wrote:
> I, too, originally thought that that was the way to go. But when I
> thought about example programs, I found that there are not that many
> examples where you have to deal with individual characters (examples
> that I came up with were anagram or palindrome checkers, i.e., stuff
> that's not very common).

Obviously a big area is regular expression handling, and when you have
that is naturally expressed in terms of "ok, this substring matches,
now skip ahead a character and ...", the real key character level
operations are exactly the core of the XCHAR semantics:

* get a character ["XC-GET" --> XC@+ ( xc-addr -- xc-addr+ xc ) ]
* put a character ["XC-PUT" --> XC!+ ( xc-addr xc -- xc-addr+ ) ]
* advance a character ["XC++" --> XCHAR+ ( xc-addr -- xc-addr+ ) ]

Assuming that XCHARs are defined as "a character encoding in which a
character is made up of one or more CHARs", exact string matching is
the same with XCHARs and CHARs -- two chunks of memory a certain number
of au's long, are they identical? The basic difference that breaks
down the assumptions of CHAR is that CHAR do not need the address of a
character to know how much to add to an address, while variable width
characters need to know what actual character they are talking about to
know how to increment. And since the same information is needed to
fetch and store as to increment, making GET and PUT atomic operations
is a natural.

Personally I would not mind just the XCHAR semantics even if the target
is fixed-width wide characters, provided that there is a word to get
the "safe buffer size" for a given character count, or equivalently the
maximum size of an XCHAR.

Albert van der Horst

unread,

Oct 13, 2005, 1:09:19 PM10/13/05

to

In article <1129135663.7...@g14g2000cwa.googlegroups.com>,

Brad Eckert <nospaa...@tinyboot.com> wrote:
>I haven't dealt with UTF-8 strings, but they look like regular byte
>strings to me except that sometimes substrings need to be treated as
>characters. I think Forth source code expressed in UTF-8 format would
>compile with no problems, for example.

If we could have a new FIND+ that works on what I want to call area's:
addr <number of address units>
Didn't we want to obselete the old FIND anyway?

Substrings would be representable by area's too.

Old strings would be <addr count> and would be converted to an area by
a word which is theoretically CHARS and would be a NOP most of the
time.

Now FIND+ couldn't care less whether the name to be found would be
an IP-address, a chinese or a german word as long as the count in
address units is right. By the definition of address units this count
would be integral, even if Chinese chars are 7.5 German characters
or 15 Cambodian characters.

Area's would be the choice for communication buffers for messages
too.
I could paraphrase Anton Ertl's remark by saying that area's are
what mostly is transfered, foregoing the need for individual
characters, and accomodating any type of characters at the same time.

And by the way, thanks Brad for pointing out again the crucial fact
that xchar's are variable width. Somehow I missed that or didn't
realize.

>
>Brad
>

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

jmdra...@yahoo.com

unread,

Oct 13, 2005, 7:37:43 PM10/13/05

to

Anton Ertl wrote:
> jmdra...@yahoo.com writes:
> >Well, one alternative to variable length XCHARS would be to look at
> >UTF-8 as "compressed" or "packed" Unicode. If you have some UTF-8
> >data to process, first "unpack" it into a buffer, then do whatever
> >manipulations you need to do on CELL wide data. Then "pack" it
> >back into UTF-8 when you're finished.
>
> Yes, you can work that way with xchars, or you could just use big
> (4-byte) chars and do the conversion on I/O; or (to avoid portability
> problems for programs that assume 1 chars=1) you could introduce a
> wide fixed-width character data type (widechar proposal by Pelc and
> Knaggs), and do the conversion on I/O.
>
> I, too, originally thought that that was the way to go. But when I
> thought about example programs, I found that there are not that many
> examples where you have to deal with individual characters (examples
> that I came up with were anagram or palindrome checkers, i.e., stuff
> that's not very common).

You seem to be taking the view that Chuck Moore took regarding
ColorForth. (Forth processes words, not charecters). Anyway,
there's and obvious application that seems to have skipped your
mind. Text editors. In most text editors you deal with the
individual charecters. In ColorForth you can't "edit" a word.
You can only delete then retype it. That's fine for writing
programs (especially ColorForth ones) but I'm not sold on that
for general text editing.

Anyway, using "variable length" Xchars turns the easiest of
text editing tasks into the most complex. I'm talking about
"overwrite mode". With fixed length chars that simply means
replacing one char with another. But with variable length
chars you'd have to do a deletion/insertion because you
couldn't be sure ahead of time if the new char would be the
same length as the char it was replacing.

> In most cases you deal with strings of characters, or you can write
> the code such that it works on strings (see the scan2 example). And
> for that kind of code, variable-width characters (done the right way)
> work just as well as fixed-width characters, so why deal with all the
> trouble that widening fixed-width characters would cause?
>
> - anton

What "trouble" do you see widening fixed-width chars causing?
It seems quite simple to me. In fact all of the words in the
XChar wordset can be trivially implemented if you assume fixed
char width. The only problem I see with fixed-width chars is
that they take up more space. Furthermore processing
fixed-width chars is potentially faster. Take XC@+ for
instance. With a fixed with char this is simply @+.

Regards,

John M. Drake

Bruce McFarling

unread,

Oct 14, 2005, 5:22:29 AM10/14/05

to

jmdra...@yahoo.com wrote:
> Anyway, using "variable length" Xchars turns the easiest of
> text editing tasks into the most complex. I'm talking about
> "overwrite mode". With fixed length chars that simply means
> replacing one char with another. But with variable length
> chars you'd have to do a deletion/insertion because you
> couldn't be sure ahead of time if the new char would be the
> same length as the char it was replacing.

Yes, having "insert" mode as the basic operation and "overwrite" mode
as a elaboration built on top of that is natural with variable width
characters.

Its not as if having a standardised set of names for the actions you
need to work with variable width characters will somehow "force"
someone to treat their constant-width characters as if they are
variable width. Indeed, there already seems to be an extant proposal
for "wide" constant-width characters that may be wider than the CHAR
character, WCHARs.

But a WCHAR does not *enable* portable code that can cope when *faced
with* variable width characters, and an XCHAR does.

jmdra...@yahoo.com

unread,

Oct 14, 2005, 3:45:32 PM10/14/05

to

Fair enough. Have both options available and the end implementor
can choose whats best for his project. Glad to see these kinds of
discussions. Sometimes we focus too much on the problems ignoring
obvious solutions. Like the reason given for why there's no graphics
standard. "Oh, it wouldn't work on this 1960s teletype machine I
have in my basement" rather than developing something that could
work portably across workstations. It's like there's at least 3
implementations of OpenGL for Forth that are all different.

Regards,

John M. Drake

Anton Ertl

unread,

Oct 14, 2005, 6:56:48 PM10/14/05

to

jmdra...@yahoo.com writes:

>
>Anton Ertl wrote:
>> I, too, originally thought that that was the way to go. But when I
>> thought about example programs, I found that there are not that many
>> examples where you have to deal with individual characters (examples
>> that I came up with were anagram or palindrome checkers, i.e., stuff
>> that's not very common).
>
>You seem to be taking the view that Chuck Moore took regarding
>ColorForth. (Forth processes words, not charecters). Anyway,
>there's and obvious application that seems to have skipped your
>mind. Text editors. In most text editors you deal with the
>individual charecters.

Sure. So if BMW or any of the other editors in Forth is to be
extended to work with xchars, they need some changes. The point is
that most applications around need changes only in a few places (or,
if they are lucky, nowhere), not everywhere where strings are handled.

>Anyway, using "variable length" Xchars turns the easiest of
>text editing tasks into the most complex. I'm talking about
>"overwrite mode". With fixed length chars that simply means
>replacing one char with another. But with variable length
>chars you'd have to do a deletion/insertion because you
>couldn't be sure ahead of time if the new char would be the
>same length as the char it was replacing.

So what? Unless the editor is optimized for overwrite-only, turning
an overwrite into a delete-forward followed by an insert should be
peanuts.

BTW, overwriting a character with another does not necessarily mean
replacing a character with another. E.g., in the buffer
representation I used in my last editor (hole at the cursor),
overwriting a character means deleting a character from the
behind-cursor part of the buffer, and inserting the replacement at the
before-cursor part, even for fixed-width characters. Alternatively,
you could replace the character in-place in the behind-cursor part,
then do a cursor-right, which in turn consists of deleting the char
from the behind-cursor part, and inserting it at the before-cursor
part. As you can see, that data structure is optimized for inserting
and deleting.

>> In most cases you deal with strings of characters, or you can write
>> the code such that it works on strings (see the scan2 example). And
>> for that kind of code, variable-width characters (done the right way)
>> work just as well as fixed-width characters, so why deal with all the
>> trouble that widening fixed-width characters would cause?
>>
>> - anton
>
>What "trouble" do you see widening fixed-width chars causing?

That depends on how you do it:

- make 1 CHARS > 1: lots of code breaks.

- add widechars: the transition from ordinary chars to widechars is
harder than the transition to xchars. And you have to transition
large parts of an application at the same time.

>It seems quite simple to me. In fact all of the words in the
>XChar wordset can be trivially implemented if you assume fixed
>char width.

Yes. However, using UTF-32 (i.e., fixed-width) xchars with 8-bit
chars would mean that a string containing ASCII characters would be
represented as xchars differently from chars. So, should string words
like READ-LINE produce an xchar string or a char string? Ypu would
have to introduce another set of string words, like for widechars.
And while we are at it, you probably should be using widechars anyway,
because they are designed for fixed-width encodings, whereas xchars
are designed for variable-width encodings.