RfD: XCHAR wordset (for UTF-8 and alike)

11 views
Skip to first unread message

Bernd Paysan

unread,
Sep 25, 2005, 6:16:25 PM9/25/05
to
Problem:

ASCII is only appropriate for the English language. Most western languages
however fit somewhat into the Forth frame, since a byte is sufficient to
encode the few special characters in each (though not always the same
encoding can be used; latin-1 is most widely used, though). For other
languages, different char-sets have to be used, several of them
variable-width. Most prominent representant is UTF-8. Let's call these
extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
ASCII-compatible encodings may be used.

Proposal

Datatypes:

xc is an extended char on the stack. It occupies one cell, and is
a subset of unsigned cell. Note: UTF-8 can not store more that 31
bits; on 16 bit systems, only the UCS16 subset of the UTF-8
character set can be used.
xc_addr is the address of an XCHAR in memory. Alignment requirements are
the same as c_addr. The memory representation of an XCHAR differs
from the stack location, and depends on the encoding used. An XCHAR
may use a variable number of address units in memory.

Common encodings:

Input and files commonly are either encoded iso-latin-1 or utf-8. The
encoding depends on settings of the computer system such as the LANG
environment variable on Unix. You can use the system consistently only when
you don't change the encoding, or only use the ASCII subset.

Words:

XC-SIZE ( xc -- u )
Computes the memory size of the XCHAR xc in address units.

XC@+ ( xc_addr1 -- xc_addr2 xc )
Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XC!+ ( xc xc_addr1 -- xc_addr2 )
Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
location after xc.

XCHAR+ ( xc_addr1 -- xc_addr2 )
Adds the size of the XCHAR stored at xc_addr1 to this address, giving
xc_addr2.

XCHAR- ( xc_addr1 -- xc_addr2 )
Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
work for every possible encoding.

X-SIZE ( xc_addr u -- n )
n is the number of monospace ASCII characters that take the same space to
display as the the XCHAR string starting at xc_addr, using u address units.

XKEY ( -- xc )
Reads an XCHAR from the terminal.

XEMIT ( xc -- )
Prints an XCHAR on the terminal.

The following words behave different when the XCHAR extension is present:

CHAR ( "<spaces>name" -- xc )
Skip leading space delimiters. Parse name delimited by a space. Put the
value of its first XCHAR onto the stack.

[CHAR]
Interpretation: Interpretation semantics for this word are undefined.
Compilation: ( ?<spaces>name? -- )
Skip leading space delimiters. Parse name delimited by a space. Append the
run-time semantics given below to the current definition.
Run-time: ( -- xc )
Place xc, the value of the first XCHAR of name, on the stack.

Reference implementation:

Unfortunately, both the Gforth and the bigFORTH implementation have several
system-specific parts.

Experience:

Build into Gforth (development version) and recent versions of bigFORTH.
Open issues are file reading and writing (conversion on the fly or leave as
it is?).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Bruce McFarling

unread,
Sep 26, 2005, 2:29:05 AM9/26/05
to

Bernd Paysan wrote:
> Problem:

> ASCII is only appropriate for the English language. Most western languages
> however fit somewhat into the Forth frame, since a byte is sufficient to
> encode the few special characters in each (though not always the same
> encoding can be used; latin-1 is most widely used, though).

> For other languages, different char-sets have to be used, several of
> them variable-width. Most prominent representant is UTF-8. Let's call
> these extended characters XCHARs. Since ANS Forth specifies ASCII
> encoding, only ASCII-compatible encodings may be used.

> Experience:

> Build into Gforth (development version) and recent versions of bigFORTH.
> Open issues are file reading and writing (conversion on the fly or leave as
> it is?).

The first thing to settle is whether XCHARS are "these" extended
character sets that are upwardly compatible with printable ASCII, or
"this" extended character set. And I could well see a wish to use, eg,
UTF-8 in file storage (if my primary targets were Europe, Africa, and
the Americas) and UTF-16 in processing.

It seems to me that, since you can always tell where a UTF character
begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
but you need to know know WHICH it is as well as endianess for UTF16
and UTF32, the most coherent thing to do is to have AN XCHAR
representation for processing and a set of file modes that specify the
kind of file you are loading:

* ASCII (latin-1, etc, any fixed 8-bit code pages)
* UTF8
* UTF16 (endedness of your system)
* UTF32 (endedness of your system)
* UTF16B
* UTF16L
* UTF32B
* UTF32L

Then if the file mode matches the system mode, you just load the file,
if it mismatches, it is translated on the fly on reading and writing.

Obviously the system mode would be a thing for a system query.

Bernd Paysan

unread,
Sep 26, 2005, 5:35:12 AM9/26/05
to fort...@yahoogroups.com
Bruce McFarling wrote:

> The first thing to settle is whether XCHARS are "these" extended
> character sets that are upwardly compatible with printable ASCII, or
> "this" extended character set. And I could well see a wish to use, eg,
> UTF-8 in file storage (if my primary targets were Europe, Africa, and
> the Americas) and UTF-16 in processing.
>
> It seems to me that, since you can always tell where a UTF character
> begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
> but you need to know know WHICH it is as well as endianess for UTF16
> and UTF32, the most coherent thing to do is to have AN XCHAR
> representation for processing and a set of file modes that specify the
> kind of file you are loading:
>
> * ASCII (latin-1, etc, any fixed 8-bit code pages)

Though, depending on the fixed code-page, the translation will be different
(latin-1 different from latin-2).

> * UTF8
> * UTF16 (endedness of your system)
> * UTF32 (endedness of your system)
> * UTF16B
> * UTF16L
> * UTF32B
> * UTF32L

You can add a few other encodings. UCS16 managed to have an easy conversion
from several previous ASCII-compatible encodings, even though the code
pages of the non-ASCII portion moves within UCS16 (E.g. the GB2312 format).
Which encoding actually is known to the Forth system would be subject of a
query, too.

> Then if the file mode matches the system mode, you just load the file,
> if it mismatches, it is translated on the fly on reading and writing.
>
> Obviously the system mode would be a thing for a system query.

Exactly.

Stephen Pelc

unread,
Sep 26, 2005, 6:17:02 AM9/26/05
to
On Mon, 26 Sep 2005 00:16:25 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

How does this fit in with the wide character and internationalisation
proposals at
www.mpeforth.com/arena/
i18n.propose.v7.PDF
i18n.widechar.v7.PDF
These proposals/RFCs are from the application developers point of
view. There's a sample implementation in the file
LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
is derived from 15+ years of experience. From the file header:

"You are free to use this code in any way, as long as the MPE
copyright notice in this section is retained.

This code is an implementation of the draft ANS internationalisation
specification available from the download area of the MPE web site.
The implementation provides more functionality than is required by
the ANS draft standard and provides enough hooks to be the basis of
a practical system."



>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.

IMHO standardising a word that can't be guaranteed to work is not
beneficial. If you must step back through a string, extend the
definition of /STRING to form /-STRING or some such, such that
the start of the string must be at the start of a character.

IMHO your approach is from the implementor's perspective, which is
valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
that what *applications* do with strings is at a *much* higher level
than implementors issues.

Can we merge the application developer issues with the kernel
issues? These inclue cleaning up the meaning of character,
byte/octet access, file wors and son on.

I look forward to discussing these issues at EuroForth 2005.

Stephen


--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Bernd Paysan

unread,
Sep 26, 2005, 8:31:48 AM9/26/05
to
Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

The main difference with the i18n.widechar.v7.PDF proposal is that our
proposal (Anton's and my) doesn't distinguish between development character
set and application character set. I think this distinction is unnatural
and only valid in a historical context, e.g. the different code-pages used
in DOS-based Windows, and wide characters, which won't coexist with ASCII.

The string-based localization proposal in i18n.propose.v7.PDF is orthogonal
to the character issue, and works regardless of the coding system, as
strings always stay strings.

I would welcome it when you set up an RfD for your proposal.

>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of
>>this XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed
>>to work for every possible encoding.
>
> IMHO standardising a word that can't be guaranteed to work is not
> beneficial. If you must step back through a string, extend the
> definition of /STRING to form /-STRING or some such, such that
> the start of the string must be at the start of a character.

Quite a number of variable width wide-char encodings, especially UTF-8,
allow both stepping forward and backward a character at a time. Another
possible compromise is to simply outlaw those variable width wide-char
encodings that don't allow stepping back. UTF-8 allows to find the next and
the previous character regardless where you point to. Some of the chinese
encodings can do the same: the first byte of a double-byte glyph there has
the MSB set, the second clear.

It's like seeking in a file. Not all files allow seeking (pipes and sockets
won't, e.g.). Seeking is a useful activity, though. Adding a X/STRING
( xc_addr u n -- xc_addr' u' ) isn't much of a trouble. n would be the
number of XCHARs to step forward (positive) or backward (negative).

The question is rather what should XCHAR- do when it fails. It can throw an
error, as well as when it encounters a bad encoding.

> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Especially when they finally use some OS function to paint the text on the
screen. On the other hand, when they use something integrated into the
Forth system (like MINOS), they use the DCS to display things on screen.

Using UTF-8 internally is even possible for a Windows Forth, though you then
have to go through hoops to call TextOutW correctly (AFAIK it even doesn't
know how to deal with combining characters). So far, I haven't ported the
UTF-8 stuff to Windows, and concluded that it's easier to make the Windows
MINOS version use the same iso-latin-1 DCS as it always did. But then,
bigFORTH on Windows is not really supported.

> Can we merge the application developer issues with the kernel
> issues? These inclue cleaning up the meaning of character,

> byte/octet access, file words and so on.

Good idea.

Stephen Pelc

unread,
Sep 26, 2005, 11:13:32 AM9/26/05
to
On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
wrote:

>The main difference with the i18n.widechar.v7.PDF proposal is that our


>proposal (Anton's and my) doesn't distinguish between development character
>set and application character set. I think this distinction is unnatural
>and only valid in a historical context, e.g. the different code-pages used
>in DOS-based Windows, and wide characters, which won't coexist with ASCII.

Unfortunately I have to disagree here. Even if you can get to one
encoding from the UTF-xxx family in the long term, applications
written in South Africa (development character set, DCS) must be able
to be hosted and configured on a PC running a Chinese-xxx version
of some operating system (operating character set, OCS)and used by
a Russian-xxx speaker (application character set, ACS). This is a
mix that has been seen "in the wild" - it is not a scenario.

The impact of ACS is not necessarily in the encoding, but in
how the application presents information and the order of
text substitutions, e.g. subject/verb/object and time/manner/place.
Then there's the date/time display nightmare and ...

I really wish we could embrace a single encoding, but there are
Forth applications out there with 15-20 years of history.

>I would welcome it when you set up an RfD for your proposal.

Let's reserve time for it at EuroForth. Those who want to join a mail
list for this topic should email me directly. I will re-establish
the locale and other mailing lists when our servers have recovered
from the plumbing alterations at Hill Lane.

>Another
>possible compromise is to simply outlaw those variable width wide-char
>encodings that don't allow stepping back.

Tell that to an application developer and they will ignore you. Such
encodings exist and are used. In our experience, stepping back through
strings is most often encountered in file handling and affects DCS and

OCS rather than ACS.

>> Can we merge the application developer issues with the kernel
>> issues? These inclue cleaning up the meaning of character,
>> byte/octet access, file words and so on.
>
>Good idea.

Will you be at EuroForth?

Albert van der Horst

unread,
Sep 26, 2005, 5:30:20 AM9/26/05
to
In article <p6kj03-...@vimes.paysan.nom>,

Bernd Paysan <bernd....@gmx.de> wrote:
>Problem:
>
>ASCII is only appropriate for the English language.

Hardly. English has given up one of the most important
advantages of a phonetic system. It is unpronouncable.
I am thinking about a phonetically correct spelling of
English and it would need a host of dia-critical marks,
like just every other lanugage.

> Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though). For other
>languages, different char-sets have to be used, several of them
>variable-width. Most prominent representant is UTF-8. Let's call these
>extended characters XCHARs. Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

>
>Proposal
<SNIP>

One of the problems, and I think it is a design issue, we have
inherited from C, is the mess resulting from using characters
as address units (in Forth parlance.)
In Forth with all the embedded programming we really need
a means to address bytes. I would like to split off from
the character handling in Forth, all that is in fact intended
to handle let's say assembler level programming.
This would make character handling much cleaner, and a better
starting point for extending the real character handling.

It is my hope that we need not introduce a new type for char's beside
the byte type that we need anyhow, and the normal CHAR.
Why would CHAR <some extended character> not fit in a Forth
character (provided we do not try it at the same time for
things like a length as exemplified by the ugly word COUNT.)

In fact bytes are somehow in place by the concept of
address unit. We only need to flesh it out a little.
Note that there is *no* Forth word to fetch or store
the content of an address unit. Still.
An address unit is the smallest part of memory that can
addressed, i.e. fetched or stored. But it can't because there
are no words for it.

>--

Groetjes Albert

--
--
Albert van der Horst,Oranjestr 8,3511 RA UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Bernd Paysan

unread,
Sep 26, 2005, 5:13:48 PM9/26/05
to
Stephen Pelc wrote:

> On Mon, 26 Sep 2005 14:31:48 +0200, Bernd Paysan <bernd....@gmx.de>
> wrote:
>
>>The main difference with the i18n.widechar.v7.PDF proposal is that our
>>proposal (Anton's and my) doesn't distinguish between development
>>character set and application character set. I think this distinction is
>>unnatural and only valid in a historical context, e.g. the different
>>code-pages used in DOS-based Windows, and wide characters, which won't
>>coexist with ASCII.
>
> Unfortunately I have to disagree here. Even if you can get to one
> encoding from the UTF-xxx family in the long term, applications
> written in South Africa (development character set, DCS) must be able
> to be hosted and configured on a PC running a Chinese-xxx version
> of some operating system (operating character set, OCS)and used by
> a Russian-xxx speaker (application character set, ACS). This is a
> mix that has been seen "in the wild" - it is not a scenario.

The way it works in Unix/Linux (the platform where it really works) is to
use a single encoding, UTF-8, for everything. Unix platforms and Linux are
now delivered for some years with UTF-8 support, and recently, it's often
the default setting. I have absolutely no problem to install a SuSE with
two dozen languages all available to the user, just depending on the $LANG
variable - sharing documents with each others.

AFAIK, even Windows has some variants that ship with a multi-language
system, though in Windows, lots of system internals depend on the language
(such as the "Program Files" directory, or "My Documents"). Windows
supports Unicode as one of the codespaces, though UTF-8 support would be
left to the application (several do use it already, but most of them are
ported over from Unix).

But the XCHAR proposal is really not about having UTF-8 everywhere, but
about dealing with variable-width wide characters. Fixed wide characters
are a subset of that; though that takes the ASCII compatibility away, and
being incompatible to the DCS opens the can of worms you have with your
OCS!=DCS!=ACS.

> The impact of ACS is not necessarily in the encoding, but in
> how the application presents information and the order of
> text substitutions, e.g. subject/verb/object and time/manner/place.
> Then there's the date/time display nightmare and ...

That's another question, but not bound to the character encoding itself.

> I really wish we could embrace a single encoding, but there are
> Forth applications out there with 15-20 years of history.

The vast majority of Forth programs however is DCS=OCS=ACS. And since OCS
now is often enough UTF-8 by default, we should be able to handle that.

There might be place for a more complicated scheme even in future, but so
far, I see the DCS != OCS != ACS as a result of bad decisions in operating
system design. Such things should better be solved outside the scope of a
general standard (i.e. a rather specific standard "how to I overcome this
particular problem with the popular brainfuck operating system").

Having DCS != OCS/ACS is something that works for batch compiled programming
languages. There's still the problem of the string constants, but the
localization mapping handles that (you don't have strings in the user's
language around in your primary source code).

This however means that you enforce a particular way to deal with your
development system and your localization. This particular way is something
I really don't want in Forth. E.g. I could write some turtle graphics for
children, and it certainly is necessary that it has to be used in their
native language. On the other hand, it's quite obvious that it will use the
Forth interpreter. So it's definitely DCS, and the localization is a file
with lots of ' xxx alias yyy commands.

It reminds me all on target compilers. You jump through hoops because you
don't have your target system available. This is all well if you need it.
It's not something that should have an impact on the design of a Forth
system where build=host=target.

>>Another
>>possible compromise is to simply outlaw those variable width wide-char
>>encodings that don't allow stepping back.
>
> Tell that to an application developer and they will ignore you.

That's true.

> Such encodings exist and are used.

Unfortunately. For me, these encodings are other people's problems ;-).

> In our experience, stepping back through
> strings is most often encountered in file handling and affects DCS and
> OCS rather than ACS.

I use stepping backwards mostly in editing code, that's ACS.

>>> Can we merge the application developer issues with the kernel
>>> issues? These inclue cleaning up the meaning of character,
>>> byte/octet access, file words and so on.
>>
>>Good idea.
>
> Will you be at EuroForth?

Unfortunately not. I originally booked holiday before, but unfortunately, I
had to shift my trip by three weeks. So I'm now on the other side of the
world when EuroForth is :-(.

Bruce McFarling

unread,
Sep 26, 2005, 11:37:16 PM9/26/05
to

Albert van der Horst wrote:
> It is my hope that we need not introduce a new type for char's beside
> the byte type that we need anyhow, and the normal CHAR.
> Why would CHAR <some extended character> not fit in a Forth
> character (provided we do not try it at the same time for
> things like a length as exemplified by the ugly word COUNT.)

But the RfD is moving in the direction you want, in which characters
are treated as character set entities. After all, while a UTF-8
encoding is perfectly regular, any given character may be one, two,
three, or four bytes long.

COUNT is perfectly useful and clean. Its just using it to count, with
the attendant limitation of counts to the width of a uniform width
character set that is obsolete.

Bruce McFarling

unread,
Sep 27, 2005, 1:22:34 AM9/27/05
to

Stephen Pelc wrote:
> How does this fit in with the wide character and internationalisation
> proposals at
> www.mpeforth.com/arena/
> i18n.propose.v7.PDF
> i18n.widechar.v7.PDF
> These proposals/RFCs are from the application developers point of
> view. There's a sample implementation in the file
> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
> is derived from 15+ years of experience. From the file header:

WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
of text processing and place them in the realm of networking standards
compliance. And a subset of the XCHAR words would suggest how to
handle them:

OCTET-SIZE ( -- u )
The memory size of a Byte in address units.

OCTET@+ ( oct_addr1 -- oct_addr2 oct )
Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
memory
location after xc.

OCTET!+ ( oct oct_addr1 -- oct_addr2 )
Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
location after xc.

OCTET+ ( oct_addr1 -- oct_addr2 )
Adds the size of an OCTET to oct_addr1, giving oct_addr2.

OCTET- ( oct_addr1 -- oct_addr2 )
Subracts the size of an OCTET from oct_addr1, giving oct_addr2.

After all, XCHARs do not get rid of the possibility that CHARs may be
16 bits wide, though they may be of use for 8-bit data when the CHARs
are 16 bits wide.

Bruce McFarling

unread,
Sep 27, 2005, 1:26:20 AM9/27/05
to
Stephen Pelc wrote:
> IMHO your approach is from the implementor's perspective, which is
> valuable. But all our (Willem, Nick, Peter, Stephen) reviews showed
> that what *applications* do with strings is at a *much* higher level
> than implementors issues.

Its not at an implementor's perspective, because I aint an implementor.
Its from a text processing perspective. Almost all applications must
use strings to communicate with the user, but only text processing
applications have to .... errr process text.

Since I have a bit of an non-professional interest in text processing
(for my job, I mostly just generate it), I'll have a crack at
"addressing" the interaction between this and the i18n proposal.

The ACS only critically depends on the language in use if it is an
8-bit code page. If it is UTF-8, UTF-16 or UTF-32 it does not change
when the language changes, that being the point of the Unicode
Translation Formats. Display, input, etc. may have to change, but not
the character set per se. And while the XCHAR proposal is focused on
UTF-8, it also fits into UTF-16, especially for a historian,
archeologist, or anthropologist who needs to work with archaic or
uncommon languages that may require characters above the 16-bit plane.

If the ACS is an 8-bit code page, the only thing that is likely to
change as a result of something you've done in the i18n system is
sorting order. AND SORTING ORDER IS NOT A CHARACTER SET ISSUE! Its a
language issue.

OK, now, any translation REQUIRED between the system and the DCS is
built into the implementation. That includes any text files read or
written by the system.

So, if "ASCII" is taken to mean, "ASCII, possibly extended by a
language-specific code page", the four most common OCS/ACS combinations
are:

ACS=ASCII, DCS=ASCII
ACS=ASCII, DCS=UTF-#
ACS=UTF-#, DCS=ASCII
ACS=UTF-#, DCS=UTF-#

The translation issues are:

* ACS=ASCII, DCS=ASCII
They happen to be different code pages. KEY, EMIT, [CHAR] and CHAR may
have an issue of which code page you are talking about. But neither
are XCHAR's.

* ACS=ASCII, DCS=UTF-#
The only question is whether XKEY/XEMIT is in Application space or
Developer space or are transitions between the two.

I don't see how the input can be FROM developer space and output TO
developer space (programming utilities, after all, are only
applications that happen to work in the developers languages, so ACS
happens to EQUAL DCS), so there are only two possibilities:

** If XKEY/XEMIT are entirely in Application Space, no possible dramas,
no matter what character set that is. As XKEY's they are just
arbitrary chunks of bits measured in arbitrary address units.

** If XKEY/XEMIT bring ACS characters into Developer Space and then out
again, then translations occur from ASCII + "SET LANGUAGE" code page to
UTF. If the application is internationalised, all characters emitted
will be from input or from resource files, so there is never any "CS
won't translate" problem.

* ACS=UTF-#, DCS=ASCII

** If XKEY/XEMIT bring ACS into Developer space, there is a potential
translation problem, in that not all UTF-# encoded characters will fit
into any given 8-bit code page.

* ACS=UTF-#, DCS=UTF-#

** For this, there is no XKEY/XEMIT translation barrier, even if they
are different UTF's (say the developer is Han Chinese, and so prefers
to develop in UTF-16, or is working with an OS that relies on UTF-16,
but is writing for an Atlantic Zone audience internationalised into
English/Spanish/French/Portuguese and so prefers UTF-8 as the ACS),
since there is well-defined translations between any well-encodeded UTF
character. There is translation overhead, but that is all.

** For this, the problem is that there need to be DIFFERENT "XKEY"'s if
they are different encodings of the same character set.

To my mind, XCHARS's belong to the Application Character Set, since the
kind of thing that can be portable between systems is more text
processing applications than how a particular system may talk to its
underlying operating systems.

Further XCHARS are quite clearly NEEDED for the text processing in an
ACS, since CHARs suffice for ASCII code-page encodings, but not for
UTF-# encodings of THE SAME CS, and ASCII code-page does not accomodate
all character sets.

And for things like searching source for a particular definition, just
set the ACS to the DCS.

This is orthogonal to my earlier comment. My earlier comment presumes
that XCHARS are for what might be termed the "Memory Storage CS", not
for what may be termed the "Permanent Storage CS", which may well be
different. XCHARS define a translation between the stack and Memory
Storage. File words bring parts of files into Memory Storage. Hence
my argument that there should be file modes that handle that
translation (which can be done in bulk). And indeed, in a certain
sense that needs to be done in the file word, because the file words
are designed to bring parts of files into ALLOCATED parts of storage,
so the file words should only bring as much as can fit into the
allocated part of storage under the Memory Storage CS.

On the other hand, while XCHARs are required in ACS land, the ACS is
subject to change. And it doesn't make sense to change it "behind the
back" of the I18N words. So that suggests that the SET LANGUAGE system
ought to include an ability to set the default working character set
encoding and the default permanent storage character set encoding.

There is no need for a portable program to SET the ACS encoding. But
it may have to be able to QUERY the ACS encoding, and then to be able
to associate that with a particular collection of text in memory so
that if necessary it can RESTORE the ACS encoding to what was in place
when that text went into memory.

Bernd Paysan

unread,
Sep 27, 2005, 4:57:17 AM9/27/05
to
Bruce McFarling wrote:
> WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
> of text processing and place them in the realm of networking standards
> compliance. And a subset of the XCHAR words would suggest how to
> handle them:
>
> OCTET-SIZE ( -- u )
> The memory size of a Byte in address units.
>
> OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> memory
> location after xc.
>
> OCTET!+ ( oct oct_addr1 -- oct_addr2 )
> Stores the OCTET oct at oct_addr1. oct_addr2 points to the first memory
> location after xc.
>
> OCTET+ ( oct_addr1 -- oct_addr2 )
> Adds the size of an OCTET to oct_addr1, giving oct_addr2.
>
> OCTET- ( oct_addr1 -- oct_addr2 )
> Subracts the size of an OCTET from oct_addr1, giving oct_addr2.
>
> After all, XCHARs do not get rid of the possibility that CHARs may be
> 16 bits wide, though they may be of use for 8-bit data when the CHARs
> are 16 bits wide.

Another missing part of my XCHAR proposal is how to change the way these
XCHARs are handled. ATM, I say the system deals with that, depending on
user settings (e.g. LANG environment variable). What's obvious is that
there's a way to deal with several encodings, and OCTET could be one of
them.

OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
picture, but the u would not depend on xc.

Since the actually available encodings are rather system-dependent, I
suggest that the system documentation lists available encodings and ways to
set them. E.g.

XC-CODING ( xc-id -- ) set XC encoding.

XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
stored in memory; it is not defined what happens with bit 8.

OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
memory. This encoding is compatible with packed ASCII strings.

UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
with packed ASCII strings.

UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
compatible with packed ASCII strings, but ASCII strings can be converted.

This however is the part of the system which is still open, so I can't say
there is enough experience to push a RfD through.

Bruce McFarling

unread,
Sep 27, 2005, 7:26:52 AM9/27/05
to

Bernd Paysan wrote:
> OCTET-SIZE still would be ( xc -- u ), to fit into the general stack
> picture, but the u would not depend on xc.

Or not be a word at all, but rather be a query, since it won't be
changing and won't need any magic going on behind the back of the
author of portable code to make the portable code work

> Since the actually available encodings are rather system-dependent, I
> suggest that the system documentation lists available encodings and ways to
> set them. E.g.

> XC-CODING ( xc-id -- ) set XC encoding.

> XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

I would stress that more important than the ability to generate xc-id's
is the ability to get the CURRENT xc-id. Scenario: you get some text
and it is stored in memory somewhere. Then you take an action that you
know MIGHT result in a switch in character set, and you get some text,
and it is stored in memory somewhere.

So, if YOU didn't SET the sc-id's, how do you know how to switch back
and forth between them, or even whether you need to?

SET-XCHAR ( xc-id -- )
GET-XCHAR ( -- xc-id )

is the core. That lets you get the xc-id when you store the first set
of information in memory, lets you get the xc-id when you store the
second set of information in memory, test for equality to see if you
have to take care, reset to the "old" xc-id when appropriate.

If there are going to be these:

> ASCII ( -- xc-id ) Format: ASCII characters. The lowest 7 bits of xc are
> stored in memory; it is not defined what happens with bit 8.
>
> OCTET ( -- xc-id ) Format: Octets. The lowest 8 bits of xc are stored in
> memory. This encoding is compatible with packed ASCII strings.
>
> UTF-8 ( -- xc-id ) Format: UTF-8 characters. This encoding is compatible
> with packed ASCII strings.
>
> UTF-16 ( -- xc-id ) Format: UTF-16 characters. This encoding is not
> compatible with packed ASCII strings, but ASCII strings can be converted.

There should also be LANGUAGE-XCHAR ( -- ) to synchronise the xc-id
with the current language. An implementation of XCHAR's that did not
have I18N implemented would reset xc-id to the system default.

Albert van der Horst

unread,
Sep 27, 2005, 8:11:52 AM9/27/05
to
In article <1127792236.4...@g43g2000cwa.googlegroups.com>,

It is not clean to store an integer (the count) in a character.
It is not useful to have a count limited to 256 in Britain
65526 in Japan and 4 billion in China.

Albert van der Horst

unread,
Sep 27, 2005, 8:20:36 AM9/27/05
to
In article <1127798554....@g14g2000cwa.googlegroups.com>,

Bruce McFarling <agi...@netscape.net> wrote:
>
>Stephen Pelc wrote:
>> How does this fit in with the wide character and internationalisation
>> proposals at
>> www.mpeforth.com/arena/
>> i18n.propose.v7.PDF
>> i18n.widechar.v7.PDF
>> These proposals/RFCs are from the application developers point of
>> view. There's a sample implementation in the file
>> LIB\INTERNATIONAL.FTH in the VFX Forth distribution. The file
>> is derived from 15+ years of experience. From the file header:
>
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance. And a subset of the XCHAR words would suggest how to
>handle them:
>
>OCTET-SIZE ( -- u )
>The memory size of a Byte in address units.

A byte is an address unit. Not only by definition but for all
practical purposes.
Can't we just condemn those that don't to declare a
"environmental dependancy on an address unit not to contain
8 bits".
By the way Chuck Moore would have to define OCTET-SIZE as one
quarter, anyway. How is that?

>
>OCTET@+ ( oct_addr1 -- oct_addr2 oct )
>Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
>memory
>location after xc.

Much too verbose for such a basic word.
Why not OCTET -> B

<SNIP>

>After all, XCHARs do not get rid of the possibility that CHARs may be
>16 bits wide, though they may be of use for 8-bit data when the CHARs
>are 16 bits wide.

CHAR's should not be used for 8-bit data.
XHAR's should not be used to free CHAR's of the chore to handle
8-bit data, because of a refusal to use bytes (or OCTET's).

So,
do we really need XCHAR ?

Groetjes Albert

Anton Ertl

unread,
Sep 27, 2005, 12:09:09 PM9/27/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Problem:
>
>ASCII is only appropriate for the English language. Most western languages
>however fit somewhat into the Forth frame, since a byte is sufficient to
>encode the few special characters in each (though not always the same
>encoding can be used; latin-1 is most widely used, though).

Actually Unicode (in its UCS-4/UTF-32 encoding) would also fit in the
ANS Forth frame. However, most near-ANS code around has an
environmental dependency on 1 chars = 1 au, and I think that more
existing programs work with a system that uses 1-au chars and xchars
(even when processing wider xchars) than with a system that uses n-au
chars (n>1).

> Since ANS Forth specifies ASCII encoding, only
>ASCII-compatible encodings may be used.

That's sounds like a requirement should therefore be part of the
proposal, not the problem description.

The on-stack representation of ASCII characters should certainly be
ASCII. For the in-memory representation that would also have some
advantages: in particular, programs that access individual characters
using char (not xchar) words would work correctly on strings
consisting only of ASCII characters (and ANS Forth does not give any
guarantee for other characters anyway).

>Proposal

I would have waited for some more time (and experience) before making
such a proposal (I am still unsure which words to include and which
not). But since you made it, let's collect the feedback.

>Words:
>
>XC-SIZE ( xc -- u )
>Computes the memory size of the XCHAR xc in address units.
>
>XC@+ ( xc_addr1 -- xc_addr2 xc )
>Fetchs the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.
>
>XC!+ ( xc xc_addr1 -- xc_addr2 )
>Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>location after xc.

This is unsafe, as it writes an unknown amount of data behind
xc_addr1. One can use it safely in combination with XC-SIZE, but then
it is easier to use XC!+? (see below).

Providing this word, but not XC!+? discourages safe programming
practices and encourages creating buffer overflows.

In other words, this might become Forth's strcat().

It's probably best not to standardize this word.

>XCHAR+ ( xc_addr1 -- xc_addr2 )
>Adds the size of the XCHAR stored at xc_addr1 to this address, giving
>xc_addr2.
>
>XCHAR- ( xc_addr1 -- xc_addr2 )
>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>work for every possible encoding.
>
>X-SIZE ( xc_addr u -- n )
>n is the number of monospace ASCII characters that take the same space to
>display as the the XCHAR string starting at xc_addr, using u address units.

Maybe another name would be harder to confuse with XC-SIZE. How about
X-WIDTH or XC-WIDTH?

>XKEY ( -- xc )
>Reads an XCHAR from the terminal.
>
>XEMIT ( xc -- )
>Prints an XCHAR on the terminal.

Currently Gforth also implements:

+X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like 1 /STRING

-X/STRING ( xc-addr1 u1 -- xc-addr2 u2 )
like -1 /STRING

XC@ ( xc-addr -- xc )
like C@

DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
safe version of XC!+, f specifies success

-TRAILING-GARBAGE ( addr u1 -- addr u2 )
remove trailing incomplete xc

Of course, some of these can be defined from others, but it's not
clear to me yet which ones are the set that we want to select.

>The following words behave different when the XCHAR extension is present:

That is actually a compatible extension of ANS Forth's CHAR and
[CHAR]; for ASCII characters they behave exactly the same, and for
others ANS Forth does not specify a behaviour. So I would not say
"behave different", but use wording such as "extend the semantics of
..."

>Open issues are file reading and writing (conversion on the fly or leave as
>it is?).

Definitely conversion on the fly. There must be only one character
encoding in memory. However, we have not implemented that yet.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.complang.tuwien.ac.at/forth/ansforth/forth200x.html
EuroForth 2005: http://www.complang.tuwien.ac.at/anton/euroforth2005/

Anton Ertl

unread,
Sep 27, 2005, 12:55:53 PM9/27/05
to
"Bruce McFarling" <agi...@netscape.net> writes:
>The first thing to settle is whether XCHARS are "these" extended
>character sets that are upwardly compatible with printable ASCII, or
>"this" extended character set. And I could well see a wish to use, eg,
>UTF-8 in file storage (if my primary targets were Europe, Africa, and
>the Americas) and UTF-16 in processing.

Xchars can be used for any fixed-width encodings (even for a
fixed-width encoding with three chars/xchar), and for any
variable-width encodings that satisfy the requirements (e.g., UTF-8
and UTF-16).

That being said, I don't see a point in using UTF-16 for processing;
it combines the disadvantages of a fixed-width encoding with the
disadvantages of a variable-width encoding. If you want fixed-width,
use UTF-32; if you want variable-width, use UTF-8.

>It seems to me that, since you can always tell where a UTF character
>begins and ends when you know whether it is UTF-32, UTF-16, or UTF-8,
>but you need to know know WHICH it is as well as endianess for UTF16
>and UTF32, the most coherent thing to do is to have AN XCHAR
>representation for processing and a set of file modes that specify the
>kind of file you are loading:
>
>* ASCII (latin-1, etc, any fixed 8-bit code pages)
>* UTF8
>* UTF16 (endedness of your system)
>* UTF32 (endedness of your system)
>* UTF16B
>* UTF16L
>* UTF32B
>* UTF32L
>
>Then if the file mode matches the system mode, you just load the file,
>if it mismatches, it is translated on the fly on reading and writing.

Yes, that's somewhat like what I have in mind. Except that currently
I am only envisioning conversions between various 8-bit encodings and
UTF-8; but if there really are people around with UTF-16 files, adding
a converter for them is not a big issue.

>Obviously the system mode would be a thing for a system query.

Ideally programs should be written with the Xchars words such that
they do not need to know the encoding used in the system.

Anton Ertl

unread,
Sep 27, 2005, 1:08:20 PM9/27/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>>XCHAR- ( xc_addr1 -- xc_addr2 )
>>Goes backward from xc_addr1 until it finds an XCHAR so that the size of this
>>XCHAR added to xc_addr2 gives xc_addr1. Note: XCHAR- isn't guaranteed to
>>work for every possible encoding.
>
>IMHO standardising a word that can't be guaranteed to work is not
>beneficial.

This word is guaranteed to work (if there is at least one character
right before xc_addr1).

If you are thinking about encodings where you cannot find the previous
character, they are not supported by Xchars. And I consider this a
virtue, not a deficiency.

>I look forward to discussing these issues at EuroForth 2005.

I will be there.

Anton Ertl

unread,
Sep 27, 2005, 1:16:40 PM9/27/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>Unfortunately I have to disagree here. Even if you can get to one
>encoding from the UTF-xxx family in the long term, applications
>written in South Africa (development character set, DCS) must be able
>to be hosted and configured on a PC running a Chinese-xxx version
>of some operating system (operating character set, OCS)and used by
>a Russian-xxx speaker (application character set, ACS). This is a
>mix that has been seen "in the wild" - it is not a scenario.

No problem:

DCS: Unicode (encoded as UTF-8 or UTF-32)
OCS: Unicode (encoded as UTF-8 or UTF-32)
ACS: Unicode (encoded as UTF-8 or UTF-32)

So once your condition above is satisfied, this is not an issue at the
character set and encoding level, and is thus outside the scope of the
xchars words.

>The impact of ACS is not necessarily in the encoding, but in
>how the application presents information and the order of
>text substitutions, e.g. subject/verb/object and time/manner/place.
>Then there's the date/time display nightmare and ...

Well, that's internationalisation. Xchars don't solve (much of) that.

Anton Ertl

unread,
Sep 27, 2005, 1:30:59 PM9/27/05
to
"Bruce McFarling" <agi...@netscape.net> writes:

>
>Stephen Pelc wrote:
>WRT the 8bit issue, XCHARs, if successful, remove bytes from the realm
>of text processing and place them in the realm of networking standards
>compliance.

Bytes are not in ANS Forth, and are therefore not used in text
processing.

With Xchars, one might use Chars as bytes: Nearly all systems
implement chars as bytes anyway, and probably a number of programs use
chars for bytes, so one might standardize on that.

The disadvantage of such a step in the Xchars context would be that
the in-memory representation for UTF-16 and UTF-32 would no longer be
fully ASCII-compatible (one ASCII Xchar would become more than one
Char).

But I don't believe that UTF-16 or UTF-32 and multi-au Chars will
become significant, so one might just as well settle down to using
Chars for bytes.

>And a subset of the XCHAR words would suggest how to
>handle them:

Well, since octets are fixed-width, it may be better to model the
octet words on the Char or Cell words than on the Xchar words.

Anton Ertl

unread,
Sep 27, 2005, 1:39:52 PM9/27/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Another missing part of my XCHAR proposal is how to change the way these
>XCHARs are handled.

No, that's not missing. There should not be any switching between
encodings. There is one encoding in the Forth system that should be
able to represent anything, and everything is converted to that
encoding on input, and from that encoding on output. No need to
switch anything.

If you allowed switching, then:

- Either you would have to change the encoding all the strings in the
Forth system. This is impossible.

- Or the program would have to keep track of which strings are in
which encoding and always switch around. That's cumbersome and
error-prone.

>XC-FILE-MODE ( xc-id fid -- ) set file fid to xc-id XC encoding mode.

IMO the encoding should be part of the fam, and not be set on the fly.
Or do you envision files that mix UTF-8 and, say UTF-16? So we might
have words like

UTF-8 ( fam1 -- fam2 )

latin-1 ( fam1 -- fam2 )

Bruce McFarling

unread,
Sep 28, 2005, 12:49:04 AM9/28/05
to

Anton Ertl wrote:
> That being said, I don't see a point in using UTF-16 for processing;

To save memory space, if your primary language uses a wide character
set in the first plane (where most UTF-8 encodings are three bytes
long). Also if you know what language you are working in, you know
whether or not you are going to stay down in the first plane, so the
variable width issue may be moot.

Not that I had those in mind when I wrote that, rather I had in mind
that as soon as you assume away something, you will find out that
someone else has a strong preference for it, so I tried to avoid
assuming away anything.

Bruce McFarling

unread,
Sep 28, 2005, 12:57:54 AM9/28/05
to

Albert van der Horst wrote:
> Much too verbose for such a basic word.
> Why not OCTET -> B
>
> <SNIP>
>
> >After all, XCHARs do not get rid of the possibility that CHARs may be
> >16 bits wide, though they may be of use for 8-bit data when the CHARs
> >are 16 bits wide.

> CHAR's should not be used for 8-bit data.
> XHAR's should not be used to free CHAR's of the chore to handle
> 8-bit data, because of a refusal to use bytes (or OCTET's).

> So,
> do we really need XCHAR ?

Yes, of course, because XCHARS is not about address units but about
character set units. XCHARS handle extended character data, where we
know perfectly well that sometimes it is one octet long, sometimes it
is two octets long, sometimes it is four octets long, sometimes is
ranges from one to four octets long, and sometimes it ranges from two
to four octets long. So XCHAR+, XCHAR-, XCHAR@+, and XCHAR!+ are
things that are likely to benefit from optimisation and especially
handy for portability given that you could write and test for, say,
UTF-8, and then have code that works for a fixed width 16-bit character
set.

So when I say "XCHAR's may be a byte wide", that's dependent on the
character set encoding in use, not the system and system-specific
address unit.

Bruce McFarling

unread,
Sep 28, 2005, 1:23:02 AM9/28/05
to

Albert van der Horst wrote:
[Bruce]

> >COUNT is perfectly useful and clean. Its just using it to count, with
> >the attendant limitation of counts to the width of a uniform width
> >character set that is obsolete.

> It is not clean to store an integer (the count) in a character.
> It is not useful to have a count limited to 256 in Britain
> 65526 in Japan and 4 billion in China.

I didn't say THAT was clean or useful. In fact, I said that THAT is
obsolete. But CHAR@+ is perfectly clean and useful, however confusing
the string of letters you use to do it.

Stephen Pelc

unread,
Sep 28, 2005, 6:12:59 AM9/28/05
to
On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>No, that's not missing. There should not be any switching between
>encodings. There is one encoding in the Forth system that should be
>able to represent anything, and everything is converted to that
>encoding on input, and from that encoding on output. No need to
>switch anything.

The key word is "should". However, reality intervenes. There are
apps out there that use multiple encodings. A standard formalises
current practice - it is *not* a design for the future.

If you push through a standard that disenfranchises existing
substantial apps, the developers of those apps will ignore
the standard. Is this what you want?

The preferred route, I suggest, is to provide GET-ENCODING and
SET-ENCODING. In your system, you can always be non-compliant
for the moment. You will then have an environmental dependency on
UTF8. This is no worse than the widely accepted char=byte=au
dependency.

Bruce McFarling

unread,
Sep 28, 2005, 6:22:56 AM9/28/05
to

Albert van der Horst wrote:

> >OCTET@+ ( oct_addr1 -- oct_addr2 oct )
> >Fetches the OCTET oct at oct_addr1. oct_addr2 points to the first
> >memory
> >location after xc.

> Much too verbose for such a basic word.
> Why not OCTET -> B

I don't know why not. I'm pretty confident that few people are likely
to have OCTET@+ lying around, and if they do its odds on it does that
anyway. B@+? The B could stand for "buffer", or "block". OTOH,
BYTE@+ is fine by me.

Could call it "BCOUNT" in homage to established naming conventions for
CHAR@+, which is called COUNT, or OCOUNT.

Or OC@+ in homage to the yank television show that the young'uns here
like so much.

Bruce McFarling

unread,
Sep 28, 2005, 6:28:35 AM9/28/05
to
Stephen Pelc wrote:

> The preferred route, I suggest, is to provide GET-ENCODING and
> SET-ENCODING. In your system, you can always be non-compliant
> for the moment. You will then have an environmental dependency on
> UTF8. This is no worse than the widely accepted char=byte=au
> dependency.

Note that an implementation may only do one encoding, in which case
GET-ENCODING will always get the same encoding, and SET-ENCODING will
either do nothing or throw an error if the encoding set is not the
supported one.

It certainly is not unreasonable for gforth to focus on UTF-8, which is
emerging as a de facto standard in much of Linux oriented open source.
A standard that did not accomodate UTF-8 would be flawed. But
prescribing in advance of common practice will limit the uptake of the
standard and therefore the portability of source relying on it.

Bernd Paysan

unread,
Sep 28, 2005, 8:00:58 AM9/28/05
to
Anton Ertl wrote:

> One can use it safely in combination with XC-SIZE, but then
> it is easier to use XC!+? (see below).

Well, the reference implementation of XC!+? then is

: xc!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
>r over xc-size r@ over u< IF ( xc xc-addr1 len r: u1 )
\ not enough space
drop nip r> false
else
>r xc!+ r> r> swap - true
then ;

> In other words, this might become Forth's strcat().

You at least know that there is an upper bound for how much you might
overwrite (not the case with strcat). Well, the upper bound depends on the
encoding, and we don't guarantee now that -1 XC-SIZE will return the
maximum one.

Stephen Pelc

unread,
Sep 28, 2005, 12:13:10 PM9/28/05
to
On 28 Sep 2005 03:28:35 -0700, "Bruce McFarling"
<agi...@netscape.net> wrote:

>It certainly is not unreasonable for gforth to focus on UTF-8, which is
>emerging as a de facto standard in much of Linux oriented open source.
>A standard that did not accomodate UTF-8 would be flawed. But
>prescribing in advance of common practice will limit the uptake of the
>standard and therefore the portability of source relying on it.

I've been discussing applications that have been shipping for 15 or
more years. Internationalisation and the consequent "char" issues
have been around for a long time, and some of our clients handle
them daily. I just don't want their *requirements* to be locked
out.

The DCS, OCS and ACS terminology stems from issues that exist for
real applications. It is certainly rare for encodings to change
after program initialisation (although some multilingual word
processors have worked that way) but it is common that an app
has to select the encoding at startup.

Anton Ertl

unread,
Sep 28, 2005, 1:43:44 PM9/28/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>It is certainly rare for encodings to change
>after program initialisation (although some multilingual word
>processors have worked that way) but it is common that an app
>has to select the encoding at startup.

Sounds to me that we are in agreement then. Gforth uses the standard
Unix mechanism (the LANG environment variable) for determining the
encoding on startup. No switching words needed.

As for multilingual word processors, that's a good reason for using a
universal character set and encoding rather than switching around.

Anton Ertl

unread,
Sep 28, 2005, 1:49:50 PM9/28/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Tue, 27 Sep 2005 17:39:52 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>No, that's not missing. There should not be any switching between
>>encodings. There is one encoding in the Forth system that should be
>>able to represent anything, and everything is converted to that
>>encoding on input, and from that encoding on output. No need to
>>switch anything.
>
>The key word is "should". However, reality intervenes. There are
>apps out there that use multiple encodings. A standard formalises
>current practice - it is *not* a design for the future.

It makes no sense to standardize a current practice that has no
future.

But as I said before, IMO it's a little to early for the xchars
proposal, because there is not enough practice with it.

In the Linux world, UTF-8 is the present.

>If you push through a standard that disenfranchises existing
>substantial apps, the developers of those apps will ignore
>the standard. Is this what you want?

I have read enough statements from Forth vendors that it's impossible
to write substantial apps in ANS Forth, so supposedly the programmers
of those substantial apps are ignoring the standard already.

The existing apps will continue to work on the systems where they
worked before and be as non-standard as they ever where.

It seems to me that you are thinking about requirements of your
customers that most of the others don't have, and that hopefully will
go away at some point even for your customers.

>The preferred route, I suggest, is to provide GET-ENCODING and
>SET-ENCODING.

That's the worst possible design; or maybe having an ENCODING variable
would be even worse.

In general, the global-state approach is always causing problems,
whether it's STATE or BASE or something else.

If you want to support different encodings, the encoding should be
stored with the data. But then we would be dealing with something
that's much different from current Forth strings. And the words for
dealing with that stuff would probably be much different from the
xchars words.

Xchars were designed for dealing with one encoding used throughout the
Forth system. Several encodings are compatible with the requirements
of xchars, and a Forth system might let you choose on startup which
encoding to use, but you cannot switch around between encodings.

Anton Ertl

unread,
Sep 28, 2005, 2:15:12 PM9/28/05
to
Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>
>> One can use it safely in combination with XC-SIZE, but then
>> it is easier to use XC!+? (see below).
>
>Well, the reference implementation of XC!+? then is

My point is that you should include XC!+? in the proposal and probably
delete XC!+ from it.

BTW, concerning a reference implementation of xchars, a reference
implementation for the 8bit (or a general fixed-width) encoding should
be easy (although not very exciting).

>> In other words, this might become Forth's strcat().
>
>You at least know that there is an upper bound for how much you might
>overwrite (not the case with strcat).

True, but XC+! can be enough to overwrite an xt, and that can be
enough to break into the system.

>Well, the upper bound depends on the
>encoding, and we don't guarantee now that -1 XC-SIZE will return the
>maximum one.

Even if an upper bound could be determined, making use of that would
require additional programmer effort, and it's a bad idea to design
words that require that; you need to educate the programmers about
that, and even if they know about it, it's still easier to make errors
when the required effort is higher.

Albert van der Horst

unread,
Sep 28, 2005, 7:38:44 PM9/28/05
to
In article <1127884982.2...@f14g2000cwb.googlegroups.com>,

Of course, I agree to that. Here in the Netherlands the shorter C@+ is
in common use.

Groetjes Albert

Elizabeth D Rather

unread,
Sep 28, 2005, 8:36:59 PM9/28/05
to
"Anton Ertl" <an...@mips.complang.tuwien.ac.at> wrote in message
news:2005Sep2...@mips.complang.tuwien.ac.at...

>
> I have read enough statements from Forth vendors that it's impossible
> to write substantial apps in ANS Forth, so supposedly the programmers
> of those substantial apps are ignoring the standard already.

That statement refers to the need for dependencies on things such as
underlying OS (and its interface), device drivers, and other extensions.
Wise programmers (IMO) stick to ANS Forth for everything not involving such
extensions, which is often the bulk of the app.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310-491-3356
5155 W. Rosecrans Ave. #1018 Fax: +1 310-978-9454
Hawthorne, CA 90250
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Bruce McFarling

unread,
Sep 28, 2005, 11:49:38 PM9/28/05
to

Anton Ertl wrote:
> steph...@mpeforth.com (Stephen Pelc) writes:
> >It is certainly rare for encodings to change
> >after program initialisation (although some multilingual word
> >processors have worked that way) but it is common that an app
> >has to select the encoding at startup.

> Sounds to me that we are in agreement then. Gforth uses the standard
> Unix mechanism (the LANG environment variable) for determining the
> encoding on startup. No switching words needed.

Except that is at the startup of gforth, not necessarily the startup of
the application. And that does not address someone who uses gforth as
a buffer against the expert-friendliness of Linux.

Bruce McFarling

unread,
Sep 28, 2005, 11:52:07 PM9/28/05
to

Stephen Pelc wrote:
> I've been discussing applications that have been shipping for 15 or
> more years. Internationalisation and the consequent "char" issues
> have been around for a long time, and some of our clients handle
> them daily. I just don't want their *requirements* to be locked
> out.

Yes, noted. I don't want their requirements locked out either, because
it interferes with uptake of a putative standard and limits
portability.

Bruce McFarling

unread,
Sep 29, 2005, 12:00:11 AM9/29/05
to

Anton Ertl wrote:
> >The key word is "should". However, reality intervenes. There are
> >apps out there that use multiple encodings. A standard formalises
> >current practice - it is *not* a design for the future.

> It makes no sense to standardize a current practice that has no
> future.

It makes no sense to standardise in a way that locks out a substantial
part of the present, since then the standard will not be viable and
won't have been respected in the available code base when the future
arrives.

> In the Linux world, UTF-8 is the present.

No standard can be limited to the Linux world, just as no standard
should shut out the Linux world.

> I have read enough statements from Forth vendors that it's impossible
> to write substantial apps in ANS Forth, so supposedly the programmers
> of those substantial apps are ignoring the standard already.

That's an all or nothing reading of what turn out to be qualified
statements. It may be impossible to write the entirety of substantial
apps in ANS Forth alone. There is nothing in that statement that
suggests the programmers of those apps are ignoring the standard.
After all, the standard does not *require* you to write the entirety of
an app in ANS Forth alone.

And XCHARs are right in the nitty gritty of low level support words for
text processing that is real appealing to have standardised, whether
formally or as a de facto toolkit.

Bruce McFarling

unread,
Sep 29, 2005, 12:15:58 AM9/29/05
to

Anton Ertl wrote:
[Bernd]

> >XC-SIZE ( xc -- u )
> >Computes the memory size of the XCHAR xc in address units.

> >XC!+ ( xc xc_addr1 -- xc_addr2 )


> >Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
> >location after xc.

> This is unsafe, as it writes an unknown amount of data behind
> xc_addr1. One can use it safely in combination with XC-SIZE, but then
> it is easier to use XC!+? (see below).

> DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )


> safe version of XC!+, f specifies success

I'm not sure about the level of this. An au length of a sequence of
XCHARs in memory seems handier, to me, for most things, and I
definitely prefer "know in advance" to "try it and clean up if it
fails".

One thing that occurs to me is that XC-SIZE seems to entail MOVE>,
analogous to CMOVE> in address units.

Bruce McFarling

unread,
Sep 29, 2005, 12:18:27 AM9/29/05
to

Anton Ertl wrote:
> BTW, concerning a reference implementation of xchars, a reference
> implementation for the 8bit (or a general fixed-width) encoding should
> be easy (although not very exciting).

Reference implementations for UTF-32, UTF-16 and UTF-8 would be enough
to give the idea. And of course code-page-ASCII is even easier than
UTF-32.

Stephen Pelc

unread,
Sep 29, 2005, 5:30:12 AM9/29/05
to
On Wed, 28 Sep 2005 17:43:44 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Sounds to me that we are in agreement then. Gforth uses the standard
>Unix mechanism (the LANG environment variable) for determining the
>encoding on startup. No switching words needed.

If OCS <> ACS, then switching may be needed.

>As for multilingual word processors, that's a good reason for using a
>universal character set and encoding rather than switching around.

Yes for a new design, not necessarily for an existing app being
ported. Standard = current practice. Some of the biggest issues
in ANS94 come from the introduction of new practice. The good
new parts come from the embodiment of best current practice, even
if it came from another language, e.g. CATCH and THROW.

Stephen Pelc

unread,
Sep 29, 2005, 6:13:59 AM9/29/05
to
On Wed, 28 Sep 2005 17:49:50 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>The key word is "should". However, reality intervenes. There are
>>apps out there that use multiple encodings. A standard formalises
>>current practice - it is *not* a design for the future.
>
>It makes no sense to standardize a current practice that has no
>future.

Yes it does! It encourages take up of current best practice after
the first port. Application developers simply will not discard a
large and proven code base just because you say they should.

We have been involved in two ports of large commercial Forth
applications, FigForth -> Forth83 and Forth83 -> ANS94. The
final application generates 10-16Mb of binary. Even the first
stage build requires compiling 250,000 lines of code. Until
you understand the mindset of these developers and the
management issues of large applications, you will not
understand why I'm taking this approach.

In essence you want to go from A -> B directly. I' saying that
acceptance of B requires some people to go A -> C -> B. The
end point is not in dispute, it's the journey that counts.

>I have read enough statements from Forth vendors that it's impossible
>to write substantial apps in ANS Forth, so supposedly the programmers
>of those substantial apps are ignoring the standard already.

I for one do not subscribe to that point of view. What many/some
vendors have said is
a) the standard does not cover enough
b) we were out of time to do more
c) we welcome your taking up the challenge.

>>The preferred route, I suggest, is to provide GET-ENCODING and
>>SET-ENCODING.
>
>That's the worst possible design; or maybe having an ENCODING variable
>would be even worse.
>
>In general, the global-state approach is always causing problems,
>whether it's STATE or BASE or something else.

That's why GET-ENCODING and SET-ENCODING are suggested - they hide
the implementation of the storage.

>Xchars were designed for dealing with one encoding used throughout
>the Forth system. Several encodings are compatible with the
>requirements of xchars, and a Forth system might let you choose on
>startup which encoding to use, but you cannot switch around between
>encodings.

The implication of XCHARs is then that they cannot be used when
ACS <> DCS or OCS <> DCS. This breaks XCHARs for application
development on current Forths.

Bruce McFarling

unread,
Sep 29, 2005, 11:56:17 PM9/29/05
to

Stephen Pelc wrote:
> The implication of XCHARs is then that they cannot be used when
> ACS <> DCS or OCS <> DCS. This breaks XCHARs for application
> development on current Forths.

Or that they cannot be used in a multi-tasking situation when the ACS
of one task is not the same as the ACS of another task.

On the other hand, GET-ENCODING SET-ENCODING can *accomodate* "UTF-8
uber alles" if SET-ENCODING is:

SET-ENCODING
( xc-id -- flag )
\ flag=FALSE, encoding is not available
\ flag=TRUE, atomic XCHAR encoding is available (it is always possible
to find the beginning of the current char from an abitrary memory
address within the string)
\ flag=1, XCHAR encoding available, encoding is not atomic (a valid
start of character address is required and you can only move forward).

Then a "UTF-8 uber alles" system simply refuses any other encodings for
XCHARs, and accepts code with system dependencies on AU=1CHAR=8bits,
and XCHAR-ENCODING=UTF-8. Systems that can accomodate those
dependencies (and whetever else they do not have a vanilla-ANS prelude
file for) are able to run those programs. Let the best approach win,
without forcing anybody to lose.

Bernd Paysan

unread,
Sep 30, 2005, 10:36:29 AM9/30/05
to
Stephen Pelc wrote:
> That's why GET-ENCODING and SET-ENCODING are suggested - they hide
> the implementation of the storage.

So far, I suggest that this part should be defined elsewhere. The XCHAR
wordset itself is orthogonal to the ACS/OCS/DCS separation, and can be
(ab)used to handle that (with SET-ENCODING/GET-ENCODING and the encodings
that live behind that).

Being able to change the encoding also needs to know how to call these
locales, so either dictionary names have to be defined, or the locale
specifier is a string, like with setlocale - and then you need to tell the
user what the string means.

I think we can agree on that we need to handle other encodings than ASCII
and fixed-width wide characters - this is Forth200x, after all, and these
things exist. We need to handle several encodings (switchable) on some
systems in some cases, but not on others.

Encoding changes apparently belongs to an internationalization wordset. It's
corresponding to the C "setlocale" word, and you already have
SET-LANGUAGE/GET-LANGUAGE and SET-COUNTRY/GET-COUNTRY words in your I18N
proposal. Let's keep them, they are fine. The LOCALE words apparently try
to address my concern how to list and display the available locales (though
I think there should be at least one known locale-id to start from, e.g.
FORTH-LSID, which corresponds to the DCS). So SET-ENCODING/GET-ENCODING fit
perfectly into the LOCALE wordset.

Systems without internationalization already may need XCHAR, because there
are widely used environments with UTF-8 as default character set. But they
don't need to switch between encodings.

Anton Ertl

unread,
Sep 30, 2005, 12:24:47 PM9/30/05
to
"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
[>>Someone wrote:]

>> >The key word is "should". However, reality intervenes. There are
>> >apps out there that use multiple encodings. A standard formalises
>> >current practice - it is *not* a design for the future.
....

>> In the Linux world, UTF-8 is the present.
>
>No standard can be limited to the Linux world, just as no standard
>should shut out the Linux world.

My statement was refuting someone's statement about a design for the
future. It was not intended to be exhaustive, much less limiting.

E.g., in Plan9 UTF-8 has been the present since 1992.

I am no expert on Windows, but AFAIK Unicode (or its 16-bit subset) is
the standard character set of Windows NT and its offspring (also for
more than ten years). Jax4th on WNT supported Unicode in 1993.

So, universal character sets are not something that is in the distant
future.

[reinserted missing context]


>>> If you push through a standard that disenfranchises existing
>>> substantial apps, the developers of those apps will ignore
>>> the standard. Is this what you want?

>> I have read enough statements from Forth vendors that it's impossible


>> to write substantial apps in ANS Forth, so supposedly the programmers
>> of those substantial apps are ignoring the standard already.
>
>That's an all or nothing reading of what turn out to be qualified
>statements.

Well, I fail to see the qualification in the statement I responded to.

>It may be impossible to write the entirety of substantial
>apps in ANS Forth alone. There is nothing in that statement that
>suggests the programmers of those apps are ignoring the standard.
>After all, the standard does not *require* you to write the entirety of
>an app in ANS Forth alone.

Well, xchars don't require anyone write an entire app in ANS Forth
alone, either.

Brad Eckert

unread,
Sep 30, 2005, 1:03:18 PM9/30/05
to
Should XCHARS be variable length? A given character could be a byte, a
16-bit char, 32-bit char, etc. Then why not support xts too. When TYPE
encounters an xt in a string it would execute it. You can let your
imagination run with that.

I think the purpose of a wordset is to lay down rules for things that
you can't portably do in ANS Forth, like emit a character or string
using the more generalized characters. The other stuff sounds like an
exercise in creating useful data structures. If that's what we're
after, are there already common file formats that contain data
structures that deal with wide character sets?

Brad

Anton Ertl

unread,
Sep 30, 2005, 12:42:12 PM9/30/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Wed, 28 Sep 2005 17:49:50 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>>It makes no sense to standardize a current practice that has no
>>future.
>
>Yes it does! It encourages take up of current best practice after
>the first port.

?

>Application developers simply will not discard a
>large and proven code base just because you say they should.

Straw man argument.

>In essence you want to go from A -> B directly. I' saying that
>acceptance of B requires some people to go A -> C -> B. The
>end point is not in dispute, it's the journey that counts.

Lets say what we are talking about: For me A and B are:

A: single character set, single 8-bit encoding
B: single character set, single, possibly variable-width encoding

And most Forth programmers are at A right now. I see no reason for
all of us to go through:

C: multiple character sets, multiple, possibly nasty, encodings

Of course, you have customers who are currently at C. I don't know if
going to B is viable for them, and what immediate steps they should
take, but I don't see that those people who are at A need to go there.

>>>The preferred route, I suggest, is to provide GET-ENCODING and
>>>SET-ENCODING.
>>
>>That's the worst possible design; or maybe having an ENCODING variable
>>would be even worse.
>>
>>In general, the global-state approach is always causing problems,
>>whether it's STATE or BASE or something else.
>
>That's why GET-ENCODING and SET-ENCODING are suggested - they hide
>the implementation of the storage.

It's still global state, with all it's problems. Often, a better
design is to have a context wrapper, like

ENCODING-EXECUTE ( enc-id xt -- )

which executes xt in a context where the encoding is enc-id. That
would be safe against exceptions and makes reusable programming
easier.

But actually in the case of encodings, if you want to support multiple
encodings, they should be stored with the data, maybe with each
character (I believe Emacs does something like this).

>The implication of XCHARs is then that they cannot be used when
>ACS <> DCS or OCS <> DCS.

Yes.

>This breaks XCHARs for application
>development on current Forths.

It may make xchars inappropriate for some applications on some Forths,
but they work well enough on one, no, according to Bernd two current
Forth systems.

Anton Ertl

unread,
Sep 30, 2005, 1:13:56 PM9/30/05
to
"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
>> Gforth uses the standard
>> Unix mechanism (the LANG environment variable) for determining the
>> encoding on startup. No switching words needed.
>
>Except that is at the startup of gforth, not necessarily the startup of
>the application.

Sure, so what?

>And that does not address someone who uses gforth as
>a buffer against the expert-friendliness of Linux.

Using expert-friendly Gforth as a buffer against expert-friendly
Linux? Hmm.

Anyway, once Gforth is potentially poisoned with strings from one
encoding, the only reasonable way to change the encoding is to start
Gforth from scratch; we cannot recode all the strings lying around
with the earlier encoding. If anybody really has a problem with
exiting and restarting Gforth, one could write a word that exec()s
Gforth (which has the same effect, except possibly wrt open files and
stuff).

Anton Ertl

unread,
Sep 30, 2005, 1:21:33 PM9/30/05
to
steph...@mpeforth.com (Stephen Pelc) writes:
>On Wed, 28 Sep 2005 17:43:44 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>>As for multilingual word processors, that's a good reason for using a
>>universal character set and encoding rather than switching around.
>
>Yes for a new design, not necessarily for an existing app being
>ported. Standard = current practice. Some of the biggest issues
>in ANS94 come from the introduction of new practice. The good
>new parts come from the embodiment of best current practice, even
>if it came from another language, e.g. CATCH and THROW.

Well, then take a look at Java, which has used a universal character
set (AFAIK the 16-bit subset of Unicode) since 1995, and does not
provide for multiple encodings.

Anton Ertl

unread,
Sep 30, 2005, 1:31:19 PM9/30/05
to
"Bruce McFarling" <agi...@netscape.net> writes:
>
>Anton Ertl wrote:
>[Bernd]
>> >XC-SIZE ( xc -- u )
>> >Computes the memory size of the XCHAR xc in address units.
>
>> >XC!+ ( xc xc_addr1 -- xc_addr2 )
>> >Stores the XCHAR xc at xc_addr1. xc_addr2 points to the first memory
>> >location after xc.
>
>> This is unsafe, as it writes an unknown amount of data behind
>> xc_addr1. One can use it safely in combination with XC-SIZE, but then
>> it is easier to use XC!+? (see below).
>
>> DEFER XC!+? ( xc xc-addr1 u1 -- xc-addr2 u2 f )
>> safe version of XC!+, f specifies success
>
>I'm not sure about the level of this. An au length of a sequence of
>XCHARs in memory seems handier, to me, for most things, and I
>definitely prefer "know in advance" to "try it and clean up if it
>fails".

There is no need to clean up after XC!+?. It does the "know in
advance" internally. I think you'll have to try programming with both
words to see how it works.

Stephen Pelc

unread,
Sep 30, 2005, 2:13:21 PM9/30/05
to
On Fri, 30 Sep 2005 17:21:33 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Well, then take a look at Java, which has used a universal character
>set (AFAIK the 16-bit subset of Unicode) since 1995, and does not
>provide for multiple encodings.

At that stage in Java's life, there was not a substantial legacy
of 15 year old applications.

Stephen Pelc

unread,
Sep 30, 2005, 2:26:11 PM9/30/05
to
On Fri, 30 Sep 2005 16:42:12 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>Application developers simply will not discard a
>>large and proven code base just because you say they should.
>
>Straw man argument.

I disagree. Talk to Willem and Nick at EuroForth.

>A: single character set, single 8-bit encoding
>B: single character set, single, possibly variable-width encoding
>
>And most Forth programmers are at A right now. I see no reason for
>all of us to go through:
>
>C: multiple character sets, multiple, possibly nasty, encodings
>
>Of course, you have customers who are currently at C. I don't know if
>going to B is viable for them, and what immediate steps they should
>take, but I don't see that those people who are at A need to go there.

Are you really telling the Forth developers with:
a) The largest code base
b) the largest client base
c) the most experience of multiple languages and encodings
that they don't count in Forth200x?

>>The implication of XCHARs is then that they cannot be used when
>>ACS <> DCS or OCS <> DCS.
>
>Yes.

A large number of ebedded systems will use an 8 bit DCS for a long
while into the future, regardless of what any Forth200x standard says.
Such systems *do* and will *have* to use multiple encodings for a
long time to come.

Bruce McFarling

unread,
Sep 30, 2005, 9:41:12 PM9/30/05
to
Anton Ertl wrote:

> Anyway, once Gforth is potentially poisoned with strings from one
> encoding, the only reasonable way to change the encoding is to start
> Gforth from scratch; we cannot recode all the strings lying around
> with the earlier encoding. If anybody really has a problem with
> exiting and restarting Gforth, one could write a word that exec()s
> Gforth (which has the same effect, except possibly wrt open files and
> stuff).

Nobody said every system had to support every encoding, or that a
system that wished to target UTF-8 as a universal encoding was not free
to do so. And as I mentioned before, in the GNU/Linux opensource
space, UTF-8 as the only XCHAR is perfectly justifiable. That leaves
byte-wide CHARs for backword compatibility to working with code-pages.

However, just now trying to work out how encodings are juggled when
going between ACS/DCS/OCS, it has become clear to me that two XCHAR's
are needed. Anton's view of XCHAR is as a joint DCS/OCS when it
becomes necessary to work with multiple-CHAR encodings. Note that this
may equally well be UTF-8 with CHAR applies to 8-bit values or UTF-16
when CHAR applies to 16-bit values. Additionally, this may be in
resource constrained situations, as with no memory limit UTF-32 fits
into 32 bit wide CHARS.

(Also sidenote that the most appeal to the latter is in migrating a
Forth targetted to a simplified ideogram encoding that fits into the
first Unicode plane to full Unicode recognition, and philosophically as
well as with experience lecturing classes with large numbers of Chinese
and Taiwanese students, I don't want to lock out that migration path).

The definitions of some existing standards in terms of ASCII7, and the
ease of extension to ISO 8 bit sets requires byte word handling.

B@ B! BMOVE B, BYTES BYTE+

as referred to in Pelc and Knaggs (2001), citing Greg Bailey,
accomodates that. And additionally, note that a known size,
specialised access to bizarre address units can be defined in a
portable manner, and in any event bizarre address units typically
entail unportable code in any respect.

And finally, the statement in Knaggs and Pelc (2001) that:
"Because of the rarity of multibyte character sets, we believe that
they need be handled only by the LOCALE wordset proposal for
internationalisation of applications."

no longer applies with the growing adoption of UTF-8 in the open source
community.

Also, the assumption that Unicode is 16 bits no longer applies.

Therefore:

(1) I would propose that XCHARS be adopted for standard use with
variable-CHAR width OCS.

(2) A system has one variable width OCS encoding.

(3) A system may also adopt an OCS variable width encoding as its DCS,
provided that it is upwardly compatible with ASCII7, as both UTF8 with
8bit CHARS and UTF16 with 16bit CHARS are. Obviously only source that
is encoded in the ASCII7 subset may be considered portable.


OK now, what I was interested in was variable character sets were
standardisation would actually be useful, which is the ACS. A subset
of the XCHAR functionality provides operations on non-atomic variable
width encodings, for which you can go forward from the start address of
a well formed character, but you cannot necessarily go back from the
final address pointing to a well-formed character. The full XCHAR
functionality provides operations on atomic variable width encodings.
So:

(4) Call these "VCHAR"s.
(5) The discussion of GET-ENCODING SET-ENCODING applies to VCHARs.
Since by presumption one sets the encoding before starting to work in a
"non-native" set, if one works with multiple sets one knows before hand
that it is necessary to also store the appropriate vc-id in an
appropriate place.

Bruce McFarling

unread,
Sep 30, 2005, 10:54:53 PM9/30/05
to

Anton Ertl wrote:
> I am no expert on Windows, but AFAIK Unicode (or its 16-bit subset) is
> the standard character set of Windows NT and its offspring (also for
> more than ten years). Jax4th on WNT supported Unicode in 1993.

> So, universal character sets are not something that is in the distant
> future.

But then they found out that UTF-16 was not big enough to be universal,
and in particular allocated plane 2 for some of the more elegant Han
ideographs.

Since Windows NT relies on the 16-bit subset of UTF-16, it would not be
surprising to see a Forth implemented for Windows NT to have 16-bit
CHARS, and then need to have something like XCHARs to upgrade to full
Unicode.

What I was saying was not that universal-across-all-languages character
SETS were things we would see in the distant future, but that A
universal-across-all-systems character SET ENCODING (with possibly some
specialised legacy cases) is a possible future, and not the present.

And nothing about "distant" future. You added that out of whole cloth
to make a better straw man to knock down. (Not that I mind, being
faced with straw men versions of what you have said makes it easier to
see where you have been vague or confusing in your expression).

Bruce McFarling

unread,
Sep 30, 2005, 11:07:00 PM9/30/05
to

Brad Eckert wrote:

> Should XCHARS be variable length?

Yes, that's the whole point. Variable length character set encodings
are becoming more common, and UTF-8 since its the easiest upgrade path
to full Unicode from classic C character=byte, anybody who wants to
talk to internationalised Linux applications is going to want to handle
variable length character sets.

Bernd's discussion of XCHAR's introduced into gforth to cope with this
exact issue is, I think, a good layout of the basic functionality.

Stephen's raising the I18N OCS/DCS/ACS issues is pertinent as well. I
was attracted to XCHARs as an ACS tool, but the XCHAR's in gforth are
in effect an OCS tool.

For working with fixed width ACS characters, provided they are as big
or bigger than the DCS character set (CHAR) and one bit narrower than
the cell size (to avoid signedness/unsignedness problems), Pelc and
Knagg (2001) WCHAR's copes with that. But it is for a constant width
character set, not a variable width character set. If your WCHAR is
32-bit Unicode and you are building an IP-packet in UTF-8 in memory,
you have to HAVE the XCHAR functionality, whether you get it from
somewhere else or program it yourself. My humble proposal was to get
at that kind of ACS issue by having VCHARs in parallel with XCHARs, and
include a GET-ENCODING / SET-ENCODING that works with VCHARs.

Bruce McFarling

unread,
Sep 30, 2005, 11:14:41 PM9/30/05
to
Bernd Paysan wrote:

> Stephen Pelc wrote:
> > That's why GET-ENCODING and SET-ENCODING are suggested - they hide
> > the implementation of the storage.

> So far, I suggest that this part should be defined elsewhere. The XCHAR
> wordset itself is orthogonal to the ACS/OCS/DCS separation, and can be
> (ab)used to handle that (with SET-ENCODING/GET-ENCODING and the encodings
> that live behind that).

The functionality is orthogonal, but the set-encoding issues are tied
up with ACS/OCS/DCS. Or at least, that's my story, and I'm sticking to
it until the next smart person comes by and shakes it loose.

Basically, you don't WANT get-encoding to TOUCH the OCS or the DCS.
You only WANT it to touch the ACS. Anton's issues with building a
system first and then adding a mutating XCHAR after as being a problem
are, I think, perfectly valid, and they are examples of WHY the OCS/DCS
ought to be considered to be hardwired. The portability issue is
making it easier to share tools between various systems, and XCHAR does
that. And since it is a subset of the capabilities available with
fixed width chars, it is perfectly generic to ANY OCS, with the
exception that some efforts at variable width character encodings prior
to UTF-8 (and then UTF-16) were not atomic, and required a "start at
the beginning and scroll forward" approach.

The only way to make XCHARs mutable in some instances and immutable in
others is to have two parallel sets of words, a system-defined one
(XCHAR) and an application settable one (what I have impertinently
labelled VCHAR).

Bruce McFarling

unread,
Sep 30, 2005, 11:19:34 PM9/30/05
to

Anton Ertl wrote:
> And most Forth programmers are at A right now. I see no reason for
> all of us to go through:

> C: multiple character sets, multiple, possibly nasty, encodings

Precisely what is there in what Stephen has said that mandates going
through C?

The question is the difference between (1) "I want A->B available,
without having to go through C", and (2) "I want A->B to be the only
standardised option, to discourage going through C".

An standard that gives (1), while accomodating those who by force of
circumstance have been forced to start from C, is in my mind better. I
don't think there is any need to push people at "A" to skip "C" ...
given the option the appeal of "B" will be a strong enough pull on its
own.

And promulgating the option as widely as possible requires accomodating
as many different present starting positions as possible.

Bruce McFarling

unread,
Sep 30, 2005, 11:26:47 PM9/30/05