type

NN

unread,

Apr 2, 2021, 8:29:26 AM4/2/21

to

https://forth-standard.org/standard/core/TYPE

For suggested ref implementations how about.

: type ( a u -- )
swap >r begin dup 0> while
r> count emit >r 1-
repeat drop r> drop ;

Brian Fox

unread,

Apr 2, 2021, 9:06:48 AM4/2/21

to

On ITC systems DO/LOOP is faster because the comparison and branching
and return stack operations are not separated by NEXT.

So this is arguably better on those systems:

: TYPE ( caddr u --) ( PAUSE) OVER + SWAP DO I C@ EMIT LOOP ;

And if OVER + SWAP is a primitive in the system like BOUNDS
it's even smaller and faster.

I am acutely aware of these differences working with a 3 MHz processor.
:)

NN

unread,

Apr 2, 2021, 11:41:32 AM4/2/21

to

Fair enough.
BTW, I think your example would fail on an empty string which is the
examples in the link use ?do

: type ( a u -- ) abs bounds ?do i c@ emit loop ;

If its good enough someone can add it to the list.

Paul Rubin

unread,

Apr 2, 2021, 3:59:15 PM4/2/21

to

NN <novembe...@gmail.com> writes:
> For suggested ref implementations how about.
> : type ( a u -- )
> swap >r begin dup 0> while
> r> count emit >r 1-
> repeat drop r> drop ;

Is something wrong with DO ?

: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;

I first used ?DO to avoid the 0 test, but that turns out to be in
CORE-EXT rather than CORE.

P Falth

unread,

Apr 2, 2021, 6:13:29 PM4/2/21

to

And how is EMIT defined?
I use something like

VARIABLE tmp
: EMIT tmp c! tmp 1 type ;

or if you want it to work with unicode chars

: EMIT tmp dup >r xc!+ r@ - r> swap type ;

BR
Peter

Brian Fox

unread,

Apr 2, 2021, 6:30:25 PM4/2/21

to

On my system EMIT is primitive since it has to talk directly to
a Video chip. (TMS9918) So that gave me the freedom to do TYPE
the way I did.

Doug Hoffman

unread,

Apr 2, 2021, 6:52:03 PM4/2/21

to

On 4/2/21 3:59 PM, Paul Rubin wrote:

> Is something wrong with DO ?
>
> : type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;

I think you need a 2drop:

: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE 2drop THEN ;

-Doug

Brian Fox

unread,

Apr 2, 2021, 8:50:51 PM4/2/21

to

On 2021-04-02 11:41 AM, NN wrote:

> Fair enough.
> BTW, I think your example would fail on an empty string which is the
> examples in the link use ?do
>
> : type ( a u -- ) abs bounds ?do i c@ emit loop ;
>
> If its good enough someone can add it to the list.
>

You are correct. I implement ?DO in my kernel and use it for
TYPE. I was typing from "hip".

Since a U is specified for the string length I think ABS
is incorrect. I have never seen it placed in TYPE.

It would be wise perhaps for Forth79 and FIG Forth DO LOOPS.

dxforth

unread,

Apr 2, 2021, 9:02:56 PM4/2/21

to

Also the memory saved.

dxforth

unread,

Apr 2, 2021, 9:19:54 PM4/2/21

to

A distinction only a Standard could make. For a small system to not
include ?DO would be to needlessly waste memory.

dxforth

unread,

Apr 3, 2021, 1:17:53 AM4/3/21

to

: TYPE ?DUP IF 0 DO COUNT EMIT LOOP THEN DROP ;

Paul Rubin

unread,

Apr 3, 2021, 2:12:40 AM4/3/21

to

dxforth <dxf...@gmail.com> writes:
> A distinction only a Standard could make. For a small system to not
> include ?DO would be to needlessly waste memory.

I don't understand why DO exists instead of ?DO being the default.

Paul Rubin

unread,

Apr 3, 2021, 2:15:41 AM4/3/21

to

Doug Hoffman <dhoff...@gmail.com> writes:
> I think you need a 2drop:

Yep. Or I suppose I could have used ?DUP but that word always makes me
squirm because of its variable stack effect.

Paul Rubin

unread,

Apr 3, 2021, 2:17:54 AM4/3/21

to

P Falth <peter....@gmail.com> writes:
> And how is EMIT defined?

I've usually thought of EMIT as a primitive that writes directly to a
hardware port, on low level systems. TYPE then uses EMIT.

dxforth

unread,

Apr 3, 2021, 3:25:49 AM4/3/21

to

?DO came much later and involves an extra test you may not need.

Anton Ertl

unread,

Apr 3, 2021, 4:28:41 AM4/3/21

to

Paul Rubin <no.e...@nospam.invalid> writes:
>NN <novembe...@gmail.com> writes:
>> For suggested ref implementations how about.
>> : type ( a u -- )
>> swap >r begin dup 0> while
>> r> count emit >r 1-
>> repeat drop r> drop ;
>
>Is something wrong with DO ?

Yes.

>: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;
>
>I first used ?DO to avoid the 0 test, but that turns out to be in
>CORE-EXT rather than CORE.

So what. It's still the better word for this purpose. It's not your
job to find workarounds for systems without ?DO.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020

Anton Ertl

unread,

Apr 3, 2021, 5:10:47 AM4/3/21

to

NN <novembe...@gmail.com> writes:
>https://forth-standard.org/standard/core/TYPE
>
>For suggested ref implementations how about.

You can suggest a reference implementation for TYPE there.

>: type ( a u -- )
> swap >r begin dup 0> while
> r> count emit >r 1-
> repeat drop r> drop ;

Others have presented versions that use ?DO (or DO surrounded by IF).
There are the following variants:

1) Use the address as the loop index:

: type1 ( c-addr u -- )
over + swap ?do i c@ emit loop ;

2) Use the array index as loop index:

: type2 ( c-addr u -- )
0 ?do dup i + c@ emit loop drop ;

3) Use the length as loop limit, don't use the loop index:

: type3 ( c-addr u -- )
0 ?do dup c@ emit 1+ loop drop ;

or less obvious

: type3 ( c-addr u -- )
0 ?do count emit loop drop ;

or you could use a variant of FOR..NEXT that supports 0-trip loops.

If you want to cater for "1 chars > 1" systems (try to get one for
testing!), these definitions become more complicated:

: type1 ( c-addr u -- )
chars over + swap ?do i c@ emit 1 chars +loop ;

: type2 ( c-addr u -- )
0 ?do dup i chars + c@ emit loop drop ;

: type3 ( c-addr u -- )
0 ?do dup c@ emit char+ loop drop ;

In this case the differences between TYPE1, TYPE2 and TYPE3 are small,
but in general, when looping over arrays, I prefer to use the address
as loop index (i.e. TYPE1), because it usually means that I don't have
to keep the (base or running) address elsewhere throughout the loop
body, resulting in less stack juggling.

Anton Ertl

unread,

Apr 3, 2021, 5:31:35 AM4/3/21

to

P Falth <peter....@gmail.com> writes:
>or if you want it to work with unicode chars
>
>: EMIT tmp dup >r xc!+ r@ - r> swap type ;

EMIT is defined to work on chars, not on xchars. We have XEMIT for
printing one xchar on the stack (or TYPE for printing one or more
xchars in memory).

And in general, EMIT cannot be extended to work like XEMIT: EMIT
prints the raw byte; and on, e.g., a system like Gforth (with UTF-8
encoding) where XEMIT takes a Unicode code point number as input and
produces UTF-8 as output, the output for, e.g., $C4 XEMIT consists of
two bytes ($c3 $84), whule $C4 EMIT just outputs $c4. That's why the
xchar wordset contains XEMIT (unlike what I proposed in 2005); the use
of EMIT and KEY for dealing with raw bytes was pointed out by Stephen
Pelc.

I know that you do not use codepoint numbers for the on-stack
representation of characters, but something derived from the in-memory
(i.e., string) representation. It makes me wonder if, with your
on-stack representation, XEMIT can output any raw byte, or if this
results in the byte followed by a number of 0 bytes for certain byte
values.

P Falth

unread,

Apr 3, 2021, 6:06:10 AM4/3/21

to

On a forth hosted on an OS, I found it convenient to implement
TYPE as the primitive. On my lxf, 32 bit uner linux type is

: (type) ( addr len -- ) swap 1 4 syscall3 drop ;

syscall3 implements a kernel call with 3 parameters
( plus the syscall nr)

Peter

Doug Hoffman

unread,

Apr 3, 2021, 7:44:34 AM4/3/21

to

Some Forths don't default to always showing the stack depth when
execution is done. I think Gforth is one of those but maybe it can be
configured to do so. It is a pain to type .s every time after you test a
word. Anyway, I am guessing that is why you didn't notice that your TYPE
was leaving something on the stack.

I only tested the edge case where the string length was zero. My fault.
dxforth caught it and showed one possible correction.

-Doug

P Falth

unread,

Apr 3, 2021, 8:22:20 AM4/3/21

to

On Saturday, 3 April 2021 at 11:31:35 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >or if you want it to work with unicode chars
> >
> >: EMIT tmp dup >r xc!+ r@ - r> swap type ;
> EMIT is defined to work on chars, not on xchars. We have XEMIT for
> printing one xchar on the stack (or TYPE for printing one or more
> xchars in memory).
>
> And in general, EMIT cannot be extended to work like XEMIT: EMIT
> prints the raw byte; and on, e.g., a system like Gforth (with UTF-8
> encoding) where XEMIT takes a Unicode code point number as input and
> produces UTF-8 as output, the output for, e.g., $C4 XEMIT consists of
> two bytes ($c3 $84), whule $C4 EMIT just outputs $c4. That's why the
> xchar wordset contains XEMIT (unlike what I proposed in 2005); the use
> of EMIT and KEY for dealing with raw bytes was pointed out by Stephen
> Pelc.
>
> I know that you do not use codepoint numbers for the on-stack
> representation of characters, but something derived from the in-memory
> (i.e., string) representation. It makes me wonder if, with your
> on-stack representation, XEMIT can output any raw byte, or if this
> results in the byte followed by a number of 0 bytes for certain byte
> values.
> - anton

I found that my stack representation was a dead end, creating problems
with no other gains. I switched to stack representation being the codepoint.

My systems are also unicode only, no other encodings supported. For this
reason I have emit=xemit and key=xkey. This has so far not given me any
problem.

$C4 EMIT outputs Ä.
What else should it output?
In fact that was the reason I started with unicode about 20 years ago.
To be able to spell and type my last name properly. It is Fälth!

BR
Peter

Anton Ertl

unread,

Apr 3, 2021, 8:53:05 AM4/3/21

to

P Falth <peter....@gmail.com> writes:
>I found that my stack representation was a dead end, creating problems
>with no other gains.

Interesting. What were the problems?

>My systems are also unicode only, no other encodings supported. For this

>reason I have emit=3Dxemit and key=3Dxkey. This has so far not given me any=
>=20
>problem.
>
> $C4 EMIT outputs =C3=84.=20

>What else should it output?

Just the raw byte $c4. For xchars, there is XEMIT, which should
output UTF-8 on your system.

>To be able to spell and type my last name properly. It is F=C3=A4lth!

And then Google Groups mangles it into quoted-printable encoding:-)

P Falth

unread,

Apr 3, 2021, 9:20:59 AM4/3/21

to

On Saturday, 3 April 2021 at 14:53:05 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >I found that my stack representation was a dead end, creating problems
> >with no other gains.
> Interesting. What were the problems?

I tried to avoid coding and decoding by just using c!/c@, w!/w@ and a 3@3! to get 3
bytes. easy to implement but:

I could not take a codepoint directly and use it. I had to store, it look at the dump
or @ it to understand how I should write it.

sorting order was completely lost

No other language used a similar encoding

> >My systems are also unicode only, no other encodings supported. For this
> >reason I have emit=3Dxemit and key=3Dxkey. This has so far not given me any=
> >=20
> >problem.
> >
> > $C4 EMIT outputs =C3=84.=20
> >What else should it output?
> Just the raw byte $c4. For xchars, there is XEMIT, which should
> output UTF-8 on your system.

the first paragraph in forth2012 of EMIT says

6.1.1320 EMIT CORE ( x – – )
If x is a graphic character in the implementation-defined character set, display x. The
effect of EMIT for all other values of x is implementation-defined.

$C4 is Ä in my implementation-defined character set so it displays that!

>
> >To be able to spell and type my last name properly. It is F=C3=A4lth!
>
> And then Google Groups mangles it into quoted-printable encoding:-)

After soon 27 years in Italy I have become used to different spellings and
pronunciations of my name. The 2 dots have almost always disappeared

Peter

Anton Ertl

unread,

Apr 3, 2021, 10:39:44 AM4/3/21

to

P Falth <peter....@gmail.com> writes:
>On Saturday, 3 April 2021 at 14:53:05 UTC+2, Anton Ertl wrote:

>> P Falth <peter....@gmail.com> writes:=20
>> > $C4 EMIT outputs =3DC3=3D84.=3D20

>> >What else should it output?

>> Just the raw byte $c4. For xchars, there is XEMIT, which should=20
>> output UTF-8 on your system.=20

>
>the first paragraph in forth2012 of EMIT says
>

>6.1.1320 EMIT CORE ( x =E2=80=93 =E2=80=93 )
>If x is a graphic character in the implementation-defined character set, di=

>splay x. The
>effect of EMIT for all other values of x is implementation-defined.

The second paragraph says

|When passed a character whose character-defining bits have a value
|between hex 20 and 7E inclusive, the corresponding standard character,
|specified by 3.1.2.1 Graphic characters, is displayed. Because
|different output devices can respond differently to control
|characters, programs that use control characters to perform specific
|functions have an environmental dependency. Each EMIT deals with only
|one character.

It does not really say what happens for other characters. In any
case, the intention of adding XEMIT was so that EMIT could be used for
raw bytes. And several systems (I think all, but yours) work that
way. I guess we should fix the definition of EMIT.

Paul Rubin

unread,

Apr 3, 2021, 12:07:29 PM4/3/21

to

P Falth <peter....@gmail.com> writes:
> My systems are also unicode only, no other encodings supported. For this
> reason I have emit=xemit and key=xkey. This has so far not given me any
> problem.

That means more complicated methods for actual binary i/o, I guess.
It's nice to be able to write raw bytes when you want to.

P Falth

unread,

Apr 3, 2021, 1:43:09 PM4/3/21

to

I checked VfX. The windows version outputs Ä for $C4 EMIT
but for $20ac (€) it does not work. probably it supports Latin-1 or similar.
Linux version does not work for EMIT, same output as gforth.

Already when we discussed this 20? years ago I was against xemit and
xkey. My position has always been that emit and key should work with
the implemented character encoding. If you need to send byte by byte
these functions should be named differently, like pemit and pkey.
If input or output is redirected key and emit should behave according
to the specifics of the new source for example by being defered

Peter

P Falth

unread,

Apr 3, 2021, 1:45:28 PM4/3/21

to

WRITE-FILE and READ-FILE works for that!

Peter

NN

unread,

Apr 4, 2021, 8:02:35 AM4/4/21

to

128 emit € <- depends on the font.

Anton Ertl

unread,

Apr 4, 2021, 10:05:43 AM4/4/21

to

P Falth <peter....@gmail.com> writes:
>I checked VfX. The windows version outputs =C3=84 for $C4 EMIT

That's strange. Stephen Pelc argued against the extension of EMIT to
work on xchars and for EMIT to process raw bytes. His argument
resulted in the introduction of XEMIT for dealing with xchars. My
guess is that this behaviour is not intentional.

On Linux Gforth, SwiftForth and VFX seem to process raw bytes. iForth
behaves stragely:

iforth "1 $c3 emit .s 1 $a4 emit .s bye"

shows an empty stack twice, so EMIT apparently consumes two stack
items.

lxf behaves as you described.

>Already when we discussed this 20? years ago I was against xemit and
>xkey. My position has always been that emit and key should work with
>the implemented character encoding. If you need to send byte by byte
>these functions should be named differently, like pemit and pkey.

Well, the decision has been to add XEMIT and XKEY, and that's water
down the river. We also have EMIT and KEY, and the intention at some
time was for them to deal with raw bytes, but that intentions has not
been reflected in the text of the standard document yet. I don't
expect a proposal for PEMIT and PKEY (if somebody writes it) to be
successful, but who knows.

Originally I proposed to let EMIT and KEY work on xchars (as XEMIT and
XKEY do now) and was not enthusiastic about EMIT and KEY for raw
bytes. But it has the advantage that the Forth-94 code like

: type1 ( c-addr u -- )
over + swap ?do i c@ emit loop ;

works as intended on Forth-2012, even when runnung on an UTF-8 system
and passing an UTF-8 string to TYPE1. However, the difference between
VFX for Windows and for Linux indicates that in practice, this
advantage is not used (at least not by VFX users on Windows).

>If input or output is redirected key and emit should behave according
>to the specifics of the new source for example by being defered

A deferred word should implement a common interface. In case of EMIT,
all implementations should process a raw byte, or all implementations
should process a code point. If you want EMIT to behave like XEMIT,
then it should do that even when redirected to a serial port;
conversely, if we want EMIT to process raw bytes, all the
implementations of EMIT should do that.

P Falth

unread,

Apr 4, 2021, 10:08:41 AM4/4/21

to

No it also depends on what codepage you use. That could be Windows 1252.
It is definitely not unicode.

P Falth

unread,

Apr 4, 2021, 1:31:19 PM4/4/21

to

It will work on Linux but not on a Windows console.
For that to work on a Windows console I need to define emit as

\ emit for raw bytes

variable tmpbytes
variable #tmp
variable #expected

: XCS ( xcaddr -- n ) \ size of xc in bytes stored at addr
c@
dup $80 u< if drop 1 exit then
dup $e0 u< if drop 2 exit then
dup $f0 u< if drop 3 exit then
dup $f8 u< if drop 4 exit then
dup $fc u< if drop 5 exit then
dup $fe u< if drop 6 exit then
drop 6 ;

: leadingchar ( pchar -- )
tmpbytes c!
1 #tmp !
tmpbytes xcs 1- #expected ! ;

: trailingchar ( pchar -- )
tmpbytes #tmp @ + c!
1 #tmp +!
-1 #expected +! ;

: emit ( pchar -- )
#tmp @ if trailingchar else leadingchar then
#expected @ 0= if tmpbytes #tmp @ type 0 #tmp ! then ;

Where type is defined like
: prtmp here 4096 + aligned ;

: type ( addr len -- )
dup if
prtmp over 2* utf>wc16-string 2/
0 temp 2swap swap conout @ writeconsolew drop exit then
2drop ;

the utf8 string is transformed to utf16 and sent to the WriteConsoleW
system call.

So first I need to split the character to print it as raw bytes and then
in EMIT put them together again. I guess this becomes cooked bytes now!

The Windows console has no similarities with a Linux VT-console.
Fortunately Microsoft has introduced the Windows Terminal.
It has almost full support for UTF8 and VT codes. Printing utf8
strings to the console is now working. It is not even limited to the
first 64K codepoints as before. Reading does not work yet.

> >If input or output is redirected key and emit should behave according
> >to the specifics of the new source for example by being defered
> A deferred word should implement a common interface. In case of EMIT,
> all implementations should process a raw byte, or all implementations
> should process a code point. If you want EMIT to behave like XEMIT,
> then it should do that even when redirected to a serial port;
> conversely, if we want EMIT to process raw bytes, all the
> implementations of EMIT should do that.

What I mean is that if emit is redirected to a serial port I would regard that
as having a character set of 0-255 and every byte will be raw bytes.

Peter

Anton Ertl

unread,

Apr 7, 2021, 5:22:19 AM4/7/21

to

P Falth <peter....@gmail.com> writes:
>It will work on Linux but not on a Windows console.
>

>So first I need to split the character to print it as raw bytes and then
>in EMIT put them together again. I guess this becomes cooked bytes now!
>
>The Windows console has no similarities with a Linux VT-console.
>Fortunately Microsoft has introduced the Windows Terminal.
>It has almost full support for UTF8 and VT codes. Printing utf8
>strings to the console is now working.

Ruvim reports in
<https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-627>:

|I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the
|console via chcp 65001 command), and in Linux. The test:
|
|HEX C3 EMIT A4 EMIT
|
|outputs ä
|
|In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).
|
|In the test
|
|HEX C3 EMIT KEY DROP A4 EMIT
|
|we can see that after the first emit nothing is shown, and after the
|second emit the character ä is shown.

I don't know how SP-Forth/4 calls Windows, and whether Ruvim used the
Windows Terminal, but it's apparently possible to implement EMIT in
the "raw byte" way on Windows, too.

I wonder if in VFX on Windows the "typical use" case works as intended
if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and
don't pass the test) unless you tell the system that you use UTF-8
(but these days, UTF-8 is the default setting).

>> >If input or output is redirected key and emit should behave according
>> >to the specifics of the new source for example by being defered
>> A deferred word should implement a common interface. In case of EMIT,
>> all implementations should process a raw byte, or all implementations
>> should process a code point. If you want EMIT to behave like XEMIT,
>> then it should do that even when redirected to a serial port;
>> conversely, if we want EMIT to process raw bytes, all the
>> implementations of EMIT should do that.
>
>What I mean is that if emit is redirected to a serial port I would regard that
>as having a character set of 0-255 and every byte will be raw bytes.

But if the serial port is connected to something expecting UTF-8, the
behaviour would differ from the behaviour of EMIT on the console: the
"typical use" example would work, while it does not work on the
console.

In any case, I have written a proposal on the wording of EMIT
<https://forth-standard.org/proposals/emit-and-non-ascii-values#contribution-184>,
and you may want to contribute to it.

none albert

unread,

Apr 7, 2021, 10:19:57 AM4/7/21

to

In article <a526461b-316d-4baa...@googlegroups.com>,
P Falth <peter....@gmail.com> wrote:

>On Friday, 2 April 2021 at 21:59:15 UTC+2, Paul Rubin wrote:
>> NN <novembe...@gmail.com> writes:

>> > For suggested ref implementations how about.

>> > : type ( a u -- )
>> > swap >r begin dup 0> while
>> > r> count emit >r 1-
>> > repeat drop r> drop ;

>> Is something wrong with DO ?
>>

>> : type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;
>>
>> I first used ?DO to avoid the 0 test, but that turns out to be in
>> CORE-EXT rather than CORE.
>

>And how is EMIT defined?

>I use something like
>
>VARIABLE tmp
>: EMIT tmp c! tmp 1 type ;

>
>or if you want it to work with unicode chars
>
>: EMIT tmp dup >r xc!+ r@ - r> swap type ;

Indeed it is much better to have
: TYPE 1 ( stdout) WRITE-FILE THROW ;

You don't need a tmp as long as you can point into
the data stack:
: EMIT DSP@ 1 TYPE DROP ;

>
>BR
>Peter

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

none albert

unread,

Apr 7, 2021, 10:35:53 AM4/7/21

to

In article <64189ef8-0164-4239...@googlegroups.com>,

Let's try
\----------------
HEX
CREATE buffer C4 ,
: doit buffer 1 TYPE ;
\----------------
lina -c doemit.frt
doemit | hd
00000000 c4 |.|
00000001
Anybody who expects something different?

Now xterm prints an upper case a with an umlaut, and linux terminal
show a square with a query sign. So the interpretation of the character
has little to do with Forth I'd say, more with the terminal.

>
>Already when we discussed this 20? years ago I was against xemit and
>xkey. My position has always been that emit and key should work with
>the implemented character encoding. If you need to send byte by byte
>these functions should be named differently, like pemit and pkey.
>If input or output is redirected key and emit should behave according
>to the specifics of the new source for example by being defered

There is a difference if one types <esc>[A or <cursor> up on this keyboard.
It is in timing. So I can differentiate between the two and
cursor up can be recognized as a single (extended) key.
: XKEY KEY BEGIN KEY? WHILE 8 LSHIFT KEY OR REPEAT ;

>
>Peter

P Falth

unread,

Apr 7, 2021, 1:14:36 PM4/7/21

to

How do you manage to type <esc>[A ?
I cannot manage the fingers on my keyboard to do it.
It also returns immediately on <esc>

P Falth

unread,

Apr 7, 2021, 5:49:36 PM4/7/21

to

On Wednesday, 7 April 2021 at 11:22:19 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >It will work on Linux but not on a Windows console.
> >
> >So first I need to split the character to print it as raw bytes and then
> >in EMIT put them together again. I guess this becomes cooked bytes now!
> >
> >The Windows console has no similarities with a Linux VT-console.
> >Fortunately Microsoft has introduced the Windows Terminal.
> >It has almost full support for UTF8 and VT codes. Printing utf8
> >strings to the console is now working.
> Ruvim reports in
> <https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-627>:
>
> |I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the
> |console via chcp 65001 command), and in Linux. The test:
> |
> |HEX C3 EMIT A4 EMIT
> |
> |outputs ä
> |
> |In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).
> |
> |In the test
> |
> |HEX C3 EMIT KEY DROP A4 EMIT
> |
> |we can see that after the first emit nothing is shown, and after the
> |second emit the character ä is shown.
>
> I don't know how SP-Forth/4 calls Windows, and whether Ruvim used the
> Windows Terminal, but it's apparently possible to implement EMIT in
> the "raw byte" way on Windows, too.

It has always been possible to set the codepage to 65001 for output.
It has not always worked correctly. One problem being wrong number of
written bytes reported. Input of utf8 has never worked and unfortunately not
even now with the new Windows Terminal

> I wonder if in VFX on Windows the "typical use" case works as intended
> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and
> don't pass the test) unless you tell the system that you use UTF-8
> (but these days, UTF-8 is the default setting).

No that does not make any change. But loading xchar.fth in VFX and
defining:
VARIABLE etmp
: EMIT etmp dup >r xc!+ r@ - r> swap type ;

Make emit work as I want!
$20ac emit € ok

This works on both Windows and Linux versions.
Also in Gforth that definition works.

> >> >If input or output is redirected key and emit should behave according
> >> >to the specifics of the new source for example by being defered
> >> A deferred word should implement a common interface. In case of EMIT,
> >> all implementations should process a raw byte, or all implementations
> >> should process a code point. If you want EMIT to behave like XEMIT,
> >> then it should do that even when redirected to a serial port;
> >> conversely, if we want EMIT to process raw bytes, all the
> >> implementations of EMIT should do that.
> >
> >What I mean is that if emit is redirected to a serial port I would regard that
> >as having a character set of 0-255 and every byte will be raw bytes.
> But if the serial port is connected to something expecting UTF-8, the
> behaviour would differ from the behaviour of EMIT on the console: the
> "typical use" example would work, while it does not work on the
> console.
>
> In any case, I have written a proposal on the wording of EMIT
> <https://forth-standard.org/proposals/emit-and-non-ascii-values#contribution-184>,
> and you may want to contribute to it.

I have seen that. I have registered and will contribute

BR
Peter

Coos Haak

unread,

Apr 8, 2021, 6:57:24 AM4/8/21

to

Op Wed, 7 Apr 2021 10:14:34 -0700 (PDT) schreef P Falth:

> How do you manage to type <esc>[A ?
> I cannot manage the fingers on my keyboard to do it.
> It also returns immediately on <esc>
>

HEX
: <ESC> 1B EMIT ;
: UP <ESC> S" [A" TYPE

groet, Coos

Anton Ertl

unread,

Apr 8, 2021, 7:32:10 AM4/8/21

to

P Falth <peter....@gmail.com> writes:
>It has always been possible to set the codepage to 65001 for output.
>It has not always worked correctly. One problem being wrong number of
>written bytes reported.

Reported by whom? And why ist that a problem?

>> I wonder if in VFX on Windows the "typical use" case works as intended=20
>> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and=20
>> don't pass the test) unless you tell the system that you use UTF-8=20

>> (but these days, UTF-8 is the default setting).
>
>No that does not make any change. But loading xchar.fth in VFX and
>defining:
>VARIABLE etmp

>: EMIT etmp dup >r xc!+ r@ - r> swap type ;=20

>
>Make emit work as I want!

>$20ac emit =E2=82=AC ok

A simpler implementation of what you want is, after loading VFX's
xchar.fth:

: emit xemit ;

Or you could just use XEMIT directly, which is what I would recommend
if you want to deal with an xchar.

P Falth

unread,

Apr 8, 2021, 9:21:03 AM4/8/21

to

On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >It has always been possible to set the codepage to 65001 for output.
> >It has not always worked correctly. One problem being wrong number of
> >written bytes reported.
> Reported by whom? And why ist that a problem?

Google is our friend here. Here is one example from a Perl bug
https://github.com/perl/perl5/issues/13794

WriteFile returns characters written and not bytes. If your function
checks what has been written it can see a lower value then expected
and try to write the "missing" bytes. This is obviously not a problem
in our case as we expect the display to consume the whole string and
do not check.

>
> >> I wonder if in VFX on Windows the "typical use" case works as intended=20
> >> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and=20
> >> don't pass the test) unless you tell the system that you use UTF-8=20
> >> (but these days, UTF-8 is the default setting).
> >
> >No that does not make any change. But loading xchar.fth in VFX and
> >defining:
> >VARIABLE etmp
> >: EMIT etmp dup >r xc!+ r@ - r> swap type ;=20
> >
> >Make emit work as I want!
> >$20ac emit =E2=82=AC ok
>
> A simpler implementation of what you want is, after loading VFX's
> xchar.fth:
>
> : emit xemit ;
>
> Or you could just use XEMIT directly, which is what I would recommend
> if you want to deal with an xchar.

I could also do
: EMIT dup $80 < if emit else xemit then ;

But this is silly! You mention somewhere that we do not have an XTYPE
as TYPE knows how to deal correctly with the string. I think the same should
be true for EMIT and KEY. They should know how to deal with a codepoint if
I implement Unicode support in my Forth.

This works on my systems. ( I hope google does not mess up this to much)
'A' emit A ok
'Ä' emit Ä ok
'$' emit $ ok
'£' emit £ ok
'€' emit € ok

On Gforth I get
'A' emit A ok
'Ä' emit � ok
'$' emit $ ok
'£' emit � ok
'€' emit � ok

And you are saying that my system is non standard!
If I can enter a character from my keyboard I also expect EMIT to display it.

Yes I could have used XEMIT and both examples would have been the same.
But I see no use of them in any programs, people continue to use emit and key.
How many systems have implemented unicode support as part of the core
system and not as a loadable file buried in a library?

BR
Peter

Ruvim

unread,

Apr 8, 2021, 2:48:29 PM4/8/21

to

On 2021-04-08 16:21, P Falth wrote:
> On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:

[...]

>> A simpler implementation of what you want is, after loading VFX's
>> xchar.fth:
>>
>> : emit xemit ;
>>
>> Or you could just use XEMIT directly, which is what I would recommend
>> if you want to deal with an xchar.
>
> I could also do
> : EMIT dup $80 < if emit else xemit then ;
>
> But this is silly!

Having EMIT that is equivalent to XEMIT is also silly.
Do you think it's worth to deprecate XEMIT ?

> You mention somewhere that we do not have an XTYPE
> as TYPE knows how to deal correctly with the string. I think the same should
> be true for EMIT and KEY. They should know how to deal with a codepoint if
> I implement Unicode support in my Forth.
>
> This works on my systems. ( I hope google does not mess up this to much)
> 'A' emit A ok
> 'Ä' emit Ä ok
> '$' emit $ ok
> '£' emit £ ok
> '€' emit € ok
>
> On Gforth I get
> 'A' emit A ok
> 'Ä' emit � ok
> '$' emit $ ok
> '£' emit � ok
> '€' emit � ok

The optional Extended-Character word set suggests that [CHAR] and
character literal return xchar (code point) and then a program should
use XEMIT to print it as:

'Ä' xemit

Perhaps emit may throw an exception if the given pchar cannot be a part
of a correct xchar in the sequence.

> And you are saying that my system is non standard!

Actually not because of that.

> If I can enter a character from my keyboard I also expect EMIT to display it.

Yes.

So the sequence "KEY EMIT" should be always correct (ditto "XKEY XEMIT")

Hence if EMIT handles only pchar, then KEY should also return only pchar.

The idea is that the expressions

s" Ä" type

s" Ä" over 1 type 1 /string type

s" Ä" drop dup c@ emit char+ c@ emit

should all produce the same result when UTF-8 encoding is used.

How could you explain it if they produce the different results?

> Yes I could have used XEMIT and both examples would have been the same.

> But I see no use of them in any programs, people continue to use emit and key.

--
Ruvim

P Falth

unread,

Apr 8, 2021, 3:28:11 PM4/8/21

to

On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> On 2021-04-08 16:21, P Falth wrote:
> > On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
> [...]
> >> A simpler implementation of what you want is, after loading VFX's
> >> xchar.fth:
> >>
> >> : emit xemit ;
> >>
> >> Or you could just use XEMIT directly, which is what I would recommend
> >> if you want to deal with an xchar.
> >
> > I could also do
> > : EMIT dup $80 < if emit else xemit then ;
> >
> > But this is silly!
> Having EMIT that is equivalent to XEMIT is also silly.
> Do you think it's worth to deprecate XEMIT ?

Yes and also XKEY, that is why I respond to this thread.

My windows system uses UTF-16 for type. The UTF-8 string is converted
to UTF-16 before calling the OS WriteConsoleW function.
The Linux version types the UTF-8 string directly.

One of the problems with EMIT being restricted to pchars is that you need
to know the encoding of the underlying string as in your example above.

If EMIT take a Unicode codepoint as I suggest the encoding does not
need to be know to the programmer

Peter

Paul Rubin

unread,

Apr 8, 2021, 3:47:31 PM4/8/21

to

P Falth <peter....@gmail.com> writes:
> This works on my systems. ( I hope google does not mess up this to much)
> 'A' emit A ok

> 'Ä' emit Ä ok ...

> And you are saying that my system is non standard!
> If I can enter a character from my keyboard I also expect EMIT to display it.

At odds here is an idea, I guess under dispute, that in the old days 1
character was 1 byte so EMIT would always send a byte; but now with
Unicode, some chars have multibyte encoding. So we have EMIT for bytes
and XEMIT for codepoints under whatever encoding.

We also had the idea that KEY would read a character (i.e. byte) from a
keyboard, but for decades before anyone cared about Unicode, keyboards
had cursor keys and function keys that send escape sequences. Should we
expect KEY to properly read those and encode them somehow? Are there
even Unicode codepoints for them (I don't know)?

What does your keyboard actually transmit when you type "Ä" (capital A
with umlaut, codepoint 00C4)? My guess is it actually send an ISO
8859-1 character (single byte) which also happens to be 00C4 so your
EMIT possibly has to translate it to some other encoding like UTF-8 on
output. Do you want EMIT to also be able to display CJK characters if
your keyboard can transmit them? Or maybe your system simply displays
ISO 8859-1 directly and doesn't bother with Unicode. Today that is in
some ways an annoying legacy system, but it was workable way to deal
with European alphabets for a while, and maybe still is for your
particular application.

It seems to me that 1) there's no point having EMIT and XEMIT as
separate words if they both do the same thing; 2) having a simple way to
read and write single bytes is still important.

P Falth

unread,

Apr 8, 2021, 4:55:53 PM4/8/21

to

On Thursday, 8 April 2021 at 21:47:31 UTC+2, Paul Rubin wrote:
> P Falth <peter....@gmail.com> writes:
> > This works on my systems. ( I hope google does not mess up this to much)
> > 'A' emit A ok
> > 'Ä' emit Ä ok ...
> > And you are saying that my system is non standard!
> > If I can enter a character from my keyboard I also expect EMIT to display it.
> At odds here is an idea, I guess under dispute, that in the old days 1
> character was 1 byte so EMIT would always send a byte; but now with
> Unicode, some chars have multibyte encoding. So we have EMIT for bytes
> and XEMIT for codepoints under whatever encoding.
>
> We also had the idea that KEY would read a character (i.e. byte) from a
> keyboard, but for decades before anyone cared about Unicode, keyboards
> had cursor keys and function keys that send escape sequences. Should we
> expect KEY to properly read those and encode them somehow? Are there
> even Unicode codepoints for them (I don't know)?

KEY in my systems does not return function or cursor keys. EKEY does
this. They are coded in the 32 bit space not covered by Unicode codepoints.

>
> What does your keyboard actually transmit when you type "Ä" (capital A
> with umlaut, codepoint 00C4)? My guess is it actually send an ISO
> 8859-1 character (single byte) which also happens to be 00C4 so your
> EMIT possibly has to translate it to some other encoding like UTF-8 on
> output. Do you want EMIT to also be able to display CJK characters if
> your keyboard can transmit them? Or maybe your system simply displays
> ISO 8859-1 directly and doesn't bother with Unicode. Today that is in
> some ways an annoying legacy system, but it was workable way to deal
> with European alphabets for a while, and maybe still is for your
> particular application.

My systems, ntf on Windows and lxf on Linux support Unicode since 20 years.
I have had no problems having emit and key working on Unicode codepoints.

The solution to achieve this is very different from Win and Linux
On Windows a 2 byte codepoint is returned directly from the OS with KEY. This limits
the codepoints to the first 64K. This was a problem of the windows console.
(Microsoft has now improved the console and my ntf64 can use the complete
Unicode codepoints). In the same way EMIT uses the OS WriteConsoleW to directly
write the 16 bit codepoint to the screen.

On Linux characters arrive as UTF-8 streams that are converted by KEY to the proper
codepoint. EMIT saves the codepoint as UTF-8 in a string that is sent to type

Internally strings are UTF-8 encoded. On Windows they are translated to UTF-16
inside TYPE before being sent to the OS for output.

Two very different implementation due to different operating system capabilities
but totally transparent while using the systems

> It seems to me that 1) there's no point having EMIT and XEMIT as
> separate words if they both do the same thing; 2) having a simple way to
> read and write single bytes is still important.

Yes XEMIT is not needed in my opinion
KEY and EMIT in my systems can read and write from 0 to 0x10FFFF.
0-0xFF is included in that range

BR
Peter

Ruvim

unread,

Apr 8, 2021, 4:57:25 PM4/8/21

to

On 2021-04-08 22:28, P Falth wrote:
> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>> On 2021-04-08 16:21, P Falth wrote:
>>> On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
>> [...]
>>>> A simpler implementation of what you want is, after loading VFX's
>>>> xchar.fth:
>>>>
>>>> : emit xemit ;
>>>>
>>>> Or you could just use XEMIT directly, which is what I would recommend
>>>> if you want to deal with an xchar.
>>>
>>> I could also do
>>> : EMIT dup $80 < if emit else xemit then ;
>>>
>>> But this is silly!
>> Having EMIT that is equivalent to XEMIT is also silly.
>> Do you think it's worth to deprecate XEMIT ?
>
> Yes and also XKEY, that is why I respond to this thread.

But then we will have the problem (1) below.

[...]

>>> This works on my systems. ( I hope google does not mess up this to much)
>>> 'A' emit A ok
>>> 'Ä' emit Ä ok

>>> On Gforth I get
>>> 'A' emit A ok
>>> 'Ä' emit � ok

>> The optional Extended-Character word set suggests that [CHAR] and
>> character literal return xchar (code point) and then a program should
>> use XEMIT to print it as:
>>
>> 'Ä' xemit
>>
>> Perhaps emit may throw an exception if the given pchar cannot be a part
>> of a correct xchar in the sequence.

>>> If I can enter a character from my keyboard I also expect EMIT to display it.
>> Yes.
>> So the sequence "KEY EMIT" should be always correct (ditto "XKEY XEMIT")
>>
>> Hence if EMIT handles only pchar, then KEY should also return only pchar.
>>
>>
>>
>> The idea is that the expressions
>>
>> s" Ä" type
>>
>> s" Ä" over 1 type 1 /string type
>>
>> s" Ä" drop dup c@ emit char+ c@ emit
>>
>>
>> should all produce the same result when UTF-8 encoding is used.
>
> My windows system uses UTF-16 for type. The UTF-8 string is converted
> to UTF-16 before calling the OS WriteConsoleW function.
> The Linux version types the UTF-8 string directly.
>
> One of the problems with EMIT being restricted to pchars is that you need
> to know the encoding of the underlying string as in your example above.

Actually, you don't need to know the encoding. I mentioned encoding just
for the sake of the third expression, but it can be replaced by the
following variant:

s" Ä" over c@ emit 1 /string type

Now these tree expressions should produce the same result regardless of
the encoding. Also, the result should be the same for any non-empty string.

And that is why the correct programs can continue to use EMIT and KEY.

If your system produce the different results, how you can explain that? (1)

--
Ruvim

P Falth

unread,

Apr 8, 2021, 5:30:37 PM4/8/21

to

No this is still depending on knowing the encoding of the string. The right way to
write it is

s" Ä" over xc@+ emit drop +x/string type

or with xemit if your emit and xemit are not the same

Peter

Ruvim

unread,

Apr 8, 2021, 7:30:14 PM4/8/21

to

On 2021-04-09 00:30, P Falth wrote:
> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>> On 2021-04-08 22:28, P Falth wrote:
>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:

[...]

>>>> The idea is that the expressions
>>>>
>>>> s" Ä" type
>>>>
>>>> s" Ä" over 1 type 1 /string type
>>>>

[...]

>>> One of the problems with EMIT being restricted to pchars is that you need
>>> to know the encoding of the underlying string as in your example above.
>> Actually, you don't need to know the encoding.

[...]

>>
>> s" Ä" over c@ emit 1 /string type
>
> No this is still depending on knowing the encoding of the string.

If EMIT is restricted to pchar, why this is depending on the encoding
that a Forth system uses under the hood?

In a standard Forth system the results should be the same independently
of the encoding. Otherwise the system just is not standard compliant in
this aspect.

> The right way to write it is
>
> s" Ä" over xc@+ emit drop +x/string type
>
> or with xemit if your emit and xemit are not the same

Of course, by the Forth-2012 you have to use xemit in this case. And
then this variant is also possible.

>> Now these tree expressions should produce the same result regardless of
>> the encoding. Also, the result should be the same for any non-empty string.

>> If your system produce the different results, how you can explain that?

Don't you think that your system, that uses the same encoding
independently of the platform, should produce the same result in Windows
and in Linux for each expression from my three above?

If not, what is your ground?

--
Ruvim

P Falth

unread,

Apr 9, 2021, 2:06:10 AM4/9/21

to

On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
> On 2021-04-09 00:30, P Falth wrote:
> > On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
> >> On 2021-04-08 22:28, P Falth wrote:
> >>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> [...]
> >>>> The idea is that the expressions
> >>>>
> >>>> s" Ä" type
> >>>>
> >>>> s" Ä" over 1 type 1 /string type
> >>>>
> [...]
> >>> One of the problems with EMIT being restricted to pchars is that you need
> >>> to know the encoding of the underlying string as in your example above.
> >> Actually, you don't need to know the encoding.
> [...]
> >>
> >> s" Ä" over c@ emit 1 /string type
> >
> > No this is still depending on knowing the encoding of the string.
> If EMIT is restricted to pchar, why this is depending on the encoding
> that a Forth system uses under the hood?

You use c@ to access a string you do not know the encoding of!

> In a standard Forth system the results should be the same independently
> of the encoding. Otherwise the system just is not standard compliant in
> this aspect.
> > The right way to write it is
> >
> > s" Ä" over xc@+ emit drop +x/string type
> >
> > or with xemit if your emit and xemit are not the same
> Of course, by the Forth-2012 you have to use xemit in this case. And
> then this variant is also possible.
> >> Now these tree expressions should produce the same result regardless of
> >> the encoding. Also, the result should be the same for any non-empty string.
> >> If your system produce the different results, how you can explain that?
> Don't you think that your system, that uses the same encoding
> independently of the platform, should produce the same result in Windows
> and in Linux for each expression from my three above?
>
> If not, what is your ground?

Internally both my Linux and Windows systems uses UTF-8 encoded strings.
But the Windows systems translate this to an 16 bit char representation
inside type, to be able to write it to the screen with the WriteConsoleW
OS function. You remove 1 part of a multibyte char and send the remaining
string to type that will see an illegal utf8 char to translate and will fail.

Br
Peter

>
>
> --
> Ruvim

Ruvim

unread,

Apr 9, 2021, 3:51:47 AM4/9/21

to

On 2021-04-09 09:06, P Falth wrote:
> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
>> On 2021-04-09 00:30, P Falth wrote:
>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>>>> On 2021-04-08 22:28, P Falth wrote:
>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>> [...]
>>>>>> The idea is that the expressions
>>>>>>
>>>>>> s" Ä" type
>>>>>>
>>>>>> s" Ä" over 1 type 1 /string type
>>>>>>
>> [...]
>>>>> One of the problems with EMIT being restricted to pchars is that you need
>>>>> to know the encoding of the underlying string as in your example above.
>>>> Actually, you don't need to know the encoding.
>> [...]
>>>>
>>>> s" Ä" over c@ emit 1 /string type
>>>
>>> No this is still depending on knowing the encoding of the string.
>> If EMIT is restricted to pchar, why this is depending on the encoding
>> that a Forth system uses under the hood?
>
> You use c@ to access a string you do not know the encoding of!

It's allowed. And c@ returns pchar independently of the encoding.

Having a list of primitive characters that compose a text, what is your
suggested way to output them?

The Standard allows to just apply EMIT to each of them.

Do you think to also eliminate the notion of primitive character (pchar)
at all?

>> In a standard Forth system the results should be the same independently
>> of the encoding. Otherwise the system just is not standard compliant in
>> this aspect.
>>> The right way to write it is
>>>
>>> s" Ä" over xc@+ emit drop +x/string type
>>>
>>> or with xemit if your emit and xemit are not the same
>> Of course, by the Forth-2012 you have to use xemit in this case. And
>> then this variant is also possible.
>>>> Now these tree expressions should produce the same result regardless of
>>>> the encoding. Also, the result should be the same for any non-empty string.
>>>> If your system produce the different results, how you can explain that?
>> Don't you think that your system, that uses the same encoding
>> independently of the platform, should produce the same result in Windows
>> and in Linux for each expression from my three above?
>>
>> If not, what is your ground?
>
> Internally both my Linux and Windows systems uses UTF-8 encoded strings.
> But the Windows systems translate this to an 16 bit char representation
> inside type, to be able to write it to the screen with the WriteConsoleW
> OS function. You remove 1 part of a multibyte char and send the remaining
> string to type that will see an illegal utf8 char to translate and will fail.

I see, but a user should not immerse into the implementation details.

The idea of the standard is that a standard program produces the same
results on the different standard systems.

In this case there are programs with an environmental dependency
concerning the graphical characters set (but not the character
encoding!), and your systems meet this dependency.

So I interpret the different results of the programs as a drawback in
the implementation of TYPE in ntf, and of EMIT in both ntf and lxf systems.

--
Ruvim

P Falth

unread,

Apr 9, 2021, 6:19:42 AM4/9/21

to

On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:
> On 2021-04-09 09:06, P Falth wrote:
> > On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
> >> On 2021-04-09 00:30, P Falth wrote:
> >>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
> >>>> On 2021-04-08 22:28, P Falth wrote:
> >>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> >> [...]
> >>>>>> The idea is that the expressions
> >>>>>>
> >>>>>> s" Ä" type
> >>>>>>
> >>>>>> s" Ä" over 1 type 1 /string type
> >>>>>>
> >> [...]
> >>>>> One of the problems with EMIT being restricted to pchars is that you need
> >>>>> to know the encoding of the underlying string as in your example above.
> >>>> Actually, you don't need to know the encoding.
> >> [...]
> >>>>
> >>>> s" Ä" over c@ emit 1 /string type
> >>>
> >>> No this is still depending on knowing the encoding of the string.
> >> If EMIT is restricted to pchar, why this is depending on the encoding
> >> that a Forth system uses under the hood?
> >
> > You use c@ to access a string you do not know the encoding of!
> It's allowed. And c@ returns pchar independently of the encoding.

Yes you are always allowed to fetch a byte with c@. Take now this string
s" €Fälth" if the encoding is UTF-8 the dump of it is
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
00741C40 E2 82 AC 46 C3 A4 6C 74 68
the lenght is 9 bytes

If converted to UTF-16 it will look like this
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
00741C40 AC 20 46 00 E4 00 6C 00 74 00 68 00
and have the lenght 12 bytes

So you can use c@ to fetch the individual bytes but the will be different
and there will also be a different number of them. What do you expect to
be able to do with those bytes?

> Having a list of primitive characters that compose a text, what is your
> suggested way to output them?

The correct way to output a string of text is with TYPE

> The Standard allows to just apply EMIT to each of them.
>
> Do you think to also eliminate the notion of primitive character (pchar)
> at all?

Yes in my opinion. It is a distinction that is not needed and just creates
confusion. It exposes the encoding of strings instead of hiding it and
making it transparent for the user.

> >> In a standard Forth system the results should be the same independently
> >> of the encoding. Otherwise the system just is not standard compliant in
> >> this aspect.
> >>> The right way to write it is
> >>>
> >>> s" Ä" over xc@+ emit drop +x/string type
> >>>
> >>> or with xemit if your emit and xemit are not the same
> >> Of course, by the Forth-2012 you have to use xemit in this case. And
> >> then this variant is also possible.
> >>>> Now these tree expressions should produce the same result regardless of
> >>>> the encoding. Also, the result should be the same for any non-empty string.
> >>>> If your system produce the different results, how you can explain that?
> >> Don't you think that your system, that uses the same encoding
> >> independently of the platform, should produce the same result in Windows
> >> and in Linux for each expression from my three above?
> >>
> >> If not, what is your ground?
> >
> > Internally both my Linux and Windows systems uses UTF-8 encoded strings.
> > But the Windows systems translate this to an 16 bit char representation
> > inside type, to be able to write it to the screen with the WriteConsoleW
> > OS function. You remove 1 part of a multibyte char and send the remaining
> > string to type that will see an illegal utf8 char to translate and will fail.
> I see, but a user should not immerse into the implementation details.
>
> The idea of the standard is that a standard program produces the same
> results on the different standard systems.

Yes on this I agree fully

Peter

Ruvim

unread,

Apr 9, 2021, 9:45:19 AM4/9/21

to

On 2021-04-09 13:19, P Falth wrote:
> On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:
>> On 2021-04-09 09:06, P Falth wrote:
>>> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
>>>> On 2021-04-09 00:30, P Falth wrote:
>>>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>>>>>> On 2021-04-08 22:28, P Falth wrote:
>>>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>>>> [...]
>>>>>>>> The idea is that the expressions
>>>>>>>>
>>>>>>>> s" Ä" type
>>>>>>>>
>>>>>>>> s" Ä" over 1 type 1 /string type
>>>>>>>>
>>>> [...]
>>>>>>> One of the problems with EMIT being restricted to pchars is that you need
>>>>>>> to know the encoding of the underlying string as in your example above.
>>>>>> Actually, you don't need to know the encoding.
>>>> [...]
>>>>>>
>>>>>> s" Ä" over c@ emit 1 /string type
>>>>>
>>>>> No this is still depending on knowing the encoding of the string.
>>>> If EMIT is restricted to pchar, why this is depending on the encoding
>>>> that a Forth system uses under the hood?
>>>
>>> You use c@ to access a string you do not know the encoding of!
>> It's allowed. And c@ returns pchar independently of the encoding.
>
> Yes you are always allowed to fetch a byte with c@.

In some system it might be two bytes, or even four bytes.

> Take now this string
> s" €Fälth" if the encoding is UTF-8 the dump of it is
> Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
> 00741C40 E2 82 AC 46 C3 A4 6C 74 68
> the lenght is 9 bytes
>
> If converted to UTF-16 it will look like this
> Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
> 00741C40 AC 20 46 00 E4 00 6C 00 74 00 68 00
> and have the lenght 12 bytes
>
> So you can use c@ to fetch the individual bytes but the will be different
> and there will also be a different number of them. What do you expect to
> be able to do with those bytes?

NB: if the system uses UTF-16 then a pchar size shall be 2 bytes.

So, if we have two systems, one uses UTF-8, and another uses UTF-16,
then for the primitive characters of the given string that correspond to
(F,l,t,h) C@ should return the same value in the both systems, since the
corresponding code points are in the range $20-$7F, and this range is
fixed in the standard.

For the rest primitive characters in the string C@ returns distinct
values between two systems, but they should be greater than $7F in any
case. The number of these primitive characters is also different.

What a program can do with primitive characters?

Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
, REPLACE , HASH and many other can be implemented in the terms of
pchars. A program can represent strings in the form of suffix tree, or
other data structures — also relying on pchars only. (2)

Also, a program can change case for the alphabetical characters in the
range $20-$7F.

And certainly a program can output strings by output primitive
characters one by one.

>> Having a list of primitive characters that compose a text, what is your
>> suggested way to output them?
>
> The correct way to output a string of text is with TYPE

It's a list of primitive chars.

>> The Standard allows to just apply EMIT to each of them.
>>
>> Do you think to also eliminate the notion of primitive character (pchar)
>> at all?
>
> Yes in my opinion.

And then the words C@ and C! two?

> It is a distinction that is not needed

In the case of variable-width encoding, many operations (like examples
(2) above) become far less efficient if you operate on xchars instead of
pchars.

Another problem that it's breaking backward compatibility.

> and just creates confusion.
> It exposes the encoding of strings instead of hiding it and
> making it transparent for the user.

I don't see how it exposes the encoding of strings (in a bad sense). All
my examples in (2) are encoding independent.

--
Ruvim

none albert

unread,

Apr 9, 2021, 10:29:20 AM4/9/21

to

In article <485b6239-c89f-4b33...@googlegroups.com>,

P Falth <peter....@gmail.com> wrote:
>
>Internally both my Linux and Windows systems uses UTF-8 encoded strings.
>But the Windows systems translate this to an 16 bit char representation
>inside type, to be able to write it to the screen with the WriteConsoleW
>OS function. You remove 1 part of a multibyte char and send the remaining
>string to type that will see an illegal utf8 char to translate and will fail.

My Forth uses buffers where the length in byte is known. It doesn't
give a rat's ass whether a terminal shows that as chinese characters.
Nowhere in my Forth is there any concern about individual characters.
Any attempts at case-insensitivity would spoil that (and would have
removed in the Chinese version of my Forth) so it is loadable.

>
>Br
>Peter

Groetjes Albert

none albert

unread,

Apr 9, 2021, 10:53:31 AM4/9/21

to

In article <874kggr...@nightsong.com>,
Paul Rubin <no.e...@nospam.invalid> wrote:
<SNIP>

>What does your keyboard actually transmit when you type "Ä" (capital A
>with umlaut, codepoint 00C4)?

A keyboard doesn't transmit code points, it transmit scan codes.
You can change the key tops of most keyboards but a typical English
PC- keyboard has no keys with A Umlaut.
If the key that is mostly marked A is pressed down it transmit
30 (0x1E) If you release the key it transmits 0x9E).
There is no such thing as uppercase keys. There is no separate a and A
keys on a keyboard. There are shift keys with their own scan codes.
The difference between a and A only exists in the imagination of your
computer, provided it succeeds in interpreting the shift keys correctly.
(and frankly even the association between scan code 0x1E and
the character a/A is in a table that can be tampered with.)

[There may be an exception in e.g. a numlock key that changes a keyboard's
behaviour]

P Falth

unread,

Apr 9, 2021, 12:05:18 PM4/9/21

to

No that was proven to be a dead end. Jack Woehr created Jax forth 1993.
It had char=2 and used UCS-2 (UTF-16 limited to 2 bytes) 2048 bytes block etc.
It did not catch on, nobody to my knowledge continued that approach.
I still have a copy. It started without problem. My example string worked also.

When I started developing Unicode support for my system I tested the idea
of having a variable sized char, CHAR+ could be 1+ 2+ 3+ or 4+. I soon gave up
and adopted the idea of a specific xchar wordset that was being developed by
Anton and Berndt at that time.

Why do you insist of using C@ to traverse a string when XC@+ hides all details
and works perfectly?

> So, if we have two systems, one uses UTF-8, and another uses UTF-16,
> then for the primitive characters of the given string that correspond to
> (F,l,t,h) C@ should return the same value in the both systems, since the
> corresponding code points are in the range $20-$7F, and this range is
> fixed in the standard.
>
> For the rest primitive characters in the string C@ returns distinct
> values between two systems, but they should be greater than $7F in any
> case. The number of these primitive characters is also different.
>
>
> What a program can do with primitive characters?
>
> Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
> , REPLACE , HASH and many other can be implemented in the terms of
> pchars. A program can represent strings in the form of suffix tree, or
> other data structures — also relying on pchars only. (2)

They worked perfectly well on my systems before pchars were invented and
they continue to work afterwards. COMPARE do a binary comparison,
byte by byte.

> Also, a program can change case for the alphabetical characters in the
> range $20-$7F.

My case conversion works for all Unicode codepoints that have a case property.
Or used at least, my translation tables are not updated for some time.

> And certainly a program can output strings by output primitive
> characters one by one.
> >> Having a list of primitive characters that compose a text, what is your
> >> suggested way to output them?
> >
> > The correct way to output a string of text is with TYPE
> It's a list of primitive chars.

But surly they can be stored in a string?

> >> The Standard allows to just apply EMIT to each of them.
> >>
> >> Do you think to also eliminate the notion of primitive character (pchar)
> >> at all?
> >
> > Yes in my opinion.
> And then the words C@ and C! two?

They are fundamental and have many uses!
Not everything you manipulate in memory are strings.

Peter

Ruvim

unread,

Apr 9, 2021, 1:18:49 PM4/9/21

to

It seems you miss the option that an address unit can be also two bytes
in this case, and then a char size is still 1 (i.e., 1 address unit).

> Jack Woehr created Jax forth 1993.
> It had char=2 and used UCS-2 (UTF-16 limited to 2 bytes) 2048 bytes block etc.
> It did not catch on, nobody to my knowledge continued that approach.
> I still have a copy. It started without problem. My example string worked also.

> When I started developing Unicode support for my system I tested the idea
> of having a variable sized char, CHAR+ could be 1+ 2+ 3+ or 4+. I soon gave up
> and adopted the idea of a specific xchar wordset that was being developed by
> Anton and Berndt at that time.

> Why do you insist of using C@ to traverse a string when XC@+ hides all details
> and works perfectly?

I only insist to have a choice, and to have the *ability* of traversing
a string via C@ — because it's backward compatible (old programs
continue to work in new UTF-based systems), it's simpler, and it has
better performance.

And when you need to treat the actual code points beyond a char, you can
use X* words.

[...]

>> What a program can do with primitive characters?
>>
>> Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
>> , REPLACE , HASH and many other can be implemented in the terms of
>> pchars. A program can represent strings in the form of suffix tree, or
>> other data structures — also relying on pchars only. (2)
>
> They worked perfectly well on my systems before pchars were invented and
> they continue to work afterwards. COMPARE do a binary comparison,
> byte by byte.
>

>> Also, a program can change case for the alphabetical characters in the
>> range $20-$7F.
>
> My case conversion works for all Unicode codepoints that have a case property.
> Or used at least, my translation tables are not updated for some time.

But your conversion works beyond pchar. When I mentioned the only things
that can be done in portable manner in the frame of pchars.

>> And certainly a program can output strings by output primitive
>> characters one by one.
>>>> Having a list of primitive characters that compose a text, what is your
>>>> suggested way to output them?
>>>
>>> The correct way to output a string of text is with TYPE
>> It's a list of primitive chars.
>
> But surly they can be stored in a string?

Yes.

Actually it's a possible approach that the standard doesn't provide a
way to print a pchar, but only a string. Although it doesn't solve any
problem, since the expression:

s" Ä" over 1 type 1 /string type

should be equivalent to the expression:

s" Ä" type

(for any non-empty string)

And a program still can define:

variable _tmp
: my-emit ( pchar -- ) _tmp c! _tmp 1 type ;

So it's better to just standardize this word.

The only choice is what should the old EMIT word do, and what the word
to introduce: PEMIT or XEMIT.

If you introduce PEMIT (and change EMIT) then C@ EMIT becomes non standard.

If you introduce XEMIT (and keep EMIT) then [CHAR] X EMIT becomes
non-standard for non-ASCII X.

It seems the second variant is better for backward compatibility.

>>>> The Standard allows to just apply EMIT to each of them.
>>>>
>>>> Do you think to also eliminate the notion of primitive character (pchar)
>>>> at all?
>>>
>>> Yes in my opinion.

>> And then the words C@ and C! too?

>
> They are fundamental and have many uses!
> Not everything you manipulate in memory are strings.

Then what data type C@ returns?

NB: in the standard 'char' usually means 'pchar'
"Unless otherwise stated, a "character" refers to a primitive character"
(3.1.2.3)

--
Ruvim

P Falth

unread,

Apr 9, 2021, 4:30:21 PM4/9/21

to

That will also work on my systems when your strings/chars are below $7F.
Where we differ are when the char is between $80 and $FF.
What should $C4 emit do?

> And when you need to treat the actual code points beyond a char, you can
> use X* words.

Yes we agree on this!

But only if your strings are UTF-8, Should that be standardized?
And if your TYPE allows to type incomplete UTF-8 sequences.

Is it really a good idea to have TYPE type incomplete sequences?
Could that not lead to a security problem?

Ruvim

unread,

Apr 9, 2021, 6:57:57 PM4/9/21

to

On 2021-04-09 23:30, P Falth wrote:
> On Friday, 9 April 2021 at 19:18:49 UTC+2, Ruvim wrote:
>> On 2021-04-09 19:05, P Falth wrote:
>>> On Friday, 9 April 2021 at 15:45:19 UTC+2, Ruvim wrote:

[...]

>>>> NB: if the system uses UTF-16 then a pchar size shall be 2 bytes.

[...]

>>> Why do you insist of using C@ to traverse a string when XC@+ hides all details
>>> and works perfectly?
>> I only insist to have a choice, and to have the *ability* of traversing
>> a string via C@ — because it's backward compatible (old programs
>> continue to work in new UTF-based systems), it's simpler, and it has
>> better performance.
>
> That will also work on my systems when your strings/chars are below $7F.
> Where we differ are when the char is between $80 and $FF.
> What should $C4 emit do?

By the current standard, it should send $C4 as a primitive character to
the user output device. If it's a terminal, then what is shown depends
on the character encoding (and the terminal's state), and it's an
implementation defined thing.

But iff EMIT doesn't treat some pchar argument as a primitive character
then such a string exists that the following word:

: test-emit ( c-addr u -- ) cr 2dup type cr
dup if over c@ emit 1 /string then type cr
;

will show two different lines when it's applied to this string.

>> And when you need to treat the actual code points beyond a char, you can
>> use X* words.
>
> Yes we agree on this!

>>

>> Actually it's a possible approach that the standard doesn't provide a
>> way to print a pchar, but only a string. Although it doesn't solve any
>> problem,
>> since the expression:
>> s" Ä" over 1 type 1 /string type
>>
>> should be equivalent to the expression:
>>
>> s" Ä" type
>
> But only if your strings are UTF-8, Should that be standardized?

Not only. This equivalence is true (should be true) for *any* encoding
(when the argument is any non-empty string contains characters from this
encoding).

E.g., it's true for ASCII, ISO 8859-1, UTF-8, UTF-16, UTF-32, or
anything else.

A particular character encoding is not fixed and, I think, it should not
be fixed in the standard. A system can provide a function to convert a
string from the internal representation to the given encoding.

> And if your TYPE allows to type incomplete UTF-8 sequences.

BTW, incomplete characters can appear in any variable-width character
encoding, e.g. in UTF-16 as well as in UTF-8.

> Is it really a good idea to have TYPE type incomplete sequences?

It's a good idea since it makes programs simpler and moves some
complexity to the side of the underlying system.

And a program may be unaware whether a sequence is complete or not.

A terminal just doesn't show an incomplete character until it's
completed (or made incorrect and then it's replaced by some special
character).

> Could that not lead to a security problem?

It should not lead to any problem, since a user may compose and sent to
the terminal even incorrect strings.

--
Ruvim

dxforth

unread,

Apr 9, 2021, 11:00:20 PM4/9/21

to

On 9/04/2021 23:45, Ruvim wrote:
> On 2021-04-09 13:19, P Falth wrote:
>>
>> Yes you are always allowed to fetch a byte with c@.
>
> In some system it might be two bytes, or even four bytes.

Indeed. Even ANS file functions are expressed in terms of 'characters' -
which can be anything from 8 bits to 1 cell width.

none albert

unread,

Apr 10, 2021, 8:35:03 AM4/10/21

to

My 2 cent. The UTF-blabla are sequential data structures.
It is a design error to make single character output the base
instead of the writing of the whole sequential data structure.

Ruvim

unread,

Apr 10, 2021, 9:51:00 AM4/10/21

to

On 2021-04-10 15:21, albert wrote:
> In article <s4r4c2$77e$1...@gioia.aioe.org>, dxforth <dxf...@gmail.com> wrote:
>> On 9/04/2021 23:45, Ruvim wrote:
>>> On 2021-04-09 13:19, P Falth wrote:
>>>>
>>>> Yes you are always allowed to fetch a byte with c@.
>>>
>>> In some system it might be two bytes, or even four bytes.
>>
>> Indeed. Even ANS file functions are expressed in terms of 'characters' -
>> which can be anything from 8 bits to 1 cell width.
>
> My 2 cent. The UTF-blabla are sequential data structures.

JFYI, UTF-32 is not a variable-width character encoding but fixed-width
at the moment.

> It is a design error to make single character output the base
> instead of the writing of the whole sequential data structure.

When we write text from a socket into a file, we call WRITE-FILE in a
loop, and it doesn't matter where is a boundary occurs between the
written parts.

When we send text from a socket to the output device, you suggest to
parse extended characters before sending them and maintain a buffer for
the last uncompleted extended character. And it's a bad design.

In a good design, an API is the same for sending text to the output
device and for writing text to a file, or to a socket.

--
Ruvim

minf...@arcor.de

unread,

Apr 10, 2021, 12:35:18 PM4/10/21

to

There are two different tasks:

1) send a UTFx text stream to an output device, then let the device do the rendering
and cursor control, and bother no more.

2) do UTFx text processing in your own machine, then use some UTFx-aware
string processing words (e.g. based on the XCHAR wordset)

Many people don't separate cleanly between these tasks and get lost in
UTF en/decoding intricacies. Umlauts are easy, try Taiwanese...

Ruvim

unread,

Apr 10, 2021, 12:59:50 PM4/10/21

to

On 2021-04-09 09:06, P Falth wrote:

> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
>> On 2021-04-09 00:30, P Falth wrote:
>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>>>> On 2021-04-08 22:28, P Falth wrote:
>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>> [...]
>>>>>> The idea is that the expressions
>>>>>>
>>>>>> s" Ä" type
>>>>>>
>>>>>> s" Ä" over 1 type 1 /string type
>>>>

>>>> s" Ä" over c@ emit 1 /string type

[...]

>>>> Now these tree expressions should produce the same result regardless of
>>>> the encoding. Also, the result should be the same for any non-empty string.
>>>> If your system produce the different results, how you can explain that?
>> Don't you think that your system, that uses the same encoding
>> independently of the platform, should produce the same result in Windows
>> and in Linux for each expression from my three above?
>>
>> If not, what is your ground?
> > > Internally both my Linux and Windows systems uses UTF-8 encoded
strings.
>
> But the Windows systems translate this to an 16 bit char representation
> inside type, to be able to write it to the screen with the WriteConsoleW
> OS function.

By my test, in Windows 10 (at least, but not in Windows 7) translating
encoding (and possible buffering uncompleted characters) can be avoided
for output to the console.

For that you need to set the output code page to 65001 (UTF-8) via
SetConsoleOutputCP, and then just use WriteFile with stdout handle. Also
restore the original code page before terminating.

In Windows 7, using CP 65001 and WriteFile with the console handle, you
are still required to send only completed UTF-8 strings, and WriteFile
returns the number of written extended characters, instead of the number
of written bytes [1].

But Windows 7 is already discontinued, so I would support in a Forth
system only ASCII for it (and for earlier versions of Windows).

[1] https://github.com/microsoft/terminal/issues/396#issuecomment-480675296

--
Ruvim

P Falth

unread,

Apr 10, 2021, 1:29:24 PM4/10/21

to

Yes the last years Microsoft has done a lot of work to move the console to
UTF-8, they now also support VT-sequences. Unfortunately ReadFile does not
yet support reading of UTF-8 sequences from the console. For my new 64
bit Forth I have converted the windows version to use this.

My 32 bit ntf runs on Windows 2000 and forward. I have no plans to change this

Peter

none albert

unread,

Apr 10, 2021, 1:44:30 PM4/10/21

to

In article <s4sag2$q3q$1...@dont-email.me>, Ruvim <ruvim...@gmail.com> wrote:
>On 2021-04-10 15:21, albert wrote:
>> In article <s4r4c2$77e$1...@gioia.aioe.org>, dxforth <dxf...@gmail.com> wrote:
>>> On 9/04/2021 23:45, Ruvim wrote:
>>>> On 2021-04-09 13:19, P Falth wrote:
>>>>>
>>>>> Yes you are always allowed to fetch a byte with c@.
>>>>
>>>> In some system it might be two bytes, or even four bytes.
>>>
>>> Indeed. Even ANS file functions are expressed in terms of 'characters' -
>>> which can be anything from 8 bits to 1 cell width.
>>
>> My 2 cent. The UTF-blabla are sequential data structures.
>
>JFYI, UTF-32 is not a variable-width character encoding but fixed-width
>at the moment.
>
>> It is a design error to make single character output the base
>> instead of the writing of the whole sequential data structure.
>
>When we write text from a socket into a file, we call WRITE-FILE in a
>loop, and it doesn't matter where is a boundary occurs between the
>written parts.

That is exactly the way I want to do it.

>
>When we send text from a socket to the output device, you suggest to
>parse extended characters before sending them and maintain a buffer for
>the last uncompleted extended character. And it's a bad design.

That is not what I want. At the level of sending the socket
may be unaware of what are the characters. That is good.

>
>In a good design, an API is the same for sending text to the output
>device and for writing text to a file, or to a socket.

right.
>
>
>--
>Ruvim

Anton Ertl

unread,

Apr 12, 2021, 11:46:53 AM4/12/21

to

P Falth <peter....@gmail.com> writes:
>On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:

>> P Falth <peter....@gmail.com> writes:=20
>> >It has always been possible to set the codepage to 65001 for output.=20
>> >It has not always worked correctly. One problem being wrong number of=20
>> >written bytes reported.
>> Reported by whom? And why ist that a problem?=20

>
>Google is our friend here. Here is one example from a Perl bug
>https://github.com/perl/perl5/issues/13794
>
>WriteFile returns characters written and not bytes. If your function
>checks what has been written it can see a lower value then expected
>and try to write the "missing" bytes. This is obviously not a problem
>in our case as we expect the display to consume the whole string and
>do not check.

Certainly Forth's WRITE-FILE and TYPE do not report how many bytes
(aka chars) nor how many xchars were written. If WRITE-FILE is
implemented on to of something like Unix write() which can write less
than the whole buffer thanks to EINTR, then WRITE-FILE should
implement a loop. And if Windows' WriteFile combines something
EINTR-like with returning the number of code points written, then
WRITE-FILE has to work around that. That's not nice, but on a system
that can support Windows, the additional code for working around that
won't be a problem.

>> A simpler implementation of what you want is, after loading VFX's=20
>> xchar.fth:=20
>>=20
>> : emit xemit ;=20
>>=20
>> Or you could just use XEMIT directly, which is what I would recommend=20

>> if you want to deal with an xchar.
>

>I could also do=20

>: EMIT dup $80 < if emit else xemit then ;
>
>But this is silly!

Yes. It just is more complex than : emit xemit ; but otherwise the
same.

>You mention somewhere that we do not have an XTYPE

>as TYPE knows how to deal correctly with the string. I think the same shoul=

>d
>be true for EMIT and KEY.

Yes that was one option when we did the xchars proposal. This was
actually what we originally proposed (section 2.2 of
<http://www.euroforth.org/ef05/ertl-paysan05.pdf>). But Stephen Pelc
pointed out that he and his customers use EMIT and KEY for dealing
with raw bytes, so the proposal was changed to add XEMIT and XKEY for
dealing with xchars, while EMIT and KEY was left alone.

Of course, leaving EMIT alone means that a standard program can use
only printable ASCII characters with EMIT (and only get ASCII
characters out of KEY), so a : emit xemit ; implementation is
Forth-2012, even though it's not what the committee intended.

>This works on my systems. ( I hope google does not mess up this to much)
>'A' emit A ok

>'=C3=84' emit =C3=84 ok
>'$' emit $ ok
>'=C2=A3' emit =C2=A3 ok
>'=E2=82=AC' emit =E2=82=AC ok

>
>On Gforth I get
>'A' emit A ok

>'=C3=84' emit =EF=BF=BD ok
>'$' emit $ ok
>'=C2=A3' emit =EF=BF=BD ok
>'=E2=82=AC' emit =EF=BF=BD ok

Without quoted-printable:

'ä' emit �

This example uses a literal written in the Forth-2012 syntax 'C' and
it contains a UTF-8 character (i.e., it uses the Xchar wordset). In
this context, the programmer can be expected to use Forth-2012 XEMIT.

Consider this Forth-94 program:

: type1 ( c-addr u -- )
over + swap ?do i c@ emit loop ;

With a raw-byte EMIT, this works even if we pass some UTF-8 to it:

s" ä" type1

With a EMIT=XEMIT, it does not. For such a program written before
Forth-2012, the programmer could not be expected to write it
differently to accomodate UTF-8 strings, but fortunately this is not
necessary if the system supports raw-byte EMIT.

This example is not particularly strong, because one can come up with
an analogous example where EMIT=XEMIT looks better:

\ Forth-94 program
: emit1 emit ;

\ a Forth-2012 user tries to use it with UTF-8:
'ä' emit1

So the question is which kind of usage occurs more often, and
conversely, which kind of EMIT causes more breakage. My experience is
that even older versions of Gforth, SwiftForth, and VFX (all of which
use raw-byte EMIT and KEY and were not written for xchars) handle
UTF-8 input and output nicely, with two exceptions: 1) command-line
editing and 2) position indication for error messages. These two
exceptions have little to do with EMIT, however.

The rest works nicely, though, including command-line input without
editing. Consider what happens in a Forth system when it processes
the following example:

: ä s" ä" ;
ä dump

I don't know how much EMIT happens in this code, but Gforth definitely
inputs every (p)char with KEY and then uses C!. This would definitely
break for non-ASCII characters if KEY=XKEY.

To demonstrate how well older (Xchar-unaware) code works with UTF-8 in
the presence of raw-byte KEY and EMIT, I used the example above on
older Forth systems that were written UTF-8 unaware.

SwiftForth i386-Linux 3.4.4 31-Jul-2012
: ä s" ä" ; ok
ä dump
8082AE5 C3 A4 .. ok

Continuing to
VFX Forth for Linux IA32
Version: 4.30 RC1 [build 0324]
Build date: 1 June 2009
: ä s" ä" ; ok
ä dump
080B:7356 C3 A4 C3 04 64 75 6D 70 00 00 00 00 00 00 00 00 C$C.dump........

GForth 0.4.0, Copyright (C) 1998 Free Software Foundation, Inc.
GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
: ä s" ä" ; ok
ä dump
F752422D: C3 A4 - ..

So we see that UTF-8 works relatively nicely on code that was not
written to be aware of UTF-8, as long as KEY and EMIT are work on raw
bytes.

>If I can enter a character from my keyboard I also expect EMIT to display i=
>t.

That would certainly be a desireable property, but IMO the ability of
running existing programs with few, if any changes is more desirable.

>Yes I could have used XEMIT and both examples would have been the same.

>But I see no use of them in any programs, people continue to use emit and k=
>ey.

Which is fine in many cases, as shown above.

>How many systems have implemented unicode support as part of the core=20

>system and not as a loadable file buried in a library?

Gforth has UTF-8 in gforth.fi (the standard image), and uses it in
UTF-8 setups.

But that hardly matters: Most stuff works on strings, and, as
demonstrated above, even for code that works on individual bytes, the
magic of UTF-8 makes most of this code work fine. That's one reason
why UTF-8 has won.

Anton Ertl

unread,

Apr 12, 2021, 11:51:05 AM4/12/21

to

P Falth <peter....@gmail.com> writes:
>If EMIT take a Unicode codepoint as I suggest the encoding does not
>need to be know to the programmer

If you want to output Unicode code points, use XEMIT (and set up the
system appropriately).

If you want to output raw bytes, use EMIT.

In neither case, does the programmer need to know the encoding.

Anton Ertl

unread,

Apr 12, 2021, 12:08:51 PM4/12/21

to

Paul Rubin <no.e...@nospam.invalid> writes:
> but for decades before anyone cared about Unicode, keyboards
>had cursor keys and function keys that send escape sequences. Should we
>expect KEY to properly read those and encode them somehow? Are there
>even Unicode codepoints for them (I don't know)?

No.

Already Forth-94 has EKEY for this kind of usage, and Forth-2012 has
extended it with words like EKEY>FKEY and K-UP to allow checking for
cursor keys.

>What does your keyboard actually transmit when you type "Ä" (capital A
>with umlaut, codepoint 00C4)?

xev reports:

KeyPress event, serial 34, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198524391, (117,102), root:(155,197),
state 0x0, keycode 50 (keysym 0xffe1, Shift_L), same_screen YES,
XLookupString gives 0 bytes:
XmbLookupString gives 0 bytes:
XFilterEvent returns: False

KeyPress event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198525304, (117,102), root:(155,197),
state 0x1, keycode 48 (keysym 0xc4, Adiaeresis), same_screen YES,
XLookupString gives 2 bytes: (c3 84) "Ä"
XmbLookupString gives 2 bytes: (c3 84) "Ä"
XFilterEvent returns: False

KeyRelease event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198525463, (117,102), root:(155,197),
state 0x1, keycode 48 (keysym 0xc4, Adiaeresis), same_screen YES,
XLookupString gives 2 bytes: (c3 84) "Ä"
XFilterEvent returns: False

KeyRelease event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198525780, (117,102), root:(155,197),
state 0x1, keycode 50 (keysym 0xffe1, Shift_L), same_screen YES,
XLookupString gives 0 bytes:
XFilterEvent returns: False

>My guess is it actually send an ISO
>8859-1 character (single byte) which also happens to be 00C4 so your
>EMIT possibly has to translate it to some other encoding like UTF-8 on
>output.

As you can see, the keyboard transmits events that contain key codes,
and X translates the event to a keysym (which seems to use Latin-1 or
the Unicode code point) and a string (which uses UTF-8). Let's take
an example where there is no overlap between Unicode code points and
Latin-1: Pressing AltGr-E on a German keyboard, giving the Euro sign:

KeyPress event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198870996, (161,147), root:(199,242),
state 0x0, keycode 108 (keysym 0xfe03, ISO_Level3_Shift), same_screen YES,
XKeysymToKeycode returns keycode: 92
XLookupString gives 0 bytes:
XmbLookupString gives 0 bytes:
XFilterEvent returns: False

KeyPress event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198872498, (161,147), root:(199,242),
state 0x80, keycode 26 (keysym 0x20ac, EuroSign), same_screen YES,
XLookupString gives 3 bytes: (e2 82 ac) "€"
XmbLookupString gives 3 bytes: (e2 82 ac) "€"
XFilterEvent returns: False

KeyRelease event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198872574, (161,147), root:(199,242),
state 0x80, keycode 26 (keysym 0x20ac, EuroSign), same_screen YES,
XLookupString gives 3 bytes: (e2 82 ac) "€"
XFilterEvent returns: False

KeyRelease event, serial 37, synthetic NO, window 0x4400001,
root 0x13f, subw 0x0, time 198872931, (161,147), root:(199,242),
state 0x80, keycode 108 (keysym 0xfe03, ISO_Level3_Shift), same_screen YES,
XKeysymToKeycode returns keycode: 92
XLookupString gives 0 bytes:
XFilterEvent returns: False

So X uses the Unicode code point for the keysym.

Anton Ertl

unread,

Apr 12, 2021, 12:21:56 PM4/12/21

to

P Falth <peter....@gmail.com> writes:
>On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:

>> On 2021-04-09 00:30, P Falth wrote:=20
>> > On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:=20
>> >> s" =C3=84" over c@ emit 1 /string type=20
>> >=20

>> > No this is still depending on knowing the encoding of the string.

>> If EMIT is restricted to pchar, why this is depending on the encoding=20
>> that a Forth system uses under the hood?=20

>
>You use c@ to access a string you do not know the encoding of!

Yes, and the nice thing about UTF-8 is that, with a raw-byte c@, a
raw-byte EMIT, and a raw-byte TYPE, this works.

>Internally both my Linux and Windows systems uses UTF-8 encoded strings.

>But the Windows systems translate this to an 16 bit char representation=20

>inside type, to be able to write it to the screen with the WriteConsoleW
>OS function. You remove 1 part of a multibyte char and send the remaining

>string to type that will see an illegal utf8 char to translate and will fai=
>l.

I see two good ways to solve this:

1) You use a different Windows function that works more like Unix'
write(); I don't know if WriteConsoleA is such a function.

2) Your TYPE and EMIT buffer incomplete code points, and only convert
complete code points and send it to WriteConsoleW.

Your current way passes the Windows pain on to the Forth programmers.
Not a good idea.

Anton Ertl

unread,

Apr 12, 2021, 12:58:22 PM4/12/21

to

P Falth <peter....@gmail.com> writes:
>On Friday, 9 April 2021 at 15:45:19 UTC+2, Ruvim wrote:

>> On 2021-04-09 13:19, P Falth wrote:=20
>> > Take now this string=20
>> > s" =E2=82=ACF=C3=A4lth" if the encoding is UTF-8 the dump of it is=20
>> > Address 0 1 2 3 4 5 6 7 8 9 A B C D E F=20
>> > 00741C40 E2 82 AC 46 C3 A4 6C 74 68=20
>> > the lenght is 9 bytes=20
>> >=20
>> > If converted to UTF-16 it will look like this=20
>> > Address 0 1 2 3 4 5 6 7 8 9 A B C D E F=20
>> > 00741C40 AC 20 46 00 E4 00 6C 00 74 00 68 00=20
>> > and have the lenght 12 bytes=20
>> >=20
>> > So you can use c@ to fetch the individual bytes but the will be differe=
>nt=20
>> > and there will also be a different number of them. What do you expect t=
>o=20

>> > be able to do with those bytes?

>> NB: if the system uses UTF-16 then a pchar size shall be 2 bytes.=20

>
>No that was proven to be a dead end.

Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
encoding. So the second memory block is not a valid string encoding,
just a stretch of memory. So with the first memory block I expect to
be able to TYPE it, and to EMIT each byte from the first byte to the
last.

>Why do you insist of using C@ to traverse a string when XC@+ hides all deta=
>ils
>and works perfectly?=20

You can use either: You can use C@ to traverse a string byte-by-byte
(and, e.g., output it byte-by-byte with EMIT), and you can use XC@+ to
traverse the string codepoint-by-codepoint (and, e.g., output it with
XEMIT). In many cases, you don't need codepoints, so doing stuff
byte-by-byte, or the whole string at once will do fine (and is more
efficient).

>> Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT=
>=20
>> , REPLACE , HASH and many other can be implemented in the terms of=20
>> pchars. A program can represent strings in the form of suffix tree, or=20
>> other data structures =E2=80=94 also relying on pchars only. (2)=20

>
>They worked perfectly well on my systems before pchars were invented and
>they continue to work afterwards. COMPARE do a binary comparison,
>byte by byte.

Exactly. No need to split the string into code points.

Concerning pchars: pchar=char. It's just that Stephen Pelc thought it
necessary (and apparently still thinks so) that you should rename
chars into pchars once xchars come into play. The idea is that char
might be confused with xchar, while pchar makes it clear that it is
not xchar. However, your last paragraph shows that pchar may cause
more confusion than it avoids.

I would prefer to replace in the standard document all occurences of
"pchar" with "char". But if the consensus is that pchar is necessary,
we should replace all occurences of "char" with "pchar".

>My case conversion works for all Unicode codepoints that have a case proper=

>ty.
>Or used at least, my translation tables are not updated for some time.

Note that the case conversion depends on the locale:

[c8b:~:66284] echo i |LANG=tr_TR.utf8 tr '[:lower:]' '[:upper:]'
i
[c8b:~:66285] echo i |LANG=de_AT.utf8 tr '[:lower:]' '[:upper:]'
I

Somehow tr is not particularly good at case conversion, IMO the first
command should produce a capital I with a dot. This still
demonstrates the point I was trying to make.

Anyway, ignoring that wrinkle, yes, byte-by-byte case conversion only
converts ASCII characters. But for xchar case conversion, the xchar
words are not enough. You need a case conversion word (maybe TOUPPER
( xc1 -- xc2 )), or knowledge about the encoding.

Anton Ertl

unread,

Apr 12, 2021, 1:19:03 PM4/12/21

to

P Falth <peter....@gmail.com> writes:
>On Friday, 9 April 2021 at 19:18:49 UTC+2, Ruvim wrote:

>> On 2021-04-09 19:05, P Falth wrote:=20
>> > On Friday, 9 April 2021 at 15:45:19 UTC+2, Ruvim wrote:=20
>> >> On 2021-04-09 13:19, P Falth wrote:=20
>> >>> On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:=20
>> >>>> On 2021-04-09 09:06, P Falth wrote:=20
>> >>>>> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:=20
>> >>>>>> On 2021-04-09 00:30, P Falth wrote:=20
>> >>>>>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:=20
>> >>>>>>>> On 2021-04-08 22:28, P Falth wrote:=20
>> >>>>>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:=20

>What should $C4 emit do?

Send the raw byte $C4.

>> Actually it's a possible approach that the standard doesn't provide a=20
>> way to print a pchar, but only a string. Although it doesn't solve any=20
>> problem, since the expression:
>> s" =C3=84" over 1 type 1 /string type
>> should be equivalent to the expression:=20
>>=20
>> s" =C3=84" type=20

>
>But only if your strings are UTF-8

Also works for any other encoding, as long as it's consistent between
program and environment:

LANG=de_AT.iso88591 xterm -e gforth
Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
$c4 emit \ output: Ä ok
s" Ä" dump \ output:
55EAADA4C2F0: C4 - .
ok
s" Ä" over 1 type 1 /string type \ output: Ä ok

>And if your TYPE allows to type incomplete UTF-8 sequences.

Yes, that's necessary. If the OS does not do this for you, it costs a
little pain in TYPE (and you use TYPE to implement EMIT), but it
avoids spreading the pain to all users of TYPE and EMIT.

>Is it really a good idea to have TYPE type incomplete sequences?

Yes.

>Could that not lead to a security problem?

Such as?

>> NB: in the standard 'char' usually means 'pchar'=20
>> "Unless otherwise stated, a "character" refers to a primitive character"=

char=pchar always. As for "character", unless it is preceded by
"extended", it is the same as "primitive character".

Anton Ertl

unread,

Apr 12, 2021, 1:29:30 PM4/12/21

to

"minf...@arcor.de" <minf...@arcor.de> writes:
>There are two different tasks:
>
>1) send a UTFx text stream to an output device, then let the device do the rendering
>and cursor control, and bother no more.
>
>2) do UTFx text processing in your own machine, then use some UTFx-aware
>string processing words (e.g. based on the XCHAR wordset)

1 is a very common special case of 2.

>Many people don't separate cleanly between these tasks and get lost in
>UTF en/decoding intricacies.

Who are these many people? Because 1 is a special case of 2, there is
no need for separation. But if you need to do something outside 1,
you need to know how to do it.

>Umlauts are easy, try Taiwanese...

What problem do you see with Taiwanese?

P Falth

unread,

Apr 13, 2021, 6:56:16 AM4/13/21

to

On Monday, 12 April 2021 at 18:21:56 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
> >> On 2021-04-09 00:30, P Falth wrote:=20
> >> > On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:=20
> >> >> s" =C3=84" over c@ emit 1 /string type=20
> >> >=20
> >> > No this is still depending on knowing the encoding of the string.
> >> If EMIT is restricted to pchar, why this is depending on the encoding=20
> >> that a Forth system uses under the hood?=20
> >
> >You use c@ to access a string you do not know the encoding of!
> Yes, and the nice thing about UTF-8 is that, with a raw-byte c@, a
> raw-byte EMIT, and a raw-byte TYPE, this works.

But only when you know the encoding is UTF-8

Maybe it is time to standardize on UTF-8 as representation for strings!
But I would do it only for the external representation of text. Write-line
and Read-Line expect input in UTF-8 and will output the same.

Internal representation of strings I would let open to the implementation.
In my 64 bit Forths I use for the edit-line function (the comandline editor)
UTF-16 for windows and UCS4 for Linux. It was simpler to implement in that
way. Before delivering the result the string is converted to UTF-8.

> >Internally both my Linux and Windows systems uses UTF-8 encoded strings.
> >But the Windows systems translate this to an 16 bit char representation=20
> >inside type, to be able to write it to the screen with the WriteConsoleW
> >OS function. You remove 1 part of a multibyte char and send the remaining
> >string to type that will see an illegal utf8 char to translate and will fai=
> >l.
>
> I see two good ways to solve this:
>
> 1) You use a different Windows function that works more like Unix'
> write(); I don't know if WriteConsoleA is such a function.

Only on a relatively new Windows 10 (from the last 2 years).
My ntf works from Windows 2000 and onwards. I will not break that.
My 64 bit ntf64 uses WriteFile with codepage 65001 (utf8).
WriteFile is preferred as it can be redirected.

> 2) Your TYPE and EMIT buffer incomplete code points, and only convert
> complete code points and send it to WriteConsoleW.

That is an ugly hack that I will not implement. The standard should hide
implementation details not force them on you.

BR
Peter

P Falth

unread,

Apr 13, 2021, 7:17:18 AM4/13/21

to

I think UTF-16 is a valid string representation also with byte chars.
I just needs to be used with the appropriate xchar words.

Please remove pchar from the standard. It only adds confusion.
Just say instead that KEY and EMIT are limited to chars from 0 to 255.
Also please remove the "raw char" statements.

I did go back to Forth-94 and compared it with Forth-2012. There has been a
significant change in the appendixes on KEY and EKEY.

I have always regarded EKEY as the low level word that could accept anything
from the keyboard. KEY was the constructed from EKEY. This is also
shown as an example in Forth-94. In the 2012 standard this has been reversed
KEY is now the low level that can generate a series of chars from one key-press.

Peter

P Falth

unread,

Apr 13, 2021, 7:23:09 AM4/13/21

to

On Monday, 12 April 2021 at 19:19:03 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >On Friday, 9 April 2021 at 19:18:49 UTC+2, Ruvim wrote:
> >> On 2021-04-09 19:05, P Falth wrote:=20
> >> > On Friday, 9 April 2021 at 15:45:19 UTC+2, Ruvim wrote:=20
> >> >> On 2021-04-09 13:19, P Falth wrote:=20
> >> >>> On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:=20
> >> >>>> On 2021-04-09 09:06, P Falth wrote:=20
> >> >>>>> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:=20
> >> >>>>>> On 2021-04-09 00:30, P Falth wrote:=20
> >> >>>>>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:=20
> >> >>>>>>>> On 2021-04-08 22:28, P Falth wrote:=20
> >> >>>>>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:=20
> >What should $C4 emit do?
> Send the raw byte $C4.

and for $C0 emit

$C0 is a byte that never can be part of an UTF-8 string.

P Falth

unread,

Apr 13, 2021, 9:10:26 AM4/13/21

to

That works because you run this on Linux with the terminal set to UTF-8.

on VFX for windows I get
VFX Forth for Windows x86
© MicroProcessor Engineering Ltd, 1998-2020
Version: 5.11 [build 3793]
Build date: 21 October 2020

Free dictionary = 7472078 bytes [7296kb]

: ä s" ä" ; ok
ä dump

004D:DC56 E4 00 C3 00 00 00 00 00 00 00 00 00 00 00 00 00 d.C.............
ok
ä . 1 ok-1

It looks like VFX uses windows 1252 codepage under windows.
It is a subset of Unicode

Peter

Ruvim

unread,

Apr 14, 2021, 9:54:05 AM4/14/21

to

On 2021-04-13 14:23, P Falth wrote:
> On Monday, 12 April 2021 at 19:19:03 UTC+2, Anton Ertl wrote:
>> P Falth <peter....@gmail.com> writes:

[...]

>>> What should $C4 emit do?
>> Send the raw byte $C4.
>
> and for $C0 emit
>
> $C0 is a byte that never can be part of an UTF-8 string.

There is no any different with TYPE

here $c0 c, 1 type

What should TYPE print?

It's a problem of underlying layer how to deal with incorrect sequences.

If the output user device is a file (due to redirection) then $c0 is
just written to the file as is.

If the user output device is a terminal, and $c0 produces an incorrect
sequence in the current encoding, then the replacement character
character ("�" U+FFFD) is usually shown.

--
Ruvim

Ruvim

unread,

Apr 28, 2021, 9:07:26 AM4/28/21

to

On 2021-04-13 14:17, P Falth wrote:
> On Monday, 12 April 2021 at 18:58:22 UTC+2, Anton Ertl wrote:

>> Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
>> encoding.

> I think UTF-16 is a valid string representation also with byte chars.
> I just needs to be used with the appropriate xchar words.

The phrase:

align here char A c, char B c, char C c, 3 type cr

shall print "ABC"

If a char size is one byte, and UTF-16 is used, then this phrase prints
something else.

So UTF-16 is not compatible with byte chars. Only two bytes chars can be
used with UTF-16.

--
Ruvim

P Falth

unread,

Apr 28, 2021, 11:45:42 AM4/28/21

to

On Wednesday, 28 April 2021 at 15:07:26 UTC+2, Ruvim wrote:
> On 2021-04-13 14:17, P Falth wrote:
> > On Monday, 12 April 2021 at 18:58:22 UTC+2, Anton Ertl wrote:
>
> >> Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
> >> encoding.
> > I think UTF-16 is a valid string representation also with byte chars.
> > I just needs to be used with the appropriate xchar words.
> The phrase:
>
> align here char A c, char B c, char C c, 3 type cr
>
> shall print "ABC"

That is only guaranteed to work with 7 bit ascii.
Try it with my favorit Swedish letters ÅÄÖ

align here char Å c, char Ä c, char Ö c, 3 type cr

Will work only if the encoding is latin-1 or windows 1252 or another
8 bit char set with European characters in it.

It does not work with UTF-8 or16

A system supports Unicode when the XCHAR wordset is loaded
The correct way to write the above is

align here char Å xc, char Ä xc, char Ö xc, here over - type

and that will work with UTF-8 or 16 encoded strings depending on what
the system implementer has chosen as string encoding.
Also UTF-16 is a variable width encoding so the char can be 2 or 4 bytes
UCS-4 is also a possible encoding for strings and is always 4 bytes.

> If a char size is one byte, and UTF-16 is used, then this phrase prints
> something else.
>
> So UTF-16 is not compatible with byte chars. Only two bytes chars can be
> used with UTF-16.

No only the XCHAR wordset can be used with a Unicode system.
I need c! and c@ to work on bytes independent on the encoding of strings.
But I can never expect c!, c@ to work on unicode codepoints.

BR
Peter

>
> --
> Ruvim

Ruvim

unread,

Apr 28, 2021, 12:32:57 PM4/28/21

to

On 2021-04-28 18:45, P Falth wrote:
> On Wednesday, 28 April 2021 at 15:07:26 UTC+2, Ruvim wrote:
>> On 2021-04-13 14:17, P Falth wrote:
>>> On Monday, 12 April 2021 at 18:58:22 UTC+2, Anton Ertl wrote:
>>
>>>> Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
>>>> encoding.
>>>
>>> I think UTF-16 is a valid string representation also with byte chars.
>>> I just needs to be used with the appropriate xchar words.
>>>
>> The phrase:
>>
>> align here char A c, char B c, char C c, 3 type cr
>>
>> shall print "ABC"
>
> That is only guaranteed to work with 7 bit ascii.

Yes, but it's guaranteed! If this test fails, the system is not
compliant (regardless support of the Extended-Character word set).

A formal testcase:

T{ align here char A c, char B c, char C c, 3 s" ABC" compare -> 0 }

> Try it with my favorit Swedish letters ÅÄÖ
>
> align here char Å c, char Ä c, char Ö c, 3 type cr

Yes, this is not guaranteed.

> Will work only if the encoding is latin-1 or windows 1252 or another
> 8 bit char set with European characters in it.
>
> It does not work with UTF-8 or16

> A system supports Unicode when the XCHAR wordset is loaded
> The correct way to write the above is
>
> align here char Å xc, char Ä xc, char Ö xc, here over - type

Agreed

In the most general case:

... here over - 1 chars u/ type

(just in case of the char size is more than one address unit)

I would prefer to represent strings as the address and the size in
address units, but we have what we have.

> and that will work with UTF-8 or 16 encoded strings depending on what
> the system implementer has chosen as string encoding.
> Also UTF-16 is a variable width encoding so the char can be 2 or 4 bytes
> UCS-4 is also a possible encoding for strings and is always 4 bytes.
>
>> If a char size is one byte, and UTF-16 is used, then this phrase prints
>> something else.
>>
>> So UTF-16 is not compatible with byte chars. Only two bytes chars can be
>> used with UTF-16.
>
> No only the XCHAR wordset can be used with a Unicode system.
> I need c! and c@ to work on bytes independent on the encoding of strings.

Then you need to introduce b! and b@ that are always work on octets.

> But I can never expect c!, c@ to work on unicode codepoints.

They don't work on Unicode code points.
They work on primitive characters.

--
Ruvim

Anton Ertl

unread,

Apr 28, 2021, 1:09:28 PM4/28/21

to

Ruvim <ruvim...@gmail.com> writes:
>I would prefer to represent strings as the address and the size in
>address units, but we have what we have.

Already accpeted for the next standard in 2016:
<http://www.forth200x.org/char-is-1.html>

>> But I can never expect c!, c@ to work on unicode codepoints.
>
>They don't work on Unicode code points.

Correct. The Unicode term for what c@ and c! work on is "code unit"
(for UTF-8 it's a byte).

Ruvim

unread,

Apr 28, 2021, 5:17:29 PM4/28/21

to

On 2021-04-28 20:05, Anton Ertl wrote:
> Ruvim <ruvim...@gmail.com> writes:
>> I would prefer to represent strings as the address and the size in
>> address units, but we have what we have.
>
> Already accpeted for the next standard in 2016:
> <http://www.forth200x.org/char-is-1.html>

Yes, I know. But I mean the cases when char > au.

Concerning this new char=au restriction on systems, it excludes UTF-16
on byte-addressed machines. Only UTF-8 is available.

Windows uses UTF-16 internally, so it could be worth to have char = 2 au
in some case. Now such case is only possible with the corresponding
environmental restriction.

But then (regardless that new restriction) you cannot read/write
byte-granular buffers into binary files via standard words, since they
count in chars.

It seems a better choice would be that all words work with buffers
counted in the address units. But this train left a long time ago.

--
Ruvim

Anton Ertl

unread,

Apr 29, 2021, 2:08:40 AM4/29/21

to

Ruvim <ruvim...@gmail.com> writes:
>Concerning this new char=au restriction on systems, it excludes UTF-16
>on byte-addressed machines. Only UTF-8 is available.

Pretty much, unless you want to have au=16bits.

>Windows uses UTF-16 internally, so it could be worth to have char = 2 au
>in some case.

It isn't. There has been only one attempt at a Windows Forth with 1
chars=2 (JaxForth), and it did not catch on. All the other Windows
Forths stayed with 1 chars=1, and this common practice is why we
finally standardized that. As for dealing with Windows,
<https://utf8everywhere.org/> gives advice on that.

>It seems a better choice would be that all words work with buffers
>counted in the address units. But this train left a long time ago.

Given that 1 chars = 1, this train is never going to leave.

Stephen Pelc

unread,

Apr 29, 2021, 7:31:41 AM4/29/21

to

On Thu, 29 Apr 2021 05:55:03 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>It isn't. There has been only one attempt at a Windows Forth with 1
>chars=2 (JaxForth), and it did not catch on. All the other Windows
>Forths stayed with 1 chars=1, and this common practice is why we
>finally standardized that. As for dealing with Windows,
><https://utf8everywhere.org/> gives advice on that.

Please not that Peter Knaggs' Forth has been severely impacted
by "1 chars = 1".

Stephen

--
Stephen Pelc, ste...@vfxforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612, +34 649 662 974
web: http://www.mpeforth.com - free VFX Forth downloads

Anton Ertl

unread,

Apr 29, 2021, 9:06:57 AM4/29/21

to

ste...@mpeforth.com (Stephen Pelc) writes:
>On Thu, 29 Apr 2021 05:55:03 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>It isn't. There has been only one attempt at a Windows Forth with 1
>>chars=2 (JaxForth), and it did not catch on. All the other Windows
>>Forths stayed with 1 chars=1, and this common practice is why we
>>finally standardized that. As for dealing with Windows,
>><https://utf8everywhere.org/> gives advice on that.
>
>Please not that Peter Knaggs' Forth has been severely impacted
>by "1 chars = 1".

My understanding is that he is not going to implement "1 chars = 1",
so there is no impact. The vast majority of programs that rely on 1
chars = 1 will continue to not work on his system, while the few
programs that have no such dependence continue to work on it (unless
they run afoul of another dependence).

If anything, we were too late in standardizing "1 chars = 1". If we
had standardized that earlier, he probably would have avoided the
mistake of implementing a "1 chars = 2" system.

P Falth

unread,

Apr 29, 2021, 3:08:06 PM4/29/21

to

On Wednesday, 28 April 2021 at 18:32:57 UTC+2, Ruvim wrote:
> On 2021-04-28 18:45, P Falth wrote:
> > On Wednesday, 28 April 2021 at 15:07:26 UTC+2, Ruvim wrote:
> >> On 2021-04-13 14:17, P Falth wrote:
> >>> On Monday, 12 April 2021 at 18:58:22 UTC+2, Anton Ertl wrote:
> >>
> >>>> Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
> >>>> encoding.
> >>>
> >>> I think UTF-16 is a valid string representation also with byte chars.
> >>> I just needs to be used with the appropriate xchar words.
> >>>
> >> The phrase:
> >>
> >> align here char A c, char B c, char C c, 3 type cr
> >>
> >> shall print "ABC"
> >
> > That is only guaranteed to work with 7 bit ascii.
> Yes, but it's guaranteed! If this test fails, the system is not
> compliant (regardless support of the Extended-Character word set).
>
> A formal testcase:
>
> T{ align here char A c, char B c, char C c, 3 s" ABC" compare -> 0 }
> > Try it with my favorit Swedish letters ÅÄÖ
> >
> > align here char Å c, char Ä c, char Ö c, 3 type cr
> Yes, this is not guaranteed.

So FORTH is forever tied to 7-bit ASCII

What about making a completely standalone extended char wordset.
Where ASCII and Unicode can live in parallel. The XHAR wordset
needs to be completed with:
XS" for unicode strings, XTYPE, XCHAR, [XCHAR].

For output and input we need XREAD-LINE and XWRITE-LINE or
maybe better XS>UTF8 and UTF8>XS to convert strings before/after
writing/reading.

That UTF-8 is the external representation of text I do not think anybody
disagrees with.

BR
Peter

dxforth

unread,

Apr 29, 2021, 6:38:26 PM4/29/21

to

When features aren't designed-in from the beginning they will always
be a patch. Like it or not Forth is forever tied to the threaded-
code 7-bit thinking of the 70's. For this reason there are new
languages (if one can stomach them). One can always change one's
name :)

Ruvim

unread,

Apr 30, 2021, 3:28:29 AM4/30/21

to

On 2021-04-29 08:55, Anton Ertl wrote:
> Ruvim <ruvim...@gmail.com> writes:
>> Concerning this new char=au restriction on systems, it excludes UTF-16
>> on byte-addressed machines. Only UTF-8 is available.
>
> Pretty much, unless you want to have au=16bits.
>
>> Windows uses UTF-16 internally, so it could be worth to have char = 2 au
>> in some case.
>
> It isn't. There has been only one attempt at a Windows Forth with 1
> chars=2 (JaxForth), and it did not catch on. All the other Windows
> Forths stayed with 1 chars=1, and this common practice is why we
> finally standardized that. As for dealing with Windows,
> <https://utf8everywhere.org/> gives advice on that.
>
>> It seems a better choice would be that all words work with buffers
>> counted in the address units. But this train left a long time ago.
>
> Given that 1 chars = 1, this train is never going to leave.

Well, you have convinced me :)

But it doesn't concern the question what is a better way to represent
contiguous data structures size: use internal units (chars in this
case), or use external units (au in this case).

--
Ruvim

Ruvim

unread,

Apr 30, 2021, 3:58:01 AM4/30/21

to

On 2021-04-29 22:08, P Falth wrote:
> On Wednesday, 28 April 2021 at 18:32:57 UTC+2, Ruvim wrote:
>> On 2021-04-28 18:45, P Falth wrote:
>>> On Wednesday, 28 April 2021 at 15:07:26 UTC+2, Ruvim wrote:
>>>> On 2021-04-13 14:17, P Falth wrote:
>>>>> On Monday, 12 April 2021 at 18:58:22 UTC+2, Anton Ertl wrote:
>>>>
>>>>>> Ok, if we exclude 2-byte chars, we also exclude UTF-16 as xchar
>>>>>> encoding.
>>>>>
>>>>> I think UTF-16 is a valid string representation also with byte chars.
>>>>> I just needs to be used with the appropriate xchar words.
>>>>>
>>>> The phrase:
>>>>
>>>> align here char A c, char B c, char C c, 3 type cr
>>>>
>>>> shall print "ABC"
>>>
>>> That is only guaranteed to work with 7 bit ascii.
>> Yes, but it's guaranteed! If this test fails, the system is not
>> compliant (regardless support of the Extended-Character word set).
>>
>> A formal testcase:
>>
>> T{ align here char A c, char B c, char C c, 3 s" ABC" compare -> 0 }
>>> Try it with my favorit Swedish letters ÅÄÖ
>>>
>>> align here char Å c, char Ä c, char Ö c, 3 type cr
>> Yes, this is not guaranteed.
>
> So FORTH is forever tied to 7-bit ASCII

It's not quite correct expression, I think.

A standard *program* that doesn't rely on Extended-Character word set,
is limited to 7-bit ASCII.

But it didn't prevent programs to have environmental dependencies, and
to use various 8-bit (or other according to char size) character encodings.

> What about making a completely standalone extended char wordset.
> Where ASCII and Unicode can live in parallel. The XHAR wordset
> needs to be completed with:
> XS" for unicode strings, XTYPE, XCHAR, [XCHAR].

Let's think about it.

Do you mean to also introduce XC@ anc XC! ?

If no, then how does XS" differ from S" in presence of the
Extended-Character word set?

If yes, I don't think that it is worthwhile to introduce the different
word set. It's better to just implement all the standard words that work
with different char size in the separate word list.

> For output and input we need XREAD-LINE and XWRITE-LINE or
> maybe better XS>UTF8 and UTF8>XS to convert strings before/after
> writing/reading.

>
> That UTF-8 is the external representation of text I do not think anybody
> disagrees with.

--
Ruvim

Anton Ertl

unread,

Apr 30, 2021, 5:07:00 AM4/30/21

to

P Falth <peter....@gmail.com> writes:
>On Wednesday, 28 April 2021 at 18:32:57 UTC+2, Ruvim wrote:
>> T{ align here char A c, char B c, char C c, 3 s" ABC" compare -> 0 }

>> > Try it with my favorit Swedish letters =C3=85=C3=84=C3=96=20
>> >=20
>> > align here char =C3=85 c, char =C3=84 c, char =C3=96 c, 3 type cr

>> Yes, this is not guaranteed.
>
>So FORTH is forever tied to 7-bit ASCII

Forth is tied to ASCII in that the former continues to be guaranteed
to work. This particular idiom does not work for your favourite
Swedisch letters, but there are other idioms that work, e.g.

s" ÅÄÖ" type

And actually for most things you want to do with (potential) UTF-8
data, you do not even need xchar words. String words are sufficient
most of the time. Gforth works nicely with UTF-8 data, and if you
look at the number of occurences of xchar words and compare them with
the number of occurences of string words, you see that xchar words are
used rarely.

xchar String-Alternative
3 x-size
1 xc!+ 43 move
3 xc!+? 43 move
5 xc, 9 mem,
5 xc-size 114 nip
5 xc@+ 43 move
4 xchar+
1 xemit 151 type
2 xkey
1 xkey?

xchar ext String-Alternative
2 +x/string 71 /string
0 -trailing-garbage
1 ekey>xchar
6 x-width
3 xc-width 6 x-width
6 xchar-
0 xhold 4 holds
1 x\string- 71 /string

And most of the xchar stuff is used in the line editor, which is not a
common application use.

>What about making a completely standalone extended char wordset.
>Where ASCII and Unicode can live in parallel. The XHAR wordset
>needs to be completed with:
>XS" for unicode strings, XTYPE, XCHAR, [XCHAR].

What would these words offer that S", TYPE, 18.6.2.0895 CHAR and
28.6.2.2520 [CHAR] don't?

>For output and input we need XREAD-LINE and XWRITE-LINE

What would these words offer that READ-LINE and WRITE-LINE don't?

>or
>maybe better XS>UTF8 and UTF8>XS to convert strings before/after
>writing/reading.

What's the point of these words?

The point of the xchar words and UTF-8 is that they fit nicely in a
software world that is full of char=byte-oriented software. And
that's why UTF-8 has won.

If you want to have words for strings with 16-bit code units, go
ahead, but better call them something else, maybe wchar words (W for
Windows, not for wide char, because the latter would be 32 bits). I
think it's an unnecessary complication, as is evidenced by the fact
that no Windows Forths up to now, including those used in
international applications, have introduced such a wordset (not even
JaxForth).

But note that with WCHAR WC, WTYPE for 16-bit code units, you still
cannot do, e.g.

here wchar 𝔸 wc, 1 wtype

because 𝔸 (U+01D538) does not fit in a 16-bit code unit. If you want
that to work, you need a wide-character wordset (with 32-bit
characters), but you do not need counterparts for all the words that
the xchar wordset has for dealing with multi-character code points.

So, what do you want:

1) code unit = code point (i.e., 32-bit code units)?

2) code unit = 16 bits (the Windows way)?

3) Neither.

My vote is for 3.

P Falth

unread,

Apr 30, 2021, 8:03:18 AM4/30/21

to

XC@+ and XC!+ are already in the Standard.
They are also much better suited to working with strings

>
> If no, then how does XS" differ from S" in presence of the
> Extended-Character word set?

It could be in another encoding. UCS-4 is one valid alternative

BR
Peter

P Falth

unread,

Apr 30, 2021, 8:16:04 AM4/30/21

to

This is also my experience from almost 20 years of use of a Forth
that is UTF-8 internally. I have never had problem to load any program
from other authors and have problems due to Unicode and UTF-8.

But Ruvim is good at finding examples where it potentially could break.

Maybe it is time to standardize UTF-8 as internal string representation
of Unicode characters in Forth. And then also list the programming
habits that can break.

> >What about making a completely standalone extended char wordset.
> >Where ASCII and Unicode can live in parallel. The XHAR wordset
> >needs to be completed with:
> >XS" for unicode strings, XTYPE, XCHAR, [XCHAR].
> What would these words offer that S", TYPE, 18.6.2.0895 CHAR and
> 28.6.2.2520 [CHAR] don't?
> >For output and input we need XREAD-LINE and XWRITE-LINE
> What would these words offer that READ-LINE and WRITE-LINE don't?
> >or
> >maybe better XS>UTF8 and UTF8>XS to convert strings before/after
> >writing/reading.
> What's the point of these words?

If you do not standardize UTF-8 as internal representation you need
to translate your string representation to what the external world expects.

One alternative to UTF-8 could be UCS-4, it wastes time but is simple.
This is how C uses wide chars

> The point of the xchar words and UTF-8 is that they fit nicely in a
> software world that is full of char=byte-oriented software. And
> that's why UTF-8 has won.
>
> If you want to have words for strings with 16-bit code units, go
> ahead, but better call them something else, maybe wchar words (W for
> Windows, not for wide char, because the latter would be 32 bits). I
> think it's an unnecessary complication, as is evidenced by the fact
> that no Windows Forths up to now, including those used in
> international applications, have introduced such a wordset (not even
> JaxForth).
>
> But note that with WCHAR WC, WTYPE for 16-bit code units, you still
> cannot do, e.g.
>
> here wchar 𝔸 wc, 1 wtype
>
> because 𝔸 (U+01D538) does not fit in a 16-bit code unit. If you want
> that to work, you need a wide-character wordset (with 32-bit
> characters), but you do not need counterparts for all the words that
> the xchar wordset has for dealing with multi-character code points.
>
> So, what do you want:
>
> 1) code unit = code point (i.e., 32-bit code units)?
>
> 2) code unit = 16 bits (the Windows way)?
>
> 3) Neither.
>
> My vote is for 3.

I would vote for Standardizing UTF-8 as internal string representation
of Unicode strings in Forth.

Peter

Anton Ertl

unread,

Apr 30, 2021, 9:24:52 AM4/30/21

to

Ruvim <ruvim...@gmail.com> writes:
>But it doesn't concern the question what is a better way to represent
>contiguous data structures size: use internal units (chars in this
>case), or use external units (au in this case).

Words that are used for dealing with raw memory, such as READ-FILE,
WRITE-FILE, MOVE, FILL, and ERASE, specify chars (READ-FILE,
WRITE-FILE, FILL) or aus (MOVE, ERASE), and 1 chars = 1 eliminates the
need for au versions of READ-FILE, WRITE-FILE, and FILL.

Concerning arrays, there are three convenient representations:

1) addr-start u-elems

2) addr-start u-aus

3) addr-start addr-end (where the last byte is at addr-end-1)

Sometimes one representation is more convenient, sometimes a different
one. The common convention in Forth is to use 1, and in my experience
the benefits from following such a convention trump the benefit of
using a locally most convenient representation.

Concerning heterogeneous data structures (e.g., Forth-2012 structs),
their size is given in aus, what else?

Anton Ertl

unread,

Apr 30, 2021, 9:47:43 AM4/30/21

to

P Falth <peter....@gmail.com> writes:
>Maybe it is time to standardize UTF-8 as internal string representation
>of Unicode characters in Forth.

Given that we standardize 1 chars = 1, UTF-8 for Unicode is pretty
much inevitable for byte-addressed machines. Ok, that leaves the door
open for word-addressed machines, but is it really an issue?

>> >maybe better XS>UTF8 and UTF8>XS to convert strings before/after=20
>> >writing/reading.
>> What's the point of these words?=20

>
>If you do not standardize UTF-8 as internal representation you need
>to translate your string representation to what the external world expects.

IMO that's the job for READ-FILE etc. One thing that's not yet
standardized is to specify the external encoding in the fam, but I
have not yet felt the need, and don't know anything about existing
practice out there. Gforth just uses the environment variables (in
particular, LANG) for determining what internal encoding to assume,
and just does not recode on input/output; so it assumes a homogeneous
environment. As time passes, this becomes more and more of a reality:
Everything is UTF-8.

>One alternative to UTF-8 could be UCS-4, it wastes time but is simple.
>This is how C uses wide chars

From what I read, if you use C on Windows, you get a 16-bit wchar_t,
contrary to the intentions of the C standard (which are indeed to have
a 32-bit wchar_t), as can be seen from the lack of support for
multi-wchar_t characters.

Anyway, using plain chars (=bytes) in C works nicely, thanks to UTF-8,
and for the few cases where you need to deal with individual code
points, standard C has functions for dealing with multi-byte
characters (unlike for multi-wchar_t characters).

>I would vote for Standardizing UTF-8 as internal string representation
>of Unicode strings in Forth.

Good to know. Of course, before you can vote, somebody has to propose
it for standardization.

Ruvim

unread,

Apr 30, 2021, 12:17:57 PM4/30/21

to

On 2021-04-30 15:03, P Falth wrote:
> On Friday, 30 April 2021 at 09:58:01 UTC+2, Ruvim wrote:
>> On 2021-04-29 22:08, P Falth wrote:
>>> So FORTH is forever tied to 7-bit ASCII
>> It's not quite correct expression, I think.
>>
>> A standard *program* that doesn't rely on Extended-Character word set,
>> is limited to 7-bit ASCII.
>>
>> But it didn't prevent programs to have environmental dependencies, and
>> to use various 8-bit (or other according to char size) character encodings.
>>
>>
>>> What about making a completely standalone extended char wordset.
>>> Where ASCII and Unicode can live in parallel. The XHAR wordset
>>> needs to be completed with:
>>> XS" for unicode strings, XTYPE, XCHAR, [XCHAR].
>> Let's think about it.
>>
>> Do you mean to also introduce XC@ anc XC! ?
>
> XC@+ and XC!+ are already in the Standard.
> They are also much better suited to working with strings
>
>>
>> If no, then how does XS" differ from S" in presence of the
>> Extended-Character word set?
>
> It could be in another encoding. UCS-4 is one valid alternative

XC@+ shall be compatible with S"

T{ s" ABC" drop xc@+ nip char A = -> -1 }T

And then XC@+ cannot be compatible with XS" that differs from S"

>> If yes, I don't think that it is worthwhile to introduce the different
>> word set. It's better to just implement all the standard words that work
>> with different char size in the separate word list.

So for another YS" you also need another YC! YC@ etc,
and then just place them into the different word list having the same
names: S" C! C@ XC@+ etc

If you want to be portable, it's enough to find consensus on the name of
this word list only.

--
Ruvim

Ruvim

unread,

Apr 30, 2021, 12:46:04 PM4/30/21

to

On 2021-04-30 14:47, Anton Ertl wrote:
> Ruvim <ruvim...@gmail.com> writes:
>> But it doesn't concern the question what is a better way to represent
>> contiguous data structures size: use internal units (chars in this
>> case), or use external units (au in this case).
>
> Words that are used for dealing with raw memory, such as READ-FILE,
> WRITE-FILE, MOVE, FILL, and ERASE, specify chars (READ-FILE,
> WRITE-FILE, FILL) or aus (MOVE, ERASE), and 1 chars = 1 eliminates the
> need for au versions of READ-FILE, WRITE-FILE, and FILL.

True.

But if counting in au (when au <> char), it would be enough to have
*only* au version of all these words. Just as a history lesson.

> Concerning arrays, there are three convenient representations:
>
> 1) addr-start u-elems
>
> 2) addr-start u-aus
>
> 3) addr-start addr-end (where the last byte is at addr-end-1)
>
> Sometimes one representation is more convenient, sometimes a different
> one. The common convention in Forth is to use 1, and in my experience
> the benefits from following such a convention trump the benefit of
> using a locally most convenient representation.

By my experience, when address arithmetic is involved, representing size
in au produces shorter source code than representing size in internal units.

> Concerning heterogeneous data structures (e.g., Forth-2012 structs),
> their size is given in aus, what else?

And it's more universal, as this example shows.

--
Ruvim

Anton Ertl

unread,

Apr 30, 2021, 1:47:42 PM4/30/21

to

Ruvim <ruvim...@gmail.com> writes:
>On 2021-04-30 14:47, Anton Ertl wrote:
>> Concerning arrays, there are three convenient representations:
>>
>> 1) addr-start u-elems
>>
>> 2) addr-start u-aus
>>
>> 3) addr-start addr-end (where the last byte is at addr-end-1)
>>
>> Sometimes one representation is more convenient, sometimes a different
>> one. The common convention in Forth is to use 1, and in my experience
>> the benefits from following such a convention trump the benefit of
>> using a locally most convenient representation.
>
>
>By my experience, when address arithmetic is involved, representing size
>in au produces shorter source code than representing size in internal units.

I think it depends on the programming style. If you work mostly with
indices and convert to addresses only for the access, representation 1
is the most convenient and probably also results in shorter code. If,
OTOH, you work with addresses all the time, representations 3 or 2 are
more convenient.

My experience is that working with addresses rather than indices often
results in fewer stack items to juggle, and code that may be faster;
but the code is also harder to debug, because it's harder to see when
an address is wrong than when an index is wrong.

Anyway, the convention is to use representation 1, and because the
benefit of having a common convention trumps the benefits of
alternative representations, thinking about alternative
representations is barking up the wrong tree.

Maybe we should instead be defining looping constructs:

( addr u-elem ) u-size ado i @ . aloop \ prints elements from first to last
( addr u-elem ) u-size -ado i @ . aloop \ prints elements from last to first

where u-size is a constant.

Paul Rubin

unread,

May 1, 2021, 12:45:20 PM5/1/21

to

dxforth <dxf...@gmail.com> writes:
> When features aren't designed-in from the beginning they will always
> be a patch. Like it or not Forth is forever tied to the threaded-
> code 7-bit thinking of the 70's. For this reason there are new
> languages (if one can stomach them).

That is actually kind of a liberating thought. It allows trying to
figure out the spirit of Forth and reimplement it without worrying too
much about compatibility. Several regulars here find ANS tried to
abstract too much, so they reverted back to 70s methods. Maybe it's
preferable to skip ahead to let's say 1990s methods, whatever those
might be. At minimum they would likely involve utf-8 strings and a
frame pointer register saved in the return stack.

Anton Ertl

unread,

May 1, 2021, 1:35:45 PM5/1/21

to

Paul Rubin <no.e...@nospam.invalid> writes:
>dxforth <dxf...@gmail.com> writes:
>> When features aren't designed-in from the beginning they will always
>> be a patch. Like it or not Forth is forever tied to the threaded-
>> code 7-bit thinking of the 70's. For this reason there are new
>> languages (if one can stomach them).
>
>That is actually kind of a liberating thought.

He likes to write emotional-appeal statements. Apparently you fell
for this one.

>Maybe it's
>preferable to skip ahead to let's say 1990s methods, whatever those
>might be. At minimum they would likely involve utf-8 strings and a
>frame pointer register saved in the return stack.

Whatever floats your boat. UTF-8 strings work nicely with Gforth, and
mostly nicely with SwiftForth and VFX (mainly because UTF-8 is great).
VFX uses a frame pointer register and AFAIK saves it on the return
stack.

But what the typical complaints are about is the complexity of these
systems. They include various features that you want to have when you
are a user of the system (i.e., when you develop application code),
but that get in the way when you want to get a feeling of mastery and
understanding.

S Jack

unread,

May 1, 2021, 2:05:14 PM5/1/21

to

On Saturday, May 1, 2021 at 11:45:20 AM UTC-5, Paul Rubin wrote:

> Maybe it's
> preferable to skip ahead to let's say 1990s methods, whatever those
> might be. At minimum they would likely involve utf-8 strings and a
> frame pointer register saved in the return stack.

Chrome forces me to use a unicode terminal but I prefer to work
in codepage. The luit filter alows me to do this:
$ ru # set locale to KOI8-R
$ luit # run filter
$ frog # my Fig-forth

Unicode output for me is an application.
Unicode text is a cell array of 16-bit codepoints.
The array is printed by use of UEMIT ( u+xxxx -- ) which
converts a codepoint to a unicode string and print it.

Printing unicode to the terminal under the luit is not a good idea;
the cursor for the unicode and the cursor for codepage output are
not in sync. Its best to open another window for unicode output.
--
me

Ruvim

unread,

May 1, 2021, 2:14:23 PM5/1/21

to

On 2021-04-30 20:28, Anton Ertl wrote:
> Ruvim <ruvim...@gmail.com> writes:
>> On 2021-04-30 14:47, Anton Ertl wrote:
>>> Concerning arrays, there are three convenient representations:
>>>
>>> 1) addr-start u-elems
>>>
>>> 2) addr-start u-aus
>>>
>>> 3) addr-start addr-end (where the last byte is at addr-end-1)
>>>
>>> Sometimes one representation is more convenient, sometimes a different
>>> one. The common convention in Forth is to use 1, and in my experience
>>> the benefits from following such a convention trump the benefit of
>>> using a locally most convenient representation.
>>
>>
>> By my experience, when address arithmetic is involved, representing size
>> in au produces shorter source code than representing size in internal units.
>
> I think it depends on the programming style. If you work mostly with
> indices and convert to addresses only for the access, representation 1
> is the most convenient and probably also results in shorter code. If,
> OTOH, you work with addresses all the time, representations 3 or 2 are
> more convenient.
>
> My experience is that working with addresses rather than indices often
> results in fewer stack items to juggle, and code that may be faster;
> but the code is also harder to debug, because it's harder to see when
> an address is wrong than when an index is wrong.
>

> Anyway, the convention is to use representation 1,

Could you point out some examples? (except character strings)

I didn't notice such a convention yet.

Concerning arrays, ( addr u-elem ) pair brings too small amount of
information to be useful. Usually only addr is used. This addr is an
object reference from which you can get number of elements, their size
(if any), overall size, n-th element, etc.

If you pass array is opaque data block, you probably pass it as ( addr
u-au ) pair, i.e. size in address units. If you pass it as an array, you
pass just its addr.

> and because the
> benefit of having a common convention trumps the benefits of
> alternative representations, thinking about alternative
> representations is barking up the wrong tree.
>
> Maybe we should instead be defining looping constructs:
>
> ( addr u-elem ) u-size ado i @ . aloop \ prints elements from first to last
> ( addr u-elem ) u-size -ado i @ . aloop \ prints elements from last to first
>
> where u-size is a constant.

Yes, +loop is used very rarely with non constant step.

--
Ruvim

S Jack

unread,

May 1, 2021, 7:37:15 PM5/1/21

to

Example:
https://drive.google.com/file/d/1CK4pUZUib8Rij-AtIf8bMTmCjIq2LxCr/view?usp=sharing
--
me