type

520 views
Skip to first unread message

NN

unread,
Apr 2, 2021, 8:29:26 AMApr 2
to
https://forth-standard.org/standard/core/TYPE

For suggested ref implementations how about.

: type ( a u -- )
swap >r begin dup 0> while
r> count emit >r 1-
repeat drop r> drop ;

Brian Fox

unread,
Apr 2, 2021, 9:06:48 AMApr 2
to
On ITC systems DO/LOOP is faster because the comparison and branching
and return stack operations are not separated by NEXT.

So this is arguably better on those systems:

: TYPE ( caddr u --) ( PAUSE) OVER + SWAP DO I C@ EMIT LOOP ;

And if OVER + SWAP is a primitive in the system like BOUNDS
it's even smaller and faster.

I am acutely aware of these differences working with a 3 MHz processor.
:)



NN

unread,
Apr 2, 2021, 11:41:32 AMApr 2
to
Fair enough.
BTW, I think your example would fail on an empty string which is the
examples in the link use ?do

: type ( a u -- ) abs bounds ?do i c@ emit loop ;

If its good enough someone can add it to the list.

Paul Rubin

unread,
Apr 2, 2021, 3:59:15 PMApr 2
to
NN <novembe...@gmail.com> writes:
> For suggested ref implementations how about.
> : type ( a u -- )
> swap >r begin dup 0> while
> r> count emit >r 1-
> repeat drop r> drop ;

Is something wrong with DO ?

: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;

I first used ?DO to avoid the 0 test, but that turns out to be in
CORE-EXT rather than CORE.

P Falth

unread,
Apr 2, 2021, 6:13:29 PMApr 2
to
And how is EMIT defined?
I use something like

VARIABLE tmp
: EMIT tmp c! tmp 1 type ;

or if you want it to work with unicode chars

: EMIT tmp dup >r xc!+ r@ - r> swap type ;

BR
Peter

Brian Fox

unread,
Apr 2, 2021, 6:30:25 PMApr 2
to
On my system EMIT is primitive since it has to talk directly to
a Video chip. (TMS9918) So that gave me the freedom to do TYPE
the way I did.



Doug Hoffman

unread,
Apr 2, 2021, 6:52:03 PMApr 2
to
On 4/2/21 3:59 PM, Paul Rubin wrote:

> Is something wrong with DO ?
>
> : type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;

I think you need a 2drop:

: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE 2drop THEN ;

-Doug

Brian Fox

unread,
Apr 2, 2021, 8:50:51 PMApr 2
to
On 2021-04-02 11:41 AM, NN wrote:

> Fair enough.
> BTW, I think your example would fail on an empty string which is the
> examples in the link use ?do
>
> : type ( a u -- ) abs bounds ?do i c@ emit loop ;
>
> If its good enough someone can add it to the list.
>

You are correct. I implement ?DO in my kernel and use it for
TYPE. I was typing from "hip".

Since a U is specified for the string length I think ABS
is incorrect. I have never seen it placed in TYPE.

It would be wise perhaps for Forth79 and FIG Forth DO LOOPS.

dxforth

unread,
Apr 2, 2021, 9:02:56 PMApr 2
to
Also the memory saved.

dxforth

unread,
Apr 2, 2021, 9:19:54 PMApr 2
to
A distinction only a Standard could make. For a small system to not
include ?DO would be to needlessly waste memory.

dxforth

unread,
Apr 3, 2021, 1:17:53 AMApr 3
to
: TYPE ?DUP IF 0 DO COUNT EMIT LOOP THEN DROP ;

Paul Rubin

unread,
Apr 3, 2021, 2:12:40 AMApr 3
to
dxforth <dxf...@gmail.com> writes:
> A distinction only a Standard could make. For a small system to not
> include ?DO would be to needlessly waste memory.

I don't understand why DO exists instead of ?DO being the default.

Paul Rubin

unread,
Apr 3, 2021, 2:15:41 AMApr 3
to
Doug Hoffman <dhoff...@gmail.com> writes:
> I think you need a 2drop:

Yep. Or I suppose I could have used ?DUP but that word always makes me
squirm because of its variable stack effect.

Paul Rubin

unread,
Apr 3, 2021, 2:17:54 AMApr 3
to
P Falth <peter....@gmail.com> writes:
> And how is EMIT defined?

I've usually thought of EMIT as a primitive that writes directly to a
hardware port, on low level systems. TYPE then uses EMIT.

dxforth

unread,
Apr 3, 2021, 3:25:49 AMApr 3
to
?DO came much later and involves an extra test you may not need.

Anton Ertl

unread,
Apr 3, 2021, 4:28:41 AMApr 3
to
Paul Rubin <no.e...@nospam.invalid> writes:
>NN <novembe...@gmail.com> writes:
>> For suggested ref implementations how about.
>> : type ( a u -- )
>> swap >r begin dup 0> while
>> r> count emit >r 1-
>> repeat drop r> drop ;
>
>Is something wrong with DO ?

Yes.

>: type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;
>
>I first used ?DO to avoid the 0 test, but that turns out to be in
>CORE-EXT rather than CORE.

So what. It's still the better word for this purpose. It's not your
job to find workarounds for systems without ?DO.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020

Anton Ertl

unread,
Apr 3, 2021, 5:10:47 AMApr 3
to
NN <novembe...@gmail.com> writes:
>https://forth-standard.org/standard/core/TYPE
>
>For suggested ref implementations how about.

You can suggest a reference implementation for TYPE there.

>: type ( a u -- )
> swap >r begin dup 0> while
> r> count emit >r 1-
> repeat drop r> drop ;

Others have presented versions that use ?DO (or DO surrounded by IF).
There are the following variants:

1) Use the address as the loop index:

: type1 ( c-addr u -- )
over + swap ?do i c@ emit loop ;

2) Use the array index as loop index:

: type2 ( c-addr u -- )
0 ?do dup i + c@ emit loop drop ;

3) Use the length as loop limit, don't use the loop index:

: type3 ( c-addr u -- )
0 ?do dup c@ emit 1+ loop drop ;

or less obvious

: type3 ( c-addr u -- )
0 ?do count emit loop drop ;

or you could use a variant of FOR..NEXT that supports 0-trip loops.

If you want to cater for "1 chars > 1" systems (try to get one for
testing!), these definitions become more complicated:

: type1 ( c-addr u -- )
chars over + swap ?do i c@ emit 1 chars +loop ;

: type2 ( c-addr u -- )
0 ?do dup i chars + c@ emit loop drop ;

: type3 ( c-addr u -- )
0 ?do dup c@ emit char+ loop drop ;

In this case the differences between TYPE1, TYPE2 and TYPE3 are small,
but in general, when looping over arrays, I prefer to use the address
as loop index (i.e. TYPE1), because it usually means that I don't have
to keep the (base or running) address elsewhere throughout the loop
body, resulting in less stack juggling.

Anton Ertl

unread,
Apr 3, 2021, 5:31:35 AMApr 3
to
P Falth <peter....@gmail.com> writes:
>or if you want it to work with unicode chars
>
>: EMIT tmp dup >r xc!+ r@ - r> swap type ;

EMIT is defined to work on chars, not on xchars. We have XEMIT for
printing one xchar on the stack (or TYPE for printing one or more
xchars in memory).

And in general, EMIT cannot be extended to work like XEMIT: EMIT
prints the raw byte; and on, e.g., a system like Gforth (with UTF-8
encoding) where XEMIT takes a Unicode code point number as input and
produces UTF-8 as output, the output for, e.g., $C4 XEMIT consists of
two bytes ($c3 $84), whule $C4 EMIT just outputs $c4. That's why the
xchar wordset contains XEMIT (unlike what I proposed in 2005); the use
of EMIT and KEY for dealing with raw bytes was pointed out by Stephen
Pelc.

I know that you do not use codepoint numbers for the on-stack
representation of characters, but something derived from the in-memory
(i.e., string) representation. It makes me wonder if, with your
on-stack representation, XEMIT can output any raw byte, or if this
results in the byte followed by a number of 0 bytes for certain byte
values.

P Falth

unread,
Apr 3, 2021, 6:06:10 AMApr 3
to
On a forth hosted on an OS, I found it convenient to implement
TYPE as the primitive. On my lxf, 32 bit uner linux type is

: (type) ( addr len -- ) swap 1 4 syscall3 drop ;

syscall3 implements a kernel call with 3 parameters
( plus the syscall nr)

Peter

Doug Hoffman

unread,
Apr 3, 2021, 7:44:34 AMApr 3
to
Some Forths don't default to always showing the stack depth when
execution is done. I think Gforth is one of those but maybe it can be
configured to do so. It is a pain to type .s every time after you test a
word. Anyway, I am guessing that is why you didn't notice that your TYPE
was leaving something on the stack.

I only tested the edge case where the string length was zero. My fault.
dxforth caught it and showed one possible correction.

-Doug


P Falth

unread,
Apr 3, 2021, 8:22:20 AMApr 3
to
On Saturday, 3 April 2021 at 11:31:35 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >or if you want it to work with unicode chars
> >
> >: EMIT tmp dup >r xc!+ r@ - r> swap type ;
> EMIT is defined to work on chars, not on xchars. We have XEMIT for
> printing one xchar on the stack (or TYPE for printing one or more
> xchars in memory).
>
> And in general, EMIT cannot be extended to work like XEMIT: EMIT
> prints the raw byte; and on, e.g., a system like Gforth (with UTF-8
> encoding) where XEMIT takes a Unicode code point number as input and
> produces UTF-8 as output, the output for, e.g., $C4 XEMIT consists of
> two bytes ($c3 $84), whule $C4 EMIT just outputs $c4. That's why the
> xchar wordset contains XEMIT (unlike what I proposed in 2005); the use
> of EMIT and KEY for dealing with raw bytes was pointed out by Stephen
> Pelc.
>
> I know that you do not use codepoint numbers for the on-stack
> representation of characters, but something derived from the in-memory
> (i.e., string) representation. It makes me wonder if, with your
> on-stack representation, XEMIT can output any raw byte, or if this
> results in the byte followed by a number of 0 bytes for certain byte
> values.
> - anton

I found that my stack representation was a dead end, creating problems
with no other gains. I switched to stack representation being the codepoint.

My systems are also unicode only, no other encodings supported. For this
reason I have emit=xemit and key=xkey. This has so far not given me any
problem.

$C4 EMIT outputs Ä.
What else should it output?
In fact that was the reason I started with unicode about 20 years ago.
To be able to spell and type my last name properly. It is Fälth!

BR
Peter

Anton Ertl

unread,
Apr 3, 2021, 8:53:05 AMApr 3
to
P Falth <peter....@gmail.com> writes:
>I found that my stack representation was a dead end, creating problems
>with no other gains.

Interesting. What were the problems?

>My systems are also unicode only, no other encodings supported. For this
>reason I have emit=3Dxemit and key=3Dxkey. This has so far not given me any=
>=20
>problem.
>
> $C4 EMIT outputs =C3=84.=20
>What else should it output?

Just the raw byte $c4. For xchars, there is XEMIT, which should
output UTF-8 on your system.

>To be able to spell and type my last name properly. It is F=C3=A4lth!

And then Google Groups mangles it into quoted-printable encoding:-)

P Falth

unread,
Apr 3, 2021, 9:20:59 AMApr 3
to
On Saturday, 3 April 2021 at 14:53:05 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >I found that my stack representation was a dead end, creating problems
> >with no other gains.
> Interesting. What were the problems?

I tried to avoid coding and decoding by just using c!/c@, w!/w@ and a 3@3! to get 3
bytes. easy to implement but:

I could not take a codepoint directly and use it. I had to store, it look at the dump
or @ it to understand how I should write it.

sorting order was completely lost

No other language used a similar encoding

> >My systems are also unicode only, no other encodings supported. For this
> >reason I have emit=3Dxemit and key=3Dxkey. This has so far not given me any=
> >=20
> >problem.
> >
> > $C4 EMIT outputs =C3=84.=20
> >What else should it output?
> Just the raw byte $c4. For xchars, there is XEMIT, which should
> output UTF-8 on your system.

the first paragraph in forth2012 of EMIT says

6.1.1320 EMIT CORE ( x – – )
If x is a graphic character in the implementation-defined character set, display x. The
effect of EMIT for all other values of x is implementation-defined.

$C4 is Ä in my implementation-defined character set so it displays that!

>
> >To be able to spell and type my last name properly. It is F=C3=A4lth!
>
> And then Google Groups mangles it into quoted-printable encoding:-)

After soon 27 years in Italy I have become used to different spellings and
pronunciations of my name. The 2 dots have almost always disappeared

Peter

Anton Ertl

unread,
Apr 3, 2021, 10:39:44 AMApr 3
to
P Falth <peter....@gmail.com> writes:
>On Saturday, 3 April 2021 at 14:53:05 UTC+2, Anton Ertl wrote:
>> P Falth <peter....@gmail.com> writes:=20
>> > $C4 EMIT outputs =3DC3=3D84.=3D20
>> >What else should it output?
>> Just the raw byte $c4. For xchars, there is XEMIT, which should=20
>> output UTF-8 on your system.=20
>
>the first paragraph in forth2012 of EMIT says
>
>6.1.1320 EMIT CORE ( x =E2=80=93 =E2=80=93 )
>If x is a graphic character in the implementation-defined character set, di=
>splay x. The
>effect of EMIT for all other values of x is implementation-defined.

The second paragraph says

|When passed a character whose character-defining bits have a value
|between hex 20 and 7E inclusive, the corresponding standard character,
|specified by 3.1.2.1 Graphic characters, is displayed. Because
|different output devices can respond differently to control
|characters, programs that use control characters to perform specific
|functions have an environmental dependency. Each EMIT deals with only
|one character.

It does not really say what happens for other characters. In any
case, the intention of adding XEMIT was so that EMIT could be used for
raw bytes. And several systems (I think all, but yours) work that
way. I guess we should fix the definition of EMIT.

Paul Rubin

unread,
Apr 3, 2021, 12:07:29 PMApr 3
to
P Falth <peter....@gmail.com> writes:
> My systems are also unicode only, no other encodings supported. For this
> reason I have emit=xemit and key=xkey. This has so far not given me any
> problem.

That means more complicated methods for actual binary i/o, I guess.
It's nice to be able to write raw bytes when you want to.

P Falth

unread,
Apr 3, 2021, 1:43:09 PMApr 3
to
I checked VfX. The windows version outputs Ä for $C4 EMIT
but for $20ac (€) it does not work. probably it supports Latin-1 or similar.
Linux version does not work for EMIT, same output as gforth.

Already when we discussed this 20? years ago I was against xemit and
xkey. My position has always been that emit and key should work with
the implemented character encoding. If you need to send byte by byte
these functions should be named differently, like pemit and pkey.
If input or output is redirected key and emit should behave according
to the specifics of the new source for example by being defered

Peter

P Falth

unread,
Apr 3, 2021, 1:45:28 PMApr 3
to
WRITE-FILE and READ-FILE works for that!

Peter

NN

unread,
Apr 4, 2021, 8:02:35 AMApr 4
to
128 emit € <- depends on the font.

Anton Ertl

unread,
Apr 4, 2021, 10:05:43 AMApr 4
to
P Falth <peter....@gmail.com> writes:
>I checked VfX. The windows version outputs =C3=84 for $C4 EMIT

That's strange. Stephen Pelc argued against the extension of EMIT to
work on xchars and for EMIT to process raw bytes. His argument
resulted in the introduction of XEMIT for dealing with xchars. My
guess is that this behaviour is not intentional.

On Linux Gforth, SwiftForth and VFX seem to process raw bytes. iForth
behaves stragely:

iforth "1 $c3 emit .s 1 $a4 emit .s bye"

shows an empty stack twice, so EMIT apparently consumes two stack
items.

lxf behaves as you described.

>Already when we discussed this 20? years ago I was against xemit and
>xkey. My position has always been that emit and key should work with
>the implemented character encoding. If you need to send byte by byte
>these functions should be named differently, like pemit and pkey.

Well, the decision has been to add XEMIT and XKEY, and that's water
down the river. We also have EMIT and KEY, and the intention at some
time was for them to deal with raw bytes, but that intentions has not
been reflected in the text of the standard document yet. I don't
expect a proposal for PEMIT and PKEY (if somebody writes it) to be
successful, but who knows.

Originally I proposed to let EMIT and KEY work on xchars (as XEMIT and
XKEY do now) and was not enthusiastic about EMIT and KEY for raw
bytes. But it has the advantage that the Forth-94 code like

: type1 ( c-addr u -- )
over + swap ?do i c@ emit loop ;

works as intended on Forth-2012, even when runnung on an UTF-8 system
and passing an UTF-8 string to TYPE1. However, the difference between
VFX for Windows and for Linux indicates that in practice, this
advantage is not used (at least not by VFX users on Windows).

>If input or output is redirected key and emit should behave according
>to the specifics of the new source for example by being defered

A deferred word should implement a common interface. In case of EMIT,
all implementations should process a raw byte, or all implementations
should process a code point. If you want EMIT to behave like XEMIT,
then it should do that even when redirected to a serial port;
conversely, if we want EMIT to process raw bytes, all the
implementations of EMIT should do that.

P Falth

unread,
Apr 4, 2021, 10:08:41 AMApr 4
to
No it also depends on what codepage you use. That could be Windows 1252.
It is definitely not unicode.

P Falth

unread,
Apr 4, 2021, 1:31:19 PMApr 4
to
It will work on Linux but not on a Windows console.
For that to work on a Windows console I need to define emit as

\ emit for raw bytes

variable tmpbytes
variable #tmp
variable #expected


: XCS ( xcaddr -- n ) \ size of xc in bytes stored at addr
c@
dup $80 u< if drop 1 exit then
dup $e0 u< if drop 2 exit then
dup $f0 u< if drop 3 exit then
dup $f8 u< if drop 4 exit then
dup $fc u< if drop 5 exit then
dup $fe u< if drop 6 exit then
drop 6 ;


: leadingchar ( pchar -- )
tmpbytes c!
1 #tmp !
tmpbytes xcs 1- #expected ! ;

: trailingchar ( pchar -- )
tmpbytes #tmp @ + c!
1 #tmp +!
-1 #expected +! ;

: emit ( pchar -- )
#tmp @ if trailingchar else leadingchar then
#expected @ 0= if tmpbytes #tmp @ type 0 #tmp ! then ;

Where type is defined like
: prtmp here 4096 + aligned ;

: type ( addr len -- )
dup if
prtmp over 2* utf>wc16-string 2/
0 temp 2swap swap conout @ writeconsolew drop exit then
2drop ;

the utf8 string is transformed to utf16 and sent to the WriteConsoleW
system call.

So first I need to split the character to print it as raw bytes and then
in EMIT put them together again. I guess this becomes cooked bytes now!

The Windows console has no similarities with a Linux VT-console.
Fortunately Microsoft has introduced the Windows Terminal.
It has almost full support for UTF8 and VT codes. Printing utf8
strings to the console is now working. It is not even limited to the
first 64K codepoints as before. Reading does not work yet.

> >If input or output is redirected key and emit should behave according
> >to the specifics of the new source for example by being defered
> A deferred word should implement a common interface. In case of EMIT,
> all implementations should process a raw byte, or all implementations
> should process a code point. If you want EMIT to behave like XEMIT,
> then it should do that even when redirected to a serial port;
> conversely, if we want EMIT to process raw bytes, all the
> implementations of EMIT should do that.

What I mean is that if emit is redirected to a serial port I would regard that
as having a character set of 0-255 and every byte will be raw bytes.

Peter

Anton Ertl

unread,
Apr 7, 2021, 5:22:19 AMApr 7
to
P Falth <peter....@gmail.com> writes:
>It will work on Linux but not on a Windows console.
>
>So first I need to split the character to print it as raw bytes and then
>in EMIT put them together again. I guess this becomes cooked bytes now!
>
>The Windows console has no similarities with a Linux VT-console.
>Fortunately Microsoft has introduced the Windows Terminal.
>It has almost full support for UTF8 and VT codes. Printing utf8
>strings to the console is now working.

Ruvim reports in
<https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-627>:

|I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the
|console via chcp 65001 command), and in Linux. The test:
|
|HEX C3 EMIT A4 EMIT
|
|outputs ä
|
|In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).
|
|In the test
|
|HEX C3 EMIT KEY DROP A4 EMIT
|
|we can see that after the first emit nothing is shown, and after the
|second emit the character ä is shown.

I don't know how SP-Forth/4 calls Windows, and whether Ruvim used the
Windows Terminal, but it's apparently possible to implement EMIT in
the "raw byte" way on Windows, too.

I wonder if in VFX on Windows the "typical use" case works as intended
if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and
don't pass the test) unless you tell the system that you use UTF-8
(but these days, UTF-8 is the default setting).

>> >If input or output is redirected key and emit should behave according
>> >to the specifics of the new source for example by being defered
>> A deferred word should implement a common interface. In case of EMIT,
>> all implementations should process a raw byte, or all implementations
>> should process a code point. If you want EMIT to behave like XEMIT,
>> then it should do that even when redirected to a serial port;
>> conversely, if we want EMIT to process raw bytes, all the
>> implementations of EMIT should do that.
>
>What I mean is that if emit is redirected to a serial port I would regard that
>as having a character set of 0-255 and every byte will be raw bytes.

But if the serial port is connected to something expecting UTF-8, the
behaviour would differ from the behaviour of EMIT on the console: the
"typical use" example would work, while it does not work on the
console.

In any case, I have written a proposal on the wording of EMIT
<https://forth-standard.org/proposals/emit-and-non-ascii-values#contribution-184>,
and you may want to contribute to it.

none albert

unread,
Apr 7, 2021, 10:19:57 AMApr 7
to
In article <a526461b-316d-4baa...@googlegroups.com>,
P Falth <peter....@gmail.com> wrote:
>On Friday, 2 April 2021 at 21:59:15 UTC+2, Paul Rubin wrote:
>> NN <novembe...@gmail.com> writes:
>> > For suggested ref implementations how about.
>> > : type ( a u -- )
>> > swap >r begin dup 0> while
>> > r> count emit >r 1-
>> > repeat drop r> drop ;
>> Is something wrong with DO ?
>>
>> : type ( a u -- ) dup IF 0 DO dup c@ emit 1+ LOOP ELSE drop THEN ;
>>
>> I first used ?DO to avoid the 0 test, but that turns out to be in
>> CORE-EXT rather than CORE.
>
>And how is EMIT defined?
>I use something like
>
>VARIABLE tmp
>: EMIT tmp c! tmp 1 type ;
>
>or if you want it to work with unicode chars
>
>: EMIT tmp dup >r xc!+ r@ - r> swap type ;

Indeed it is much better to have
: TYPE 1 ( stdout) WRITE-FILE THROW ;

You don't need a tmp as long as you can point into
the data stack:
: EMIT DSP@ 1 TYPE DROP ;

>
>BR
>Peter

Groetjes Albert
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

none albert

unread,
Apr 7, 2021, 10:35:53 AMApr 7
to
In article <64189ef8-0164-4239...@googlegroups.com>,
Let's try
\----------------
HEX
CREATE buffer C4 ,
: doit buffer 1 TYPE ;
\----------------
lina -c doemit.frt
doemit | hd
00000000 c4 |.|
00000001
Anybody who expects something different?

Now xterm prints an upper case a with an umlaut, and linux terminal
show a square with a query sign. So the interpretation of the character
has little to do with Forth I'd say, more with the terminal.
>
>Already when we discussed this 20? years ago I was against xemit and
>xkey. My position has always been that emit and key should work with
>the implemented character encoding. If you need to send byte by byte
>these functions should be named differently, like pemit and pkey.
>If input or output is redirected key and emit should behave according
>to the specifics of the new source for example by being defered

There is a difference if one types <esc>[A or <cursor> up on this keyboard.
It is in timing. So I can differentiate between the two and
cursor up can be recognized as a single (extended) key.
: XKEY KEY BEGIN KEY? WHILE 8 LSHIFT KEY OR REPEAT ;

>
>Peter

P Falth

unread,
Apr 7, 2021, 1:14:36 PMApr 7
to
How do you manage to type <esc>[A ?
I cannot manage the fingers on my keyboard to do it.
It also returns immediately on <esc>

P Falth

unread,
Apr 7, 2021, 5:49:36 PMApr 7
to
On Wednesday, 7 April 2021 at 11:22:19 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >It will work on Linux but not on a Windows console.
> >
> >So first I need to split the character to print it as raw bytes and then
> >in EMIT put them together again. I guess this becomes cooked bytes now!
> >
> >The Windows console has no similarities with a Linux VT-console.
> >Fortunately Microsoft has introduced the Windows Terminal.
> >It has almost full support for UTF8 and VT codes. Printing utf8
> >strings to the console is now working.
> Ruvim reports in
> <https://forth-standard.org/proposals/emit-and-non-ascii-values#reply-627>:
>
> |I tested SP-Forth/4 in Windows (by setting UTF-8 code page in the
> |console via chcp 65001 command), and in Linux. The test:
> |
> |HEX C3 EMIT A4 EMIT
> |
> |outputs ä
> |
> |In SP-Forth the word EMIT is implemented via TYPE (that is via WRITE-FILE).
> |
> |In the test
> |
> |HEX C3 EMIT KEY DROP A4 EMIT
> |
> |we can see that after the first emit nothing is shown, and after the
> |second emit the character ä is shown.
>
> I don't know how SP-Forth/4 calls Windows, and whether Ruvim used the
> Windows Terminal, but it's apparently possible to implement EMIT in
> the "raw byte" way on Windows, too.

It has always been possible to set the codepage to 65001 for output.
It has not always worked correctly. One problem being wrong number of
written bytes reported. Input of utf8 has never worked and unfortunately not
even now with the new Windows Terminal

> I wonder if in VFX on Windows the "typical use" case works as intended
> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and
> don't pass the test) unless you tell the system that you use UTF-8
> (but these days, UTF-8 is the default setting).

No that does not make any change. But loading xchar.fth in VFX and
defining:
VARIABLE etmp
: EMIT etmp dup >r xc!+ r@ - r> swap type ;

Make emit work as I want!
$20ac emit € ok

This works on both Windows and Linux versions.
Also in Gforth that definition works.

> >> >If input or output is redirected key and emit should behave according
> >> >to the specifics of the new source for example by being defered
> >> A deferred word should implement a common interface. In case of EMIT,
> >> all implementations should process a raw byte, or all implementations
> >> should process a code point. If you want EMIT to behave like XEMIT,
> >> then it should do that even when redirected to a serial port;
> >> conversely, if we want EMIT to process raw bytes, all the
> >> implementations of EMIT should do that.
> >
> >What I mean is that if emit is redirected to a serial port I would regard that
> >as having a character set of 0-255 and every byte will be raw bytes.
> But if the serial port is connected to something expecting UTF-8, the
> behaviour would differ from the behaviour of EMIT on the console: the
> "typical use" example would work, while it does not work on the
> console.
>
> In any case, I have written a proposal on the wording of EMIT
> <https://forth-standard.org/proposals/emit-and-non-ascii-values#contribution-184>,
> and you may want to contribute to it.

I have seen that. I have registered and will contribute

BR
Peter

Coos Haak

unread,
Apr 8, 2021, 6:57:24 AMApr 8
to
Op Wed, 7 Apr 2021 10:14:34 -0700 (PDT) schreef P Falth:

> How do you manage to type <esc>[A ?
> I cannot manage the fingers on my keyboard to do it.
> It also returns immediately on <esc>
>
HEX
: <ESC> 1B EMIT ;
: UP <ESC> S" [A" TYPE

groet, Coos

Anton Ertl

unread,
Apr 8, 2021, 7:32:10 AMApr 8
to
P Falth <peter....@gmail.com> writes:
>It has always been possible to set the codepage to 65001 for output.
>It has not always worked correctly. One problem being wrong number of
>written bytes reported.

Reported by whom? And why ist that a problem?

>> I wonder if in VFX on Windows the "typical use" case works as intended=20
>> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and=20
>> don't pass the test) unless you tell the system that you use UTF-8=20
>> (but these days, UTF-8 is the default setting).
>
>No that does not make any change. But loading xchar.fth in VFX and
>defining:
>VARIABLE etmp
>: EMIT etmp dup >r xc!+ r@ - r> swap type ;=20
>
>Make emit work as I want!
>$20ac emit =E2=82=AC ok

A simpler implementation of what you want is, after loading VFX's
xchar.fth:

: emit xemit ;

Or you could just use XEMIT directly, which is what I would recommend
if you want to deal with an xchar.

P Falth

unread,
Apr 8, 2021, 9:21:03 AMApr 8
to
On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
> P Falth <peter....@gmail.com> writes:
> >It has always been possible to set the codepage to 65001 for output.
> >It has not always worked correctly. One problem being wrong number of
> >written bytes reported.
> Reported by whom? And why ist that a problem?

Google is our friend here. Here is one example from a Perl bug
https://github.com/perl/perl5/issues/13794

WriteFile returns characters written and not bytes. If your function
checks what has been written it can see a lower value then expected
and try to write the "missing" bytes. This is obviously not a problem
in our case as we expect the display to consume the whole string and
do not check.

>
> >> I wonder if in VFX on Windows the "typical use" case works as intended=20
> >> if you do "chcp 65001" first. On Linux you also don't get UTF-8 (and=20
> >> don't pass the test) unless you tell the system that you use UTF-8=20
> >> (but these days, UTF-8 is the default setting).
> >
> >No that does not make any change. But loading xchar.fth in VFX and
> >defining:
> >VARIABLE etmp
> >: EMIT etmp dup >r xc!+ r@ - r> swap type ;=20
> >
> >Make emit work as I want!
> >$20ac emit =E2=82=AC ok
>
> A simpler implementation of what you want is, after loading VFX's
> xchar.fth:
>
> : emit xemit ;
>
> Or you could just use XEMIT directly, which is what I would recommend
> if you want to deal with an xchar.

I could also do
: EMIT dup $80 < if emit else xemit then ;

But this is silly! You mention somewhere that we do not have an XTYPE
as TYPE knows how to deal correctly with the string. I think the same should
be true for EMIT and KEY. They should know how to deal with a codepoint if
I implement Unicode support in my Forth.

This works on my systems. ( I hope google does not mess up this to much)
'A' emit A ok
'Ä' emit Ä ok
'$' emit $ ok
'£' emit £ ok
'€' emit € ok

On Gforth I get
'A' emit A ok
'Ä' emit � ok
'$' emit $ ok
'£' emit � ok
'€' emit � ok

And you are saying that my system is non standard!
If I can enter a character from my keyboard I also expect EMIT to display it.

Yes I could have used XEMIT and both examples would have been the same.
But I see no use of them in any programs, people continue to use emit and key.
How many systems have implemented unicode support as part of the core
system and not as a loadable file buried in a library?

BR
Peter

Ruvim

unread,
Apr 8, 2021, 2:48:29 PMApr 8
to
On 2021-04-08 16:21, P Falth wrote:
> On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
[...]
>> A simpler implementation of what you want is, after loading VFX's
>> xchar.fth:
>>
>> : emit xemit ;
>>
>> Or you could just use XEMIT directly, which is what I would recommend
>> if you want to deal with an xchar.
>
> I could also do
> : EMIT dup $80 < if emit else xemit then ;
>
> But this is silly!

Having EMIT that is equivalent to XEMIT is also silly.
Do you think it's worth to deprecate XEMIT ?

> You mention somewhere that we do not have an XTYPE
> as TYPE knows how to deal correctly with the string. I think the same should
> be true for EMIT and KEY. They should know how to deal with a codepoint if
> I implement Unicode support in my Forth.
>
> This works on my systems. ( I hope google does not mess up this to much)
> 'A' emit A ok
> 'Ä' emit Ä ok
> '$' emit $ ok
> '£' emit £ ok
> '€' emit € ok
>
> On Gforth I get
> 'A' emit A ok
> 'Ä' emit � ok
> '$' emit $ ok
> '£' emit � ok
> '€' emit � ok

The optional Extended-Character word set suggests that [CHAR] and
character literal return xchar (code point) and then a program should
use XEMIT to print it as:

'Ä' xemit

Perhaps emit may throw an exception if the given pchar cannot be a part
of a correct xchar in the sequence.


> And you are saying that my system is non standard!

Actually not because of that.


> If I can enter a character from my keyboard I also expect EMIT to display it.

Yes.

So the sequence "KEY EMIT" should be always correct (ditto "XKEY XEMIT")

Hence if EMIT handles only pchar, then KEY should also return only pchar.



The idea is that the expressions

s" Ä" type

s" Ä" over 1 type 1 /string type

s" Ä" drop dup c@ emit char+ c@ emit


should all produce the same result when UTF-8 encoding is used.


How could you explain it if they produce the different results?


> Yes I could have used XEMIT and both examples would have been the same.

> But I see no use of them in any programs, people continue to use emit and key.



--
Ruvim

P Falth

unread,
Apr 8, 2021, 3:28:11 PMApr 8
to
On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> On 2021-04-08 16:21, P Falth wrote:
> > On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
> [...]
> >> A simpler implementation of what you want is, after loading VFX's
> >> xchar.fth:
> >>
> >> : emit xemit ;
> >>
> >> Or you could just use XEMIT directly, which is what I would recommend
> >> if you want to deal with an xchar.
> >
> > I could also do
> > : EMIT dup $80 < if emit else xemit then ;
> >
> > But this is silly!
> Having EMIT that is equivalent to XEMIT is also silly.
> Do you think it's worth to deprecate XEMIT ?

Yes and also XKEY, that is why I respond to this thread.
My windows system uses UTF-16 for type. The UTF-8 string is converted
to UTF-16 before calling the OS WriteConsoleW function.
The Linux version types the UTF-8 string directly.

One of the problems with EMIT being restricted to pchars is that you need
to know the encoding of the underlying string as in your example above.

If EMIT take a Unicode codepoint as I suggest the encoding does not
need to be know to the programmer

Peter

Paul Rubin

unread,
Apr 8, 2021, 3:47:31 PMApr 8
to
P Falth <peter....@gmail.com> writes:
> This works on my systems. ( I hope google does not mess up this to much)
> 'A' emit A ok
> 'Ä' emit Ä ok ...
> And you are saying that my system is non standard!
> If I can enter a character from my keyboard I also expect EMIT to display it.

At odds here is an idea, I guess under dispute, that in the old days 1
character was 1 byte so EMIT would always send a byte; but now with
Unicode, some chars have multibyte encoding. So we have EMIT for bytes
and XEMIT for codepoints under whatever encoding.

We also had the idea that KEY would read a character (i.e. byte) from a
keyboard, but for decades before anyone cared about Unicode, keyboards
had cursor keys and function keys that send escape sequences. Should we
expect KEY to properly read those and encode them somehow? Are there
even Unicode codepoints for them (I don't know)?

What does your keyboard actually transmit when you type "Ä" (capital A
with umlaut, codepoint 00C4)? My guess is it actually send an ISO
8859-1 character (single byte) which also happens to be 00C4 so your
EMIT possibly has to translate it to some other encoding like UTF-8 on
output. Do you want EMIT to also be able to display CJK characters if
your keyboard can transmit them? Or maybe your system simply displays
ISO 8859-1 directly and doesn't bother with Unicode. Today that is in
some ways an annoying legacy system, but it was workable way to deal
with European alphabets for a while, and maybe still is for your
particular application.

It seems to me that 1) there's no point having EMIT and XEMIT as
separate words if they both do the same thing; 2) having a simple way to
read and write single bytes is still important.

P Falth

unread,
Apr 8, 2021, 4:55:53 PMApr 8
to
On Thursday, 8 April 2021 at 21:47:31 UTC+2, Paul Rubin wrote:
> P Falth <peter....@gmail.com> writes:
> > This works on my systems. ( I hope google does not mess up this to much)
> > 'A' emit A ok
> > 'Ä' emit Ä ok ...
> > And you are saying that my system is non standard!
> > If I can enter a character from my keyboard I also expect EMIT to display it.
> At odds here is an idea, I guess under dispute, that in the old days 1
> character was 1 byte so EMIT would always send a byte; but now with
> Unicode, some chars have multibyte encoding. So we have EMIT for bytes
> and XEMIT for codepoints under whatever encoding.
>
> We also had the idea that KEY would read a character (i.e. byte) from a
> keyboard, but for decades before anyone cared about Unicode, keyboards
> had cursor keys and function keys that send escape sequences. Should we
> expect KEY to properly read those and encode them somehow? Are there
> even Unicode codepoints for them (I don't know)?

KEY in my systems does not return function or cursor keys. EKEY does
this. They are coded in the 32 bit space not covered by Unicode codepoints.

>
> What does your keyboard actually transmit when you type "Ä" (capital A
> with umlaut, codepoint 00C4)? My guess is it actually send an ISO
> 8859-1 character (single byte) which also happens to be 00C4 so your
> EMIT possibly has to translate it to some other encoding like UTF-8 on
> output. Do you want EMIT to also be able to display CJK characters if
> your keyboard can transmit them? Or maybe your system simply displays
> ISO 8859-1 directly and doesn't bother with Unicode. Today that is in
> some ways an annoying legacy system, but it was workable way to deal
> with European alphabets for a while, and maybe still is for your
> particular application.

My systems, ntf on Windows and lxf on Linux support Unicode since 20 years.
I have had no problems having emit and key working on Unicode codepoints.

The solution to achieve this is very different from Win and Linux
On Windows a 2 byte codepoint is returned directly from the OS with KEY. This limits
the codepoints to the first 64K. This was a problem of the windows console.
(Microsoft has now improved the console and my ntf64 can use the complete
Unicode codepoints). In the same way EMIT uses the OS WriteConsoleW to directly
write the 16 bit codepoint to the screen.

On Linux characters arrive as UTF-8 streams that are converted by KEY to the proper
codepoint. EMIT saves the codepoint as UTF-8 in a string that is sent to type

Internally strings are UTF-8 encoded. On Windows they are translated to UTF-16
inside TYPE before being sent to the OS for output.

Two very different implementation due to different operating system capabilities
but totally transparent while using the systems

> It seems to me that 1) there's no point having EMIT and XEMIT as
> separate words if they both do the same thing; 2) having a simple way to
> read and write single bytes is still important.

Yes XEMIT is not needed in my opinion
KEY and EMIT in my systems can read and write from 0 to 0x10FFFF.
0-0xFF is included in that range

BR
Peter

Ruvim

unread,
Apr 8, 2021, 4:57:25 PMApr 8
to
On 2021-04-08 22:28, P Falth wrote:
> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>> On 2021-04-08 16:21, P Falth wrote:
>>> On Thursday, 8 April 2021 at 13:32:10 UTC+2, Anton Ertl wrote:
>> [...]
>>>> A simpler implementation of what you want is, after loading VFX's
>>>> xchar.fth:
>>>>
>>>> : emit xemit ;
>>>>
>>>> Or you could just use XEMIT directly, which is what I would recommend
>>>> if you want to deal with an xchar.
>>>
>>> I could also do
>>> : EMIT dup $80 < if emit else xemit then ;
>>>
>>> But this is silly!
>> Having EMIT that is equivalent to XEMIT is also silly.
>> Do you think it's worth to deprecate XEMIT ?
>
> Yes and also XKEY, that is why I respond to this thread.

But then we will have the problem (1) below.


[...]
>>> This works on my systems. ( I hope google does not mess up this to much)
>>> 'A' emit A ok
>>> 'Ä' emit Ä ok

>>> On Gforth I get
>>> 'A' emit A ok
>>> 'Ä' emit � ok

>> The optional Extended-Character word set suggests that [CHAR] and
>> character literal return xchar (code point) and then a program should
>> use XEMIT to print it as:
>>
>> 'Ä' xemit
>>
>> Perhaps emit may throw an exception if the given pchar cannot be a part
>> of a correct xchar in the sequence.


>>> If I can enter a character from my keyboard I also expect EMIT to display it.
>> Yes.
>> So the sequence "KEY EMIT" should be always correct (ditto "XKEY XEMIT")
>>
>> Hence if EMIT handles only pchar, then KEY should also return only pchar.
>>
>>
>>
>> The idea is that the expressions
>>
>> s" Ä" type
>>
>> s" Ä" over 1 type 1 /string type
>>
>> s" Ä" drop dup c@ emit char+ c@ emit
>>
>>
>> should all produce the same result when UTF-8 encoding is used.
>
> My windows system uses UTF-16 for type. The UTF-8 string is converted
> to UTF-16 before calling the OS WriteConsoleW function.
> The Linux version types the UTF-8 string directly.
>
> One of the problems with EMIT being restricted to pchars is that you need
> to know the encoding of the underlying string as in your example above.

Actually, you don't need to know the encoding. I mentioned encoding just
for the sake of the third expression, but it can be replaced by the
following variant:

s" Ä" over c@ emit 1 /string type


Now these tree expressions should produce the same result regardless of
the encoding. Also, the result should be the same for any non-empty string.

And that is why the correct programs can continue to use EMIT and KEY.


If your system produce the different results, how you can explain that? (1)


--
Ruvim

P Falth

unread,
Apr 8, 2021, 5:30:37 PMApr 8
to
No this is still depending on knowing the encoding of the string. The right way to
write it is

s" Ä" over xc@+ emit drop +x/string type

or with xemit if your emit and xemit are not the same

Peter

Ruvim

unread,
Apr 8, 2021, 7:30:14 PMApr 8
to
On 2021-04-09 00:30, P Falth wrote:
> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>> On 2021-04-08 22:28, P Falth wrote:
>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
[...]
>>>> The idea is that the expressions
>>>>
>>>> s" Ä" type
>>>>
>>>> s" Ä" over 1 type 1 /string type
>>>>
[...]

>>> One of the problems with EMIT being restricted to pchars is that you need
>>> to know the encoding of the underlying string as in your example above.
>> Actually, you don't need to know the encoding.
[...]
>>
>> s" Ä" over c@ emit 1 /string type
>
> No this is still depending on knowing the encoding of the string.

If EMIT is restricted to pchar, why this is depending on the encoding
that a Forth system uses under the hood?

In a standard Forth system the results should be the same independently
of the encoding. Otherwise the system just is not standard compliant in
this aspect.




> The right way to write it is
>
> s" Ä" over xc@+ emit drop +x/string type
>
> or with xemit if your emit and xemit are not the same

Of course, by the Forth-2012 you have to use xemit in this case. And
then this variant is also possible.




>> Now these tree expressions should produce the same result regardless of
>> the encoding. Also, the result should be the same for any non-empty string.

>> If your system produce the different results, how you can explain that?


Don't you think that your system, that uses the same encoding
independently of the platform, should produce the same result in Windows
and in Linux for each expression from my three above?

If not, what is your ground?


--
Ruvim

P Falth

unread,
Apr 9, 2021, 2:06:10 AMApr 9
to
On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
> On 2021-04-09 00:30, P Falth wrote:
> > On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
> >> On 2021-04-08 22:28, P Falth wrote:
> >>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> [...]
> >>>> The idea is that the expressions
> >>>>
> >>>> s" Ä" type
> >>>>
> >>>> s" Ä" over 1 type 1 /string type
> >>>>
> [...]
> >>> One of the problems with EMIT being restricted to pchars is that you need
> >>> to know the encoding of the underlying string as in your example above.
> >> Actually, you don't need to know the encoding.
> [...]
> >>
> >> s" Ä" over c@ emit 1 /string type
> >
> > No this is still depending on knowing the encoding of the string.
> If EMIT is restricted to pchar, why this is depending on the encoding
> that a Forth system uses under the hood?

You use c@ to access a string you do not know the encoding of!


> In a standard Forth system the results should be the same independently
> of the encoding. Otherwise the system just is not standard compliant in
> this aspect.
> > The right way to write it is
> >
> > s" Ä" over xc@+ emit drop +x/string type
> >
> > or with xemit if your emit and xemit are not the same
> Of course, by the Forth-2012 you have to use xemit in this case. And
> then this variant is also possible.
> >> Now these tree expressions should produce the same result regardless of
> >> the encoding. Also, the result should be the same for any non-empty string.
> >> If your system produce the different results, how you can explain that?
> Don't you think that your system, that uses the same encoding
> independently of the platform, should produce the same result in Windows
> and in Linux for each expression from my three above?
>
> If not, what is your ground?

Internally both my Linux and Windows systems uses UTF-8 encoded strings.
But the Windows systems translate this to an 16 bit char representation
inside type, to be able to write it to the screen with the WriteConsoleW
OS function. You remove 1 part of a multibyte char and send the remaining
string to type that will see an illegal utf8 char to translate and will fail.

Br
Peter


>
>
> --
> Ruvim

Ruvim

unread,
Apr 9, 2021, 3:51:47 AMApr 9
to
On 2021-04-09 09:06, P Falth wrote:
> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
>> On 2021-04-09 00:30, P Falth wrote:
>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>>>> On 2021-04-08 22:28, P Falth wrote:
>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>> [...]
>>>>>> The idea is that the expressions
>>>>>>
>>>>>> s" Ä" type
>>>>>>
>>>>>> s" Ä" over 1 type 1 /string type
>>>>>>
>> [...]
>>>>> One of the problems with EMIT being restricted to pchars is that you need
>>>>> to know the encoding of the underlying string as in your example above.
>>>> Actually, you don't need to know the encoding.
>> [...]
>>>>
>>>> s" Ä" over c@ emit 1 /string type
>>>
>>> No this is still depending on knowing the encoding of the string.
>> If EMIT is restricted to pchar, why this is depending on the encoding
>> that a Forth system uses under the hood?
>
> You use c@ to access a string you do not know the encoding of!

It's allowed. And c@ returns pchar independently of the encoding.

Having a list of primitive characters that compose a text, what is your
suggested way to output them?

The Standard allows to just apply EMIT to each of them.

Do you think to also eliminate the notion of primitive character (pchar)
at all?




>> In a standard Forth system the results should be the same independently
>> of the encoding. Otherwise the system just is not standard compliant in
>> this aspect.
>>> The right way to write it is
>>>
>>> s" Ä" over xc@+ emit drop +x/string type
>>>
>>> or with xemit if your emit and xemit are not the same
>> Of course, by the Forth-2012 you have to use xemit in this case. And
>> then this variant is also possible.
>>>> Now these tree expressions should produce the same result regardless of
>>>> the encoding. Also, the result should be the same for any non-empty string.
>>>> If your system produce the different results, how you can explain that?
>> Don't you think that your system, that uses the same encoding
>> independently of the platform, should produce the same result in Windows
>> and in Linux for each expression from my three above?
>>
>> If not, what is your ground?
>
> Internally both my Linux and Windows systems uses UTF-8 encoded strings.
> But the Windows systems translate this to an 16 bit char representation
> inside type, to be able to write it to the screen with the WriteConsoleW
> OS function. You remove 1 part of a multibyte char and send the remaining
> string to type that will see an illegal utf8 char to translate and will fail.


I see, but a user should not immerse into the implementation details.

The idea of the standard is that a standard program produces the same
results on the different standard systems.

In this case there are programs with an environmental dependency
concerning the graphical characters set (but not the character
encoding!), and your systems meet this dependency.

So I interpret the different results of the programs as a drawback in
the implementation of TYPE in ntf, and of EMIT in both ntf and lxf systems.



--
Ruvim

P Falth

unread,
Apr 9, 2021, 6:19:42 AMApr 9
to
On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:
> On 2021-04-09 09:06, P Falth wrote:
> > On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
> >> On 2021-04-09 00:30, P Falth wrote:
> >>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
> >>>> On 2021-04-08 22:28, P Falth wrote:
> >>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
> >> [...]
> >>>>>> The idea is that the expressions
> >>>>>>
> >>>>>> s" Ä" type
> >>>>>>
> >>>>>> s" Ä" over 1 type 1 /string type
> >>>>>>
> >> [...]
> >>>>> One of the problems with EMIT being restricted to pchars is that you need
> >>>>> to know the encoding of the underlying string as in your example above.
> >>>> Actually, you don't need to know the encoding.
> >> [...]
> >>>>
> >>>> s" Ä" over c@ emit 1 /string type
> >>>
> >>> No this is still depending on knowing the encoding of the string.
> >> If EMIT is restricted to pchar, why this is depending on the encoding
> >> that a Forth system uses under the hood?
> >
> > You use c@ to access a string you do not know the encoding of!
> It's allowed. And c@ returns pchar independently of the encoding.

Yes you are always allowed to fetch a byte with c@. Take now this string
s" €Fälth" if the encoding is UTF-8 the dump of it is
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
00741C40 E2 82 AC 46 C3 A4 6C 74 68
the lenght is 9 bytes

If converted to UTF-16 it will look like this
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
00741C40 AC 20 46 00 E4 00 6C 00 74 00 68 00
and have the lenght 12 bytes

So you can use c@ to fetch the individual bytes but the will be different
and there will also be a different number of them. What do you expect to
be able to do with those bytes?

> Having a list of primitive characters that compose a text, what is your
> suggested way to output them?

The correct way to output a string of text is with TYPE

> The Standard allows to just apply EMIT to each of them.
>
> Do you think to also eliminate the notion of primitive character (pchar)
> at all?

Yes in my opinion. It is a distinction that is not needed and just creates
confusion. It exposes the encoding of strings instead of hiding it and
making it transparent for the user.

> >> In a standard Forth system the results should be the same independently
> >> of the encoding. Otherwise the system just is not standard compliant in
> >> this aspect.
> >>> The right way to write it is
> >>>
> >>> s" Ä" over xc@+ emit drop +x/string type
> >>>
> >>> or with xemit if your emit and xemit are not the same
> >> Of course, by the Forth-2012 you have to use xemit in this case. And
> >> then this variant is also possible.
> >>>> Now these tree expressions should produce the same result regardless of
> >>>> the encoding. Also, the result should be the same for any non-empty string.
> >>>> If your system produce the different results, how you can explain that?
> >> Don't you think that your system, that uses the same encoding
> >> independently of the platform, should produce the same result in Windows
> >> and in Linux for each expression from my three above?
> >>
> >> If not, what is your ground?
> >
> > Internally both my Linux and Windows systems uses UTF-8 encoded strings.
> > But the Windows systems translate this to an 16 bit char representation
> > inside type, to be able to write it to the screen with the WriteConsoleW
> > OS function. You remove 1 part of a multibyte char and send the remaining
> > string to type that will see an illegal utf8 char to translate and will fail.
> I see, but a user should not immerse into the implementation details.
>
> The idea of the standard is that a standard program produces the same
> results on the different standard systems.

Yes on this I agree fully

Peter

Ruvim

unread,
Apr 9, 2021, 9:45:19 AMApr 9
to
On 2021-04-09 13:19, P Falth wrote:
> On Friday, 9 April 2021 at 09:51:47 UTC+2, Ruvim wrote:
>> On 2021-04-09 09:06, P Falth wrote:
>>> On Friday, 9 April 2021 at 01:30:14 UTC+2, Ruvim wrote:
>>>> On 2021-04-09 00:30, P Falth wrote:
>>>>> On Thursday, 8 April 2021 at 22:57:25 UTC+2, Ruvim wrote:
>>>>>> On 2021-04-08 22:28, P Falth wrote:
>>>>>>> On Thursday, 8 April 2021 at 20:48:29 UTC+2, Ruvim wrote:
>>>> [...]
>>>>>>>> The idea is that the expressions
>>>>>>>>
>>>>>>>> s" Ä" type
>>>>>>>>
>>>>>>>> s" Ä" over 1 type 1 /string type
>>>>>>>>
>>>> [...]
>>>>>>> One of the problems with EMIT being restricted to pchars is that you need
>>>>>>> to know the encoding of the underlying string as in your example above.
>>>>>> Actually, you don't need to know the encoding.
>>>> [...]
>>>>>>
>>>>>> s" Ä" over c@ emit 1 /string type
>>>>>
>>>>> No this is still depending on knowing the encoding of the string.
>>>> If EMIT is restricted to pchar, why this is depending on the encoding
>>>> that a Forth system uses under the hood?
>>>
>>> You use c@ to access a string you do not know the encoding of!
>> It's allowed. And c@ returns pchar independently of the encoding.
>
> Yes you are always allowed to fetch a byte with c@.

In some system it might be two bytes, or even four bytes.

> Take now this string
> s" €Fälth" if the encoding is UTF-8 the dump of it is
> Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
> 00741C40 E2 82 AC 46 C3 A4 6C 74 68
> the lenght is 9 bytes
>
> If converted to UTF-16 it will look like this
> Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
> 00741C40 AC 20 46 00 E4 00 6C 00 74 00 68 00
> and have the lenght 12 bytes
>
> So you can use c@ to fetch the individual bytes but the will be different
> and there will also be a different number of them. What do you expect to
> be able to do with those bytes?

NB: if the system uses UTF-16 then a pchar size shall be 2 bytes.

So, if we have two systems, one uses UTF-8, and another uses UTF-16,
then for the primitive characters of the given string that correspond to
(F,l,t,h) C@ should return the same value in the both systems, since the
corresponding code points are in the range $20-$7F, and this range is
fixed in the standard.

For the rest primitive characters in the string C@ returns distinct
values between two systems, but they should be greater than $7F in any
case. The number of these primitive characters is also different.


What a program can do with primitive characters?

Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
, REPLACE , HASH and many other can be implemented in the terms of
pchars. A program can represent strings in the form of suffix tree, or
other data structures — also relying on pchars only. (2)

Also, a program can change case for the alphabetical characters in the
range $20-$7F.

And certainly a program can output strings by output primitive
characters one by one.




>> Having a list of primitive characters that compose a text, what is your
>> suggested way to output them?
>
> The correct way to output a string of text is with TYPE

It's a list of primitive chars.



>> The Standard allows to just apply EMIT to each of them.
>>
>> Do you think to also eliminate the notion of primitive character (pchar)
>> at all?
>
> Yes in my opinion.

And then the words C@ and C! two?


> It is a distinction that is not needed

In the case of variable-width encoding, many operations (like examples
(2) above) become far less efficient if you operate on xchars instead of
pchars.

Another problem that it's breaking backward compatibility.



> and just creates confusion.
> It exposes the encoding of strings instead of hiding it and
> making it transparent for the user.

I don't see how it exposes the encoding of strings (in a bad sense). All
my examples in (2) are encoding independent.



--
Ruvim

none albert

unread,
Apr 9, 2021, 10:29:20 AMApr 9
to
In article <485b6239-c89f-4b33...@googlegroups.com>,
P Falth <peter....@gmail.com> wrote:
>
>Internally both my Linux and Windows systems uses UTF-8 encoded strings.
>But the Windows systems translate this to an 16 bit char representation
>inside type, to be able to write it to the screen with the WriteConsoleW
>OS function. You remove 1 part of a multibyte char and send the remaining
>string to type that will see an illegal utf8 char to translate and will fail.

My Forth uses buffers where the length in byte is known. It doesn't
give a rat's ass whether a terminal shows that as chinese characters.
Nowhere in my Forth is there any concern about individual characters.
Any attempts at case-insensitivity would spoil that (and would have
removed in the Chinese version of my Forth) so it is loadable.

>
>Br
>Peter

Groetjes Albert

none albert

unread,
Apr 9, 2021, 10:53:31 AMApr 9
to
In article <874kggr...@nightsong.com>,
Paul Rubin <no.e...@nospam.invalid> wrote:
<SNIP>
>What does your keyboard actually transmit when you type "Ä" (capital A
>with umlaut, codepoint 00C4)?

A keyboard doesn't transmit code points, it transmit scan codes.
You can change the key tops of most keyboards but a typical English
PC- keyboard has no keys with A Umlaut.
If the key that is mostly marked A is pressed down it transmit
30 (0x1E) If you release the key it transmits 0x9E).
There is no such thing as uppercase keys. There is no separate a and A
keys on a keyboard. There are shift keys with their own scan codes.
The difference between a and A only exists in the imagination of your
computer, provided it succeeds in interpreting the shift keys correctly.
(and frankly even the association between scan code 0x1E and
the character a/A is in a table that can be tampered with.)

[There may be an exception in e.g. a numlock key that changes a keyboard's
behaviour]

P Falth

unread,
Apr 9, 2021, 12:05:18 PMApr 9
to
No that was proven to be a dead end. Jack Woehr created Jax forth 1993.
It had char=2 and used UCS-2 (UTF-16 limited to 2 bytes) 2048 bytes block etc.
It did not catch on, nobody to my knowledge continued that approach.
I still have a copy. It started without problem. My example string worked also.

When I started developing Unicode support for my system I tested the idea
of having a variable sized char, CHAR+ could be 1+ 2+ 3+ or 4+. I soon gave up
and adopted the idea of a specific xchar wordset that was being developed by
Anton and Berndt at that time.

Why do you insist of using C@ to traverse a string when XC@+ hides all details
and works perfectly?

> So, if we have two systems, one uses UTF-8, and another uses UTF-16,
> then for the primitive characters of the given string that correspond to
> (F,l,t,h) C@ should return the same value in the both systems, since the
> corresponding code points are in the range $20-$7F, and this range is
> fixed in the standard.
>
> For the rest primitive characters in the string C@ returns distinct
> values between two systems, but they should be greater than $7F in any
> case. The number of these primitive characters is also different.
>
>
> What a program can do with primitive characters?
>
> Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
> , REPLACE , HASH and many other can be implemented in the terms of
> pchars. A program can represent strings in the form of suffix tree, or
> other data structures — also relying on pchars only. (2)

They worked perfectly well on my systems before pchars were invented and
they continue to work afterwards. COMPARE do a binary comparison,
byte by byte.

> Also, a program can change case for the alphabetical characters in the
> range $20-$7F.

My case conversion works for all Unicode codepoints that have a case property.
Or used at least, my translation tables are not updated for some time.

> And certainly a program can output strings by output primitive
> characters one by one.
> >> Having a list of primitive characters that compose a text, what is your
> >> suggested way to output them?
> >
> > The correct way to output a string of text is with TYPE
> It's a list of primitive chars.

But surly they can be stored in a string?



> >> The Standard allows to just apply EMIT to each of them.
> >>
> >> Do you think to also eliminate the notion of primitive character (pchar)
> >> at all?
> >
> > Yes in my opinion.
> And then the words C@ and C! two?

They are fundamental and have many uses!
Not everything you manipulate in memory are strings.

Peter

Ruvim

unread,
Apr 9, 2021, 1:18:49 PMApr 9
to
It seems you miss the option that an address unit can be also two bytes
in this case, and then a char size is still 1 (i.e., 1 address unit).


> Jack Woehr created Jax forth 1993.
> It had char=2 and used UCS-2 (UTF-16 limited to 2 bytes) 2048 bytes block etc.
> It did not catch on, nobody to my knowledge continued that approach.
> I still have a copy. It started without problem. My example string worked also.


> When I started developing Unicode support for my system I tested the idea
> of having a variable sized char, CHAR+ could be 1+ 2+ 3+ or 4+. I soon gave up
> and adopted the idea of a specific xchar wordset that was being developed by
> Anton and Berndt at that time.


> Why do you insist of using C@ to traverse a string when XC@+ hides all details
> and works perfectly?

I only insist to have a choice, and to have the *ability* of traversing
a string via C@ — because it's backward compatible (old programs
continue to work in new UTF-based systems), it's simpler, and it has
better performance.

And when you need to treat the actual code points beyond a char, you can
use X* words.



[...]
>> What a program can do with primitive characters?
>>
>> Such words like SEARCH , COMPARE , STARTS-WITH , SUBSTRING-AFTER , SPLIT
>> , REPLACE , HASH and many other can be implemented in the terms of
>> pchars. A program can represent strings in the form of suffix tree, or
>> other data structures — also relying on pchars only. (2)
>
> They worked perfectly well on my systems before pchars were invented and
> they continue to work afterwards. COMPARE do a binary comparison,
> byte by byte.
>

>> Also, a program can change case for the alphabetical characters in the
>> range $20-$7F.
>
> My case conversion works for all Unicode codepoints that have a case property.
> Or used at least, my translation tables are not updated for some time.

But your conversion works beyond pchar. When I mentioned the only things
that can be done in portable manner in the frame of pchars.



>> And certainly a program can output strings by output primitive
>> characters one by one.
>>>> Having a list of primitive characters that compose a text, what is your
>>>> suggested way to output them?
>>>
>>> The correct way to output a string of text is with TYPE
>> It's a list of primitive chars.
>
> But surly they can be stored in a string?

Yes.



Actually it's a possible approach that the standard doesn't provide a
way to print a pchar, but only a string. Although it doesn't solve any
problem, since the expression:

s" Ä" over 1 type 1 /string type

should be equivalent to the expression:

s" Ä" type

(for any non-empty string)

And a program still can define:

variable _tmp
: my-emit ( pchar -- ) _tmp c! _tmp 1 type ;

So it's better to just standardize this word.


The only choice is what should the old EMIT word do, and what the word
to introduce: PEMIT or XEMIT.


If you introduce PEMIT (and change EMIT) then C@ EMIT becomes non standard.

If you introduce XEMIT (and keep EMIT) then [CHAR] X EMIT becomes
non-standard for non-ASCII X.


It seems the second variant is better for backward compatibility.




>>>> The Standard allows to just apply EMIT to each of them.
>>>>
>>>> Do you think to also eliminate the notion of primitive character (pchar)
>>>> at all?
>>>
>>> Yes in my opinion.
>> And then the words C@ and C! too?
>
> They are fundamental and have many uses!
> Not everything you manipulate in memory are strings.

Then what data type C@ returns?


NB: in the standard 'char' usually means 'pchar'
"Unless otherwise stated, a "character" refers to a primitive character"
(3.1.2.3)



--
Ruvim

P Falth

unread,
Apr 9, 2021, 4:30:21 PMApr 9
to
That will also work on my systems when your strings/chars are below $7F.
Where we differ are when the char is between $80 and $FF.
What should $C4 emit do?

> And when you need to treat the actual code points beyond a char, you can
> use X* words.

Yes we agree on this!
But only if your strings are UTF-8, Should that be standardized?
And if your TYPE allows to type incomplete UTF-8 sequences.

Is it really a good idea to have TYPE type incomplete sequences?
Could that not lead to a security problem?

Ruvim

unread,
Apr 9, 2021, 6:57:57 PMApr 9
to
On 2021-04-09 23:30, P Falth wrote:
> On Friday, 9 April 2021 at 19:18:49 UTC+2, Ruvim wrote:
>> On 2021-04-09 19:05, P Falth wrote:
>>> On Friday, 9 April 2021 at 15:45:19 UTC+2, Ruvim wrote:
[...]
>>>> NB: if the system uses UTF-16 then a pchar size shall be 2 bytes.
[...]

>>> Why do you insist of using C@ to traverse a string when XC@+ hides all details
>>> and works perfectly?
>> I only insist to have a choice, and to have the *ability* of traversing
>> a string via C@ — because it's backward compatible (old programs
>> continue to work in new UTF-based systems), it's simpler, and it has
>> better performance.
>
> That will also work on my systems when your strings/chars are below $7F.
> Where we differ are when the char is between $80 and $FF.
> What should $C4 emit do?

By the current standard, it should send $C4 as a primitive character to
the user output device. If it's a terminal, then what is shown depends
on the character encoding (and the terminal's state), and it's an
implementation defined thing.

But iff EMIT doesn't treat some pchar argument as a primitive character
then such a string exists that the following word:

: test-emit ( c-addr u -- ) cr 2dup type cr
dup if over c@ emit 1 /string then type cr
;

will show two different lines when it's applied to this string.


>> And when you need to treat the actual code points beyond a char, you can
>> use X* words.
>
> Yes we agree on this!



>>
>> Actually it's a possible approach that the standard doesn't provide a
>> way to print a pchar, but only a string. Although it doesn't solve any
>> problem,
>> since the expression:
>> s" Ä" over 1 type 1 /string type
>>
>> should be equivalent to the expression:
>>
>> s" Ä" type
>
> But only if your strings are UTF-8, Should that be standardized?

Not only. This equivalence is true (should be true) for *any* encoding
(when the argument is any non-empty string contains characters from this
encoding).

E.g., it's true for ASCII, ISO 8859-1, UTF-8, UTF-16, UTF-32, or
anything else.


A particular character encoding is not fixed and, I think, it should not
be fixed in the standard. A system can provide a function to convert a
string from the internal representation to the given encoding.



> And if your TYPE allows to type incomplete UTF-8 sequences.

BTW, incomplete characters can appear in any variable-width character
encoding, e.g. in UTF-16 as well as in UTF-8.

> Is it really a good idea to have TYPE type incomplete sequences?

It's a good idea since it makes programs simpler and moves some
complexity to the side of the underlying system.

And a program may be unaware whether a sequence is complete or not.

A terminal just doesn't show an incomplete character until it's
completed (or made incorrect and then it's replaced by some special
character).


> Could that not lead to a security problem?

It should not lead to any problem, since a user may compose and sent to
the terminal even incorrect strings.



--
Ruvim

dxforth

unread,
Apr 9, 2021, 11:00:20 PMApr 9
to
On 9/04/2021 23:45, Ruvim wrote:
> On 2021-04-09 13:19, P Falth wrote:
>>
>> Yes you are always allowed to fetch a byte with c@.
>
> In some system it might be two bytes, or even four bytes.

Indeed. Even ANS file functions are expressed in terms of 'characters' -
which can be anything from 8 bits to 1 cell width.

none albert

unread,
Apr 10, 2021, 8:35:03 AMApr 10
to
My 2 cent. The UTF-blabla are sequential data structures.
It is a design error to make single character output the base
instead of the writing of the whole sequential data structure.