Is STC one threading model to rule them all?

Phillip Eaton

unread,

May 16, 2021, 5:38:30 PM5/16/21

to

Something that's been bugging me for a while now...

From what I understand, when Forth was originated, it was a Virtual Machine on top of some other language on a mini-computer in the 60s. Due to these circumstances, low memory constraints and the desire for portability, Indirect Threaded Code and Direct Threaded Code were most common.

Eventually, however, starting with popular 8 and 16 bit CPUs, Forth in the early 80s (and maybe earlier) was able to replace the other OS's/languages and it's VM effectively became the physical machine of the underlying computer hardware.

At this point, I'd thinking that memory constraints were becoming somewhat irrelevant and so ITC/DTC should have been superseded by Subroutine Threaded Code, as what ITC/DTC offers in memory is negated by the complexity and performance degradation of using an inner interpreter.

But 30+ years on, even though Chuck's Machine Forth seems to have started getting closer to the underlying CPU instructions, this hasn't happen. Of the many "new" Forths to have been created, many seem to have been based on the old ITC/DTC Forths of the past.

Isn't building an STC Forth from assembly language the most efficient, simple and performant way of doing things? I'm guessing it would also be easier to optimise automatically, too.

I'm not a Forth Guru, I've never built a Forth from scratch, I'm happy to take an existing Forth and use it above and below the hood as necessary to build my end product.

I'm currently using CamelForth 6809 (DTC) and, whilst the 8/16-bit 6809 is particularly Forth-friendly with it's two hardware stacks and NEXT is very efficient, it's still bugging me that being STC would strip out an extra level of complexity and perhaps give a performance boost.

So, I'm interested in opinions here:
1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

Thanks for your thoughts.

dxforth

unread,

May 16, 2021, 11:50:12 PM5/16/21

to

IMO if you don't care for other people's forth, insist on creating
your own, and still want time left to write apps, threaded-code isn't
a bad choice. With increasing CPU speed and memory, the need for
optimizing native-code compilers has probably decreased. OTOH if
the goal is to demonstrate the superiority of Forth by beating the
latest bloated C compiler, it is choice that is limited.

Anton Ertl

unread,

May 17, 2021, 4:45:07 AM5/17/21

to

Phillip Eaton <pjea...@gmail.com> writes:
>At this point, I'd thinking that memory constraints were becoming somewhat =
>irrelevant and so ITC/DTC should have been superseded by Subroutine Threade=
>d Code, as what ITC/DTC offers in memory is negated by the complexity and p=
>erformance degradation of using an inner interpreter. =20
>
>But 30+ years on, even though Chuck's Machine Forth seems to have started g=
>etting closer to the underlying CPU instructions, this hasn't happen. Of th=
>e many "new" Forths to have been created, many seem to have been based on t=
>he old ITC/DTC Forths of the past.=20
>
>Isn't building an STC Forth from assembly language the most efficient, simp=
>le and performant way of doing things? I'm guessing it would also be easier=
> to optimise automatically, too.=20

First of all, there is a difference between STC and native code,
although there is no hard boundary between them.

It seems to me that commercial Forth systems have switched to native
code in the 1990s, and there are also non-commercial Forth systems
like FLK and ntf/lxf that use native code.

Native code is certainly fastest (unless you do things like mixing
code and data on CPUs that don't like that). It's not the simplest,
but I think that the reluctance to do native code among
build-your-own-Forth implementors is more due to not having a good
model to go by than due to the actual complexity.

Whether STC is faster than DTC depends on tha actual CPU; you can find
some results on <http://www.complang.tuwien.ac.at/forth/threading/>.
In straight-line code, STC has to perform a call and a return per
word, while DTC performs only one jump per word, but does some
additional work on the data side.

>I'm currently using CamelForth 6809 (DTC) and, whilst the 8/16-bit 6809 is =
>particularly Forth-friendly with it's two hardware stacks and NEXT is very =
>efficient, it's still bugging me that being STC would strip out an extra le=

>vel of complexity and perhaps give a performance boost.

For straight-line code, STC costs 12 cycles per primitive (JSR 7, RTS
5). My 6809 is a little rusty and my 6809 book is hiding from me, but
it seems to me that JMP [,X++] would implement DTC dispatch, and that
it costs 12 cycles, too. The 6809 has 64KB address space, so the
extra byte per compiled word is a good reason to use DTC.

>So, I'm interested in opinions here:

>1. Is my assertation that post- the early 80s, STC is a no-brainer and the =
>other models are obsolete?
>2. If 1. is true, why are so many people still working with non-STC Forths =
>nowadays?

Many of the build-it-yourself implementors seem to be more interested
in the "feeling of mastery and understanding" than in performance.
The fact that people are building Forth systems for the 8086 or the
6809 shows that these systems are not intended to exploit modern
hardware to its fullest.

The existing material on threaded code is apparently more amenable to
being adopted for a build-it-yourself implementation than the material
for native-code systems. FLK has no accompanying material; cmForth
has (Footsteps in an Empty Valley), but given it's focus on
unavailable hardware it's apparently not inspring. Gforth uses
techniques that are useful for native-code compilers, and has material
explaining that (in particular "The new Gforth Header" [paysan19]),
but apparently the complexity of Gforth is a deterrent. The
commercial systems were not designed to inspire build-it-yourself
implementors, and are certainly not accompanied by such material.

I have started on a system that is intended to be a model for modern
concepts, including native-code generation, but have not gotten very
far before other things required my time. We'll see when I will find
the time, and whether that system will fulfill its intended role.

>3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

In Gforth, we are using a DTC base for portability reasons: it can be
implemented with gcc. This also allows us to have native code for
most straight-line code with very little machine-specific code,
falling back to DTC for control flow and non-relocatable primitives.

@InProceedings{paysan19,
author = {Bernd Paysan and M. Anton Ertl},
title = {The new {Gforth} Header},
crossref = {euroforth19},
pages = {5--20},
url = {http://www.euroforth.org/ef19/papers/paysan.pdf},
url-slides = {http://www.euroforth.org/ef19/papers/paysan-slides.pdf},
video = {https://wiki.forth-ev.de/doku.php/events:ef2019:header},
OPTnote = {refereed},
abstract = {The new Gforth header is designed to directly
implement the requirements of Forth-94 and
Forth-2012. Every header is an object with a fixed
set of fields (code, parameter, count, name, link)
and methods (\texttt{execute}, \texttt{compile,},
\texttt{(to)}, \texttt{defer@}, \texttt{does},
\texttt{name>interpret}, \texttt{name>compile},
\texttt{name>string}, \texttt{name>link}). The
implementation of each method can be changed
per-word (prototype-based object-oriented
programming). We demonstrate how to use these
features to implement optimization of constants,
\texttt{fvalue}, \texttt{defer}, \texttt{immediate},
\texttt{to} and other dual-semantics words, and
\texttt{synonym}.}
}
@Proceedings{euroforth19,
title = {35th EuroForth Conference},
booktitle = {35th EuroForth Conference},
year = {2019},
key = {EuroForth'19},
url = {http://www.euroforth.org/ef19/papers/proceedings.pdf}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2020: https://euro.theforth.net/2020

Stephen Pelc

unread,

May 17, 2021, 6:19:46 AM5/17/21

to

On Mon, 17 May 2021 06:20:36 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Whether STC is faster than DTC depends on tha actual CPU; you can find
>some results on <http://www.complang.tuwien.ac.at/forth/threading/>.
>In straight-line code, STC has to perform a call and a return per
>word, while DTC performs only one jump per word, but does some
>additional work on the data side.

That does not allow for colon definitions and the overheads of
the NEST and UNNEST routines. Once these are factored in, most STC
systems show a 2.2:1 performance advantage (measurements on 68k in
the late 1990s) and others.

Stephen

--
Stephen Pelc, ste...@vfxforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612, +34 649 662 974
web: http://www.mpeforth.com - free VFX Forth downloads

Stephen Pelc

unread,

May 17, 2021, 6:43:21 AM5/17/21

to

On Sun, 16 May 2021 14:38:29 -0700 (PDT), Phillip Eaton
<pjea...@gmail.com> wrote:

>But 30+ years on, even though Chuck's Machine Forth seems to have started g=
>etting closer to the underlying CPU instructions, this hasn't happen. Of th=
>e many "new" Forths to have been created, many seem to have been based on t=
>he old ITC/DTC Forths of the past.=20

It is very easy to write an ITC or DTC Forth, especially if you have
read the source code for an existing one. The popularity of JonesForth
comes from it being documented, not from it being a good Forth.
That you can use an existing assembler reduces the tool-making load
very considerable.

>Isn't building an STC Forth from assembly language the most efficient, simp=
>le and performant way of doing things? I'm guessing it would also be easier=
> to optimise automatically, too.=20

Yebbut. You also need to have a rock-solid assembler and disassembler
for debugging STC systems. Most STC systems without optimisation will
be bigger than the corresponding ITC or DTC system. For 8 and 16 bit
CPUs with a 64k address limit, DTC is often the best choice unless
performance is an overriding objective, in which case STC wins.

For 32 bit and above, it's just a no-brainer now that RAM is
reasonably cheap. Basic STC is usually twice as fast as any ITC
or DTC Forth, but it will be larger (say 20%). You only recover
the size with optimisation, at which point the performance advantage
goes up to 4:1 or more for pattern matching generators and 10:1 or
more for analytical compilers. However, a good analytical compiler is
a big piece of code, 5000 lines or more for a full VFX code generator.
Writing an analytical Forth compiler is probably much more work than
a hobbyist is prepared to commit to.

>So, I'm interested in opinions here:

>1. Is my assertation that post- the early 80s, STC is a no-brainer and the =
>other models are obsolete?

No - you have to allow for the implementation cost.

>2. If 1. is true, why are so many people still working with non-STC Forths =

>nowadays?

>3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

16 bit address spaces.

If you really bother about code size, you'll probably go for token
threaded code (TTC). In around 2000, we worked on a mobile phone for
teenagers. The phone was limited to 1Mb of something. With TTC and
tuning, we got the games down to less than 400 kb and the phone was
viable again.

Converting CamelForth to STC would be an interesting exercise and you
will learn a great deal. Feel free to discuss this offline with me.

none albert

unread,

May 17, 2021, 8:00:02 AM5/17/21

to

In article <f9afceb2-2eef-4c4b...@googlegroups.com>,

Phillip Eaton <pjea...@gmail.com> wrote:
>Something that's been bugging me for a while now...
>
>

>So, I'm interested in opinions here:
>1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
>2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
>3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

1. No.
2. See 3.
3. An indirect threaded model abstracts of many of the differences between
CPU's which gives advantages.

An example of this is vectored execution, meaning that behaviour is changed
by changing a pointer.
In ITC all behaviour is via a pointer, but for high level code this is
a data item pointing to a data item, which means even less problematic.
Another example is data/code separation. An implementation of
itc can have all code in a block that is neither added to nor changed.
That is something no virus objects to. It also in no way restricts what
the code can do.

As you pointed out, memory shortage is less of an issue.
This means that if ITC is less memory effective, that is not objectionable.

That made me decide to keep the ITC of fig throughout ciforth.
>
>Thanks for your thoughts.
--
"in our communism country Viet Nam, people are forced to be
alive and in the western country like US, people are free to
die from Covid 19 lol" duc ha
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Krishna Myneni

unread,

May 17, 2021, 5:52:14 PM5/17/21

to

On 5/16/21 4:38 PM, Phillip Eaton wrote:
> Something that's been bugging me for a while now...

...

>
> So, I'm interested in opinions here:
> 1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
> 2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
> 3. If 1. is false, what are the key reason why ITC/DTC are still relevant?
>

...

For my desktop use of Forth, the threading model of the Forth system is
maybe a 3rd order consideration. More important than maximum efficiency
are other considerations such as ease of use, presence of features such
as floating point support, external library interfaces, good
documentation, and numerous source code examples which will run on the
system and illustrate its use for different types of applications.
Progress in hardware clock speeds means that more applications can run
within reasonable times so the intrinsic efficiency becomes even less of
an issue.

There are applications, of course, for which efficiency is the main
consideration. Even in this space, once can often isolate the
bottleneck(s) and either use efficient external libraries (provided the
Forth system can interface with them), or use the traditional Forth way
of writing efficient assembler code, callable from Forth words, to deal
with bottlenecks.

Krishna Myneni

Jan Coombs

unread,

May 18, 2021, 11:59:55 AM5/18/21

to

On Mon, 17 May 2021 10:43:19 GMT
ste...@mpeforth.com (Stephen Pelc) wrote:

> Converting CamelForth to STC would be an interesting exercise and you
> will learn a great deal. Feel free to discuss this offline with me.

CamelForth 8051 already is subroutine threaded.

Jan Coombs
--

Marcel Hendrix

unread,

May 18, 2021, 3:05:37 PM5/18/21

to

On Sunday, May 16, 2021 at 11:38:30 PM UTC+2, pjea...@gmail.com wrote:
> Something that's been bugging me for a while now...

[..]

> So, I'm interested in opinions here:

That is asking for heated discussions.

> 1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?

The other models all have something going for them so they aren't considered obsolete, certainly not if this is about opinions and aired in a Forth group.

> 2. If 1. is true, why are so many people still working with non-STC Forths nowadays?

Don't have to react here...

> 3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

IMO, a DTC system is great to generate code for, because every word is like compressed source and very easy to parse and optimize (token-threaded would be even better). Once source is translated to machine code it can be unbeatably fast. I use a variant of this method in iForth.

Unfortunately, the optimized words break the clean DTC model, and introspection / debugging / speed of development and maintenance will suffer greatly / unbearably. It takes a few decades to find the right balance between the various models. I am about half of the way.

-marcel

none albert

unread,

May 19, 2021, 4:10:03 AM5/19/21

to

In article <f9afceb2-2eef-4c4b...@googlegroups.com>,
Phillip Eaton <pjea...@gmail.com> wrote:

<SNIP>

>I'm currently using CamelForth 6809 (DTC) and, whilst the 8/16-bit 6809 is particularly Forth-friendly with it's two hardware stacks and NEXT is very
>efficient, it's still bugging me that being STC would strip out an extra level of complexity and perhaps give a performance boost.

If you're modifying CamelForth 6809, make sure to check out
m6809forth.html on my site (see below). I used CamelForth to make a
Forth according to the ciforth model.
I shaved ca 40 states off the 178 states for the U* instruction.
Also that word used unprotected stack space, a potential hazard
if you want to use interrupts.
Also MOVE was improved, using 16 bits moves, as soon as possible.

>
>So, I'm interested in opinions here:
>1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
>2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
>3. If 1. is false, what are the key reason why ITC/DTC are still relevant?
>
>Thanks for your thoughts.

Heinrich Hohl

unread,

May 19, 2021, 6:11:37 AM5/19/21

to

On Sunday, May 16, 2021 at 11:38:30 PM UTC+2, pjea...@gmail.com wrote:

> So, I'm interested in opinions here:
> 1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?

This article by Brad Rodriguez gives an excellent overview of the different threading models:

https://www.bradrodriguez.com/papers/moving1.htm

According to this article, each threading model has advantages and drawbacks.
For best results, you must choose the threading model according to the target CPU
and the available memory.

Henry

anti...@math.uni.wroc.pl

unread,

May 19, 2021, 1:45:55 PM5/19/21

to

Phillip Eaton <pjea...@gmail.com> wrote:
> So, I'm interested in opinions here:
> 1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
> 2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
> 3. If 1. is false, what are the key reason why ITC/DTC are still relevant?

Note little technical problem: most current Forth implementations
seem to use machine stack as data stack. STC may force you
to use machine stack as control stack. If there is single
machine-assisted stack, then you gain in dispatch/control and lose
on data stack access. AFAICS this is related to locals:
benchmaks seem to indicate that on many implementations
keeping data on data stack is faster than locals. Reasonable
guess is that with locals on return stack and machine stack
as return stack access to locals would be faster than access
to data stack. Add register allocation for locals and
one can expect much better performance.

But gain is expected only if you are willing to invest effort
in Forth compiler and substantial change. My personal guess
is that current implementations are essentially trapped in
local optimum. Changes like going to STC may easily degrade
performance unless you are willing to adapt whole implementation
around new paradigm.

Another question is why people program in Forth? Arguably
speed and size of code are not main reason.

--
Waldek Hebisch

dxforth

unread,

May 19, 2021, 10:30:01 PM5/19/21

to

The nuanced responses have been interesting. Perhaps there's hope for Forth yet :)

Anton Ertl

unread,

May 20, 2021, 11:35:03 AM5/20/21

to

anti...@math.uni.wroc.pl writes:
>Phillip Eaton <pjea...@gmail.com> wrote:
>> So, I'm interested in opinions here:
>> 1. Is my assertation that post- the early 80s, STC is a no-brainer and the other models are obsolete?
>> 2. If 1. is true, why are so many people still working with non-STC Forths nowadays?
>> 3. If 1. is false, what are the key reason why ITC/DTC are still relevant?
>
>Note little technical problem: most current Forth implementations
>seem to use machine stack as data stack. STC may force you
>to use machine stack as control stack. If there is single
>machine-assisted stack, then you gain in dispatch/control and lose
>on data stack access.

On architectures that have a specific stack for call/return it is
usually quite beneficial for STC to use this stack as return stack.
Native code does not call and return as much as STC, so for native
code the benefit is less, and with inlining even less. Likewise, for
data stack accesses, sophisticated Forth compilers keep several stack
items in registers, so accesses through a data stack pointer are less
important for performance.

While Intel and AMD CPUs have some optimizations for push or pop, I
don't think that there is a big difference. For call and return, I
expect more of a difference, because these CPUs have a hardware return
stack for predicting the return address, and that has a better
prediction accuracy (when each ret is paired with its call, and no
return-address manipulation is performed) than the indirect branch
predictor used when you by using an indirect branch.

We actually have a number of native-code systems for IA-32 and AMD64
that explore various different ways to implement the stacks:

* VFX and SwiftForth use esp/rsp as return stack pointer and ebp/rbp
as data stack pointer. They keep the TOS in a register at call
boundaries.

* iForth uses rsp for the data stack (without keeping TOS in a
register) and something else for the return stack. It uses inlining
and tail-call optimization.

* bigForth uses exch to switch data stack pointer and return stack
pointer into esp when the corresponding stack is accessed.

My impression is that VFX and iForth have similar performance;
sometimes one is faster, sometimes the other, so esp is not that
decisive for performance. It's hard to be sure about the effect, so
if you are curious, one way would be to replace the stack-specific
instructions (such as pop) with the more general instructions (such as
mov and add), and see what effect it has.

BigForth used a less sophisticated compilation technology, so its not
so great performance does not necessarily say something about how to
implement stack accesses. One interesting development is that in Zen3
exch is performed in the register renamer at very little cost
(something that Bernd Paysan had hoped for already with the Pentium
Pro (1995)).

>AFAICS this is related to locals:
>benchmaks seem to indicate that on many implementations
>keeping data on data stack is faster than locals.

Yes. On simple threaded-code implementations locals cause more
executed primitives in many cases. And they also tend to cost more
instructions if locals are in memory.

Sophisticated implementations keep data stack items in registers, but
at least VFX 4 kept locals and return stack items in memory (I have
yet to look at VFX 5 more closely), so locals were naturally much
slower.

>Reasonable
>guess is that with locals on return stack and machine stack
>as return stack access to locals would be faster than access
>to data stack.

I doubt it. Push and pop don't offer that much benefit over mov.

>Add register allocation for locals and
>one can expect much better performance.

Yes.

>Another question is why people program in Forth? Arguably
>speed and size of code are not main reason.

That could be a self-fulfilling prophesy.

Anton Ertl

unread,

May 20, 2021, 12:05:37 PM5/20/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Sophisticated implementations keep data stack items in registers, but
>at least VFX 4 kept locals and return stack items in memory (I have
>yet to look at VFX 5 more closely), so locals were naturally much
>slower.

It's the same for VFX 5.11 RC2:

: foo1 swap 1+ swap ;
: foo2 >r 1+ r> ;
: foo3 {: a :} 1+ a ;

see foo1
FOO1
( 004E3E60 488B5500 ) MOV RDX, [RBP]
( 004E3E64 48FFC2 ) INC RDX
( 004E3E67 48895500 ) MOV [RBP], RDX
( 004E3E6B C3 ) RET/NEXT
( 12 bytes, 4 instructions )
ok
see foo2
FOO2
( 004E3EA0 53 ) PUSH RBX
( 004E3EA1 488B5D00 ) MOV RBX, [RBP]
( 004E3EA5 48FFC3 ) INC RBX
( 004E3EA8 5A ) POP RDX
( 004E3EA9 48895D00 ) MOV [RBP], RBX
( 004E3EAD 488BDA ) MOV RBX, RDX
( 004E3EB0 C3 ) RET/NEXT
( 17 bytes, 7 instructions )
ok
see foo3
FOO3
( 004E3EF0 488BD4 ) MOV RDX, RSP
( 004E3EF3 53 ) PUSH RBX
( 004E3EF4 52 ) PUSH RDX
( 004E3EF5 57 ) PUSH RDI
( 004E3EF6 488BFC ) MOV RDI, RSP
( 004E3EF9 4881EC00000000 ) SUB RSP, # 00000000
( 004E3F00 488B5D00 ) MOV RBX, [RBP]
( 004E3F04 488D6D08 ) LEA RBP, [RBP+08]
( 004E3F08 48FFC3 ) INC RBX
( 004E3F0B 488D6DF8 ) LEA RBP, [RBP+-08]
( 004E3F0F 48895D00 ) MOV [RBP], RBX
( 004E3F13 488B5F10 ) MOV RBX, [RDI+10]
( 004E3F17 488B6708 ) MOV RSP, [RDI+08]
( 004E3F1B 488B3F ) MOV RDI, 0 [RDI]
( 004E3F1E C3 ) RET/NEXT
( 47 bytes, 15 instructions )

The same test for iforth shows that iForth's preference for pop and
push is not so great in this case; but iForth manages to keep return a
stack item in a register. It keeps a local in memory:

FORTH> ' foo1 idis
$10226000 : foo1 488BC04883ED088F4500 H.@H.m..E.
$1022600A pop rbx 5B [
$1022600B pop rdi 5F _
$1022600C lea rdi, [rdi 1 +] qword
488D7F01 H...
$10226010 push rdi 57 W
$10226011 push rbx 53 S
$10226012 ; 488B45004883C508FFE0 H.E.H.E..` ok
FORTH> ' foo2 idis
$10226080 : foo2 488BC04883ED088F4500 H.@H.m..E.
$1022608A pop rbx 5B [
$1022608B pop rdi 5F _
$1022608C lea rdi, [rdi 1 +] qword
488D7F01 H...
$10226090 push rdi 57 W
$10226091 push rbx 53 S
$10226092 ; 488B45004883C508FFE0 H.E.H.E..` ok
FORTH> ' foo3 idis
$10226540 : foo3 488BC04883ED088F4500 H.@H.m..E.
$1022654A pop rbx 5B [
$1022654B lea rsi, [rsi #-16 +] qword
488D76F0 H.vp
$1022654F mov [esi] dword, rbx
48891E H..
$10226552 pop rbx 5B [
$10226553 lea rbx, [rbx 1 +] qword
488D5B01 H.[.
$10226557 push rbx 53 S
$10226558 push [rsi] qword FF36 .6
$1022655A add rsi, #16 b# 4883C610 H.F.
$1022655E ; 488B45004883C508FFE0 H.E.H.E..` ok

P Falth

unread,

May 20, 2021, 1:14:30 PM5/20/21

to

lxf does this
: foo1 swap 1+ swap ; ok
see foo1
A4A230 4098A3 8 C80000 5 normal FOO1

4098A3 8B4500 mov eax , [ebp]
4098A6 40 inc eax
4098A7 894500 mov [ebp] , eax
4098AA C3 ret near
ok
: foo2 >r 1+ r> ; ok
see foo2
A4A248 4098AB 8 C80000 5 normal FOO2

4098AB 8B4500 mov eax , [ebp]
4098AE 40 inc eax
4098AF 894500 mov [ebp] , eax
4098B2 C3 ret near
ok
: foo3 {: a :} 1+ a ; ok
see foo3
A4A260 4098B3 8 C80000 5 normal FOO3

4098B3 8B4500 mov eax , [ebp]
4098B6 40 inc eax
4098B7 894500 mov [ebp] , eax
4098BA C3 ret near
ok

lxf64 does this, only showing the last as they are equal

: foo3 {: a :} 1+ a ; ok
comp foo3
FOO3:
inc QWORD PTR [rbp+0]
ret
ok
asm foo3 foo4 ok
seeasm foo4 5 bytes, 2 instructions
$A0F368 48FF4500 inc qword ptr [rbp]
$A0F36C C3 ret
ok

I am not sure if this is really better. In most cases some other
operations will follow the 1+ and having it already in a register
might be better.

Anton, I have now updated lxf and will send you a copy.
You will also get a preview of lxf64!

BR
Peter

P Falth

unread,

May 20, 2021, 1:33:10 PM5/20/21

to

When you do a code generator ( at least on x86) I have found it beneficial
to regard the stack as an array and use the locations directly.
Push and pop just come in the way.

Then when you are on ARM64 there are no specific push and pop. But
instead every load and store can adjust the base register

This is drop for example in ARM64 (token threaded code)

ldrb w0, [x19], 1 \ load next opcode in X0, advance IP(X19)
ldr x20, [x21], 8 \ Pop datastack into X20, X21 is stackpointer
ldr w2, [x22, x0, lsl 2] \ Shift opcode 2 bits add to base of table and load address in X2
br x2 \ jump to next instruction

BR
Peter

Phillip Eaton

unread,

Jun 8, 2021, 6:23:22 PM6/8/21

to

Phillip Eaton wrote:
> Something that's been bugging me for a while now...

ANTON: "First of all, there is a difference between STC and native code,

although there is no hard boundary between them."

PHIL: What actually is the difference? Could it be that native code is
built from scratch, whereas an STC could be a reworked DTC/ITC
implementation? I couldn't find a distinction via Google.

ANTON: "...I think that the reluctance to do native code among

build-your-own-Forth implementors is more due to not having a good

model to go by than due to the actual complexity." "The existing

material on threaded code is apparently more amenable to
being adopted for a build-it-yourself implementation than the material
for native-code systems."

PHIL: This has been my feeling also.

STEPHEN: "That does not allow for colon definitions and the overheads of

the NEST and UNNEST routines. Once these are factored in, most STC
systems show a 2.2:1 performance advantage (measurements on 68k in
the late 1990s) and others."

PHIL: Without any evidence to back it up or a good look into the code,
on 6809 at least, I can't imagine how DTC, even with it's single JMP
instruction NEXT, would be faster than a subroutine call and return, due
to the need to run DOCOLON. I guess this would be the case across many CPUs.

STEPHEN: "It is very easy to write an ITC or DTC Forth, especially if
you have read the source code for an existing one." "You also need to

have a rock-solid assembler and disassembler for debugging STC systems."

Yes, so the practical investment required to shake off ITC/DTC is not
insignificant. For me, I have a strong 6809 assembler and debugger, so
there is potential!

JAN: "CamelForth 8051 already is subroutine threaded."

PHIL: Thanks for the heads-up, I had a quick look and you're right, I'll
be looking into that!

ALBERT: "An indirect threaded model abstracts of many of the differences

between CPU's which gives advantages. An example of this is vectored
execution, meaning that behaviour is changed by changing a pointer."

PHIL: I probably don't understand this fully, but apart from code/data
mixing, I don't get why ITC has an advantage here over STC/native?

KRISHNA: "For my desktop use of Forth, the threading model of the Forth

system is maybe a 3rd order consideration."

PHIL: Maybe, but I'm wondering why in the past 35+ years, why so very
few people have decided to optimize from the ground up, given so many
people have written their own Forth.

MARCEL: "The other models all have something going for them so they

aren't considered obsolete, certainly not if this is about opinions and
aired in a Forth group."

PHIL: Apart from saving memory and 'because they're already written', I
haven't really heard any strong reasons as to what they do have going
for them.

MARCEL: "It takes a few decades to find the right balance between the

various models. I am about half of the way."

PHIL: That might actually explain a lot. I'm at the start of those decades.

ALBERT: "If you're modifying CamelForth 6809, make sure to check out

m6809forth.html on my site (see below)."

PHIL: Thanks for the recommendation, I'll check it out.

WALDEK: "Note little technical problem: most current Forth

implementations seem to use machine stack as data stack. STC may force
you to use machine stack as control stack."

PHIL: Now that's one I didn't consider, that could be quite significant.
I'll have to mull it over. Nonetheless, STC implementations do exist, so
it must be doable!

-----

PHIL: So overall then, it would seem that a few people tend to lean in
favour of STC/native code (not least those that have actually done it
e.g. Stephen) and performance test data may support this.

However, a few people cite advantages with ITC/DTC, other than memory
saving, but without really being specific, except for possibly easier
portability.

My conclusion then remains that STC (or perhaps native code, if indeed
there is a significant difference) really should be the default model
for a new implementation, unless there is a RAM limit for the
application, which I don't personally think has existed since the 80's.
Paged memory isn't that hard.

For my own project, I am coding for the 6809 video game console using
CamelForth 6809. There is no display buffer or dedicated vector
generator circuit, so with a 1.5MHz clock, I have to generate the full
vector display in 30,000 CPU cycles to maintain a 50Hz display, that
alone can easily take up all of the 30,000, before any program/game
logic. Thus code speed is the number one target and I'm having to lean
heavily on the adage "You can get 80% of the speed of assembler with 20%
of the code in assembler". Or is it 90%/10%...

Once again, many thanks for all of your comments, I really appreciate
the discussion.

Travis Bemann

unread,

Jun 8, 2021, 10:40:36 PM6/8/21

to

For zeptoforth, which is my Forth for Cortex-M4 and Cortex-M7 microcontrollers, I opted for a subroutine threaded/native code inlining model, where while code is basically subroutine threaded, a significant portion of words, especially primitives, are inlined. (Another similar Forth in this regard is Mecrisp-Stellaris.) This has the benefit of giving very good performance, especially since many subroutine calls and link register pushing and popping are eliminated, at the cost of space (because many inlined words take up more memory than the corresponding subroutine call would take). Space is less critical, though, because many of the targeted MCU's have 1 MB of onboard flash, and if one opts to compile to RAM (e.g. while testing code, so one does not wear out flash with repeated writes) they still have a decent amount of RAM. A downside is that SEE can only disassemble code and cannot attempt to recover actual source code.

Travis Bemann

unread,

Jun 8, 2021, 11:13:29 PM6/8/21

to

An advantage to subroutine threading/native code inlining that I forgot to mention is that it allows folding constants into inlined primitives, making the generated code more compact while simultaneously making it faster. Both my zeptoforth and Mecrisp-Stellaris support this, the latter in a more advanced fashion the latter (zeptoforth admittedly only supports this for a selected set of hard-coded primitives, whereas Mecrisp-Stellaris supports it much more generally).

Travis Bemann

unread,

Jun 8, 2021, 11:16:55 PM6/8/21

to

On Tuesday, June 8, 2021 at 10:13:29 PM UTC-5, Travis Bemann wrote:
> An advantage to subroutine threading/native code inlining that I forgot to mention is that it allows folding constants into inlined primitives, making the generated code more compact while simultaneously making it faster. Both my zeptoforth and Mecrisp-Stellaris support this, the latter in a more advanced fashion the latter (zeptoforth admittedly only supports this for a selected set of hard-coded primitives, whereas Mecrisp-Stellaris supports it much more generally).

I should proofread my posts before I hit post message; that should be "the latter in a more advanced fashion than the former".

Nils M Holm

unread,

Jun 9, 2021, 4:11:51 AM6/9/21

to

Phillip Eaton <in...@phillipeaton.com> wrote:
> However, a few people cite advantages with ITC/DTC, other than memory
> saving, but without really being specific, except for possibly easier
> portability.

ITC in particular has the advantage that you can have a FORTH system
without any machine code outside of the primitives. I have such a system
that I used to run under a DOS emulator. Then one day it occurred to me
that I could just write a very simple emulator that intercepts calls to
primitives and emulates them in C. Tried it and, indeed, a 500-line
emulator can run an unaltered FORTH image for DOS under Unix.

And then, as you mentioned: size. The system is a full FORTH-79 system
with interpreter, compiler, and device words plus some extentions that
fits in less than 8K bytes of memory. This may not be relevant any
longer these days, but still gives me a sense of satisfaction. :)

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org

Anton Ertl

unread,

Jun 9, 2021, 4:52:11 AM6/9/21

to

Phillip Eaton <in...@phillipeaton.com> writes:
>Phillip Eaton wrote:
>> Something that's been bugging me for a while now...
>
>
>ANTON: "First of all, there is a difference between STC and native code,
>although there is no hard boundary between them."
>
>PHIL: What actually is the difference? Could it be that native code is
>built from scratch, whereas an STC could be a reworked DTC/ITC
>implementation?

The difference is in how you compile it, and the resulting code. When
you compile CELL+ @, STC is

call cell+
call @

while native code is, e.g.

add tos,8
mov tos,(tos)

and a more sophisticated native-code compiler could combine this into

mov tos,8(tos)

For STC, a simple dumb COMPILE, that just compiles a call is enough.
Immediate words like IF or LITERAL will typically generate non-call
native code, but they don't call COMPILE, for that. It is possible to
let them compile a call with some data behind the call, but I don't
think that there is an STC system that does that; maybe if space is at
a premium, but then people tend to prefer token-threaded code or ITC.

For native-code, the dumb COMPILE, is not good enough. Chuck Moore
has solved that in cmForth by turning the compilation of all
primitives into "immediate" words in his COMPILER vocabulary, but that
approach has not caught on. A more modern approach is the intelligent
"COMPILE,", and in its most flexible form you just attach the
implementation of COMPILE, to an existing word. And for primitives
the typical approach is to define its COMPILE, first, and then base
the EXECUTEd code on that, something like:

: compile,-@ ( xt -- )
drop
... \ compile code for "add tos,8"
;

: @ [ 0 compile,-@ ] ;
\ and now attach the COMPILE, action to @
' compile,-@ set-optimizer

In-between variants are possible. E.g., bigForth starts out
with STC, but inlines the code for words that are marked as inline
(and primitives typically are marked as inline) and performs peephole
optimization on the resulting code.

>PHIL: Without any evidence to back it up or a good look into the code,
>on 6809 at least, I can't imagine how DTC, even with it's single JMP
>instruction NEXT, would be faster than a subroutine call and return, due
>to the need to run DOCOLON. I guess this would be the case across many CPUs.

There are 7.5 times as many NEXTs as COL:s (docols) in
http://www.complang.tuwien.ac.at/forth/peep/sorted. So if STC costs
2.2 cycles more per NEXT than DTC, as on the R4000
<http://www.complang.tuwien.ac.at/forth/threading-v1/>, the STC docol
and ;s have to be 16.5 cycles cheaper than the DTC ones to make up for
that. If IF, LITERAL etc. produce native code in the STC system, that
weighs in for STC.

Concerning the 6809, given that DTC NEXT has no speed advantage over
STC NEXT, yes, STC will be faster. How much depends on the
implementation details; and whether it's worth the memory cost depends
on you.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html

EuroForth 2021: https://euro.theforth.net/2021

none albert

unread,

Jun 9, 2021, 5:43:28 AM6/9/21

to

In article <s9oqkm$1fg4$1...@gioia.aioe.org>,

Phillip Eaton <in...@phillipeaton.com> wrote:
>Phillip Eaton wrote:
>> Something that's been bugging me for a while now...
>
>
>ANTON: "First of all, there is a difference between STC and native code,
>although there is no hard boundary between them."
>
>PHIL: What actually is the difference? Could it be that native code is
>built from scratch, whereas an STC could be a reworked DTC/ITC
>implementation? I couldn't find a distinction via Google.
>
>
>ANTON: "...I think that the reluctance to do native code among
>build-your-own-Forth implementors is more due to not having a good
>model to go by than due to the actual complexity." "The existing
>material on threaded code is apparently more amenable to
>being adopted for a build-it-yourself implementation than the material
>for native-code systems."
>
>PHIL: This has been my feeling also.

One of the reasons to build your own system is to experiment.
Indirect Threaded Systems are easier to experiment with.
Commercial system give a high priority to speed, which favours
Subroutine Threaded Code.

<SNIP>

Groetjes Albert

Anton Ertl

unread,

Jun 9, 2021, 7:49:57 AM6/9/21

to

ste...@mpeforth.com (Stephen Pelc) writes:
>On Mon, 17 May 2021 06:20:36 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>Whether STC is faster than DTC depends on tha actual CPU; you can find
>>some results on <http://www.complang.tuwien.ac.at/forth/threading/>.
>>In straight-line code, STC has to perform a call and a return per
>>word, while DTC performs only one jump per word, but does some
>>additional work on the data side.
>
>That does not allow for colon definitions and the overheads of
>the NEST and UNNEST routines. Once these are factored in, most STC
>systems show a 2.2:1 performance advantage (measurements on 68k in
>the late 1990s) and others.

On <http://www.complang.tuwien.ac.at/forth/threading-v1/> I see that
on an 68040 STC requires 4 cycles more per NEXT than DTC. At 7.5
NEXTs per NEST+UNNEST, NEST+UNNEST need to cost 30 cycles less in STC
to make STC break even, which sounds unlikely. 2.2:1 sounds even more
unlikely. Are you sure that you measured only STC vs DTC, not native
code vs. some slower threaded code (ITC, TTC)?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html

EuroForth 2021: https://euro.theforth.net/2021

Stephen Pelc

unread,

Jun 10, 2021, 5:59:10 AM6/10/21

to

On Wed, 09 Jun 2021 11:41:48 GMT, an...@mips.complang.tuwien.ac.at

(Anton Ertl) wrote:

>ste...@mpeforth.com (Stephen Pelc) writes:
>>On Mon, 17 May 2021 06:20:36 GMT, an...@mips.complang.tuwien.ac.at
>>(Anton Ertl) wrote:
>>
>>>Whether STC is faster than DTC depends on tha actual CPU; you can find
>>>some results on <http://www.complang.tuwien.ac.at/forth/threading/>.
>>>In straight-line code, STC has to perform a call and a return per
>>>word, while DTC performs only one jump per word, but does some
>>>additional work on the data side.
>>
>>That does not allow for colon definitions and the overheads of
>>the NEST and UNNEST routines. Once these are factored in, most STC
>>systems show a 2.2:1 performance advantage (measurements on 68k in
>>the late 1990s) and others.
>
>On <http://www.complang.tuwien.ac.at/forth/threading-v1/> I see that
>on an 68040 STC requires 4 cycles more per NEXT than DTC. At 7.5
>NEXTs per NEST+UNNEST, NEST+UNNEST need to cost 30 cycles less in STC
>to make STC break even, which sounds unlikely. 2.2:1 sounds even more
>unlikely. Are you sure that you measured only STC vs DTC, not native
>code vs. some slower threaded code (ITC, TTC)?

Yes, our results are correct. Measuring a chain of NEXTs is measuring
something with no return stack manipulation. Measuring Forth high
level words requires you to include the return stack timing.

Just measuring NEXT does not measure system performance.

Anton Ertl

unread,

Jun 10, 2021, 7:17:48 AM6/10/21

to

ste...@mpeforth.com (Stephen Pelc) writes:
>On Wed, 09 Jun 2021 11:41:48 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>ste...@mpeforth.com (Stephen Pelc) writes:
>>>On Mon, 17 May 2021 06:20:36 GMT, an...@mips.complang.tuwien.ac.at
>>>(Anton Ertl) wrote:
>>>
>>>>Whether STC is faster than DTC depends on tha actual CPU; you can find
>>>>some results on <http://www.complang.tuwien.ac.at/forth/threading/>.
>>>>In straight-line code, STC has to perform a call and a return per
>>>>word, while DTC performs only one jump per word, but does some
>>>>additional work on the data side.
>>>
>>>That does not allow for colon definitions and the overheads of
>>>the NEST and UNNEST routines. Once these are factored in, most STC
>>>systems show a 2.2:1 performance advantage (measurements on 68k in
>>>the late 1990s) and others.
>>
>>On <http://www.complang.tuwien.ac.at/forth/threading-v1/> I see that
>>on an 68040 STC requires 4 cycles more per NEXT than DTC. At 7.5
>>NEXTs per NEST+UNNEST, NEST+UNNEST need to cost 30 cycles less in STC
>>to make STC break even, which sounds unlikely. 2.2:1 sounds even more
>>unlikely. Are you sure that you measured only STC vs DTC, not native
>>code vs. some slower threaded code (ITC, TTC)?
>
>Yes, our results are correct.

Not giving an answer is also an answer.

>Measuring a chain of NEXTs is measuring
>something with no return stack manipulation.

True, return stack manipulation is more expensive in STC on the 68k
architecture than in DTC, because you have to get the return address
of the call out of the way in STC, but not in DTC; or you use a
different stack for >R etc. than for calling and returning; or a
different stack for >R and Forth-level call and return than for the
STC mechanism. In the latter two cases case, >R without NEXT will be
as expensive in STC as in DTC (but not cheaper).

>Just measuring NEXT does not measure system performance.

True, but it does allow us to evaluate the plausibility of claims.

It would be interesting to do a more comprehensive benchmark, but I
lack an 68040 system, and even if I had one available, I would
probably not find the time.

Stephen Pelc

unread,

Jun 11, 2021, 7:24:49 AM6/11/21

to

On Thu, 10 Jun 2021 10:50:39 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>>>Are you sure that you measured only STC vs DTC, not native
>>>code vs. some slower threaded code (ITC, TTC)?
>>
>>Yes, our results are correct.
>
>Not giving an answer is also an answer.

Yes, I am sure that we measured only STC vs DTC.