heap implementation

Hugh Aguilar

unread,

Aug 2, 2012, 3:25:46 AM8/2/12

to

Does anybody know how to implement a heap on the modern x86
processors?

This question came up on comp.lang.forth:
https://groups.google.com/group/comp.lang.forth/browse_thread/thread/f7cb467ec9d95ce5#
I cross-posted to c.l.f. for this reason.

The problem is that the ANS-Forth standard, and the upcoming
Forth-200x standard, both require that the heap be exactly compatible
with the heap in GLIBC. This is because nobody on the standards
committee knows how to implement a heap, and so they plan on
continuing to rely on GLIBC forever. Unfortunately, the heap in GLIBC
isn't very good. It doesn't have ALLOCATION for one thing (a function
that returns the amount of memory in a node given the node's address).
Also, as described in the thread mentioned above, FREE doesn't
specifically tell you if it has been given an address that isn't a
node. There are other problems. All in all, I would like to get rid of
Forth's dependence on GLIBC altogether --- if people want to use GLIBC
they should program in C.

I can implement a working heap, and I likely will for my next novice-
package release.
http://www.forth.org/novice.html
The problem is that I don't understand how the caches work on the
modern x86 processors, and this is necessary for an efficient
implementation. My novice package has linked lists and self-balancing
trees, so an efficient heap is important.

Alex McDonald

unread,

Aug 2, 2012, 5:51:56 AM8/2/12

to

On Aug 2, 8:25 am, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
wrote:

> Does anybody know how to implement a heap on the modern x86
> processors?
>

> This question came up on comp.lang.forth:https://groups.google.com/group/comp.lang.forth/browse_thread/thread/...

> I cross-posted to c.l.f. for this reason.
>
> The problem is that the ANS-Forth standard, and the upcoming
> Forth-200x standard, both require that the heap be exactly compatible
> with the heap in GLIBC. This is because nobody on the standards
> committee knows how to implement a heap, and so they plan on
> continuing to rely on GLIBC forever. Unfortunately, the heap in GLIBC
> isn't very good. It doesn't have ALLOCATION for one thing (a function
> that returns the amount of memory in a node given the node's address).

We've discussed that one before. I still maintain that the problem is
not in the language. If you need the length, store it on allocation.
Win32Forth does this.

> Also, as described in the thread mentioned above, FREE doesn't
> specifically tell you if it has been given an address that isn't a
> node. There are other problems.

If code is written to FREE stuff which it didn't ALLOCATE, an
ALLOCATION to get the length of some random address is going to have
serious issues. If you can't FREE it, you can't ALLOCATION it.

> All in all, I would like to get rid of
> Forth's dependence on GLIBC altogether --- if people want to use GLIBC
> they should program in C.

What Forth dependance? Many Forths use the underlying OS allocator,
not a library. Some don't have support for dynamic memory.

>
> I can implement a working heap, and I likely will for my next novice-

> package release.http://www.forth.org/novice.html

> The problem is that I don't understand how the caches work on the
> modern x86 processors, and this is necessary for an efficient
> implementation. My novice package has linked lists and self-balancing
> trees, so an efficient heap is important.

Avoiding fragmentation and getting the locking right should be your
focus; cache performance issues will pale into insignificance if you
chose a poor algorithm. There are lots of resources on the web that
cover this.

s_dub...@nospicedham.yahoo.com

unread,

Aug 6, 2012, 11:17:55 PM8/6/12

to

On Thursday, August 2, 2012 2:25:46 AM UTC-5, Hugh Aguilar wrote:
> Does anybody know how to implement a heap on the modern x86
>
> processors?
>

This is kind of a loaded question to be asked in c.l.a.x. considering that a 'heap' is a high level language construct.. and the target enviornment has a large say in the implementation due to how memory is managed, dynamically and otherwise... as does the development language.

ASM I assume?
Self hosted interpreter?

Then typically a heap region between the static data area and the stack, is what has been traditionally used. Alternately, a .bss segment can be used, but ELF for linux seems to require a compile time size commitment.

>
>
> This question came up on comp.lang.forth:
>
> https://groups.google.com/group/comp.lang.forth/browse_thread/thread/f7cb467ec9d95ce5#
>
> I cross-posted to c.l.f. for this reason.
>
>
>
> The problem is that the ANS-Forth standard, and the upcoming
>
> Forth-200x standard, both require that the heap be exactly compatible
>
> with the heap in GLIBC. This is because nobody on the standards
>
> committee knows how to implement a heap, and so they plan on
>
> continuing to rely on GLIBC forever. Unfortunately, the heap in GLIBC
>
> isn't very good. It doesn't have ALLOCATION for one thing (a function
>
> that returns the amount of memory in a node given the node's address).
>
> Also, as described in the thread mentioned above, FREE doesn't
>
> specifically tell you if it has been given an address that isn't a
>
> node. There are other problems. All in all, I would like to get rid of
>
> Forth's dependence on GLIBC altogether --- if people want to use GLIBC
>
> they should program in C.
>

Yeah, well, I've interfaced to glibc from assembler, but this is because in linux the enviornment is a multitasking OS and you're not free to do I/O just any ole way. -same for mem management, like malloc kinds of operations.

>
>
> I can implement a working heap, and I likely will for my next novice-
>
> package release.
>
> http://www.forth.org/novice.html
>
> The problem is that I don't understand how the caches work on the
>
> modern x86 processors, and this is necessary for an efficient
>

I'm not much help to you here other than to point you to the manual for the processor you are developing on, and read its section on cashing, for ideas and concerns.

I picked up my dead tree version of the i486 manual 'cause it was handy, to see what it says.
Considering Forth as an interpreter, I/O is many magnatudes slower than cache miss performance. For an assembly programmer though, data alignment to cache line size is an issue to pay attention to. So for your list structure and pointers, it would be helpful to pad or partition your node data to some factor of your cache line size - as well as assuring their placement in your 'heap' at cache friendly offsets (meaning page alignment).

> implementation. My novice package has linked lists and self-balancing
>
> trees, so an efficient heap is important.

I must be seriously out of the loop on modern Forth's because it just seems a twisting of the language to consider Forth and Heap in the same discussion, I wonder what Charles Moore would have to say about it. Oh well, fwiw.

Steve

Hugh Aguilar

unread,

Aug 7, 2012, 1:35:20 AM8/7/12

to

On Aug 6, 8:17 pm, s_dubrov...@nospicedham.yahoo.com wrote:
> Yeah, well, I've interfaced to glibc from assembler, but this is because in linux the enviornment is a multitasking OS and you're not free to do I/O just any ole way. -same for mem management, like malloc kinds of operations.

It may be possible to rewrite MALLOC etc. in GLIBC such that they
support ALLOCATION, while still staying within Linux requirements.

> I picked up my dead tree version of the i486 manual 'cause it was handy, to see what it says.
> Considering Forth as an interpreter, I/O is many magnatudes slower than cache miss performance. For an assembly programmer though, data alignment to cache line size is an issue to pay attention to. So for your list structure and pointers, it would be helpful to pad or partition your node data to some factor of your cache line size - as well as assuring their placement in your 'heap' at cache friendly offsets (meaning page alignment).

Sure, it is a good idea to make sure that every memory block starts on
a page boundary. How big is a page? 16 bytes? 32 bytes? If I assume
too large of a page, then I waste memory because I am allocating
memory in multiples of the page size, so there is going to be wasted
space on the end (as much as the page size minus one byte, and
averaging about a half page).

Is there anything more to it than page alignment? Don't we also have
another level of cache above that? It would be worthwhile to get
allocated memory blocks next to each other so they are all within this
super-page. That seems like the complicated part.

> > My novice package has linked lists and self-balancing
> > trees, so an efficient heap is important.
>
> I must be seriously out of the loop on modern Forth's because it just seems a twisting of the language to consider Forth and Heap in the same discussion, I wonder what Charles Moore would have to say about it. Oh well, fwiw.

Well, Forth is traditionally used on micro-controllers. Those programs
don't typically have a heap at all.

My novice package is primarily for use on desktop computers though. I
rely heavily on linked lists, with all of the nodes being on the heap.
This isn't really what Forth is for, but I'm too lazy to learn Lisp or
anything else, so I'm stretching Forth.

BTW: Forth is generally compiled nowadays --- those threaded
interpreters are pretty much obsolete.

s_dub...@nospicedham.yahoo.com

unread,

Aug 9, 2012, 10:16:17 AM8/9/12

to

On Tuesday, August 7, 2012 12:35:20 AM UTC-5, Hugh Aguilar wrote:
> On Aug 6, 8:17 pm, s_dubrov...@nospicedham.yahoo.com wrote:
>
> > Yeah, well, I've interfaced to glibc from assembler, but this is because in linux the enviornment is a multitasking OS and you're not free to do I/O just any ole way. -same for mem management, like malloc kinds of operations.
>
>
>
> It may be possible to rewrite MALLOC etc. in GLIBC such that they
>
> support ALLOCATION, while still staying within Linux requirements.
>

Or initially malloc all that you can and use your own functions to manage within that memory object.

>
>
> > I picked up my dead tree version of the i486 manual 'cause it was handy, to see what it says.
>
> > Considering Forth as an interpreter, I/O is many magnatudes slower than cache miss performance. For an assembly programmer though, data alignment to cache line size is an issue to pay attention to. So for your list structure and pointers, it would be helpful to pad or partition your node data to some factor of your cache line size - as well as assuring their placement in your 'heap' at cache friendly offsets (meaning page alignment).
>
>
>
> Sure, it is a good idea to make sure that every memory block starts on
>
> a page boundary. How big is a page? 16 bytes? 32 bytes? If I assume
>
> too large of a page, then I waste memory because I am allocating
>
> memory in multiples of the page size, so there is going to be wasted
>
> space on the end (as much as the page size minus one byte, and
>
> averaging about a half page).
>
>
>
> Is there anything more to it than page alignment? Don't we also have
>
> another level of cache above that? It would be worthwhile to get
>
> allocated memory blocks next to each other so they are all within this
>
> super-page. That seems like the complicated part.
>
>

You have page alignment, and more importantly, cache size & cache line size. On the x486 cache line size is 4 doublewords. That is the minimum fetch size, up to the cache size. I seem to recall that those vary according to the processor. That's why I say to read up on the processor that you are developing on. There can be other issues on mutlicore cpu's regarding multi-processor accesses to the same cache. These are granular issues, and HLL's don't normally give you that granularity directly. The best you can normally do from a HLL is try to assure your data (say your node structure) maps efficiently to the cache size, in other words, alignments.

>
> > > My novice package has linked lists and self-balancing
>
> > > trees, so an efficient heap is important.
>
> >
>
> > I must be seriously out of the loop on modern Forth's because it just seems a twisting of the language to consider Forth and Heap in the same discussion, I wonder what Charles Moore would have to say about it. Oh well, fwiw.
>
>
>
> Well, Forth is traditionally used on micro-controllers. Those programs
>
> don't typically have a heap at all.
>

Right, that is Forth I'm more familiar with.

>
>
> My novice package is primarily for use on desktop computers though. I
>
> rely heavily on linked lists, with all of the nodes being on the heap.
>
> This isn't really what Forth is for, but I'm too lazy to learn Lisp or
>
> anything else, so I'm stretching Forth.
>

Hey, go for it. But it may make sense to call it something different then.

>
>
> BTW: Forth is generally compiled nowadays --- those threaded
>
> interpreters are pretty much obsolete.

Amazing, perhaps 'out of vogue' is a better phrase.

Steve

Hugh Aguilar

unread,

Aug 14, 2012, 10:34:46 AM8/14/12

to

I am writing Straight Forth --- but that is a cross-compiler for micro-
controllers --- there is very little interest in Forth outside of the
micro-controller world, and not much there. For a popular desktop-
computer OS such as Linux, there are myriad good choices available for
an HLL. Forth could never be anything except a weird hobby.

If I do implement a Forth for a desktop OS, 64-bit Menuet would be a
possibility --- my Forth might actually get used, as it would be the
only HLL available, and Forth is more assembly-language friendly than
most HLLs (the Menuet guys would think of it as a framework for
assembly language, rather than as an HLL). Most people have never
heard of Forth or Menuet though, so it would be an obscure corner in
an obscure corner.

> > BTW: Forth is generally compiled nowadays --- those threaded
> > interpreters are pretty much obsolete.
>
> Amazing, perhaps 'out of vogue' is a better phrase.

Threaded code is only useful when you have very tight memory
constraints, and you are not concerned about speed. All computers had
small memories when Chuck Moore invented Forth. Now you see this on a
few micro-controllers (the MSP430 Launchpad), but not too many. The
only reason why threaded code is done on desktop computers is laziness
--- there are a lot of modern Forth systems that are just warmed over
carcasses from the 1980s, or are threaded-code interpreters written in
C and using a gigantic SWITCH statement --- they just give Forth a bad
name.

BGB

unread,

Aug 14, 2012, 10:59:43 AM8/14/12

to

actually, a merit of threaded code is that it is much easier to maintain
than a full JIT or native compiler would be.

I don't consider bytecode directly interpreted by a switch to be
threaded code, but such a giant switch may be a reasonable way to
translate bytecode into threaded-code, FWIW (the giant switch figures
out which functions should be called, ...).

Rod Pemberton

unread,

Aug 15, 2012, 6:22:52 AM8/15/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message
news:fc189e1d-34c4-4696...@iw9g2000pbc.googlegroups.com...
...

[snip]

> Threaded code is only useful when you [...] are
> not concerned about speed.

True.

That's especially so IMO if the language being used has many higher-level
routines all of which are built from a few lower-level routines, as some
Forths are implemented. If on the other hand, the Forth has many, many
specialized, low-level routines in assembly, then it needn't be excessively
slow. The specialized routines eliminate, or help to eliminate, the
slowness. Although, even with the "speed enhancement" code, a threaded
interpreter will still be slower than a compiled program.

The power of the language is also a factor. I seems that to do any general
programming in Forth, one must do a bit to a lot of stack juggling. Each
time, that requires a custom sequence of low-level stack manipulations which
are not optimized. I.e., the use of a stack or stacks - while they
simplified the language's syntax - they added overhead for the interpreter.
The interpreter has to interpret more Forth words or "instructions".

For interactive input and simple, direct entry programming, threaded code is
more than fast enough.

> Threaded code is only useful when you have very

> tight memory constraints, [...]

In implementing my Forth interpreter in C, I found that the threaded code
interpreter, as implemented for early Forth's, is quite good, but lacking
some things that other languages would need. So, I think threaded
interpreters probably didn't get the glory they should've gotten. The real
issue in Forth's case was the direction Forth's syntax went and Forth's
embrace of a stack. Once a stack was introduced and Forth's syntax changed
from more symbolic to more word based, then the language no longer had the
features that would've enhanced the interpreter's design in the way that a
language like C or Pascal would need. It's almost like Charles Moore was
following a "traditional" language implementation (if it existed back
then...), and then for no apparent reason radically changed course.

I.e., I think much, although probably not all, of C's capability could be
implemented with a threaded interpreter. Although BASIC, would probably be
a better choice, maybe even a very good choice. BASIC always seemed to be
implemented as token-threaded code, usually (8-bit) byte code. The base C
language, without many of C's qualifiers, or maybe without some of C's
typesystem, should be do-able. The big problem for C on an interpreter is
it's need for a complicated parser. Some of the language constructs in C
simply don't reduce to something that could be looked up in a dictionary,
like for Forth, or found in linked-list. I.e., stages of work must be
resolved. If the parsing (and additional processing) was done separately,
then the interpreter could handle the equivalent C code as threaded code
just fine IMO. Although, it'd still be slower than compiled, but it'd be
portable...

(Please don't "go there" on the "C is portable" already argument.)

IIRC, there are like five C interpreter's out there. You could take a look
to find out how they are implemented, how much of the language they
implement, what they change, restrict, etc. AIR, one of them seemed to
derive from Forth or a Forth-like, threaded interpreter.

> [...] The only reason why threaded code is done on
> desktop computers is laziness [...]

I disagree. It's just that an interpreter is easier to implement than a
compiler. I wouldn't call that lazy. It's quite a bit of work to get even
a simple compiler to work correctly.

You almost always need a powerful assembler to help out with the compiling,
or you need the assembler be a key component in the compiler toolchain. You
have to emit binary code at some point. In order to do so, you either limit
yourself to a very limited set of "canned" instructions, or pre-compiled
simple routines in binary, or you use an assembler.

> [...] there are a lot of modern Forth systems that [...]

> are threaded-code interpreters written in C and using

> a gigantic SWITCH statement [...]

True, at least about the use of C and switch() to implement Forth's in C.
Every Forth in C, written by someone else, that I've seen, all seem to use a
switch. I actually implemented a threaded-code interpreter in C, instead of
using a switch(). I think my Forth in C (still unreleased) is the only one
that doesn't use a switch() ... Most C programmer's do things "the C way."

I'd question whether to call Forth in C using a switch() as being
"threaded-code". I'd also question whether "a lot" of modern Forth's do it
that way. Modern Forth's, supposedly, at least according to those on
comp.lang.forth, are compilers, not interpreters, but as a regular there,
you should know that...

Rod Pemberton

Robert Redelmeier

unread,

Aug 15, 2012, 10:27:39 PM8/15/12

to

Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote in part:

> Threaded code is only useful when you have very tight memory
> constraints, and you are not concerned about speed.

??? how do _you_ define threading? Most [successful] threading is
done to break large problems across cores to finish more rapidly:
video animation/compression, finite element analyses, etc.

Threading can save memory when the OS implements strategies
such as read-only codepages and copy-on-write data for
multiple server instances like apache httpd.

But what sort of threading are you thinking of?

-- Robert

BGB

unread,

Aug 15, 2012, 11:08:47 PM8/15/12

to

the topic was "threaded code", which is not the same as "multithreading".

basically, the idea is that rather than having, say, an interpreter
which dispatches instructions via a large switch or jump-table, it will
instead have code as a collection of function-pointers (to opcode
handlers) or similar (representing individual operations).

in this case, the execution will work by calling all of these function
pointers or similar in-sequence.

alternatively, it may also be possible to spit out intermediate native
code that is basically a sequence of call instructions or similar.

the advantage is that (while still considerably slower than proper
compiler output) it can also be considerably faster than using a
big-switch or similar.

or, see here:
http://en.wikipedia.org/wiki/Threaded_code

Robert Wessel

unread,

Aug 16, 2012, 12:00:36 AM8/16/12

to

He means threaded in the Forth sense. Basically compiled words
(functions) are lists of pointers to the words they execute, with the
head of the list pointing to the proper interpreter. Usually
non-primitive words have an interpreter which is no more than several
instructions long, which fetches the next (instruction) word, and
invokes its interpreter. Primitive words are usually just code
sequences pointed to be the heads of their definition. There are some
nuances for handling conditionals (the Forth 0BRANCH primitive, for
example, modifies the pointer used by the inner interpreter).

It tends to lead to very compact compiled code.

Hugh Aguilar

unread,

Aug 16, 2012, 2:24:38 AM8/16/12

to

On Wednesday, August 15, 2012 9:00:36 PM UTC-7, Robert Wessel wrote:
> There are some
> nuances for handling conditionals (the Forth 0BRANCH primitive, for
> example, modifies the pointer used by the inner interpreter).

Branching is what really kills the speed. If a compiler only does a little bit of peephole-optimization on branches, and makes no other effort at optimization at all, it will generate code that is at least a couple of times faster than any threaded implementation (either assembly-language or C) --- that is pretty much all that SwiftForth does, and it is 2 or 3 times faster than Gforth.

BTW, just out of curiosity, does anybody know about Ruby internal workings? Why is Ruby slow on loops --- is it using threaded code? Also, does Racket generate threaded code or machine code? I'm interested in Racket --- is it an assembly-language friendly language?

Anyway, I don't know how we got onto the subject of threading schemes, as I originally asked about heap implementation on the x86.

Rod Pemberton

unread,

Aug 16, 2012, 5:44:07 AM8/16/12

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message
news:k0hlqq$gbi$1...@speranza.aioe.org...

> Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote in part:

...

> > Threaded code is only useful when you have very tight memory
> > constraints, and you are not concerned about speed.
>
> ??? how do _you_ define threading?

> [...]

>
> But what sort of threading are you thinking of?
>

He's discussing the four types of threaded-code historically used in
interpreters, like BASIC or early Forth's. Modern Forth's are compiled.
You're probably familiar with BASIC using bytecodes. That's also called
token threaded code (TTC). The bytecode is a token. I'm not sure where
the term "threaded-code" comes in exactly. Early Forth's used direct
threaded code (DTC) or indirect threaded code (ITC). For the most part,
these use address lists for routines. At some point, they stopping
"jumping" to addresses and jump to an assembly routine (if written in
assembly) which does some work. The assembly routine is a "primitive" or
low-level Forth word. DTC just jumps directly to the address. I.e., the
address list is just a list of the addresses used for multiple JMPs without
the actual JMP opcodes. The interpreter performs the JMP. Each routine
returns to a central routine - part of the interpreter - which is why JMPs
are used instead of CALLs. ITC jumps through an intermediate address. I
don't recall what the second address used in ITC is exactly, even though I
wrote an ITC Forth in C... Calling routines using assembly instructions is
known as subroutine threaded code (STC).

E.g., STC:

call $swap
call $fetch
call $store
jmp $exit

E.g., DTC:

$swap
$fetch
$store
$exit

The interpreter gets $swap address and jumps to to it. Repeat. Swap etc
would be a low-level routine in assembly:

$swap
pop EAX
pop EBX
push EAX
push EBX
jmp $next

$exit exits the current "word" (subroutine as an address list) to return to
the prior "word". $next is a central point in the address or inner
interpreter as Forthers call it. Forth also has an outer or text
interpreter which parses and "compiles" words into the dictionary.

Forth uses a "dictionary" which stores the "words" (subroutines) by adding a
header that contains a name, linked-list, entry point, and an address to a
routine which determines how to process each "word". These special routines
interpret a high-level word, jump to the assembly for a low-level word, or
handle the word's contents as a variable or constant, etc. Many Forth's
have most of their words written in Forth with a small set of low-level
routines. So, you end up with an address pointing to an address pointing to
an address ... until either a "primitive" or non-interpreted word is found
and it does some work. E.g., if we call the $swap $fetch $store above
$fixit, then it's part of another "word" (address list) called 'movit'
where 'movit' is:

(movit)
$dup
$fixit
$dup
$fixit
$exit

(fixit)
$swap
$fetch
$store
$exit

(dup)
;primitive

(swap)
;primitive

Probably, the best online reference is Brad Rodriguez's
"Moving Forth" series which has diagrams:
http://www.bradrodriguez.com/papers/

Rod Pemberton

George Neuner

unread,

Aug 16, 2012, 4:59:36 PM8/16/12

to

On Wed, 15 Aug 2012 23:24:38 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Wednesday, August 15, 2012 9:00:36 PM UTC-7, Robert Wessel wrote:
>> There are some
>> nuances for handling conditionals (the Forth 0BRANCH primitive, for
>> example, modifies the pointer used by the inner interpreter).
>
>Branching is what really kills the speed. If a compiler only does a little bit
>of peephole-optimization on branches, and makes no other effort at
>optimization at all, it will generate code that is at least a couple of times
>faster than any threaded implementation (either assembly-language or
>C) --- that is pretty much all that SwiftForth does, and it is 2 or 3 times
>faster than Gforth.

Depends on the target ... most modern chips happily prefetch code
through unconditional branches - including subroutine call and return.
For conditional branches they prefetch at least one direction - either
"taken" or if prediction is used, the predicted direction. Some high
performance chips even prefetch both directions for conditional
branches and abandon the incorrect instruction stream when the result
of the conditional becomes known.

The overhead of compiler generated call threaded code is largely
convention register saves/restores.

>BTW, just out of curiosity, does anybody know about Ruby internal
>workings? Why is Ruby slow on loops --- is it using threaded code?

"Loops" in Ruby are really repeated invocations of an anonymous
function that is the "body" of the loop. I don't know any detail of
the Ruby compiler, but I doubt that it bothers to perform closure
conversion.

Ruby heap allocates almost all data. In the reference implementation,
all data is an instance of a tagged union that can hold any of the
built-in types (pointer, integer, floating point, character, etc.). I
*think* there is an actual linear string type (at least for constant
strings), but Ruby does *not* have linear arrays ... what Ruby calls
an "array" really is a hash table (a "dictionary" object) using
integer keys.

I've only toyed with Ruby (I really don't like it) so I don't know
anything about its GC or interfacing it to foreign code.

>Also, does Racket generate threaded code or machine code? I'm interested
>in Racket --- is it an assembly-language friendly language?

Racket compiles to byte code, but the VM can JIT compile on x86(_64)
and PowerPC platforms.

WRT friendliness: Racket is a Scheme derivative with tagged immediate
data and (by default) compacting GC. Type tags are embedded into most
ALU register-size types (pointers, small integers, booleans,
characters, etc.) and GC may move data in memory, which means external
code has to be written very carefully.

Racket has a reasonably good C interface ... with care, Racket can
call external C functions and, with a lot more care, C functions also
can call Racket functions. Using special function declarations, the
Racket compiler automagically handles conversion between internal
tagged and external untagged data formats for function arguments and
results. You also can create arrays and structures using untagged
data that can be shared directly with external code. Racket is
embeddable into a C program.

The default garbage collector may compact the heap and so data in
memory may be moved if GC is invoked. There is an (older) alternate
non-moving collector that can be used, but it is subject to heap
fragmentation and really isn't suitable for long-running applications.
GC does not occur while in foreign code, but foreign code can call
Racket functions (both system and user written) and GC might occur at
any time while in Racket code. If the foreign code stores pointers to
shared data - e.g., for asynchronous call backs - or if there is a
chance that GC could be invoked unexpectedly, then either:
- the shared data must be specifically allocated non-moveable, or
- the foreign code must reference data by name and ask Racket for
its current location and/or to "pin" the data in place while it is
in use, or
- use the non-moving GC.

>Anyway, I don't know how we got onto the subject of threading schemes,
>as I originally asked about heap implementation on the x86.

Discussion drift is half the value of these forums 8-)

George

Hugh Aguilar

unread,

Aug 16, 2012, 11:40:38 PM8/16/12

to

On Thursday, August 16, 2012 1:59:36 PM UTC-7, George Neuner wrote:
> On Wed, 15 Aug 2012 23:24:38 -0700 (PDT), Hugh Aguilar

> >Branching is what really kills the speed. If a compiler only does a little bit
> >of peephole-optimization on branches, and makes no other effort at
> >optimization at all, it will generate code that is at least a couple of times
> >faster than any threaded implementation (either assembly-language or
> >C) --- that is pretty much all that SwiftForth does, and it is 2 or 3 times
> >faster than Gforth.
>
> Depends on the target ... most modern chips happily prefetch code
> through unconditional branches - including subroutine call and return.
> For conditional branches they prefetch at least one direction - either
> "taken" or if prediction is used, the predicted direction. Some high
> performance chips even prefetch both directions for conditional
> branches and abandon the incorrect instruction stream when the result
> of the conditional becomes known.

I don't think that any processor prefetches code though register-indirect jumps, which is what 0BRANCH and BRANCH are doing internally --- that is the speed killer.

> The overhead of compiler generated call threaded code is largely
> convention register saves/restores.

That kills the speed too.

> I've only toyed with Ruby (I really don't like it)...

I'm not much interested in Ruby either, mostly because I've heard that it is slow and it is not easy to speed it up with foreign-language libraries. Why do you not like it?

Thanks for your info on Racket. The two languages that I want to learn are Racket and Lua (maybe Python too, just because it is so popular, although I'm not very enthusiastic about that one).

My interest in Racket is as a general-purpose language for desktop computers.

My interest in Lua is as a front-end for CAM software (I have worked as a CNC programmer).

s_dub...@nospicedham.yahoo.com

unread,

Aug 16, 2012, 11:41:26 PM8/16/12

to

Thxs,

I've found Anton Ertl's explanatory page informative,
http://www.complang.tuwien.ac.at/forth/threaded-code.html#what
esp. his comments about use of gnu c.

I've used threaded code in assembly, in the executive of my syscalls where the syscall vectors to a list of function addresses, terminated by a null pointer. It is an unusual way to program, but handy for testing ideas in a scratchpad sort of way. So for a syscall to fopen() which would require a number of sub-function developments, those can be explored, debug subroutines dropped in the list, or removed, etc, almost on a level of what an interpreter brings to the development cycle, code refactoring is much easier. I needed to proof that the executive was re-enterable, that a syscall could call another syscall tru the executive, without the executive losing state, compared to an outside call to the syscalls. It was easy to do with threaded code, bracket the internal syscall caller's list with debug reporting subroutines, to test the flow control and stack state.

Steve

BGB

unread,

Aug 17, 2012, 2:03:41 AM8/17/12

to

On 8/16/2012 10:40 PM, Hugh Aguilar wrote:
> On Thursday, August 16, 2012 1:59:36 PM UTC-7, George Neuner wrote:
>> On Wed, 15 Aug 2012 23:24:38 -0700 (PDT), Hugh Aguilar
>>> Branching is what really kills the speed. If a compiler only does a little bit
>>> of peephole-optimization on branches, and makes no other effort at
>>> optimization at all, it will generate code that is at least a couple of times
>>> faster than any threaded implementation (either assembly-language or
>>> C) --- that is pretty much all that SwiftForth does, and it is 2 or 3 times
>>> faster than Gforth.
>>
>> Depends on the target ... most modern chips happily prefetch code
>> through unconditional branches - including subroutine call and return.
>> For conditional branches they prefetch at least one direction - either
>> "taken" or if prediction is used, the predicted direction. Some high
>> performance chips even prefetch both directions for conditional
>> branches and abandon the incorrect instruction stream when the result
>> of the conditional becomes known.
>
> I don't think that any processor prefetches code though register-indirect jumps, which is what 0BRANCH and BRANCH are doing internally --- that is the speed killer.
>

newer processors can sort of fake it, usually by making the assumption
that the jump will land in the same place it did last time.

granted, this doesn't help much for dispatcher loops...

>> The overhead of compiler generated call threaded code is largely
>> convention register saves/restores.
>
> That kills the speed too.
>

ideally, if a person wrote their interpreter in assembler, they could
optimize things, say, by caching state in registers rather than using
the C ABI or similar.

of course, a downside is when much of the interpreter is written in C
anyways, so the ASM parts of the code are still forced to use the C ABI.

for example, VM uses:
ESI=VM context;
EDI=Current Opcode.

combined with each opcode initiating the dispatch of the next opcode, say:
mov edi, [edi+VM_OP_OFFS_NEXT]
jmp dword [edi+VM_OP_OFFS_FCN]

even without all this, a naive C-based dispatch loop is still generally
faster than a big switch.

example:
while(cur && !ctx->rs)
{ cur=cur->fcn(ctx, cur); }

which can be much faster than, say:
while(!ctx->rs) { VM_StepOpcode(ctx); }

void VM_StepOpcode(VM_Context *ctx)
{
int op;

op=*ctx->ip++;
if(op>192)op=((op-192)<<8)|(*ctx->ip++);
switch(op)
{
case VM_OP_NOP: break;
...
default: ctx->rs=VM_ERR_INVALID_OPCODE; break;

}
}

>> I've only toyed with Ruby (I really don't like it)...
>
> I'm not much interested in Ruby either, mostly because I've heard that it is slow and it is not easy to speed it up with foreign-language libraries. Why do you not like it?
>
> Thanks for your info on Racket. The two languages that I want to learn are Racket and Lua (maybe Python too, just because it is so popular, although I'm not very enthusiastic about that one).
>
> My interest in Racket is as a general-purpose language for desktop computers.
>
> My interest in Lua is as a front-end for CAM software (I have worked as a CNC programmer).
>

well, AFAIK, at its core Racket is still basically Scheme.

not much messed with Lua or Python.
what little I have seen of Python makes me not exactly want to use it so
much though.

I have my own scripting language, which is sort of partway between:
JavaScript, ActionScript, Java, C, and C++.

performance is a bit weak, but this is partly as it currently runs in a
good-old threaded-code interpreter written mostly in C.

it isn't really a performance-critical language though.

wolfgang kern

unread,

Aug 17, 2012, 1:57:33 PM8/17/12

to

Hugh Aguilar asked:

> Does anybody know how to implement a heap on the modern x86
> processors?

A heap would look in my world as nothing else than allocated memory.
It now depends on the OS and required heap-size vs. available-size.

if its a hiding paged memory model:
you may not know if its in RAM or paged out on disk.

some slow applications could live with such virtual hideaway ...,
but speed aware code shouldn't accept swapped to disk memory anyway.

So your question may be more related to the OS involved rather than
to x86-CPUs, the latter (OS independant) allows everything desired.

My own OS (KESYS) never used nor needed any virtual RAM-space on disk.
As any application got its own data-space on disk anyway, I couldn't
find a reason to make a copy of this somewhere else on the disk ... :)

__
wolfgang

George Neuner

unread,

Aug 20, 2012, 3:45:52 PM8/20/12

to

Hi Hugh,

Apologies for the delay - I was away for the weekend.

On Thu, 16 Aug 2012 20:40:38 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Thursday, August 16, 2012 1:59:36 PM UTC-7, George Neuner wrote:
>> On Wed, 15 Aug 2012 23:24:38 -0700 (PDT), Hugh Aguilar
>>
>> >Branching is what really kills the speed. If a compiler only does a little bit
>> >of peephole-optimization on branches, and makes no other effort at
>> >optimization at all, it will generate code that is at least a couple of times
>> >faster than any threaded implementation (either assembly-language or
>> >C) --- that is pretty much all that SwiftForth does, and it is 2 or 3 times
>> >faster than Gforth.
>>
>> Depends on the target ... most modern chips happily prefetch code
>> through unconditional branches - including subroutine call and return.
>> For conditional branches they prefetch at least one direction - either
>> "taken" or if prediction is used, the predicted direction. Some high
>> performance chips even prefetch both directions for conditional
>> branches and abandon the incorrect instruction stream when the result
>> of the conditional becomes known.
>
>I don't think that any processor prefetches code though register-indirect
>jumps, which is what 0BRANCH and BRANCH are doing internally --- that
>is the speed killer.

Sorry, I missed something ... I thought you were asking about x86.
What chip are you referring to?

Any conditional or indirect branch will introduce a bubble into the
prefetch pipeline stage ... but on modern chips the question of
whether the decode stage will stall depends on if the address resolve
must wait on the register (e.g., for a load to complete) and if the
branch target location is somewhere in cache. Typically a code fetch
from L2 will be fast enough to avoid a stall.

Now the x86 also has memory-indirect branch where the target is
fetched from a location provided as an argument to the instruction.
This can't be scheduled as effectively as a separate register load +
branch, so memory-indirect pretty much is guaranteed to stall unless
the address location happens to be in L1 (and the branch target is in
cache as above).

>> I've only toyed with Ruby (I really don't like it)...
>
>I'm not much interested in Ruby either, mostly because I've heard that it
>is slow and it is not easy to speed it up with foreign-language libraries.
>Why do you not like it?

I have some gripes about Ruby's syntax, but my major complaint is with
Ruby's code blocks (lambda functions).

Ruby's code blocks were inspired by Smalltalk, and are intended to be
defined inline and passed directly as method arguments. However,
unlike Smalltalk, Ruby permits methods to accept only a single inline
code block argument. If you need to use multiple code blocks they
must be defined ahead of time and bound to variables (which can then
be passed as normal arguments or accessed as non-locals from within a
method).

It's the need to define code blocks ahead of time which is the
problem. Despite also borrowing from Lisp, AFAICT, there is no way in
Ruby for a code block (or even a regular function) to close over
variables, so references to non-locals within the block will use the
current values of those variables at execution time. IMO this creates
a lot of potential for programmer confusion between the definition
environment and the execution environment.

[I have similar issues with lambdas in Java and C++. Compared to
Lisp, Scheme, ML, etc., I find all of these lambda implementations to
be quite limiting.]

As always, YMMV.

Also Ruby's programming model is essentially pure OO and, FWIW, I'm
not a raving fan. There's a class of problems for which OO is a good
fit, but it isn't a good fit for every problem and I don't like having
to contort solutions to fit a particular model. Generally I prefer
to work in multiparadigm languages that let me mix and model the
solution in the way that best fits the problem.

>Thanks for your info on Racket ... My interest in Racket is as a

>general-purpose language for desktop computers.

IMO Racket is a pretty good tool for cross platform general
application development. It's available on Windows, Unix/Linux and
MacOS, comes with a number of useful libraries, many additional
contributor libraries are available from an online repository, and
most usefully it includes a portable GUI framework that works pretty
much the same across all platforms.

>My interest in Lua is as a front-end for CAM software (I have worked
>as a CNC programmer).

Sorry, I know nothing about Lua.

George

Hugh Aguilar

unread,

Aug 21, 2012, 12:48:11 AM8/21/12

to

On Monday, August 20, 2012 12:45:52 PM UTC-7, George Neuner wrote:
> On Thu, 16 Aug 2012 20:40:38 -0700 (PDT), Hugh Aguilar
> <hughag...@nospicedham.yahoo.com> wrote:
>
> >On Thursday, August 16, 2012 1:59:36 PM UTC-7, George Neuner wrote:
> >> Depends on the target ... most modern chips happily prefetch code
> >> through unconditional branches - including subroutine call and return.
> >> For conditional branches they prefetch at least one direction - either
> >> "taken" or if prediction is used, the predicted direction. Some high
> >> performance chips even prefetch both directions for conditional
> >> branches and abandon the incorrect instruction stream when the result
> >> of the conditional becomes known.
> >
>
> >I don't think that any processor prefetches code though register-indirect
> >jumps, which is what 0BRANCH and BRANCH are doing internally --- that
> >is the speed killer.
>
> Sorry, I missed something ... I thought you were asking about x86.
> What chip are you referring to?

I am only interested in the x86 at this time, although the ARM later on.

I thought that no chip at all prefetched code through register-indirect jumps, although I may have been wrong about that.

> Any conditional or indirect branch will introduce a bubble into the
> prefetch pipeline stage ... but on modern chips the question of
> whether the decode stage will stall depends on if the address resolve
> must wait on the register (e.g., for a load to complete) and if the
> branch target location is somewhere in cache. Typically a code fetch
> from L2 will be fast enough to avoid a stall.

> Now the x86 also has memory-indirect branch where the target is
> fetched from a location provided as an argument to the instruction.
> This can't be scheduled as effectively as a separate register load +
> branch, so memory-indirect pretty much is guaranteed to stall unless
> the address location happens to be in L1 (and the branch target is in
> cache as above).

In ITC the cfa is in a register (it is not a literal encoded in the operand), and the cfa contains the address of the code to execute. This is definitely going to stall.

In DTC the cfa is in a register (it is not a literal encoded in the operand), and the cfa is the address of the code to execute. My understanding of what you're saying, is that this won't stall --- although I had assumed that it would.

> >> I've only toyed with Ruby (I really don't like it)...
> >
> >I'm not much interested in Ruby either, mostly because I've heard that it
> >is slow and it is not easy to speed it up with foreign-language libraries.
> >Why do you not like it?
>
> I have some gripes about Ruby's syntax, but my major complaint is with
> Ruby's code blocks (lambda functions).
>
> Ruby's code blocks were inspired by Smalltalk, and are intended to be
> defined inline and passed directly as method arguments. However,
> unlike Smalltalk, Ruby permits methods to accept only a single inline
> code block argument. If you need to use multiple code blocks they
> must be defined ahead of time and bound to variables (which can then
> be passed as normal arguments or accessed as non-locals from within a
> method).
>
> It's the need to define code blocks ahead of time which is the
> problem. Despite also borrowing from Lisp, AFAICT, there is no way in
> Ruby for a code block (or even a regular function) to close over
> variables, so references to non-locals within the block will use the
> current values of those variables at execution time. IMO this creates
> a lot of potential for programmer confusion between the definition
> environment and the execution environment.
>
> [I have similar issues with lambdas in Java and C++. Compared to
> Lisp, Scheme, ML, etc., I find all of these lambda implementations to
> be quite limiting.]

I don't understand why this is a problem. It seems intuitive that the variables should be the current values at run-time --- why would the programmer expect the values at compile-time to be used?

> >Thanks for your info on Racket ... My interest in Racket is as a
> >general-purpose language for desktop computers.
>
> IMO Racket is a pretty good tool for cross platform general
> application development. It's available on Windows, Unix/Linux and
> MacOS, comes with a number of useful libraries, many additional
> contributor libraries are available from an online repository, and
> most usefully it includes a portable GUI framework that works pretty
> much the same across all platforms.

Anyway, I'm not going to worry about Ruby esoterica --- I'm just going to go with Racket. I really don't know very much about computer science --- one of the best things about Racket is that there is beaucoup documentation (Scheme has been taught in schools for a long time, and a lot has been written about it) --- Racket is a learning language, and learning is what I need to do. I had to drop Factor mostly because I didn't understand the concepts and the documentation was all a reference that assumed that the programmer already knew what he was doing.

George Neuner

unread,

Aug 22, 2012, 4:13:41 AM8/22/12

to

On Mon, 20 Aug 2012 21:48:11 -0700 (PDT), Hugh Aguilar

<hughag...@nospicedham.yahoo.com> wrote:

>On Monday, August 20, 2012 12:45:52 PM UTC-7, George Neuner wrote:
>> On Thu, 16 Aug 2012 20:40:38 -0700 (PDT), Hugh Aguilar
>> <hughag...@nospicedham.yahoo.com> wrote:

>
>> Any conditional or indirect branch will introduce a bubble into the
>> prefetch pipeline stage ... but on modern chips the question of
>> whether the decode stage will stall depends on if the address resolve
>> must wait on the register (e.g., for a load to complete) and if the
>> branch target location is somewhere in cache. Typically a code fetch
>> from L2 will be fast enough to avoid a stall.
>
>> Now the x86 also has memory-indirect branch where the target is
>> fetched from a location provided as an argument to the instruction.
>> This can't be scheduled as effectively as a separate register load +
>> branch, so memory-indirect pretty much is guaranteed to stall unless
>> the address location happens to be in L1 (and the branch target is in
>> cache as above).
>
>In ITC the cfa is in a register (it is not a literal encoded in the operand),
>and the cfa contains the address of the code to execute. This is
>definitely going to stall.

Typo? I assume you meant the target address is in *memory* (as in a
jump table). This is *not* guaranteed to stall but only to introduce
bubbles into the pipeline. Technically to "stall" means the execution
pipeline drains completely - i.e. the CPU runs out of work to do -
before the code fetch mechanism can provide new work.

Remember that modern x86 (since Pentium and excepting Atom) are OoO
processors. The CPU can look ahead in the instruction stream and
begin working on an upcoming branch before executing unrelated
instructions that precede the branch.

I'm currently not aware of any OoO ARM, but the design doesn't
constrain execution order. ARM itself doesn't make chips - it
licenses the design - and so the details of implementation are up to
the manufacturer.

If both the jump table location and the target of the branch are in
cache, you take a hit but not necessarily a stall. The degree of the
hit depends on exactly where in the cache hierarchy the requested
items lie.

The x86 since i486 has had separate L1 caches for data and code. Since
Pentium Pro (P6) there has been on chip a combined code/data L2 cache.
Since the Pentium IV the L1 code cache contains not x86 instructions
but rather already decoded instruction traces in the CPU's internal
wide format. Code fetched from the L1 trace cache skips decoding and
is injected directly into the execution stage.

Still the case of a subroutine return followed closely by a memory
indirect branch is likely to cause a pipeline stall ... but this is
because the subroutine return itself *is* a memory indirect branch
(with address pulled from the stack) and the CPU can't handle two
memory indirect branches back-to-back unless all the data and code
involved can be served from the L1 caches.

However, there are ways to tail thread an interpreter so that each
"instruction" subroutine ends with a single indirect branch directly
to the next subroutine. It isn't necessary to return to a dispatcher
after each subroutine.

>In DTC the cfa is in a register (it is not a literal encoded in the operand),
>and the cfa is the address of the code to execute. My understanding
>of what you're saying, is that this won't stall --- although I had assumed
>that it would.

Right. Assuming the address value is in the register the cost to
redirect the prefetch is just an extra cycle to access it. Again this
is dependent on the register load having completed before the branch
instruction begins to execute, and on the target of the branch being
somewhere in cache (at least in L2).

>> It's the need to define code blocks ahead of time which is the
>> problem. Despite also borrowing from Lisp, AFAICT, there is no way in
>> Ruby for a code block (or even a regular function) to close over
>> variables, so references to non-locals within the block will use the
>> current values of those variables at execution time. IMO this creates
>> a lot of potential for programmer confusion between the definition
>> environment and the execution environment.
>

>It seems intuitive that the variables should be the current values at run-time
>--- why would the programmer expect the values at compile-time to be
> used?

I don't have handy a good example and it's a bit difficult to explain
my objection, but it's a matter of context and where the programmer's
attention is focused when writing the code. The case that concerns me
is a too often encountered "oops!" where the programmer's train of
thought gets derailed.

The typical use of a code block is _supposed_ (in the sense of
"thought to be") to be a little helper function defined inline at the
point of use. There are 2 problems with this:

Ruby actively discourages using explicit looping code (though it's
possible) in favor of using iterator functions. Ruby function can
accept a code block argument - but only one. Ruby's built-in iterator
functions all are 1-dimensional and pass only a single argument to the
block when it's executed. It is possible to write your own iterator
functions, but that is an advanced control concept that a lot of
programmers likely will never explore. It far more likely that a
programmer will contort the code to fit what Ruby provides.

So if you find you need to pass more than one parameter to a code
block, it's very likely that you will pass the extra parameters
indirectly by referencing non-locals. And if you find you need to
pass more than one code block into a function, you have to cut and
paste your inline definitions into explicit forward definitions
(requiring a slightly different syntax) bound to named variables and
then go back and rewrite the call site.

Assuming you got the cut-n-paste correct, you suddenly have a
*reusable* function that wasn't written with reusability in mind. But
you'll tend to forget that and, somewhere along the line, you are
bound to try to reuse one where it's external dependencies aren't
satisfied. Worse, it might even appear to work.

And once you discover that your little function really isn't reusable
and want to make it so ... you can't ... at least not without
redesigning and refactoring (some part of) your program. So instead
you are likely to copy/tweak it and from then on have to maintain
redundancies.

All of which might have been avoided had you designed a stateful
"function" object to use instead of the code block in the first place
... but you can't always pass a regular object to an iterator function
(it depends on how it was written), so the program design would have
to have taken a different course from the beginning.

To belabor my point: if Ruby allowed functions to accept multiple code
blocks (as did Smalltalk which was the inspiration), and if it's code
blocks could - at least optionally and with compiler warnings - close
over their external dependencies, the whole issue would go away and
the language would be both more powerful and easier to use.

I'm not in any way claiming these to be fatal flaws in Ruby ... but
the cut-n-paste is just annoying busy work under the best of
circumstances and it can result in a lot of head scratching if/when
the code block reuse bug bites. The worst of it is that I believe it
could easily have been avoided.

> ... I really don't know very much about computer science

FWIW, I have a CS Masters in database/data modeling, accumulated
course work (sans dissertation) for a second Masters in programming
language design/implementation, and 20+ years professional experience
in various areas including medical and HRT industrial imaging, high
performance server apps, some bare metal embedded, and along the way
I've actually managed to use my education a few times to design
databases and create DSL compilers/runtimes. I've forgotten how many
languages I've learned over the years.

For fun I experiment with languages, compilers and runtime systems ...
like so many others I'm searching for a really good way for the
*average* programmer to write programs that *safely* take advantage of
close coupled (i.e. not message passing) parallelism, without having
to explicitly deal with synchronization and without having to learn
strange new languages or weird development paradigms.

But (much more than) enough about me :="

It certainly isn't necessary to understand CS theory to apply it. I
think a formal CS education is (or ought to be) necessary for certain
application areas: operating systems, databases, developer tools, etc.
Other important areas like engineering design and support tools have
their own math/science/engineering requirements.

But for run of the mill productivity / education / entertainment
applications, I don't believe a formal CS or CE education is necessary
... maybe desirable, but not necessary.

Program development is as much art as it is science ... really good
developers are kin to renaissance philosophers. IMO, the most
important attributes for a developer to have are attention to detail,
the ability to visualize problems at many different levels of
abstraction and to quickly jump from one abstraction level to another
without getting confused.

As is typical, we're way off topic now and YMMV 8-)
George

Hugh Aguilar

unread,

Aug 22, 2012, 11:59:26 PM8/22/12

to

On Wednesday, August 22, 2012 1:13:41 AM UTC-7, George Neuner wrote:
> On Mon, 20 Aug 2012 21:48:11 -0700 (PDT), Hugh Aguilar
> <hughag...@nospicedham.yahoo.com> wrote:
> >In ITC the cfa is in a register (it is not a literal encoded in the operand),
> >and the cfa contains the address of the code to execute. This is
> >definitely going to stall.
>
> Typo? I assume you meant the target address is in *memory* (as in a
> jump table). This is *not* guaranteed to stall but only to introduce
> bubbles into the pipeline. Technically to "stall" means the execution
> pipeline drains completely - i.e. the CPU runs out of work to do -
> before the code fetch mechanism can provide new work.

In ITC, the target address is in memory, but it is not a jump-table.

In both DTC and ITC, the code consists of an array of pointers. Each pointer contains a CFA (code field address).

In DTC (direct threaded code) the CFA is machine-code that is typically in front of the threaded code.

In ITC (indirect threaded code) the CFA is a pointer in front of the threaded code, and that pointer is to the machine-code. For colon words, this machine-code is called DOCOL, for constants it is DOCON, etc..

Chuck Moore used ITC originally. The advantage is that it saves memory. Forth colon words (functions) tend to be small, so there are quite a lot of them. On the old 8-bit processors, DOCOL was quite bulky. With ITC, you only have one copy of DOCOL and every colon word has a pointer to it in its CFA.

DTC was introduced later on. UR/Forth for MS-DOS used DTC. More memory is used because every colon word has a copy of DOCOL stuck in front of it. DTC is much faster though. Also, with the modern processors, the DOCOL code isn't very big.

The PDP-11 was an awesome processor for DTC. It had a JSR (jump to subroutine) instruction that would put the return-address into a register, after pushing the contents of that register to the stack. Essentially, the top value of the return stack was a register. This worked out really well, the DOCOL code just did a JSR to the inner-interpreter, and the IP (instruction pointer) for the threaded code as already in its register. That worked so well I had to wonder if the DEC designers had Forth in mind when then made their JSR work like that.

This is all ancient history though --- nowadays it is best to just use subroutine-threading.

> However, there are ways to tail thread an interpreter so that each
> "instruction" subroutine ends with a single indirect branch directly
> to the next subroutine. It isn't necessary to return to a dispatcher
> after each subroutine.

I don't understand this at all. Can you elaborate?

> All of which might have been avoided had you designed a stateful
> "function" object to use instead of the code block in the first place
> ... but you can't always pass a regular object to an iterator function
> (it depends on how it was written), so the program design would have
> to have taken a different course from the beginning.

Factor does this with with its quotations. All looping is done with quotations.

I'm going to have something like quotations (or closures, or whatever you want to call them) in my Straight Forth. I'm not going to use them for looping the way that Factor does though (I will just have BEGIN loops like traditional Forth). I didn't really like quotations for looping. I am just going to use them for iterators, and possibly for coroutines (if I can figure out the whole coroutine concept).

> > ... I really don't know very much about computer science
>
> FWIW, I have a CS Masters in database/data modeling, accumulated
> course work (sans dissertation) for a second Masters in programming
> language design/implementation, and 20+ years professional experience
> in various areas including medical and HRT industrial imaging, high
> performance server apps, some bare metal embedded, and along the way
> I've actually managed to use my education a few times to design
> databases and create DSL compilers/runtimes. I've forgotten how many
> languages I've learned over the years.

Do you know anything about the PIC24? That is going to be the first target for my Straight Forth --- I'll wait on the ARM until later.

> For fun I experiment with languages, compilers and runtime systems ...
> like so many others I'm searching for a really good way for the
> *average* programmer to write programs that *safely* take advantage of
> close coupled (i.e. not message passing) parallelism, without having
> to explicitly deal with synchronization and without having to learn
> strange new languages or weird development paradigms.

Do you know Erlang?

Do you know anything about the Propeller?

Hugh Aguilar

unread,

Aug 23, 2012, 1:51:29 AM8/23/12

to

On Wednesday, August 22, 2012 1:13:41 AM UTC-7, George Neuner wrote:

> For fun I experiment with languages, compilers and runtime systems ...
> like so many others I'm searching for a really good way for the
> *average* programmer to write programs that *safely* take advantage of
> close coupled (i.e. not message passing) parallelism, without having
> to explicitly deal with synchronization and without having to learn
> strange new languages or weird development paradigms.

I had a job (www.testra.com) in the mid 1990s writing a Forth development system for the MiniForth processor. That processor packed up to 5 instructions into a single opcode and they would all execute concurrently. My assembler would rearrange the instructions at compile-time and pack them into the opcodes in such a way as to minimize the number of NOP instructions that had to be introduced while yet ensuring that the program did the same thing as if the instructions were assembled one per opcode and executed sequentially in the order that the programmer wrote them in his source-code.

My assembler was doing at compile-time what the Pentium does at run-time when it pushes the instructions down the U and V pipes.

That is pretty closely coupled parallelism! For the most part, closely-coupled parallelism is a hardware issue. Loosely-coupled parallelism with message-passing seems more interesting from a software standpoint.

If you are really interested in close-coupled parallelism, you might want to look at the GA144 from Chuck Moore. I don't have any interest in it, but it might be up your alley.

George Neuner

unread,

Aug 24, 2012, 7:09:38 PM8/24/12

to

On Wed, 22 Aug 2012 20:59:26 -0700 (PDT), Hugh Aguilar

<hughag...@nospicedham.yahoo.com> wrote:

>On Wednesday, August 22, 2012 1:13:41 AM UTC-7, George Neuner wrote:
>> On Mon, 20 Aug 2012 21:48:11 -0700 (PDT), Hugh Aguilar
>> <hughag...@nospicedham.yahoo.com> wrote:
>>
>> >In ITC the cfa is in a register (it is not a literal encoded in the operand),
>> >and the cfa contains the address of the code to execute. This is
>> >definitely going to stall.
>>
>> Typo? I assume you meant the target address is in *memory* (as in a
>> jump table). This is *not* guaranteed to stall but only to introduce
>> bubbles into the pipeline. Technically to "stall" means the execution
>> pipeline drains completely - i.e. the CPU runs out of work to do -
>> before the code fetch mechanism can provide new work.
>
>In ITC, the target address is in memory, but it is not a jump-table.

The thread list in a DTC or ITC implementation *is* a jump table ...
albeit one that is walked sequentially rather than accessed by
arbitrary index.

>> However, there are ways to tail thread an interpreter so that each
>> "instruction" subroutine ends with a single indirect branch directly
>> to the next subroutine. It isn't necessary to return to a dispatcher
>> after each subroutine.
>
>I don't understand this at all. Can you elaborate?

The exit jump at the end of each subroutine in a DTC or ITC
implementation is an example of a continuation passing style tail
threaded call. However, the conventional implementation by indirect
jump is not necessarily the most efficient one.

Most everything below is fairly obvious if you've thought about it, so
if you have, just nod politely and bear with me 8-)

Code is pseudo assembler and doesn't represent any particular ISA. I
assume a virtual machine architecture with a thread list "instruction"
pointer kept in a register. I assume nothing about the VM's use of
stacks other than one or more additional registers will be required
for stack pointers. In particular, I don't assume whether the
call/return control stack is ISA or software convention.

First thing is to recognize that DTC and ITC both implement a
distributed form of a dispatch loop. Instead of having a centralized
loop to which each subroutine returns: e.g.,

thread = { &A, &B, &C, ..., &exit }

# centralized dispatch

start: load Rcp,&thread
loop: load Rx,*Rcp ; load subroutine address
add Rcp,4 ; incr thread list pointer
call Rx ; call subroutine
jump loop

A: ...
return

B: ...
return

in DTC/ITC, each "instruction" subroutine incorporates one iteration
of the dispatch loop: e.g.,

A: ...
load Rx,*Rcp ; load subroutine address
add Rcp,4 ; incr thread list pointer
jump Rx ; jump to next subroutine

Depending on the chip and choice of DTC or ITC, in the general case
this results in 3-5 instructions being replicated into each of the
subroutines.

Naturally, there could be a common completion routine to which each
subroutine jumps to finish. This might save a few code bytes per
subroutine, but is likely to execute slower than inline exit code.

So what?

Well ... most ISAs include a simple subroutine call/return mechanism
which involves only an address pushed onto or pulled from a stack.
[I say "most" because there are chips where the ISA call/return
mechanism automatically saves/restores certain CPU state and thus puts
more than just an address onto the stack.]

The return instruction pulls an address from the stack, updates the
stack pointer, and transfers control to the address. In all ways but
one - the address source being a stack vs a list - its behavior
parallels that of the subroutine exit code in a conventional DTC
implementation.

So, it should be obvious that a DTC thread list can be pushed (in
reverse order) onto a stack and the ISA return instruction can be used
to jump from one thread subroutine to the next. On many chips this
will execute faster than the 3 instruction jump, and additionally it
frees up a register that is no longer needed for the thread list
pointer.

Apart from possible issues with allowable stack size - which can
handled by partitioning the thread list into blocks and using
trampoline functions to load the stack - (ab)using the return
mechanism in this way has no impact on also using the stack for normal
runtime library or system calls. It does, however, require the VM to
maintain a software convention stack for its own purposes.

More interestingly, (ab)using the return mechanism for threading can
create possibilities for additional optimization. In cases where a
thread subroutine calls library or system functions, in some cases it
may be possible to have a called function return directly to the next
thread subroutine rather than to the one that called it.

>Do you know anything about the PIC24?

Sorry no. I never used any of the PICs. I never worked in the real
small ... my metal work was variously with 80186, 68K and relatively
modern floating point DSPs.

>Do you know Erlang?

Of it. Erlang was designed by a group at Ericsson ... originally for
telephone switch application programming.

Erlang is a declarative functional language. It provides (mostly)
immutable data, pattern matching dispatch, language level lightweight
threads/processes/tasks which communicate by message passing [ala
Hoare's CSP], and a portable concurrent runtime environment.

However, Erlang is a high level language that provides no low level
programming capabilities, so it is dependent on native code services
for hardware control.

>Do you know anything about the Propeller?

If you mean Parallax's chip ... I'm aware of it and that's about it.

George

Hugh Aguilar

unread,

Aug 26, 2012, 3:12:55 AM8/26/12

to

On Friday, August 24, 2012 4:09:38 PM UTC-7, George Neuner wrote:
> Code is pseudo assembler and doesn't represent any particular ISA. I
> assume a virtual machine architecture with a thread list "instruction"
> pointer kept in a register. I assume nothing about the VM's use of
> stacks other than one or more additional registers will be required
> for stack pointers. In particular, I don't assume whether the
> call/return control stack is ISA or software convention.

I don't know what the acronym ISA means --- I assume that the "A" is for "architecture," as in what registers are available and with what addressing modes or special instructions.

> So, it should be obvious that a DTC thread list can be pushed (in
> reverse order) onto a stack and the ISA return instruction can be used
> to jump from one thread subroutine to the next. On many chips this
> will execute faster than the 3 instruction jump, and additionally it
> frees up a register that is no longer needed for the thread list
> pointer.

That is a pretty clever idea. I hadn't thought of that, so it was non-obvious to me anyway.

You aren't freeing up a register though; you still need 3 registers with permanently assigned purposes: the data stack pointer, return stack pointer and instruction pointer. With subroutine threading, you need 2 registers with permanently assigned purposes (no instruction pointer). This was traditionally the weakness of Forth compared to C --- that C only needed 1 register with a permanently assigned purpose, as it used the return stack for passing data (a separate stack-frame pointer helps, but is not strictly required). This is why it was possible to implement C on register-starved processors such as the 6808 that Forth really couldn't be implemented on. On processors with only a few registers (such as the 6809 or 6812), you get one more general-purpose register with C than with Forth, which helps a lot.

On a lot of old processors (and the archaic x86), the only register that did pre-decrement and post-decrement was the return stack pointer (SP on the 8088). For this reason, UR/Forth used SP as the data stack pointer (and kept the top value of the data stack in BX). This sped up the low-level stack-juggling words. For example, DUP is just a PUSH of BX and DROP was a POP of BX. By comparison, SwiftForth uses EBP as the data stack pointer, and the code is bloated out with a lot of ADD and SUB of #4 to EBP as the pre-decrement and post-increment had to be done manually.

You are using SP as your instruction pointer to boost the speed of the threading, but this precludes using SP as the data-stack pointer to boost the speed of the stack-juggling. This is a bad trade-off --- you are better off to have fast stack-juggling and slightly slower threading --- on the x86 you can use SI for the instruction pointer and have a pretty fast NEXT using the LODSW instruction which does an automatic increment of the SI (that is what UR/Forth did).

What I'm going doing in Straight Forth is subroutine threading. You call functions with CALL just like usual. I'm using ESP as the data stack pointer however. The first thing that each function does is POP the return address into a general-purpose register to get it off the data stack where it is in the way of accessing the data. If the function is "primitive" in the sense that it doesn't call any sub-functions, then the return address can be held in that register throughout the execution of the function; at the end we just do an indirect jump through that register to return. If the function is "secondary" in the sense that it does call sub-functions, then it is necessary to push the return address onto the Forth return stack (I'm using EBP as the return-stack pointer).

I'm assuming here that it is worthwhile on the modern x86 to use the ESP as the data-stack pointer. This may not be true though. I could use EBP as the data-stack pointer similar to what SwiftForth does. The code is a lot more bloated that way because of all the ADD and SUB #4 to EBP. It is possible that these ADD and SUB instructions get parallelized though, so they aren't taking up extra clock cycles. The code is more bloated though, so it is not going to fit in the code cache as well. I think that my system is faster than SwiftForth's but I can't be sure without implementing it and testing.

The big problem with the x86 is that nobody knows what it is doing internally. The chip makers just provide vague guidelines for "optimization" but they don't provide any concrete information about what the chip is doing internally (like they did in the early Pentium days when Abrash wrote his 2nd book) because they don't want the other chip makers to know their trade secrets. The result though, is that x86 assembly-language programming is like voodoo --- you have rituals (the vague guidelines that I mentioned) that you do, but you don't know if they work or not, and if they do work you don't know why.

> More interestingly, (ab)using the return mechanism for threading can
> create possibilities for additional optimization. In cases where a
> thread subroutine calls library or system functions, in some cases it
> may be possible to have a called function return directly to the next
> thread subroutine rather than to the one that called it.

That is true --- you can just JMP into the function and let it do the RTS that starts the next function in the threaded sequence. This is only going to work though, if that library or system function doesn't have any parameters, otherwise you are going to need a wrapper.

I'll give some more thought to your scheme of using SP as the instruction pointer. I don't think it is a good idea on the x86, and I will continue with my current plan. It might be a good idea on processors with little memory available though --- the AVR or MSP430 for example --- not the PIC24 though, because it has plenty of code memory and is only limited in data memory, so subroutine-threading is best there.

BTW, on the MiniForth the NEXT took zero clock cycles --- it got parallelized every time. :-)

Hugh Aguilar

unread,

Aug 27, 2012, 4:49:08 AM8/27/12

to

On Aug 26, 12:12 am, Hugh Aguilar

<hughaguila...@nospicedham.yahoo.com> wrote:
> I'll give some more thought to your scheme of using SP as the instruction pointer.

I gave some more thought to it and realized that it is a very bad idea
--- every time than an interrupt occurs your code will be corrupted,
and you won't know about the problem until the next time that you
execute that function. Oh well, it seemed like a clever idea at first
glance.

George Neuner

unread,

Aug 27, 2012, 2:55:14 PM8/27/12

to

On Sun, 26 Aug 2012 00:12:55 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Friday, August 24, 2012 4:09:38 PM UTC-7, George Neuner wrote:
>
>I don't know what the acronym ISA means --- I assume that the "A" is
>for "architecture," as in what registers are available and with what
>addressing modes or special instructions.

Yes. ISA stands for "Instruction Set/Architecture".

In a generic discussion it's usually more appropriate to talk about
the programmer visible architecture rather than a specific
implementation of it. E.g., to speak about "x86" rather than deal
with the difference between, say, i486 and Pentium.

>> So, it should be obvious that a DTC thread list can be pushed (in
>> reverse order) onto a stack and the ISA return instruction can be used
>> to jump from one thread subroutine to the next. On many chips this
>> will execute faster than the 3 instruction jump, and additionally it
>> frees up a register that is no longer needed for the thread list
>> pointer.
>
>That is a pretty clever idea. I hadn't thought of that, so it was
>non-obvious to me anyway.
>
>You aren't freeing up a register though; you still need 3 registers with
>permanently assigned purposes: the data stack pointer, return stack pointer
>and instruction pointer.

That depends on the semantics of the VM being implemented. Forth is
low enough level wrt to hardware that it often does permit abusing ISA
pointers as VM pointers, but this isn't possible in general. With a
higher level language or an uncooperative host environment, a dual
stack VM implementation may require up to 5 pointers:

- ISA instruction pointer
- ISA stack pointer
- VM instruction pointer
- VM control stack pointer
- VM evaluation stack pointer

depending on how (or if) threading is implemented.

Subroutine threading permits conflating the ISA instruction pointer
with the VM instruction pointer. DTC or ITC threading doesn't permit
that, but often does permit (ab)using an ISA stack pointer as the VM
instruction pointer as I described.

However, in both cases the VM's architectural stacks still may require
separate pointers.

>You are using SP as your instruction pointer to boost the speed of
>the threading, but this precludes using SP as the data-stack pointer
>to boost the speed of the stack-juggling. This is a bad trade-off

Only if the ISA stack register is special wrt addressing or increment
modes. But with most modern chips any pointer or general register can
be used to implement a software stack with little loss of performance.

>The big problem with the x86 is that nobody knows what it is doing
>internally. The chip makers just provide vague guidelines for
>"optimization" but they don't provide any concrete information about
>what the chip is doing internally (like they did in the early Pentium
>days when Abrash wrote his 2nd book) because they don't want the
>other chip makers to know their trade secrets.

It's more a case that accurate information is extremely difficult to
provide.

The Pentium was relatively simple. It had dual pipelines, but it
issued in order and the pipelines were asymmetric - only a particular
subset of ALU instructions could use the B pipeline. So it was
possible to fairly accurately predict performance. That is no longer
true.

Performance of modern OoO chips is highly context dependent: it
matters not only what instructions are used and what dependencies
exist among them, but also what chip resources (ALU, FPU, rename
registers, etc.) are instantaneously (cycle by cycle) available for
simultaneous advancement of independent instructions. The possible
combinations are too numerous to document - or to understand if they
ever were documented - so the manufacturers simply don't try.

>> More interestingly, (ab)using the return mechanism for threading can
>> create possibilities for additional optimization. In cases where a
>> thread subroutine calls library or system functions, in some cases it
>> may be possible to have a called function return directly to the next
>> thread subroutine rather than to the one that called it.
>
>That is true --- you can just JMP into the function and let it do the RTS
>that starts the next function in the threaded sequence. This is only going
>to work though, if that library or system function doesn't have any
>parameters, otherwise you are going to need a wrapper.

The problem is not parameters, but return values. You can do this so
long as return values are passed back via "out" parameters: e.g,
instead of

<return type> function( arg1, ..., argN )

you have

void function( arg1, ..., argN, &retval )

[Back in the day (~1960's) this style of programming used to be called
"pure procedural", but that term has long since fallen out of use.]

George

George Neuner

unread,

Aug 27, 2012, 4:42:27 PM8/27/12

to

On Mon, 27 Aug 2012 01:49:08 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Aug 26, 12:12 am, Hugh Aguilar
><hughaguila...@nospicedham.yahoo.com> wrote:
>
>> I'll give some more thought to your scheme of using SP as the instruction pointer.
>
>I gave some more thought to it and realized that it is a very bad idea
>--- every time than an interrupt occurs your code will be corrupted,

An interrupt should NEVER corrupt the application stack.

It's better in general if interrupts use a separate stack, but
interrupts are no different from any other context switch: excepting
the need for the interrupt to deliberately communicate with the
program, return from an interrupt should leave the program's execution
environment unchanged.

Where communication from the interrupt to the program is necessary, it
always should be accomplished through a deliberate mechanism that
can't inadvertently affect program execution.

Otherwise you have a recipe for disaster no matter what you are doing.

George

Hugh Aguilar

unread,

Aug 28, 2012, 12:55:19 AM8/28/12

to

On Aug 27, 1:47 pm, George Neuner <gneun...@nospicedham.comcast.net>
wrote:

> On Mon, 27 Aug 2012 01:49:08 -0700 (PDT), Hugh Aguilar
>

> <hughaguila...@nospicedham.yahoo.com> wrote:
> >On Aug 26, 12:12 am, Hugh Aguilar
> ><hughaguila...@nospicedham.yahoo.com> wrote:
>
> >> I'll give some more thought to your scheme of using SP as the instruction pointer.
>
> >I gave some more thought to it and realized that it is a very bad idea
> >--- every time than an interrupt occurs your code will be corrupted,
>
> An interrupt should NEVER corrupt the application stack.

An interrupt doesn't corrupt the application stack. This is because
all data at lower addresses than the stack pointer is garbage and it
is okay for the interrupt to overwrite it, which it does (for most
processors, when an interrupt occurs the return address and processor
flag register are pushed onto the stack).

The problem with your system however, is that the stack pointer is
pointing at the middle of the threaded code when the interrupt occurs
and everything at lower addresses is valid threaded code --- if that
gets overwritten it corrupts the code.

> It's better in general if interrupts use a separate stack...

Not just "better in general" but imperative. Is there any processor in
the world in which interrupts have their own stack which is distinct
from the return stack that the applications are using? That would be
the only processor in which your threading system could be used.

Christopher Head

unread,

Aug 28, 2012, 3:51:30 AM8/28/12

to

On Mon, 27 Aug 2012 21:55:19 -0700 (PDT)
Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote:

> Not just "better in general" but imperative. Is there any processor in
> the world in which interrupts have their own stack which is distinct
> from the return stack that the applications are using? That would be
> the only processor in which your threading system could be used.

Sure, there are plenty. x86 in protected mode, such as used for
virtually every modern OS: when an interrupt arrives, the CPU loads the
stack pointer register with a special value called "ESP0" from the task
descriptor before pushing anything. This is pretty essential for
security since otherwise a second application thread running on a
different CPU could overwrite the interrupt-return frame on the stack,
since the application's stack would be accessible. ARM Cortex M3 I
think does the same thing when full privilege control is set up.

Chris

Hugh Aguilar

unread,

Aug 28, 2012, 4:30:53 AM8/28/12

to

On Aug 28, 1:02 am, Christopher Head <ch...@nospicedham.is.invalid>
wrote:

> On Mon, 27 Aug 2012 21:55:19 -0700 (PDT)
>

I expected that the ARM might do that, but didn't know the x86 did.

It is ironic that this is available on the big processors, but
threaded code is mostly only useful on small micro-controllers that
have very limited memory (the MSP430). Some of those ARM boards have
pretty limited memory though, so threaded code might be useful yet.

It could be interesting to implemented George's wild scheme on the x86
(I've already thought of a name for it: AbForth), just for the fun of
implementing something so weird. Saving memory is still important even
given a gigantic memory, because doing so improves cache performance
--- so it might come out quite efficient.

Does the stack have its own cache distinct from the data cache? If so,
that would help a lot. All of the threaded code would be in the stack
segment, all of the primitives' machine-language code would be in the
code segment, and all of the data as well as the Forth return stack
and parameter stack would be in the data segment. If they each have
their own cache, then it could be quite fast.

I would be interested in doing a 64-bit implementation, as that alone
will put it ahead of the several 32-bit x86 Forth systems already
available. Maybe even on Menuet!

Hugh Aguilar

unread,

Sep 4, 2012, 8:24:46 PM9/4/12

to

On Aug 26, 12:12 am, Hugh Aguilar
<hughaguila...@nospicedham.yahoo.com> wrote:

> I'm assuming here that it is worthwhile on the modern x86 to use the ESP as the data-stack pointer. This may not be true though. I could use EBP as the data-stack pointer similar to what SwiftForth does. The code is a lot more bloated that way because of all the ADD and SUB #4 to EBP. It is possible that these ADD and SUB instructions get parallelized though, so they aren't taking up extra clock cycles. The code is more bloated though, so it is not going to fit in the code cache as well. I think that my system is faster than SwiftForth's but I can't be sure without implementing it and testing.

I'm seriously considering rewriting my Forth to use George's threading
scheme rather than subroutine threading. I've got two questions that I
hope the CLAX experts can answer (the last time that I considered
myself anything close to an expert on the x86 was in the days of the
Pentium when it had its U and V pipe and the programmer could pretty
much predict what was going on).

1.) Is it more efficient to use the ESP as the parameter stack
pointer, which allows PUSH and POP to be used, or is it more efficient
to use another register and do the ADD and SUB of #4 manually? Does it
make any difference? Isn't it true that PUSH and POP get "decomposed"
into smaller instructions roughly comparable to what you get if you do
this manually using a different register? I mean, that there is a low-
level RISC-like machine-code that the CISC-like x86 code gets compiled
into on the fly, so there is no advantage to using the CISC
instructions that do several operations as compared to the RISC
instructions that do one simple operation each.

2.) My subroutine-threading Forth will inline code and will then patch
the code. This involves poking (now there's an archaic term!) in the
literal values for instructions that have a # addressing-mode operand,
and also the offset values for instructions that use a displacement
addressing mode. Is this dangerous? I am worried that different
assemblers generate different code given the same source-code. This
could be a problem if I switch assemblers for some reason in the
future --- I would be poking data into the middle of machine-code and
hitting the wrong place --- chaos would result! I don't want to be
tied to a specific assembler. For example, Menuet uses a different
version of NASM than is typically used under Linux (I'm using HLA
right now, and it isn't available for Menuet at all, and neither does
it support 64-bit x86 AFAIK).

Because of my concern for #2 above, I'm leaning toward using George's
scheme and forgetting about subroutine-threading for now. Because of
my concern for #1 above however, I'm concerned that George's scheme
might be a lot slower than subroutine threading. Safety is more
important than speed though, so most likely I will make the switch to
George's scheme. Advice?

Philip Lantz

unread,

Sep 5, 2012, 1:28:18 AM9/5/12

to

Hugh Aguilar wrote:
> > I'm assuming here that it is worthwhile on the modern x86 to use
> > the ESP as the data-stack pointer. This may not be true though. I
> > could use EBP as the data-stack pointer similar to what SwiftForth
> > does. The code is a lot more bloated that way because of all the ADD
> > and SUB #4 to EBP. It is possible that these ADD and SUB
> > instructions get parallelized though, so they aren't taking up extra
> > clock cycles. The code is more bloated though, so it is not going to
> > fit in the code cache as well. I think that my system is faster than
> > SwiftForth's but I can't be sure without implementing it and
> > testing.
>

> 1.) Is it more efficient to use the ESP as the parameter stack
> pointer, which allows PUSH and POP to be used, or is it more efficient
> to use another register and do the ADD and SUB of #4 manually? Does it
> make any difference? Isn't it true that PUSH and POP get "decomposed"
> into smaller instructions roughly comparable to what you get if you do
> this manually using a different register? I mean, that there is a low-
> level RISC-like machine-code that the CISC-like x86 code gets compiled
> into on the fly, so there is no advantage to using the CISC
> instructions that do several operations as compared to the RISC
> instructions that do one simple operation each.

I think it is more efficient to use ESP. It is true that the processor
decomposes instructions into micro-ops, but it also takes care to
optimize frequently used instructions like PUSH and POP. Current Intel
processors have a "Stack Pointer Tracker" which maintains an offset to
the stack pointer and avoids introducing a micro-op to update ESP after
every PUSH or POP. (See the Intel Optimization Manual, section 2.2.2.5.)
I don't know what other manufacturers have done, but I suspect they do
their best to optimize stack operations as well.

> 2.) My subroutine-threading Forth will inline code and will then patch
> the code. This involves poking (now there's an archaic term!) in the
> literal values for instructions that have a # addressing-mode operand,
> and also the offset values for instructions that use a displacement
> addressing mode. Is this dangerous? I am worried that different
> assemblers generate different code given the same source-code. This
> could be a problem if I switch assemblers for some reason in the
> future --- I would be poking data into the middle of machine-code and
> hitting the wrong place --- chaos would result! I don't want to be
> tied to a specific assembler. For example, Menuet uses a different
> version of NASM than is typically used under Linux (I'm using HLA
> right now, and it isn't available for Menuet at all, and neither does
> it support 64-bit x86 AFAIK).

There are some instructions that have more than one encoding, but
usually one is shorter that the other and the assemblers will choose the
shortest encoding. I'm not aware of any cases where an instruction has
multiple encodings that are the same length but include an offset or
immediate in a different place within the instruction. So unless you run
into a perverse assembler that chooses a longer instruction over a
shorter one, I think you'll be fine.

Christopher Head

unread,

Sep 5, 2012, 4:09:06 AM9/5/12

to

On Tue, 28 Aug 2012 01:30:53 -0700 (PDT)
Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote:

> Does the stack have its own cache distinct from the data cache? If so,
> that would help a lot. All of the threaded code would be in the stack
> segment, all of the primitives' machine-language code would be in the
> code segment, and all of the data as well as the Forth return stack
> and parameter stack would be in the data segment. If they each have
> their own cache, then it could be quite fast.

On that point, no, not on x86 anyway. L0 cache is split between code
and data (only that division, no further subdivision between stack and
non-stack). Higher level caches tend to be completely unified, even
between code and data.

Chris

George Neuner

unread,

Sep 5, 2012, 4:27:50 AM9/5/12

to

On Tue, 4 Sep 2012 17:24:46 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>1.) Is it more efficient to use the ESP as the parameter stack
>pointer, which allows PUSH and POP to be used, or is it more efficient
>to use another register and do the ADD and SUB of #4 manually?

Internally on the uOP chips there is no difference. Externally (or on
non-uOP chips), the x86 add/sub code will be slightly larger.

>Isn't it true that PUSH and POP get "decomposed"
>into smaller instructions roughly comparable to what you get if you do
>this manually using a different register? I mean, that there is a low-
>level RISC-like machine-code that the CISC-like x86 code gets compiled
>into on the fly, so there is no advantage to using the CISC
>instructions that do several operations as compared to the RISC
>instructions that do one simple operation each.

Not exactly. The sequence of simpler instructions may take longer
(at least initially) to fetch and decode, and it may put more register
pressure on the code than the equivalent complex instruction.

>2.) My subroutine-threading Forth will inline code and will then patch
>the code. This involves poking (now there's an archaic term!) in the
>literal values for instructions that have a # addressing-mode operand,
>and also the offset values for instructions that use a displacement
>addressing mode. Is this dangerous?

Could be ... I'm not sure I understand what you are trying to doing.

You definitely will have a problem unless the assembler is a known
quantity. Every one I've ever used has accepted a slightly different
syntax (even disregarding macros). You'd be better off picking one
and embedding it directly into your compiler (or into the runtime if
you are JIT'ing).

>Because of my concern for #2 above, I'm leaning toward using George's
>scheme and forgetting about subroutine-threading for now. Because of
>my concern for #1 above however, I'm concerned that George's scheme
>might be a lot slower than subroutine threading. Safety is more
>important than speed though, so most likely I will make the switch to
>George's scheme. Advice?

Accurate branch prediction matters more than any other consideration.
Unfortunately, performance of various threading schemes is rather
architecture specific.

Pentium (except P1) and Core chips have a stacked call/return
predictor which accurately predicts call-matched return targets.
"Core" chips also can accurately predict call targets that happen to
be in their code trace caches (keeping the VM itself small will help
there). Assuming your system can tolerate larger thread lists,
subroutine threading will be optimum for these chips.

Stack threading can outperform both direct and subroutine threading on
older in-order models - I've seen it done on i386 and i486 and I
believe it would be the same for P1. The reduction in total
instructions executed more than makes up for mispredicted branches.

I don't know about the newer (Atom, Nano, etc.) in-order x86 chips.

For the OoO models, stack vs direct threading provides generally
comparable performance and the question of which is better overall is
more complicated. The pipelining, BTB hardware and branch prediction
rules have been tweaked with each new microarchitecture. Details of
the implementation become very important. For me, the question really
is more about VM complexity and whether you can make use of an
additional pointer register for something else.

There is a pretty good comparison study of a number of x86 versions
which includes details on their pipelining and branch prediction
characteristics at
www.agner.org/optimize/microarchitecture.pdf

Hope this helps.
George

Robert Redelmeier

unread,

Sep 5, 2012, 10:46:14 AM9/5/12

to

Christopher Head <ch...@nospicedham.is.invalid> wrote in part:

But beware of treating the same memory as code & data.
There are all sorts of penalties for Self-Modifying Code,
usually worse for writing to code cachelines as data
(roundtrip through DRAM) than executing data cachelines.

-- Robert

Rod Pemberton

unread,

Sep 5, 2012, 4:05:19 PM9/5/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message
news:ebb0da2f-9c1b-4df6...@wm7g2000pbc.googlegroups.com...
...

> It is ironic that this is available on the big processors, but
> threaded code is mostly only useful on small micro-controllers that
> have very limited memory (the MSP430).

I don't know why you're saying that threaded code isn't that useful. You've
been following c.l.f. discussions that I've implemented a Forth interpreter
in C. It's compliant with much of Hayes core, except floating point and a
few other things. It uses threaded code, not a switch in C. It's much
slower than bigForth and Win32Forth, but it's substantially faster than the
ITC version Gforth. I've done nothing special for speed.

> Some of those ARM boards have pretty limited memory though,
> so threaded code might be useful yet.

I was searching for some Brainfuck related tools. I ran across Clifford
Wolf's webpages. He implemented a CPU for BF one of the better BF
compilers. He also has a small embedded VM for small memory footprint
microcontrollers, likely based off of BF ideas.
http://www.clifford.at/
http://www.clifford.at/embedvm/

Rod Pemberton

Hugh Aguilar

unread,

Sep 5, 2012, 4:12:48 PM9/5/12

to

No worries --- the modifications are done at compile-time (for the
Forth compiler) not at run-time (for the for code being generated). I
may have been unclear about how my subroutine-threading works, as
George seemed confused too (he mentioned assembler syntax, which is
irrelevant).

When the Forth compiler compiles a reference to a function in a
function that it is generating, it can do one of two things. If the
function is big, then it compiles a CALL to that function. If the
function is small and is primitive, then it copies all of the code of
that function into the code being generated (it "inlines" the
function). Sometimes after doing this, it needs to modify some of the
code that it copied. For example, I only have one function for pushing
a literal onto the parameter stack (the literal $deadbeef). After this
code is copied into the code being generated, I poke the correct value
into the middle of that code overwriting $deadbeef. Similarly, all of
the functions that access the local variables need to get a
displacement byte overwritten so they access the correct local
variable.

Getting back to threading schemes, here are the possibilities:

SwiftForth:
NEXT is a CALL and RET
ESP is the return-stack and EBP is the parameter stack.

my system:
NEXT is a CALL, a POP into ECX and a JMP through ECX
EBP is the return-stack and ESP is the parameter stack.

UR/Forth:
NEXT is a LODSW and a JMP through EAX
EAX is the instruction pointer, EBP is the return-stack and ESP is the
parameter stack.

George's system:
NEXT is a RET
ESP is the instruction pointer, ESI is the return-stack and EBP is the
parameter stack (these two are pretty much whatever you choose).

In all cases above this is for a primitive word. If the word contains
calls to sub-words, then the return address has to be put on the
return stack rather than held in a register.

Hugh Aguilar

unread,

Sep 5, 2012, 4:23:15 PM9/5/12

to

On Sep 5, 1:40 am, George Neuner <gneun...@nospicedham.comcast.net>
wrote:

> Accurate branch prediction matters more than any other consideration.
> Unfortunately, performance of various threading schemes is rather
> architecture specific.

I think that branch prediction has no effect at all in any threading
scheme, because branching is done by loading the IP (instruction
pointer) with a new value --- the IP is ESP in your scheme and is ESI
in UR/Forth (BTW: I incorrectly stated that the IP is EAX in UR/Forth
in my previous post).

This failure to take advantage of branch prediction is why all
threading schemes are going to be inferior to subroutine threading on
the newer chips (although your scheme or UR/Forth's scheme may have
been better on the old x86 chips).

> Pentium (except P1) and Core chips have a stacked call/return
> predictor which accurately predicts call-matched return targets.
> "Core" chips also can accurately predict call targets that happen to
> be in their code trace caches (keeping the VM itself small will help
> there). Assuming your system can tolerate larger thread lists,
> subroutine threading will be optimum for these chips.

Is this going to work in my scheme? I don't leave the return address
on the ESP stack and then do a RET at the end --- instead I POP the
return address into ECX at the beginning of the function and do a JMP
through ECX at the end.

Robert Redelmeier

unread,

Sep 6, 2012, 8:52:18 AM9/6/12

to

Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote in part:

> Getting back to threading schemes, here are the possibilities:
>
> SwiftForth:
> NEXT is a CALL and RET
> ESP is the return-stack and EBP is the parameter stack.

Good branch predictability, risks collision of ESP & EBP stacks

> my system:
> NEXT is a CALL, a POP into ECX and a JMP through ECX
> EBP is the return-stack and ESP is the parameter stack.
>
> UR/Forth:
> NEXT is a LODSW and a JMP through EAX
> EAX is the instruction pointer, EBP is the return-stack and ESP is the
> parameter stack.
>
> George's system:
> NEXT is a RET
> ESP is the instruction pointer, ESI is the return-stack and EBP is the
> parameter stack (these two are pretty much whatever you choose).

All have poor branch predictability and will suffer from stalls.

> In all cases above this is for a primitive word. If the word
> contains calls to sub-words, then the return address has
> to be put on the return stack rather than held in a register.

Using stack or registers usually makes little difference if
stack is in cache. AFAIK, all modern high-perf x86 chips
have hardware CALL/RET internal stacks to speed-up normal
operations (and slow them horribly when mis-paired)

-- Robert

George Neuner

unread,

Sep 6, 2012, 3:08:22 PM9/6/12

to

On Wed, 5 Sep 2012 13:23:15 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Sep 5, 1:40 am, George Neuner <gneun...@nospicedham.comcast.net>
>wrote:
>> Accurate branch prediction matters more than any other consideration.
>> Unfortunately, performance of various threading schemes is rather
>> architecture specific.
>
>I think that branch prediction has no effect at all in any threading
>scheme, because branching is done by loading the IP (instruction
>pointer) with a new value --- the IP is ESP in your scheme and is ESI
>in UR/Forth (BTW: I incorrectly stated that the IP is EAX in UR/Forth
>in my previous post).

I know that Anton Ertl did a study of various threading schemes on
various x86 (and other designs) and found that branch prediction was
important enough to warrant effort to identify commonly used
instruction sequences on the compiler side and to combine them where
possible into a single "super" instruction on the runtime side.

I don't have a link to his work handy, but it should be pretty easy to
find. He is a regular in comp.arch if you want to ask him directly.

>This failure to take advantage of branch prediction is why all
>threading schemes are going to be inferior to subroutine threading on
>the newer chips (although your scheme or UR/Forth's scheme may have
>been better on the old x86 chips).

Yes. The primary reason for using other threading techniques is space
(see below) in a multi program VM environment. If you only ever run a
single program, then it doesn't matter much ... you are unlikely ever
to use these monsters with a small memory.

I successfully used stack threading with a multi tasking kernel in a
DSP application (explaining why would take too long 8-). It was much
faster than direct threading and much smaller than subroutine
threading. However, the DSP had an alternate register set for
interrupt handling and was running over SRAM so branch timing mostly
was irrelevant - all locations were equally close.

>> Pentium (except P1) and Core chips have a stacked call/return
>> predictor which accurately predicts call-matched return targets.
>> "Core" chips also can accurately predict call targets that happen to
>> be in their code trace caches (keeping the VM itself small will help
>> there). Assuming your system can tolerate larger thread lists,
>> subroutine threading will be optimum for these chips.
>
>Is this going to work in my scheme? I don't leave the return address
>on the ESP stack and then do a RET at the end --- instead I POP the
>return address into ECX at the beginning of the function and do a JMP
>through ECX at the end.

No. The call/return predictor uses an internal BTB arranged as a
stack which is pushed/popped automatically by call and return
instructions. When the return instruction pulls an address from ESP,
it is compared to the BTB's top of stack. If it matches, the branch
proceeds without delay.

Different chips have different buffer sizes (aka stack depths), so
nested call chains can overflow the buffer (particular on the older
chips). The buffer isn't spilled/filled, so if a return address is
"aged" out by subsequent calls, it is lost and then normal dynamic
branch rules apply to the return.
[Unfortunately, the normal rules also are somewhat chip specific.]

However, the size of the call/return predictor buffer is less
important to subroutine threading because most thread routines are
leaf functions that don't make further calls.

But to use this, the thread list must contain explicit call
instructions (as in normal code): e.g.,
call this
call that
call the_next_thing
:

and you can't manipulate the IP so it isn't possible to embed argument
data into the instruction stream - you have to use something like
Java's constant pool instead.

George

George Neuner

unread,

Sep 6, 2012, 3:12:28 PM9/6/12

to

Yes. It's worth noting, however, that since the Pentium 4, the L0
code cache has contained uOP traces rather than x86 instructions.

>Chris
George

Hugh Aguilar

unread,

Sep 6, 2012, 9:05:49 PM9/6/12

to

On Sep 6, 12:12 pm, George Neuner <gneun...@nospicedham.comcast.net>
wrote:

> On Wed, 5 Sep 2012 13:23:15 -0700 (PDT), Hugh Aguilar
> <hughaguila...@nospicedham.yahoo.com> wrote:
> >> Pentium (except P1) and Core chips have a stacked call/return
> >> predictor which accurately predicts call-matched return targets.
> >> "Core" chips also can accurately predict call targets that happen to
> >> be in their code trace caches (keeping the VM itself small will help
> >> there). Assuming your system can tolerate larger thread lists,
> >> subroutine threading will be optimum for these chips.
>
> >Is this going to work in my scheme? I don't leave the return address
> >on the ESP stack and then do a RET at the end --- instead I POP the
> >return address into ECX at the beginning of the function and do a JMP
> >through ECX at the end.
>
> No. The call/return predictor uses an internal BTB arranged as a
> stack which is pushed/popped automatically by call and return
> instructions. When the return instruction pulls an address from ESP,
> it is compared to the BTB's top of stack. If it matches, the branch
> proceeds without delay.

Well, I think that kills my scheme. I expected that the use of PUSH
and POP, rather than a manual stack using some other register, was
worthwhile --- but it doesn't have that much of an effect, and it is
messing up dthe CALL/RET predictor that you describe above --- so I
will switch to doing what SwiftForth does, which is CALL and RET for
both primitive and secondary functions, the ESP is the return-stack
pointer and the EBP is the parameter stack pointer.

> Different chips have different buffer sizes (aka stack depths), so
> nested call chains can overflow the buffer (particular on the older
> chips). The buffer isn't spilled/filled, so if a return address is
> "aged" out by subsequent calls, it is lost and then normal dynamic
> branch rules apply to the return.
> [Unfortunately, the normal rules also are somewhat chip specific.]
>
> However, the size of the call/return predictor buffer is less
> important to subroutine threading because most thread routines are
> leaf functions that don't make further calls.

What you call "leaf functions" are normally called "primitives" in
Forth --- functions that don't call any sub-functions but just contain
code.

Is the CALL/RET predictor efficient enough that doing a CALL to a
primitive is just as fast as inlining the primitive? This would save a
lot of memory. Forths that do a lot of inlining (SwiftForth) tend to
produce very bloated code. If I did a CALL to a primitive and it was
as fast as inlining the primitive, the program might actually be not
just as fast but faster --- because the code is less bloated and the
cache is more efficient. That would be ironic, as inlining primitives
was traditionally a big part of why subroutine-threading was more
efficient than direct-threading, but now I would be going back to
generating a sequence of CALL instructions just like the old much-
maligned Forth compilers did. :-)

Hugh Aguilar

unread,

Sep 6, 2012, 9:11:53 PM9/6/12

to

On Sep 6, 5:56 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:

> Hugh Aguilar <hughaguila...@nospicedham.yahoo.com> wrote in part:

>
> > Getting back to threading schemes, here are the possibilities:
>
> > SwiftForth:
> > NEXT is a CALL and RET
> > ESP is the return-stack and EBP is the parameter stack.
>
> Good branch predictability, risks collision of ESP & EBP stacks
>
> > my system:
> > NEXT is a CALL, a POP into ECX and a JMP through ECX
> > EBP is the return-stack and ESP is the parameter stack.
>
> > UR/Forth:
> > NEXT is a LODSW and a JMP through EAX
> > EAX is the instruction pointer, EBP is the return-stack and ESP is the
> > parameter stack.
>
> > George's system:
> > NEXT is a RET
> > ESP is the instruction pointer, ESI is the return-stack and EBP is the
> > parameter stack (these two are pretty much whatever you choose).
>
> All have poor branch predictability and will suffer from stalls.

I agree that UR/Forth and George's scheme will both have poor branch
predictability --- that is the major problem with all threading
schemes that involve a sequence of addresses and a NEXT function that
steps through them.

I don't think that my system will have poor branch predictability
though --- it has branch instructions the same as SwiftForth. Perhaps
you meant that my system will not do the CALL/RET prediction, which is
what George said, and which pretty much kills my system.

Also, all Forth systems risk collision of the return-stack and the
parameter-stack. This is because Forth has two stacks, whereas C has
one stack --- this is intrinsic in Forth and no threading scheme is
going to change this fact.

Terje Mathisen

unread,

Sep 7, 2012, 2:21:13 AM9/7/12

to

George Neuner wrote:
> I know that Anton Ertl did a study of various threading schemes on
> various x86 (and other designs) and found that branch prediction was
> important enough to warrant effort to identify commonly used
> instruction sequences on the compiler side and to combine them where
> possible into a single "super" instruction on the runtime side.
>
> I don't have a link to his work handy, but it should be pretty easy to
> find. He is a regular in comp.arch if you want to ask him directly.

He's not only a regular: We've been discussing this very issue in
comp.arch for the last few weeks, so hop over. :-)
[snip]

> But to use this, the thread list must contain explicit call
> instructions (as in normal code): e.g.,
> call this
> call that
> call the_next_thing
> :
>
> and you can't manipulate the IP so it isn't possible to embed argument
> data into the instruction stream

This is really crucial, using ESP to point at a bunch of Forth word
pointers before using RET to chain from one to the next, will severely
mess up the return stack predictor.

It seems to me that you have to leave a bit of free space below each
such block if the stack should also be available for local PUSH/POP
operations?

OTOH, it can indeed make for very _small_ code, which is getting
increasingly important again. :-)

On the gripping hand, as soon as we go here, we should also be able to
identify groups of short words (code blocks) and stitch them together,
at which point we're pushing into (non-optimizing) compiler territory.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Robert Redelmeier

unread,

Sep 7, 2012, 2:21:22 PM9/7/12

to

Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote in part:

> On Sep 6, 5:56 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
>> Hugh Aguilar <hughaguila...@nospicedham.yahoo.com> wrote in part:
>> > Getting back to threading schemes, here are the possibilities:
>> > SwiftForth:
>> > NEXT is a CALL and RET
>> > ESP is the return-stack and EBP is the parameter stack.
>>
>> Good branch predictability, risks collision of ESP & EBP stacks
>>
>> > my system:
>> > NEXT is a CALL, a POP into ECX and a JMP through ECX
>> > EBP is the return-stack and ESP is the parameter stack.
>

> I don't think that my system will have poor branch predictability
> though --- it has branch instructions the same as SwiftForth. Perhaps
> you meant that my system will not do the CALL/RET prediction, which is
> what George said, and which pretty much kills my system.

Yes. a paired RET is _much_ faster than JMP[ECX] which will be
totally unpredictable and dogslow -- POP ECX will have to retire
before load rather than a paired RET lining up [speculative] execution.

> Also, all Forth systems risk collision of the return-stack
> and the parameter-stack. This is because Forth has two
> stacks, whereas C has one stack --- this is intrinsic in
> Forth and no threading scheme is going to change this fact.

This is probably not as bad as it sounds, unless you go into deep
recursion -- a small 8kB ret stack can be above the larger param
stack. Frankly, I would use ESP for call/ret and EBP for params
(short addr modes) as per original design intent. x86 has some
nice hw mmap features you could use to put a guard page between
the stacks.

-- Robert

George Neuner

unread,

Sep 7, 2012, 4:25:26 PM9/7/12

to

On Thu, 6 Sep 2012 18:05:49 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>Is the CALL/RET predictor efficient enough that doing a CALL to a
>primitive is just as fast as inlining the primitive?

No.

The predictor gives a seamless *return* - at least until/unless the
return address ages from the BTB or the code ages from the trace
cache. The reason is that traces are already decoded and can be
injected directly into the execution stage(s).

There will be a small (few cycles) hiccup if the BTB can't find the
return address, but the code still is in trace cache. In the absence
of a positive result from the call/return predictor, return branches
are *not* further dynamically predicted - so they suffer only from
discontiguous fetch and code caching. This is part of the reason
stack threading still can be faster than direct threading.

Direct (inline address) calls also can be seamless if the target is in
the trace cache. Otherwise they also suffer from discontiguous
fetch, caching, and possibly the need to decode the instructions.

However, indirect jumps (and indirect calls) are dynamically predicted
and basically it is assumed that the last target will be wanted again.
Some chips can correctly handle predictable rotation among a few
addresses, but indirect jump used for arbitrary threading is
guaranteed to be mispredicted most of the time.

[The other part of the reason is that stack threading can result in
fewer instructions executed overall - although the uOP designs can
largely negate this advantage if the VM largely (or completely) fits
into the code cache.]

George

George Neuner

unread,

Sep 7, 2012, 4:32:21 PM9/7/12

to

On Fri, 07 Sep 2012 08:21:13 +0200, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:

>This is really crucial, using ESP to point at a bunch of Forth word
>pointers before using RET to chain from one to the next, will severely
>mess up the return stack predictor.

Yes it will. But the return stack predictor can also be confused by
deep nested call chains where return addresses are "aged" out of the
BTB. So in the absence of a positive predictor result, return
branches are simply taken as directed. Even in the best case - target
in trace cache - this produces a small pipeline bubble ... but you
don't suffer like with a misprediction.

George

Hugh Aguilar

unread,

Sep 12, 2012, 11:35:55 PM9/12/12

to

On Sep 6, 11:27 pm, Terje Mathisen <"terje.mathisen at > It seems to

me that you have to leave a bit of free space below each
> such block if the stack should also be available for local PUSH/POP
> operations?

That isn't possible. The ESP is right in the middle of the threaded
sequence. Forth does have >R and R> for temporary storage --- in this
case, that would be to the EBP stack which is where return addresses
are being held.

> On the gripping hand, as soon as we go here, we should also be able to
> identify groups of short words (code blocks) and stitch them together,
> at which point we're pushing into (non-optimizing) compiler territory.

My compiler will primarily look for combinatins of producers and
consumers. For example, LIT and OVER are producers because they push a
new value onto the parameter stack. + and ! are consumers because they
remove a value or two from the parameter stack. In many cases, a
producer is immediately followed by a consumer. Peephole optimization
can combine them together --- the value doesn't get pushed onto the
stack and then popped off again into a register, but it is just loaded
directly into the register.

Hugh Aguilar

unread,

Sep 12, 2012, 11:54:03 PM9/12/12

to

On Sep 7, 1:29 pm, George Neuner <gneun...@nospicedham.comcast.net>
wrote:

> [The other part of the reason is that stack threading can result in
> fewer instructions executed overall - although the uOP designs can
> largely negate this advantage if the VM largely (or completely) fits
> into the code cache.]

How big is the code cache? How small do I need to make the collection
of primitives so that they will always be fast? Is it true that there
are two levels of cache? If primitives are often used in conjunction,
does it help to locate them next to each other?

I have really decided to switch over to your stack threading. It will
be much simpler to implement than subroutine threading because I don't
have all that pasting code snippets into the generated code and then
poking values into operands --- I just compile pointers to functions
(the cfa in Forth parlance) --- that is really simple.

Stack-threading will be a pretty fun experiment. :-) It might turn out
to be quite fast. If it turns out to be slow however, I don't really
care right now anyway. The purpose of this whole exercise to write a
cross-compiler for micro-controllers (the PIC24 first, then the ARM
later on). The x86 Forth is the host, if it is slow then the only
effect will be slower compile times for the micro-controller code ---
but that doesn't matter so long as I'm compiling programs in a matter
of seconds which I will be with any halfway efficient system.

Maybe someday in the future I will think of an application for the x86
that requires speed, in which case I will start worrying about
subroutine-threading again. It is important to remember that desktop
computer software is given away for free, but that micro-controller
code is sold for money --- so I don't want to get bogged down messing
around with the x86 --- I want to start making some progress on the
PIC24 cross-compiler.

Rod Pemberton

unread,

Sep 13, 2012, 3:47:06 AM9/13/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message
news:d3dfa850-dcc9-49f4...@sd5g2000pbc.googlegroups.com...

> On Sep 6, 11:27 pm, Terje Mathisen <"terje.mathisen at
>
> > It seems to me that you have to leave a bit of free space below each
> > such block if the stack should also be available for local PUSH/POP
> > operations?
>
> That isn't possible. The ESP is right in the middle of the threaded
> sequence. Forth does have >R and R> for temporary storage --- in this
> case, that would be to the EBP stack which is where return addresses
> are being held.
>

From what I've seen implementing my Forth interpreter so far, most (not all)
usage of >R together with R> for the system words involves moving and
restoring just a single item. For that situation, the return stack is
acting like a single register. It's not being used as a stack since only
the top of the return stack is being used. So, similar to TOS for the
parameter stack, you could implement a register and rewrite single use >R
and R> code for the register. You'll have to make sure there is no
intermediate words which cause nesting or intermediate >R and R>. That'd
corrupt the register. The advantage should be that sequences for saving and
restoring to a register are shorter than for the saving and restoring values
to the return stack, i.e., faster. The 2x words (esp. 2OVER, 2ROT, 2>R,
2R>, 2R@ etc) and LOOP related words seem to be where saving and
restoring more than two data items to the return stack occur.

Rod Pemberton

George Neuner

unread,

Sep 13, 2012, 9:04:36 PM9/13/12

to

On Wed, 12 Sep 2012 20:54:03 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>On Sep 7, 1:29 pm, George Neuner <gneun...@nospicedham.comcast.net>
>wrote:
>> [The other part of the reason is that stack threading can result in
>> fewer instructions executed overall - although the uOP designs can
>> largely negate this advantage if the VM largely (or completely) fits
>> into the code cache.]
>
>How big is the code cache? How small do I need to make the
>collection of primitives so that they will always be fast?

Intel has never officially disclosed the size of the Execution Trace
Cache - there have only ever been qualitative comparisons: e.g., "chip
Y's [ETC] is larger than chip X's".

AFAIK, the organization of the cache has not been disclosed either,
although several researchers have tried to investigate it.

However the Pentium 4 trace cache was said to be 12KuOPs, which on
average equates to roughly 2K x86 instructions. More recent chips
will only be larger.

http://www.eecg.toronto.edu/~moshovos/ACA05/read/Pentium4ArchITJ.pdf
http://www.ptlsim.org/papers/Research/trace_filtering.pdf

>Is it true that there are two levels of cache?

Yes and no. The (L1 code) Execution Trace Cache contains sequences
of *already decoded* uOP instructions. It is separate from the L1
data cache.

L2 is a combined code and data cache. Any code bearing lines in L2
will contain x86 instructions.

So "yes" there are 2 levels, "no" they don't work the same.

>If primitives are often used in conjunction,
>does it help to locate them next to each other?

It helps most to create a separate combination primitive and use that
instead of the sequence.

WRT locality in the ETC, I don't know the answer to that. From what I
have read, it seems that the cache is associative and traces are
dynamically assembled according to execution order, so the same
instruction might be part of multiple traces if it can be reached by
multiple routes.

There doesn't seem to be a whole lot of information available.

George

Hugh Aguilar

unread,

Sep 13, 2012, 11:09:24 PM9/13/12

to

On Sep 13, 12:47 am, "Rod Pemberton"

<do_not_h...@nospicedham.notemailnotz.cmm> wrote:
> From what I've seen implementing my Forth interpreter so far, most (not all)
> usage of >R together with R> for the system words involves moving and
> restoring just a single item. For that situation, the return stack is
> acting like a single register. It's not being used as a stack since only
> the top of the return stack is being used. So, similar to TOS for the
> parameter stack, you could implement a register and rewrite single use >R
> and R> code for the register. You'll have to make sure there is no
> intermediate words which cause nesting or intermediate >R and R>. That'd
> corrupt the register. The advantage should be that sequences for saving and
> restoring to a register are shorter than for the saving and restoring values
> to the return stack, i.e., faster. The 2x words (esp. 2OVER, 2ROT, 2>R,
> 2R>, 2R@ etc) and LOOP related words seem to be where saving and
> restoring more than two data items to the return stack occur.

Rod! Its a stack! It may be true that the top-value of the return
stack is held in a register (to make R@ faster), but this doesn't mean
that the whole thing isn't behaving like a stack. When you do >R there
is first a push of the top-value from the register into memory, and
then that register is loaded with the new datum. I don't recommend
this because it wastes a register and provides little or no
improvement in speed, but it is pretty common (it does boost speed
somewhat when R@ and I are synonyms, which is likely why it was done).

Anyway, none of this is of concern to me because my Forth won't
support >R R> R@ and all of that anyway. That makes for some pretty
ugly code. I will only support local variables.

One of the weirdest things about ANS-Forth is that it supports both >R
etc. and local variables, and it says that locals "may" be on the
return stack. This isn't possible. You can't have both.

My Forth is going to be a lot simpler than ANS-Forth, and it is going
to make sense (the two goals are related). The only significant new
feature will be closures --- mostly I will be getting rid of 1970s
vintage cruft (for example, I will get rid of DO loops and only
support BEGIN loops).

Hugh Aguilar

unread,

Sep 13, 2012, 11:37:21 PM9/13/12

to

On Sep 13, 6:07 pm, George Neuner <gneun...@nospicedham.comcast.net>
wrote:

> On Wed, 12 Sep 2012 20:54:03 -0700 (PDT), Hugh Aguilar
> <hughaguila...@nospicedham.yahoo.com> wrote:
> >How big is the code cache? How small do I need to make the
> >collection of primitives so that they will always be fast?
>
> Intel has never officially disclosed the size of the Execution Trace
> Cache - there have only ever been qualitative comparisons: e.g., "chip
> Y's [ETC] is larger than chip X's".
>
> AFAIK, the organization of the cache has not been disclosed either,
> although several researchers have tried to investigate it.
>
> However the Pentium 4 trace cache was said to be 12KuOPs, which on
> average equates to roughly 2K x86 instructions. More recent chips
> will only be larger.

That is pretty big. If I get most of my primitives in cache, the whole
thing should be quite fast. As the x86 gets more complicated, the code
gets simpler --- I'm back to a fairly simple threaded system with only
peephole optimization, such as was done 20 years ago. That is fine
with me --- makes for less work on my part.

> >If primitives are often used in conjunction,
> >does it help to locate them next to each other?
>
> It helps most to create a separate combination primitive and use that
> instead of the sequence.

I can do peephole optimization so that common producer/consumer combos
(such as OVER + or LIT !) will get compiled as a single primitive. I
won't get over enthusiastic about this because I don't want to have so
many primitives that they don't fit in the cache.

> WRT locality in the ETC, I don't know the answer to that. From what I
> have read, it seems that the cache is associative and traces are
> dynamically assembled according to execution order, so the same
> instruction might be part of multiple traces if it can be reached by
> multiple routes.

Does it help to align the primitive functions on paragraph boundaries?
If so, how big is a paragraph? From what you have described, it seems
like this doesn't help (and it wastes a lot of memory), so it
shouldn't be done.

Does the entire collection of primitives have to be aligned on some
super-paragraph boundary? I mean, does the cache get loaded from
absolutely delineated blocks of memory?

> There doesn't seem to be a whole lot of information available.

It is very difficult for me to get interested in x86 assembly language
because of this lack of information. As I said before, it is like
voodoo --- you don't know if your techniques are having any effect
without testing, and it is easy to get fooled by anecdotal evidence
into believing that techniques are having an effect when they don't
have any effect whatsoever (a pretty common error among voodoo
priests).

The ARM is better documented, isn't it? The PIC24 certainly is ---
there are no mysteries about what is going on under the hood there.
Micro-controllers pretty much have to be simple like this, because
consistency is *very* important. It is better to have a program that
always runs fast enough, than a program that runs too slow 1 out of
1000 times and faster than necessary the rest of the time.

Rod Pemberton

unread,

Sep 14, 2012, 6:20:27 AM9/14/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message

news:7cf3915c-e3e3-477d...@pz10g2000pbb.googlegroups.com...
...

> Rod! Its a stack! It may be true that the top-value of the return
> stack is held in a register (to make R@ faster), but this doesn't mean
> that the whole thing isn't behaving like a stack.

What I was saying is that what I'm seeing is that much use of the return
stack isn't as a stack, but as a register, maybe 50% to 60%. So, why use a
stack for those situations?

Many of the base system words use a >R ... R> sequence only once, and
non-nested. That excludes the 2x words and looping related words.
The 2x and looping words use >R >R ... R> R> or more moves.

E.g., typical, single use of >R ... R> sequence:

: BLAH ( no use of R> or >R ) ;
: DOBLAH ( no use of R> or >R ) ;
: SYSWORD DIT DASH DOT >R BLAH DOBLAH R> ;

In a situation like that, R> ... >R sequence isn't really using the return
stack as a stack. It's being used as a temporary register: one save, one
restore, no intermediate nested saves and restores, i.e., no >R ... R>
sequences in BLAH or DOBLAH. That means that SYSWORD can be
rewritten using a register or variable, say TMP:

: TMP! TMP ! ;
: TMP@ TMP @ ;
: SYSWORD DIT DASH DOT TMP! BLAH DOBLAH TMP@ ;

Of course, TMP! and TMP@ can be optimized primitives. They don't have to be
coded in Forth. If BLAH or DOBLAH did use a >R ... R> sequence, then you
can't do that. You'd overwrite TMP register.

For my Forth, currently (giving away my secrets...) I've got three temporary
registers. You can't access them directly. I've got @ and ! primitives for
each register, as I suggested to you. Two of them are for the
data/parameter stack, and one which is for the control-flow/return stack.

One of the convenient things is that a store to one of them is a DROP.

My interpreter, like classic Forth interpreters, saves the return IP on the
return stack. The register for the return stack allows me to pop that IP
and push it later, to get to lower data items without moving the IP to the
data stack, i.e., reduces stack "juggling". That's another convenient
thing.

E.g., I've had SWAP as both coded in Forth and as a primitive. Currently, I
have it in high-level Forth using 'Ar' and 'Br' registers:

: SWAP Ar! Br! Ar@ Br@ ;

Of course, implementing SWAP as a primitive is the optimal solution.

Or, an example of using my 'Xr' register for the return stack, in this case,
for implementing Forth's I word (gets 2nd of 3 elements on return stack):

: I Xr! R> DUP >R Xr@ ;

(FYI, it has a saved IP prior to the two stack items for I because of the
interpreter.)

If you were to do that without 'Xr', it'd be some other sequence, longer,
slower, depending on which stack primitives are the fastest for your Forth:

: I R> R> TUCK >R >R ;
: I R> R> DUP >R SWAP >R ;
: I R> R> DUP -ROT >R >R ;
: I R> R> SWAP OVER >R >R ;

The Xr version is one operation less than the DUP SWAP version. Xr! and Xr@
are faster than R> and >R . The TUCK version has the same number of
operations, but is slower because TUCK is more complicated and because Xr!
and Xr@ are faster than R> and >R .

Forth's I is probably a primitive on many Forths too. But, I'm
demonstrating how extra registers can reduce the operations. It's less
likely the 2x words are primitives. They'll be faster and shorter with Xr!
and Xr@ .

> Anyway, none of this is of concern to me because my Forth won't
> support >R R> R@ and all of that anyway.

Ok. Then, don't you need something to replace the usage of >R R> R@ ? Is
that the ANS locals?

So, two stacks or three? Etc.
a) two: data/parameter and control-flow/return
b) three: data/parameter, return, and control-flow

Rod Pemberton

Hugh Aguilar

unread,

Sep 14, 2012, 4:56:19 PM9/14/12

to

On Sep 14, 3:25 am, "Rod Pemberton"

It is possible to have globally defined "register variables" such as
your TMP. I don't think it is a good idea. What if later on you
rewrite your BLAH or DOBLAH to use TMP? You would have search through
your source-code for words like SYSWORD that call them and rewrite
those words so that they don't use TMP. If you miss one, chaos will
result!

You are better off to have local variables rather than your three
globally defined register variables. It is possible to implement
locals such that some or all of them are held in registers during the
execution of the function. That is not realistic on the x86 because of
the shortage of registers, but it should work well on a modern
processor such as the PIC24 or ARM that provides plenty of registers.

> One of the convenient things is that a store to one of them is a DROP.
>
> My interpreter, like classic Forth interpreters, saves the return IP on the
> return stack. The register for the return stack allows me to pop that IP
> and push it later, to get to lower data items without moving the IP to the
> data stack, i.e., reduces stack "juggling". That's another convenient
> thing.
>
> E.g., I've had SWAP as both coded in Forth and as a primitive. Currently, I
> have it in high-level Forth using 'Ar' and 'Br' registers:
>
> : SWAP Ar! Br! Ar@ Br@ ;
>
> Of course, implementing SWAP as a primitive is the optimal solution.

This isn't a bad technique. You can write pseudo-primitives. They are
primitive in the sense that they don't call any sub-functions, but
they are written in Forth rather than assembly-language --- yet they
are fast because they use registers rather than the stack.

I wouldn't expose this stuff to the user though.

George Neuner

unread,

Sep 15, 2012, 10:41:11 PM9/15/12

to

On Thu, 13 Sep 2012 20:37:21 -0700 (PDT), Hugh Aguilar
<hughag...@nospicedham.yahoo.com> wrote:

>It is very difficult for me to get interested in x86 assembly language
>because of this lack of information. As I said before, it is like
>voodoo --- you don't know if your techniques are having any effect
>without testing, and it is easy to get fooled by anecdotal evidence
>into believing that techniques are having an effect when they don't
>have any effect whatsoever (a pretty common error among voodoo
>priests).
>
>The ARM is better documented, isn't it?

That depends on your perspective.

x86 is documented about as well as it can be. ARM (at least all the
32-bit versions) are in-order, x86 have been out-of-order since
Pentium Pro. Execution time and memory consistency in OoO chips is
context sensitive and there are too many possibilities to document in
detail. Instead general rules and software patterns are documented
and it's up to programmers to recognize an applicable pattern in their
own software.

George

Thomas Womack

unread,

Sep 16, 2012, 6:02:40 PM9/16/12

to

In article <udda5852jcud59uim...@4ax.com>,

George Neuner <gneu...@nospicedham.comcast.net> wrote:
>On Thu, 13 Sep 2012 20:37:21 -0700 (PDT), Hugh Aguilar
><hughag...@nospicedham.yahoo.com> wrote:
>
>>It is very difficult for me to get interested in x86 assembly language
>>because of this lack of information. As I said before, it is like
>>voodoo --- you don't know if your techniques are having any effect
>>without testing, and it is easy to get fooled by anecdotal evidence
>>into believing that techniques are having an effect when they don't
>>have any effect whatsoever (a pretty common error among voodoo
>>priests).
>>
>>The ARM is better documented, isn't it?
>
>That depends on your perspective.
>
>x86 is documented about as well as it can be. ARM (at least all the
>32-bit versions) are in-order

That's just not true; the Cortex-A9 (announced 2007 and found in such
esoteric devices as the iPhone 4) is out-of-order, and Cortex-A15
(announced 2010 and to ship in a number of Samsung phones and tablets
by Christmas 2012) is wider out-of-order.

Tom

Hugh Aguilar

unread,

Sep 16, 2012, 7:37:42 PM9/16/12

to

On Sep 13, 8:49 pm, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
wrote:

> Does the entire collection of primitives have to be aligned on some
> super-paragraph boundary? I mean, does the cache get loaded from
> absolutely delineated blocks of memory?

This question goes unanswered! :-(

It seems that getting the entire collection of primitives in cache is
the #1 trick for making it run fast. I don't want a situation where my
VM is small enough to fit in cache, but it is staddling a cache
boundary --- that would defeat the whole purpose.

Is it also true what I said, that the functions don't have to be
aligned on any paragraph boundaries because so long as they are in the
cache they are going to be fast. I don't want to waste memory
unnecessarily because I want to make the VM as large as possible while
still fitting in the cache.

Rod Pemberton

unread,

Sep 17, 2012, 4:21:26 AM9/17/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in message

news:3f60addb-ef5c-4aa5...@j2g2000pbg.googlegroups.com...

> On Sep 13, 8:49 pm, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
> wrote:
> > Does the entire collection of primitives have to be aligned on some
> > super-paragraph boundary? I mean, does the cache get loaded from
> > absolutely delineated blocks of memory?
>
> This question goes unanswered! :-(

Sorry Hugh, I was only casually reading the thread and replying to the Forth
stuff, not assembly or x86 microprocessor stuff per c.l.a.x. guidelines.

Who would you prefer an answer from?

> It seems that getting the entire collection of primitives
> in cache is the #1 trick for making it run fast.

I haven't been following the thread.

However, from my out-of-date understanding of microprocessor caches, I don't
know how you're going to make that happen, especially if the cache size of
your microprocessor is small.

The cache is a fixed size. AFAIK, you can't lock code of your choice in the
cache, like your set of primitives (code routines). So, as soon as the VM
or the threaded code or some other code is needed by the microprocessor, the
cache is (re)loaded with that code or data. So, your all primitives get
dumped, or a portion of them do. I don't know if modern caches dump in
their entirety or in small pieces.

In other words, your primitives, threaded code, interpreters or VM, and
other low-level Forth words (functions), like DOCOL, DOVAR, etc, must *all*
be in cache. I.e., AIUI, if your microprocessor's cache is 64KB and you
don't want the cache to reload, 64KB is all you get for your entire Forth
system. Otherwise, the cache will be reloaded with other data as needed.

Of course, modern x86 microprocessor's have some very large cache sizes.
So, on many modern x86 microprocessors, this issue might not matter at all.
Have you checked the cache sizes by microprocessor generation for x86
IA-32/AA-64? I think this 5 year old micro has something like 1MB cache,
but don't quote me. I could be way off on that.

However, it's been two decades since I studied anything related to a
microprocessor cache. So, maybe someone here will fill in correct and
updated information.

Rod Pemberton

wolfgang kern

unread,

Sep 17, 2012, 2:30:09 PM9/17/12

to

Hugh Aguilar wrote:

> Does the entire collection of primitives have to be aligned on some
> super-paragraph boundary? I mean, does the cache get loaded from
> absolutely delineated blocks of memory?

|This question goes unanswered! :-(

I don't know exact details on Intel chips, but AMD and Intel both
use something called code-prefetch size apart from L1-data and L1-
code cache-lines. This instruction prefetch length (in bytes) is
exactly the main criteria for optimal timing and I remember some
variants reported with CPU-family and versions from 15,16,32,64 byte.

|It seems that getting the entire collection of primitives in cache is
|the #1 trick for making it run fast. I don't want a situation where my
|VM is small enough to fit in cache, but it is staddling a cache
|boundary --- that would defeat the whole purpose.

|Is it also true what I said, that the functions don't have to be
|aligned on any paragraph boundaries because so long as they are in the
|cache they are going to be fast. I don't want to waste memory
|unnecessarily because I want to make the VM as large as possible while
|still fitting in the cache.

But as long you use HLL-abstraction layers to fit the hardware needs,
you dont need to care about lowest-level optimisation anyway.

__
wolfgang

Terje Mathisen

unread,

Sep 17, 2012, 4:09:38 PM9/17/12

to

wolfgang kern wrote:

> Hugh Aguilar wrote:
> |Is it also true what I said, that the functions don't have to be
> |aligned on any paragraph boundaries because so long as they are in the
> |cache they are going to be fast. I don't want to waste memory
> |unnecessarily because I want to make the VM as large as possible while
> |still fitting in the cache.
>
> But as long you use HLL-abstraction layers to fit the hardware needs,
> you dont need to care about lowest-level optimisation anyway.

For an interpreted mini-language (like Forth) it seem pretty obvious
that you want all the core words/functions to fit in L1 code cache.

The actual Forth program is really data, not code.

Hugh Aguilar

unread,

Sep 17, 2012, 7:05:29 PM9/17/12

to

On Sep 17, 11:33 am, "wolfgang kern" <nowh...@never.at> wrote:
> But as long you use HLL-abstraction layers to fit the hardware needs,
> you dont need to care about lowest-level optimisation anyway.

I think what you are saying here is to write the time-critical stuff
in assembly language.

The big problem with threaded code, as pointed out earlier, is that
loops don't do branch prediction. These are functions that reload the
IP (ESP in the case of stack-threading).

I can get around this problem somewhat by having functions that
interate through a data structure. They would be given an xt of a
function that would be called for each node in the data structure. An
example of this is EACH in my novice package. This works a lot better
when you have closures, which my Forth will --- this is why I'm
providing closures. Branch prediction will work because the entire
loop is a self-contained machine-language routine.

Hugh Aguilar

unread,

Sep 17, 2012, 8:00:51 PM9/17/12

to

On Sep 17, 4:18 pm, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
wrote:

An advantage of stack-threading is that functions end with RTS, so
they can be called with CALL so long as ESP is set up ahead of time to
point to the top of an empty block of memory. The iterator functions
described above can use CALL indirect through a register to call the
function that acts on each node. This should hopefully allow CALL/RET
optimization to work in this case (the usual way of calling Forth
functions in a threaded sequence doesn't allow CALL/RET optimization
to work).

I am going to provide linked lists as a standard data structure and
encourage the use of these (similar to how Factor provides "sequences"
as its standard data structure). If the user uses EACH a lot, rather
than write loops manually with BEGIN WHILE REPEAT, he should get
pretty good speed --- and more readable code than Forth usually admits
to. :-)

In my novice package, I have EACH etc.. These allow the parent
function to put data on the parameter stack which is accessible to the
"toucher" function that acts on each node. This is how that function
communicates with its parent. Closures work a lot better, and are a
lot more readable --- that is pretty much the gist of my Forth system.

wolfgang kern

unread,

Sep 18, 2012, 1:46:51 PM9/18/12

to

Hugh Aguilar wrote:
> But as long you use HLL-abstraction layers to fit the hardware needs,
> you dont need to care about lowest-level optimisation anyway.

|I think what you are saying here is to write the time-critical stuff
|in assembly language.

yes.

|The big problem with threaded code, as pointed out earlier, is that
|loops don't do branch prediction. These are functions that reload the
|IP (ESP in the case of stack-threading).

|I can get around this problem somewhat by having functions that
|interate through a data structure. They would be given an xt of a
|function that would be called for each node in the data structure. An
|example of this is EACH in my novice package. This works a lot better
|when you have closures, which my Forth will --- this is why I'm
|providing closures. Branch prediction will work because the entire
|loop is a self-contained machine-language routine.

not sure if I even fully understood what you want to achieve, if your
Forth should work as runtime interpreter then it may be similar to
my token-parameter-strings which were just interpreted by jumps ...

And instead of call/ret pairs I use 'chain-jumping', so all functions
are jumped to by branch-tables and end with a jump to the main entry.

CALL/RET takes more time then two near jumps, and avoids stack-pollution
too. So a thread (string-part or timeslice multiplexed instances) only
use one CALL on start and one RET when fully done or terminated.

And the main branch-code part will be always in L1-code-cache, because
it's more frequenly used than anything else regardless from which thread
or instance a function may be requested.

__
wolfgang

wolfgang kern

unread,

Sep 18, 2012, 1:21:27 PM9/18/12

to

Terje Mathisen replied:

> wolfgang kern wrote:
>> Hugh Aguilar wrote:
>> |Is it also true what I said, that the functions don't have to be
>> |aligned on any paragraph boundaries because so long as they are in the
>> |cache they are going to be fast. I don't want to waste memory
>> |unnecessarily because I want to make the VM as large as possible while
>> |still fitting in the cache.

>> But as long you use HLL-abstraction layers to fit the hardware needs,
>> you dont need to care about lowest-level optimisation anyway.

> For an interpreted mini-language (like Forth) it seem pretty obvious that
> you want all the core words/functions to fit in L1 code cache.

> The actual Forth program is really data, not code.

thanks, I partly see the problem now. Is this interpreter meant to
act in runtime or compile x86-instructions as final binary code?

__
wolfgang

Hugh Aguilar

unread,

Sep 18, 2012, 4:41:50 PM9/18/12

to

On Sep 18, 10:50 am, "wolfgang kern" <nowh...@never.at> wrote:
> Terje Mathisen replied:

> > For an interpreted mini-language (like Forth) it seem pretty obvious that
> > you want all the core words/functions to fit in L1 code cache.
> > The actual Forth program is really data, not code.
>
> thanks, I partly see the problem now. Is this interpreter meant to
> act in runtime or compile x86-instructions as final binary code?

Previously I had been doing subroutine-threading. In this case at
compile-time I generate x86 machine code (containing CALL instructions
to sub-functions). Now I am thinking about switching over to George's
stack-threading in which case at compile-time I generate threaded code
(a sequence of code-field-addresses which are pointers to functions)
and I interpret that threaded code at run-time --- in this case I want
my VM (collection of primitive functions written in machine-code) to
fit in the cache.

Generating threaded code is simpler than generating machine-code, but
likely will result in slower code. If the VM fits in the cache
however, it might be quite fast.

Hugh Aguilar

unread,

Sep 18, 2012, 4:52:20 PM9/18/12

to

That isn't just two near jumps, is it? Your branch table needs to
convert the xt (execution token) into an address to jump to. The xt is
really an index into the branch table, so you have to add it to the
branch-table base address and then fetch the cfa (code field address)
from that location in the array, and then do a jump indirect through
the register holding the cfa. That is 3 instructions, plus the jump at
the end of each primitive back to this code --- so it is 4
instructions total, which is quite a lot.

This is how most of the C-based Forth interpreters work --- they have
a big SWITCH statement. Rod's doesn't do this he has said --- I've
never looked at his code though, so I don't know how he does do it.

Bernhard Schornak

unread,

Sep 18, 2012, 11:20:28 PM9/18/12

to

Actually, it is just one opcode looking like this (GCC/AS):

.section .rdata, "dr"
.p2align 4,,15 # jump table
jt0:.quad XIT # common exit
.quad L00 # branch label 10
.quad L01 # 11
.quad L02 # 12

.text
.p2align 4,,15
... # prepare / evaluate
... # calculate target's index
...
jmp *jt0(, %reg, 8) # 1 clock (bulldozer)

.p2align 4,,15
L00:...
jmp XIT

.p2align 4,,15
L01:...
jmp XIT

.p2align 4,,15
L02:...
jmp XIT

.p2align 4,,15
XIT:... # clean up
ret # return to caller

Where REG holds the index of the label you want to jump to.
It is the fastest way for multiple branch targets, but your
jump table eats one quadword per label. (Replace .quad with
.long and the multiplier 8 with 4 for 32 bit code).

Greetings from Augsburg

Bernhard

Hugh Aguilar

unread,

Sep 19, 2012, 5:27:07 PM9/19/12

to

On Sep 18, 8:35 pm, Bernhard Schornak <schor...@nospicedham.web.de>
wrote:

> Actually, it is just one opcode looking like this (GCC/AS):

> jmp *jt0(, %reg, 8) # 1 clock (bulldozer)

Welcome to the wonderful world of CISC programming! It may just be one
opcode, but it is effectively three instructions. Isn't it true that
this is the same speed as if it were done manually with RISC-like
instructions? There is no gain from using the CISC instructions
nowadays, like there was back in the 8088 days.

That is what I picked up earlier, in regard to whether PUSH and POP to
an ESP stack was faster than manually doing it with another register
(such as EBP) as the stack pointer --- and I was told that it wasn't
any faster.

wolfgang kern

unread,

Sep 20, 2012, 4:52:52 AM9/20/12

to

Hugh Aguilar wrote:

[...]

> not sure if I even fully understood what you want to achieve, if your
> Forth should work as runtime interpreter then it may be similar to
> my token-parameter-strings which were just interpreted by jumps ...
>
> And instead of call/ret pairs I use 'chain-jumping', so all functions
> are jumped to by branch-tables and end with a jump to the main entry.
>
> CALL/RET takes more time then two near jumps, and avoids stack-pollution
> too. So a thread (string-part or timeslice multiplexed instances) only
> use one CALL on start and one RET when fully done or terminated.
>
> And the main branch-code part will be always in L1-code-cache, because
> it's more frequenly used than anything else regardless from which thread
> or instance a function may be requested.

|That isn't just two near jumps, is it? Your branch table needs to
|convert the xt (execution token) into an address to jump to. The xt is
|really an index into the branch table, so you have to add it to the
|branch-table base address and then fetch the cfa (code field address)
|from that location in the array, and then do a jump indirect through
|the register holding the cfa. That is 3 instructions, plus the jump at
|the end of each primitive back to this code --- so it is 4
|instructions total, which is quite a lot.

Yes this need some instructions and one dependency stall as overhead,
but it's faster than call/ret for sure.
I took the advantage of x86's SIB-addressing and use the token-values
direct as index for the branch-tables:

Main_branch: ;string to interpret in pointed to by ESI
MOV ecx,[esi] ;FN#,SUBFN#,option-flags
MOVZX byte eax,CL ;
JMP [FN_table+eax*4] ;256 main function-groups

FNx: ;same for SUBFN..
;function-code is here, may be many lines and SUB-FN-branches too.
ADD esi,... ;adjust as required (skip parameters if any)
jmp Main_branch

SUBFNxy_branch:
MOVXZ byte eax,CH
JMP [SUByy+eax*4]

FNend:
ret ;aka DONE

|This is how most of the C-based Forth interpreters work --- they have
|a big SWITCH statement. Rod's doesn't do this he has said --- I've
|never looked at his code though, so I don't know how he does do it.

I don't know Forth at all, my token/parameter interpreter works a bit
like Basic and 'G'. Nested, structured and much more functions, but
without syntax issues because I enter the strings with my HEX-editor :)

Do you convert your tokens into single instructions (primitives) ?

My collection of functions wont completely fit into L1-cache anyway,
but the main parts and frequent used functions may do.
__
wolfgang

Terje Mathisen

unread,

Sep 20, 2012, 5:51:45 AM9/20/12

to

wolfgang kern wrote:
> I took the advantage of x86's SIB-addressing and use the token-values
> direct as index for the branch-tables:
>
> Main_branch: ;string to interpret in pointed to by ESI
> MOV ecx,[esi] ;FN#,SUBFN#,option-flags
> MOVZX byte eax,CL ;
> JMP [FN_table+eax*4] ;256 main function-groups

The above code is almost certainly slower than loading twice:

movzx eax,byte ptr [esi]
mov ecx,[esi]
jmp [fn_table+eax*4]

The reason is that you save a full cycle latency for every jump when you
don't have to go via ECX in order to locate the low byte.

The load to ECX will most probably take part during the AGI stall caused
by loading EAX and then immediately using it for the jump table lookup.

BTW, have you tried skipping the ECX load totally, and instead go
directly to EAX, possibly using XOR eax,eax/LODSB instead of MOVZX / INC
ESI?

If EAX always contains zero in the top 24 bits you might get away
without the XOR as well, but it is safer to keep it there. You _can_
skip the XOR before loading the secondary function code though.

Terje Mathisen

unread,

Sep 20, 2012, 6:14:48 AM9/20/12

to

A small addendum here:

Terje Mathisen wrote:
> movzx eax,byte ptr [esi]
> mov ecx,[esi]
> jmp [fn_table+eax*4]

Since a key issue is to fit all the binary words into L1 cache (if
possible) it makes a lot of sense to go for the smallest threading code
possible:

lodsb
jmp [fn_table+eax*4]

and explicit XOR'ing of EAX at the end of each word that has used the
top 24 bits of it, otherwise skip this part.

Sub-fn lookups will then look exactly the same:

subfn1:
lodsb
jmp [subfn1_table+eax*4]

Bernhard Schornak

unread,

Sep 20, 2012, 7:23:56 AM9/20/12

to

The above is the fastest way to jump to the indexed label. GCC uses
a different method with two registers, subtracting the address of a
reference label from the jump target's address to get an offset for
a JMP [displacement]. It saves 4 byte per label in 64 bit code, but
it's one clock slower than JMP [mem] - due to the dependency of the
read-subtract-jump sequence - and requires one additional register.

JMP [mem] costs 4 clocks (Phenom) or 5 clocks (Bulldozer). The only
caveat is the jump table's larger size in 64 bit code.

Greetings from Augsburg

Bernhard

wolfgang kern

unread,

Sep 20, 2012, 3:42:13 PM9/20/12

to

Terje Mathisen mentioned:

>A small addendum here:
> Terje Mathisen wrote:
>> movzx eax,byte ptr [esi]
>> mov ecx,[esi]
>> jmp [fn_table+eax*4]
>
> Since a key issue is to fit all the binary words into L1 cache (if
> possible) it makes a lot of sense to go for the smallest threading code
> possible:
>
> lodsb
> jmp [fn_table+eax*4]
>
> and explicit XOR'ing of EAX at the end of each word that has used the top
> 24 bits of it, otherwise skip this part.
>
> Sub-fn lookups will then look exactly the same:

> subfn1:
> lodsb
> jmp [subfn1_table+eax*4]

not sure if this got anything to do with threaded code at all ....
Yes Terje, thats exactly were I once started, meanwhile I had to
consider that we cannot kill all flies with one stroke :)
__
wolfgang

wolfgang kern

unread,

Sep 20, 2012, 3:35:48 PM9/20/12

to

Terje Mathisen mentioned:

> wolfgang kern wrote:
>> I took the advantage of x86's SIB-addressing and use the token-values
>> direct as index for the branch-tables:
>>
>> Main_branch: ;string to interpret in pointed to by ESI
>> MOV ecx,[esi] ;FN#,SUBFN#,option-flags
>> MOVZX byte eax,CL ;
>> JMP [FN_table+eax*4] ;256 main function-groups

> The above code is almost certainly slower than loading twice:
>
> movzx eax,byte ptr [esi]
> mov ecx,[esi]
> jmp [fn_table+eax*4]

Not sure if the double memory access is always faster than my main,
your suggesstion would of course avoid one stall, but OTOH it may
not be too relevant in an IRQ-driven OS. But thanks I'll check on it.

> The reason is that you save a full cycle latency for every jump when you
> don't have to go via ECX in order to locate the low byte.

Right, but many of my functions got Sub-functions and would loose
the additional information given in the option flags...

> The load to ECX will most probably take part during the AGI stall caused
> by loading EAX and then immediately using it for the jump table lookup.

Yes right.
So if someone just need a byte for table index without further info
then a "MOVXZ eax,byte [ESI]" might be enough.

> BTW, have you tried skipping the ECX load totally, and instead go directly
> to EAX, possibly using XOR eax,eax/LODSB instead of MOVZX / INC ESI?

Yeah, some of the sub-functions do this. But it would limit the
otherwise given opportunies to pass on more info to a function.

Yes, and not all of my functions got a full set of subfunctions.
So I kept the ecx load in main as an often happen compromise.

> If EAX always contains zero in the top 24 bits you might get away without
> the XOR as well, but it is safer to keep it there. You _can_ skip the XOR
> before loading the secondary function code though.

Right Terje, but I once decided to have my tokens 32-bit wide, so this
upper 24 bits may contain additional info or even sub-functions.

ie: eax=50 08 12 22
show a numeric text image (variable type#22 of current instance) on screen
as INSTRING (right-aligned/enginereng/fix-comma/truncated to global size/
rounded dn to a preset max. display prescision) using colour-sceme x2,
(back/forg) which can be altered or global preset before this, the 1x in
the third byte tell this is a user editable input field, so it creates
a mouse focus rectangle and a visible keyrequest-field which only accepts
numeric keys on the fly and wait for (numeric/cursor/enter/ESC-key).

The 08 tells this input value is limited by assocciated variables #23/24/25
(min/max/default). Other functions may use this byte as Sub-function-branch.
So the user input goes (after cheching against limits) directly
to the instance owned variable #22 (just an example here yet).

_
wolfgang