Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Compiling FORTH on the 68K

7 views

Skip to first unread message

Bob Pendleton

unread,

Nov 7, 1986, 1:00:29 PM11/7/86

A few weeks ago there was some discussion of the execution speed of FORTH on a
68000. Out going news was hung up at our site so I wasn't able to comment at the
time the discussion was going on.

First a little background. FORTH is usually "compiled" to what is called
threaded code. Threaded code is a sequence of pointers to the objects referenced
in the original source code. A FORTH compiler spends most of its time looking up
words in the simple table an writing their addresses into memory. A very small
number of words are compiled by executing them and letting them do anything they
want. This may seem strange to people who are used to complex compilers, but it
works, its fast, and its very flexible.

Threaded code is interpreted by a small loop that gets the next pointer from
memory and then calls the code at that address. The code that is called my
actually do something, an add or a fetch, or it may just call the interpreter
with the address of another sequence of pointers.

The only real problem with threaded code is that it is interpreted. And because
it is interpreted, it is slow. I know that statement will get me some heat from
FORTH fans the world over, but interpreted threaded code is slow compared to
machine code. You're just stuck with the fact that the interpreter must execute
2 or 3 instructions for each pointer it follows.

The obvious solution to this is to compile to "call" or "JSR" threaded code.
Don't just put the address of object in memory, put a hardcoded subroutine call
to the object in memory. This approach gets rid of the interpretive overhead and
can make things run a lot faster. This works well on multiple stack machines
like the 68000, but doesn't work as well as you might expect for single stack
machines like the z80. You can also get into trouble on systems that require
code to be relocatable.

One of the reasons for using threaded code is that it is supposed to be more
compact than equivalent machine code. But, on a machine like the 68000 where a
pointer is 4 bytes and a JSR with a long word address is 6 bytes this argument
breaks down.

Assuming register A6 is the data stack pointer a JSR threaded FORTH system the
word + ( add top to items on the stack ) is ( most likely to be ) compiled as

JSR add$word

Where add$word is something like

add$word:
MOVE.L (A6)+,D0
ADD.L (A6)+,D0
MOVE.L D0,-(A6)
RTS

Except for the RTS instruction the code used to implement + is the same length
as the JSR instruction that invokes it. The RTS is only needed because of the
JSR. So JSR threaded code gives no savings in space over compiling + in line,
and the time required to execute the JSR, RTS pair ( assuming I'm reading the
tables right ) at 36 clocks, almost doubles the 40 clocks needed to execute the
two MOVEs and an ADD. And this is for 32 bit operands. Most primitive operations
in FORTH can be expanded in line with little cost in space when compared to JSR
threading. There are several other optimizations that can be used once you start
compiling to in line code.

A compiler that worked this way would generate many redundant push/pop pairs. A
single flag can be used to eliminate these pairs of instructions. The result is
that in the worst case the generated code would be the same size or perhaps
slightly larger than JSR threaded code, but would usually be 1/3 to 1/2 the size
and run 2 to 4 times faster.

I wrote a FORTH like system for the z80 that compiled code this way. It
generated code that was about the same size as CALL threaded code on the z80
( remember 16 bit addresses, not 32 bit addresses like the 68000 ). The code ran
about 3 times faster than the original CALL threaded version of the same system.

So if somebody says that a JSR threaded FORTH on the Amiga runs 10 iterations of
the sieve in 10 seconds and somebody else says that they have seen a FORTH on
the Amiga that will do 10 iterations in 2 to 3 seconds, I would be inclined to
believe them.

Bob Pendleton.

P.S.
Sorry for being so long winded.

0 new messages