Useful instructions in Forth machine

rickman

unread,

Jul 2, 2006, 3:31:40 PM7/2/06

to

I am reworking a Forth CPU I designed a couple of years ago and I would
like to ask for input on the utility of some instructions. When I
originally designed my machine, I looked at the info I could find on
Chuck Moore's work as well as others and to some extent I used the best
of all worlds given my application and tools.

I'm not sure if I need two shift instructions. I have no shift left
since this can be done by DUP +. I have a shift right logical (shift 0
into msb) and a shift right arithmetic (rotate through carry). Do you
think these are both necessary? I see that ColorForth and SEAforth do
not have a shift right logical. Is the logical right shift used for
much? I guess to duplicate it the carry would have to be cleared before
doing the arithmetic right shift, DUP DUP - DROP?

A subtract can be done as an instruction or the top of stack can be
negated followed by an add. Originally I was going for optimizing code
size, so I used the subtract with and without carry (to match the add
and compare instructions). Any opinions on replacing the two subtracts
with a single negate instruction? Any opinions on eliminating the
compare instructions? They can be duplicated with 2DUP -.

This processor was optimized for code size, logic size and of course I
tried to keep it fast. I got a satisfactory result, but now I am
targeting a much faster FPGA with much more memory and logic. I still
want to keep it small since that typically goes well with fast. But to
make intelligent choices on the trade offs, I need to know which
instructions are actually used much. BTW, the machine currently has 36
instructions, a few more than the SEAforth processor. I use an 8 bit
opcode with embedded constants for instructions that need an address or
data. The embedded address or data is extended by prefixing with a
literal instruction similar to the way the Transputer did it back in
the 80's, but I use 7 bits of the 8 bit opcode instead of just 4 for
data.

I am sure I could learn a few things about minmizing the instruction
set from the SEAforth processor, but it looks like it will be a slow
painful process to get enough info on how to make good use of the
instruction set to be practical to duplicate it.

One final note, I already looked at references like Koopman's book on
stack computers. I am looking for opinions from the users here.

Jerry Avins

unread,

Jul 2, 2006, 6:22:14 PM7/2/06

to

rickman wrote:
> I am reworking a Forth CPU I designed a couple of years ago and I would
> like to ask for input on the utility of some instructions. When I
> originally designed my machine, I looked at the info I could find on
> Chuck Moore's work as well as others and to some extent I used the best
> of all worlds given my application and tools.

...

Doesn't that depend on what you want to do and what you will do it on?
If you don't have shifting primitives but end up defining high-level
words for clarity or because you use them often, you will use more
memory than if you do have them. If your processor has a barrel shifter,
it would be a pity not to make it available. The usual 2/ is an
arithmetic right shift that rounds toward -inf. U2/ isn't common. I have
n rshift and n lshift coded in assembler. (It makes no sense for n to
exceed the word length, and it must be positive.)

Whatever you decide is not written in stone.

Jerry
--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Albert van der Horst

unread,

Jul 2, 2006, 5:46:03 PM7/2/06

to

In article <1151868700.0...@a14g2000cwb.googlegroups.com>,
rickman <spamgo...@yahoo.com> wrote:
<SNIP>

>
>I am sure I could learn a few things about minmizing the instruction
>set from the SEAforth processor, but it looks like it will be a slow
>painful process to get enough info on how to make good use of the
>instruction set to be practical to duplicate it.

Certainly a lot can be learned from SEAForth, but it will reflect
trade offs made by Chuck Moore. There is also the design decision
to use a fixed, small opcode width. Adding a single opcode would
increase the width. Your design may not face such severe
consequences.

If you are really in for a general purpose processor, you
cannot rule out the need for multiple precision shifts and
arithmetic (widely used for instance for encryption, this is
an area were speed counts.)
It then may be a bad idea to leave out logical shift rights, rotates
through carry and add-with-carry, because "if you need them, you need
them badly."
The form may not be the classical Program Status Word. Look
at the DEC Alpha for alternatives.

E.g. a Forth primitive on a stack machine replacing ADC could be
: M+ ( s,s --d) S>D 2>R S>D 2>R 2R> 2R> D+ ;

>One final note, I already looked at references like Koopman's book on
>stack computers. I am looking for opinions from the users here.

Little of it stuck in my brain. It is more of scientific foundation
for what we Forthers all know already. Not an experience like
Bertrand Meyer's "Object Oriented software development" or Leo Brodie's
"Thinking Forth".

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
alb...@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

rickman

unread,

Jul 2, 2006, 10:47:07 PM7/2/06

to

Jerry Avins wrote:
> rickman wrote:
> > I am reworking a Forth CPU I designed a couple of years ago and I would
> > like to ask for input on the utility of some instructions. When I
> > originally designed my machine, I looked at the info I could find on
> > Chuck Moore's work as well as others and to some extent I used the best
> > of all worlds given my application and tools.
>
> ...
>
> Doesn't that depend on what you want to do and what you will do it on?
> If you don't have shifting primitives but end up defining high-level
> words for clarity or because you use them often, you will use more
> memory than if you do have them. If your processor has a barrel shifter,
> it would be a pity not to make it available. The usual 2/ is an
> arithmetic right shift that rounds toward -inf. U2/ isn't common. I have
> n rshift and n lshift coded in assembler. (It makes no sense for n to
> exceed the word length, and it must be positive.)
>
> Whatever you decide is not written in stone.

Ok, thanks for the comments about the U2/, etc.

But maybe I didn't make the purpose clear. I am not writing a Forth, I
am designing the instruction set for the processor. I would not
normally consider changing the instruction set because it pretty much
*is* written in stone because of the work required to verify that the
instruction set is working correctly and that it meets all the timing
requirements and to change the software I have written so it will still
work. I am only considering the change because I am changing the
memory modules to host it on a different (read newer) FPGA. At this
point the memory will be up to date with single clock cycle synchronous
memories that are available on nearly all current families of FPGAs.
Then it will easily port to lots of different chips. So while I am
doing that, I want to consider any instruction set changes. Maybe I
should post the full instruction set to see what people think...

This is what I currently have. Every instruction is one clock cycle
(no pipelining) except there is a three cycle delay for accessing the
instruction memory which is byte mapped only. The rest of the machine
is 16 bits. Currently there is no external memory and the IP is 9 bits
(to become 11). When I change the CPU for synchronous memory all
memory reads will be 2 clock cycles (both data memory and instruction
memory). I have considered making a memory read two instructions, 1)
read the memory, 2) push the data on the stack. This would make all
instructions 1 clock and simplify some of the CPU.

Originally I had all instructions as fixed opcodes with sparse encoding
to speed up decoding. So a call or jump requried a separate literal
instruction to load an address. I dropped the idea of sparse encoding
as it was not a big improvement and added a 4 bit field for the
beginning of an address. Now an 11 bit jump or call can be coded into
two bytes using 4 bits in the jump or call and 7 bits from the literal
instruction...

-------------------------------
Alternate instruction encoding to shorten literal for call/jump. This
may make address path more complex and difficult to make fast (it
turned out ok). One byte JMP 0 serves as NOP, freeing one opcode for
RET which was not needed when JMP has no offset in the instruction (JMP
and RET were the same).

The literal instruction sets the literal flag, all other instructions
clear it. If the literal flag is clear, immediate data from a LIT,
JMP, CALA or CALR is loaded with sign extension. If the literal flag
is set the immediate data is left shifted with the top of return stack.

There are 32 fixed opcode instructions, 4 instructions with immediate
data and interrupts which are handled like a CALA instruction pushing
the return address, seting IP to one of 8 vectors (two byte boundaries)
but also pushing the PSW to the data stack.

0snn nnnn LIT - s,n is literal constant loaded/shifted onto Ret
Stack
10cc snnn JMP - cc is condition code - UZCO, s is sign, nnn is
offset
Cx 1100 snnn CALA - s is sign, nnnn is offset
Dx 1101 snnn CALR - s is sign, nnnn is offset
E0 1110 0000 SWAP - swap the top two items on the data stack
E1 1110 0001 RDUP - push the top of the return stack to the return
stack
E2 1110 0010 OVER - push the second item on the data stack to data
stack
E3 1110 0011 DDUP - push the top of the data stack to the data stack

E4 1110 0100 SHRL - shift right logical the top of data stack (CZ)
E5 1110 0101 SHRC - shift right arithmetic the top of data stack (CZ)
E6 1110 0110 ZFLAG - push the zero flag to the top of data stack
E7 1110 0111 RFTCH - push return stack top to data stack, no pop
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
E8 1110 1000 RDROP - pop the return stack
E9 1110 1001 RET - pop the return stack to the IP
EA 1110 1010 FETCH - push memory to data stack, pop the return stack
Inst Cycle 1
IMEM FTCH 2
3
EB 1110 1011 RFROM - pop the return stack, push to data stack

EC 1110 1100 RSUB - subtract top of return from return second (OCZ)
ED 1110 1101 RCMP - compare top of return from return second with a
pop (OCZ)
EE 1110 1110 FTCHP - push memory to data stack, increment return stack
Inst Cycle 1
IMEM FTCH 2
3
EF 1110 1111 RADRS - Add return stack top to second with a pop (OCZ)
-------------------------------------------------------------------------------
F0 1111 0000 ADDNC - ADD data stack top to second with a pop (OCZ)
F1 1111 0001 ADDC - ADD data stack top to second, carry as lsb OCZ)
F2 1111 0010 SUBNC - SUB data stack top to second with a pop (OCZ)
F3 1111 0011 SUBC - SUB data stack top to second, carry as lsb (OCZ)

F4 1111 0100 TORR - Push return stack, pop data stack
F5 1111 0101 BFLG - Toggle byte flag
F6 1111 0110 CMPNC - Compare data stack top to second, no stack op
(OCZ)
F7 1111 0111 CMPC - Compare data stack top to second, carry as lsb, no
stack op (OCZ)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
F8 1111 1000 DDROP - pop data stack
F9 1111 1001 RETI - pop return stack to IP and data stack to PSW
(OCZB)
FA 1111 1010 STORE - pop data stack to memory, pop return stack (OCZB
if PSW)
FB 1111 1011 BAND - AND data stack top and second with a pop (Z)

FC 1111 1100 RADDT - ADD data stact top to return stack top, no stack
op (OCZ)
FD 1111 1101 BOR - OR data stack top and second with a pop (Z)
FE 1111 1110 STORP - pop data stack to memory, inc return stack (OCZB
if PSW)
FF 1111 1111 BXOR - XOR data stack top and second with a pop (Z)

Other than the issue of needing one more opcode to allow me to split
the fetch memory into two instruction, I don't really need to change
anything. I am just looking for information (or even opinion) on what
is good or bad about these opcodes.

BTW, just to make things a bit more clear, I use the top of the return
stack as an address pointer. At one point I had a separate address
register like the C18, but I realized that I could combine that with
the return stack since there are already operations that use it this
way. I gave the return stack the ability to do some simple math to
facilitate Forth looping and ended up with this instruction set.

I would consider using the SEAForth or C18 instruction set, but I don't
understand how to make use of all the instructions. I don't want to
copy an instruction set that I don't understand.

rickman

unread,

Jul 2, 2006, 11:18:49 PM7/2/06

to

Albert van der Horst wrote:
> In article <1151868700.0...@a14g2000cwb.googlegroups.com>,
> rickman <spamgo...@yahoo.com> wrote:
> <SNIP>
> >
> >I am sure I could learn a few things about minmizing the instruction
> >set from the SEAforth processor, but it looks like it will be a slow
> >painful process to get enough info on how to make good use of the
> >instruction set to be practical to duplicate it.
>
> Certainly a lot can be learned from SEAForth, but it will reflect
> trade offs made by Chuck Moore. There is also the design decision
> to use a fixed, small opcode width. Adding a single opcode would
> increase the width. Your design may not face such severe
> consequences.

Thanks for your comments. I think I understand some of the important
aspects of what Jeff has explained over the years of the various Forth
CPUs they have designed. I find it facinating to learn how they think
and work. I especially have learned that you don't have to follow the
norm to be productive. But it certainly is a lot easier on the resume!

I ended up with a similarly constrained opcode encoding. Please see my
other post to Jerry if you are interested.

> If you are really in for a general purpose processor, you
> cannot rule out the need for multiple precision shifts and
> arithmetic (widely used for instance for encryption, this is
> an area were speed counts.)
> It then may be a bad idea to leave out logical shift rights, rotates
> through carry and add-with-carry, because "if you need them, you need
> them badly."
> The form may not be the classical Program Status Word. Look
> at the DEC Alpha for alternatives.

The current application is not so much a general purpose processor, but
as a tightly embedded processor in an FPGA. It does things like manage
DMA, emulate a UART interface (not the UART itself).

The one part I am especially not happy with is the control of memory
word vs. byte accesses. With four opcodes for memory access it would
take four more to create versions with both word and byte. So I have a
bit in the status register that controls whether memory accesses are
byte or word. This bit can be set or cleared by accessing the PSW.
For faster control I added an opcode to toggle the bit. Clearly
toggling is not optimal as you have to know the current state, but
there is not a second opcode to spare to create set and clear opcodes.
I guess I could change the definition to Write Byte Control from data
on the stack. This would make it a 2 byte function. So far it has
worked ok.

> E.g. a Forth primitive on a stack machine replacing ADC could be
> : M+ ( s,s --d) S>D 2>R S>D 2>R 2R> 2R> D+ ;

I can't say I understand how this would replace ADC (add with carry).
The trick to the above is how do you implement the D+ without ADC?

> >One final note, I already looked at references like Koopman's book on
> >stack computers. I am looking for opinions from the users here.
>
> Little of it stuck in my brain. It is more of scientific foundation
> for what we Forthers all know already. Not an experience like
> Bertrand Meyer's "Object Oriented software development" or Leo Brodie's
> "Thinking Forth".

My point about Koopman's book is that it contains information on the
static and dynamic instruction frequencies of Forth primatives. I used
that to guide my selection of instructions. I believe it is the only
reference I have seen that actually did this analysis other than a few
postings in c.l.f.

Marcel Hendrix

unread,

Jul 3, 2006, 3:14:06 AM7/3/06

to

"rickman" <spamgo...@yahoo.com> writes Re: Useful instructions in Forth machine
[..]

> My point about Koopman's book is that it contains information on the
> static and dynamic instruction frequencies of Forth primatives. I used
> that to guide my selection of instructions. I believe it is the only
> reference I have seen that actually did this analysis other than a few
> postings in c.l.f.

The book needs careful analysis before being used as gospel.

E.g. Static/Dynamic Instruction Execution Freqency for Fraq on pages 133/132:

static dynamic
CALL 17% 11%
LIT 11% 4%
EXIT 6% 11%
@ 11% 7%
DUP 4% 4%
0BRANCH 3% 3%
+ 3% ?
! 3% ?

Any useful progam needs to fetch from memory, do something, and store
back (or do I/O). However, "!" and "+" are in the noise, and "@" is as
frequent as the do-nothings "CALL" and "LIT." Other ALU operations aren't
listed at all. How is this possible?

I think the tables must be strongly influenced by the implementation model,
e.g. a table will change dramatically when the compiler is optimizing for
size or for speed. The tables can be used as is when you have a Forth with
(only) these primitives. Having e.g. "@" and "!" primitives with immediate
addressing in the spare bits should change the frequencies radically
(chaotically ;-)

-marcel

Anton Ertl

unread,

Jul 3, 2006, 2:05:59 AM7/3/06

to

m...@iae.nl (Marcel Hendrix) writes:
>"rickman" <spamgo...@yahoo.com> writes Re: Useful instructions in Forth machine
>[..]
>> My point about Koopman's book is that it contains information on the
>> static and dynamic instruction frequencies of Forth primatives. I used
>> that to guide my selection of instructions. I believe it is the only
>> reference I have seen that actually did this analysis other than a few
>> postings in c.l.f.
>
>The book needs careful analysis before being used as gospel.
>
>E.g. Static/Dynamic Instruction Execution Freqency for Fraq on pages 133/132:
>
> static dynamic
> CALL 17% 11%
> LIT 11% 4%
> EXIT 6% 11%
> @ 11% 7%
> DUP 4% 4%
> 0BRANCH 3% 3%
> + 3% ?
> ! 3% ?
>
>Any useful progam needs to fetch from memory, do something, and store
>back (or do I/O). However, "!" and "+" are in the noise, and "@" is as
>frequent as the do-nothings "CALL" and "LIT." Other ALU operations aren't
>listed at all. How is this possible?

IIRC he only listed the top n primitives.

Why do you feel that @ should be more frequent than CALL and LIT? And
why do you call them do-nothings?

I have also done measurements of dynamic primitive frequencies, and
you can find the results in

http://www.complang.tuwien.ac.at/forth/peep/
http://www.complang.tuwien.ac.at/misc/stack-caching-data/

You can find static and dynamic data for more benchmarks in

http://www.complang.tuwien.ac.at/papers/gregg+01euroforth.ps.gz

Note that the high frequency of LITs in the latter data is caused by
the fact that recent versions of Gforth translate constants,
variables, and values into sequences containing LIT.

Looking at the primitives/routines you mentioned, I see in the peep
data (sum of three benchmarks):

15143346 100% NEXTS
1995025 14% col: (call)
1355759 9% @
700996 5% lit
419537 3% +
187718 1% !

Looks pretty close to what Koopman found.

>Having e.g. "@" and "!" primitives with immediate
>addressing in the spare bits should change the frequencies radically
>(chaotically ;-)

Certainly in the Gforth primitives there is only one word to do @ and
one to do !. If you split these words into several primitives, each
will have a lower frequency.

Ah, I see what you mean: Yes, the frequency of lits could go down with
such instructions. And similarly, if the compiler performed
aggressive inlining, the frequency of calls and returns would go down.

One might think that words like MOVE and FILL might increase the @ and
! counts significantly if they are implemented in terms of @ and !
instead of being primitives (as they are here). However, given their
low usage counts (MOVE: 12800; FILL: 4735), I doubt that.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2006: http://www.complang.tuwien.ac.at/anton/euroforth2006/

rickman

unread,

Jul 3, 2006, 10:41:02 AM7/3/06

to

I was only able to view the table with the counts of various items. I
don't have a way to view postscript.

> Note that the high frequency of LITs in the latter data is caused by
> the fact that recent versions of Gforth translate constants,
> variables, and values into sequences containing LIT.

That is the closest to the use of LIT in a machine instruction set. In
my machine LIT is the way you would get any data longer than 4 bits
into a register. For actual data values it is how you get the entire
value into the register.

> Looking at the primitives/routines you mentioned, I see in the peep
> data (sum of three benchmarks):
>
> 15143346 100% NEXTS
> 1995025 14% col: (call)
> 1355759 9% @
> 700996 5% lit
> 419537 3% +
> 187718 1% !
>
> Looks pretty close to what Koopman found.
>
> >Having e.g. "@" and "!" primitives with immediate
> >addressing in the spare bits should change the frequencies radically
> >(chaotically ;-)

Spare bits?

> Certainly in the Gforth primitives there is only one word to do @ and
> one to do !. If you split these words into several primitives, each
> will have a lower frequency.

Perhaps, but the overall operator of @ is high in the list and needs to
be efficient in both time and space.

> Ah, I see what you mean: Yes, the frequency of lits could go down with
> such instructions. And similarly, if the compiler performed
> aggressive inlining, the frequency of calls and returns would go down.

If space optimization is important, as it is in my application,
inlining is not very useful except for very short sequences (as short
as the call). I tried to keep calls, returns and jumps at the top of
the efficiency list including the address which is insterted with a
LIT.

Jerry Avins

unread,

Jul 3, 2006, 11:48:53 AM7/3/06

to rickman

rickman wrote:
> Jerry Avins wrote:

...

>> Whatever you decide is not written in stone.
>
> Ok, thanks for the comments about the U2/, etc.
>
> But maybe I didn't make the purpose clear. I am not writing a Forth, I
> am designing the instruction set for the processor.

...

You were clear enough; my mindset got in the way.

We're now in the realm of tradeoffs I'm not entitled to an opinion
about. It seems to me that shifts through and around carry are
important, even though high-level languages don't usually support them,
and the a carry flag can be dispensed with even for multiple-precision
arithmetic. (Bernd once explained how the Alpha manages, but I forget.)

I'm well aware that less is often better, but it seems to me that "less
is more" is just hype. Good luck.

Albert van der Horst

unread,

Jul 3, 2006, 12:25:38 PM7/3/06

to

In article <1151896729....@j8g2000cwa.googlegroups.com>,

rickman <spamgo...@yahoo.com> wrote:
>Albert van der Horst wrote:

<SNIP>

>
>> E.g. a Forth primitive on a stack machine replacing ADC could be
>> : M+ ( s,s --d) S>D 2>R S>D 2>R 2R> 2R> D+ ;
>
>I can't say I understand how this would replace ADC (add with carry).
>The trick to the above is how do you implement the D+ without ADC?

That is a spec, not an implementation. D+ would be implemented in
terms of M+

Groetjes Albert

jeth...@gmail.com

unread,

Jul 3, 2006, 1:54:41 PM7/3/06

to

You want a set of primitives that will make your Forth code efficient.

It's only natural to look at a body of Forth code and figure which
primitives would do that. But to a very large extent people have subtly
adapted their Forth code to be efficient on their systems. And you will
do that with your Forth code written for your system. Notice how Moore
made a system where SWAP was inefficient and then coded in a way that
tended to avoid SWAP . I don't have the exact quote but I got the
strong impression that he didn't miss SWAP at all.

So it's hard to tell what instruction set is good. But it's obvious
that efficient calls and returns are good, and @ ! are a great big
deal.

Here's a tentative suggestion. If you have OVER R@ and R'@ (which
copies the next-to-top item from the return stack onto the data stack)
then that's 3 items you can get to TOS easily. 4 stack items very
easily available. That's a good start for stack manipulation. For
debugging code you can have a dummy return stack that gets used while
interpreting, because the big problem with using the return stack is it
makes debugging hard. The interpreted return stack doesn't need to go
in the final product unless you want it there, it's just for debugging.

Depending on your architecture it might be cheap to have the word that
somebody (I think it was Ewald Pfau) called PLUCK that does ( a b c --
a b c a ). If I used it I'd probably call it LEAP . That also gives you
an extra stack item easily available.

Instructions like R'@ and PLUCK don't help very much for implementing,
say, 2OVER . If you want existing Forth code to run efficiently you'll
want to design your processor for the code. And if you want a processor
that can run efficient code, you won't know how hard it is to write
efficient code for it until you try.

If you have R@ and R'@ do you even want 2OVER ? Do >R >R (or implement
2>R which is supposed to be a little different) and then you can do
OVER OVER ( or 2DUP ) or else R@ R'@ ( or 2R@ which is supposed to be a
little different) and you can put a copy of either pair on the stack
whenever you want. Would you want 2OVER too? I don't know. Maybe.

You'll do a lot better with some sort of arithmetic carry if you ever
want efficient multicell arithmetic. If your applications will hardly
ever need that, then no problem.

Marcel Hendrix

unread,

Jul 3, 2006, 4:19:13 PM7/3/06

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: Useful instructions in Forth machine

> m...@iae.nl (Marcel Hendrix) writes:
[..]

>> Any useful progam needs to fetch from memory, do something, and store
>> back (or do I/O). However, "!" and "+" are in the noise, and "@" is as
>> frequent as the do-nothings "CALL" and "LIT." Other ALU operations aren't
>> listed at all. How is this possible?

> IIRC he only listed the top n primitives.

I know that :-)
How can there be no primitives that DO something in the top n (except for "+")?

> Why do you feel that @ should be more frequent than CALL and LIT?

You ask me to fix your bike and I call a cab so you can go to the shop and ask
there. The task has not been done.

This could be caused by the way Philip Koopman instrumented the counts.
He mentions CALL _and_ @, so does that mean @ is a primitive (not called?)
As AND is in the list, it is not CALLed? Given the frequency of CALLs with
regards to CPU operations (+ @ ! AND) CALL is used to call high-level
Forth words that don't do any real work (and not call OR AND - .. ).

> And
> why do you call them do-nothings?

What work is done with a call to a non-primitive?
LIT is used to supply addresses for @ and ! ... but wait, the frequency
of LIT is lower than the sum of frequencies for @ plus ! ?

> I have also done measurements of dynamic primitive frequencies, and
> you can find the results in

> http://www.complang.tuwien.ac.at/forth/peep/
> http://www.complang.tuwien.ac.at/misc/stack-caching-data/

> You can find static and dynamic data for more benchmarks in

> http://www.complang.tuwien.ac.at/papers/gregg+01euroforth.ps.gz

> Note that the high frequency of LITs in the latter data is caused by
> the fact that recent versions of Gforth translate constants,
> variables, and values into sequences containing LIT.

> Looking at the primitives/routines you mentioned, I see in the peep
> data (sum of three benchmarks):

> 15143346 100% NEXTS
> 1995025 14% col: (call)
> 1355759 9% @
> 700996 5% lit
> 419537 3% +
> 187718 1% !

> Looks pretty close to what Koopman found.

From this I may assume that gForth is between 2 and 7 times
slower than a NCC?

>> Having e.g. "@" and "!" primitives with immediate
>> addressing in the spare bits should change the frequencies radically
>> (chaotically ;-)

[..]

> Ah, I see what you mean: Yes, the frequency of lits could go down with
> such instructions. And similarly, if the compiler performed
> aggressive inlining, the frequency of calls and returns would go down.

Yes, that is what I meant. After that is done, the ratio of work/overhead
instructions would become apparent.

-marcel

rickman

unread,

Jul 3, 2006, 2:35:06 PM7/3/06

to

Albert van der Horst wrote:
> In article <1151896729....@j8g2000cwa.googlegroups.com>,
> rickman <spamgo...@yahoo.com> wrote:
> >Albert van der Horst wrote:
> <SNIP>
> >
> >> E.g. a Forth primitive on a stack machine replacing ADC could be
> >> : M+ ( s,s --d) S>D 2>R S>D 2>R 2R> 2R> D+ ;
> >
> >I can't say I understand how this would replace ADC (add with carry).
> >The trick to the above is how do you implement the D+ without ADC?
>
> That is a spec, not an implementation. D+ would be implemented in
> terms of M+

Now I am more confused. How can you define M+ in terms of D+ and then
say D+ would be implemented in terms of M+?

D+ would be done like this, assuming the lsByte is at the top of the
stack...

: D+ ( d1 d2 -- dsum ) ROT ADD >R ADC R> ;

If you don't have the ADC, you need to somehow capture the result of
the carry from the first ADD and add that into the second ADD somehow.

: D+ ( d1 d2 -- dsum ) ROT ADD >R PSW @ #0001 AND ADD ADD R> ;

rickman

unread,

Jul 3, 2006, 3:02:37 PM7/3/06

to

Thanks for the advice. I pretty much am aware of the points you are
making. Many seemingly useful instructions are not options because of
the restrictions on keeping the hardware lean. Mainly I am asking for
advice on the few instructions I listed. Like the ADD, ADC, SUB, SUBC,
CMP, CMPC set. I am not sure if I need all six of these. Having the
carry versions make multi-cell arithmetic easier, but maybe I don't
need them all. SUB can be done by inverting and setting the carry then
doing an ADD. CMP can be done with a 2DUP and SUB.

One that rubs me the wrong way is the memory access for bytes. I use a
PSW flag to determine if a given access will be a cell or a byte
access. I used one opcode to toggle the flag which can be used in
conjunction with direct PSW accesses. I just can't think of a better
way unless I just ignore byte accesses.

There are a lot of tradeoffs. Some of the instructions were included
because they required a very minimal amount of extra logic. Others
were left out that seem like they should be easy, but they did not fit
the instruction decoding well. I am just asking opinions, but of
course I will have to make the decisions.

Coos Haak

unread,

Jul 3, 2006, 4:13:35 PM7/3/06

to

Op 3 Jul 2006 11:35:06 -0700 schreef rickman:

> Albert van der Horst wrote:
>> In article <1151896729....@j8g2000cwa.googlegroups.com>,
>> rickman <spamgo...@yahoo.com> wrote:
>>>Albert van der Horst wrote:
>> <SNIP>
>>>
>>>> E.g. a Forth primitive on a stack machine replacing ADC could be
>>>> : M+ ( s,s --d) S>D 2>R S>D 2>R 2R> 2R> D+ ;
>>>
>>>I can't say I understand how this would replace ADC (add with carry).
>>>The trick to the above is how do you implement the D+ without ADC?
>>
>> That is a spec, not an implementation. D+ would be implemented in
>> terms of M+
>
> Now I am more confused. How can you define M+ in terms of D+ and then
> say D+ would be implemented in terms of M+?
>
> D+ would be done like this, assuming the lsByte is at the top of the
> stack...
>
>: D+ ( d1 d2 -- dsum ) ROT ADD >R ADC R> ;

No, the msCell is on the top:
: D+ ( d1 d2 -- dsum ) >R ROT ADD SWAP R> ADC ;

>
> If you don't have the ADC, you need to somehow capture the result of
> the carry from the first ADD and add that into the second ADD somehow.
>
>
>: D+ ( d1 d2 -- dsum ) ROT ADD >R PSW @ #0001 AND ADD ADD R> ;

: D+ ( d1 d2 -- dsum ) >R ROT ADD SWAP PSW @ 1 AND ADD R> ADD ;

Note: ISO M+ has this stack picture: ( d1 n -- d2 ) and differs from
Alberts example.

--
Coos

CHForth, 16 bit DOS applications
http://home.hccnet.nl/j.j.haak/forth.html

Albert van der Horst

unread,

Jul 3, 2006, 4:53:31 PM7/3/06

to

In article <1151951706.7...@a14g2000cwb.googlegroups.com>,

Sorry about this. M+ would be a primitive, and leave the sum and,
on top of that, the carry on the stack.
Then
: D+ ROT + >R M+ R> + ;

My code for M+ was merely to show exactly what it does.

Anton Ertl

unread,

Jul 4, 2006, 2:45:26 AM7/4/06

to

Albert van der Horst <alb...@spenarnc.xs4all.nl> writes:
>It then may be a bad idea to leave out logical shift rights, rotates
>through carry and add-with-carry, because "if you need them, you need
>them badly."
>The form may not be the classical Program Status Word. Look
>at the DEC Alpha for alternatives.

Right. No carry bit there, neither in a PSW nor elsewhere. For our
D+ example, this is not too bad (apart from the stack juggling):

: D+ ( d1 d2 -- d )
rot + >r tuck + swap over u> r> swap - ;

Anton Ertl

unread,

Jul 4, 2006, 4:13:34 AM7/4/06

to

m...@iae.nl (Marcel Hendrix) writes:
>This could be caused by the way Philip Koopman instrumented the counts.
>He mentions CALL _and_ @, so does that mean @ is a primitive (not called?)

Sure. That's the usual thing in classical threaded-code Forths.
Certainly this is the case for the numbers from me.

>> why do you call them do-nothings?
>
>What work is done with a call to a non-primitive?

Leave the call away, and you will see that it does something.

>LIT is used to supply addresses for @ and !

Not usually. There is dovar for that. And addresses often come out
of I, +, etc.

>... but wait, the frequency
>of LIT is lower than the sum of frequencies for @ plus ! ?

For those numbers where variables are translated to literals, the
number of LITs exceeds the number of @ plus !.

>> Looking at the primitives/routines you mentioned, I see in the peep
>> data (sum of three benchmarks):
>
>> 15143346 100% NEXTS
>> 1995025 14% col: (call)
>> 1355759 9% @
>> 700996 5% lit
>> 419537 3% +
>> 187718 1% !
>
>> Looks pretty close to what Koopman found.
>
>From this I may assume that gForth is between 2 and 7 times
>slower than a NCC?

Not sure how you get there from these numbers, but yes, somewhere in
the ballpark, especially the version that these number come from. In
the meantime, Gforth has become twice as fast, without increasing the
proportion of + and ! in the executed primitives, so now it is around
two times slower than a good NCC on these kinds of benchmarks.

rickman

unread,

Jul 4, 2006, 11:46:14 PM7/4/06

to

Anton Ertl wrote:
> Right. No carry bit there, neither in a PSW nor elsewhere. For our
> D+ example, this is not too bad (apart from the stack juggling):
>
> : D+ ( d1 d2 -- d )
> rot + >r tuck + swap over u> r> swap - ;

Ok, in the primatives I have available,

: D+ ( d1 d2 -- dsum )
rot + >R swap over + swap over sub drop 3 jmpc 0 nop 1 jmp 1 r> swap -
;

I count 18 operations at one byte and 1 clock cycle each. The 3 jmpc
and 1 jmp are a single clock each since the range is between 7 and -8.

Or with the carry it is

: D+ ( d1 d2 -- dsum )
r> swap >r ADD RADRS r> ;

I count 6 clocks and 6 bytes. So clearly double precision arithmetic
is greatly assisted by using a carry.

I'm not sure so sure about the subtract and compare instructions. I'll
have to take a closer look at them.

dke...@hotmail.com

unread,

Jul 5, 2006, 8:15:23 PM7/5/06

to

rickman wrote:
---snip---

> One that rubs me the wrong way is the memory access for bytes. I use a
> PSW flag to determine if a given access will be a cell or a byte
> access. I used one opcode to toggle the flag which can be used in
> conjunction with direct PSW accesses. I just can't think of a better
> way unless I just ignore byte accesses.
>

Hi
If you keep track of data and instruction fetch as being different,
you can use full byte addressing but instead of masking
part, you swap the bytes. Doing the masking if needed could
be another simple operation. For instruction, you only
fetch from even addresses. The odd address bit could be
used for special purposes to flag things like clearing PSW
on nesting.
Not sure if this makes much sense under your terms.
Dwight

rickman

unread,

Jul 5, 2006, 10:05:29 PM7/5/06

to

It makes sense, but that is not really the issue. I don't have trouble
masking off the byte, the problem is knowing when to do a byte access
and when to do a word access. The instruction memory and data memory
are different, so that is not a concern.

The issue is instructions... I have one instruction to toggle the byte
access flag which seems a bit crude. If I wanted to have separate
instructions for byte and cell accesses, as is typical of most
processors, I would need to add four more. Or I could just drop byte
accesses and free up an opcode.

dke...@hotmail.com

unread,

Jul 6, 2006, 7:42:20 PM7/6/06

to

Hi
That was part of my point. Make all addressing in byte steps.
Make all fetches to be full width. You can either fetch the value
at an address that isn't word aligned by rotating the value such
that the desired byte is in the LSBs or do two fetches and combine
( not as easy ). This way, you completely illiminate any special
instructions at the cost of needing to mask when only the byte part
is needed. This often isn't an issue since one may need to mask
parts of the byte anyway.
By using the rotation method, the actual memory access is still
the full word value. It is just rotated, based on the address.
This changes how one would access instructions. For a 16 bit
machine, instruction fetches would only be done on even addresses.
This can be extended to any word width.
You then have no byte fetch instruction to worry about. Addresses
for instruction fetch could either be such that the A0 bit was used
for special purposes ( like call/jump ) or simply shift the instruction
address left before putting to the address bus and putting 0 at A0.
Both have advantages and disadvantages.
Dwight

rickman

unread,

Jul 6, 2006, 11:26:49 PM7/6/06

to

dke...@hotmail.com wrote:

> rickman wrote:
> > It makes sense, but that is not really the issue. I don't have trouble
> > masking off the byte, the problem is knowing when to do a byte access
> > and when to do a word access. The instruction memory and data memory
> > are different, so that is not a concern.
> >
> > The issue is instructions... I have one instruction to toggle the byte
> > access flag which seems a bit crude. If I wanted to have separate
> > instructions for byte and cell accesses, as is typical of most
> > processors, I would need to add four more. Or I could just drop byte
> > accesses and free up an opcode.
>
> Hi
> That was part of my point. Make all addressing in byte steps.
> Make all fetches to be full width. You can either fetch the value
> at an address that isn't word aligned by rotating the value such
> that the desired byte is in the LSBs or do two fetches and combine
> ( not as easy ). This way, you completely illiminate any special
> instructions at the cost of needing to mask when only the byte part
> is needed. This often isn't an issue since one may need to mask
> parts of the byte anyway.
> By using the rotation method, the actual memory access is still
> the full word value. It is just rotated, based on the address.

That works ok for reads, but what about writes?

> This changes how one would access instructions. For a 16 bit
> machine, instruction fetches would only be done on even addresses.
> This can be extended to any word width.
> You then have no byte fetch instruction to worry about. Addresses
> for instruction fetch could either be such that the A0 bit was used
> for special purposes ( like call/jump ) or simply shift the instruction
> address left before putting to the address bus and putting 0 at A0.
> Both have advantages and disadvantages.

The instructions *are* bytes and the instruction memory is separate and
byte wide. To use the lsb as an instruction would require a
restructuring of the opcodes. All immediate data, regardless of
whether it is address or data for the stack, is encoded as literal
instructions. Each one has the msb clear and shifts 7 bits into the
top of the return stack with sign extend (ms 7 bits first). The
literal operation ends when another instruction is executed. So the
literal has to be followed by another instruction anyway. Since the
lsb is requried to address instructions as bytes, it can't be used as a
flag.

Or am I missing something?

dke...@hotmail.com

unread,

Jul 7, 2006, 10:00:09 AM7/7/06

to

rickman wrote:
> dke...@hotmail.com wrote:
> > rickman wrote:
> > > It makes sense, but that is not really the issue. I don't have trouble
> > > masking off the byte, the problem is knowing when to do a byte access
> > > and when to do a word access. The instruction memory and data memory
> > > are different, so that is not a concern.
> > >
> > > The issue is instructions... I have one instruction to toggle the byte
> > > access flag which seems a bit crude. If I wanted to have separate
> > > instructions for byte and cell accesses, as is typical of most
> > > processors, I would need to add four more. Or I could just drop byte
> > > accesses and free up an opcode.
> >
> > Hi
> > That was part of my point. Make all addressing in byte steps.
> > Make all fetches to be full width. You can either fetch the value
> > at an address that isn't word aligned by rotating the value such
> > that the desired byte is in the LSBs or do two fetches and combine
> > ( not as easy ). This way, you completely illiminate any special
> > instructions at the cost of needing to mask when only the byte part
> > is needed. This often isn't an issue since one may need to mask
> > parts of the byte anyway.
> > By using the rotation method, the actual memory access is still
> > the full word value. It is just rotated, based on the address.
>
> That works ok for reads, but what about writes?

Writes are not handy. Depending on what is being done, one
could pack the two bytes before the write, do a read modify then write
or waste a byte of memory. Speed/application would determine which
was best. Yes, it isn't as good as an instruction but you are looking
for a way to not have a byte instruction and still handle bytes.
The advantage of the rotated fetch and store is that the masking
operation is the same. regardless of the address used. It is about
the easiest way to do word fetches and still have a sense of bytes.

>
> > This changes how one would access instructions. For a 16 bit
> > machine, instruction fetches would only be done on even addresses.
> > This can be extended to any word width.
> > You then have no byte fetch instruction to worry about. Addresses
> > for instruction fetch could either be such that the A0 bit was used
> > for special purposes ( like call/jump ) or simply shift the instruction
> > address left before putting to the address bus and putting 0 at A0.
> > Both have advantages and disadvantages.
>
> The instructions *are* bytes and the instruction memory is separate and
> byte wide. To use the lsb as an instruction would require a
> restructuring of the opcodes. All immediate data, regardless of
> whether it is address or data for the stack, is encoded as literal
> instructions. Each one has the msb clear and shifts 7 bits into the
> top of the return stack with sign extend (ms 7 bits first). The
> literal operation ends when another instruction is executed. So the
> literal has to be followed by another instruction anyway. Since the
> lsb is requried to address instructions as bytes, it can't be used as a
> flag.
>
> Or am I missing something?

OK, your instruction memory need not be the same as data.
Dwight