RTX2000 optimization

Brad Eckert

unread,

Oct 22, 2012, 11:53:22 AM10/22/12

to

Hi All,

I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule.

Does anyone here have a feel for the correspondence between Forth source primitives and generated code?

This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

Andrew Haley

unread,

Oct 22, 2012, 12:21:12 PM10/22/12

to

Brad Eckert <hwf...@gmail.com> wrote:
> Hi All,
>
> I've been thinking about Novix style processors like the
> RTX2000. There are many Forth sequences that can be compacted into
> one instruction, so with a good optimizer the chip can execute
> several Forth (source) primitives in one machine cycle. I suspect
> though that such optimization opportunities are the exception rather
> than the rule.

It happens a lot bacause many phrases are things like OVER + and
R> DROP . Also, ; pairs with just about everything. Novix had more
of these phrases than RTX2000 because of the way its encoding was
done.

Andrew.

visua...@rocketmail.com

unread,

Oct 22, 2012, 12:56:29 PM10/22/12

to

On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

Mark Wills

unread,

Oct 22, 2012, 3:10:28 PM10/22/12

to

On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:
> Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words!

Hmmm... I'm not following you. What is the function/significance of
this special bit? Where is it set? In the call instruction or the
return instruction?

visua...@rocketmail.com

unread,

Oct 22, 2012, 3:55:11 PM10/22/12

to

On Monday, October 22, 2012 3:10:29 PM UTC-4, M.R.W Wills wrote:
> On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote: > Despite allowing several Forth commands inside one RTX2000 16 bit word a great advantage is the ability to set a special Return from Subroutine bit - that is a real speed accelerator - so a whole executable program may need only one RTX2000 16 bit word! Imagine what additional possibilities you can have with 32 bit words! Hmmm... I'm not following you. What is the function/significance of this special bit? Where is it set? In the call instruction or the return instruction?

There is no special return instruction. Return is achieved by one bit only:

For all instructions except Subroutine Calls or Branch instructions, bit 5 of the instruction code represents the Subroutine Return Bit. If this bit is set to 1, a Return is performed whereby the return address is popped from the Return Stack.

Source: Intersil HS-RTX2010RH Data Sheet March 2000 File Number 3961.3, p. 28
http://www.intersil.com/content/dam/Intersil/documents/fn39/fn3961.pdf

Subroutine Return Bit, ibid., p. 31, HARRIS RTX2000 ADVANCE INFORMATION, May 1988, p. 16

Coos Haak

unread,

Oct 22, 2012, 4:02:13 PM10/22/12

to

Op Mon, 22 Oct 2012 12:10:28 -0700 (PDT) schreef Mark Wills:

There is no return instruction, but any instruction may have its return bit
set. : SQR DUP + ; may be one instruction word existing of DUP, plus and
the return bit.

--
Coos

CHForth, 16 bit DOS applications
http://home.hccnet.nl/j.j.haak/forth.html

rickman

unread,

Oct 22, 2012, 4:50:33 PM10/22/12

to

<this post turned out to be a bit longer than I expected...>

Stack machines are inherently pretty simple because of the limited
nature of registers and the limited instruction set typically
implemented. I did one about ten years ago and found it really consists
of three execution units, all of which can operate in parallel.

1) The Instruction Unit contains the instruction memory with the address
generation (instruction fetch) as well as the decode (although the
decode could also be spread among the units).

2) The Address Unit contains the return stack with stack pointer and the
various logic that manipulates elements on the stack, such as
auto-increment of addresses and loop counter functions. The address for
main memory is provided by this unit in my machine.

3) The Data Unit contains the data stack with stack pointer, the ALU and
any other special logic for operations on the data stack. I also
included the memory interface in this unit even though the address comes
from the Address Unit.

Each of these three units is present in any dual stack CPU design. Any
given instruction may involve any combination of the three units or may
leave some idle. If instructions leave any unit idle it can be combined
with another instruction that uses just that unit in a compatible way.
For example, as others have indicated, any instruction that is not using
the Address Unit, e.g. ADD, SUB, etc. can execute an instruction that
only uses the Address Unit, e.g. RET, LOOP, etc. Both instructions have
to be using the Instruction Unit in a compatible way which is typically
not hard since most instructions are doing the equivalent of NEXT (or
NOP in other terms).

The only issue with such combining of instructions is the instruction
encoding. I encoded to minimize the amount of program space needed
which resulted in a minimum width instruction optimized per Koopman's
data for instruction frequency. I looked at separating the opcodes for
each unit in essence making the MISC equivalent of a VLIW processor, if
you can have such a thing... lol I couldn't quite squeeze the
instruction into a 9 bit word which is a memory width commonly available
in FPGAs. I am now looking at using an FPGA that only supports 16 bit
wide memory and don't like the idea of multiplexing the instructions,
that just adds a level of logic to the timing path. A wider instruction
might just minimize the decode logic and provide for more instruction
parallelism at the same time.

Another way of paralleling instructions is to just pick unused opcodes
and implement the most common parallel instructions. This won't improve
timing of the design and may worsen it a bit, but will provide for fewer
instructions in a program.

The obvious one, return as a separate bit, in parallel with everything,
can only be used with about half the instructions in my CPU design. The
return stack is used in a lot of them. But if the bit is free...

Info that would be VERY useful to me is frequency of use of instructions
in combination. This could be pulled from existing code by measuring
how often instructions are found adjacent to each other. This may not
be a perfect measure, but it would be a great start. Can anyone
generate a metric on this similar to Koopman's data on instruction use?

Rick

daveyrotten

unread,

Oct 22, 2012, 4:54:33 PM10/22/12

to

Are you sure you're not confusing machine cycles with address fetches? Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. Maybe I'm not understanding what you mean by machine cycle. I'm a huge fan of James Bowman's J1, but I don't believe it packs instructions in memory at all. It simply executes them very fast thru the use of dual port memory. I think of the B16 and J1 as two sides of the same coin. The B16 is very memory efficient, but slower than the J1. The J1 is very fast, but not as memory efficient as the B16. The B16 is programmed directly in Forth, using about 32 built-in words. The J1 is similar but I believe has a few more possible Forth words as primitive instructions.

visua...@rocketmail.com

unread,

Oct 22, 2012, 5:14:19 PM10/22/12

to

On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote:
> Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster.

The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store.

daveyrotten

unread,

Oct 22, 2012, 5:26:19 PM10/22/12

to

Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.

visua...@rocketmail.com

unread,

Oct 22, 2012, 7:00:03 PM10/22/12

to

On Monday, October 22, 2012 5:26:20 PM UTC-4, daveyrotten wrote:
> On Monday, October 22, 2012 4:14:19 PM UTC-5, visua...@rocketmail.com wrote: > On Monday, October 22, 2012 4:54:33 PM UTC-4, daveyrotten wrote: > > > Paysan's B16 (taking a cue from Moore's work) packs something like 3 instructions in a single memory word. But each of those instructions still takes a complete processor clock cycle to execute (ie, they execute in sequence). So it saves memory space but doesn't really execute 3 times faster. > > > > The RTX2000 family of RISCs is able to execute several commands within one clock cycle. Only memory access needs two clock cycles: one to address memory, and one to fetch/store. Ok, sorry. I guess the compiler must pick instructions that have no interdependency to pack together. I'm not familiar with the RTX2000 itself.

I am. I did several designs, 1988-1995:
http://www.somersetweb.com/BruehlConsult/Projects/RTX2000-MINI.html
http://www.somersetweb.com/BruehlConsult/Projects/mc-RISC-EMUF.html
http://www.somersetweb.com/BruehlConsult/Projects/S5-4MB-RTX2000.html
http://www.somersetweb.com/BruehlConsult/Projects/S5-F-1MBd-RTX2000.html

Rod Pemberton

unread,

Oct 22, 2012, 9:51:24 PM10/22/12

to

"Brad Eckert" <hwf...@gmail.com> wrote in message
news:e1479cfa-a969-40ab...@googlegroups.com...

> I've been thinking about Novix style processors like the RTX2000.
> There are many Forth sequences that can be compacted into one
> instruction, so with a good optimizer the chip can execute several
> Forth (source) primitives in one machine cycle. I suspect though
> that such optimization opportunities are the exception rather than the
> rule.
>

The question for both you (and Rick) is if you create new, faster, more
powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency. I.e., a person is unlikely to use it. In which
case, there is no point in using or implementing it.
(This is repeated later in a reply to Rick.)

> Does anyone here have a feel for the correspondence between Forth
> source primitives and generated code?

Generally, Forth's built using "primitives" or low-level words generally
need 30 to 40 or so. I kept track of how many are needed for certain
Forths. There are a few posts by me to c.l.f. with counts and specific
words used.

Rod Pemberton

Rod Pemberton

unread,

Oct 22, 2012, 9:52:27 PM10/22/12

to

"rickman" <gnu...@gmail.com> wrote in message
news:k64bj1$m4n$1...@dont-email.me...

> On 10/22/2012 3:10 PM, Mark Wills wrote:
> > On Oct 22, 5:56 pm, visualfo...@rocketmail.com wrote:

> >> Despite allowing several Forth commands inside one RTX2000 16 bit
> >> word a great advantage is the ability to set a special Return from
> >> Subroutine bit - that is a real speed accelerator - so a whole
> >> executable program may need only one RTX2000 16 bit word!
> >> Imagine what additional possibilities you can have with 32 bit words!
> >
> > Hmmm... I'm not following you. What is the function/significance of
> > this special bit? Where is it set? In the call instruction or the
> > return instruction?
>
> <this post turned out to be a bit longer than I expected...>
>

I'm versed in (old) microprocessor design. So, that's easily taken care of:
[SNIP]

> The only issue with such combining of instructions is the instruction
> encoding. I encoded to minimize the amount of program space needed
> which resulted in a minimum width instruction optimized per Koopman's
> data for instruction frequency. I looked at separating the opcodes for
> each unit in essence making the MISC equivalent of a VLIW processor, if
> you can have such a thing... lol I couldn't quite squeeze the
> instruction into a 9 bit word which is a memory width commonly available

> in FPGAs. [...]

Forth's built using "primitives" generally need 30 to 40 or so, i.e.,
5-bits. Why do you need 9-bits (512)? I'd guess that you're encoding
things other than just the instruction, e.g., control-bits, offsets, modes,
etc.

Koopman's and Ertl's instruction frequency data is basically the same.
Using their data is a good choice though.

(This is also posted earlier in the thread:)
The question for both you (and Brad) is if you create new, faster, more

powerful, multiple operation instructions, how do you ensure they are used?
Without an optimizer, it's likely the instruction will have a low
instruction frequency. I.e., a person is unlikely to use it. In which
case, there is no point in using or implementing it.

> The obvious one, return as a separate bit, in parallel with everything,
> can only be used with about half the instructions in my CPU design. The
> return stack is used in a lot of them. But if the bit is free...

Years ago, there was a processor that used a few bits, like two, for
conditional execution of each instruction. I don't recall what it was, or
if it was a Forth processor. It might've been a bit-slice design...

> Info that would be VERY useful to me is frequency of use of instructions
> in combination. This could be pulled from existing code by measuring
> how often instructions are found adjacent to each other. This may not
> be a perfect measure, but it would be a great start. Can anyone
> generate a metric on this similar to Koopman's data on instruction use?

Anton Ertl also has instruction frequency data. I don't recall if he showed
combinations or not. I know he or someone created the concept of "super
operators" for Forth, which I think is what you're asking about. If he
doesn't respond, I'll attempt to locate for you what I previously found.

Rod Pemberton

visua...@rocketmail.com

unread,

Oct 22, 2012, 10:44:15 PM10/22/12

to

On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:

> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.

The trick is to make the right use of the available bits.

The RTX2000 uses two important tricks to make it run fast:
First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
Second, the aforementioned trick with the return from subroutine bit.

These both accomplish speed first hand.

The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.

If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.

Hugh Aguilar

unread,

Oct 23, 2012, 1:03:13 AM10/23/12

to

On Oct 22, 6:48 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:

> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction. I don't recall what it was, or
> if it was a Forth processor. It might've been a bit-slice design...

Isn't that the way that the ARM works?

Andrew Haley

unread,

Oct 23, 2012, 4:19:41 AM10/23/12

to

Hugh Aguilar <hughag...@yahoo.com> wrote:
> On Oct 22, 6:48?pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>

> wrote:
>> Years ago, there was a processor that used a few bits, like two, for

>> conditional execution of each instruction. ?I don't recall what it was, or
>> if it was a Forth processor. ?It might've been a bit-slice design...

>
> Isn't that the way that the ARM works?

It was, but it's mostly been dropped in ARM 64 because "Benchmarking
shows that modern branch predictors work well enough that predicated
execution of instructions does not offer sufficient benefit to justify
its significant use of opcode space, and its implementation cost in
advanced implementations."

Andrew.

Stephen Pelc

unread,

Oct 23, 2012, 6:52:34 AM10/23/12

to

On Mon, 22 Oct 2012 08:53:22 -0700 (PDT), Brad Eckert
<hwf...@gmail.com> wrote:

>I've been thinking about Novix style processors like the RTX2000. There are=
> many Forth sequences that can be compacted into one instruction, so with a=
> good optimizer the chip can execute several Forth (source) primitives in o=
>ne machine cycle. I suspect though that such optimization opportunities are=

> the exception rather than the rule.

See:
http://www.complang.tuwien.ac.at/anton/euroforth/ef04/pelc-bailey04.pdf
This machine has been run in an FPGA.

http://www-users.cs.york.ac.uk/~chrisb/main-pages/fpga/fpga-research-interests

>Does anyone here have a feel for the correspondence between Forth source pr=
>imitives and generated code?

Many address generating phrases such as "dup 8 + @" can be collapsed
to single instructions. However, the result is not a Forth purist's
CPU.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Anton Ertl

unread,

Oct 23, 2012, 8:44:15 AM10/23/12

to

rickman <gnu...@gmail.com> writes:
>Info that would be VERY useful to me is frequency of use of instructions
>in combination. This could be pulled from existing code by measuring
>how often instructions are found adjacent to each other. This may not
>be a perfect measure, but it would be a great start. Can anyone
>generate a metric on this similar to Koopman's data on instruction use?

http://www.complang.tuwien.ac.at/forth/peep/

in particular:

http://www.complang.tuwien.ac.at/forth/peep/sorted

There is also a later paper that shows different kinds of data:

http://www.complang.tuwien.ac.at/anton/euroforth/ef01/gregg01.pdf

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2012: http://www.euroforth.org/ef12/

van...@vsta.org

unread,

Oct 23, 2012, 1:54:46 PM10/23/12

to

Andrew Haley <andr...@littlepinkcloud.invalid> wrote:

> Hugh Aguilar <hughag...@yahoo.com> wrote:
>>> Years ago, there was a processor that used a few bits, like two, for
>>> conditional execution of each instruction.

>> Isn't that the way that the ARM works?
> It was, but it's mostly been dropped in ARM 64 because "Benchmarking
> shows that modern branch predictors work well enough that predicated
> execution of instructions does not offer sufficient benefit to justify
> its significant use of opcode space, and its implementation cost in
> advanced implementations."

FWIW, the Propeller CPU has conditional execution too. I did a fair amount
of hand-coding of assembly for it, and came away not liking its instruction
set very much at all. MIPS is my all-time favorite, but I'd even take 32-bit
x86 over Propeller.

--
Andy Valencia
Home page: http://www.vsta.org/andy/
To contact me: http://www.vsta.org/contact/andy.html

rickman

unread,

Oct 23, 2012, 5:19:01 PM10/23/12

to

I was constrained by the memories available in FPGAs. Many allow
somewhat flexible word widths of 1, 2, 4, 8, 9, 16 and 18 bits, some
even 32 and 36 bits. Multiplexing multiple instructions in one word has
a downside in that it requires extra levels of logic in the instruction
decode path which affects *all* instructions. I wanted to avoid that.

So I started with a 4 bit instruction and found that rather limiting,
mainly in the impact on performance since most code as around twice as
long as it could be with larger instructions. Literals (both data and
address) were especially problematic. Looking at Koopman's data it was
clear that anything which could optimize the address fields of calls and
other instructions, including immediate data would be a boon.

So I tried 8 bit words and used a variable bit with instruction with the
remaining bits as immediate data. This was combined with a data
extension scheme similar to that used by the Transputer. They used 4
bit instructions with 4 bits of immediate data which would be shifted
into larger words. Since this would be the most commonly used
instruction I gave it one bit with 7 bit immediate data. The first
invocation of a literal instruction pushes the top of return stack with
the 7 bit data, sign extended. Each subsequent invocation of the
literal instruction shifts 7 more bits into the top of return stack.
Calls and Jumps have a four field which is combined with the top of
return stack if a literal has been pushed, or just sign extended if not.

There remains some 16 opcodes for the various instructions for
manipulating data. The 8 bit machine was used in one design.

I considered a 9 bit version to fully utilize the block RAM in most
FPGAs. The immediate data fields were extended by one bit which I think
was significant for jumps and calls, 5 bits vs. 4). In the case of
general opcodes the extra bit could be used to provide 32 instructions
rather than 16, but I didn't feel this gave much benefit and complicated
the instruction decode which was already more complex than I preferred.
Another alternative was to use the extra bit to flag a combined Return
instruction. I found it could only be used with about half the opcodes
I was using because of conflicts.

> Koopman's and Ertl's instruction frequency data is basically the same.
> Using their data is a good choice though.

It is a LOT better than no data at all which is what I have otherwise.

> (This is also posted earlier in the thread:)
> The question for both you (and Brad) is if you create new, faster, more
> powerful, multiple operation instructions, how do you ensure they are used?
> Without an optimizer, it's likely the instruction will have a low
> instruction frequency. I.e., a person is unlikely to use it. In which
> case, there is no point in using or implementing it.

Who is this "person"? My design was for me and if I generated enough
code to analyze statistically, I would pick the instructions to combine
from analyzing my code. It's not like I am selling this design for
others to use... not that I wouldn't mind sharing, but you have to read
my mind for much of the details. One person asked and I gave him my
block diagram with labeled control points and my opcode cheat sheet. He
couldn't make heads or tails out of it... lol

>> The obvious one, return as a separate bit, in parallel with everything,
>> can only be used with about half the instructions in my CPU design. The
>> return stack is used in a lot of them. But if the bit is free...
>
> Years ago, there was a processor that used a few bits, like two, for
> conditional execution of each instruction. I don't recall what it was, or
> if it was a Forth processor. It might've been a bit-slice design...

I've heard of that as well as other "unique" features. An ancient
Univac machine had a bit in the address field that flagged indirect.
The address fetched had the same bit... they had to add a indirect
counter to get out of the infinite loops that could happen.

>> Info that would be VERY useful to me is frequency of use of instructions
>> in combination. This could be pulled from existing code by measuring
>> how often instructions are found adjacent to each other. This may not
>> be a perfect measure, but it would be a great start. Can anyone
>> generate a metric on this similar to Koopman's data on instruction use?
>
> Anton Ertl also has instruction frequency data. I don't recall if he showed
> combinations or not. I know he or someone created the concept of "super
> operators" for Forth, which I think is what you're asking about. If he
> doesn't respond, I'll attempt to locate for you what I previously found.

That would be greatly interesting. I'm surprised I didn't notice this
before. I wish there was some market for a machine like this, but then
others would have done this before me. I know Bernd would have been all
over this years ago if the market existed as well as others.

It seems that if the CPU isn't pipelined and blazing fast, it isn't
interesting to most FPGA users. They prefer very high performance CPUs
that access MBs of external memory and use 1000's of LUTs. My design
has an extensible word size, but will likely never have a C compiler for
it.

That reminds me of the ZPU. It is a stack machine designed to run C.
It was also designed to be as tiny as possible in the minimal
configuration with other versions running faster but using more
resources. The ported the gcc tools for it. I think the minimal
version is slightly smaller than my design, but very slow, maybe 10x...
or would that be 10/?

Rick

rickman

unread,

Oct 23, 2012, 5:32:14 PM10/23/12

to

On 10/23/2012 8:44 AM, Anton Ertl wrote:
> rickman<gnu...@gmail.com> writes:
>> Info that would be VERY useful to me is frequency of use of instructions
>> in combination. This could be pulled from existing code by measuring
>> how often instructions are found adjacent to each other. This may not
>> be a perfect measure, but it would be a great start. Can anyone
>> generate a metric on this similar to Koopman's data on instruction use?
>
> http://www.complang.tuwien.ac.at/forth/peep/
>
> in particular:
>
> http://www.complang.tuwien.ac.at/forth/peep/sorted
>
> There is also a later paper that shows different kinds of data:
>
> http://www.complang.tuwien.ac.at/anton/euroforth/ef01/gregg01.pdf
>
> - anton

Thank you. I'll take a look at this data.

Rick

rickman

unread,

Oct 23, 2012, 5:36:45 PM10/23/12

to

On 10/22/2012 9:51 PM, Rod Pemberton wrote:
> "Brad Eckert"<hwf...@gmail.com> wrote in message
> news:e1479cfa-a969-40ab...@googlegroups.com...
>> I've been thinking about Novix style processors like the RTX2000.
>> There are many Forth sequences that can be compacted into one
>> instruction, so with a good optimizer the chip can execute several
>> Forth (source) primitives in one machine cycle. I suspect though
>> that such optimization opportunities are the exception rather than the
>> rule.
>>
>
> The question for both you (and Rick) is if you create new, faster, more
> powerful, multiple operation instructions, how do you ensure they are used?
> Without an optimizer, it's likely the instruction will have a low
> instruction frequency. I.e., a person is unlikely to use it. In which
> case, there is no point in using or implementing it.
> (This is repeated later in a reply to Rick.)

We aren't talking about Forth coding really. We are talking about the
assembly language for a machine. I don't think instructions will go
unused just because they are mapped to Forth in a more complicated way
than 1 to 1 (or 1/2 to 1).

>> Does anyone here have a feel for the correspondence between Forth
>> source primitives and generated code?
>
> Generally, Forth's built using "primitives" or low-level words generally
> need 30 to 40 or so. I kept track of how many are needed for certain
> Forths. There are a few posts by me to c.l.f. with counts and specific
> words used.

Don't confuse Forth low level primitives (which are really HLL
primitives selected to be convenient for the programmer writing a Forth)
and assembly language which has to be selected in part based on what is
practical and efficient to implement. Chuck's machine only uses 32
opcodes and you can get by with as few as 16.

Rick

rickman

unread,

Oct 23, 2012, 5:47:25 PM10/23/12

to

On 10/22/2012 10:44 PM, visua...@rocketmail.com wrote:
> On Monday, October 22, 2012 11:53:24 AM UTC-4, Brad Eckert wrote:
>> Hi All, I've been thinking about Novix style processors like the RTX2000. There are many Forth sequences that can be compacted into one instruction, so with a good optimizer the chip can execute several Forth (source) primitives in one machine cycle. I suspect though that such optimization opportunities are the exception rather than the rule. Does anyone here have a feel for the correspondence between Forth source primitives and generated code? This kind of architecture is pretty good if you can get a wide instruction from code space every clock. You have calls and you have everything else, where everything else is a kind of compact VLIW. Implementation can be simple, as shown by James Bowman's J1.
>
> The trick is to make the right use of the available bits.
>
> The RTX2000 uses two important tricks to make it run fast:
> First of all, one bit switches between two types of commands: general commands and subroutine calls. This bit could be bit zero. If bit zero is zero, the other bits form the address of the subroutine to be called. That's obvious, because you don't need odd addresses for subroutines. But the RTX2000 uses bit 15 and shifts the address to fit.
> Second, the aforementioned trick with the return from subroutine bit.
>
> These both accomplish speed first hand.
>
> The remaining bits should be used to decode all Forth primitives which are needed. I am sure you don't need a frequency of calling, because the Forth OS itself needs all of them.

With a 16 bit or larger instruction, you have enough bits to "encode"
each execution unit in a dual stack machine separately. Then you don't
need a separate bit for the return operation. It can't be used in
parallel with any other Instruction Unit operation or a Return Stack
operation since both of these are used to do a return. It can only be
used in parallel with a purely Data Stack operation.

The next time I look at a design on an FPGA I will take a look at a 16
or 18 bit instruction word that encodes the execution units operations
separately. Part of the problem is that 16 is too many! What to do
with the remainder?

> If there are enough bits, it will be possible to use these to run several primitives in one clock cycle - RISCs normally use only one clock cycle per command, the RTX2000 method allows up to four commands run per one clock cycle - it's some kind of Super-RISC. These commands running in one clock cycle should generate the next level of primitives. And of course there is the possibility to use some bits to switch between different kinds of decoding.. The more bits, the more possibilities there are, and more commands can be made to run in one clock cycle, not necessarily in parallel.

Primitives can only be run together if they don't conflict. That's why
I want to look at which primitives occur together in the code.

Rick

visua...@rocketmail.com

unread,

Oct 23, 2012, 7:00:55 PM10/23/12

to

On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote:

> > On 10/22/2012 10:44 PM, visualforth.com wrote:
> > The RTX2000 uses two important tricks to make it run fast:

> > Second, the aforementioned trick with the return from subroutine bit.

> With a 16 bit or larger instruction, you have enough bits to "encode" each
> execution unit in a dual stack machine separately. Then you don't need a
> separate bit for the return operation. It can't be used in parallel with any
> other Instruction Unit operation or a Return Stack operation since both of
> these are used to do a return. It can only be used in parallel with a purely
> Data Stack operation.

If you say so ....

rickman

unread,

Oct 23, 2012, 9:07:11 PM10/23/12

to

I don't understand. Are you agreeing or passively saying you don't agree?

Rick

visua...@rocketmail.com

unread,

Oct 23, 2012, 9:44:49 PM10/23/12

to

On Tuesday, October 23, 2012 9:07:21 PM UTC-4, rickman wrote:
> On 10/23/2012 7:00 PM, visualforth.com wrote: > On Tuesday, October 23, 2012 5:47:32 PM UTC-4, rickman wrote: >>> On 10/22/2012 10:44 PM, visualforth.com wrote: >>> The RTX2000 uses two important tricks to make it run fast: >>> Second, the aforementioned trick with the return from subroutine bit. > >> With a 16 bit or larger instruction, you have enough bits to "encode" each >> execution unit in a dual stack machine separately. Then you don't need a >> separate bit for the return operation. It can't be used in parallel with any >> other Instruction Unit operation or a Return Stack operation since both of >> these are used to do a return. It can only be used in parallel with a purely >> Data Stack operation. > > If you say so ....

> I don't understand. Are you agreeing or passively saying you don't agree?
> Rick

I have to confess: communication is not easy.
I wrote "If you say so ..." to address that you have your own opinion in this matter.

My point of view is - I forgot to mention - that a special return bit closes a definition, and instead of having a separate return instruction which needs one clock cycle more, the next address is the start address of the following word. When the RTX2000 was introduced, IIRC any other microprocessor had a special return instruction, which needed a full memory access and normally a lot of clock cycles. That's why back then the main stream teaching was: "avoid subroutines". The structure of Forth is based on "subroutines" - to accelerate that process, a special functionality has been needed: the subroutine bit.
Now, fifteen years later, this may be taken for granted.

Rod Pemberton

unread,

Oct 24, 2012, 6:56:18 AM10/24/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k671kb$1uc$1...@dont-email.me...

> On 10/22/2012 9:52 PM, Rod Pemberton wrote:
> > "rickman"<gnu...@gmail.com> wrote in message
> > news:k64bj1$m4n$1...@dont-email.me...

...

> >> Info that would be VERY useful to me is frequency of use of
> >> instructions in combination. This could be pulled from existing
> >> code by measuring how often instructions are found adjacent to
> >> each other. This may not be a perfect measure, but it would be a
> >> great start. Can anyone generate a metric on this similar to
> >> Koopman's data on instruction use?
> >
> > Anton Ertl also has instruction frequency data. I don't recall if he
> > showed combinations or not. I know he or someone created the concept

> > of "superoperators" for Forth, which I think is what you're asking

> > about. If he doesn't respond, I'll attempt to locate for you what I
> > previously found.
>
> That would be greatly interesting.

Anton Ertl provided links to instruction frequencies. It lists combinations
of instructions also.

The pdf he posted uses the term "superinstruction". It also mentions
another paper by Proebstring using the term "superoperators".

Todd Proebstring "Optimizing an ANSI C Interpreter with Superoperators"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4382&rep=rep1&type=pdf

> It seems that if the CPU isn't pipelined and blazing fast, it isn't
> interesting to most FPGA users.

Without pipelining, I doubt it's of interest to anyone. The 6502 proved how
effective pipelining in a microprocessor is in increasing performance.

> That reminds me of the ZPU. It is a stack machine designed to run C.
>

The name seemed familiar. But, I didn't recall it.

It seems you mentioned this circa 2008 in c.l.f. in a thread in which I
responded.

Rod Pemberton

Rod Pemberton

unread,

Oct 24, 2012, 8:47:20 AM10/24/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k672lj$82o$1...@dont-email.me...

For temporary reasons, most of my Forth stack operators, like DUP SWAP etc,
aren't currently low-level Forth "primitives". They're implemented in
high-level Forth using an even lower set of actual stack "primitives", which
I'll call sub-operators for this thread. These sub-operators could be
considered to be "stack assembly instructions" for my Forth. Currently,
only >R and R> are actually "primitives" coded in C. Eventually, DUP DROP
SWAP OVER will be actual low-level Forth "primitives", as they once were.

So, a word like OVER can be coded in many ways. I have 22 different
definitions just for OVER in a list, and one can construct many more. E.g.,

: OVER >R DUP R> SWAP ;
: OVER 1 PICK ;
: OVER SWAP TUCK ;
: OVER SWAP DUP -ROT ;
: OVER NUP SWAP ;
etc.

Which definition you choose affects how fast OVER executes and depends on
what operations you have available and on how fast each of those operations
are.

E.g., if >R >R DUP SWAP and -ROT are all very fast machine instructions for
your processor, then ">R DUP R> SWAP" and "SWAP DUP -ROT" should be fast
sequences. And, one sequence will be faster than the other, depending on
how fast each instruction is. But, if -ROT is not a machine instruction,
e.g., perhaps coded as "ROT ROT" or "SWAP >R SWAP R>", or -ROT is a very
slow machine instruction, then "SWAP DUP -ROT" is more expensive than ">R
DUP R> SWAP".

The instructions also need to be balanced acrossed many Forth words.
Originally, I had the 2xxx series words defined in terms of the simpler
stack operators, like SWAP DUP etc. When I converted to sub-operators, the
number of operations per definition dropped dramatically for some of the
2xxx definitions. Since the sub-operators are primitives, some of the 2xxx
definitions became faster. However, after converting all of the simpler
stack operators to sub-operators, a few of the non-primitive simple stack
operators ended up with more operations per definition. So some words
became much faster, while others became slightly slower. And, the former
primitives became real slow, but they'll be converted back eventually.

Let's take a look at 2OVER for my Forth interpreter. I could easily
implement it as a Forth low-level "primitive". Or, I could define it in
high-level Forth. Or, I could define it in high-level Forth in terms of
sub-operators which are actually the low-level "primitives".

In high-level Forth, 2OVER can be defined:

: 2OVER 2>R 2DUP 2R> 2SWAP ;

In terms of my sub-operators, my 2OVER definition has ten words it's
definition. Clearly, that's many more words than the four in definition
above. But, 2>R 2R> 2DUP and 2SWAP are also implemented in high-level
Forth using the same set of sub-operators. So, 2>R 2R> 2DUP and 2SWAP
aren't "primitives" nor are they a single sub-operator each.

The 2>R sequence has six items.
The 2DUP sequence has six items.
The 2R> sequence has six items.
The 2SWAP sequence has eight items.

The 2OVER sequence has ten items.

So, 2OVER is only 10 items using a single sequence of sub-operators instead
of a total 28 items using four sequences of sub-operators for the four words
in the high-level definition. If the 2>R 2DUP 2R> and 2SWAP words are
defined in terms of standard Forth words:

The 2>R sequence has three items.
The 2DUP sequence has ten items over multiple words.
The 2R> sequence has three items.
The 2SWAP sequence has twelve items over multiple words.

In this case, the counts changed, but it just happens that it's total is 28
also... Usually, it's more. Of course, you'd rather have a 2DUP of six
items instead of ten, i.e., balance. If the instructions are unbalanced,
then heavy use of a single Forth word will slow the code speed way down.
I.e., many 2DUP's of ten items is much worse than many 2DUP's with six
items, even if other used words are made slightly slower, like 2>R and 2R>.
Of course, you don't know if your user's code will follow the measured
instruction frequencies or not. But, at this point, you're the only user
...

As a primitive, 2OVER will be a small C routine which is compiled to
optimized, machine code.

Rod Pemberton

Anton Ertl

unread,

Oct 24, 2012, 8:26:29 AM10/24/12

to

"Rod Pemberton" <do_no...@notemailnotz.cnm> writes:
>The pdf he posted uses the term "superinstruction". It also mentions
>another paper by Proebstring using the term "superoperators".
>
>Todd Proebstring "Optimizing an ANSI C Interpreter with Superoperators"
>http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4382&rep=rep1&type=pdf

Superinstructions combine adjacent VM instructions in a sequential
representation of a program. Superoperators combine adjacent
operators in a tree representation. E.g., consider a C statement like

a[i+3] = b[j]

with a tree representation (shown as S-expression) like

(c!
(+
(var a)
(+
(var i)
(const 3)))
(c@
(+
(var b)
(var j))))

and s sequential representation like

var_b var_j + c@ var_a var_i 3 + + c!

Then you can combine a tree pattern like (c! * (c@ *)) into a
superoperator (say c@!, resulting in the following tree:

(c@!
(+
(var a)
(+
(var i)
(const 3)))
(+
(var b)
(var j)))

You cannot do that with superinstructions.

And you can combine the sequence

c@ var_*

into a superinstruction c@var_*, resulting in the following sequence:

var_b var_j + c@var_a var_i 3 + + c!

You cannot do that with superoperators.

Helmut Eller compared superoperators with superinstructions in his
diploma thesis
<http://www.complang.tuwien.ac.at/Diplomarbeiten/eller05.ps.gz> and
found superinstructions to be slightly better.

Alex McDonald

unread,

Oct 24, 2012, 9:36:28 AM10/24/12

to

On Oct 24, 1:43 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:

> "rickman" <gnu...@gmail.com> wrote in message
>
> news:k672lj$82o$1...@dont-email.me...
>
>
>
>
>
>
>
>
>
> > On 10/22/2012 9:51 PM, Rod Pemberton wrote:

> > > "Brad Eckert"<hwfw...@gmail.com> wrote in message

I believe the minimal set of primitives out of which all stack
juggling words can be built is DUP DROP SWAP >R R>.

: over >r dup r> swap ;
: nip swap drop ;
: tuck swap over ;
: rot >r swap r> swap ;
: -rot swap >r swap r> ;
: 2swap rot >r rot r> ;
: 2dup over over ;
: 2over 2>r 2dup 2r> 2swap ;
: 2drop drop drop ;
: 2nip 2swap 2drop ;
: 2rot 2>r 2swap 2r> 2swap ;
: r@ r> dup >r ;
: 2>r swap >r >r ;
: 2r> r> r> swap ;
: 2r@ 2r> 2dup 2>r ;

In your "sub-operator" notation, such sequences can be greatly
simplified. Assuming an addressable stack (that is, we don't need to
POP to get entries off the stack to access them, such as on the x86)
there are 6 of these operations, 3 for each stack. SGET S[n] and RGET
R[n] fetch entries from the stacks by fixed offset from a stack
pointer, SPUT S[n] and RPUT R[n] store entries on the stacks, and SPTR
+ n and RPTR+ n adjust the stack pointers. In the example below, only
the S operators are shown, since the rstack operations and other
juggling have been optimised away. [reg] is a virtual register.

: 2OVER 2>R 2DUP 2R> 2SWAP ;

block ( S: 4 -- 6 R: 0 -- 0 )
[1] block-begin [2]
[5] sget S[2]
[4] sget S[3]
[5] sput S[-2]
[4] sput S[-1]
[3] sptr+ -2
[2] block-end [1]

Hugh Aguilar

unread,

Oct 24, 2012, 11:16:22 AM10/24/12

to

On Oct 23, 10:54 am, van...@vsta.org wrote:
> Andrew Haley <andre...@littlepinkcloud.invalid> wrote:

I looked at the Propeller advertisements, and the seemed pretty cool
to have 8 processors on a board working together. I've been intending
to get one. What was the problem with the instruction set? Your
comment about the 32-bit x86 seems to imply that the problem with the
propeller is register starvation, as that is certainly the problem
with the 32-bit x86.

The MIPS has a lot of registers and an orthogonal instruction set. It
seems to be not very popular compared to the ARM though. I recently
bought a Microstik-II from MicroChip. It has both the PIC24 and the
PIC32. I am programming the PIC24 which will be the first target for
Straight Forth. The PIC32 (which is a MIPS) seems like an obvious
second target. I've programmed the PIC16 and PIC24 in the past, and
I'm pretty impressed with MicroChip development software and their
technical support. What is it that you like about the MIPS? Are you
programming the PIC32? Isn't it true that the MIPS was always an
academic exercise in the past for VHDL, but the PIC32 is the only
commercial product featuring it?

Brad Eckert

unread,

Oct 24, 2012, 12:55:15 PM10/24/12

to

On Tuesday, October 23, 2012 2:47:32 PM UTC-7, rickman wrote:
>
> The next time I look at a design on an FPGA I will take a look at a 16
> or 18 bit instruction word that encodes the execution units operations
> separately. Part of the problem is that 16 is too many! What to do
> with the remainder?
>

Look at the J1, which weighs in at 200 lines of Verilog. There isn't much instruction decoding, it's mostly small bit fields controlling different muxes.

Paul Rubin

unread,

Oct 24, 2012, 1:04:57 PM10/24/12

to

Brad Eckert <hwf...@gmail.com> writes:
> Look at the J1, which weighs in at 200 lines of Verilog. There isn't
> much instruction decoding, it's mostly small bit fields controlling
> different muxes.

I wonder how the code density for these tiny Forth cores compares with
the typical C target.

Andy Valencia

unread,

Oct 24, 2012, 1:58:57 PM10/24/12

to

> I looked at the Propeller advertisements, and the seemed pretty cool
> to have 8 processors on a board working together. I've been intending
> to get one. What was the problem with the instruction set? Your
> comment about the 32-bit x86 seems to imply that the problem with the
> propeller is register starvation, as that is certainly the problem
> with the 32-bit x86.

It feels like one of those calculator pseudo-assembly type of
instruction sets, application specific and designed around the
specific target hardware. But no, number of registers itself was
not the problem.

> What is it that you like about the MIPS?

Very clean instruction set designed by people who had clearly done compiler
work, hand coded assembly, and OS work. This was true from the math through
the memory accesses and addressing modes all the way down into the design of
exceptions and interrupts.

Andy

Brad Eckert

unread,

Oct 24, 2012, 3:00:29 PM10/24/12

to

On Wednesday, October 24, 2012 10:04:58 AM UTC-7, Paul Rubin wrote:
>
> I wonder how the code density for these tiny Forth cores compares with
> the typical C target.

The J1 was used because the Microblaze CPU was too big, too slow, and the C application (for an IP camera) was too big to fit in available memory. The Forth version was 62% smaller: 6349 vs 16380 bytes. See http://excamera.com/sphinx/fpga-j1.html

rickman

unread,

Oct 24, 2012, 3:25:45 PM10/24/12

to

On 10/24/2012 6:56 AM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message

>> It seems that if the CPU isn't pipelined and blazing fast, it isn't
>> interesting to most FPGA users.
>
> Without pipelining, I doubt it's of interest to anyone. The 6502 proved how
> effective pipelining in a microprocessor is in increasing performance.

I don't follow that. There are lots of MCUs that aren't pipelined. The
need for speed is a spectrum. Often there are other things that are
more important requirements, like size. How many LUTs is the Leon
processor, 5000? Fast, but large.

Rick

rickman

unread,

Oct 24, 2012, 3:33:51 PM10/24/12

to

On 10/24/2012 9:36 AM, Alex McDonald wrote:
> On Oct 24, 1:43 pm, "Rod Pemberton"<do_not_h...@notemailnotz.cnm>
> wrote:
>> "rickman"<gnu...@gmail.com> wrote in message
>>
>> news:k672lj$82o$1...@dont-email.me...
>>

>>> We aren't talking about Forth coding really. We are talking about the
>>> assembly language for a machine. I don't think instructions will go
>>> unused just because they are mapped to Forth in a more complicated way
>>> than 1 to 1 (or 1/2 to 1).
>>

Interesting. Has anyone noticed that Chuck's GA144 does not include
SWAP as a primitive? I find it very odd, but workable. In fact, the
absence of this operator led me to discover that - . + - performs a
subtraction without adding the 1 constant!

(for those of you who aren't familiar with the F18A assembly language, +
is an add, . is a nop to let the carry propagate and - is a 1's
complement, not a 2's complement. )

It seems that in a lot of cases when you think you want a swap, you can
do very well without it!

Rick

rickman

unread,

Oct 24, 2012, 3:44:30 PM10/24/12

to

Coded in 'C'? Perhaps I missed something. I was talking about a Forth
like CPU. Are you referring to a Forth compiler on a PC?

> So, a word like OVER can be coded in many ways. I have 22 different
> definitions just for OVER in a list, and one can construct many more. E.g.,
>
> : OVER>R DUP R> SWAP ;
> : OVER 1 PICK ;
> : OVER SWAP TUCK ;
> : OVER SWAP DUP -ROT ;
> : OVER NUP SWAP ;
> etc.

Yes it can, but why? Over is a very simple construct for either a Forth
like CPU or a Forth compiler.

> Which definition you choose affects how fast OVER executes and depends on
> what operations you have available and on how fast each of those operations
> are.
>
> E.g., if>R>R DUP SWAP and -ROT are all very fast machine instructions for
> your processor, then ">R DUP R> SWAP" and "SWAP DUP -ROT" should be fast
> sequences. And, one sequence will be faster than the other, depending on
> how fast each instruction is. But, if -ROT is not a machine instruction,
> e.g., perhaps coded as "ROT ROT" or "SWAP>R SWAP R>", or -ROT is a very
> slow machine instruction, then "SWAP DUP -ROT" is more expensive than ">R
> DUP R> SWAP".

I don't follow that a four instruction sequence should be "fast" in any
real sense. I don't follow at all why you would have -ROT as a
primitive and not OVER.

> The instructions also need to be balanced acrossed many Forth words.
> Originally, I had the 2xxx series words defined in terms of the simpler
> stack operators, like SWAP DUP etc. When I converted to sub-operators, the
> number of operations per definition dropped dramatically for some of the
> 2xxx definitions. Since the sub-operators are primitives, some of the 2xxx
> definitions became faster. However, after converting all of the simpler
> stack operators to sub-operators, a few of the non-primitive simple stack
> operators ended up with more operations per definition. So some words
> became much faster, while others became slightly slower. And, the former
> primitives became real slow, but they'll be converted back eventually.

Not sure why you are looking at this. If you want to optimize the
implementation, why not look at what is used most often and start by
optimizing those operators? That is why I went to Koopman's book. Not
much return on optimizing things that aren't used so much.

> ....

>
> As a primitive, 2OVER will be a small C routine which is compiled to
> optimized, machine code.

Ok, so this is running on some other computer.

Rick

rickman

unread,

Oct 24, 2012, 4:03:46 PM10/24/12

to

This is all intended to be for discussion. It ended up being a bit
long. Please don't take it as anything other than friendly discussion
and please feel free to disagree with me.

Yeah. You can talk about subroutines being slow or bad or whatever.
But they get used. I'm not sure what your point is about this. I don't
ever remember being taught to "avoid subroutines". In fact, I was
taught just the opposite in a structured programming class. They talked
about the advantages of subroutines, like information hiding, decision
hiding, module reuse, etc... Forth carries that to the extreme for
sure, so optimizing the subroutine linkage is useful.

But oddly enough, if you want to optimize returns, the return "bit"
doesn't work as well as you might think for a number of reasons. As I
said before, only about half the instructions can be combined withe the
return function. In Forth code it can be hard to use this bit. If your
last word before the ; is an ELSE, what instruction do you combine the
return with? If the last word before the ; is another definition and
not a primitive you can't combine a return with a call... or can you?

Both of these situations allow optimization in a different way. The
last definition call of a word definition can be changed from a call to
a jump, this is called "tail recursion", IIRC. The ELSE before a ; just
means the tail recursion is performed for both cases of the IF.

Words like LOOP or R> can't be combined with the return because they are
using the return stack.

Don't forget there is a cost to that return "bit". It is a bit in the
instruction word that might be put to other, more productive use. The
Univac 1108 had a cascaded indirect bit that has never been used in any
machine since because all of the address bits were better used as...
address bits. The return bit may be good in some ways, but can't be
used in many cases.

Also, don't overrate the savings of the return "bit". It saves an
instruction access, but in my machine it is only a single clock cycle.
It is only in standard CPUs that a subroutine uses multiple clock cycles.

Rick

rickman

unread,

Oct 24, 2012, 4:10:14 PM10/24/12

to

I compared my code to a RISC instruction set once, don't recall which
one. They had 16 bit instructions and the code did a simple memory copy
loop. The RAM sizes were the same. Mine had twice as many instructions
but the RISC instructions were twice as large. If the RISC were large
enough it might do each instruction in one clock cycle. But my entire
CPU would probably be smaller than the register file on the RISC.

Remember that this CPU design is optimized for implementing Forth and I
was comparing assembly to assembly. Who knows how it would do with C
vs. Forth on appropriate CPUs.

I do believe that MISC is the way to go for many applications. But
CISC/RISC is firmly entrenched.

Rick

Paul Rubin

unread,

Oct 24, 2012, 4:15:38 PM10/24/12

to

rickman <gnu...@gmail.com> writes:
> Don't forget there is a cost to that return "bit". It is a bit in the
> instruction word that might be put to other, more productive use.

I could imagine an instruction set packed into a bit stream using
Huffman codes. It would use more decoding hardware but the programs
would be smaller. It's possible that "return" would end up as a single
bit, if it was much more common than any other instruction. More likely
it would be 2 or 3 bits.

Paul Rubin

unread,

Oct 24, 2012, 4:21:49 PM10/24/12

to

rickman <gnu...@gmail.com> writes:
> I do believe that MISC is the way to go for many applications. But
> CISC/RISC is firmly entrenched.

Looking at the F18 die photo, most of the F18 area is memory arrays and
the stacks. The ALU is maybe 10% of it. If they went to 6-bit
instructions instead of 5-bit, they could add a bunch more primitives,
more registers, a few built-in constants, etc. Perhaps the ALU would
become 2x or 3x larger, but there might well be significant improvements
in code size and performance to make up for it.

rickman

unread,

Oct 24, 2012, 4:25:19 PM10/24/12

to

That is essentially what I did with my instruction encoding. I had to
work with a fixed instruction size, so I used the extra bits for data
(address mostly). In an 8 bit instruction a leading 0 meant the rest of
the instruction was literal data. A leading 1 followed by the next
three bits said it was some form of call (rel or abs) or jump (several
different conditions) with four bits of immediate address to be appended
to any literal previously loaded. The remaining 16 instructions were
direct opcodes of which return was one, a full 8 bits.

Take a look at Forth code sometime and consider how often tail recursion
can be used and you will see that the "return" instruction is not so
often used. In addition the remaining cases often can't be combined
with the return bit.

At least that is what I found when I tried it.

Rick

Rod Pemberton

unread,

Oct 25, 2012, 12:30:21 AM10/25/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k69fbv$a6n$1...@dont-email.me...

Are we talking about the same thing?

http://en.wikipedia.org/wiki/Instruction_pipeline
http://en.wikipedia.org/wiki/Pipelining

AFAIK, the 6502 was the first microprocessor with pipelining. And, the Cray
CDC 6600 had the first central processing unit with it.

If we're talking about the same thing and a microprocessor design isn't
using pipelining, how do you get any acceptable level of performance? Do
you just use faster logic and clock?

Rod Pemberton

Rod Pemberton

unread,

Oct 25, 2012, 12:42:35 AM10/25/12

to

"Alex McDonald" <bl...@rivadpm.com> wrote in message
news:172e6a33-bc52-4fe9...@10g2000vbu.googlegroups.com...

IIRC, you're the guy who says I snip too much. Well, it's all there.
Reformatted. I don't know who is going to read through all that or
enjoy scrolling for six pages just to get to this, which is almost
another topic ...

> I believe the minimal set of primitives out of which all stack
> juggling words can be built is DUP DROP SWAP >R R>.
>
> : over >r dup r> swap ;
> : nip swap drop ;
> : tuck swap over ;
> : rot >r swap r> swap ;
> : -rot swap >r swap r> ;
> : 2swap rot >r rot r> ;
> : 2dup over over ;
> : 2over 2>r 2dup 2r> 2swap ;
> : 2drop drop drop ;
> : 2nip 2swap 2drop ;
> : 2rot 2>r 2swap 2r> 2swap ;
> : r@ r> dup >r ;
> : 2>r swap >r >r ;
> : 2r> r> r> swap ;
> : 2r@ 2r> 2dup 2>r ;

Oops, I'm missing the 2r@ definition using 2xxx words ...

Yes. But, the minimal set is not likely to be the fastest solution.

E.g., 2r@ calls 2r> which calls 3 words, 2>r which calls 3 words, and 2dup
which calls over twice which calls four other words for 8 words, plus the
overhead for calling DOCOL/ENTER SEMIS/EXIT. A single, optimized
definition for 2r@ can be shorter and therefore faster:

: 2r@ r> r> dup >r swap dup >r ;

That's only trivially faster, one operation less, assuming DUP DROP SWAP
>R R> are primitives. Unfortunately, I can't cite my 2r@ in sub-operators
as faster. It's slightly worse. Other definitions are improved more. My
sub-operators improve some of the 2xxx words quite a bit.

> In your "sub-operator" notation, such sequences can be greatly
> simplified. Assuming an addressable stack (that is, we don't need to
> POP to get entries off the stack to access them, such as on the x86)
> there are 6 of these operations, 3 for each stack. SGET S[n] and RGET
> R[n] fetch entries from the stacks by fixed offset from a stack
> pointer, SPUT S[n] and RPUT R[n] store entries on the stacks, and SPTR
> + n and RPTR+ n adjust the stack pointers. In the example below, only
> the S operators are shown, since the rstack operations and other
> juggling have been optimised away. [reg] is a virtual register.
>
> : 2OVER 2>R 2DUP 2R> 2SWAP ;
> block ( S: 4 -- 6 R: 0 -- 0 )
> [1] block-begin [2]
> [5] sget S[2]
> [4] sget S[3]
> [5] sput S[-2]
> [4] sput S[-1]
> [3] sptr+ -2
> [2] block-end [1]

Well, I haven't done a 3-level analysis, but did do a 2-level direct
read/write of the data stack. I'm using a highly modified version, by me,
of Peter Sovietov's Forth Wizard in Javascript.

Anyway, I concluded that the 2-level direct read/write wasn't all that
optimal. I found a different combination of pushing/popping which I believe
to provide better overall results. But, eventually, SWAP DROP DUP R> >R
will be primitives again. So, this may be all for naught.

For Forthers here, SGET and SPUT would be like PICK and the uncommon
opposite operation: PLACE.

Rod Pemberton

Rod Pemberton

unread,

Oct 25, 2012, 12:50:12 AM10/25/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k69gf5$hic$1...@dont-email.me...

> On 10/24/2012 8:47 AM, Rod Pemberton wrote:

...

> > For temporary reasons, most of my Forth stack operators, like DUP SWAP
> > etc, aren't currently low-level Forth "primitives". They're implemented
> > in high-level Forth using an even lower set of actual stack
> > "primitives", which I'll call sub-operators for this thread. These
> > sub-operators could be considered to be "stack assembly instructions"
> > for my Forth. Currently, only>R and R> are actually "primitives"
> > coded in C. Eventually, DUP DROP SWAP OVER will be actual
> > low-level Forth "primitives", as they once were.
>
> Coded in 'C'? Perhaps I missed something. I was talking about a Forth
> like CPU. Are you referring to a Forth compiler on a PC?
>

Yes. You're talking about a Forth like CPU. I'm talking about my Forth
interpreter in C. I'm attempting to discuss similarities which are useful.

> > So, a word like OVER can be coded in many ways. I have 22 different
> > definitions just for OVER in a list, and one can construct many more.
> > E.g.,
> >
> > : OVER>R DUP R> SWAP ;
> > : OVER 1 PICK ;
> > : OVER SWAP TUCK ;
> > : OVER SWAP DUP -ROT ;
> > : OVER NUP SWAP ;
> > etc.
>
> Yes it can, but why? Over is a very simple construct for either a Forth
> like CPU or a Forth compiler.

If you implement it as a single instruction, that's great. But, you're not
going to fit all Forth words into your instruction set. So, some words are
going to be made up of multiple Forth words comprised of multiple
instructions. This was just an example of how I went about optimizing such
a sequence.

Also, some Forth stack words don't fit most CPU's well. For example, some
words shift a portion of the data on the stack, e.g., ROT TUCK. Stack
shifts are not well suited to most processors. Stacks can push or pop or
use other instructions to direct reads and writes, but shifts require a copy
routine.

> > Which definition you choose affects how fast OVER executes and depends
> > on what operations you have available and on how fast each of those
> > operations are.
> >
> > E.g., if>R>R DUP SWAP and -ROT are all very fast machine instructions
> > for your processor, then ">R DUP R> SWAP" and "SWAP DUP -ROT"
> > should be fast sequences. And, one sequence will be faster than the
> > other, depending on how fast each instruction is. But, if -ROT is not a
> > machine instruction, e.g., perhaps coded as "ROT ROT" or "SWAP>R
> > SWAP R>", or -ROT is a very slow machine instruction, then "SWAP
> > DUP -ROT" is more expensive than ">R DUP R> SWAP".
>
> I don't follow that a four instruction sequence should be "fast" in any
> real sense. I don't follow at all why you would have -ROT as a
> primitive and not OVER.
>

You're going to be implementing instructions for your processor. Which
instructions you implement and which you don't implement affects how each
Forth word is coded. That affects their speed.

> > The instructions also need to be balanced acrossed many Forth words.
> > Originally, I had the 2xxx series words defined in terms of the simpler
> > stack operators, like SWAP DUP etc. When I converted to sub-operators,
> > the number of operations per definition dropped dramatically for some
> > of the 2xxx definitions. Since the sub-operators are primitives, some
> > of the 2xxx definitions became faster. However, after converting all of
> > the simpler stack operators to sub-operators, a few of the non-primitive
> > simple stack operators ended up with more operations per definition.
> > So some words became much faster, while others became slightly slower.
> > And, the former primitives became real slow, but they'll be converted
> > back eventually.
>
> Not sure why you are looking at this. If you want to optimize the
> implementation, why not look at what is used most often and start by
> optimizing those operators? That is why I went to Koopman's book. Not
> much return on optimizing things that aren't used so much.

There is no quarantee that the instruction frequencies will be the most used
in the code that will be run. It merely demonstrates which was the most
used in the code that was sampled.

I.e., let's say I used Koopman's dynamic instruction frequencies to design a
processor. TUCK isn't in the list. So, I didn't put any effort into
optimizing TUCK. TUCK is really slow and long sequence for my imaginary
processor. Later, I fall in "love" with TUCK. So, I begin to use it
everywhere in my imaginary Forth code. Will my code be fast? No.

The frequencies are only useful on the sampled code and similar code. If
you're not using a code optimizer, then you (or your future users) are going
to have to selectively preference the optimized instructions over the
unoptimized.

Rod Pemberton

Rod Pemberton

unread,

Oct 25, 2012, 12:55:05 AM10/25/12

to

"Rod Pemberton" <do_no...@notemailnotz.cnm> wrote in message
news:k6afoh$laf$1...@speranza.aioe.org...

Correction:

> Oops, I'm missing the 2r@ definition using 2xxx words ...
>
> Yes. But, the minimal set is not likely to be the fastest solution.
>
> E.g., 2r@ calls 2r> which calls 3 words, 2>r which calls 3 words, and 2dup
> which calls over twice which calls four other words for 8 words, plus the
> overhead for calling DOCOL/ENTER SEMIS/EXIT. A single, optimized
> definition for 2r@ can be shorter and therefore faster:
>
> : 2r@ r> r> dup >r swap dup >r ;
>
> That's only trivially faster, one operation less, assuming DUP DROP SWAP
> >R R> are primitives. Unfortunately, I can't cite my 2r@ in sub-operators
> as faster. It's slightly worse. Other definitions are improved more. My
> sub-operators improve some of the 2xxx words quite a bit.

Sorry, I miscounted as 7 vs. 8. It's 7 vs. 14. So, it's 50% faster.

Rod Pemberton

Paul Rubin

unread,

Oct 25, 2012, 1:10:43 AM10/25/12

to

"Rod Pemberton" <do_no...@notemailnotz.cnm> writes:
> If we're talking about the same thing and a microprocessor design isn't
> using pipelining, how do you get any acceptable level of performance? Do
> you just use faster logic and clock?

You just get less performance. A non-pipelined processor typically uses
3-4 cycles for basic instructions that a pipelined one does in 1 cycle
(throughput). That may still be enough for your application. If it's
not, pick a different processor, usually at a cost in die area and power
consumption.

Paul Rubin

unread,

Oct 25, 2012, 2:24:02 AM10/25/12

to

Brad Eckert <hwf...@gmail.com> writes:
> The J1 was used because the Microblaze CPU was too big, too slow, and
> the C application (for an IP camera) was too big to fit in available
> memory. The Forth version was 62% smaller: 6349 vs 16380 bytes. See
> http://excamera.com/sphinx/fpga-j1.html

The J1 is really cool and I remember looking at it before. I guess it
uses RAM blocks for the stacks? It's very clever, the instruction words
are almost like microcode. The hardware folks here would know better
than me, but I wonder if the J1 architecture is tuned for FPGA
implementation in a way that would lose some advantage (compared to
something like b16) in silicon.

I notice the paper doesn't say how many CLB's are consumed. I see all
the opcodes are used, but a couple of them could be synthesized from
others, so maybe it's possible to repurpose them for new instructions.
Or maybe some unused combinations of opcodes and flags could be
hijacked, if checking for them doesn't cause delays.

It's interesting that 22% of instructions in his code are literals. I
wonder if most of them are either 0 or 1.

I can understand a lot of the Verilog by reading it. I'd sure like to
be able to write it too.

Alex McDonald

unread,

Oct 25, 2012, 7:49:18 AM10/25/12

to

On Oct 25, 5:38 am, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:
> "Alex McDonald" <b...@rivadpm.com> wrote in message
>
> news:172e6a33-bc52-4fe9...@10g2000vbu.googlegroups.com...>

[snip]

The idea is to use the primitives. 2r@ above is

: 2r@ r> r> dup >r swap dup >r ;

block ( S: 0 -- 2 R: 2 -- 2 )
[1] block-begin [2]
[5] rget R[0]
[4] rget R[1]
[7] sput [5] S[-2]
[6] sput [4] S[-1]

[3] sptr+ -2
[2] block-end

Using the compounds 2r> and so on compiles to;

: 2r@ 2r> 2dup 2>r ;

^
Warning -4100 2r@ is redefined
block ( S: 0 -- 2 R: 2 -- 2 )
[1] block-begin [2]
[5] rget R[0]
[4] rget R[1]
[7] sput [5] S[-2]
[6] sput [4] S[-1]

[3] sptr+ -2
[2] block-end

i.e. exactly the same sequence of primitives.

What I'm suggesting is that you re-write all the operations in terms
of simpler, non-ANS Forth words like SGET, SPUT and so on. The
examples above were simply there to show that DUP DROP SWAP >R and R>
could be used to build all the others.

There's a difference between SGET and PICK.

: x 3 pick ;
block ( S: 3 -- 4 R: 0 -- 0 )
[1] block-begin [2]
[4] sget S[2]
[5] sput [4] S[-1]
[3] sptr+ -1
[2] block-end

n PICK puts its results on the stack; SGET and SPUT fetch and store
into virtual registers, which eventually get translated at the machine
code level into real registers; or they may be further optimised
away.

The output I'm showing here is from my experimental SSA based Forth
optimiser that I've been working on intermittently for a couple of
years, based in part on Anton Ertl's RAFTS papers. Progress is slow, I
must admit, but as you can see there are minor differences in the
output between my original post and this one, so work proceeds. Even
in this state it does an excellent and optimal job of removing all the
stack juggling as it refers to each stack entry only once, and it
reveals the issues you showed upthread with the various
implementations of OVER. They're obviously all equivalent, and reduce
to the same sequence of simple "sub-operators".

If the target supports PUSH and POP or if the target doesn't have an
addressable stack and only has PUSH and POP, the translation into SSA
could be changed. For instance, generate an SPOP that is the
equivalent of SGET S[0] SPTR+ 1 and an SPUSH that is SPUT S[-1] SPTR+
-1 . I use the intermediate form shown to allow addressing the stack
on an x86 as an array (hence the S[n] and R[n] notation for elements
of the stack). It makes the SSA analysis easier at the expense of a
slightly more complex code generator if PUSH and POP are to be
generated from these types of sequences.

>
> Rod Pemberton

Alex McDonald

unread,

Oct 25, 2012, 7:54:33 AM10/25/12

to

On Oct 24, 8:33 pm, rickman <gnu...@gmail.com> wrote:
[snip]

>
> Interesting. Has anyone noticed that Chuck's GA144 does not include
> SWAP as a primitive? I find it very odd, but workable. In fact, the
> absence of this operator led me to discover that - . + - performs a
> subtraction without adding the 1 constant!
>
> (for those of you who aren't familiar with the F18A assembly language, +
> is an add, . is a nop to let the carry propagate and - is a 1's
> complement, not a 2's complement. )
>
> It seems that in a lot of cases when you think you want a swap, you can
> do very well without it!
>
> Rick

It's possible to swap using XORs, but I'd be interested to see the
F18A equivalent. For some code sequences, I don't see how it can be
avoided; A B - is not the same as B A -

Brad Eckert

unread,

Oct 25, 2012, 12:26:20 PM10/25/12

to

On Wednesday, October 24, 2012 11:24:05 PM UTC-7, Paul Rubin wrote:
> The J1 is really cool and I remember looking at it before. I guess it
> uses RAM blocks for the stacks? It's very clever, the instruction words
> are almost like microcode. The hardware folks here would know better
> than me, but I wonder if the J1 architecture is tuned for FPGA
> implementation in a way that would lose some advantage (compared to
> something like b16) in silicon.
>

Stacks use LUT RAM, which is fine for small RAMs. It is geared more toward FPGA implementation than the B16, but that's more implementation than ISA.

Forth Day 2010 was an interesting year. Three different CPU cores were presented. The J1, a minimal control processor; the SC20, a more full featured application processor; and Ting's P32, something in between.

Minimal processors are a good fit for Forth. Since the PAUSE chain can be made so fast, interrupts are often not needed. Not having interrupts simplifies the CPU design and saves you from having to do a bunch of verification to see where an asynchronous interrupt could break the system.

Thinking about the different kinds of stack machines led me to start this thread. The Novix style ISA is good and compact. The compiler needs to be smarter than that of a MISC to take advantage of it.

Paul Rubin

unread,

Oct 25, 2012, 1:39:22 PM10/25/12

to

Brad Eckert <hwf...@gmail.com> writes:
> Stacks use LUT RAM, which is fine for small RAMs. It is geared more
> toward FPGA implementation than the B16, but that's more
> implementation than ISA.

Ah, thanks, I see that now, about the LUT ram. I wonder how much
hardware it uses compared to the b16. The amount of verilog code is
very small, but some of the things it does look expensive:

1) the stacks are way bigger than the b16's (in addition to being in LUT's)

2) There is a "barrel shifter" (T = N << (T&0x0F) and T = N >>
(T&0x0F)). I wonder how often code actually uses it.

3) I'm not sure but I saw something to the effect that on every cycle,
it computes all the possible instruction results (T+N, T&N, memory read,
etc.) in parallel, so the result is ready by the time the instruction
fetch completes. The instruction then just selects which result to use.
This sounds power hungry?

The paper mentions that putting the stacks in LUT complicates PICK and
ROLL, and I see in nuc.fs that they are done as unrolled loops. I
wonder what obstacles there might have been to allowing random access to
the stacks. That might have made the J1 a more attractive target for
conventional compilers, without complicating the Verilog too much.

rickman

unread,

Oct 25, 2012, 4:35:02 PM10/25/12

to

On 10/25/2012 12:30 AM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:k69fbv$a6n$1...@dont-email.me...
>> On 10/24/2012 6:56 AM, Rod Pemberton wrote:
>>> "rickman"<gnu...@gmail.com> wrote in message
>>>> It seems that if the CPU isn't pipelined and blazing fast, it isn't
>>>> interesting to most FPGA users.
>>>
>>> Without pipelining, I doubt it's of interest to anyone. The 6502
>>> proved how effective pipelining in a microprocessor is in increasing
>>> performance.
>>
>> I don't follow that. There are lots of MCUs that aren't pipelined. The
>> need for speed is a spectrum. Often there are other things that are
>> more important requirements, like size. How many LUTs is the Leon
>> processor, 5000? Fast, but large.
>>
>
> Are we talking about the same thing?
>
> http://en.wikipedia.org/wiki/Instruction_pipeline
> http://en.wikipedia.org/wiki/Pipelining
>
> AFAIK, the 6502 was the first microprocessor with pipelining. And, the Cray
> CDC 6600 had the first central processing unit with it.
>
> If we're talking about the same thing and a microprocessor design isn't
> using pipelining, how do you get any acceptable level of performance? Do
> you just use faster logic and clock?

Before I can answer that question, you need to define "acceptable level
of performance". My entire point is that the performance needs of
different applications vary across the spectrum. There are costs
associated with pipelining; it takes more gates, it uses more power and
the performance speed up depends on the code being executed, there are
pipeline stalls and flushes. Pipelining is not a magic formula for
"free" performance. It is a tradeoff.

I think I already mentioned that my design goals had to do with
minimizing the resources used balanced against performance. Using one
block RAM for the stacks, a second for instruction memory and a third
for RAM, it only required a further 600 LUTs to implement a 16 bit
processor. That included a few bells and whistles that I would likely
rip out with any further work. This ran at 50 MHz in a 15 year old FPGA.

That level of performance was fine for the application I had. I came
close to using it again on a design a couple of years ago when a
customer wanted me to upgrade an existing design. But I was able to use
the Lattice device at 90% without an impact on the performance so in the
end adding the functions as logic worked out ok.

I'm looking at a design in a Lattice iCE40 device which would be the
lowest power way of implementing this I believe. I might use this
processor rather than application specific logic. We'll see. It all
depends on how small I can get the processor. It may require an 8 bit
version or even a redesign to 4 or 5 bit data paths.

Rick

rickman

unread,

Oct 25, 2012, 4:43:25 PM10/25/12

to

You are thinking of CISC or RISC processors where the instructions are
more complex. As is true for many MISC designs, every instruction on my
processor is one clock cycle. Pipelining might help to speed up the
performance, but there become complex interactions between adjacent
instructions and interrupt handling becomes more complex.

I'd like to get back to working on this CPU design. I wasn't able to
optimize it in ways I originally set out to do. If I return to it I may
be able to do that.

Rick

rickman

unread,

Oct 25, 2012, 5:35:54 PM10/25/12

to

Yes, I recall there is a sequence using OVER and OR (xor) to do a swap.
I can't recall where I saw it now but I think it is not so optimal and
still requires a temp register... OVER DUP A! OR OR A@ or something
like that.

No, A B - is not the same as B A -, in fact for all of the cases where I
was doing a subtract, it was preceded by a swap, so this was a win-win.
Turns out the 1 constant is a very inefficient operation. Not only
does it occupy a full word of instruction memory, it takes an extra
memory access (the slowest operation on the CPU, about three ALU ops)
and kills the prefetch which hits performance (and power consumption)
some more. Anyone using the number 700 MIPS is kidding themselves.
This CPU only does about half that in any but the most optimal
situations and the "ops" are very basic so you need a lot more of them
to do much.

That is the sort of stuff you need to learn to use the GA144. I spent
some two or three weeks wracking my brain to try to learn enough to code
some tasks. It was rewarding to find things like this that help to
enhance the utility of the device. But it just has so many hurdles that
don't have "tricks" to get around. It is still on my list of chips to
consider, but the design I'm currently looking at doing requires lower
power consumption than the GA144 is capable of. I'm now looking at an
iCE40 FPGA.

Rick

Paul Rubin

unread,

Oct 25, 2012, 6:27:18 PM10/25/12

to

rickman <gnu...@gmail.com> writes:
> Yes, I recall there is a sequence using OVER and OR (xor) to do a
> swap. I can't recall where I saw it now but I think it is not so
> optimal and still requires a temp register... OVER DUP A! OR OR A@ or
> something like that.

push a! pop a

Or if you don't need the third thing on the stack:

over over

Both of these are from http://www.colorforth.com/inst.htm .

rickman

unread,

Oct 25, 2012, 6:58:37 PM10/25/12

to

On 10/25/2012 12:50 AM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:k69gf5$hic$1...@dont-email.me...
>> On 10/24/2012 8:47 AM, Rod Pemberton wrote:
> ...
>
>>> For temporary reasons, most of my Forth stack operators, like DUP SWAP
>>> etc, aren't currently low-level Forth "primitives". They're implemented
>>> in high-level Forth using an even lower set of actual stack
>>> "primitives", which I'll call sub-operators for this thread. These
>>> sub-operators could be considered to be "stack assembly instructions"
>>> for my Forth. Currently, only>R and R> are actually "primitives"
>>> coded in C. Eventually, DUP DROP SWAP OVER will be actual
>>> low-level Forth "primitives", as they once were.
>>
>> Coded in 'C'? Perhaps I missed something. I was talking about a Forth
>> like CPU. Are you referring to a Forth compiler on a PC?
>>
>
> Yes. You're talking about a Forth like CPU. I'm talking about my Forth
> interpreter in C. I'm attempting to discuss similarities which are useful.

Ok, now that I get the context, what is the point exactly? Why do you
only code the lesser number of primitives when you will eventually code
more words as primitives. I'm not following what point you are trying
to make. Is this an exercise where you hope to learn how best to
optimize primitives?

>>> So, a word like OVER can be coded in many ways. I have 22 different
>>> definitions just for OVER in a list, and one can construct many more.
>>> E.g.,
>>>
>>> : OVER>R DUP R> SWAP ;
>>> : OVER 1 PICK ;
>>> : OVER SWAP TUCK ;
>>> : OVER SWAP DUP -ROT ;
>>> : OVER NUP SWAP ;
>>> etc.
>>
>> Yes it can, but why? Over is a very simple construct for either a Forth
>> like CPU or a Forth compiler.
>
> If you implement it as a single instruction, that's great. But, you're not
> going to fit all Forth words into your instruction set. So, some words are
> going to be made up of multiple Forth words comprised of multiple
> instructions. This was just an example of how I went about optimizing such
> a sequence.

I'm still not following your point. OVER is a very logical word to
implement as a primitive. It is used often enough and requires little
or nothing in the way of extra hardware for a CPU design and should be a
single instruction on most existing CPUs.

> Also, some Forth stack words don't fit most CPU's well. For example, some
> words shift a portion of the data on the stack, e.g., ROT TUCK. Stack
> shifts are not well suited to most processors. Stacks can push or pop or
> use other instructions to direct reads and writes, but shifts require a copy
> routine.

Absolutely. Many Forth words won't map well to either hardware or to
any given software implementation.

>>> Which definition you choose affects how fast OVER executes and depends
>>> on what operations you have available and on how fast each of those
>>> operations are.
>>>
>>> E.g., if>R>R DUP SWAP and -ROT are all very fast machine instructions
>>> for your processor, then ">R DUP R> SWAP" and "SWAP DUP -ROT"
>>> should be fast sequences. And, one sequence will be faster than the
>>> other, depending on how fast each instruction is. But, if -ROT is not a
>>> machine instruction, e.g., perhaps coded as "ROT ROT" or "SWAP>R
>>> SWAP R>", or -ROT is a very slow machine instruction, then "SWAP
>>> DUP -ROT" is more expensive than ">R DUP R> SWAP".
>>
>> I don't follow that a four instruction sequence should be "fast" in any
>> real sense. I don't follow at all why you would have -ROT as a
>> primitive and not OVER.
>>
>
> You're going to be implementing instructions for your processor. Which
> instructions you implement and which you don't implement affects how each
> Forth word is coded. That affects their speed.

Ok. For instructions I picked functions that were -

1) Basic enough to be implemented in one clock cycle
2) Were good primitives in a Forth sense
3) Were high on the list of use frequency according to Koopman's book

not necessarily in that priority.

I did not have a one to one mapping of instructions to existing Forth
words, but many instructions were existing Forth words or very similar.
Some exceptions were adding the top elements of the return stack and
adding the top of data stack to the return stack. These are used to
control looping, although they may change if I return to working on this
design.

>>> The instructions also need to be balanced acrossed many Forth words.
>>> Originally, I had the 2xxx series words defined in terms of the simpler
>>> stack operators, like SWAP DUP etc. When I converted to sub-operators,
>>> the number of operations per definition dropped dramatically for some
>>> of the 2xxx definitions. Since the sub-operators are primitives, some
>>> of the 2xxx definitions became faster. However, after converting all of
>>> the simpler stack operators to sub-operators, a few of the non-primitive
>>> simple stack operators ended up with more operations per definition.
>>> So some words became much faster, while others became slightly slower.
>>> And, the former primitives became real slow, but they'll be converted
>>> back eventually.
>>
>> Not sure why you are looking at this. If you want to optimize the
>> implementation, why not look at what is used most often and start by
>> optimizing those operators? That is why I went to Koopman's book. Not
>> much return on optimizing things that aren't used so much.
>
> There is no quarantee that the instruction frequencies will be the most used
> in the code that will be run. It merely demonstrates which was the most
> used in the code that was sampled.

Then you have no basis for picking any words for optimization. No
matter what words you optimize, if the application doesn't use those
instructions very much the optimization won't be effective for that app.
That is why I rely on data like Koopman's book, it draws on a range of
apps.

> I.e., let's say I used Koopman's dynamic instruction frequencies to design a
> processor. TUCK isn't in the list. So, I didn't put any effort into
> optimizing TUCK. TUCK is really slow and long sequence for my imaginary
> processor. Later, I fall in "love" with TUCK. So, I begin to use it
> everywhere in my imaginary Forth code. Will my code be fast? No.
>
> The frequencies are only useful on the sampled code and similar code. If
> you're not using a code optimizer, then you (or your future users) are going
> to have to selectively preference the optimized instructions over the
> unoptimized.

Ok, I am all ears. If you have a better idea for how to pick primitive
instructions to implement on a dual stack CPU design, I would love to
hear about it. The thing above about balancing across multiple words
has the same flaw of only applying to some applications as any other
method of selection.

Rick

Paul Rubin

unread,

Oct 25, 2012, 7:07:57 PM10/25/12

to

rickman <gnu...@gmail.com> writes:
> Ok, I am all ears. If you have a better idea for how to pick
> primitive instructions to implement on a dual stack CPU design, I
> would love to hear about it.

In the J1, I wonder what it would have cost (in CLB's or speed) to add
some ways for user code to access the stack slots in array- or
register-like fashion. The stacks are implemented as arrays in the
Verilog code already. That could save some typical stack juggling and
make life easier for compilers. Maybe they could be mapped into memory
rather than finding opcode space to do stuff with them.

rickman

unread,

Oct 25, 2012, 7:10:52 PM10/25/12

to

On 10/25/2012 6:27 PM, Paul Rubin wrote:
> rickman<gnu...@gmail.com> writes:
>> Yes, I recall there is a sequence using OVER and OR (xor) to do a
>> swap. I can't recall where I saw it now but I think it is not so
>> optimal and still requires a temp register... OVER DUP A! OR OR A@ or
>> something like that.
>
> push a! pop a

I don't consider this one very usable because of the number of side
effects, using one of 10 levels of the return stack and clobbering the A
register. I'm not saying it can't be used, I just won't consider it
often.

> Or if you don't need the third thing on the stack:
>
> over over
>
> Both of these are from http://www.colorforth.com/inst.htm .

Over over is a 2DUP, not a swap. You could do an OVER, use the two
operands and then do a DROP later in the code. That is the most
efficient when practical I suppose.

Personally, I'll just keep the SWAP as a primitive.

Rick

rickman

unread,

Oct 25, 2012, 7:20:32 PM10/25/12

to

Without looking at the code for J1 (something that is on my list of
things to do) I would say it is not a huge effort. It requires a source
of address and a mux to bring that address into the stack RAM along with
the control hardware of course. This is not a lot of hardware, but
where does the address come from? I don't understand your last sentence
at all.

BTW, how it is coded in Verilog has nothing to do with it. This isn't
software.

Rick

rickman

unread,

Oct 25, 2012, 7:35:30 PM10/25/12

to

On 10/25/2012 1:39 PM, Paul Rubin wrote:
> Brad Eckert<hwf...@gmail.com> writes:
>> Stacks use LUT RAM, which is fine for small RAMs. It is geared more
>> toward FPGA implementation than the B16, but that's more
>> implementation than ISA.
>
> Ah, thanks, I see that now, about the LUT ram. I wonder how much
> hardware it uses compared to the b16. The amount of verilog code is
> very small, but some of the things it does look expensive:

It may use LUT RAM on a Xilinx or Lattice part, but many FPGAs don't
*have* LUT RAM in which case it is done in block RAM. LUT RAM has
limitations and it is not unusual to use block RAM to more fully exploit
the advantages, like full dual read ports.

> 1) the stacks are way bigger than the b16's (in addition to being in LUT's)

If they are "way bigger", they probably aren't in LUT RAM. Most devices
with LUT RAM are only 16 deep. Otherwise you need to use multiplexors
to combined multiple LUTs for each bit.

> 2) There is a "barrel shifter" (T = N<< (T&0x0F) and T = N>>
> (T&0x0F)). I wonder how often code actually uses it.

Good question. Barrel shifters are used in floating point calculations
but are often cheap in FPGAs because you can implement them in
multipliers if you can tolerate the clock delay as the multipliers are
clocked.

> 3) I'm not sure but I saw something to the effect that on every cycle,
> it computes all the possible instruction results (T+N, T&N, memory read,
> etc.) in parallel, so the result is ready by the time the instruction
> fetch completes. The instruction then just selects which result to use.
> This sounds power hungry?

I doubt this is really any more power hungry than most other designs.
Every register that changes dissipates power in all of the logic it
feeds regardless of whether you use that path or not. The only question
is does the J1 have multiple logic paths where otherwise a single path
with multiple functions would be used? An subtractor is just an adder
with an inverted input and a one on the carry. Or do you use an adder
and a subtractor followed by a mux? BTW, the mux to select the function
outputs can be as much logic as the functions themselves and extra power
as well.

> The paper mentions that putting the stacks in LUT complicates PICK and
> ROLL, and I see in nuc.fs that they are done as unrolled loops. I
> wonder what obstacles there might have been to allowing random access to
> the stacks. That might have made the J1 a more attractive target for
> conventional compilers, without complicating the Verilog too much.

I don't know of any reason why LUT RAM is any different from block RAM
in doing PICK or ROLL. I believe ROLL requires you to copy sequential
locations in the stack and PICK just copies one element to the top of
stack. If the stack is in RAM, regardless of which type of RAM is used,
this is all the same. It requires some extra hardware to mux an address
which is not the top of stack, but is not precluded and is the same for
both types.

Rick

Bernd Paysan

unread,

Oct 25, 2012, 9:15:31 PM10/25/12

to

rickman wrote:

> If the last word before the ; is another
> definition and not a primitive you can't combine a return with a
> call... or can you?

Of course, that's called "tail call elimination" and is a standard
technique in compilers on normal CPUs. You turn the call into a jump.
It is a bit more complicated in languages like C with their stack
frames, but well, a C compiler is a bit more complicated than a Forth
compiler. In Forth, it's really just converting the call into a jump.
Maybe you can't do it for every call, because the jumps might not cover
the entire address space, but if you can, it is a useful optimization.

It is about the only optimization I do on the b16 - which doesn't have a
return bit, but since combining calls+returns happens quite frequently
in Forth, I grab this low-hanging fruit.

The RTX2000 had a "horizontal" instruction encoding. This term comes
from microcode engines, and on "horizontal" engines, you could encode
the operations of all available resources (ALUs, incrementers, register
files, etc.) in one wide horizontal line of microcode, while a
"vertical" encoding had a selector for the resource, and one instruction
for the selected resource. The horizontal microcode executed all in
parallel, if useful or not; the code is less dense. The vertical
microcode execute all in sequence, the performance then is less good.

So even if a combination is not always useful or even doable, a
horizontal style will make it possible, because *when* it is doable, it
provides a speed advantage. The RTX2000 allowed to combine more than
one Forth operation into one instruction, IIRC up to three (with the
return being the third). In conventional processors, this sort of
instruction combination was touted "superscalar", and Intel's first
superscalar processor was the Pentium. Though, as Forth instructions
don't do much work, you could argue that mov [offset+reg],reg is already
equivalent to three Forth instructions (lit + @ or lit + !).

As usual in engineering, these different styles have different
tradeoffs.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/

Paul Rubin

unread,

Oct 25, 2012, 9:34:47 PM10/25/12

to

rickman <gnu...@gmail.com> writes:
>> push a! pop a
>
> I don't consider this one very usable because of the number of side
> effects, using one of 10 levels of the return stack and clobbering the
> A register.

I think F18 code won't normally have 10 levels of subroutine calls, so
storing temporaries on the return stack is fine. You could even save
the A register there if you didn't want to clobber it.

> Over over is a 2DUP, not a swap.

Oh of course you're right, I mis-read the page, should have said just
"over" instead of "over over". Because of the circular stack, doing
this repeatedly won't cause overflow. The issue is only if you've got
stuff past the 2nd element that needs to be preserved.

> You could do an OVER, use the two operands and then do a DROP later in
> the code. That is the most efficient when practical I suppose.

You might not even need the DROP later.

Bernd Paysan

unread,

Oct 26, 2012, 11:55:24 AM10/26/12

to

rickman wrote:

> On 10/25/2012 1:10 AM, Paul Rubin wrote:
>> "Rod Pemberton"<do_no...@notemailnotz.cnm> writes:
>>> If we're talking about the same thing and a microprocessor design
>>> isn't
>>> using pipelining, how do you get any acceptable level of
>>> performance? Do you just use faster logic and clock?
>>
>> You just get less performance. A non-pipelined processor typically
>> uses 3-4 cycles for basic instructions that a pipelined one does in 1
>> cycle
>> (throughput). That may still be enough for your application. If
>> it's not, pick a different processor, usually at a cost in die area
>> and power consumption.
>
> You are thinking of CISC or RISC processors where the instructions are
> more complex. As is true for many MISC designs, every instruction on
> my
> processor is one clock cycle. Pipelining might help to speed up the
> performance, but there become complex interactions between adjacent
> instructions and interrupt handling becomes more complex.

The only pipelining that helps performance on a MISC is the instruction
prefetch - and that's something MISCs often do. A typical simple RISC
processor pipeline has four stages: instruction fetch, register read,
ALU operation, register write; a CISC or a RISC with compressed
instructions like ARM Thumb needs an additional decode stage. Since a
stack based processor doesn't need register read and write (TOS and NOS
are directly in the ALU path, the access to the stack to push/pull NOS
is in parallel), no further pipeline stage is necessary.

The concept of packing multiple sequential instructions into one word
helps to reduce the conflicts in such a small pipeline. If your
instruction fetch takes one cycle, you only need to fetch the next
instruction in the last slot of an instruction. You either could reduce
the number of possible instructions there to those which don't conflict
with the instruction fetch (e.g. only ALU and stack operations, no
load/store operations, no branches), or you delay the prefetch if there
is a conflict, or you reduce the number of possible instructions in the
first slot - that's what I do on the b16. The first slot can only do a
call or a nop, and those two choices are selected right when the
instruction has been fetched (do or not do a call). With a complete
prefetch, I could do 3 instructions in 3 cycles, without this prefetch,
I do 3 instructions in 4 cycles, except if the first instruction is a
call - that's single cycle again.

Bernd Paysan

unread,

Oct 26, 2012, 12:26:34 PM10/26/12

to

Paul Rubin wrote:
> 1) the stacks are way bigger than the b16's (in addition to being in
> LUT's)

On the FPGA implementation, I tried to make it possible to use block RAM
for the stacks - so you could make the stacks larger without sacrifying
much resources.

> 2) There is a "barrel shifter" (T = N << (T&0x0F) and T = N >>
> (T&0x0F)). I wonder how often code actually uses it.

I doubt it's much code that uses a barrel shifter.

> 3) I'm not sure but I saw something to the effect that on every cycle,
> it computes all the possible instruction results (T+N, T&N, memory
> read, etc.) in parallel, so the result is ready by the time the
> instruction
> fetch completes. The instruction then just selects which result to
> use. This sounds power hungry?

Yes, this is potentially more power hungry than what the b16 does. The
b16 has a compact ALU, which needs to be set up at the start of the
cycle to produce the correct result (it does +, &, |, xor all with the
same logic element, the full adder, and needs multiplexers setup to feed
the carry through for +). I also only access the RAM when it is
necessary.

Some FPGA tools have problems converting the b16 case statement into a
multiplexer; for these it might help to do that "by hand".

> The paper mentions that putting the stacks in LUT complicates PICK and
> ROLL, and I see in nuc.fs that they are done as unrolled loops. I
> wonder what obstacles there might have been to allowing random access
> to
> the stacks. That might have made the J1 a more attractive target for
> conventional compilers, without complicating the Verilog too much.

For a conventional compiler, I would stronlgy recommend to simply add
the workspace concept from the transputer. E.g. expand the b16
instructions from 5 to 6 bits, add a workspace pointer register, and
have 16 W+x@ and 16 W+x! instructions. If you want to help OOP
programming languages, have two registers, one for the current object,
one for the workspace, and access only 8 locations in each.

However, for these small embedded targets, this extra instructions are
not really necessary. The programs there usually have a completely
static call tree, so you can allocate the locals statically; as the Keil
compiler does for an 8051.

The easiest allocation algorithm starts with leaf procedures: Their
locals go to the smallest addresses (easiest to address; leaf procedures
do most of the work). Then proceed to the next layer in the call graph,
and allocate the locals there, starting with the smallest available
address, too (addresses reserved by leaf calls are not to be used). And
so on, until you get to main(). A function pointer call has to be
treated as if all the functions which pointer has been taken are called
there.

Recursive functions will face a performance hit and need a separate
dynamic stack area (same as with Keil), unless the compiler can optimize
all locals away. Calling conventions should be Forth-compatible, i.e.
you push parameters on the stack when calling a function. The function
then might place the parameters in locals.

We had a student doing some experiments with a relatively simple C
compiler that used a stack based intermediate language, and the
conclusion of the student (who did this as summer practica) was that you
can target C to it, but you get considerably slower and larger code than
with Forth.

Bernd Paysan

unread,

Oct 26, 2012, 12:34:59 PM10/26/12

to

Alex McDonald wrote:
> It's possible to swap using XORs, but I'd be interested to see the
> F18A equivalent. For some code sequences, I don't see how it can be
> avoided; A B - is not the same as B A -

But there is no -. On the b16, "-" is "com +c", and "swap -" is ">r com
r> +c".

Swap itself is "over >r nip r>". You don't need to shuffle the stack
that often. In the 2k bytes Triceps program, there are in total 5
swaps. There are a lot more dups, overs (most of them as 2dups), >rs,
and r>s.

Granted, I use swap a lot more in normal Forth programs, but as I know
that it is an expensive operation on b16, I code around this fact, and
it is possible.

Bernd Paysan

unread,

Oct 26, 2012, 4:27:42 PM10/26/12

to

Bernd Paysan wrote:

> Alex McDonald wrote:
>> It's possible to swap using XORs, but I'd be interested to see the
>> F18A equivalent. For some code sequences, I don't see how it can be
>> avoided; A B - is not the same as B A -
>
> But there is no -. On the b16, "-" is "com +c", and "swap -" is ">r
> com r> +c".

There's a faster swap -: com + com, on the F18A, that's written as

- . + -

rickman

unread,

Oct 26, 2012, 7:03:23 PM10/26/12

to

Now that is the right kind of thinking for programming the F18A!

Rick

rickman

unread,

Oct 26, 2012, 7:17:46 PM10/26/12

to

On 10/26/2012 12:34 PM, Bernd Paysan wrote:
> Alex McDonald wrote:
>> It's possible to swap using XORs, but I'd be interested to see the
>> F18A equivalent. For some code sequences, I don't see how it can be
>> avoided; A B - is not the same as B A -
>
> But there is no -. On the b16, "-" is "com +c", and "swap -" is ">r com
> r> +c".

There you go! So on the F18A a subtract TOS from NOS can be done as
push - pop . + -. A lot faster and much smaller than the stock - . + 1
. +.

> Swap itself is "over>r nip r>". You don't need to shuffle the stack
> that often. In the 2k bytes Triceps program, there are in total 5
> swaps. There are a lot more dups, overs (most of them as 2dups),>rs,
> and r>s.

That may well be why the F18A doesn't have a swap.

> Granted, I use swap a lot more in normal Forth programs, but as I know
> that it is an expensive operation on b16, I code around this fact, and
> it is possible.

Indeed it is.

I have done my share of micro-code back in the day when we actually
designed computers with boards rather than just a chip on a board.
Micro-code is something you spend a lot of time on to optimize. I guess
low level forth is similar.

Rick

rickman

unread,

Oct 26, 2012, 7:27:50 PM10/26/12

to

On 10/26/2012 4:27 PM, Bernd Paysan wrote:
> Bernd Paysan wrote:
>
>> Alex McDonald wrote:
>>> It's possible to swap using XORs, but I'd be interested to see the
>>> F18A equivalent. For some code sequences, I don't see how it can be
>>> avoided; A B - is not the same as B A -
>>
>> But there is no -. On the b16, "-" is "com +c", and "swap -" is ">r
>> com r> +c".
>
> There's a faster swap -: com + com, on the F18A, that's written as
>
> - . + -

Yup, that's what started this swap conversation.

Rick

Rod Pemberton

unread,

Oct 26, 2012, 7:44:12 PM10/26/12

to

"Bernd Paysan" <bernd....@gmx.de> wrote in message
news:1766437.A...@sunwukong.fritz.box...

> Alex McDonald wrote:
> > It's possible to swap using XORs, but I'd be interested to see the
> > F18A equivalent. For some code sequences, I don't see how it can be
> > avoided; A B - is not the same as B A -
>
> But there is no -. On the b16, "-" is "com +c", and "swap -" is ">r com
> r> +c".
>
> Swap itself is "over >r nip r>". You don't need to shuffle the stack
> that often. In the 2k bytes Triceps program, there are in total 5
> swaps. There are a lot more dups, overs (most of them as 2dups), >rs,
> and r>s.
>
> Granted, I use swap a lot more in normal Forth programs, but as I know
> that it is an expensive operation on b16, I code around this fact, and
> it is possible.
>

The question is if 'nip' is effective or useful enough to have it
as a primitive or MISC/RISC instruction.

Swap is a whole bunch of things ...

swap 1 roll
swap nup drip take
swap tuck drop
swap dup rot rot drop
swap dup take
swap over take
swap over rot drop
swap dup rot nip
swap dup -rot drop
swap over >r >r drop r> r>
swap over >r nip r>
swap over dup 3 roll drop drop
swap 0 pick take
swap 1 pick take
swap >r >r 2r>
swap 2>r r> r>
swap over >r xor r@ xor r>
swap over dup >r xor xor r>
swap >r dup r> xor dup >r xor dup r> xor
swap dup >r xor dup r> xor dup >r xor r>

etc.

Rod Pemberton

rickman

unread,

Oct 26, 2012, 7:44:02 PM10/26/12

to

I won't argue this point much, but any operation that uses more than one
level of gates (or LUTs in an FPGA) can be pipelined. Pipelining
doesn't need to have anything to do with the functional units of a
processor. The instruction decode can easily be pipelined by adding
registers. The fetch has the issue of the memory access typically being
slower than much of the rest of the logic, but if that is buffered into
a register the address generation can be pipelined.

I believe the Pentium 4 had something like 15 stages to its pipeline
which illustrates the problem. The Pentium 4 was only faster in some
cases because of the cost of a pipeline flush. Not to mention the huge
transistor count. But then the problem they were starting to have to
what to do with the transistors that were available with each new
process generation!

A MISC can be pipelined by adding registers to the ALU, the decode and
the stack address generation as well as the instruction address
generation. But it adds lots of complexity to the logic.

> The concept of packing multiple sequential instructions into one word
> helps to reduce the conflicts in such a small pipeline. If your
> instruction fetch takes one cycle, you only need to fetch the next
> instruction in the last slot of an instruction. You either could reduce
> the number of possible instructions there to those which don't conflict
> with the instruction fetch (e.g. only ALU and stack operations, no
> load/store operations, no branches), or you delay the prefetch if there
> is a conflict, or you reduce the number of possible instructions in the
> first slot - that's what I do on the b16. The first slot can only do a
> call or a nop, and those two choices are selected right when the
> instruction has been fetched (do or not do a call). With a complete
> prefetch, I could do 3 instructions in 3 cycles, without this prefetch,
> I do 3 instructions in 4 cycles, except if the first instruction is a
> call - that's single cycle again.
>

All of these are design details that depend on a lot more than just the
things you mention. Requiring a branch instruction to be put off to the
next word just so that next instruction word can be fetched efficiently
when it might be followed by another instruction fetch is of dubious
value. This all depends on many design trade offs. I don't see the
value of multiplexing instructions in words in an FPGA CPU which is what
I design. In an IC the requirements and trade offs are entirely different.

Rick

Rod Pemberton

unread,

Oct 26, 2012, 7:56:43 PM10/26/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6cbc1$u8c$1...@dont-email.me...

> On 10/25/2012 7:54 AM, Alex McDonald wrote:
> > On Oct 24, 8:33 pm, rickman<gnu...@gmail.com> wrote:
> > [snip]

> > It's possible to swap using XORs, but I'd be interested to see the
> > F18A equivalent. For some code sequences, I don't see how it can be
> > avoided; A B - is not the same as B A -
>
> Yes, I recall there is a sequence using OVER and OR (xor) to do a swap.
> I can't recall where I saw it now but I think it is not so optimal and
> still requires a temp register... OVER DUP A! OR OR A@ or something
> like that.

Swapping two values using XOR's was a well known
programming "trick" in the 1980's ...

I'm not sure what Forth sequence you saw, but I have these in a file:

swap over >r xor r@ xor r>
swap over dup >r xor xor r>
swap >r dup r> xor dup >r xor dup r> xor
swap dup >r xor dup r> xor dup >r xor r>

They're much longer than other swap sequences:

swap 1 roll
swap nup drip take
swap tuck drop
swap dup rot rot drop
swap dup take
swap over take
swap over rot drop
swap dup rot nip
swap dup -rot drop
swap over >r >r drop r> r>
swap over >r nip r>
swap over dup 3 roll drop drop
swap 0 pick take
swap 1 pick take
swap >r >r 2r>
swap 2>r r> r>

etc.

Rod Pemberton

rickman

unread,

Oct 26, 2012, 7:59:10 PM10/26/12

to

Yeah, thanks for the update.

I think the point is that nip is an instruction on the B16... I may give
that a think...

Rick

Bernd Paysan

unread,

Oct 26, 2012, 8:17:52 PM10/26/12

to

rickman wrote:
> I believe the Pentium 4 had something like 15 stages to its pipeline
> which illustrates the problem. The Pentium 4 was only faster in some
> cases because of the cost of a pipeline flush. Not to mention the
> huge
> transistor count. But then the problem they were starting to have to
> what to do with the transistors that were available with each new
> process generation!
>
> A MISC can be pipelined by adding registers to the ALU, the decode and
> the stack address generation as well as the instruction address
> generation. But it adds lots of complexity to the logic.

The Pentium 4 (first generation) had a double-pumped ALU. I.e. instead
of splitting up the ALU into two cycles, it completed the work in one
half cycle, and two operations with dependency in the ALU path could be
issued in one cycle. The other 15 cycles were spent on decoding and
scheduling instructions from the trace cache - they were already pre-
decoded.

The ALU is not the bottleneck of any CPU. You can possibly pipeline it,
but at the cost of dependent instructions, which are fairly frequent
(that's why the Pentium 4 team considered this approach - it speeds up
dependent instructions).

In any case, feel free to choose tradeoffs as you like. For the tasks
I've used the b16 with, performance didn't matter. Power consumption
and code size did matter. Power consumption means something like "rush
to finish", but that's considering the most simple structure you can use
- if you can manage it in 30% less time by burning twice the current,
it's no good.

Rod Pemberton

unread,

Oct 26, 2012, 8:57:38 PM10/26/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6cg75$o2d$1...@dont-email.me...

> On 10/25/2012 12:50 AM, Rod Pemberton wrote:
> > "rickman"<gnu...@gmail.com> wrote in message
> > news:k69gf5$hic$1...@dont-email.me...
> >> On 10/24/2012 8:47 AM, Rod Pemberton wrote:
...

> >>> For temporary reasons, most of my Forth stack operators, like DUP SWAP
> >>> etc, aren't currently low-level Forth "primitives". They're
> >>> implemented in high-level Forth using an even lower set of actual
> >>> stack "primitives", which I'll call sub-operators for this thread.
> >>> These sub-operators could be considered to be "stack assembly
> >>> instructions" for my Forth. Currently, only>R and R> are actually
> >>> "primitives" coded in C. Eventually, DUP DROP SWAP OVER will
> >>> be actual low-level Forth "primitives", as they once were.
> >>
> >> Coded in 'C'? Perhaps I missed something. I was talking about a Forth
> >> like CPU. Are you referring to a Forth compiler on a PC?
> >>
> >
> > Yes. You're talking about a Forth like CPU. I'm talking about my Forth
> > interpreter in C. I'm attempting to discuss similarities which are
> > useful.
>
> Ok, now that I get the context, what is the point exactly? Why do you
> only code the lesser number of primitives when you will eventually code
> more words as primitives.

I'm in the process of converting much of my code from precompiled ITC Forth
in C (similar to a DTC Forth in assembly) to Forth as ASCII text. The goal
was to 1) make it as independent of C as possible and/or reveal hidden C
dependencies, and 2) have only primitives, variables, and other low-level
words in C. The reality is slightly different. What's left are the
primitives, variables, other low-level words in C, and a few critical
high-level Forth words in C, as well a program (re)written entirely in terms
of primitives to implement the outer interpreter.

> I'm not following what point you are trying
> to make.

Whatever operations you chose to implement for your processor affects how
quickly various Forth operations will be.

> Is this an exercise where you hope to learn how best to
> optimize primitives?

No. It's mostly a result of the transformation process. Minimizing
primitives allowed additional code to be moved out of the "C space" to the
pure "Forth space".

> > If you implement it as a single instruction, that's great. But, you're
> > not going to fit all Forth words into your instruction set. So, some
> > words are going to be made up of multiple Forth words comprised
> > of multiple instructions. This was just an example of how I went
> > about optimizing such a sequence.
>
> I'm still not following your point. OVER is a very logical word to
> implement as a primitive. It is used often enough and requires little
> or nothing in the way of extra hardware for a CPU design and should be a
> single instruction on most existing CPUs.

Pick a Forth word that you're not going to implement as an instruction.
Implement it using multiple instructions. Count how many. Now, if you
change your set of instructions, is it implemented in fewer or more
instructions? You want a set of instructions which results in fewer,
especially for the Forth words you think will be used the most.

> > You're going to be implementing instructions for your processor. Which
> > instructions you implement and which you don't implement affects how
> > each Forth word is coded. That affects their speed.
>
> Ok. For instructions I picked functions that were -
>
> 1) Basic enough to be implemented in one clock cycle
> 2) Were good primitives in a Forth sense
> 3) Were high on the list of use frequency according to Koopman's book
>
> not necessarily in that priority.
>
> I did not have a one to one mapping of instructions to existing Forth
> words, but many instructions were existing Forth words or very similar.
> Some exceptions were adding the top elements of the return stack and
> adding the top of data stack to the return stack. These are used to
> control looping, although they may change if I return to working on this
> design.

Ok.

> >> [...] If you want to optimize the

> >> implementation, why not look at what is used most often and start by
> >> optimizing those operators? That is why I went to Koopman's book. Not
> >> much return on optimizing things that aren't used so much.
> >
> > There is no quarantee that the instruction frequencies will be the most
> > used in the code that will be run. It merely demonstrates which was
> > the most used in the code that was sampled.
>
> Then you have no basis for picking any words for optimization. No
> matter what words you optimize, if the application doesn't use those
> instructions very much the optimization won't be effective for that app.

True.

> That is why I rely on data like Koopman's book, it draws on a range of
> apps.

Yes, it's representative of the range of apps that were sampled. As long as
the code to be executed uses instructions similarly, you'll be fine.

> [...]

> The thing above about balancing across multiple words
> has the same flaw of only applying to some applications as any other
> method of selection.

If you attempt to balance all instructions so each takes the same amount
of time, that results in a RISC like design. Yes?

If you implement some instructions for speed and disregard optimization for
others, that results in a CISC like design. Yes?

Rod Pemberton

rickman

unread,

Oct 26, 2012, 11:01:38 PM10/26/12

to

I don't think you understand me. I am not saying pipelining is an
important thing to use. It all depends on your requirements. I am
simply responding to your statement that "The only pipelining that helps
performance on a MISC is the instruction prefetch". I simply don't
agree with that because what holds back your MISC design depends on your
MISC design. BTW, my CPU design doesn't have a prefetch. It just
fetches each instruction in parallel with the current instruction
execution. Because the instruction fetch unit is in parallel with the
other units there is no problem with the instruction needing to be
"prefetched". It would only be if the design were pipelined that the
fetch would be a "prefetch".

I don't intend to even try to pipeline my CPU design mainly because I
don't have the time or interest. I prefer to try to optimize design
tradeoffs which don't complicate the design, but should actually
simplify it. In other words, optimizing by simplifying... much like
what Chuck prefers. But for now I'll keep the SWAP instruction and will
call an XOR an XOR.

I guess stuff like IF ELSE THEN and OR instead of XOR is what you get
when you work just to please yourself! Still, you have to recognize the
brilliance of the man.

Rick

rickman

unread,

Oct 27, 2012, 3:43:00 PM10/27/12

to

On 10/26/2012 8:57 PM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message

> news:k6cg75$o2d$1...@dont-email.me...

>>
>> Ok, now that I get the context, what is the point exactly? Why do you
>> only code the lesser number of primitives when you will eventually code
>> more words as primitives.
>
> I'm in the process of converting much of my code from precompiled ITC Forth
> in C (similar to a DTC Forth in assembly) to Forth as ASCII text. The goal
> was to 1) make it as independent of C as possible and/or reveal hidden C
> dependencies, and 2) have only primitives, variables, and other low-level
> words in C. The reality is slightly different. What's left are the
> primitives, variables, other low-level words in C, and a few critical
> high-level Forth words in C, as well a program (re)written entirely in terms
> of primitives to implement the outer interpreter.

Maybe you can explain the purpose of wanting Forth in ASCII text?
Wouldn't that be the source? Or are you talking about F-code in text?
What would be the purpose of doing this?

>> I'm not following what point you are trying
>> to make.
>
> Whatever operations you chose to implement for your processor affects how
> quickly various Forth operations will be.
>
>> Is this an exercise where you hope to learn how best to
>> optimize primitives?
>
> No. It's mostly a result of the transformation process. Minimizing
> primitives allowed additional code to be moved out of the "C space" to the
> pure "Forth space".

"Allowed"??? By definition that is minimizing primitives. But why?
Typically more primitives are better in most metrics other than possibly
size. The main reason is that you need to draw a line somewhere
otherwise you end up with all assembly. Too many words is just extra
work without much benefit. Minimizing primitives makes the code slower,
possibly a bit more compact, but otherwise to what advantage?

>> I'm still not following your point. OVER is a very logical word to
>> implement as a primitive. It is used often enough and requires little
>> or nothing in the way of extra hardware for a CPU design and should be a
>> single instruction on most existing CPUs.
>
> Pick a Forth word that you're not going to implement as an instruction.
> Implement it using multiple instructions. Count how many. Now, if you
> change your set of instructions, is it implemented in fewer or more
> instructions? You want a set of instructions which results in fewer,
> especially for the Forth words you think will be used the most.

That is the purpose of using surveys like Koopman's. But I have been
thinking lately that this may not be the best way to optimize
instruction sets. Working with just *my* code would not be good for
anything other than the code I've written in the past.

> If you attempt to balance all instructions so each takes the same amount
> of time, that results in a RISC like design. Yes?
>
> If you implement some instructions for speed and disregard optimization for
> others, that results in a CISC like design. Yes?

Does it? One of my design rules is that all instructions take one clock
cycle, period. Adding states to control multi-cycle instructions adds
to the instruction decode impacts the goal of minimizing the LUT usage.
Another goal is minimize RAM usage which is important in space
constrained apps.

Rather than optimizing the word definitions in Forth equally, I expect
to optimize the functions I will implement using the assembly language.
These functions will be the ones that use most of the CPU time.
Koopman's Forth word frequency is a starting point, but I will tweek the
instructions according to what I use in the future.

I took a quick look at 16 bit VLIW instructions this morning. I could
code each of the three execution units separately and use 11 of the 16
bits leaving 5 bits for literal data in some instances. I could also
combine the address unit and the fetch unit since they are closely
related and use 4 bits each for the data and address units. This leaves
8 bits for literal data or possibly other controls when literals are not
used. If I pursue this I may end up with a new set of possible
instructions since address unit and data unit can work independently and
in parallel.

Rick

Bernd Paysan

unread,

Oct 27, 2012, 4:18:24 PM10/27/12

to

rickman wrote:
> I don't think you understand me.

Partly, probably because you insist on calling your design non-pipelined
and your parallel instruction fetch unit not a "prefetch", even though
it is.

My opinion is that the only pipelining that helps a MISC is the
instruction prefetch. You can pipeline other parts, *but it doesn't
help*. I.e. it doesn't provide better performance, because you will be
waiting for dependent instructions. The only low-hanging fruit you have
is to make the instruction fetch unit work in parallel with the other
units. The ALU and the multiplexers in a MISC are all small and fast,
and the first limit you get is the memory timing limit.

> I am not saying pipelining is an
> important thing to use. It all depends on your requirements. I am
> simply responding to your statement that "The only pipelining that
> helps
> performance on a MISC is the instruction prefetch". I simply don't
> agree with that because what holds back your MISC design depends on
> your MISC design. BTW, my CPU design doesn't have a prefetch. It just
> fetches each instruction in parallel with the current instruction
> execution.

This is the definition of a prefetch. You fetch the next instruction
before you complete with the current instruction.

> Because the instruction fetch unit is in parallel with the
> other units there is no problem with the instruction needing to be
> "prefetched". It would only be if the design were pipelined that the
> fetch would be a "prefetch".

As you have just explained, your design is pipelined, and you do a
prefetch. A non-pipelined execution fetches current instruction and
executes it, and when done would start to fetch the next instruction.
Nothing parallel.

> I don't intend to even try to pipeline my CPU design mainly because I
> don't have the time or interest. I prefer to try to optimize design
> tradeoffs which don't complicate the design, but should actually
> simplify it. In other words, optimizing by simplifying... much like
> what Chuck prefers. But for now I'll keep the SWAP instruction and
> will call an XOR an XOR.

Calling an XOR and XOR is really wise, and SWAP is a useful instruction.
I don't know why Chuck wants to deliberately confuse people with his
OR=XOR and -=INVERT. If he wants to be compact, he could use ~ for
invert, & for AND, | for OR, and ^ for XOR. Nobody would complain.

> I guess stuff like IF ELSE THEN and OR instead of XOR is what you get
> when you work just to please yourself! Still, you have to recognize
> the brilliance of the man.

HP renamed IF ELSE THEN to IF condition THEN a ELSE b ENDIF. IF is
really just syntactic sugar. This is a lot less confusing for people
who don't know much about Forth, because it looks exactly like Algol.
But in fact, it's just Forth in disguise.

If we follow the advice above, to use C's short symbols, we could have ?
: ; for IF ELSE THEN, and ; could be a general terminator of controll
structures - only if it sees colon-sys it would terminate the definition
as such.

rickman

unread,

Oct 27, 2012, 7:19:13 PM10/27/12

to

On 10/27/2012 4:18 PM, Bernd Paysan wrote:
> rickman wrote:
>> I don't think you understand me.
>
> Partly, probably because you insist on calling your design non-pipelined
> and your parallel instruction fetch unit not a "prefetch", even though
> it is.

Ok, I would like to understand why you think it is a prefetch. Each
instruction fetches the next sequential instruction unless the
instruction is a branch, call or return in which case the appropriate
instruction is fetched. No instruction is fetched ahead of time, no
instruction is ever tossed away. The instruction to be executed next is
always fetched.

I think the issue is where I put the register in the fetch loop.
Instead of putting the register between the address generation and
memory, the register is at the output of the memory because the block
RAMs have a register there and I don't get a choice. So I moved the
register in the loop from the input of the memory to the output. This
doesn't change the way the single cycle machine works, so I don't call
it a prefetch.

> My opinion is that the only pipelining that helps a MISC is the
> instruction prefetch. You can pipeline other parts, *but it doesn't
> help*. I.e. it doesn't provide better performance, because you will be
> waiting for dependent instructions. The only low-hanging fruit you have
> is to make the instruction fetch unit work in parallel with the other
> units. The ALU and the multiplexers in a MISC are all small and fast,
> and the first limit you get is the memory timing limit.

Pipelining speeds up the cycle time of an operation. If the instruction
fetch is not the limiting function in the cycle time of the machine then
pipelining the function that is limiting the cycle time will provide a
faster cycle time, although the overall performance may or may not
improve.

>> I am not saying pipelining is an
>> important thing to use. It all depends on your requirements. I am
>> simply responding to your statement that "The only pipelining that
>> helps
>> performance on a MISC is the instruction prefetch". I simply don't
>> agree with that because what holds back your MISC design depends on
>> your MISC design. BTW, my CPU design doesn't have a prefetch. It just
>> fetches each instruction in parallel with the current instruction
>> execution.
>
> This is the definition of a prefetch. You fetch the next instruction
> before you complete with the current instruction.

Not really before, at the same time. The instruction fetch does not
depend on the current instruction unless it is a "fetch" type
instruction. The conditionals check flags in FFs from the previous
cycles.

>> Because the instruction fetch unit is in parallel with the
>> other units there is no problem with the instruction needing to be
>> "prefetched". It would only be if the design were pipelined that the
>> fetch would be a "prefetch".
>
> As you have just explained, your design is pipelined, and you do a
> prefetch. A non-pipelined execution fetches current instruction and
> executes it, and when done would start to fetch the next instruction.
> Nothing parallel.

I must have mis-stated something. The design is NOT pipelined.
Everything happens in one clock cycle and is done. A pipelined
operation is split over multiple cycles. My design simply moves the
register in the cycle from the start of the fetch to the start of the
decode. This doesn't change the way the cycle works.

>> I don't intend to even try to pipeline my CPU design mainly because I
>> don't have the time or interest. I prefer to try to optimize design
>> tradeoffs which don't complicate the design, but should actually
>> simplify it. In other words, optimizing by simplifying... much like
>> what Chuck prefers. But for now I'll keep the SWAP instruction and
>> will call an XOR an XOR.
>
> Calling an XOR and XOR is really wise, and SWAP is a useful instruction.
> I don't know why Chuck wants to deliberately confuse people with his
> OR=XOR and -=INVERT. If he wants to be compact, he could use ~ for

> invert,& for AND, | for OR, and ^ for XOR. Nobody would complain.

Well, who came up with IF ELSE THEN? That is pretty well accepted, but
it is the same thing. I can live with those opcode names, -, OR, etc.
I won't use them in my design, at least in part because I want my
opcodes to be all text, ADD, NOP, INVERT... I just wish it was all
documented better...

>> I guess stuff like IF ELSE THEN and OR instead of XOR is what you get
>> when you work just to please yourself! Still, you have to recognize
>> the brilliance of the man.
>
> HP renamed IF ELSE THEN to IF condition THEN a ELSE b ENDIF. IF is
> really just syntactic sugar. This is a lot less confusing for people
> who don't know much about Forth, because it looks exactly like Algol.
> But in fact, it's just Forth in disguise.

I've never liked the term, "syntactic sugar" because it has no defined
meaning.

> If we follow the advice above, to use C's short symbols, we could have ?
> : ; for IF ELSE THEN, and ; could be a general terminator of controll
> structures - only if it sees colon-sys it would terminate the definition
> as such.

Ok. I don't mind typing longer words. I use VHDL when I design FPGAs,
so obviously I like to type. ;7)

Rick

Albert van der Horst

unread,

Oct 28, 2012, 3:59:26 AM10/28/12

to

In article <1762397.1...@sunwukong.fritz.box>,
Bernd Paysan <bernd....@gmx.de> wrote:
<SNIP>

>
>Calling an XOR and XOR is really wise, and SWAP is a useful instruction.
>I don't know why Chuck wants to deliberately confuse people with his
>OR=XOR and -=INVERT. If he wants to be compact, he could use ~ for
>invert, & for AND, | for OR, and ^ for XOR. Nobody would complain.

If you try working with colorforth it becomes painfully obvious.
With the colorforth encoding you've only 80% chance to get the
right key. So typing OR succeeds in 64% of the cases, while XOR
would result in about 49 %. The word - you can type with an 80% hit rate
and the chance of getting INVERT correct becomes infinitesimal.
Remember there is no erase key, only an erase word!

<SNIP>

Rod Pemberton

unread,

Oct 28, 2012, 4:21:47 AM10/28/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6hq5t$s0f$1...@dont-email.me...
...

> The design is NOT pipelined.

Ok.

> Everything happens in one clock cycle and is done.

Are there still multiple steps or operations per clock? (Yes?)

E.g., an ADD instruction may have four operations: read of memory into an
ALU register, read of uP (microprocessor) register to ALU register, adds ALU
registers, stores ALU result register result back into uP register.

> A pipelined operation is split over multiple cycles.

If Intel designed an internal PLL (phase-locked loop) into their uP's, they
could use a slower external clock with a faster internal clock and then
claim that their CISC processor is non-pipelined ...

So, yours is different how? ;-)

I.e., the point being pipelining is not just splitting an operation over
multiple cycles, but also executing another instruction at the same time.

> My design simply moves the register in the cycle from the
> start of the fetch to the start of the decode. This doesn't
> change the way the cycle works.

So, you should've told BP you had a single register instruction cache
or a 1-element FIFO buffer... :-) Lol!

Joking aside, your design is synchronous, not asynchronous, because of the
FPGA, yes?

Rod Pemberton

Rod Pemberton

unread,

Oct 28, 2012, 5:01:33 AM10/28/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6hdga$4je$1...@dont-email.me...

> On 10/26/2012 8:57 PM, Rod Pemberton wrote:
> > "rickman"<gnu...@gmail.com> wrote in message
> > news:k6cg75$o2d$1...@dont-email.me...
...

> >> Ok, now that I get the context, what is the point exactly? Why do you
> >> only code the lesser number of primitives when you will eventually code
> >> more words as primitives.
> >
> > I'm in the process of converting much of my code from precompiled ITC
> > Forth in C (similar to a DTC Forth in assembly) to Forth as ASCII text.
> > The goal was to 1) make it as independent of C as possible and/or reveal
> > hidden C dependencies, and 2) have only primitives, variables, and other
> > low-level words in C. The reality is slightly different. What's left
> > are the primitives, variables, other low-level words in C, and a few
> > critical high-level Forth words in C, as well a program (re)written
> > entirely in terms of primitives to implement the outer interpreter.
>
> Maybe you can explain the purpose of wanting Forth in ASCII text?

Didn't I just explain that?

> Wouldn't that be the source?

Yes. It's Forth source.

Previously, the definitions were in C, as ITC address-lists, just like Forth
coded in assembly. Now, they're mostly in ASCII text as high-level Forth
definitions.

> Or are you talking about F-code in text?

No.

> What would be the purpose of doing this?

See 1) and 2) above.

> >> Is this an exercise where you hope to learn how best to
> >> optimize primitives?
> >
> > No. It's mostly a result of the transformation process. Minimizing
> > primitives allowed additional code to be moved out of the "C space" to
> > the pure "Forth space".
>
> "Allowed"??? By definition that is minimizing primitives.

Yes, allowed. I.e., even some of the typical "primitives" can be written in
terms of other more primitive "primitives". A typical example might be C@
or BRANCH. For C@, which is typically a primitive, you could construct it
using @ and AND. The same is true for BRANCH. It's usually a primitive,
but can be easily constructed in Forth for an interpreted Forth (exact
definition depends on implementation specific details):

\ direct addresses
: BRANCH R> @ >R ;

\ relative offset with pointer adjustment
: BRANCH R> DUP @ + CELL+ >R ;

> But why?

I thought I stated that. It's a result of eliminating the C code and
dependencies, and converting it to Forth-in-Forth.

> Typically more primitives are better in most metrics other
> than possibly size.

Yes, that comes later. Once I've got as much Forth-in-Forth as is possible,
and as little Forth-in-C code as is possible, then I can begin
(re)implementing
more primitives and other higher-level words as primitives that will improve
speed.

> The main reason is that you need to draw a line somewhere
> otherwise you end up with all assembly.

... all C.

> Too many words is just extra work without much benefit.

Exactly. You're clearly not going to implement all Forth words as
instructions on your processor either. One of the things I noticed from the
instruction frequencies for Forth, is that even the handful of the most used
Forth words were used like only 10% of the time or so ... None of them
were used 25% or 40% or 70% of the time.

> Minimizing primitives makes the code slower,
> possibly a bit more compact, but otherwise to what advantage?

To allow the transformation from C to Forth occur. Each word that isn't a
primitive can be implemented in high-level Forth. The fewer primitives
there are the more code there is that than can be converted. At some point,
I will reach the absolute minimum that I can comprehend. At that point, the
code is the least dependent on C that I can make it. Then, I can easily
reorganize or rewrite my Forth definitions. After that, I can rework them
for improved speed, e.g., by adding or converting words to primitives.

> > If you attempt to balance all instructions so each takes the same amount
> > of time, that results in a RISC like design. Yes?
> >
> > If you implement some instructions for speed and disregard optimization
> > for others, that results in a CISC like design. Yes?
>
> Does it? One of my design rules is that all instructions take one clock
> cycle, period.

If an XOR to memory has six steps and an AND to register has four steps?

You mentioned MISC. Are breaking AND etc down into RISC-like operations
that become your instructions? E.g., move-register-to-memory becomes an
instruction?

Rod Pemberton

Bernd Paysan

unread,

Oct 28, 2012, 9:45:50 AM10/28/12

to

rickman wrote:
>> Partly, probably because you insist on calling your design
>> non-pipelined and your parallel instruction fetch unit not a
>> "prefetch", even though it is.
>
> Ok, I would like to understand why you think it is a prefetch. Each
> instruction fetches the next sequential instruction unless the
> instruction is a branch, call or return in which case the appropriate
> instruction is fetched. No instruction is fetched ahead of time, no
> instruction is ever tossed away. The instruction to be executed next
> is always fetched.

That's a prefetch. Each instruction fetches the *next* instruction,
ahead of time. Pre=next. I don't know why you are confused.

> I think the issue is where I put the register in the fetch loop.
> Instead of putting the register between the address generation and
> memory, the register is at the output of the memory because the block
> RAMs have a register there and I don't get a choice. So I moved the
> register in the loop from the input of the memory to the output. This
> doesn't change the way the single cycle machine works, so I don't call
> it a prefetch.

If you try to communicate, call things the way they are. If it looks
like a duck, quacks like a duck, swims like a duck, walks like a duck,
flies like a duck, it is called "duck". If you fetch the next
instruction while you are still busy with the current instruction, it is
a prefetch.

>> This is the definition of a prefetch. You fetch the next instruction
>> before you complete with the current instruction.
>
> Not really before, at the same time.

The diagram of a two-stage pipeline is

fetch | execute |
fetch | execute |

You do the fetch *before* you are *done* with execute (my words). So
*concurrent* with execute (your words). Can you transform these two
identical statements into each other so we can agree that we agree?

> The instruction fetch does not
> depend on the current instruction unless it is a "fetch" type
> instruction.

Yes. That's why prefetch is a low-hanging fruit, and done most of the
time. As your example shows.

> I must have mis-stated something. The design is NOT pipelined.
> Everything happens in one clock cycle and is done. A pipelined
> operation is split over multiple cycles.

Well it is, according to your description of what happens. See above.
You fetch the instruction in one cycle and execute it in the next cycle.
"one" + "next" = 2.

The confusion might be that you think of an instruction as doing two
things:

a) execute itself
b) fetch the next instruction

These two mostly independent operations happen in parallel. The
conceptual entity however is *one* instruction, not two of them. One
instruction is executed by

a) fetching it
b) executing it

This is a sequential process, taking two cycles.

Your mental model as above makes it easier for you to deal with the
paralleism, and make sure there are no pipeline stalls in any case. No
argument about that. This is just a discussion about terms.

> My design simply moves the
> register in the cycle from the start of the fetch to the start of the
> decode. This doesn't change the way the cycle works.

It moves the entire fetch operation by one cycle - start of decode is
end of fetch.

> I've never liked the term, "syntactic sugar" because it has no defined
> meaning.

Of course "syntactic sugar" has a defined meaning. It is a syntactic
element that makes the program easier to read, while having no function
of its own:

http://en.wikipedia.org/wiki/Syntactic_sugar

I.e. to get HP's syntax, you define

: endif postpone then ; immediate
: then postpone if ; immediate
: if ; immediate

and then you can write something like

: simple. ( n -- ) if dup 3 > then drop ." many " else . endif ;
\ numbers are 1 2 3 many...

IF in this case is obvious syntactic sugar, as you can leave it away
without changing the functionality of the program.

rickman

unread,

Oct 28, 2012, 2:36:58 PM10/28/12

to

On 10/28/2012 4:21 AM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:k6hq5t$s0f$1...@dont-email.me...
> ...
>
>> The design is NOT pipelined.
>
> Ok.
>
>> Everything happens in one clock cycle and is done.
>
> Are there still multiple steps or operations per clock? (Yes?)

I'm not sure what you are asking. There are any number of things going
on in parallel. For example, when a stack is pushed, the pointer is
incremented, the memory is written, the next instruction is fetched and
the PC being used is registered. But none of this is pipelined, it all
happens in one clock.

> E.g., an ADD instruction may have four operations: read of memory into an
> ALU register, read of uP (microprocessor) register to ALU register, adds ALU
> registers, stores ALU result register result back into uP register.

You are talking about a RISC or CISC machine. Instructions in a MISC
are typically selected to run in one clock cycle and not take multiple
steps. e.g. R> is just a pop of the return stack, a push on the data
stack and a fetch of the next instruction... one clock cycle and all in
parallel.

>> A pipelined operation is split over multiple cycles.
>
> If Intel designed an internal PLL (phase-locked loop) into their uP's, they
> could use a slower external clock with a faster internal clock and then
> claim that their CISC processor is non-pipelined ...

Really, so their processor won't take N times the instruction time to
recover from a branch?

> So, yours is different how? ;-)
>
> I.e., the point being pipelining is not just splitting an operation over
> multiple cycles, but also executing another instruction at the same time.

Yes, by inserting registers in the MIDDLE of a function so that it takes
multiple clocks to execute, but each stage can be processing a separate
instruction.

>> My design simply moves the register in the cycle from the
>> start of the fetch to the start of the decode. This doesn't
>> change the way the cycle works.
>
> So, you should've told BP you had a single register instruction cache
> or a 1-element FIFO buffer... :-) Lol!
>
> Joking aside, your design is synchronous, not asynchronous, because of the
> FPGA, yes?

Yes, it is synchronous. There are only three ways to create an async
design, one is a custom chip, e.g. the GA144, two is an async FPGA, e.g.
Achronix which pretty much only supports the large players, or three
wire your own stuff from MSI chips. I won't be doing any of these
unless someone with deep pockets comes along asking me to.

Rick

rickman

unread,

Oct 28, 2012, 3:04:27 PM10/28/12

to

On 10/28/2012 9:45 AM, Bernd Paysan wrote:
> rickman wrote:
>>> Partly, probably because you insist on calling your design
>>> non-pipelined and your parallel instruction fetch unit not a
>>> "prefetch", even though it is.
>>
>> Ok, I would like to understand why you think it is a prefetch. Each
>> instruction fetches the next sequential instruction unless the
>> instruction is a branch, call or return in which case the appropriate
>> instruction is fetched. No instruction is fetched ahead of time, no
>> instruction is ever tossed away. The instruction to be executed next
>> is always fetched.
>
> That's a prefetch. Each instruction fetches the *next* instruction,
> ahead of time. Pre=next. I don't know why you are confused.

Aren't the instructions always fetched ahead of the decode? No,
sometimes they are fetched in parallel of the decode and execute, but
wait, that is not the *next* instruction, that is the instruction that
will execute after the instruction you just fetched which is not yet
executing... why, because the CPU is pipelined.

I don't agree. Prefetch is fetching instructions ahead of time when you
are pipelined. The result is that when the code has a branch the
instruction fetched has to be thrown away.

By your definition all processors prefetch because the instruction is
fetched BEFORE it is decoded or executed.

The classic CPU model has a PC register which is the start of each
instruction: fetch, decode, execute with no registers in between. In my
design I "rotated" the registers a bit. This is because I have to have
an instruction register because the memory has this. So the instruction
cycle is... decode, execute, fetch. No pipelining and I don't see how
this makes the fetch a "prefetch". The fetch is done on the next
instruction to be executed, not an instruction that is *anticipated* to
be fetched.

If you insist on calling this a prefetch, it adds nothing to the
conversation because the results that typically are connected to a
prefetch are not attached to my design.

>> I've never liked the term, "syntactic sugar" because it has no defined
>> meaning.
>
> Of course "syntactic sugar" has a defined meaning. It is a syntactic
> element that makes the program easier to read, while having no function
> of its own:
>
> http://en.wikipedia.org/wiki/Syntactic_sugar

Wikipedia doesn't define the language. Regardless, calling something
syntactic sugar has no utility to me.

> I.e. to get HP's syntax, you define
>
> : endif postpone then ; immediate
> : then postpone if ; immediate
> : if ; immediate
>
> and then you can write something like
>
> : simple. ( n -- ) if dup 3> then drop ." many " else . endif ;
> \ numbers are 1 2 3 many...
>
> IF in this case is obvious syntactic sugar, as you can leave it away
> without changing the functionality of the program.
>

Unfortunately you have snipped the context of the syntactic sugar
statement so I have no idea what this is now about.

Rick

rickman

unread,

Oct 28, 2012, 3:06:26 PM10/28/12

to

On 10/28/2012 3:59 AM, Albert van der Horst wrote:
> In article<1762397.1...@sunwukong.fritz.box>,
> Bernd Paysan<bernd....@gmx.de> wrote:
> <SNIP>
>>
>> Calling an XOR and XOR is really wise, and SWAP is a useful instruction.
>> I don't know why Chuck wants to deliberately confuse people with his
>> OR=XOR and -=INVERT. If he wants to be compact, he could use ~ for

>> invert,& for AND, | for OR, and ^ for XOR. Nobody would complain.

>
> If you try working with colorforth it becomes painfully obvious.
> With the colorforth encoding you've only 80% chance to get the
> right key. So typing OR succeeds in 64% of the cases, while XOR
> would result in about 49 %. The word - you can type with an 80% hit rate
> and the chance of getting INVERT correct becomes infinitesimal.
> Remember there is no erase key, only an erase word!

You may just be right!

Rick

rickman

unread,

Oct 28, 2012, 3:23:27 PM10/28/12

to

No, you didn't explain why you need to use source. You gave goals, not
rationals for what you did.

If you don't want to discuss this so I can understand, just say so.
I'll stop asking.

Isn't that rather a DUH! You can only have one word with >50%
frequency. You can only have three words with >30% frequency, etc. Are
you expecting to have lots of words that are used 90% of the time? I
really don't follow.

>> Minimizing primitives makes the code slower,
>> possibly a bit more compact, but otherwise to what advantage?
>
> To allow the transformation from C to Forth occur. Each word that isn't a
> primitive can be implemented in high-level Forth. The fewer primitives
> there are the more code there is that than can be converted. At some point,
> I will reach the absolute minimum that I can comprehend. At that point, the
> code is the least dependent on C that I can make it. Then, I can easily
> reorganize or rewrite my Forth definitions. After that, I can rework them
> for improved speed, e.g., by adding or converting words to primitives.

This sounds very complex.

>>> If you attempt to balance all instructions so each takes the same amount
>>> of time, that results in a RISC like design. Yes?
>>>
>>> If you implement some instructions for speed and disregard optimization
>>> for others, that results in a CISC like design. Yes?
>>
>> Does it? One of my design rules is that all instructions take one clock
>> cycle, period.
>
> If an XOR to memory has six steps and an AND to register has four steps?
>
> You mentioned MISC. Are breaking AND etc down into RISC-like operations
> that become your instructions? E.g., move-register-to-memory becomes an
> instruction?

RISC? I don't use RISC concepts, that would be too complex. MISC - M
for Minimal. There are NO registers, there are TWO STACKS, FULL STOP!
Instructions are things like, move stack to memory, xor popping the top
two items on the data stack and pushing the result (still one
operation), branch/call/return. These may not be the same as Forth
words, but they can be used to make Forth words. There is no XOR to
memory as an instruction.

You can learn a lot about MISC by learning the instruction set of the
F18A. Chuck uses two address registers which is different from my
design. The instructions are similar in scale. They can all be done in
one operation and don't require multiple steps.

Rick

Bernd Paysan

unread,

Oct 28, 2012, 4:09:39 PM10/28/12

to

rickman wrote:
>> It moves the entire fetch operation by one cycle - start of decode is
>> end of fetch.
>
> I don't agree. Prefetch is fetching instructions ahead of time when
> you are pipelined. The result is that when the code has a branch the
> instruction fetched has to be thrown away.
>
> By your definition all processors prefetch because the instruction is
> fetched BEFORE it is decoded or executed.

> The fetch is done on the next
> instruction to be executed, not an instruction that is *anticipated*
> to be fetched.

That's perfectly fine, in a 2-stage pipeline, you don't need to
anticipate anything. You are probably confusing this with a classical
four-stage pipeline design where you have

fetch | decode+r | execute | write
fetch | decode+r | execute | write

In this case, if you execute a branch, you have already fetched *one*
instruction after the branch. This one needs to be thrown away.

With a two-stage pipeline, you don't need to throw anything away. I.e.
you always fetch the rigth instruction.

> If you insist on calling this a prefetch, it adds nothing to the
> conversation because the results that typically are connected to a
> prefetch are not attached to my design.

The starting point of this lengthy conversation was that I said that
MISC processors typically use prefetch, but no other optimization. This
is what you opposed to, but the only disagreement we have is over
language, not over actual facts. Fact is that you parallelize fetch and
execute, which *is* the low-hanging fruit of optimization I was talking
about.

If we disagree about the term ("instruction prefetch" is pretty
generic), then well. Let's say fetching the next instruction is a
degenerated case of prefetch, because the instruction arrives just when
you actually need it, i.e. no need for queuing it up.

There are single-cycle MISC CPUs which do not prefetch at all, e.g. the
J1 CPU. There, the cycle consists of fetching the instruction, and then
asynchronously decoding it and selecting the result of the corresponding
operations (which have completed by the time the instruction is
fetched). It's not the next instrcution which is fetched, it's the
*current* instruction.

>>> I've never liked the term, "syntactic sugar" because it has no
>>> defined meaning.
>>
>> Of course "syntactic sugar" has a defined meaning. It is a syntactic
>> element that makes the program easier to read, while having no
>> function of its own:
>>
>> http://en.wikipedia.org/wiki/Syntactic_sugar
>
> Wikipedia doesn't define the language. Regardless, calling something
> syntactic sugar has no utility to me.

*Nobody* defines language. Language is made up by *usage*, and usage
can be *observed*. An encyclopedia is a collection of observed usage.
Languages work by common understanding and using the same terms for the
same things; if you are in doubt that the way you use your terms are
understood by others, get a life and check reality. Wikipedia can be a
good start.

>> I.e. to get HP's syntax, you define
>>
>> : endif postpone then ; immediate
>> : then postpone if ; immediate
>> : if ; immediate
>>
>> and then you can write something like
>>
>> : simple. ( n -- ) if dup 3> then drop ." many " else . endif ;
>> \ numbers are 1 2 3 many...
>>
>> IF in this case is obvious syntactic sugar, as you can leave it away
>> without changing the functionality of the program.
>>
>
> Unfortunately you have snipped the context of the syntactic sugar
> statement so I have no idea what this is now about.

The context of syntactic sugar was HP's modified IF THEN ELSE ENDIF, as
you can read here. The context is still there.

rickman

unread,

Oct 28, 2012, 4:59:50 PM10/28/12

to

On 10/28/2012 4:09 PM, Bernd Paysan wrote:
> rickman wrote:
>>> It moves the entire fetch operation by one cycle - start of decode is
>>> end of fetch.
>>
>> I don't agree. Prefetch is fetching instructions ahead of time when
>> you are pipelined. The result is that when the code has a branch the
>> instruction fetched has to be thrown away.
>>
>> By your definition all processors prefetch because the instruction is
>> fetched BEFORE it is decoded or executed.
>
> No, please, try to understand what you are doing. A processor which
> doesn't prefetch does
>
> fetch | execute | fetch | execute

This is a retarded processor with pipeline registers, but no
concurrency. It should be...

fetch execute | fetch execute | fetch execute

Why would you draw a line between the fetch and the execute?

You are correct that the fetch and execute operate concurrently. That
is the only difference. I simply put my register in a different place.

>> The fetch is done on the next
>> instruction to be executed, not an instruction that is *anticipated*
>> to be fetched.
>
> That's perfectly fine, in a 2-stage pipeline, you don't need to
> anticipate anything. You are probably confusing this with a classical
> four-stage pipeline design where you have
>
> fetch | decode+r | execute | write
> fetch | decode+r | execute | write
>
> In this case, if you execute a branch, you have already fetched *one*
> instruction after the branch. This one needs to be thrown away.
>
> With a two-stage pipeline, you don't need to throw anything away. I.e.
> you always fetch the rigth instruction.
>
>> If you insist on calling this a prefetch, it adds nothing to the
>> conversation because the results that typically are connected to a
>> prefetch are not attached to my design.
>
> The starting point of this lengthy conversation was that I said that
> MISC processors typically use prefetch, but no other optimization. This
> is what you opposed to, but the only disagreement we have is over
> language, not over actual facts. Fact is that you parallelize fetch and
> execute, which *is* the low-hanging fruit of optimization I was talking
> about.

Ok, I withdraw my objection to your original statement. If you want to
call what I do as pipelining, then ok, my design is MP, minimally
pipelined.

> If we disagree about the term ("instruction prefetch" is pretty
> generic), then well. Let's say fetching the next instruction is a
> degenerated case of prefetch, because the instruction arrives just when
> you actually need it, i.e. no need for queuing it up.
>
> There are single-cycle MISC CPUs which do not prefetch at all, e.g. the
> J1 CPU. There, the cycle consists of fetching the instruction, and then
> asynchronously decoding it and selecting the result of the corresponding
> operations (which have completed by the time the instruction is
> fetched). It's not the next instrcution which is fetched, it's the
> *current* instruction.

Ok, you are winning me over. Here is the difference in what I am doing...

Typical 1 stage pipelining:

Fetch 1 | Execute 1 |
| Fetch 2 | Execute 2 |
| Fetch 3 | Opps execute 2 was a jump! I need
Fetch N now...

This happens because the address of the Fetch 3 is from the PC which was
registered at the end of Fetch 2. I register the PC that was *used* for
Fetch 2 and the Fetch 3 calculates the address before the actual fetch
so if instruction 2 is a jump, the correct instruction for 3 is fetched
always.

>>>> I've never liked the term, "syntactic sugar" because it has no
>>>> defined meaning.
>>>
>>> Of course "syntactic sugar" has a defined meaning. It is a syntactic
>>> element that makes the program easier to read, while having no
>>> function of its own:
>>>
>>> http://en.wikipedia.org/wiki/Syntactic_sugar
>>
>> Wikipedia doesn't define the language. Regardless, calling something
>> syntactic sugar has no utility to me.
>
> *Nobody* defines language. Language is made up by *usage*, and usage
> can be *observed*. An encyclopedia is a collection of observed usage.
> Languages work by common understanding and using the same terms for the
> same things; if you are in doubt that the way you use your terms are
> understood by others, get a life and check reality. Wikipedia can be a
> good start.

Wikipedia is of dubious origins. I never use it as a primary source, as
recommended by *wikipedia*. When I use wikipedia, I always check the
references for anything important and I never use it to prove a point.

I don't *use* syntactic sugar because it doesn't mean anything to me.
If you want to communicate a thought to me related to this, please use
other terms.

Actually, I think I don't like the term because of the way others use
it. They use is as a label rather than discussing the real issue.
Labels tend to simplify an issue and bypass real thought about it.

>>> I.e. to get HP's syntax, you define
>>>
>>> : endif postpone then ; immediate
>>> : then postpone if ; immediate
>>> : if ; immediate
>>>
>>> and then you can write something like
>>>
>>> : simple. ( n -- ) if dup 3> then drop ." many " else . endif ;
>>> \ numbers are 1 2 3 many...
>>>
>>> IF in this case is obvious syntactic sugar, as you can leave it away
>>> without changing the functionality of the program.
>>>
>>
>> Unfortunately you have snipped the context of the syntactic sugar
>> statement so I have no idea what this is now about.
>
> The context of syntactic sugar was HP's modified IF THEN ELSE ENDIF, as
> you can read here. The context is still there.
>

Maybe the real issue is I don't know what this part of the conversation
is about. I think I said something about someone's objection to OR as
the xor instruction and compared it to some of the other "issues" in
Forth using unusual constructs like IF ELSE THEN.

I'm not sure I'm interested in exploring this part further.

Rick

Rod Pemberton

unread,

Oct 29, 2012, 7:06:32 PM10/29/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6ju0o$5ib$1...@dont-email.me...

> On 10/28/2012 4:21 AM, Rod Pemberton wrote:
> > "rickman"<gnu...@gmail.com> wrote in message
> > news:k6hq5t$s0f$1...@dont-email.me...
...

> >> Everything happens in one clock cycle and is done.
> >
> > Are there still multiple steps or operations per clock? (Yes?)
>
> I'm not sure what you are asking. There are any number of things going
> on in parallel. For example, when a stack is pushed, the pointer is
> incremented, the memory is written, the next instruction is fetched and
> the PC being used is registered. But none of this is pipelined, it all
> happens in one clock.
>

I was asking if the operations are occuring sequentially per one clock - one
after another. But, you just stated that they're parallel - all operations
at the exact same time, roughly, in one clock. The issue is "one clock"
doesn't necessarily mean parallel, to me at least...

You can have a few sequential operations in one clock. E.g.,

parallel
1 2 3 4 - all at the same time - each no more than one clock

sequential - one after another - total of one clock or less
1 - fraction of a clock
2 - fraction of a clock
3 - fraction of a clock
4 - fraction of a clock

> Instructions in a MISC are typically selected to run in one clock
> cycle and not take multiple steps.

Ok.

> > Joking aside, your design is synchronous, not asynchronous, because of
> > the FPGA, yes?
>
> Yes, it is synchronous. There are only three ways to create an async
> design, one is a custom chip, e.g. the GA144, two is an async FPGA, e.g.
> Achronix which pretty much only supports the large players, or three
> wire your own stuff from MSI chips. I won't be doing any of these
> unless someone with deep pockets comes along asking me to.

... clock-less or self-clocking or free-running design with static ram for
registers ...

Are FPGA's required to be clocked? I would've assumed that it's more safe
for the signals to clock, but not a requirement.

Is there a tool that estimates the cycle time or frequency for your clock
based on your circuit design?

Rod Pemberton

Rod Pemberton

unread,

Oct 29, 2012, 7:32:22 PM10/29/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6k0nn$n8o$1...@dont-email.me...

> On 10/28/2012 5:01 AM, Rod Pemberton wrote:
> > "rickman"<gnu...@gmail.com> wrote in message
> > news:k6hdga$4je$1...@dont-email.me...
> >> On 10/26/2012 8:57 PM, Rod Pemberton wrote:
> >>> "rickman"<gnu...@gmail.com> wrote in message
> >>> news:k6cg75$o2d$1...@dont-email.me...
...

> No, you didn't explain why you need to use source.

It's just an intermediate stage. It may remain in part, or may not. I'd
like to keep much as source. It's easy to change. But, I'm clearly going
to get rid of words that aren't doing much except slow the interpreter down.
If I keep the source and achieve my transformation goals, then I will only
need to compile a very small C program to implement or execute Forth.
Perhaps, maybe, it'll even be suitable for "sandboxing".

> You gave goals, not rationals for what you did.

I don't understand what you mean by that ...

Isn't eliminating dependence on C a rationale too?

> >> Too many words is just extra work without much benefit.
> >
> > Exactly. You're clearly not going to implement all Forth words as
> > instructions on your processor either. One of the things I noticed from
> > the instruction frequencies for Forth, is that even the handful of the
> > most used Forth words were used like only 10% of the time or so ...
> > None of them were used 25% or 40% or 70% of the time.
>
> Isn't that rather a DUH! You can only have one word with >50%
> frequency. You can only have three words with >30% frequency, etc. Are
> you expecting to have lots of words that are used 90% of the time? I
> really don't follow.

It's hard to determine where to concentrate your efforts if everything is of
minimal use. If the language had a small set of more frequently used
operations instead of a large set of minimally used operations, I'd think it
would be better suited to implementation on a RISC style microprocessor,
perhaps MISC too. Instead, people must analyze Forth words and potential
instructions, figure out how to rewrite them, or break them up into other
operations, etc until they think they've found the "magic" set of operations
or sub-operations which should allow them to very effectively implement
Forth. Obviously, even Charles Moore is _still_ working on this ... You're
working on this in your way. I'm working this in my way. Alex McDonald in
his. Bernd Paysan in his. Stephen Pelc in his. Marcel Hendrix in his.
Anton Ertl in his. Andrew Haley in his. Hugh Aguilar in his. Etc. At
what point does a highly effective Forth computation model become available?
... The apparent difficulty in finding an effective model implies to me
that maybe the Forth language is not such a good fit to the hardware. Yes,
I understand that a very simple computational model for Forth ITC
interpreters works. But, this is the era of Forth compilers ...

> RISC? I don't use RISC concepts, that would be too complex. MISC - M
> for Minimal. There are NO registers, there are TWO STACKS, FULL STOP!
> Instructions are things like, move stack to memory, xor popping the top
> two items on the data stack and pushing the result (still one
> operation), branch/call/return. These may not be the same as Forth
> words, but they can be used to make Forth words. There is no XOR to
> memory as an instruction.

Ok.

Rod Pemberton

Mark Wills

unread,

Oct 30, 2012, 5:07:27 AM10/30/12

to

On Oct 29, 11:02 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:

> I was asking if the operations are occuring sequentially per one clock - one
> after another. But, you just stated that they're parallel - all operations
> at the exact same time, roughly, in one clock. The issue is "one clock"
> doesn't necessarily mean parallel, to me at least...
>
> You can have a few sequential operations in one clock. E.g.,
>
> parallel
> 1 2 3 4 - all at the same time - each no more than one clock
>
> sequential - one after another - total of one clock or less
> 1 - fraction of a clock
> 2 - fraction of a clock
> 3 - fraction of a clock
> 4 - fraction of a clock
>

Irrelevant. Even if they do happen *slightly* one-after-the-other due
to gate propagation delays and the like, all the actions are atomic
from all perspectives except the very circuitry on the chip itself.

Rod Pemberton

unread,

Oct 30, 2012, 9:36:11 AM10/30/12

to

"Mark Wills" <markrob...@yahoo.co.uk> wrote in message
news:8d64f90b-b266-4c0c...@c20g2000vbz.googlegroups.com...

> On Oct 29, 11:02 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
> wrote:

...

> > [in response to statements by rickman, snipped by MW]

Yes, it's entirely irrelevant to an end user and c.l.f.

No, it's not irrelevant when discussing rickman's microprocessor design.

RP

Mark Wills

unread,

Oct 30, 2012, 11:09:13 AM10/30/12

to

On Oct 30, 1:32 pm, "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
wrote:
> "Mark Wills" <markrobertwi...@yahoo.co.uk> wrote in message

> RP- Hide quoted text -
>
> - Show quoted text -

I disagree. The clock cycle is the smallest (atomic) unit of time. If
he was using internal clock doublers and the like then he would have
said so!

rickman

unread,

Oct 30, 2012, 6:50:27 PM10/30/12

to

I can't say I understand the question. If a register isn't clocked, how
does it work? Are you asking if the FPGA uses clocked registers or
async registers with set/reset controls?

Many years ago it was found that it was hard to produce a design using
async logic and even harder to prove that it worked properly, especially
regarding timing. Partly in order to make simulation easier and to make
timing analysis easier, they came up with LSSD (Level Sensitive Static
Design). I won't go into what that entails but it mostly means all
registers use the data and clock enable inputs and a Q output. All logic
connects between those points and/or the I/Os. Async FF inputs are not
used.

If you do this, the logic can be simulated in what is called "unit
delay" simulation where it is assumed that all signals meet setup and
hold timing at each FF and only the logic needs to be considered. The
timing can be analyzed separately with what is called "static timing
analysis". This has been the stalwart of FPGA and ASIC design for many
years.

Async design like the GA144 or a very small number of other devices are
very distant outliers. There are few tools for designing async devices
and very little foundry support.

> Is there a tool that estimates the cycle time or frequency for your clock
> based on your circuit design?

In an FPGA, yes, this is called static timing analysis. You typically
give it a clock rate, in fact, the clock rate is provided to the tools
doing the place and route as well, and other timing constraints (clock
to a given input or output for example). The tool tells you if your
design meets the goals, which signal fail and an analysis of the
detailed timing path.

I guess I've been assuming I was talking to someone who knew a bit about
this. I understand some of the misunderstandings in our conversation now.

Rick

rickman

unread,

Oct 30, 2012, 7:00:41 PM10/30/12

to

On 10/29/2012 7:32 PM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:k6k0nn$n8o$1...@dont-email.me...
>> On 10/28/2012 5:01 AM, Rod Pemberton wrote:
>>> "rickman"<gnu...@gmail.com> wrote in message
>>> news:k6hdga$4je$1...@dont-email.me...
>>>> On 10/26/2012 8:57 PM, Rod Pemberton wrote:
>>>>> "rickman"<gnu...@gmail.com> wrote in message
>>>>> news:k6cg75$o2d$1...@dont-email.me...
> ...
>
>> No, you didn't explain why you need to use source.
>
> It's just an intermediate stage. It may remain in part, or may not. I'd
> like to keep much as source. It's easy to change. But, I'm clearly going
> to get rid of words that aren't doing much except slow the interpreter down.
> If I keep the source and achieve my transformation goals, then I will only
> need to compile a very small C program to implement or execute Forth.
> Perhaps, maybe, it'll even be suitable for "sandboxing".
>
>> You gave goals, not rationals for what you did.
>
> I don't understand what you mean by that ...
>
> Isn't eliminating dependence on C a rationale too?

It isn't an explanation of the original question. Why did the chicken
cross the road, to get to the other side... There are many ways to
"eliminating dependence on C", the question is why did you choose to use
Forth source in your design rather than compile the Forth?

Ok.

I spent a little time today looking at using a 16 bit VLIW instruction
set today against two snippets of code I wrote just to test instruction
efficiency. One is a memory to memory copy and the other is a memory
fill which I think I saw Chuck use for his CPU once.

Surprisingly enough, I got zero utility of the parallel operation
capability in the first example. I also got virtually no improvement in
the number of instructions which means it used TWICE the code space if
you don't count the literal setup code which was the same!

I'm going to look at some more code, but moving data around in memory is
a big one for the apps I am designing this for.

Rick

Rod Pemberton

unread,

Oct 30, 2012, 7:14:06 PM10/30/12

to

"Mark Wills" <markrob...@yahoo.co.uk> wrote in message

news:40bc1480-4d21-4390...@y6g2000vbb.googlegroups.com...

> I disagree.

Ok.

> The clock cycle is the smallest (atomic) unit of time.

No, it depends on the design.

> If he was using internal clock doublers and the like
> then he would have said so!

You're assuming a clock is needed to drive each stage of a sequential
sequence operations for an instruction. That's not the case. Sometimes,
the internal circuits of a microprocessor are self-clocking or
self-propagating. Once one step is completed, the next starts. In such
designs, a clock is only required at the start or end of the instruction
sequence to ensure a stable, or latched state. Of course, this is older
microprocessor theory, not FPGA microprocessor theory. So, it's entirely
possible microprocessor using an FGPA may not be able to be designed that
way.

"Rather than totally removing the clock signal, some CPU designs allow
certain portions of the device to be asynchronous, such as using
asynchronous ALUs in conjunction with superscalar pipelining to achieve some
arithmetic performance gains."
http://en.wikipedia.org/wiki/Central_processing_unit#Clock_rate

"Unlike a conventional processor, a clockless processor (asynchronous CPU)
has no central clock to coordinate the progress of data through the
pipeline."
http://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchronous_CPU

Rod Pemberton

Rod Pemberton

unread,

Oct 31, 2012, 1:23:12 AM10/31/12

to

"rickman" <gnu...@gmail.com> wrote in message

news:k6pm77$huv$1...@dont-email.me...

[snip, re-org]

> I spent a little time today looking at using a 16 bit VLIW instruction
> set today against two snippets of code I wrote just to test instruction
> efficiency. One is a memory to memory copy and the other is a memory
> fill which I think I saw Chuck use for his CPU once.
>
> Surprisingly enough, I got zero utility of the parallel operation
> capability in the first example. I also got virtually no improvement in
> the number of instructions which means it used TWICE the code space if
> you don't count the literal setup code which was the same!
>
> I'm going to look at some more code, but moving data around in
> memory is a big one for the apps I am designing this for.

The old Amiga's had a few coprocessors. I still find them to be novel some
26 or so years later.

http://en.wikipedia.org/wiki/Original_Amiga_chipset

Does your design include some type of DMA? Or, a "blitter"?

IIRC, you said your MISC design had no registers only two stacks for Forth.
So, what stack arguments are you using to specify a memory move? e.g.,
from, to, data size, length? from, end, to? Or, is this a part of the
instruction?

> [...] the question is why did you choose to use

> Forth source in your design rather than compile the Forth?

Compiled Forth is what I started with. The system was developed for many
years using just that. It's useable. Compiled high-level Forth in C is
very close to Forth in assembly. However, after the ability to parse Forth
source was implemented, I've found it's also a bit more difficult to work
with compiled Forth. It has to be compiled. Forth source doesn't.

Generally, the basic Forth definitions for words are the same, or
sometimes trivially different between the two forms. E.g., a LIT
equivalent is used with numbers in the C code for compiled Forth, but
numbers are used directly in the Forth source. A LIT is compiled into
definitions by the text/outer interpreter when calling NUMBER or somesuch
Forth word(s) ... Some code also can't be made to work exactly the same due
to C issues, but they are generally very close, e.g., except branches, DOES>
etc. Those differences are sometimes noticeable in the compiled Forth, but
generally not easy to detect casually.

I want a true Forth system, not a Forth clone, or near Forth environment.

Early on, there was almost no Forth. I backfilled needed operations using C
code and lots of it. Eventually, I reworked it to be mostly compiled Forth,
but still with quite some C. After the Forth parsing words were
implemented, I could migrate most compiled high-level Forth definitions to
Forth source. Typically, they are the exact same Forth word definitions
either way. Sometimes there are trivial differences, e.g., like LIT. With
a few exceptions, the compiled definitions are the same binary-wise too.
They basically had to be. I was migrating them from compiled Forth to
source Forth one at a time.

Currently, the time for parsing of the Forth dictionary is insignificant and
unnoticeable. So, I don't "see" why compiled Forth would be preferred.

Some of the stuff done over the years was very complicated. It was really
only needed to progress the program to whatever next stage of development I
decided I was working towards. E.g., at one point, I needed to be able to
access variables, e.g., dictionary pointer (DP), in both C and Forth. Most
were simple to do, but synchronising DP was very trying. Eventually, I
eliminated the need for the DP on the C side. But, that required a major
rewrite to eliminate the non-Forth C code that was accessing DP. Converting
the non-Forth C code to Forth was just a total pain. I knew what I was
doing on the C side, but had no idea how to implement the equivalent code
in Forth. I ran into numerous other "bottlenecks" along the way too.

Rod Pemberton