Forth Implemented on a MISC Processor

rickman

unread,

Apr 18, 2013, 6:24:32 PM4/18/13

to

I'm not sure MISC is very well defined, but I am using the term to apply
to any processor designed to be minimal in its design and
implementation. This would not exclude any architecture or model and
certainly does not require programming in Forth, although that is one of
my interests.

I have worked with stack based CPUs since around 2002 when the work of
some others inspired me to design my own instruction set and
architecture. I have worked on it from time to time and lately have
been back at it. I realized that the need for a number of stack ops is
not uncommon and take significant time and code space. So I looked at a
register based design and found that without making the hardware more
complex it was capable of using about a third less instructions to
implement a sample application. In fact, the hardware might be simpler.
I found a reasonable way to use the dual port block RAM for registers
and actually end up with register windows for multiple processes such as
interrupts.

Carrying that idea further, but back toward the stack approach, I am
thinking of the stack as more like a window into the memory with a TOS
pointer that can be moved as instructions are executed. So dual
instructions imply the TOS, but specify the other operand allowing
indexing into the stack. This is not well fleshed out as yet, but
initially it looks promising providing some real code space advantages.

One problem with using block RAM as the register file is that everything
has to happen on a clock edge. So when registers are read and written
in one instruction, it takes two clock cycles. If memory is involved,
it takes more clock edges to access the memory block RAM. I'm still
considering the implications of that, but I'm sure it will end up with
multiple clocks per instruction, just not a variable number. Likely
three or four clocks defining "phases" of the instruction. It will be a
bit more complex in some ways than the pure stack design, but should be
faster and may be smaller.

The part I am ill-equipped to handle is writing a Forth compiler to
generate good code for this design. Manually coding using the primitive
instructions would not be so hard, but designing a Forth optimizing
compiler might be a stretch, certainly for me.

--

Rick

Brad Eckert

unread,

Apr 19, 2013, 3:28:34 PM4/19/13

to

On Thursday, April 18, 2013 3:24:32 PM UTC-7, rickman wrote:
>
> The part I am ill-equipped to handle is writing a Forth compiler to
> generate good code for this design. Manually coding using the primitive
> instructions would not be so hard, but designing a Forth optimizing
> compiler might be a stretch, certainly for me.
>

If your CPU needs a good optimizing compiler, you might consider teaming up with MPE Ltd. It costs them nothing to lend you the tools and let you have a crack at writing a VFX back end.

rickman

unread,

Apr 19, 2013, 3:55:32 PM4/19/13

to

I think I had mentioned this to Stephen once and got a polite no. He
seems to be a nice guy, but he is in business to make money and if I
can't show a clear profit path, why would I get support?

--

Rick

Jecel

unread,

Apr 19, 2013, 6:24:13 PM4/19/13

to

Rick,

given the directions your designs are taking you, I wonder if you are familiar with the macroinstructions of Lisp machines like the CADR? They normally were about 16 bits wide and combined stack stuff with interesting operand addressing. The very complex call/returns wouldn't be useful for a Forth machine, but the rest might be inspirational. The Symbolics 3600 moved a bit away from this and the RISC-like K-Machine from LMI was even more different, but they still made use of some of these ideas.

My own Smalltalk computer designs have a stack cache. You can access the 32 elements at the top of the stack by adding the register number to the TOS pointer. Note that this is just a five bit adder, which is very fast. The carry out bit selects between one of two 32 word blocks in the cache and the five bit result from the adder selects a word within the chosen block. When the TOS pointer moves across a block boundary, the two block selectors are updated by following a linked list. This allows stacks from different threads to live in the same cache. Sometimes updating the block selectors means transferring 32 words to or from memory.

I also have separate return stacks that share the same cache, but for those you can only access the top element.

-- Jecel

Paul Rubin

unread,

Apr 20, 2013, 12:37:44 AM4/20/13

to

rickman <gnu...@gmail.com> writes:
> I'm still considering the implications of that, but I'm sure it will
> end up with multiple clocks per instruction, just not a variable
> number. Likely three or four clocks defining "phases" of the
> instruction. It will be a bit more complex in some ways than the pure
> stack design, but should be faster and may be smaller.

Have you read Koopman's stuff about stack hardware, and looked at
Bernd's b16 design?

I thought one of the defining characteristics of MISC was that by giving
up any access to the interior of the stack, you can do most operations
in a single cycle with no pipelining. You sometimes use extra
instructions juggling the stack, but you make up for that in higher ipc,
higher clock frequencies, and less chip area (so more parallelism
through multiple cores, if your problem can use that).

Looking at the F18A die photo though, it appears dominated by memory
arrays: stacks, ram, and rom. The ALU is a tiny sliver in the middle.
Therefore I get the idea that by going to 6 bit instruction and possibly
doubling the ALU's size, the cpu could have been much more powerful at
relatively little cost in silicon. I wonder if experienced GA144
programmers ever get over the pain of seeing 4-7 instruction slots
burned every time they use the constant "1". But, supposedly Chuck and
company did a fair amount of simulation of various options before going
forward with what they have, so maybe they know something I don't.

> The part I am ill-equipped to handle is writing a Forth compiler to
> generate good code for this design. Manually coding using the
> primitive instructions would not be so hard, but designing a Forth
> optimizing compiler might be a stretch, certainly for me.

I'd expect the user program to map just about directly to machine
instructions, similar to Chuck's chips. Arrayforth seems like
more of an assembler than a compiler.

rickman

unread,

Apr 20, 2013, 1:02:09 AM4/20/13

to

On 4/19/2013 6:24 PM, Jecel wrote:
> Rick,
>
> given the directions your designs are taking you, I wonder if you are familiar with the macroinstructions of Lisp machines like the CADR? They normally were about 16 bits wide and combined stack stuff with interesting operand addressing. The very complex call/returns wouldn't be useful for a Forth machine, but the rest might be inspirational. The Symbolics 3600 moved a bit away from this and the RISC-like K-Machine from LMI was even more different, but they still made use of some of these ideas.

No, I'm not at all familiar with Lisp machines. I did take a language
class in school where we spent two weeks on Lisp so I know the basics of
the language. I remember that it was near the end of the two weeks when
the light bulb came on and I finally "got" what Lisp is about... not
sure if I still get it though, that was nearly 40 years ago.

> My own Smalltalk computer designs have a stack cache. You can access the 32 elements at the top of the stack by adding the register number to the TOS pointer. Note that this is just a five bit adder, which is very fast. The carry out bit selects between one of two 32 word blocks in the cache and the five bit result from the adder selects a word within the chosen block. When the TOS pointer moves across a block boundary, the two block selectors are updated by following a linked list. This allows stacks from different threads to live in the same cache. Sometimes updating the block selectors means transferring 32 words to or from memory.

That is pretty much what I am thinking. The hard part is figuring out
how to specify the offset. I'm trying to keep the instructions small
which may end up being a false economy. But I don't see any way to
figure out whether a small or a large instruction works best other than
to try both. In a 9 bit instruction its hard to find more than three
bits for the offset and I might end up using just two. Letting the TOS
pointer move around as you work mitigates a small range for the offset.
In the code I am working on at the moment I start with six words of
input and move the TOS into the middle with parameters on both sides of
the pointer.

I'm not so worried about the adders being fast because the carry in
FPGAs is pretty fast compared to the rest of the circuitry. But they
are small which doesn't use a lot of resources. I'm thinking of using a
block RAM for the register memory which would give lots of room for
multiple stacks for multiple processes. I'm thinking interrupts here,
not multitasking.

> I also have separate return stacks that share the same cache, but for those you can only access the top element.

In my register version of the MISC one of the registers was a stack
pointer to be used for the return stack. I haven't give much thought to
the return stack on the hybrid version. I suppose it will be a separate
register in the design although I think I could hard code it as the
first or last location in the register RAM.

I'm also considering making the registers/stacks part of main memory.
The advantage is that the registers/stack of one process could be
accessed by another for comms. For example, the code I'm working on is
actually an interrupt process for a DDS generator. The six words of
parameters define the input controls and state. Adjustments to the
input controls need to be done somehow. The other designs first read
the input and state data from main memory. In this version the main
process just writes the input control words to the data stack for the
interrupt process and the interrupt process saves the time required to
read the input words and read/store the state words.

I just realized that I will have to have a dedicated file of stack
pointers for the various processes. Sounds like a bit of distributed
memory, 16x5. I can allocate 32 stack locations for each of 16
processes having each pointer wrap around if the end is reached. That
keeps the hardware very simple and 32 stack locations should be enough
for any app. 5 bits of base+offset and four bits of process number make
a 9 bit address.

--

Rick

rickman

unread,

Apr 20, 2013, 4:20:53 PM4/20/13

to

On 4/20/2013 12:37 AM, Paul Rubin wrote:
> rickman<gnu...@gmail.com> writes:
>> I'm still considering the implications of that, but I'm sure it will
>> end up with multiple clocks per instruction, just not a variable
>> number. Likely three or four clocks defining "phases" of the
>> instruction. It will be a bit more complex in some ways than the pure
>> stack design, but should be faster and may be smaller.
>
> Have you read Koopman's stuff about stack hardware, and looked at
> Bernd's b16 design?

I read Koopman's book a long time ago. This was the basis for the
fundamental instruction organization I have been working with. Because
of the high frequency of use of the call/jump/literal I optimized these
instructions. I think the uCore has done the same thing with a 1 bit op
code for the literal instruction.

I have looked at Bernd's b16. It uses 5 bit instructions which is
something I wanted to get away from since my design is optimized for
FPGAs and have constraints on the memory width.

> I thought one of the defining characteristics of MISC was that by giving
> up any access to the interior of the stack, you can do most operations
> in a single cycle with no pipelining. You sometimes use extra
> instructions juggling the stack, but you make up for that in higher ipc,
> higher clock frequencies, and less chip area (so more parallelism
> through multiple cores, if your problem can use that).

I think that is Chuck's idea in his chip designs. I would not say MISC
requires any particular architectural feature. I think I understand the
theoretical trade offs. But Chuck and Bernd are designing ASICs while I
am designing in FPGAs. So there are different optimizations that work
best.

> Looking at the F18A die photo though, it appears dominated by memory
> arrays: stacks, ram, and rom. The ALU is a tiny sliver in the middle.
> Therefore I get the idea that by going to 6 bit instruction and possibly
> doubling the ALU's size, the cpu could have been much more powerful at
> relatively little cost in silicon.

I think you are presuming that a 32 or 36 bit processor is "more
powerful" in a meaningful way than the 18 bit processor Chuck designed.
18 bits is enough for many, many apps including high quality audio.
Given the 5 bit word size you might think he would have used a 20 bit
word size, but he seems to be a real mizer on transistors and wanted to
keep it as small as absolutely possible. Remember, a bigger word size
means a bigger RAM too...

I think it would be very easy to get a *lot* more power by going to
newer, finer pitch processes.

> I wonder if experienced GA144
> programmers ever get over the pain of seeing 4-7 instruction slots
> burned every time they use the constant "1". But, supposedly Chuck and
> company did a fair amount of simulation of various options before going
> forward with what they have, so maybe they know something I don't.

You need to truly understand the F18 if you want to judge the tradeoffs
in the design. To do that you need to read the programming tricks
manual... which hasn't been written yet as far as I know. The closest
thing to it is Chuck's web blog which has a lot of info if you want to
spend the time to distill it out. It takes some serious reading to
absorb all that is in there. I would recommend starting with his essay,
"The Map is Not the Territory". Then keep that in mind as you learn...
http://www.colorforth.com/map.htm

>> The part I am ill-equipped to handle is writing a Forth compiler to
>> generate good code for this design. Manually coding using the
>> primitive instructions would not be so hard, but designing a Forth
>> optimizing compiler might be a stretch, certainly for me.
>
> I'd expect the user program to map just about directly to machine
> instructions, similar to Chuck's chips. Arrayforth seems like
> more of an assembler than a compiler.

I don't follow what you are saying. Are you saying the coding should be
done in assembler, which is what Chuck does, or are you saying that the
machine should map closely to Forth?

I am saying that Forth can be compiled to this CPU architecture but it
is not so straightforward. I think it will require an optimizing
compiler to get good usage of the CPU. Of course, the user can code in
the assembly language and build word definitions just like any other
Forth. But if they want to write purely in Forth, it will require some
work on a compiler.

--

Rick

Paul Rubin

unread,

Apr 20, 2013, 5:21:51 PM4/20/13

to

rickman <gnu...@gmail.com> writes:
>> Looking at the F18A die photo though... I get the idea that by going

>> to 6 bit instruction and possibly doubling the ALU's size, the cpu
>> could have been much more powerful at relatively little cost
>

> I think you are presuming that a 32 or 36 bit processor is "more
> powerful" in a meaningful way than the 18 bit processor

No I meant the same 18 bit architecture, with three 6-bit instructions
per word instead of three 5-bit instructions plus two leftover bits
sometimes usable as an instruction slot. That would have allowed adding
a bunch more instructions and another register or two, that would have
possibly have resulted in tighter code.

> You need to truly understand the F18 if you want to judge the
> tradeoffs in the design. To do that you need to read the programming
> tricks manual... which hasn't been written yet as far as I know.

A fair amount of those tricks are documented, and it's not that
difficult to think up more. But, dealing with such tricks seems
anachronistic in this day and age.

rickman

unread,

Apr 20, 2013, 6:03:43 PM4/20/13

to

On 4/20/2013 5:21 PM, Paul Rubin wrote:
> rickman<gnu...@gmail.com> writes:
>>> Looking at the F18A die photo though... I get the idea that by going
>>> to 6 bit instruction and possibly doubling the ALU's size, the cpu
>>> could have been much more powerful at relatively little cost
>>
>> I think you are presuming that a 32 or 36 bit processor is "more
>> powerful" in a meaningful way than the 18 bit processor
>
> No I meant the same 18 bit architecture, with three 6-bit instructions
> per word instead of three 5-bit instructions plus two leftover bits
> sometimes usable as an instruction slot. That would have allowed adding
> a bunch more instructions and another register or two, that would have
> possibly have resulted in tighter code.

Oh, I see. It is hard to second guess that sort of decision and that is
*exactly* the decisions I am making now. Fortunately for me only 4, 8,
9, 16 or 18 bits makes sense in FPGAs. I can rule out 4 bits as just
too durn small. 8 bits is not bad, but most FPGAs provide 9 bit wide
memories. So you can bump that up to 16 or 18 if you want to go the
wide word route, like the J1.

Of course you could do other sizes by using more than one block RAM for
each block of program memory. Like three 2 bit RAMs for a 6 bit wide
memory. I still want to pursue the middle of the road 9 bit approach
for now.

One place where the machine is speed limited in getting addresses for
jumps/calls. With an 8 or better, 9 bit wide instruction, a number of
bits can be contributed to the address reducing the burden on the
instruction fetch. The 9 bit instruction has space for a 5 bit offset
for jumps giving a loop range of +15/-16 without another opcode fetch.

>> You need to truly understand the F18 if you want to judge the
>> tradeoffs in the design. To do that you need to read the programming
>> tricks manual... which hasn't been written yet as far as I know.
>
> A fair amount of those tricks are documented, and it's not that
> difficult to think up more. But, dealing with such tricks seems
> anachronistic in this day and age.

Yes, but it can be amazing at how much more efficient the code can
become. At one point I was stumped, not being able to find the SWAP
instruction! Once I realized there wasn't one I read and reread the
code examples to see how this was handled. He lists a few tricks giving
a swap in three or so instructions. But then it hit me that the
particular section of code was actually *faster* without the swap!
Doing a subtraction in the conventional way requires an inversion and
adding a 1, making a subtract very slow relatively speaking. Invert the
other operand and invert the result and you get the subtraction without
the constant 1 and without the swap! Significantly faster! After
learning that I found that most of my subtractions actually were better
without the SWAP... Chuck Moore is no dummy for sure.

--

Rick

Paul Rubin

unread,

Apr 21, 2013, 8:07:23 PM4/21/13

to

rickman <gnu...@gmail.com> writes:
> Invert the other operand and invert the result and you get the

> subtraction without the constant 1 and without the swap!... Chuck

> Moore is no dummy for sure.

Chuck is a smart dude, that is for sure. But if you're an application
developer rather than a compiler writer, you'd have to be working in an
extremely specialized niche for it to be worth your time sweating over
micro-optimizations like that. Otherwise it just delays your getting
your product out the door, and diverts your attention from stuff that
needs doing.

It sounds to me like you're trying to design a relatively powerful cpu.
Meanwhile though it seems to me, bigger FPGA's are now appearing with
ARM cores as hard macros, similar to how they have had RAM blocks and
DSP slices for a while. I can see the attraction of tiny b16-like
softcores but by the time you're implementing something much fancier in
an fpga, the advantage compared to using a conventional core and
optimizing compiler seems harder to establish.

Now there is something called a PSOC, that sounds something like an FPGA
except instead of logic blocks, there are actually blocks of analog
circuitry with configurable routing between them, along with a
microprocessor core. I don't quite understand it but it sounds pretty
neat.

http://www.element14.com/psoc4

might even be an interesting Forth host.

Brad Eckert

unread,

Apr 22, 2013, 12:18:19 PM4/22/13

to

On Friday, April 19, 2013 12:55:32 PM UTC-7, rickman wrote:
>
> I think I had mentioned this to Stephen once and got a polite no.
>

I forgot about the soft RTX2000, commissioned by MPE several years ago. Why would they want another stack machine, especially if it competes with one that cost them actual money?

Brad Eckert

unread,

Apr 22, 2013, 3:05:52 PM4/22/13

to

On Sunday, April 21, 2013 5:07:23 PM UTC-7, Paul Rubin wrote:
>
> Meanwhile though it seems to me, bigger FPGA's are now appearing with
> ARM cores as hard macros, similar to how they have had RAM blocks and
> DSP slices for a while.

I suppose it was inevitable that the FPGA makers would recognize the pent up demand, with all the Microblaze and Nios usage, and respond in a sensible way.

That leaves MPE Ltd in a nice position, having the only high performance Forth cross compiler for ARM. OTOH, there are criteria for choosing a Forth other than performance.

You can still roll your own if high performance isn't a requirement, and if hardware does the heavy lifting maybe it isn't. Then you have to choose what to optimize. If you tune the hardware design to Forth, you get the simplicity of a straightforward compiler. Complexity always seems to have a hidden cost multiplier that's almost everybody overlooks.

rickman

unread,

Apr 22, 2013, 4:05:57 PM4/22/13

to

On 4/22/2013 12:18 PM, Brad Eckert wrote:
>
> I forgot about the soft RTX2000, commissioned by MPE several years ago. Why would they want another stack machine, especially if it competes with one that cost them actual money?

Why does any company design or support more than one processor?
Different applications have different requirements. That is why I
designed my own. At the time there were none out there that would fit
my FPGA and do what I wanted it to do. I bet the RTX2000 soft core is a
lot larger than 500 LUTs.

--

Rick

rickman

unread,

Apr 22, 2013, 4:19:54 PM4/22/13

to

On 4/21/2013 8:07 PM, Paul Rubin wrote:
> rickman<gnu...@gmail.com> writes:
>> Invert the other operand and invert the result and you get the
>> subtraction without the constant 1 and without the swap!... Chuck
>> Moore is no dummy for sure.
>
> Chuck is a smart dude, that is for sure. But if you're an application
> developer rather than a compiler writer, you'd have to be working in an
> extremely specialized niche for it to be worth your time sweating over
> micro-optimizations like that. Otherwise it just delays your getting
> your product out the door, and diverts your attention from stuff that
> needs doing.

That is the sort of thinking that goes into ARM9 CPUs with half a MB on
chip. If you are trying to get a job done and that job is about writing
a piece of code to sample an ADC and filter the result, it is not at all
uncommon to squeeze the code to run fast and not bloat the core of the
application.

It only took me a little while to see the comparison between the
sea-of-CPUs on the GA144 and an FPGA. When I design HDL code, I
typically don't worry too much about the low level logic that will be
implemented... I worry *just enough*.

> It sounds to me like you're trying to design a relatively powerful cpu.
> Meanwhile though it seems to me, bigger FPGA's are now appearing with
> ARM cores as hard macros, similar to how they have had RAM blocks and
> DSP slices for a while. I can see the attraction of tiny b16-like
> softcores but by the time you're implementing something much fancier in
> an fpga, the advantage compared to using a conventional core and
> optimizing compiler seems harder to establish.

I don't know what "powerful" means. Fast, yes. The ZPU is a stack
processor designed to be flexible in the speed/size tradeoff while
running C code. But their small implementation is some 10 times slower
than my old processor design. That was ok with them because it was very
small, some 300 LUTs at an absolute minimum. I am shooting for a better
speed/size tradeoff with a goal of 100 MIPs operation. Not sure I'll
get there exactly. The comparisons will be interesting.

> Now there is something called a PSOC, that sounds something like an FPGA
> except instead of logic blocks, there are actually blocks of analog
> circuitry with configurable routing between them, along with a
> microprocessor core. I don't quite understand it but it sounds pretty
> neat.
>
> http://www.element14.com/psoc4
>
> might even be an interesting Forth host.

The original 8 bit PSOC used their own CPU and was both slow and had
analog with minimal performance. When the CEO asked about improving
both the digital and analog he got a lot of reasons (excuses) why it was
hard to do. He gave them the goal of designing a new family of PSOC
with "no excuses".

I don't think they quite got a family with "no excuses" but the new
devices are much better with both digital and analog configurable blocks
with either an 8051 (the crappy CPU that won't die) or an ARM Cortex M3
processor, IIRC. By the time it got to market (it was some two years
late) it was not as impressive as it originally seemed.

HDL support is minimal and everything has to be done through the GUI.
The analog still has significant limitations and the digital is
configured through pre-designed modules. Of course it has been awhile
since I took a look. Maybe some of this has improved.

Unless I need some pretty good analog (which means discrete chips, not a
PSOC) I am happy with FPGAs. I can do lots of analog stuff in digital
without the analog stuff anyway.

--

Rick

Brad Eckert

unread,

Apr 22, 2013, 5:08:14 PM4/22/13

to

On Monday, April 22, 2013 1:05:57 PM UTC-7, rickman wrote:
>
> Different applications have different requirements. That is why I
> designed my own. At the time there were none out there that would fit
> my FPGA and do what I wanted it to do. I bet the RTX2000 soft core is a
> lot larger than 500 LUTs.
>

Well, MPE is a software company that already had good Forth and C compilers for the RTX. In theory, RTX fans orphaned by Harris may want to use it to support legacy code or for other reasons.

You, on the other hand, are starting with no pre-existing software tools and no customer base. That's some nice freedom. What I would do is leave gaps in the instruction set so you can add application-specific instructions as needs arise, and keep it simple.

Your stacks are in LUT RAM, which is easy because those are async read. If implemented an RTX, you would use block RAM for the stacks and they are sync read which means you have to pipeline things in order to read data a cycle before the instruction that needs it executes. Either that or fake async read by using clocking tricks.

I've wondered if you could just lump data and stacks together in one block RAM and encode the ISA such that these pre-reads are trivial to decode.

BTW, if you want a cross compiler, I started one a while back but can give it to you to play with. It's much cleaner than the other cross compilers that I've written in the last 20 years.

rickman

unread,

Apr 22, 2013, 5:52:36 PM4/22/13

to

On 4/22/2013 5:08 PM, Brad Eckert wrote:
> On Monday, April 22, 2013 1:05:57 PM UTC-7, rickman wrote:
>>
>> Different applications have different requirements. That is why I
>> designed my own. At the time there were none out there that would fit
>> my FPGA and do what I wanted it to do. I bet the RTX2000 soft core is a
>> lot larger than 500 LUTs.
>>
> Well, MPE is a software company that already had good Forth and C compilers for the RTX. In theory, RTX fans orphaned by Harris may want to use it to support legacy code or for other reasons.
>
> You, on the other hand, are starting with no pre-existing software tools and no customer base. That's some nice freedom. What I would do is leave gaps in the instruction set so you can add application-specific instructions as needs arise, and keep it simple.

If only it *were* that simple. The 9 bit version of the stack oriented
design has lots of spare opcodes. The register and hybrid designs I'm
looking at use a lot more opcode space so 9 bits get used up very fast.
I haven't fleshed out the hybrid opcodes yet, I may be trying to
design a 12 bit opcode in 9 bits!

> Your stacks are in LUT RAM, which is easy because those are async read. If implemented an RTX, you would use block RAM for the stacks and they are sync read which means you have to pipeline things in order to read data a cycle before the instruction that needs it executes. Either that or fake async read by using clocking tricks.

LUT RAM has that advantage, but with stacks, the sync nature doesn't get
in the way since the read of the stack is done at the end of the
instruction (just like the write) and the TOS pointer is defined, so the
stack RAM output is always defined at the start of the next instruction.
It is the data RAM that is awkward. A write is fine happening at the
end of the instruction, but a ready has to happen in the middle. I've
considered a couple of approaches. One is to use the dual port and
clock the reads on the opposite edge so they happen in the middle of the
cycle. The other is to use the address at the *input* to the address
register (TopOfReturnStack) as the address to the block RAM and always
do the read. If you don't need it, so what? A little energy
inefficient, but who's counting Joules here?

> I've wondered if you could just lump data and stacks together in one block RAM and encode the ISA such that these pre-reads are trivial to decode.

That is what I am looking at for the register and hybrid approaches.
This does get into timing issues. For an indirect access you need to
read the address from a register, read memory and then write a
register... three clock edges! Even playing the opposite edge game this
is two clock cycles.

I had been using a single block RAM for the two stacks in the stack
design, one port each. I hadn't thought about combining this with the
main memory, but that is a thought.

> BTW, if you want a cross compiler, I started one a while back but can give it to you to play with. It's much cleaner than the other cross compilers that I've written in the last 20 years.

Sold! What does it run under? My experience is with Win32Forth.
rickman dot 2013 at arius dot com. Shoot me an email or something.

--

Rick