Whatever happened to Eric LaForest?

165 views
Skip to first unread message

Mux

unread,
Dec 4, 2009, 12:29:56 AM12/4/09
to
Published a thesis a couple of years back on stack computers and what
not but has since (?) disappeared. Anyone know if he's still involved
with stack computers / forth?

-Mux

Jecel

unread,
Dec 4, 2009, 11:35:36 AM12/4/09
to

He mentioned to me this April that he had been pushing the research
described in http://is.uwaterloo.ca/charles_laforest.html further and
got a ~200MHz Forth CPU running on an Altera Cyclone II FPGA, using
about 2 cycles per instruction on average.

He also described an interesting register/stack hybrid idea which
would be a MIPS-like processor where the first register would actually
be a data stack and the last register would be the PC counter and
return stack. He wasn't working on it anymore, however, and I last
heard from him in June.

-- Jecel

Eric

unread,
Jan 9, 2010, 8:39:48 PM1/9/10
to

Hi Mux,

Yep, Jecel's right. I'm not working on stack computers and Forth at
the moment since my PhD is not about those.
(My Master's was about multi-ported memories on FPGAs. I'll be
presenting this at the ACM FPGA 2010 conference in Monterey, this
February.)
I suppose I should take the time someday and post the remaining code
and ideas for others to see.

In a nutshell:

Got a 32-bit F21-ish CPU running on a Cyclone II FPGA (DE2 Terasic
board). Discovered that instruction decoding is almost free on FPGA-
based soft-processors, thus one can decode pairs of instructions and
execute them "in parallel" by correctly controlling the stacks and ALU
(no extra control lines). Net result is a sort of hardware peephole
optimizer. Gets much of the benefit of the older horizontally
microcoded machines Koopman describes, but without the need for a
smart compiler. In fact, this 'instruction folding' implements on
stack machines the benefits that pipelining, data forwarding, and
random register access do on conventional MIPS-like machines.

I estimate that average Forth code would speed up 45% with this
technique. As for size and speed: 450 lines of high-level, portable
Verilog code, 1074 LEs with 8192 bits M4K RAM blocks, tested at 216
Mhz. CPI of about 1.9, but that's without the aforementioned
optimizations. On a Stratix II FPGA, I estimate this CPU should run at
420MHz and take 668 LEs area. The limiting factor is the adder carry-
chain delay, hence why some instructions (any that increments PC, R,
A, or TOP) take 2 cycles. Others, like DUP or >R, take 1 cycle.

This is all nice and good, but still only if you program it in Forth.
Compiling C to such a machine has two problems: stack scheduling and
addressing mode translation:

Stack scheduling is the stack analog of register allocation. There is
good existing work on the topic, and my approach begins with
traditional register allocation and then does this:
- Consider all registers in memory
- Map their lifetimes, load onto stack on first use
- Keep copies on the stack if still alive
- Load/stores become permutations/copies
- Must be done globally for good reuse
These simple permutations (OVER. DUP, DROP, etc...) take only a single
cycle and can also be folded into other instructions. This seems (on
paper) to work very well, but it's not tested in actual software yet.

This leave the big problem: addressing mode translation. MIPS-like
systems can generate efficient indirect and offset addressing modes
for free, while F21-like stack machines have only indirect, and
indirect-postincremented addressing modes (@ and @A+ for example)
using special registers (this is the need for index registers I
mention in my thesis). Translating between these mode is, as far as I
know, an open problem. The same one exists when trying to compile C
code to the odd addressing modes of DSP processors. If the compiler
could figure out which registers are used for addressing in loops, and
move them to A or R at the right time as part of stack scheduling,
then it might be possible to machine-generate the same code that one
constructs "by hand" when writing Forth code, which is significantly
different even when implementing the same algorithm.

The register/stack hybrid idea is to take the first two registers and
use them as TOP and NEXT. Writing to TOP pushes TOP to NEXT, and NEXT
to a stack (there's a few quirks around this to get consistent
behaviour, but nothing much). Similarly, the last register, usually
used as a program counter by convention, would be equivalent to R with
a stack under it. The other registers behave as usual. This change
makes it possible to efficiently synthesize stack instructions on a
MIPS-like system, granting all the efficiencies of stack code we love
and enjoy, but without breaking support for conventional code. In
other words, this would allow using a conventional computer
architecture, but you could run Forth code on it pretty much natively.
It would also allow some new optimizations for C compilers, mostly
having to do with function calls and system calls. Furthermore, as a
nice side-effect, the stacks would allow you to make do with fewer
registers, say 8, which means you could now encode the usual R/I/J-
type MIPS instructions into only 16 bits, granting better code density
while still optionally supporting full-width 32-bit instructions, and
decouples the IF stage from the instruction cache.

One last idea I toyed with in a graduate OS design course, was to add
support for system calls and memory range protection to a stack
machine. Needs 3 extra registers and one mode bit, uses the stacks for
system calls and traps, and should grant full machine virtualization
support for simple stack machines. I wrote a simulator and the bare
kernel of an OS (in Flight, the same Forth dialect used in my thesis),
and implemented a few simple system calls and exception handlers. Long
story short: it works, and can switch context in 80 to 100 cycles,
about a 3 to 6 times speedup over Linux on x86. It could potentially
be faster, depending on what's sufficient context to save for a
context switch.

Hope this answers your questions. Feel free to ask more. :)

Cheers,

Eric

rickman

unread,
Jan 10, 2010, 11:11:00 AM1/10/10
to
Wow, that was a lot of info. I have been dabbling with a Forth CPU in
FPGA for a number of years. I started by reading what I could find
which included Koopman's book on stack machines. I studied several
other stack CPU architectures incuding Bernd Paysan's B16 and several
of Chuck Moore's CPUs, especially the ones using separate registers.
I have to say, during all this I managed to entirely miss your work.

My goal was to produce a CPU that would facilitate programming in
Forth, run code quickly and use a minimum amount of resources in the
FPGA. To that last goal I tried very small instructions, initially 4
bit, optimizing the instruction mix using the list Koopman provided
for instruction frequency, both static and dynamic. I decided that 4
bits was just no enough for an efficient machine requiring many more
instructions. This is essentially a trade off between processing
speed and instruction density. I pushed the instruction size up to 8
bits which resulted in many more possible opcodes than needed, then
realizing that most FPGA memory will give you a 9th bit for free, I
added a single bit as a return instruction. So the current
instruction set has more than needed, but does not fully exploit the
parallelism possible...

I guess I didn't describe the architecture. I liked Chuck's idea of
using dedicated address registers. Like Bernd, (although I am pretty
sure it was independent on my part) I decided to go with two stacks,
one for data and one for addresses. This is a step further than just
having a return stack. A memory operation uses one stack for address
and one for data allowing the stack to be a bit simpler having the
operations of memory push one, memory pop one, and the top register
data coming from the stack memory or another source. This helps the
CPU run all instructions in one clock cycle.

When looking at instruction encoding, I realized that 9 bits was
nearly enough to select address and data operations as separate fields
allowing parallel operations... but not quite. I have not looked hard
at the tradeoff between parallelism of the two stacks and instruction
memory usage. But so far, everything I have tried has shown that the
trend to minimization of instructions is a better choice than having a
wider opcode allowing more instructions, up to a point I guess. I
didn't like using 4 bits and the least I have seen in any other
implemented CPU was 5 bits packed into larger instruction memory
words.

One thing you posted that got my attention was when you said
"instruction decoding is almost free on FPGA-based soft-processors".
How is instruction decoding free? The instruction decode has been one
of the stumbling blocks in my efforts. If I could really get this for
"almost free", it would solve a lot of problems.

I would be very interested in learning more about your CPU design
ideas.

Rick

Anton Ertl

unread,
Jan 11, 2010, 11:17:41 AM1/11/10
to
Eric <eric.l...@gmail.com> writes:
>Stack scheduling is the stack analog of register allocation. There is
>good existing work on the topic,

The only work I am aware of is on stack scheduling for basic blocks,
and I am not very happy with what's possible there: If you have a
JVM-like architecture (i.e., single-instruction load/stores from/to
local variables), stack scheduling hardly buys anything unless you
assume that stack manipulation instructions are much cheaper than
load/store-local instructions.

OTOH, if you keep values on the stack across basic block boundaries,
there are savings possible even for such an archtiecture if a locals
access is just as cheap as a stack manipulation instruction. But I
have no good idea for how to find which values to keep on the stack
where and in which order. Hmm, maybe something along the lines of
coagulating register allocation. But even then I don't know how to do
the stuff within a basic block; the work on basic-block stack
scheduling may give some guidance, but I don't tink it fits perfectly.

> and my approach begins with
>traditional register allocation and then does this:
>- Consider all registers in memory
>- Map their lifetimes, load onto stack on first use
>- Keep copies on the stack if still alive
>- Load/stores become permutations/copies
>- Must be done globally for good reuse

This sounds to me like you will often get more items on the stack than
can be easily handled by our shallow stack manipulation primitives
(such as OVER). You can keep maybe two long-lived variables on the
stack, sometimes more, sometimes less depending on what you have to do
for the short-lived stuff.

>This leave the big problem: addressing mode translation. MIPS-like
>systems can generate efficient indirect and offset addressing modes
>for free, while F21-like stack machines have only indirect, and
>indirect-postincremented addressing modes (@ and @A+ for example)
>using special registers (this is the need for index registers I
>mention in my thesis). Translating between these mode is, as far as I
>know, an open problem.

Not if you don't demand efficiency. And even if you demand
efficiency, there's a bit of work on that for DSP architectures; it's
called "storage assignment" optimization and it's about arranging
variables in an order that adjacent accesses often access adjacent
variables. The classical paper on that has Stan Liao as first author.

>If the compiler
>could figure out which registers are used for addressing in loops, and
>move them to A or R at the right time

That's classical register allocation (although the number of registers
is quite meagre).

But is it worth the trouble to implement a pretty complex compiler in
order to compile C efficiently to a Forth machine? If your
requirements include executing C code efficiently, adding
architectural features to make that easy with a straight-forward
compiler seems to be a better approach to me. Maybe something like
the hybrid you propose.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/

Eric

unread,
Jan 19, 2010, 9:31:33 AM1/19/10
to
On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:

> My goal was to produce a CPU that would facilitate programming in
> Forth, run code quickly and use a minimum amount of resources in the
> FPGA.  

Isn't that all our goal? ;)
For a really minimal Forth chip, check out (if you can find it online)
Myron Plichota's Steamer16 (and descendants). It uses 3-bit
instructions and a 3-deep hardware stack, but with a software stack,
manages to run Forth on a very little hardware at the price of more
memory traffic. Still, he reached 20 MIPS with it.

> To that last goal I tried very small instructions, initially 4
> bit, optimizing the instruction mix using the list Koopman provided
> for instruction frequency, both static and dynamic.  I decided that 4
> bits was just no enough for an efficient machine requiring many more
> instructions.  This is essentially a trade off between processing
> speed and instruction density.  I pushed the instruction size up to 8
> bits which resulted in many more possible opcodes than needed, then
> realizing that most FPGA memory will give you a 9th bit for free, I
> added a single bit as a return instruction.  So the current
> instruction set has more than needed, but does not fully exploit the
> parallelism possible...

While I consider to be open the question of whether packing 4, 5, or 8-
bit instructions gives the best density, I do think that it's a bit
moot.

All you end up doing with smaller instructions is synthesizing the
larger ones out of several instructions. I think the best approach is
to use a small instruction set (4 or 5 bits, *maybe* 6 for very
special-purpose stuff that couldn't be made as memory-mapped
functional modules) and decode *pairs* or *triplets* together.

Often, the effect on the stacks, registers, and memory of two
consecutive instructions can be combined together without having to
add any control lines. It's just more clever instruction decoding. I
outline that idea in the Further Work section of my thesis. Some later
work I've done (but not published) supports the idea further.

Lastly, if you really care about packing instructions densely, also
have a look at the instruction stack I describe in there.

<snip>


> This helps the CPU run all instructions in one clock cycle.

Hmm. I've found that this is a fallacy. It's not important. What is
important is keeping the critical path of instructions to a minimum.

For example, I found that the clock speed of my last stack CPU was
limited by the carry-delay of the adder circuitry. This meant that
even an instruction that didn't use an adder, like DUP, would suffer
that speed penalty.

The solution was to break instructions that use the adder into two
cycles, using the first cycle to let the adder output settle, then the
second to store that output. Instructions that didn't use the adder
still took one cycle. The net result is a faster clock cycle, and a
net performance increase.

This is what Chuck does in his designs when some NOPs must be used
after a +. I just explicitly enforce the delay, while he saves
hardware and lets the programmer deal with it.

<snip>


> I didn't like using 4 bits and the least I have seen in any other
> implemented CPU was 5 bits packed into larger instruction memory
> words.

I have to admit, 5 bits does seem to be a sweet spot. Enough
instructions to support most software well, with a few opcodes left
for special uses, like context switching or somesuch...

> One thing you posted that got my attention was when you said
> "instruction decoding is almost free on FPGA-based soft-processors".
> How is instruction decoding free?  The instruction decode has been one
> of the stumbling blocks in my efforts.  If I could really get this for
> "almost free", it would solve a lot of problems.

"Almost free" here means that it contributes little delay. One of my
colleagues here studied how to pipeline MIPS-like soft-processors, and
it ends up that simply folding decoding into the instruction fetching
stage was always a win. There was no need to give it a pipeline stage
of its own.

Similarly, I found that the instruction decoding on my stack CPU was
never anywhere near the critical path in terms of how long it took, so
that means that there is room to do more complex decoding of pairs/
triplets of instructions.

This low delay may be due to the fact that decoding a 4 or 5-bit
instruction requires a single LUT for each control line. That's not
much circuitry in an FPGA.

>I would be very interested in learning more about your CPU design ideas.

Ask away! :)

Cheers,

Eric

rickman

unread,
Jan 19, 2010, 1:02:14 PM1/19/10
to
On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote:
> On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
>
> > My goal was to produce a CPU that would facilitate programming in
> > Forth, run code quickly and use a minimum amount of resources in the
> > FPGA.
>
> Isn't that all our goal? ;)

I see the smiley, but there are other things to optimize that are not
so important to me. Such as addressing a large memory or being able
to support C such as the ZPU. I found the ZPU to be very interesting,
but the crowd designing it is dedicated to C'ing it just one way...


> For a really minimal Forth chip, check out (if you can find it online)
> Myron Plichota's Steamer16 (and descendants). It uses 3-bit
> instructions and a 3-deep hardware stack, but with a software stack,
> manages to run Forth on a very little hardware at the price of more
> memory traffic. Still, he reached 20 MIPS with it.

I'm not trying to be absolutely "minimal" in terms of resources. Like
I said, I need a balance with an emphasis on minimizing instruction
memory. This is because I use smaller FPGAs in my designs. For
example, a current module has a Lattice part with six 9-kbit block
rams. This is also why I prefer to conserve that ninth bit if
practical.


> > To that last goal I tried very small instructions, initially 4
> > bit, optimizing the instruction mix using the list Koopman provided
> > for instruction frequency, both static and dynamic. I decided that 4
> > bits was just no enough for an efficient machine requiring many more
> > instructions. This is essentially a trade off between processing
> > speed and instruction density. I pushed the instruction size up to 8
> > bits which resulted in many more possible opcodes than needed, then
> > realizing that most FPGA memory will give you a 9th bit for free, I
> > added a single bit as a return instruction. So the current
> > instruction set has more than needed, but does not fully exploit the
> > parallelism possible...
>
> While I consider to be open the question of whether packing 4, 5, or 8-
> bit instructions gives the best density, I do think that it's a bit
> moot.

Many people talk about "packing" instructions into a larger word. I
have resisted that since it requires more logic and adds delay to the
path. When using FPGA memory, it can be configured in a number of
widths to match the instruction size without muxing (unpacking)... as
long as you match the instruction size to the supported memory
widths... Typically that is 1, 2, 4 and multiples of 9 up to 36
bits. 72 can be supported by using both ports on dual port memories.
So to optimize memory usage, I see 9 bits as the "sweet" spot. I've
been focusing on how to use that size instruction in an optimal way.

Another issue I have with packing instructions is the way that it uses
memory inefficiently. Often some number of slots must be treated as
nops and can't be used. That drives down the packing density. I also
prefer to build literals a piece at a time rather than use a full word
for each. An 8 bit literal can specify a relative jump of +127 to
-128 instructions which is a good range. Although this is not exactly
how my instructions use the literal for jumps.


> All you end up doing with smaller instructions is synthesizing the
> larger ones out of several instructions. I think the best approach is
> to use a small instruction set (4 or 5 bits, *maybe* 6 for very
> special-purpose stuff that couldn't be made as memory-mapped
> functional modules) and decode *pairs* or *triplets* together.

Yes, but it also slow the machine by needing two cycles instead of
one. Of course that is a trade off and must be considered in a
context.


> Often, the effect on the stacks, registers, and memory of two
> consecutive instructions can be combined together without having to
> add any control lines. It's just more clever instruction decoding. I
> outline that idea in the Further Work section of my thesis. Some later
> work I've done (but not published) supports the idea further.

I was seeing that in my two stack architecture, data and address. I
could almost squeeze two independent instruction fields into a 9 bit
instruction, but it was just a little too limiting. Using 5 or 6 bit
instructions might be optimal for allowing instructions to be executed
in parallel, but they don't map well into the available memory
widths.


> Lastly, if you really care about packing instructions densely, also
> have a look at the instruction stack I describe in there.

Yes, I'll look at that, but it will take a bit. I'll have to read a
fair portion of your thesis to get enough background to understand
what that section is saying.


> > This helps the CPU run all instructions in one clock cycle.
>
> Hmm. I've found that this is a fallacy. It's not important. What is
> important is keeping the critical path of instructions to a minimum.

I'm not sure what the fallacy is. I want to keep instructions to one
clock cycle not just for speed, but for simplicity of design and
analysis. My current design executes *all* instructions in one clock
cycle including subroutine calls, interrupts and returns. The
interrupt and corresponding return uses the data and address stacks to
store the return address and processor status in one clock cycle. So
an interrupt can enter the corresponding code in once clock cycle.
That's pretty good interrupt latency! The only fly in the single
cycle ointment is that the synchronous memories used in FPGAs can't
return a result in the same clock cycle. So I look ahead to the
address being written to the address stack on any given cycle and use
that to address memory for reads. The memory must perform the read on
every cycle since it does not know if the next instruction will be a
memory read or not. This is not an issue in the architecture, but
uses power which is wasteful in low power designs.


> For example, I found that the clock speed of my last stack CPU was
> limited by the carry-delay of the adder circuitry. This meant that
> even an instruction that didn't use an adder, like DUP, would suffer
> that speed penalty.

Yes, I have analyzed critical paths a great deal trying to speed up my
design. It is not the data path that limits my speed. It is the
stack addressing. That is one of the places that the instruction
decode must be done to control the logic. The result feeds status
bits for stack errors. I could omit these flags, but it seems to me
to be useful for debugging. As Jeff Fox might say, I can't count!
But the flags can be omitted in a non-debugging version of the CPU or
the flags could be pipelined with a delay of one instruction.


> The solution was to break instructions that use the adder into two
> cycles, using the first cycle to let the adder output settle, then the
> second to store that output. Instructions that didn't use the adder
> still took one cycle. The net result is a faster clock cycle, and a
> net performance increase.
>
> This is what Chuck does in his designs when some NOPs must be used
> after a +. I just explicitly enforce the delay, while he saves
> hardware and lets the programmer deal with it.

I think we are talking a different level of speed. I want to make a
lean, fast machine, but I am not trying to find the limit posed by the
transistor speeds as Chuck often does. Also, working in an FPGA is
very different from workin in an ASIC.


> > I didn't like using 4 bits and the least I have seen in any other
> > implemented CPU was 5 bits packed into larger instruction memory
> > words.
>
> I have to admit, 5 bits does seem to be a sweet spot. Enough
> instructions to support most software well, with a few opcodes left
> for special uses, like context switching or somesuch...
>
> > One thing you posted that got my attention was when you said
> > "instruction decoding is almost free on FPGA-based soft-processors".
> > How is instruction decoding free? The instruction decode has been one
> > of the stumbling blocks in my efforts. If I could really get this for
> > "almost free", it would solve a lot of problems.
>
> "Almost free" here means that it contributes little delay. One of my
> colleagues here studied how to pipeline MIPS-like soft-processors, and
> it ends up that simply folding decoding into the instruction fetching
> stage was always a win. There was no need to give it a pipeline stage
> of its own.
>
> Similarly, I found that the instruction decoding on my stack CPU was
> never anywhere near the critical path in terms of how long it took, so
> that means that there is room to do more complex decoding of pairs/
> triplets of instructions.


There a some places in my design where the decode feeds a control
point that follows other logic. So the speed of the decode does not
slow the design. But there have to be some control points that must
be decoded ahead of the logic, for example, selecting pop/push in the
stack operation. My stack is an addressed memory block, so I have to
select between the counter counting up/down or holding the current
address. Even when the delay of the decode is "free", the logic is
not. More complex decode uses more LUTs.


> This low delay may be due to the fact that decoding a 4 or 5-bit
> instruction requires a single LUT for each control line. That's not
> much circuitry in an FPGA.

But there is more to driving a control point than just the opcode. If
you are running multiple cycles in your instructions, you also have to
decode the state counter and for conditional instructions you need to
decode the controlling signal. I wanted to keep my decode to two LUT
levels, but that was not possible, even with a 4 bit instruction.
That's another reason I gave that up and went to 9 bits. My opcode
format is such that the first n bits can be decoded in pieces to
enable classes of instructions. It never needs more than one LUT to
enable a class, then the individual instructions get 4 opcode bits
plus the enable and any other control signals. I'm not saying this is
terribly efficient, but that is mainly because it is hard to tell the
tools using VHDL how to construct the decoder. At some point I may
work on optimizing this.


> >I would be very interested in learning more about your CPU design ideas.
>
> Ask away! :)

I'll need to read more. What I have looked at so far seems to show
your direction is very different from mine. I see you don't even have
interrupts. What is the point of that? It is the rare CPU that
doesn't use interrupts.

What about your unpublished results?

Rick

Eric

unread,
Jan 21, 2010, 10:04:40 AM1/21/10
to
On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
> On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote:
> > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
> I'm not trying to be absolutely "minimal" in terms of resources.  Like
> I said, I need a balance with an emphasis on minimizing instruction
> memory.  This is because I use smaller FPGAs in my designs.  For
> example, a current module has a Lattice part with six 9-kbit block
> rams.  This is also why I prefer to conserve that ninth bit if
> practical.

Ah. So if I understand, you want a high instruction density since you
don't have much instruction memory available on-chip? Then yes, I can
see that 9th bit being valuable.

What's the application? That will definitely affect how many opcodes
you need.

> Many people talk about "packing" instructions into a larger word.  I
> have resisted that since it requires more logic and adds delay to the
> path.  

I don't know what your area constraints are, but I think the logic
required is minimal.
Also, from my experience, that decoding logic has never been in the
critical path, and so it does not add delay.

> Another issue I have with packing instructions is the way that it uses
> memory inefficiently.  Often some number of slots must be treated as
> nops and can't be used.  That drives down the packing density.  

Very true. IIRC, I observed an average of three 5-bit instructions per
32-bit words, out of a maximum of 6 (the last two bits were ignored).
However, if you add an instruction stack to the processor, then you
can use the instruction slots located after calls and conditional
jumps, which would normally be unreachable. I estimated that this
would bring the instruction density up to almost 6 (still can't use
slots after a RET), and the additional cost in area would be recouped
in saved RAM for any RAM larger than a couple hundred words IIRC
(again, the exact numbers are in my thesis).

One option I didn't consider because it would have complicated the
Forth runtime, was to logically extend the last two bits with zeros so
that some instructions could be fit in those two bits, so long as they
end in zeros.

> I also
> prefer to build literals a piece at a time rather than use a full word
> for each.  An 8 bit literal can specify a relative jump of +127 to
> -128 instructions which is a good range.  Although this is not exactly
> how my instructions use the literal for jumps.

Ah. That's a different approach than I took. I wanted simple, general
numbers good for computing and addressing. I'd like to know more about
your architecture: for example, how to you construct a CALL to code
that is beyond 128 instructions away?

> > All you end up doing with smaller instructions is synthesizing the
> > larger ones out of several instructions. I think the best approach is
> > to use a small instruction set (4 or 5 bits, *maybe* 6 for very
> > special-purpose stuff that couldn't be made as memory-mapped
> > functional modules) and decode *pairs* or *triplets* together.
>
> Yes, but it also slow the machine by needing two cycles instead of
> one.  Of course that is a trade off and must be considered in a
> context.

Not quite. What I was saying is that decoding instructions did not
affect the critical path, so decoding more bits made no difference.
What cost and extra cycle for some instructions was waiting for the
adder (when adding numbers or incrementing the PC) to settle.

When decoding instruction pairs, this often means that the effects of
the second instruction can be done 'for free' during the second cycle,
when the first instruction is storing the output of the adder.

> > > This helps the CPU run all instructions in one clock cycle.
>
> > Hmm. I've found that this is a fallacy. It's not important. What is
> > important is keeping the critical path of instructions to a minimum.
>
> I'm not sure what the fallacy is.  I want to keep instructions to one
> clock cycle not just for speed, but for simplicity of design and
> analysis.  

I used to think that too, but I eventually found out that having a
couple bits to encode a state machine to hold the current state of a
multi-cycle instruction added little conceptual complexity. Perhaps
its how I wrote the code, which left the details to the synthesizer,
which made it tractable. Here are some worst-case examples (CALL and
RET), written in Verilog with some of the macros included for
explanation:

`define SS0 state <= `STATE_0;
`define SS1 state <= `STATE_1;
`define SS2 state <= `STATE_2;
`define SS3 state <= `STATE_3;

`define DO_PCFETCH pc <= pc_next; mar <= pc_next; isr <= mem_in; `SS0
`define PUSH_RS(value) rs_top <= (value); `RS_BODY_IN <= (rs_top);
`INC_RS

{`CALL,`STATE_0}: begin
`SS1
end
{`CALL,`STATE_1}: begin
`PUSH_RS(pc_next)
pc <= mem_in;
mar <= mem_in;
`SS2
end
{`CALL,`STATE_2}: begin
`SS3
end
{`CALL,`STATE_3}: begin
`DO_PCFETCH
end

{`RET,`STATE_0}: begin
`POP_RS
pc <= rs_top;
mar <= rs_top;
`SS1
end
{`RET,`STATE_1}: begin
`SS2
end
{`RET,`STATE_2}: begin
`DO_PCFETCH
end

These are just case statements defining a decoder for the opcode and
state. The empty cycles where only the state is being updated are the
waits for the adder outputs. Note that even though CALL takes 4 cycles
and RET takes three, the overall average CPI is still around 1.9
(*without* instruction folding!). On a Cyclone II, this design ran at
216MHz with an effective performance of about 150 MIPS. This is a very
net performance increase than if I had stuck to a single cycle,
without much extra conceptual complexity.

This same method makes decoding pairs of instructions easy. This first
example folds DUP into the first cycle of an immediately following
CALL, leaving the state at 1, thus beginning to decode CALL (after DUP
is shifted out) at its second cycle. The second example does the same
for the 'R> PC@' sequence. This trick is commented out in my current
CPU design, as I'd have to enumerate all the possible instruction
pairings (not that many, really).

`define DO_JUMP pc <= mem_in; mar <= mem_in; `SS1
`define SHIFT_ISR isr <= isr_next;
`define PUSH_DS(value) ds_top <= (value); ds_next <= ds_top;
`DS_BODY_IN <= ds_next; `INC_DS

{`DUP,`CALL}: begin
`PUSH_RS(pc_next)
`DO_JUMP
`PUSH_DS(ds_top)
`SHIFT_ISR
end

{`RFROM,`PCFETCH}: begin
`PUSH_DS(`RS_BODY_OUT)
`POP_RS
`DO_PCFETCH
end

> The only fly in the single
> cycle ointment is that the synchronous memories used in FPGAs can't
> return a result in the same clock cycle.  So I look ahead to the
> address being written to the address stack on any given cycle and use
> that to address memory for reads.  The memory must perform the read on
> every cycle since it does not know if the next instruction will be a
> memory read or not.  This is not an issue in the architecture, but
> uses power which is wasteful in low power designs.

Interesting. So does this mean that you must construct the jump target
address on the address stack just before the jump instruction?

> Yes, I have analyzed critical paths a great deal trying to speed up my
> design.  It is not the data path that limits my speed.  It is the
> stack addressing.  That is one of the places that the instruction
> decode must be done to control the logic.  The result feeds status
> bits for stack errors.  I could omit these flags, but it seems to me
> to be useful for debugging.  As Jeff Fox might say, I can't count!
> But the flags can be omitted in a non-debugging version of the CPU or
> the flags could be pipelined with a delay of one instruction.

Hmm. This is odd, as incrementing/decrementing the 7-bit counter that
I used to address the block memory as a stack cost almost no time. It
certainly was much faster than the 32-bit adders for the PC and for
the ADD instruction.

<snip>


> But there have to be some control points that must
> be decoded ahead of the logic, for example, selecting pop/push in the
> stack operation.  My stack is an addressed memory block, so I have to
> select between the counter counting up/down or holding the current
> address.  Even when the delay of the decode is "free", the logic is
> not.  More complex decode uses more LUTs.

Hmm. If I understand this right, you have to have some values ready
ahead of others? This sounds unnecessarily complicated. This is a sign
that things should be done over multiple cycles.

> But there is more to driving a control point than just the opcode.  If
> you are running multiple cycles in your instructions, you also have to
> decode the state counter and for conditional instructions you need to
> decode the controlling signal.  I wanted to keep my decode to two LUT
> levels, but that was not possible, even with a 4 bit instruction.

Silly question: why are you worrying about such a low-level detail?
Are you very constrained on area?

> > >I would be very interested in learning more about your CPU design ideas.
>
> > Ask away! :)
>
> I'll need to read more.  What I have looked at so far seems to show
> your direction is very different from mine.  I see you don't even have
> interrupts.  What is the point of that?  It is the rare CPU that
> doesn't use interrupts.

I thought interrupts were too messy. I thought it simpler to keep the
CPU simple and small by not using interrupts, and then implementing
two such CPUs sharing a dual-ported RAM, with one CPU being master and
one slaved to watching/dealing with I/O. I simulated such a system and
it works (and ends up being similar to the Sh-Boom design).

> What about your unpublished results?

Well, the code above is some of it. It's just not in any finished
format. Just some rough presentations, hand-written notes on stack
scheduling, some Flight code running on an embarrassingly badly
written CPU simulator, and Verilog CPU code which works, but doesn't
implement more than a CPU, on-chip RAM, and a handful of output lines.

Cheers,

Eric

Brad

unread,
Jan 21, 2010, 2:49:05 PM1/21/10
to
On Jan 21, 8:04 am, Eric <eric.lafor...@gmail.com> wrote:
> On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
>
> > On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote:
> > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
> > I'm not trying to be absolutely "minimal" in terms of resources.  Like
> > I said, I need a balance with an emphasis on minimizing instruction
> > memory.  This is because I use smaller FPGAs in my designs.  For
> > example, a current module has a Lattice part with six 9-kbit block
> > rams.  This is also why I prefer to conserve that ninth bit if
> > practical.
>
> Ah. So if I understand, you want a high instruction density since you
> don't have much instruction memory available on-chip? Then yes, I can
> see that 9th bit being valuable.
>
OTOH, if you want to run really big applications (or have large data
sets) you can execute from SPI flash. Big and cheap. The speed isn't
too bad if library code is kept on-chip. The SPI clock can be 60 to 80
MHz, and some chips have dual and quad modes. Then 8-bit is a
convenient instruction size.

-Brad

Eric

unread,
Jan 22, 2010, 9:45:16 AM1/22/10
to
On Jan 11, 11:17 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Eric <eric.lafor...@gmail.com> writes:
> >Stack scheduling is the stack analog of register allocation. There is
> >good existing work on the topic,
>
> The only work I am aware of is on stack scheduling for basic blocks,
> and I am not very happy with what's possible there: If you have a
> JVM-like architecture (i.e., single-instruction load/stores from/to
> local variables), stack scheduling hardly buys anything unless you
> assume that stack manipulation instructions are much cheaper than
> load/store-local instructions.

Since the stacks in this case are in hardware, then yes, stack
manipulations are much faster (1 cycle instead of 2 for a load or
store, plus whatever overhead of getting the address in place).

> > and my approach begins with
> >traditional register allocation and then does this:
> >- Consider all registers in memory
> >- Map their lifetimes, load onto stack on first use
> >- Keep copies on the stack if still alive
> >- Load/stores become permutations/copies
> >- Must be done globally for good reuse
>
> This sounds to me like you will often get more items on the stack than
> can be easily handled by our shallow stack manipulation primitives
> (such as OVER).  You can keep maybe two long-lived variables on the
> stack, sometimes more, sometimes less depending on what you have to do
> for the short-lived stuff.

Likely. I had allowed -ROT and PICK (I don't remember why, perhaps I
could synthesize them in 2 cycles using instruction folding), which
meant handling four long-lived variables was OK since the copies were
often destroyed right afterwards (adding them, storing to mem,
etc...). That should definitely reduce load/store traffic to locals.

But I definitely concur: doing stack scheduling within basic blocks is
not useful. It has to be global to really have an effect.

> >This leave the big problem: addressing mode translation. MIPS-like
> >systems can generate efficient indirect and offset addressing modes
> >for free, while F21-like stack machines have only indirect, and
> >indirect-postincremented addressing modes (@ and @A+ for example)
> >using special registers (this is the need for index registers I
> >mention in my thesis). Translating between these mode is, as far as I
> >know, an open problem.
>
> Not if you don't demand efficiency.  And even if you demand
> efficiency, there's a bit of work on that for DSP architectures; it's
> called "storage assignment" optimization and it's about arranging
> variables in an order that adjacent accesses often access adjacent
> variables.  The classical paper on that has Stan Liao as first author.

I have to demand efficiency in addressing, otherwise the overhead of
calculating addresses dominates small loops!

I'll have to look into 'storage assignment'. Thanks!

> >If the compiler
> >could figure out which registers are used for addressing in loops, and
> >move them to A or R at the right time
>
> That's classical register allocation (although the number of registers
> is quite meagre).

Not quite. On a MIPS-like system, any register can act as an index or
base address, hence no special treatment is required and plain
register colouring (for example) works.

Stack machines need special-purpose index registers to efficiently do
loop counting and array addressing (register A in Chuck's chips, or
the Q Store on the KDF9, or X and Y on the B5000, etc...).

The challenge for doing register allocation on stack machines is to
detect the registers that are used in loops, and place them in the A
and R registers for fast post-incrementing indirect addressing. This
is definitely a place where 'storage assignment' would also help.

> But is it worth the trouble to implement a pretty complex compiler in
> order to compile C efficiently to a Forth machine?  If your
> requirements include executing C code efficiently, adding
> architectural features to make that easy with a straight-forward
> compiler seems to be a better approach to me.  Maybe something like
> the hybrid you propose.

The hybrid does seem like the best approach: keeps existing compiler
technology and processor architecture, but grants native support for
fast threaded code.

However, my little experiments suggest that on an FPGA platform at
least, given 'optimal' code for a given algorithm, a second-generation
stack machine could have about the same net performance as a similarly
complex MIPS-like machine, but for maybe half the area cost.

This is possible since although it takes twice as many cycles on a
stack machine to do the same work, it tends to run at twice the clock
rate. Hence it would be worth it to carry over existing C code to such
a stack machine.

Here's an example I worked out by hand (more unpublished stuff from
presentation slides). The C and MIPS assembly are straightforward. You
can see that the stack-scheduled translation of the MIPS code spends a
lot of time manipulating the array pointers (t2 and t6) to do load/
stores and to increment their address, without using the A register
for the consecutive accesses (A[j] and A[j+1]).

On the other hand, the hand-coded stack version uses A to good
advantage. On a modern stack machine with instruction folding, this
stack code should run with similar wall-time to the MIPS code on a
MIPS machine. If stack scheduling could use A and R for addressing and
looping, then the compiler might be able to translate the optimized
MIPS code to something like the hand-coded version.


Example: Bubble Sort C code

for (i=n-1; i >= 1; --i){
for (j = 1; j <= i; ++j){
if(A[j] > A[j+1]){
temp = A[j];
A[j] = A[j+1];
A[j+1] = temp;
}
}
}


Optimized MIPS code (inner loop only)

t1 = j - 1
t2 = t1 << 2
t2 = t2 + A
t6 = j << 2
t6 = t6 + A
loop: t3 = [t2]
t7 = [t6]
ble t3, t7, skip
[t2] = t7
[t6] = t3
skip: t2 = t2 + 4
t6 = t6 + 4
br loop

Average cycles per loop: 8
Reasonable stack test: 2 values, 2 addresses


Stack-Scheduled Equivalent to MIPS code

LIT j LIT -1 +
2* 2*
LIT A +
LIT j 2* 2*
LIT A + (t6 t2)
loop: OVER @ SWAP (t6 t3 t2)
DUP @ (t7 t6 t3 t2)
-ROT (t3 t7 t6 t2)
OVER OVER – JMP+ fix (t3 t7 t6 t2)
>R PICK (t2 t7 t6 t2) R(t3)
! R> (t3 t6 t2)
OVER ! (t6 t2)
skip: SWAP LIT 4 + (t2 t6)
SWAP LIT 4 + (t6 t2)
JMP loop
fix: DROP DROP JMP skip (t6 t2)

Average cycles per loop: 29.5 (incl. folds)


Hand-Coded Stack Version

LIT A (t2)
DUP >A (t2) A(t2)
loop: @A+ (t3 t2) A(t6)
@A (t7 t3 t2) A(t6)
OVER OVER - (t7-t3 t7 t3 t2) A(t6)
JMP+ skip (t7 t3 t2) A(t6)
-ROT (t2 t7 t3) A(t6)
>A (t7 t3) A(t2)
!A+ (t3) A(t6)
!A () A(t6)
A> (t6) A(t6)
JMP loop
skip: DROP DROP DROP A> (t6) A(t6)
JMP loop

Average cycles per loop: 16.5 (incl. folds)
(but this is a fundamentally different implementation!)

Eric

Message has been deleted

rickman

unread,
Jan 22, 2010, 12:26:16 PM1/22/10
to
On Jan 21, 10:04 am, Eric <eric.lafor...@gmail.com> wrote:
> On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
>
> > On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote:
> > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
> > I'm not trying to be absolutely "minimal" in terms of resources. Like
> > I said, I need a balance with an emphasis on minimizing instruction
> > memory. This is because I use smaller FPGAs in my designs. For
> > example, a current module has a Lattice part with six 9-kbit block
> > rams. This is also why I prefer to conserve that ninth bit if
> > practical.
>
> Ah. So if I understand, you want a high instruction density since you
> don't have much instruction memory available on-chip? Then yes, I can
> see that 9th bit being valuable.
>
> What's the application? That will definitely affect how many opcodes
> you need.

The initial app was a DMA - I/O controller for a DSP chip. The
project wasn't expanded past the first rev, so the design was never
fully utilized. The second potential app is to off-load some low
speed signal processing from hardware to reduce the overall size of an
FPGA design. The signal processing is rather simple and can be done
in software far faster than required. This app requires some amount
of multiplies in the initial parts of the algorithm. That is the only
part of the processing that might want optimization, perhaps a bit
serial multiply ala Chuck's architectures. Currently an upgrade I am
working on for this design (without the CPU) had to switch to a bit
serial multiplier to make it fit.


> > Many people talk about "packing" instructions into a larger word. I
> > have resisted that since it requires more logic and adds delay to the
> > path.
>
> I don't know what your area constraints are, but I think the logic
> required is minimal.
> Also, from my experience, that decoding logic has never been in the
> critical path, and so it does not add delay.

Yes, it is minimal, but my real point is that it isn't a great choice
for other reasons either, so why waste the resources. Of course, if
you program store is off chip and slower than the CPU, this is a
valuable speedup technique. I am talking about fully embedded CPU
with program store matching the speed of the CPU.


> > Another issue I have with packing instructions is the way that it uses
> > memory inefficiently. Often some number of slots must be treated as
> > nops and can't be used. That drives down the packing density.
>
> Very true. IIRC, I observed an average of three 5-bit instructions per
> 32-bit words, out of a maximum of 6 (the last two bits were ignored).
> However, if you add an instruction stack to the processor, then you
> can use the instruction slots located after calls and conditional
> jumps, which would normally be unreachable. I estimated that this
> would bring the instruction density up to almost 6 (still can't use
> slots after a RET), and the additional cost in area would be recouped
> in saved RAM for any RAM larger than a couple hundred words IIRC
> (again, the exact numbers are in my thesis).

This is my real concern. I believe Chuck's designs use instruction
packing a bit differently. He includes all literals in the *same*
word as the instruction referencing them. So if the word doesn't have
enough room left for the size data you are using, the current word is
packed with nops and the instruction starts in the next word. Yet
another way to waste instruction memory if it is not plentiful as it
is in his designs.


> One option I didn't consider because it would have complicated the
> Forth runtime, was to logically extend the last two bits with zeros so
> that some instructions could be fit in those two bits, so long as they
> end in zeros.

I believe that is how Bernd Paysan's B16 CPU does it. His design is
very similar to mine, except for the instructions and opcode formats.


> > I also
> > prefer to build literals a piece at a time rather than use a full word
> > for each. An 8 bit literal can specify a relative jump of +127 to
> > -128 instructions which is a good range. Although this is not exactly
> > how my instructions use the literal for jumps.
>
> Ah. That's a different approach than I took. I wanted simple, general
> numbers good for computing and addressing. I'd like to know more about
> your architecture: for example, how to you construct a CALL to code
> that is beyond 128 instructions away?

The literal instruction has the msb (bit 8 of 8 downto 0) clear. The
other 8 bits are loaded onto the return stack sign extended. A flag
is set indicating that a literal is loaded. If another literal
instruction is executed immediately after, it is left shifted into the
return stack top and the flag remains set. ANY other instruction
clears the literal flag so that the next literal instruction loads
instead of shifts it data. So addresses (or any other data) are built
up on the return stack one byte at a time using half of the available
opcodes in whatever size opcode you use. To use the data on the data
stack requires an additional instruction to move it, R>. The other
caveat is to use a nop between successive literal words if no other
instructions are needed. This format is extensible to any data word
size without modification.


> > > All you end up doing with smaller instructions is synthesizing the
> > > larger ones out of several instructions. I think the best approach is
> > > to use a small instruction set (4 or 5 bits, *maybe* 6 for very
> > > special-purpose stuff that couldn't be made as memory-mapped
> > > functional modules) and decode *pairs* or *triplets* together.
>
> > Yes, but it also slow the machine by needing two cycles instead of
> > one. Of course that is a trade off and must be considered in a
> > context.
>
> Not quite. What I was saying is that decoding instructions did not
> affect the critical path, so decoding more bits made no difference.
> What cost and extra cycle for some instructions was waiting for the
> adder (when adding numbers or incrementing the PC) to settle.

I understand that. In my designs I don't see the adder being the
critical path. The adder is pretty fast and with nothing else
controlling it or depending on it, it is not in the critical path.
The critical path in my design seems to always be in the stack
pointers. They have to depend on instruction decode and the error
flags depend on them. So even though they are only 8 bits long, they
form the critical path along with the decode that affects them.


> When decoding instruction pairs, this often means that the effects of
> the second instruction can be done 'for free' during the second cycle,
> when the first instruction is storing the output of the adder.

I'm not sure of your architecture, so it is not clear to me what is
happening. There are a lot of subtle optimizations depending on how
your architecture is set up. In my design, the return and data stacks
are capable of running in parallel. So it would make sense to run
instructions in pairs if the decode doesn't get too complex (which I
am not convinced of). I can't quite squeeze both into a single 9 bit
instruction. Eleven bits would do the job quite well I think, but
doesn't fit my FPGA so well. :^(

It isn't the code that gets messy, it is just more bits to include in
the decode. As you add bits to the decode, the amount of logic
increases rapidly. Do you define only specific pairs of instructions
to be concurrently executed? I would have thought you would just
decode both opcodes and enforce the paring in the compiler. Or maybe
I don't get how this works. Is this two opcodes or one opcode
combining the functionality?


> > The only fly in the single
> > cycle ointment is that the synchronous memories used in FPGAs can't
> > return a result in the same clock cycle. So I look ahead to the
> > address being written to the address stack on any given cycle and use
> > that to address memory for reads. The memory must perform the read on
> > every cycle since it does not know if the next instruction will be a
> > memory read or not. This is not an issue in the architecture, but
> > uses power which is wasteful in low power designs.
>
> Interesting. So does this mean that you must construct the jump target
> address on the address stack just before the jump instruction?

The address is always constructed on the return stack. The
instruction flow would be like this...

LIT 0x12
LIT 0x00
JMP

Three instructions, three clock cycles and the next cycle is executing
the code being jumped to. The jump instruction actually has five bits
which are shifted into the address, so up to 13 bits of address
(signed) can be jumped (called) in two instructions and +15 -16
instruction range can be jumped (called) in one instruction. Small
loops can be very efficient this way. The entire instruction space of
my current target fits in two bytes (> 6 kB).


> > Yes, I have analyzed critical paths a great deal trying to speed up my
> > design. It is not the data path that limits my speed. It is the
> > stack addressing. That is one of the places that the instruction
> > decode must be done to control the logic. The result feeds status
> > bits for stack errors. I could omit these flags, but it seems to me
> > to be useful for debugging. As Jeff Fox might say, I can't count!
> > But the flags can be omitted in a non-debugging version of the CPU or
> > the flags could be pipelined with a delay of one instruction.
>
> Hmm. This is odd, as incrementing/decrementing the 7-bit counter that
> I used to address the block memory as a stack cost almost no time. It
> certainly was much faster than the 32-bit adders for the PC and for
> the ADD instruction.

In my design, the 16 data path adder is part of the first layer of
logic following the data stack register. It is followed by several
layers of mux to reach the top of data stack. The first layer uses
"fast" decodes to select subtracting vs. adding vs. logic functions
and are hand optimized. The adder output runs through a large mux to
reach the data top of stack register. The decodes driving the muxes
are "slow" decodes because they have the full time of the first stage
of the data path before they are the slower of the two.

The 8 bit stack counters have several choices of selection and then
run through significant logic to reach the flag registers. It may be
that I am doing something very inefficient here, but I have looked at
it and can't find anything in particular to speed it up. In FPGAs the
delay through adders is not slow at all. It can take some 6 or 8 bits
of carry to equal one LUT delay and the routing between LUTs can be
even slower. So normally adders are not critical path items just
because of the adders.


> <snip>
>
> > But there have to be some control points that must
> > be decoded ahead of the logic, for example, selecting pop/push in the
> > stack operation. My stack is an addressed memory block, so I have to
> > select between the counter counting up/down or holding the current
> > address. Even when the delay of the decode is "free", the logic is
> > not. More complex decode uses more LUTs.
>
> Hmm. If I understand this right, you have to have some values ready
> ahead of others? This sounds unnecessarily complicated. This is a sign
> that things should be done over multiple cycles.

Not *values*, control points. In your data path adder, how do you
make it a subtraction instead of an add? Most FPGAs support an input
to the adder which switches it between the two modes. But this input
has to come from the instruction decode logic. So any LUTs in that
path will define the slow path through the adder to the register, not
the adder by itself. If a small adder (such as the stack pointer) has
more logic in the path, it should be the critical path instead of a
larger adder with less logic. BYMMV...


> > But there is more to driving a control point than just the opcode. If
> > you are running multiple cycles in your instructions, you also have to
> > decode the state counter and for conditional instructions you need to
> > decode the controlling signal. I wanted to keep my decode to two LUT
> > levels, but that was not possible, even with a 4 bit instruction.
>
> Silly question: why are you worrying about such a low-level detail?
> Are you very constrained on area?

Speed and simplicity. I am a hardware guy, so I look to keep the
implementation as simple at the hardware level. Its not so easy
though. Design a simple architecture and every feature added muddies
the "simple' waters. Interrupts added another input bit to the decode
logic. A master reset adds a bit to the decode logic. It quickly
gets out of hand and with a 9 bit opcode, I gave up. :^)


> > > >I would be very interested in learning more about your CPU design ideas.
>
> > > Ask away! :)
>
> > I'll need to read more. What I have looked at so far seems to show
> > your direction is very different from mine. I see you don't even have
> > interrupts. What is the point of that? It is the rare CPU that
> > doesn't use interrupts.
>
> I thought interrupts were too messy. I thought it simpler to keep the
> CPU simple and small by not using interrupts, and then implementing
> two such CPUs sharing a dual-ported RAM, with one CPU being master and
> one slaved to watching/dealing with I/O. I simulated such a system and
> it works (and ends up being similar to the Sh-Boom design).

It may work, but it uses *twice* the logic which is not an option in
my case. Interrupts are just a subroutine call with a PSW save. Save
the return address on the return stack and the PSW on the data
stack... one clock cycle. I implemented one of the 8 levels of
interrupt as a "debug" interrupt which interrupts all the time. The
net effect is to execute one instruction before calling the interrupt
routine again... aka single stepping. Or course many Forthers will
tell me this is not needed, but better to have a debug feature I only
need once than to have a problem that takes me weeks to solve.


> > What about your unpublished results?
>
> Well, the code above is some of it. It's just not in any finished
> format. Just some rough presentations, hand-written notes on stack
> scheduling, some Flight code running on an embarrassingly badly
> written CPU simulator, and Verilog CPU code which works, but doesn't
> implement more than a CPU, on-chip RAM, and a handful of output lines.


Thanks for sharing. I have some real work to get done, so I need to
focus on that for a couple of weeks. But I'll be back.

Rick

rickman

unread,
Jan 22, 2010, 12:28:56 PM1/22/10
to

Right now I am using the 9th bit for a parallel return on many
opcodes. The opcodes that don't have the parallel return use the lsbs
as data which is flexible. So it is possible to have the same (or
similar) CPU compiled for either the 8 bit or 9 bit opcode format.

Rick

Reply all
Reply to author
Forward
0 new messages