Published a thesis a couple of years back on stack computers and what not but has since (?) disappeared. Anyone know if he's still involved with stack computers / forth?
> Published a thesis a couple of years back on stack computers and what > not but has since (?) disappeared. Anyone know if he's still involved > with stack computers / forth?
He mentioned to me this April that he had been pushing the research described in http://is.uwaterloo.ca/charles_laforest.html further and got a ~200MHz Forth CPU running on an Altera Cyclone II FPGA, using about 2 cycles per instruction on average.
He also described an interesting register/stack hybrid idea which would be a MIPS-like processor where the first register would actually be a data stack and the last register would be the PC counter and return stack. He wasn't working on it anymore, however, and I last heard from him in June.
On Dec 4 2009, 12:29 am, Mux <yvo.z...@gmail.com> wrote:
> Published a thesis a couple of years back on stack computers and what > not but has since (?) disappeared. Anyone know if he's still involved > with stack computers / forth?
> -Mux
Hi Mux,
Yep, Jecel's right. I'm not working on stack computers and Forth at the moment since my PhD is not about those. (My Master's was about multi-ported memories on FPGAs. I'll be presenting this at the ACM FPGA 2010 conference in Monterey, this February.) I suppose I should take the time someday and post the remaining code and ideas for others to see.
In a nutshell:
Got a 32-bit F21-ish CPU running on a Cyclone II FPGA (DE2 Terasic board). Discovered that instruction decoding is almost free on FPGA- based soft-processors, thus one can decode pairs of instructions and execute them "in parallel" by correctly controlling the stacks and ALU (no extra control lines). Net result is a sort of hardware peephole optimizer. Gets much of the benefit of the older horizontally microcoded machines Koopman describes, but without the need for a smart compiler. In fact, this 'instruction folding' implements on stack machines the benefits that pipelining, data forwarding, and random register access do on conventional MIPS-like machines.
I estimate that average Forth code would speed up 45% with this technique. As for size and speed: 450 lines of high-level, portable Verilog code, 1074 LEs with 8192 bits M4K RAM blocks, tested at 216 Mhz. CPI of about 1.9, but that's without the aforementioned optimizations. On a Stratix II FPGA, I estimate this CPU should run at 420MHz and take 668 LEs area. The limiting factor is the adder carry- chain delay, hence why some instructions (any that increments PC, R, A, or TOP) take 2 cycles. Others, like DUP or >R, take 1 cycle.
This is all nice and good, but still only if you program it in Forth. Compiling C to such a machine has two problems: stack scheduling and addressing mode translation:
Stack scheduling is the stack analog of register allocation. There is good existing work on the topic, and my approach begins with traditional register allocation and then does this: - Consider all registers in memory - Map their lifetimes, load onto stack on first use - Keep copies on the stack if still alive - Load/stores become permutations/copies - Must be done globally for good reuse These simple permutations (OVER. DUP, DROP, etc...) take only a single cycle and can also be folded into other instructions. This seems (on paper) to work very well, but it's not tested in actual software yet.
This leave the big problem: addressing mode translation. MIPS-like systems can generate efficient indirect and offset addressing modes for free, while F21-like stack machines have only indirect, and indirect-postincremented addressing modes (@ and @A+ for example) using special registers (this is the need for index registers I mention in my thesis). Translating between these mode is, as far as I know, an open problem. The same one exists when trying to compile C code to the odd addressing modes of DSP processors. If the compiler could figure out which registers are used for addressing in loops, and move them to A or R at the right time as part of stack scheduling, then it might be possible to machine-generate the same code that one constructs "by hand" when writing Forth code, which is significantly different even when implementing the same algorithm.
The register/stack hybrid idea is to take the first two registers and use them as TOP and NEXT. Writing to TOP pushes TOP to NEXT, and NEXT to a stack (there's a few quirks around this to get consistent behaviour, but nothing much). Similarly, the last register, usually used as a program counter by convention, would be equivalent to R with a stack under it. The other registers behave as usual. This change makes it possible to efficiently synthesize stack instructions on a MIPS-like system, granting all the efficiencies of stack code we love and enjoy, but without breaking support for conventional code. In other words, this would allow using a conventional computer architecture, but you could run Forth code on it pretty much natively. It would also allow some new optimizations for C compilers, mostly having to do with function calls and system calls. Furthermore, as a nice side-effect, the stacks would allow you to make do with fewer registers, say 8, which means you could now encode the usual R/I/J- type MIPS instructions into only 16 bits, granting better code density while still optionally supporting full-width 32-bit instructions, and decouples the IF stage from the instruction cache.
One last idea I toyed with in a graduate OS design course, was to add support for system calls and memory range protection to a stack machine. Needs 3 extra registers and one mode bit, uses the stacks for system calls and traps, and should grant full machine virtualization support for simple stack machines. I wrote a simulator and the bare kernel of an OS (in Flight, the same Forth dialect used in my thesis), and implemented a few simple system calls and exception handlers. Long story short: it works, and can switch context in 80 to 100 cycles, about a 3 to 6 times speedup over Linux on x86. It could potentially be faster, depending on what's sufficient context to save for a context switch.
Hope this answers your questions. Feel free to ask more. :)
Wow, that was a lot of info. I have been dabbling with a Forth CPU in FPGA for a number of years. I started by reading what I could find which included Koopman's book on stack machines. I studied several other stack CPU architectures incuding Bernd Paysan's B16 and several of Chuck Moore's CPUs, especially the ones using separate registers. I have to say, during all this I managed to entirely miss your work.
My goal was to produce a CPU that would facilitate programming in Forth, run code quickly and use a minimum amount of resources in the FPGA. To that last goal I tried very small instructions, initially 4 bit, optimizing the instruction mix using the list Koopman provided for instruction frequency, both static and dynamic. I decided that 4 bits was just no enough for an efficient machine requiring many more instructions. This is essentially a trade off between processing speed and instruction density. I pushed the instruction size up to 8 bits which resulted in many more possible opcodes than needed, then realizing that most FPGA memory will give you a 9th bit for free, I added a single bit as a return instruction. So the current instruction set has more than needed, but does not fully exploit the parallelism possible...
I guess I didn't describe the architecture. I liked Chuck's idea of using dedicated address registers. Like Bernd, (although I am pretty sure it was independent on my part) I decided to go with two stacks, one for data and one for addresses. This is a step further than just having a return stack. A memory operation uses one stack for address and one for data allowing the stack to be a bit simpler having the operations of memory push one, memory pop one, and the top register data coming from the stack memory or another source. This helps the CPU run all instructions in one clock cycle.
When looking at instruction encoding, I realized that 9 bits was nearly enough to select address and data operations as separate fields allowing parallel operations... but not quite. I have not looked hard at the tradeoff between parallelism of the two stacks and instruction memory usage. But so far, everything I have tried has shown that the trend to minimization of instructions is a better choice than having a wider opcode allowing more instructions, up to a point I guess. I didn't like using 4 bits and the least I have seen in any other implemented CPU was 5 bits packed into larger instruction memory words.
One thing you posted that got my attention was when you said "instruction decoding is almost free on FPGA-based soft-processors". How is instruction decoding free? The instruction decode has been one of the stumbling blocks in my efforts. If I could really get this for "almost free", it would solve a lot of problems.
I would be very interested in learning more about your CPU design ideas.
Eric <eric.lafor...@gmail.com> writes: >Stack scheduling is the stack analog of register allocation. There is >good existing work on the topic,
The only work I am aware of is on stack scheduling for basic blocks, and I am not very happy with what's possible there: If you have a JVM-like architecture (i.e., single-instruction load/stores from/to local variables), stack scheduling hardly buys anything unless you assume that stack manipulation instructions are much cheaper than load/store-local instructions.
OTOH, if you keep values on the stack across basic block boundaries, there are savings possible even for such an archtiecture if a locals access is just as cheap as a stack manipulation instruction. But I have no good idea for how to find which values to keep on the stack where and in which order. Hmm, maybe something along the lines of coagulating register allocation. But even then I don't know how to do the stuff within a basic block; the work on basic-block stack scheduling may give some guidance, but I don't tink it fits perfectly.
> and my approach begins with >traditional register allocation and then does this: >- Consider all registers in memory >- Map their lifetimes, load onto stack on first use >- Keep copies on the stack if still alive >- Load/stores become permutations/copies >- Must be done globally for good reuse
This sounds to me like you will often get more items on the stack than can be easily handled by our shallow stack manipulation primitives (such as OVER). You can keep maybe two long-lived variables on the stack, sometimes more, sometimes less depending on what you have to do for the short-lived stuff.
>This leave the big problem: addressing mode translation. MIPS-like >systems can generate efficient indirect and offset addressing modes >for free, while F21-like stack machines have only indirect, and >indirect-postincremented addressing modes (@ and @A+ for example) >using special registers (this is the need for index registers I >mention in my thesis). Translating between these mode is, as far as I >know, an open problem.
Not if you don't demand efficiency. And even if you demand efficiency, there's a bit of work on that for DSP architectures; it's called "storage assignment" optimization and it's about arranging variables in an order that adjacent accesses often access adjacent variables. The classical paper on that has Stan Liao as first author.
>If the compiler >could figure out which registers are used for addressing in loops, and >move them to A or R at the right time
That's classical register allocation (although the number of registers is quite meagre).
But is it worth the trouble to implement a pretty complex compiler in order to compile C efficiently to a Forth machine? If your requirements include executing C code efficiently, adding architectural features to make that easy with a straight-forward compiler seems to be a better approach to me. Maybe something like the hybrid you propose.
On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
> My goal was to produce a CPU that would facilitate programming in > Forth, run code quickly and use a minimum amount of resources in the > FPGA.
Isn't that all our goal? ;) For a really minimal Forth chip, check out (if you can find it online) Myron Plichota's Steamer16 (and descendants). It uses 3-bit instructions and a 3-deep hardware stack, but with a software stack, manages to run Forth on a very little hardware at the price of more memory traffic. Still, he reached 20 MIPS with it.
> To that last goal I tried very small instructions, initially 4 > bit, optimizing the instruction mix using the list Koopman provided > for instruction frequency, both static and dynamic. I decided that 4 > bits was just no enough for an efficient machine requiring many more > instructions. This is essentially a trade off between processing > speed and instruction density. I pushed the instruction size up to 8 > bits which resulted in many more possible opcodes than needed, then > realizing that most FPGA memory will give you a 9th bit for free, I > added a single bit as a return instruction. So the current > instruction set has more than needed, but does not fully exploit the > parallelism possible...
While I consider to be open the question of whether packing 4, 5, or 8- bit instructions gives the best density, I do think that it's a bit moot.
All you end up doing with smaller instructions is synthesizing the larger ones out of several instructions. I think the best approach is to use a small instruction set (4 or 5 bits, *maybe* 6 for very special-purpose stuff that couldn't be made as memory-mapped functional modules) and decode *pairs* or *triplets* together.
Often, the effect on the stacks, registers, and memory of two consecutive instructions can be combined together without having to add any control lines. It's just more clever instruction decoding. I outline that idea in the Further Work section of my thesis. Some later work I've done (but not published) supports the idea further.
Lastly, if you really care about packing instructions densely, also have a look at the instruction stack I describe in there.
<snip>
> This helps the CPU run all instructions in one clock cycle.
Hmm. I've found that this is a fallacy. It's not important. What is important is keeping the critical path of instructions to a minimum.
For example, I found that the clock speed of my last stack CPU was limited by the carry-delay of the adder circuitry. This meant that even an instruction that didn't use an adder, like DUP, would suffer that speed penalty.
The solution was to break instructions that use the adder into two cycles, using the first cycle to let the adder output settle, then the second to store that output. Instructions that didn't use the adder still took one cycle. The net result is a faster clock cycle, and a net performance increase.
This is what Chuck does in his designs when some NOPs must be used after a +. I just explicitly enforce the delay, while he saves hardware and lets the programmer deal with it.
<snip>
> I didn't like using 4 bits and the least I have seen in any other > implemented CPU was 5 bits packed into larger instruction memory > words.
I have to admit, 5 bits does seem to be a sweet spot. Enough instructions to support most software well, with a few opcodes left for special uses, like context switching or somesuch...
> One thing you posted that got my attention was when you said > "instruction decoding is almost free on FPGA-based soft-processors". > How is instruction decoding free? The instruction decode has been one > of the stumbling blocks in my efforts. If I could really get this for > "almost free", it would solve a lot of problems.
"Almost free" here means that it contributes little delay. One of my colleagues here studied how to pipeline MIPS-like soft-processors, and it ends up that simply folding decoding into the instruction fetching stage was always a win. There was no need to give it a pipeline stage of its own.
Similarly, I found that the instruction decoding on my stack CPU was never anywhere near the critical path in terms of how long it took, so that means that there is room to do more complex decoding of pairs/ triplets of instructions.
This low delay may be due to the fact that decoding a 4 or 5-bit instruction requires a single LUT for each control line. That's not much circuitry in an FPGA.
>I would be very interested in learning more about your CPU design ideas.
On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote:
> On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote:
> > My goal was to produce a CPU that would facilitate programming in > > Forth, run code quickly and use a minimum amount of resources in the > > FPGA.
> Isn't that all our goal? ;)
I see the smiley, but there are other things to optimize that are not so important to me. Such as addressing a large memory or being able to support C such as the ZPU. I found the ZPU to be very interesting, but the crowd designing it is dedicated to C'ing it just one way...
> For a really minimal Forth chip, check out (if you can find it online) > Myron Plichota's Steamer16 (and descendants). It uses 3-bit > instructions and a 3-deep hardware stack, but with a software stack, > manages to run Forth on a very little hardware at the price of more > memory traffic. Still, he reached 20 MIPS with it.
I'm not trying to be absolutely "minimal" in terms of resources. Like I said, I need a balance with an emphasis on minimizing instruction memory. This is because I use smaller FPGAs in my designs. For example, a current module has a Lattice part with six 9-kbit block rams. This is also why I prefer to conserve that ninth bit if practical.
> > To that last goal I tried very small instructions, initially 4 > > bit, optimizing the instruction mix using the list Koopman provided > > for instruction frequency, both static and dynamic. I decided that 4 > > bits was just no enough for an efficient machine requiring many more > > instructions. This is essentially a trade off between processing > > speed and instruction density. I pushed the instruction size up to 8 > > bits which resulted in many more possible opcodes than needed, then > > realizing that most FPGA memory will give you a 9th bit for free, I > > added a single bit as a return instruction. So the current > > instruction set has more than needed, but does not fully exploit the > > parallelism possible...
> While I consider to be open the question of whether packing 4, 5, or 8- > bit instructions gives the best density, I do think that it's a bit > moot.
Many people talk about "packing" instructions into a larger word. I have resisted that since it requires more logic and adds delay to the path. When using FPGA memory, it can be configured in a number of widths to match the instruction size without muxing (unpacking)... as long as you match the instruction size to the supported memory widths... Typically that is 1, 2, 4 and multiples of 9 up to 36 bits. 72 can be supported by using both ports on dual port memories. So to optimize memory usage, I see 9 bits as the "sweet" spot. I've been focusing on how to use that size instruction in an optimal way.
Another issue I have with packing instructions is the way that it uses memory inefficiently. Often some number of slots must be treated as nops and can't be used. That drives down the packing density. I also prefer to build literals a piece at a time rather than use a full word for each. An 8 bit literal can specify a relative jump of +127 to -128 instructions which is a good range. Although this is not exactly how my instructions use the literal for jumps.
> All you end up doing with smaller instructions is synthesizing the > larger ones out of several instructions. I think the best approach is > to use a small instruction set (4 or 5 bits, *maybe* 6 for very > special-purpose stuff that couldn't be made as memory-mapped > functional modules) and decode *pairs* or *triplets* together.
Yes, but it also slow the machine by needing two cycles instead of one. Of course that is a trade off and must be considered in a context.
> Often, the effect on the stacks, registers, and memory of two > consecutive instructions can be combined together without having to > add any control lines. It's just more clever instruction decoding. I > outline that idea in the Further Work section of my thesis. Some later > work I've done (but not published) supports the idea further.
I was seeing that in my two stack architecture, data and address. I could almost squeeze two independent instruction fields into a 9 bit instruction, but it was just a little too limiting. Using 5 or 6 bit instructions might be optimal for allowing instructions to be executed in parallel, but they don't map well into the available memory widths.
> Lastly, if you really care about packing instructions densely, also > have a look at the instruction stack I describe in there.
Yes, I'll look at that, but it will take a bit. I'll have to read a fair portion of your thesis to get enough background to understand what that section is saying.
> > This helps the CPU run all instructions in one clock cycle.
> Hmm. I've found that this is a fallacy. It's not important. What is > important is keeping the critical path of instructions to a minimum.
I'm not sure what the fallacy is. I want to keep instructions to one clock cycle not just for speed, but for simplicity of design and analysis. My current design executes *all* instructions in one clock cycle including subroutine calls, interrupts and returns. The interrupt and corresponding return uses the data and address stacks to store the return address and processor status in one clock cycle. So an interrupt can enter the corresponding code in once clock cycle. That's pretty good interrupt latency! The only fly in the single cycle ointment is that the synchronous memories used in FPGAs can't return a result in the same clock cycle. So I look ahead to the address being written to the address stack on any given cycle and use that to address memory for reads. The memory must perform the read on every cycle since it does not know if the next instruction will be a memory read or not. This is not an issue in the architecture, but uses power which is wasteful in low power designs.
> For example, I found that the clock speed of my last stack CPU was > limited by the carry-delay of the adder circuitry. This meant that > even an instruction that didn't use an adder, like DUP, would suffer > that speed penalty.
Yes, I have analyzed critical paths a great deal trying to speed up my design. It is not the data path that limits my speed. It is the stack addressing. That is one of the places that the instruction decode must be done to control the logic. The result feeds status bits for stack errors. I could omit these flags, but it seems to me to be useful for debugging. As Jeff Fox might say, I can't count! But the flags can be omitted in a non-debugging version of the CPU or the flags could be pipelined with a delay of one instruction.
> The solution was to break instructions that use the adder into two > cycles, using the first cycle to let the adder output settle, then the > second to store that output. Instructions that didn't use the adder > still took one cycle. The net result is a faster clock cycle, and a > net performance increase.
> This is what Chuck does in his designs when some NOPs must be used > after a +. I just explicitly enforce the delay, while he saves > hardware and lets the programmer deal with it.
I think we are talking a different level of speed. I want to make a lean, fast machine, but I am not trying to find the limit posed by the transistor speeds as Chuck often does. Also, working in an FPGA is very different from workin in an ASIC.
> > I didn't like using 4 bits and the least I have seen in any other > > implemented CPU was 5 bits packed into larger instruction memory > > words.
> I have to admit, 5 bits does seem to be a sweet spot. Enough > instructions to support most software well, with a few opcodes left > for special uses, like context switching or somesuch...
> > One thing you posted that got my attention was when you said > > "instruction decoding is almost free on FPGA-based soft-processors". > > How is instruction decoding free? The instruction decode has been one > > of the stumbling blocks in my efforts. If I could really get this for > > "almost free", it would solve a lot of problems.
> "Almost free" here means that it contributes little delay. One of my > colleagues here studied how to pipeline MIPS-like soft-processors, and > it ends up that simply folding decoding into the instruction fetching > stage was always a win. There was no need to give it a pipeline stage > of its own.
> Similarly, I found that the instruction decoding on my stack CPU was > never anywhere near the critical path in terms of how long it took, so > that means that there is room to do more complex decoding of pairs/ > triplets of instructions.
There a some places in my design where the decode feeds a control point that follows other logic. So the speed of the decode does not slow the design. But there have to be some control points that must be decoded ahead of the logic, for example, selecting pop/push in the stack operation. My stack is an addressed memory block, so I have to select between the counter counting up/down or holding the current address. Even when the delay of the decode is "free", the logic is not. More complex decode uses more LUTs.
> This low delay may be due to the fact that decoding a 4 or 5-bit > instruction requires a single LUT for each control line. That's not > much circuitry in an FPGA.
But there is more to driving a control point than just the opcode. If you are running multiple cycles in your instructions, you also have to decode the state counter and for conditional instructions you need to decode the controlling signal. I wanted to keep my decode to two LUT levels, but that was not possible, even with a 4 bit instruction. That's another reason I gave that up and went to 9 bits. My opcode format is such that the first n bits can be decoded in pieces to enable classes of instructions. It never needs more than one LUT to enable a class, then the individual instructions get 4 opcode bits plus the enable and
...
On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
> On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote: > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote: > I'm not trying to be absolutely "minimal" in terms of resources. Like > I said, I need a balance with an emphasis on minimizing instruction > memory. This is because I use smaller FPGAs in my designs. For > example, a current module has a Lattice part with six 9-kbit block > rams. This is also why I prefer to conserve that ninth bit if > practical.
Ah. So if I understand, you want a high instruction density since you don't have much instruction memory available on-chip? Then yes, I can see that 9th bit being valuable.
What's the application? That will definitely affect how many opcodes you need.
> Many people talk about "packing" instructions into a larger word. I > have resisted that since it requires more logic and adds delay to the > path.
I don't know what your area constraints are, but I think the logic required is minimal. Also, from my experience, that decoding logic has never been in the critical path, and so it does not add delay.
> Another issue I have with packing instructions is the way that it uses > memory inefficiently. Often some number of slots must be treated as > nops and can't be used. That drives down the packing density.
Very true. IIRC, I observed an average of three 5-bit instructions per 32-bit words, out of a maximum of 6 (the last two bits were ignored). However, if you add an instruction stack to the processor, then you can use the instruction slots located after calls and conditional jumps, which would normally be unreachable. I estimated that this would bring the instruction density up to almost 6 (still can't use slots after a RET), and the additional cost in area would be recouped in saved RAM for any RAM larger than a couple hundred words IIRC (again, the exact numbers are in my thesis).
One option I didn't consider because it would have complicated the Forth runtime, was to logically extend the last two bits with zeros so that some instructions could be fit in those two bits, so long as they end in zeros.
> I also > prefer to build literals a piece at a time rather than use a full word > for each. An 8 bit literal can specify a relative jump of +127 to > -128 instructions which is a good range. Although this is not exactly > how my instructions use the literal for jumps.
Ah. That's a different approach than I took. I wanted simple, general numbers good for computing and addressing. I'd like to know more about your architecture: for example, how to you construct a CALL to code that is beyond 128 instructions away?
> > All you end up doing with smaller instructions is synthesizing the > > larger ones out of several instructions. I think the best approach is > > to use a small instruction set (4 or 5 bits, *maybe* 6 for very > > special-purpose stuff that couldn't be made as memory-mapped > > functional modules) and decode *pairs* or *triplets* together.
> Yes, but it also slow the machine by needing two cycles instead of > one. Of course that is a trade off and must be considered in a > context.
Not quite. What I was saying is that decoding instructions did not affect the critical path, so decoding more bits made no difference. What cost and extra cycle for some instructions was waiting for the adder (when adding numbers or incrementing the PC) to settle.
When decoding instruction pairs, this often means that the effects of the second instruction can be done 'for free' during the second cycle, when the first instruction is storing the output of the adder.
> > > This helps the CPU run all instructions in one clock cycle.
> > Hmm. I've found that this is a fallacy. It's not important. What is > > important is keeping the critical path of instructions to a minimum.
> I'm not sure what the fallacy is. I want to keep instructions to one > clock cycle not just for speed, but for simplicity of design and > analysis.
I used to think that too, but I eventually found out that having a couple bits to encode a state machine to hold the current state of a multi-cycle instruction added little conceptual complexity. Perhaps its how I wrote the code, which left the details to the synthesizer, which made it tractable. Here are some worst-case examples (CALL and RET), written in Verilog with some of the macros included for explanation:
`define SS0 state <= `STATE_0; `define SS1 state <= `STATE_1; `define SS2 state <= `STATE_2; `define SS3 state <= `STATE_3;
{`CALL,`STATE_0}: begin `SS1 end {`CALL,`STATE_1}: begin `PUSH_RS(pc_next) pc <= mem_in; mar <= mem_in; `SS2 end {`CALL,`STATE_2}: begin `SS3 end {`CALL,`STATE_3}: begin `DO_PCFETCH end
{`RET,`STATE_0}: begin `POP_RS pc <= rs_top; mar <= rs_top; `SS1 end {`RET,`STATE_1}: begin `SS2 end {`RET,`STATE_2}: begin `DO_PCFETCH end
These are just case statements defining a decoder for the opcode and state. The empty cycles where only the state is being updated are the waits for the adder outputs. Note that even though CALL takes 4 cycles and RET takes three, the overall average CPI is still around 1.9 (*without* instruction folding!). On a Cyclone II, this design ran at 216MHz with an effective performance of about 150 MIPS. This is a very net performance increase than if I had stuck to a single cycle, without much extra conceptual complexity.
This same method makes decoding pairs of instructions easy. This first example folds DUP into the first cycle of an immediately following CALL, leaving the state at 1, thus beginning to decode CALL (after DUP is shifted out) at its second cycle. The second example does the same for the 'R> PC@' sequence. This trick is commented out in my current CPU design, as I'd have to enumerate all the possible instruction pairings (not that many, really).
{`DUP,`CALL}: begin `PUSH_RS(pc_next) `DO_JUMP `PUSH_DS(ds_top) `SHIFT_ISR end
{`RFROM,`PCFETCH}: begin `PUSH_DS(`RS_BODY_OUT) `POP_RS `DO_PCFETCH end
> The only fly in the single > cycle ointment is that the synchronous memories used in FPGAs can't > return a result in the same clock cycle. So I look ahead to the > address being written to the address stack on any given cycle and use > that to address memory for reads. The memory must perform the read on > every cycle since it does not know if the next instruction will be a > memory read or not. This is not an issue in the architecture, but > uses power which is wasteful in low power designs.
Interesting. So does this mean that you must construct the jump target address on the address stack just before the jump instruction?
> Yes, I have analyzed critical paths a great deal trying to speed up my > design. It is not the data path that limits my speed. It is the > stack addressing. That is one of the places that the instruction > decode must be done to control the logic. The result feeds status > bits for stack errors. I could omit these flags, but it seems to me > to be useful for debugging. As Jeff Fox might say, I can't count! > But the flags can be omitted in a non-debugging version of the CPU or > the flags could be pipelined with a delay of one instruction.
Hmm. This is odd, as incrementing/decrementing the 7-bit counter that I used to address the block memory as a stack cost almost no time. It certainly was much faster than the 32-bit adders for the PC and for the ADD instruction.
<snip>
> But there have to be some control points that must > be decoded ahead of the logic, for example, selecting pop/push in the > stack operation. My stack is an addressed memory block, so I have to > select between the counter counting up/down or holding the current > address. Even when the delay of the decode is "free", the logic is > not. More complex decode uses more LUTs.
Hmm. If I understand this right, you have to have some values ready ahead of others? This sounds unnecessarily complicated. This is a sign that things should be done over multiple cycles.
> But there is more to driving a control point than just the opcode. If > you are running multiple cycles in your instructions, you also have to > decode the state counter and for conditional instructions you need to > decode the controlling signal. I wanted to keep my decode to two LUT > levels, but that was not possible, even with a 4 bit instruction.
Silly question: why are you worrying about such a low-level detail? Are you very constrained on area?
> > >I would be very interested in learning more about your CPU design ideas.
> > Ask away! :)
> I'll need to read more. What I have looked at so far seems to show > your direction is very different from mine. I see you don't even have > interrupts. What is the point of that? It is the rare CPU that > doesn't use interrupts.
I thought interrupts were too messy. I thought it simpler to keep the CPU simple and small by not using interrupts, and then implementing two such CPUs sharing a dual-ported RAM, with one CPU being master and one slaved to watching/dealing with I/O. I simulated such a system and it works (and ends up being similar to the Sh-Boom design).
> What about your unpublished results?
Well, the code above is some of it. It's just not in any finished format. Just some rough presentations, hand-written notes on stack scheduling, some Flight code running on an embarrassingly badly written CPU simulator, and Verilog CPU code which works, but doesn't implement more than a CPU, on-chip RAM, and a handful of output
...
On Jan 21, 8:04 am, Eric <eric.lafor...@gmail.com> wrote:
> On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
> > On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote: > > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote: > > I'm not trying to be absolutely "minimal" in terms of resources. Like > > I said, I need a balance with an emphasis on minimizing instruction > > memory. This is because I use smaller FPGAs in my designs. For > > example, a current module has a Lattice part with six 9-kbit block > > rams. This is also why I prefer to conserve that ninth bit if > > practical.
> Ah. So if I understand, you want a high instruction density since you > don't have much instruction memory available on-chip? Then yes, I can > see that 9th bit being valuable.
OTOH, if you want to run really big applications (or have large data sets) you can execute from SPI flash. Big and cheap. The speed isn't too bad if library code is kept on-chip. The SPI clock can be 60 to 80 MHz, and some chips have dual and quad modes. Then 8-bit is a convenient instruction size.
On Jan 11, 11:17 am, an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Eric <eric.lafor...@gmail.com> writes: > >Stack scheduling is the stack analog of register allocation. There is > >good existing work on the topic,
> The only work I am aware of is on stack scheduling for basic blocks, > and I am not very happy with what's possible there: If you have a > JVM-like architecture (i.e., single-instruction load/stores from/to > local variables), stack scheduling hardly buys anything unless you > assume that stack manipulation instructions are much cheaper than > load/store-local instructions.
Since the stacks in this case are in hardware, then yes, stack manipulations are much faster (1 cycle instead of 2 for a load or store, plus whatever overhead of getting the address in place).
> > and my approach begins with > >traditional register allocation and then does this: > >- Consider all registers in memory > >- Map their lifetimes, load onto stack on first use > >- Keep copies on the stack if still alive > >- Load/stores become permutations/copies > >- Must be done globally for good reuse
> This sounds to me like you will often get more items on the stack than > can be easily handled by our shallow stack manipulation primitives > (such as OVER). You can keep maybe two long-lived variables on the > stack, sometimes more, sometimes less depending on what you have to do > for the short-lived stuff.
Likely. I had allowed -ROT and PICK (I don't remember why, perhaps I could synthesize them in 2 cycles using instruction folding), which meant handling four long-lived variables was OK since the copies were often destroyed right afterwards (adding them, storing to mem, etc...). That should definitely reduce load/store traffic to locals.
But I definitely concur: doing stack scheduling within basic blocks is not useful. It has to be global to really have an effect.
> >This leave the big problem: addressing mode translation. MIPS-like > >systems can generate efficient indirect and offset addressing modes > >for free, while F21-like stack machines have only indirect, and > >indirect-postincremented addressing modes (@ and @A+ for example) > >using special registers (this is the need for index registers I > >mention in my thesis). Translating between these mode is, as far as I > >know, an open problem.
> Not if you don't demand efficiency. And even if you demand > efficiency, there's a bit of work on that for DSP architectures; it's > called "storage assignment" optimization and it's about arranging > variables in an order that adjacent accesses often access adjacent > variables. The classical paper on that has Stan Liao as first author.
I have to demand efficiency in addressing, otherwise the overhead of calculating addresses dominates small loops!
I'll have to look into 'storage assignment'. Thanks!
> >If the compiler > >could figure out which registers are used for addressing in loops, and > >move them to A or R at the right time
> That's classical register allocation (although the number of registers > is quite meagre).
Not quite. On a MIPS-like system, any register can act as an index or base address, hence no special treatment is required and plain register colouring (for example) works.
Stack machines need special-purpose index registers to efficiently do loop counting and array addressing (register A in Chuck's chips, or the Q Store on the KDF9, or X and Y on the B5000, etc...).
The challenge for doing register allocation on stack machines is to detect the registers that are used in loops, and place them in the A and R registers for fast post-incrementing indirect addressing. This is definitely a place where 'storage assignment' would also help.
> But is it worth the trouble to implement a pretty complex compiler in > order to compile C efficiently to a Forth machine? If your > requirements include executing C code efficiently, adding > architectural features to make that easy with a straight-forward > compiler seems to be a better approach to me. Maybe something like > the hybrid you propose.
The hybrid does seem like the best approach: keeps existing compiler technology and processor architecture, but grants native support for fast threaded code.
However, my little experiments suggest that on an FPGA platform at least, given 'optimal' code for a given algorithm, a second-generation stack machine could have about the same net performance as a similarly complex MIPS-like machine, but for maybe half the area cost.
This is possible since although it takes twice as many cycles on a stack machine to do the same work, it tends to run at twice the clock rate. Hence it would be worth it to carry over existing C code to such a stack machine.
Here's an example I worked out by hand (more unpublished stuff from presentation slides). The C and MIPS assembly are straightforward. You can see that the stack-scheduled translation of the MIPS code spends a lot of time manipulating the array pointers (t2 and t6) to do load/ stores and to increment their address, without using the A register for the consecutive accesses (A[j] and A[j+1]).
On the other hand, the hand-coded stack version uses A to good advantage. On a modern stack machine with instruction folding, this stack code should run with similar wall-time to the MIPS code on a MIPS machine. If stack scheduling could use A and R for addressing and looping, then the compiler might be able to translate the optimized MIPS code to something like the hand-coded version.
Example: Bubble Sort C code
for (i=n-1; i >= 1; --i){ for (j = 1; j <= i; ++j){ if(A[j] > A[j+1]){ temp = A[j]; A[j] = A[j+1]; A[j+1] = temp; } }
> On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
> > On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote: > > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote: > > I'm not trying to be absolutely "minimal" in terms of resources. Like > > I said, I need a balance with an emphasis on minimizing instruction > > memory. This is because I use smaller FPGAs in my designs. For > > example, a current module has a Lattice part with six 9-kbit block > > rams. This is also why I prefer to conserve that ninth bit if > > practical.
> Ah. So if I understand, you want a high instruction density since you > don't have much instruction memory available on-chip? Then yes, I can > see that 9th bit being valuable.
> What's the application? That will definitely affect how many opcodes > you need.
The initial app was a DMA - I/O controller for a DSP chip. The project wasn't expanded past the first rev, so the design was never fully utilized. The second potential app is to off-load some low speed signal processing from hardware to reduce the overall size of an FPGA design. The signal processing is rather simple and can be done in software far faster than required. This app requires some amount of multiplies in the initial parts of the algorithm. That is the only part of the processing that might want optimization, perhaps a bit serial multiply ala Chuck's architectures. Currently an upgrade I am working on for this design (without the CPU) had to switch to a bit serial multiplier to make it fit.
> > Many people talk about "packing" instructions into a larger word. I > > have resisted that since it requires more logic and adds delay to the > > path.
> I don't know what your area constraints are, but I think the logic > required is minimal. > Also, from my experience, that decoding logic has never been in the > critical path, and so it does not add delay.
Yes, it is minimal, but my real point is that it isn't a great choice for other reasons either, so why waste the resources. Of course, if you program store is off chip and slower than the CPU, this is a valuable speedup technique. I am talking about fully embedded CPU with program store matching the speed of the CPU.
> > Another issue I have with packing instructions is the way that it uses > > memory inefficiently. Often some number of slots must be treated as > > nops and can't be used. That drives down the packing density.
> Very true. IIRC, I observed an average of three 5-bit instructions per > 32-bit words, out of a maximum of 6 (the last two bits were ignored). > However, if you add an instruction stack to the processor, then you > can use the instruction slots located after calls and conditional > jumps, which would normally be unreachable. I estimated that this > would bring the instruction density up to almost 6 (still can't use > slots after a RET), and the additional cost in area would be recouped > in saved RAM for any RAM larger than a couple hundred words IIRC > (again, the exact numbers are in my thesis).
This is my real concern. I believe Chuck's designs use instruction packing a bit differently. He includes all literals in the *same* word as the instruction referencing them. So if the word doesn't have enough room left for the size data you are using, the current word is packed with nops and the instruction starts in the next word. Yet another way to waste instruction memory if it is not plentiful as it is in his designs.
> One option I didn't consider because it would have complicated the > Forth runtime, was to logically extend the last two bits with zeros so > that some instructions could be fit in those two bits, so long as they > end in zeros.
I believe that is how Bernd Paysan's B16 CPU does it. His design is very similar to mine, except for the instructions and opcode formats.
> > I also > > prefer to build literals a piece at a time rather than use a full word > > for each. An 8 bit literal can specify a relative jump of +127 to > > -128 instructions which is a good range. Although this is not exactly > > how my instructions use the literal for jumps.
> Ah. That's a different approach than I took. I wanted simple, general > numbers good for computing and addressing. I'd like to know more about > your architecture: for example, how to you construct a CALL to code > that is beyond 128 instructions away?
The literal instruction has the msb (bit 8 of 8 downto 0) clear. The other 8 bits are loaded onto the return stack sign extended. A flag is set indicating that a literal is loaded. If another literal instruction is executed immediately after, it is left shifted into the return stack top and the flag remains set. ANY other instruction clears the literal flag so that the next literal instruction loads instead of shifts it data. So addresses (or any other data) are built up on the return stack one byte at a time using half of the available opcodes in whatever size opcode you use. To use the data on the data stack requires an additional instruction to move it, R>. The other caveat is to use a nop between successive literal words if no other instructions are needed. This format is extensible to any data word size without modification.
> > > All you end up doing with smaller instructions is synthesizing the > > > larger ones out of several instructions. I think the best approach is > > > to use a small instruction set (4 or 5 bits, *maybe* 6 for very > > > special-purpose stuff that couldn't be made as memory-mapped > > > functional modules) and decode *pairs* or *triplets* together.
> > Yes, but it also slow the machine by needing two cycles instead of > > one. Of course that is a trade off and must be considered in a > > context.
> Not quite. What I was saying is that decoding instructions did not > affect the critical path, so decoding more bits made no difference. > What cost and extra cycle for some instructions was waiting for the > adder (when adding numbers or incrementing the PC) to settle.
I understand that. In my designs I don't see the adder being the critical path. The adder is pretty fast and with nothing else controlling it or depending on it, it is not in the critical path. The critical path in my design seems to always be in the stack pointers. They have to depend on instruction decode and the error flags depend on them. So even though they are only 8 bits long, they form the critical path along with the decode that affects them.
> When decoding instruction pairs, this often means that the effects of > the second instruction can be done 'for free' during the second cycle, > when the first instruction is storing the output of the adder.
I'm not sure of your architecture, so it is not clear to me what is happening. There are a lot of subtle optimizations depending on how your architecture is set up. In my design, the return and data stacks are capable of running in parallel. So it would make sense to run instructions in pairs if the decode doesn't get too complex (which I am not convinced of). I can't quite squeeze both into a single 9 bit instruction. Eleven bits would do the job quite well I think, but doesn't fit my FPGA so well. :^(
> > > > This helps the CPU run all instructions in one clock cycle.
> > > Hmm. I've found that this is a fallacy. It's not important. What is > > > important is keeping the critical path of instructions to a minimum.
> > I'm not sure what the fallacy is. I want to keep instructions to one > > clock cycle not just for speed, but for simplicity of design and > > analysis.
> I used to think that too, but I eventually found out that having a > couple bits to encode a state machine to hold the current state of a > multi-cycle instruction added little conceptual complexity. Perhaps > its how I wrote the code, which left the details to the synthesizer, > which made it tractable. Here are some worst-case examples (CALL and > RET), written in Verilog with some of the macros included for > explanation:
> `define SS0 state <= `STATE_0; > `define SS1 state <= `STATE_1; > `define SS2 state <= `STATE_2; > `define SS3 state <= `STATE_3;
> {`CALL,`STATE_0}: begin > `SS1 > end > {`CALL,`STATE_1}: begin > `PUSH_RS(pc_next) > pc <= mem_in; > mar <= mem_in; > `SS2 > end > {`CALL,`STATE_2}: begin > `SS3 > end > {`CALL,`STATE_3}: begin > `DO_PCFETCH > end
> {`RET,`STATE_0}: begin > `POP_RS > pc <= rs_top; > mar <= rs_top; > `SS1 > end > {`RET,`STATE_1}: begin > `SS2 > end > {`RET,`STATE_2}: begin > `DO_PCFETCH > end
> These are just case statements defining a decoder for the opcode and > state. The empty cycles where only the state is being updated are the > waits for the adder outputs. Note that even though CALL takes 4 cycles > and RET takes three, the overall average CPI is still around 1.9 > (*without* instruction folding!). On a Cyclone II, this design ran at > 216MHz with an effective performance of about 150 MIPS. This is a very > net performance increase than if I had stuck to a single cycle, > without much extra conceptual complexity.
> This same method makes decoding pairs of instructions easy. This first > example folds DUP into the first cycle of an immediately following > CALL, leaving the state at 1, thus beginning to decode CALL (after DUP > is shifted out) at its second cycle. The second example does the same > for the 'R> PC@' sequence. This trick is commented out in my current > CPU design, as I'd have to enumerate all the possible instruction > pairings (not that many, really).
> On Jan 21, 8:04 am, Eric <eric.lafor...@gmail.com> wrote:> On Jan 19, 1:02 pm, rickman <gnu...@gmail.com> wrote:
> > > On Jan 19, 9:31 am, Eric <eric.lafor...@gmail.com> wrote: > > > > On Jan 10, 11:11 am, rickman <gnu...@gmail.com> wrote: > > > I'm not trying to be absolutely "minimal" in terms of resources. Like > > > I said, I need a balance with an emphasis on minimizing instruction > > > memory. This is because I use smaller FPGAs in my designs. For > > > example, a current module has a Lattice part with six 9-kbit block > > > rams. This is also why I prefer to conserve that ninth bit if > > > practical.
> > Ah. So if I understand, you want a high instruction density since you > > don't have much instruction memory available on-chip? Then yes, I can > > see that 9th bit being valuable.
> OTOH, if you want to run really big applications (or have large data > sets) you can execute from SPI flash. Big and cheap. The speed isn't > too bad if library code is kept on-chip. The SPI clock can be 60 to 80 > MHz, and some chips have dual and quad modes. Then 8-bit is a > convenient instruction size.
> -Brad
Right now I am using the 9th bit for a parallel return on many opcodes. The opcodes that don't have the parallel return use the lsbs as data which is flexible. So it is possible to have the same (or similar) CPU compiled for either the 8 bit or 9 bit opcode format.