MISC - Stack Based vs. Register Based

rickman

unread,

Mar 29, 2013, 5:00:50 PM3/29/13

to

I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density. My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.

The last couple of days I was looking at some code I plan to use and
realized that it could be a lot more efficient if I could find a way to
use more parallelism inside the CPU and use fewer instructions. So I
started looking at defining separate opcodes for the two primary
function units in the design, the data stack and the return stack. Each
has its own ALU. The data stack has a full complement of capabilities
while the return stack can only add, subtract and compare. The return
stack is actually intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel capabilities of
these units, I realized that many operations were just stack
manipulations. Then I read the thread about the relative "cost" of
stack ops vs memory accesses and I realized these were what I needed to
optimize. I needed to find a way to not use an instruction and a clock
cycle for moving data around on the stack.

In the thread on stack ops it was pointed out repeatedly that very often
the stack operands would be optimized to register operands, meaning they
wouldn't need to do the stack ops at all really. So I took a look at a
register based MISC design. Guess what, I don't see the disadvantage!
I have pushed this around for a couple of days and although I haven't
done a detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as fast, if
not faster than my stack based design and it will use fewer
instructions. I still need to add some features like support for a
stack in memory, in other words, pre-increment/post-decrement (or the
other way around...), but I don't see where this is a bad design. It
may end up using *less* logic as well. My stack design provides access
to the stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has changed
my perspective over the years so that I am willing and able to do
optimizations in ways I would not have wanted to do in the past. But I
am a bit surprised that there has been so much emphasis on stack
oriented MISC machines which it may well be that register based MISC
designs are also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC model.

--

Rick

glen herrmannsfeldt

unread,

Mar 29, 2013, 5:43:36 PM3/29/13

to

In comp.arch.fpga rickman <gnu...@gmail.com> wrote:
> I have been working with stack based MISC designs in FPGAs for some
> years. All along I have been comparing my work to the work of others.
> These others were the conventional RISC type processors supplied by the
> FPGA vendors as well as the many processor designs done by individuals
> or groups as open source.

You might also look at some VLIW designs. Not that the designs
themselves will be useful, but maybe some of the ideas that were used
would help.

Well, much of the idea of RISC is that code density isn't very
important, and that many of the more complicated instructions made
assembly language programming easier, but compilers didn't use them.

> So far my CPUs have always ranked reasonably well in terms of speed, but
> more importantly to me, very well in terms of size and code density.

Do you mean the size of CPU (lines of verilog) or size of the programs
that run on it?

> My
> efforts have shown it hard to improve on code density by a significant
> degree while simultaneously minimizing the resources used by the design.
> Careful selection of the instruction set can both improve code density
> and minimize logic used if measured together, but there is always a
> tradeoff. One can always be improved at the expense of the other.

Seems to me that much of the design of VAX was to improve code
density when main memory was still a very significant part of the
cost of a machine. The large number of addressing modes allowed for
efficient use of bits. It also greatly complicated making an efficient
pipelined processor. With little overlap and a microprogrammed CPU,
it is easy to go sequentially through the instruction bytes and process
them in order. Most of the complication is in the microcode itself.

But every level of logic adds delay. Using a wide bus to fast memory
is more efficient that a complicated decoder. But sometimes RISC went
too far. In early RISC, there was the idea of one cycle per instruction.
They couldn't do that for multiply, so they added multiply-step, an
instruction that you execute many times for each multiply operation.
(And maybe no divide at all.)

For VLIW, a very wide instruction word allows for specifying many
different operations at the same time. It relies on complicated
compilers to optimally pack the multiple operations into the
instruction stream.

> The last couple of days I was looking at some code I plan to use and
> realized that it could be a lot more efficient if I could find a way to
> use more parallelism inside the CPU and use fewer instructions. So I
> started looking at defining separate opcodes for the two primary
> function units in the design, the data stack and the return stack. Each
> has its own ALU. The data stack has a full complement of capabilities
> while the return stack can only add, subtract and compare. The return
> stack is actually intended to be an "address" processing unit.

This kind of thought is what lead to the branch delay slot on many
RISC processors. There is work to be done for branches, especially
for branch to or return from subroutine. Letting the processor know
early allows for more overlap. So, on many early (maybe not now) RISC
machines the instruction after a branch instruction is executed before
the branch is taken. (It might be a NOOP, though.) Again, reduced code
density (if it is NOOP) in exchange for a faster instruction cycle.)

> While trying to figure out how to maximize the parallel capabilities of
> these units, I realized that many operations were just stack
> manipulations. Then I read the thread about the relative "cost" of
> stack ops vs memory accesses and I realized these were what I needed to
> optimize. I needed to find a way to not use an instruction and a clock
> cycle for moving data around on the stack.

Well, consider the x87 stack instructions. Some operations exist in
forms that do and don't pop something off the stack. A small increase
in the number of opcodes used allows for avoiding many POP instructions.

> In the thread on stack ops it was pointed out repeatedly that very often
> the stack operands would be optimized to register operands, meaning they
> wouldn't need to do the stack ops at all really. So I took a look at a
> register based MISC design. Guess what, I don't see the disadvantage!

Well, actually the 8087 was designed to be either a register or stack
machine, as many instructions can index into the stack. If you keep
track of the current stack depth, you can find data in the stack like
in registers. Well, it was supposed to be able to do that.

> I have pushed this around for a couple of days and although I haven't
> done a detailed design, I think I have looked at it enough to realize
> that I can design a register oriented MISC CPU that will run as fast, if
> not faster than my stack based design and it will use fewer
> instructions. I still need to add some features like support for a
> stack in memory, in other words, pre-increment/post-decrement (or the
> other way around...), but I don't see where this is a bad design. It
> may end up using *less* logic as well. My stack design provides access
> to the stack pointers which require logic for both the pointers and
> muxing them into the data stack for reading.

Well, the stack design has the advantage that you can use instruction
bits either for a memory address or for the operation, allowing for much
smaller instructions. But that only works as long as everything is in
the right place on the stack.

> I guess looking at other peoples designs (such as Chuck's) has changed
> my perspective over the years so that I am willing and able to do
> optimizations in ways I would not have wanted to do in the past. But I
> am a bit surprised that there has been so much emphasis on stack
> oriented MISC machines which it may well be that register based MISC
> designs are also very efficient, at least if you aren't building them to
> service a C compiler or trying to match some ideal RISC model.

Seems to me that there hasn't been much done in stack machines since
the B5500, the first computer I ever used when I was about nine.

-- glen

AKE

unread,

Mar 29, 2013, 9:19:27 PM3/29/13

to

On Friday, March 29, 2013 9:00:50 PM UTC, rickman wrote:

> I have been working with stack based MISC designs in FPGAs for some
> years. All along I have been comparing my work to the work of others.
> These others were the conventional RISC type processors supplied by the
> FPGA vendors as well as the many processor designs done by individuals
> or groups as open source.
>

> Careful selection of the instruction set can both improve code density
> and minimize logic used if measured together, but there is always a
> tradeoff. One can always be improved at the expense of the other.
>

> While trying to figure out how to maximize the parallel capabilities of
> these units, I realized that many operations were just stack
> manipulations. Then I read the thread about the relative "cost" of
> stack ops vs memory accesses and I realized these were what I needed to
> optimize. I needed to find a way to not use an instruction and a clock
> cycle for moving data around on the stack.
>
> In the thread on stack ops it was pointed out repeatedly that very often
> the stack operands would be optimized to register operands, meaning they
> wouldn't need to do the stack ops at all really. So I took a look at a
> register based MISC design. Guess what, I don't see the disadvantage!
>
> I have pushed this around for a couple of days and although I haven't
> done a detailed design, I think I have looked at it enough to realize
> that I can design a register oriented MISC CPU that will run as fast, if
> not faster than my stack based design and it will use fewer
> instructions.
>

> But I am a bit surprised that there has been so much emphasis on stack
> oriented MISC machines which it may well be that register based MISC
> designs are also very efficient, at least if you aren't building them to
> service a C compiler or trying to match some ideal RISC model.
>

Hi Rick -- I think your thread here is very interesting because it extracts one of the intriguing points emerging from the extended discussion in the previous thread.

Will be interested to see what others think of your investigations and findings!

However, let me see if I'm understanding your points correctly before giving my opinion:

You were heading in the direction of a MISC processor that had opcodes for at least a minimum set of 'ordinary' stack operations a la Forth, e.g. dup drop swap over rot, etc. Which would give a high degree of matching between a stack based language like Forth and your processor, which would do away with the need for a complicated optimising compiler sitting between Forth and your processor; the compiler would just be Forth itself, and the optimiser would be the human being. (An additional advantage which maybe you don't care so much about would be the ability to count cycles in code as each ordinary operation would be a known and deterministic execution time on your processor.)

Then, after seeing the repeated remarks from the other thread, about how good optimising Forth compilers for register machines essentially unroll stack operations by doing a 're-labelling' on registers, which has led to your explorations of that design as an alternative possibility.

So far hopefully correct (albeit perhaps mangled in trying to simplify)?

My take-away from the previous thread was that by building a *register based* MISC you'd still require a good middle layer -- i.e. you'd still need a reasonably smart compiler to transform the human Forth code, unroll the stack operations, and produce the efficient register based machine code.

Whereas if you went with the stack-based MISC design, the compiler layer would be much simpler.

For me, one of the key points emerging out of the previous thread was that it's not just a question of optimising between two elements -- hardware and programming language, but rather there's an 'iron triangle' between hardware / compiler / and programming language.

So the trade-offs are between all three layers: the simplicity/complexity of the processor and its opcodes, the necessity/complexity of the optimising compiler as middle-ware, and the simplicity/expressiveness of the language that the human being programs in.

If you take the ideal programming language for your processor design as Forth, then really you're down to the trade-off between machine closely matching the human language -- and reducing the need for middleware, or machine diverging from human language -- in which case the middleware becomes more necessary.

The greater the divergence, the more the middleware has to be clever.

I've pulled out some of what I thought were key quotes from the previous thread that are hopefully also relevent in this discussion.

AKE wrote:
"So what would be the situation if one were writing Forth code for a Forth chip? Is it correct that in this case, each of the ordinary operations (rot, swap, dup, over, etc.) would be atomic operations, i.e. take one machine cycle? And would this remove the ambiguity of cost-counting by essentially removing the compiler layer? [So] is it correct to say that IF one were using a 'simple' Forth chip, then one would have similar ability to predict precisely how any Forth implementation of an algorithm would cost to run?"

Brad Eckert wrote:
"That's exactly the case. The language is close enough to the instruction set that a simple compiler does a good job. That means a bug-free compiler and easy to understand object code. In fact, since the chip is so simple (look up the J1 for an example) you can write a simulation in Forth or C to make exact measurements. "

Stephen Pelc wrote:
"In a private email, someone commented to me that stack machine hardware is a substitute for a good code generator."

Andrew Haley wrote:
"That's true: by closing the semantic gap betwen language and machine,
you make the need for a complex layer of software go away."

Stephen Pelc wrote:
"At the cost of writing another one. The choice is to buy an off the
shelf chip and write a complex compiler, or to write a simple compiler
and a CPU."

Bernd Paysan wrote:
"Then you can compare: The cost of an off-the-shelf 8051 is between $10k and
$100k. The cost of integrating it into a chip is 3 months of Bernd's
salery, and, due to the bugs in the $100k industry standard stuff, you need
three mask spins to get it right ($300k), and are one year late to market
(another year for struggling with the 8051, to get the software right). The
cost of writing a Forth CPU is one week of Bernd's salery, and you save 10
cents die size cost per chip you sell, and we did have first-time-right
silicons. Plus you don't need to write a good code generator for the 8051,
which is for sure in the "priceless" category."

Cheers,
AKE

Arlet Ottens

unread,

Mar 30, 2013, 3:20:42 AM3/30/13

to

On 03/29/2013 10:00 PM, rickman wrote:
> I have been working with stack based MISC designs in FPGAs for some
> years. All along I have been comparing my work to the work of others.
> These others were the conventional RISC type processors supplied by the
> FPGA vendors as well as the many processor designs done by individuals
> or groups as open source.
>
> So far my CPUs have always ranked reasonably well in terms of speed, but
> more importantly to me, very well in terms of size and code density. My
> efforts have shown it hard to improve on code density by a significant
> degree while simultaneously minimizing the resources used by the design.
> Careful selection of the instruction set can both improve code density
> and minimize logic used if measured together, but there is always a
> tradeoff. One can always be improved at the expense of the other.
>

I once made a CPU design for an FPGA that had multiple stacks. There was
a general purpose stack "A", two index stacks "X" and "Y", and a return
stack "R". ALU operations worked between A and any other stack, so they
only required 2 bits in the opcode. There was also a move instruction
that could move data from a source to a destination stack.

Having access to multiple stacks means you spend less time shuffling
data on the stack. There's no more need for swap, over, rot and similar
stack manipulation instructions. The only primitive operations you need
are push and pop.

For instance, I had a load instruction that could load from memory using
the address in the X stack, and push the result on the A stack. The cool
part is that the X stack itself isn't changed by this operation, so the
same address can be used multiple time. So, you could do a

LOAD (X) ; load from (X) and push on A
1 ; push literal on A
ADD ; add top two elements of A
STORE (X) ; pop A, and store in (X)

to increment a location in memory.

And if you wanted to increment X to access the next memory location,
you'd do:

1 ; push literal on A
ADD X ; pop X, pop A, add, and push result on A.
MOVE A, X ; pop A, and push on X

It was an 8 bit architecture with 9 bit instructions (to match the FPGA
block RAM + parity bit). Having 9 bit instructions allows an 8 bit
literal push to be encoded in 1 instruction.

Feel free to e-mail if you want more details.

Rod Pemberton

unread,

Mar 30, 2013, 5:54:57 PM3/30/13

to

"rickman" <gnu...@gmail.com> wrote in message
news:kj4vae$msi$1...@dont-email.me...

Are those your actual results or did you just reiterate what is on
Wikipedia? Yes, that's a serious question. Read the MISC page:
http://en.wikipedia.org/wiki/Minimal_instruction_set_computer

See ... ?!

Code density is a CISC concept. I don't see how it applies to
your MISC project. Increasing code density for a MISC processor
means implementing more powerful instructions, i.e., those that do
more work, while minimizing bytes in the instruction opcode
encoding. Even if you implement CISC-like instructions, you can't
forgo the MISC instructions you already have in order to add the
CISC-like instructions. So, to do that, you'll need to increase
the size of the instruction set, as well as implement a more
complicated instruction decoder. I.e., that means the processor
will no longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...

No offense, but you seem to be "reinventing the wheel" in terms of
microprocessor design. You're coming to the same conclusions that
were found in the 1980's, e.g., concluding a register based
machine can perform better than a stack based machine, except
you've applied it to MISC in an FPGA package... How is that a new
conclusion?

Also, please read up on CISC and RISC:
http://en.wikipedia.org/wiki/Reduced_instruction_set_computing
http://en.wikipedia.org/wiki/Complex_instruction_set_computing

You posted to two groups. Which group was the " ... thread about
the relative 'cost' of stack ops vs memory accesses ..." posted
on? comp.lang.forth? comp.arch.fpga? I can look it up, but I'd
rather not.

Also, you cross-posted to comp.arch.fpga. While they'll likely be
familiar with FPGAs, most there are not going to be familiar with
the features of stack-based processors or Forth processors that
you discuss indirectly within your post. They might not be
familiar with ancient CISC concepts such as "code density" either,
or understand why it was important at one point in time. E.g., I
suspect this Forth related stuff from above won't be widely
understood on c.a.f. without clarification:

"peoples[sic] designs (such as Chuck's)"
- various Forth processors by Charles Moore

"the data stack and the return stack."

- interpreted Forth machine model

Rod Pemberton

Arlet Ottens

unread,

Mar 31, 2013, 3:11:29 AM3/31/13

to

On 03/30/2013 10:54 PM, Rod Pemberton wrote:

>>
>> I guess looking at other peoples designs (such as Chuck's) has
>> changed my perspective over the years so that I am willing and
>> able to do optimizations in ways I would not have wanted to do
>> in the past. But I am a bit surprised that there has been so
>> much emphasis on stack oriented MISC machines which it may well
>> be that register based MISC designs are also very efficient,
>> at least if you aren't building them to service a C compiler or
>> trying to match some ideal RISC model.
>>
>
> Are those your actual results or did you just reiterate what is on
> Wikipedia? Yes, that's a serious question. Read the MISC page:
> http://en.wikipedia.org/wiki/Minimal_instruction_set_computer
>
> See ... ?!

It sounds to me rickman is questioning the (unsupported) claims on
wikipedia that stack based machines have an advantage in size and/or
simplicity, not reiterating them.

> Code density is a CISC concept. I don't see how it applies to
> your MISC project. Increasing code density for a MISC processor
> means implementing more powerful instructions, i.e., those that do
> more work, while minimizing bytes in the instruction opcode
> encoding. Even if you implement CISC-like instructions, you can't
> forgo the MISC instructions you already have in order to add the
> CISC-like instructions. So, to do that, you'll need to increase
> the size of the instruction set, as well as implement a more
> complicated instruction decoder. I.e., that means the processor
> will no longer be MISC, but MISC+minimal CISC hybrid, or pure
> CISC...

It is perfectly possible to trade one type of MISC processor for another
one. The choice between stack and register based is an obvious one. If
you switch from stack to register based, there's no need to keep stack
manipulation instructions around.

> Also, you cross-posted to comp.arch.fpga. While they'll likely be
> familiar with FPGAs, most there are not going to be familiar with
> the features of stack-based processors or Forth processors that
> you discuss indirectly within your post. They might not be
> familiar with ancient CISC concepts such as "code density" either,
> or understand why it was important at one point in time. E.g., I
> suspect this Forth related stuff from above won't be widely
> understood on c.a.f. without clarification:

The design of simple and compact processors is of great interest to many
FPGA engineers. Plenty of FPGA designs need some sort of control
processor, and for cost reduction it's important to use minimal
resources. Like rickman said, this involves a careful balance between
implementation complexity, speed, and code density, while also
considering how much work it is to write/maintain the software that's
running on the processor.

Code density is still critically important. Fast memory is small, both
on FPGA as well as general purpose processors.

rickman

unread,

Mar 29, 2013, 8:28:45 PM3/29/13

to

On 3/29/2013 5:43 PM, glen herrmannsfeldt wrote:
> In comp.arch.fpga rickman<gnu...@gmail.com> wrote:
>> I have been working with stack based MISC designs in FPGAs for some
>> years. All along I have been comparing my work to the work of others.
>> These others were the conventional RISC type processors supplied by the
>> FPGA vendors as well as the many processor designs done by individuals
>> or groups as open source.
>
> You might also look at some VLIW designs. Not that the designs
> themselves will be useful, but maybe some of the ideas that were used
> would help.
>
> Well, much of the idea of RISC is that code density isn't very
> important, and that many of the more complicated instructions made
> assembly language programming easier, but compilers didn't use them.

I am somewhat familiar with VLIW. I am also very familiar with
microcode which is the extreme VLIW. I have coded microcode for an I/O
processor on an attached array processor. That's like saying I coded a
DMA controller in a DSP chip, but before DSP chips were around.

>> So far my CPUs have always ranked reasonably well in terms of speed, but
>> more importantly to me, very well in terms of size and code density.
>
> Do you mean the size of CPU (lines of verilog) or size of the programs
> that run on it?

I don't count lines of HDL code. I can't see that as a useful metric
for much. I am referring to code density of the compiled code for the
target CPU. All of my work is in FPGAs and memory is limited inside the
chip... well, at least in the small ones... lol Some FPGAs now have
literally Mbits of memory on chip. Still, nearly all of my work is
miniature and low power, sometimes *very* low power. So there may only
be 6 block memories in the FPGA for example.

>> My
>> efforts have shown it hard to improve on code density by a significant
>> degree while simultaneously minimizing the resources used by the design.
>> Careful selection of the instruction set can both improve code density
>> and minimize logic used if measured together, but there is always a
>> tradeoff. One can always be improved at the expense of the other.
>
> Seems to me that much of the design of VAX was to improve code
> density when main memory was still a very significant part of the
> cost of a machine. The large number of addressing modes allowed for
> efficient use of bits. It also greatly complicated making an efficient
> pipelined processor. With little overlap and a microprogrammed CPU,
> it is easy to go sequentially through the instruction bytes and process
> them in order. Most of the complication is in the microcode itself.

One of the big differences is that they were not much limited in the
size of the machine itself. They could throw lots of gates at the
problem and not worry too much other than how it impacted speed. I am
again more limited in the small FPGAs I use and I expect I will achieve
higher clock speeds by keeping the machine simple as well. Also, as an
architectural decision, there is no microcode and all instructions are
one clock cycle. This helps with simplification and also makes
interrupt processing very simple.

> But every level of logic adds delay. Using a wide bus to fast memory
> is more efficient that a complicated decoder. But sometimes RISC went
> too far. In early RISC, there was the idea of one cycle per instruction.
> They couldn't do that for multiply, so they added multiply-step, an
> instruction that you execute many times for each multiply operation.
> (And maybe no divide at all.)

I"m not sure what your point is. What part of this is "too far"? This
is exactly the type of design I am doing, but to a greater extent.

> For VLIW, a very wide instruction word allows for specifying many
> different operations at the same time. It relies on complicated
> compilers to optimally pack the multiple operations into the
> instruction stream.

Yes, in theory that is what VLIW is. This is just one step removed from
microcode where the only limitation to how parallel operations can be is
the data/address paths themselves. The primary application of VLIW I
have seen is in the TI 6000 series DSP chips. But in reality this is
not what I consider VLIW. This design uses eight CPUs which are mostly
similar, but not quite identical with two sets of four CPU units sharing
a register file, IIRC. In reality each of the eight CPUs gets its own
32 bit instruction stream. They all operate in lock step, but you can't
do eight FIR filters. I think of the four in a set, two are set up to
do full math, etc and two are able to generate addresses. So this is
really two CPUs with dual MACs and two address generators as it ends up
being used most of the time. But then they make it run at clocks of
over 1 GHz so it is a damn fast DSP and handles most of the cell phone
calls as part of the base station.

Regardless of whether it is a "good" example of VLIW, it was a marketing
success.

>> The last couple of days I was looking at some code I plan to use and
>> realized that it could be a lot more efficient if I could find a way to
>> use more parallelism inside the CPU and use fewer instructions. So I
>> started looking at defining separate opcodes for the two primary
>> function units in the design, the data stack and the return stack. Each
>> has its own ALU. The data stack has a full complement of capabilities
>> while the return stack can only add, subtract and compare. The return
>> stack is actually intended to be an "address" processing unit.
>
> This kind of thought is what lead to the branch delay slot on many
> RISC processors. There is work to be done for branches, especially
> for branch to or return from subroutine. Letting the processor know
> early allows for more overlap. So, on many early (maybe not now) RISC
> machines the instruction after a branch instruction is executed before
> the branch is taken. (It might be a NOOP, though.) Again, reduced code
> density (if it is NOOP) in exchange for a faster instruction cycle.)

That is a result of the heavy pipelining that is being done. I like to
say my design is not pipelined, but someone here finally convinced me
that my design *is* pipelined with the execution in parallel with the
next instruction fetch, but there is never a need to stall or flush
because there are no conflicts.

I want a design to be fast, but not at the expense of complexity. That
is one way I think like Chuck Moore. Keep it simple and that gives you
speed.

>> While trying to figure out how to maximize the parallel capabilities of
>> these units, I realized that many operations were just stack
>> manipulations. Then I read the thread about the relative "cost" of
>> stack ops vs memory accesses and I realized these were what I needed to
>> optimize. I needed to find a way to not use an instruction and a clock
>> cycle for moving data around on the stack.
>
> Well, consider the x87 stack instructions. Some operations exist in
> forms that do and don't pop something off the stack. A small increase
> in the number of opcodes used allows for avoiding many POP instructions.

In Forth speak a POP would be a DROP. That is not often used in Forth
really, or in my apps. I just wrote some code for my stack CPU and I
think there were maybe two DROPs in just over 100 instructions. I am
talking about the DUPs, SWAPs, OVERs and such. The end up being needed
enough that it makes the register design look good... at least at first
blush. I am looking at how to organize a register based instruction set
without expanding the size of the instructions. I'm realizing that is
one issue with registers, you have to specify them. But I don't need to
make the machine totally general like the goal for RISC. That is so
writing compilers is easier. I don't need to consider that.

>> In the thread on stack ops it was pointed out repeatedly that very often
>> the stack operands would be optimized to register operands, meaning they
>> wouldn't need to do the stack ops at all really. So I took a look at a
>> register based MISC design. Guess what, I don't see the disadvantage!
>
> Well, actually the 8087 was designed to be either a register or stack
> machine, as many instructions can index into the stack. If you keep
> track of the current stack depth, you can find data in the stack like
> in registers. Well, it was supposed to be able to do that.
>
>> I have pushed this around for a couple of days and although I haven't
>> done a detailed design, I think I have looked at it enough to realize
>> that I can design a register oriented MISC CPU that will run as fast, if
>> not faster than my stack based design and it will use fewer
>> instructions. I still need to add some features like support for a
>> stack in memory, in other words, pre-increment/post-decrement (or the
>> other way around...), but I don't see where this is a bad design. It
>> may end up using *less* logic as well. My stack design provides access
>> to the stack pointers which require logic for both the pointers and
>> muxing them into the data stack for reading.
>
> Well, the stack design has the advantage that you can use instruction
> bits either for a memory address or for the operation, allowing for much
> smaller instructions. But that only works as long as everything is in
> the right place on the stack.

Yes, it is a tradeoff between instruction size and the number of ops
needed to get a job done. I'm looking at trimming the instruction size
down to give a workable subset for register operations.

>> I guess looking at other peoples designs (such as Chuck's) has changed
>> my perspective over the years so that I am willing and able to do
>> optimizations in ways I would not have wanted to do in the past. But I
>> am a bit surprised that there has been so much emphasis on stack
>> oriented MISC machines which it may well be that register based MISC
>> designs are also very efficient, at least if you aren't building them to
>> service a C compiler or trying to match some ideal RISC model.
>
> Seems to me that there hasn't been much done in stack machines since
> the B5500, the first computer I ever used when I was about nine.

I wouldn't say that. There may not be many commercial designs out
there, but there have been a few stack based CPU chips (mostly from the
work of Chuck Moore) and there are a number of stack CPUs internal to
chips. Just ask Bernd Paysan. He has done several I believe. Mine has
only been run in FPGAs.

--

Rick

rickman

unread,

Mar 30, 2013, 6:50:02 PM3/30/13

to

More than headed in the direction of MISC. I designed a MISC processor
somewhere around 2002 that I used in a design for a DMA controller/bus
interface. The CPU didn't need to do a lot, but it needed to be fast
enough and responsive. The MISC processor did the job well.

One point I want to make about the instruction set without getting into
totally. In a MISC, the goal is nominally to minimize the hardware.
When implementing a stack CPU it is not realistic to implement all of
the Forth primitives because of the data paths they require. ROT is a
good example. DUP, SWAP, etc involve just TOS and NOS. ROT involves
the top three stack items. This requires more hardware for the muxes
and in this case it even needs an extra register since the stack itself
can be implemented by a block RAM or distributed RAM. In my design only
the TOS is an actual register.

This means Forth primitives which are not instructions need to be built
up. When used in code these primitives can be inefficient depending on
what is used around them. So an optimizer still has room to work.

> Then, after seeing the repeated remarks from the other thread, about how good optimising Forth compilers for register machines essentially unroll stack operations by doing a 're-labelling' on registers, which has led to your explorations of that design as an alternative possibility.

I'm not sure the connection is quite that linear, but I guess it got me
thinking, what can a MISC register based design look like. I had never
explored that before because the typical register CPU has a 16 bit
opcode to allow lots of flexibility in the instruction functionality.
But that is not useful in my approach to a CPU design because I don't
want complex instructions, even the limited complexity RISC type
instructions. So I started thinking of just how a register instruction
set could be minimized. That is what I've been doing the last couple of
days. I think I have something that is comparable to my stack based
design in flexibility and is a bit more efficient in code density. But
there is a bit more work to do.

> So far hopefully correct (albeit perhaps mangled in trying to simplify)?
>
> My take-away from the previous thread was that by building a *register based* MISC you'd still require a good middle layer -- i.e. you'd still need a reasonably smart compiler to transform the human Forth code, unroll the stack operations, and produce the efficient register based machine code.

I'm a hardware guy, I don't deal with compilers much, either writing or
even using, at least not for these MISC machines. I mostly do something
more like an assembler and let it go at a few macros. You'll have to
ask someone else about middle layers.

> Whereas if you went with the stack-based MISC design, the compiler layer would be much simpler.

Perhaps, but there is still room for an optimizer. If the CPU maps so
well to Forth that you don't need an optimizer, I think you will be
leaving a lot on the table in terms of efficiency at the compute level.
That is, there may be redundancies in the actual data movements. For
example, in my machine a literal is created on the return stack because
most literals are actually addresses and addresses are used on the
return stack. Think of the return stack as a memory address generator.
So *literal* Forth that puts literals on the data stack will require
extra movements. But then I'm not sure literals are used as addresses
in Forth exactly. I guess it depends on the style of Forth used. IMHO
STC is the only way to use a stack machine for Forth.

I'm not sure all this is clear. I don't have a lot of time to write
tonight so my writing may not be well organized.

> For me, one of the key points emerging out of the previous thread was that it's not just a question of optimising between two elements -- hardware and programming language, but rather there's an 'iron triangle' between hardware / compiler / and programming language.

I would agree with that. In fact, this is a good point to mention the
ZPU. This is a stack CPU design with very different goals than most
MISC you will find. They specifically designed a instruction set
architecture which could spam a range of implementation complexities
(and therefor speed) while being 100% compatible. They wanted a low end
implementation that would be small enough to fit even small FPGAs and
did. Their design is actually a bit smaller than mine! The kicker
is... they wanted it to run C code and wrote a back end to produce code
for it. I believe someone even has used this CPU for a commercial app.
The support is a mailing list and I haven't seen much traffic lately
so I'm not sure how much it is being used just now. Still this is
rather interesting, a stack CPU to implement C code... who'da thunk it?

> So the trade-offs are between all three layers: the simplicity/complexity of the processor and its opcodes, the necessity/complexity of the optimising compiler as middle-ware, and the simplicity/expressiveness of the language that the human being programs in.

I would agree with that only mentioning that everyone has different
goals for "optimizing".

> If you take the ideal programming language for your processor design as Forth, then really you're down to the trade-off between machine closely matching the human language -- and reducing the need for middleware, or machine diverging from human language -- in which case the middleware becomes more necessary.

I don't agree that this will be optimal in many ways. I guess it make
the compiler "optimal" since it becomes an assembler, right?

> The greater the divergence, the more the middleware has to be clever.
>
> I've pulled out some of what I thought were key quotes from the previous thread that are hopefully also relevent in this discussion.
>
>
> AKE wrote:
> "So what would be the situation if one were writing Forth code for a Forth chip? Is it correct that in this case, each of the ordinary operations (rot, swap, dup, over, etc.) would be atomic operations, i.e. take one machine cycle? And would this remove the ambiguity of cost-counting by essentially removing the compiler layer? [So] is it correct to say that IF one were using a 'simple' Forth chip, then one would have similar ability to predict precisely how any Forth implementation of an algorithm would cost to run?"

If your compiler is so simple it is an assembler, then that is certainly
easy to do. Otherwise I think there are still ways to optimize Forth
code even on a Forth-ish CPU. So there will be come complexity to the
middle-ware.

> Brad Eckert wrote:
> "That's exactly the case. The language is close enough to the instruction set that a simple compiler does a good job. That means a bug-free compiler and easy to understand object code. In fact, since the chip is so simple (look up the J1 for an example) you can write a simulation in Forth or C to make exact measurements."

Yes, I agree. In my machine you don't even need tools to count the
cycles other than perhaps an adder. It is *always* one clock per
instruction.

> Stephen Pelc wrote:
> "In a private email, someone commented to me that stack machine hardware is a substitute for a good code generator."

Again I think you can still use good compilers to optimize even Forth
code on a MISC CPU.

> Andrew Haley wrote:
> "That's true: by closing the semantic gap betwen language and machine,
> you make the need for a complex layer of software go away."

Certainly minimized.

> Stephen Pelc wrote:
> "At the cost of writing another one. The choice is to buy an off the
> shelf chip and write a complex compiler, or to write a simple compiler
> and a CPU."

:^)

> Bernd Paysan wrote:
> "Then you can compare: The cost of an off-the-shelf 8051 is between $10k and
> $100k. The cost of integrating it into a chip is 3 months of Bernd's
> salery, and, due to the bugs in the $100k industry standard stuff, you need
> three mask spins to get it right ($300k), and are one year late to market
> (another year for struggling with the 8051, to get the software right). The
> cost of writing a Forth CPU is one week of Bernd's salery, and you save 10
> cents die size cost per chip you sell, and we did have first-time-right
> silicons. Plus you don't need to write a good code generator for the 8051,
> which is for sure in the "priceless" category."

Yes, I remember Bernd writing about this adventure in some detail in
another thread I believe. He has done a couple of internal chip CPUs.
My hat is off to him.

--

Rick

rickman

unread,

Mar 30, 2013, 7:00:43 PM3/30/13

to

On 3/30/2013 3:20 AM, Arlet Ottens wrote:
> On 03/29/2013 10:00 PM, rickman wrote:
>> I have been working with stack based MISC designs in FPGAs for some
>> years. All along I have been comparing my work to the work of others.
>> These others were the conventional RISC type processors supplied by the
>> FPGA vendors as well as the many processor designs done by individuals
>> or groups as open source.
>>
>> So far my CPUs have always ranked reasonably well in terms of speed, but
>> more importantly to me, very well in terms of size and code density. My
>> efforts have shown it hard to improve on code density by a significant
>> degree while simultaneously minimizing the resources used by the design.
>> Careful selection of the instruction set can both improve code density
>> and minimize logic used if measured together, but there is always a
>> tradeoff. One can always be improved at the expense of the other.
>>
>
> I once made a CPU design for an FPGA that had multiple stacks. There was
> a general purpose stack "A", two index stacks "X" and "Y", and a return
> stack "R". ALU operations worked between A and any other stack, so they
> only required 2 bits in the opcode. There was also a move instruction
> that could move data from a source to a destination stack.

Let's see, this would be a hybrid really, between register based and
stack based as it has multiple stacks which must be selected in the
instruction set like registers, just not many.

> Having access to multiple stacks means you spend less time shuffling
> data on the stack. There's no more need for swap, over, rot and similar
> stack manipulation instructions. The only primitive operations you need
> are push and pop.

There is the extra hardware required to implement multiple stacks. Why
have quite so many? I use the return stack for addresses. That works
pretty well. Maybe A, X and R?

> For instance, I had a load instruction that could load from memory using
> the address in the X stack, and push the result on the A stack. The cool
> part is that the X stack itself isn't changed by this operation, so the
> same address can be used multiple time. So, you could do a
>
> LOAD (X) ; load from (X) and push on A
> 1 ; push literal on A
> ADD ; add top two elements of A
> STORE (X) ; pop A, and store in (X)
>
> to increment a location in memory.

But you aren't quite done yet, at least if you care about stack
overflows. Chuck's designs don't care. If he has data on the bottom of
the stack he can just leave it. Otherwise you need to drop the address
on the X stack.

I looked at this when designing my MISC processor. I ended up with two
fetch and two stores. One just does the fetch or store and pops the
address. The other does a post increment and retains the address to be
used in a loop. This one would have to be dropped at the end.

> And if you wanted to increment X to access the next memory location,
> you'd do:
>
> 1 ; push literal on A
> ADD X ; pop X, pop A, add, and push result on A.
> MOVE A, X ; pop A, and push on X

Add an autoincrement to the index stacks and these three instructions go
away.

> It was an 8 bit architecture with 9 bit instructions (to match the FPGA
> block RAM + parity bit). Having 9 bit instructions allows an 8 bit
> literal push to be encoded in 1 instruction.

I was all about 9 bit instructions and 18 bit address/data bus sizes
until I started working with the iCE40 parts which only have 8 bit
memory. I think the parity bit is a comms industry thing. That one bit
makes a *big* difference in the instruction set capabilities.

> Feel free to e-mail if you want more details.

Thanks.

--

Rick

Arlet Ottens

unread,

Mar 31, 2013, 7:20:53 AM3/31/13

to

On 03/31/2013 12:00 AM, rickman wrote:

>> I once made a CPU design for an FPGA that had multiple stacks. There was
>> a general purpose stack "A", two index stacks "X" and "Y", and a return
>> stack "R". ALU operations worked between A and any other stack, so they
>> only required 2 bits in the opcode. There was also a move instruction
>> that could move data from a source to a destination stack.
>
> Let's see, this would be a hybrid really, between register based and
> stack based as it has multiple stacks which must be selected in the
> instruction set like registers, just not many.

Exactly. And as a hybrid, it offers some advantages from both kinds of
designs.

>
>
>> Having access to multiple stacks means you spend less time shuffling
>> data on the stack. There's no more need for swap, over, rot and similar
>> stack manipulation instructions. The only primitive operations you need
>> are push and pop.
>
> There is the extra hardware required to implement multiple stacks. Why
> have quite so many? I use the return stack for addresses. That works
> pretty well. Maybe A, X and R?

I had all stacks implemented in the same block RAM, just using different
sections of it. But you are right, in my implementation I had reserved
room for the Y stack, but never really implemented it. Just using the X
was sufficient for the application I needed the CPU For. Of course, for
other applications, having an extra register/stack that you can use as
an memory pointer could be useful, so I left the Y register in the
instruction encoding. For 3 registers you need 2 bits anyway, so it
makes sense to allow for 4.

> But you aren't quite done yet, at least if you care about stack
> overflows. Chuck's designs don't care. If he has data on the bottom of
> the stack he can just leave it. Otherwise you need to drop the address
> on the X stack.

Correct, if you no longer need the address, you need to drop it from the
stack. On the other hand, if you let it drop automatically, and you need
it twice, you would have to dup it. Intuitively, I would say that in
inner loops it would be more common that you'd want to reuse an address
(possibly with offset or autoinc/dec).

> I was all about 9 bit instructions and 18 bit address/data bus sizes
> until I started working with the iCE40 parts which only have 8 bit
> memory. I think the parity bit is a comms industry thing. That one bit
> makes a *big* difference in the instruction set capabilities.

Agreed. As soon you use the parity bits, you're tied to a certain
architecture. On the other hand, if you're already chose a certain
architecture, and the parity bits are available for free, it can be very
advantageous to use them.

glen herrmannsfeldt

unread,

Mar 31, 2013, 2:34:57 PM3/31/13

to

In comp.arch.fpga rickman <gnu...@gmail.com> wrote:

(snip)

>> Well, much of the idea of RISC is that code density isn't very
>> important, and that many of the more complicated instructions made
>> assembly language programming easier, but compilers didn't use them.

> I am somewhat familiar with VLIW. I am also very familiar with
> microcode which is the extreme VLIW. I have coded microcode for an I/O
> processor on an attached array processor. That's like saying I coded a
> DMA controller in a DSP chip, but before DSP chips were around.

Well, yes, VLIW is just about like compiling from source directly to
microcode. The currently in production VLIW processor is Itanium.
I believe 128 bit wide instructions, which specify many different
operations that can happen at the same time. 128 general and 128
floating point registers, 128 bit wide data bus, but it uses a lot
of power. Something like 100W, with Icc of 100A and Vcc about 1V.

I have an actual RX2600 dual Itanium box, but don't run it very
often, mostly because of the power used.

(snip)

>> But every level of logic adds delay. Using a wide bus to fast memory
>> is more efficient that a complicated decoder. But sometimes RISC went
>> too far. In early RISC, there was the idea of one cycle per instruction.
>> They couldn't do that for multiply, so they added multiply-step, an
>> instruction that you execute many times for each multiply operation.
>> (And maybe no divide at all.)

> I"m not sure what your point is. What part of this is "too far"? This
> is exactly the type of design I am doing, but to a greater extent.

The early SPARC didn't have a multiply instruction. (Since it couldn't
be done in one cycle.) Instead, they did multiply, at least the Sun
machines did, through software emulation, with a software trap.

Early when I started using a Sun4/110 I generated a whole set of fonts
for TeX from Metafont source, which does a lot of multiply. Unix (SunOS)
keeps track of user time (what your program does) and system time (what
the OS does while executing your program). Multiply counted as system
time, and could be a large fraction of the total time.

>> For VLIW, a very wide instruction word allows for specifying many
>> different operations at the same time. It relies on complicated
>> compilers to optimally pack the multiple operations into the
>> instruction stream.

> Yes, in theory that is what VLIW is. This is just one step removed from
> microcode where the only limitation to how parallel operations can be is
> the data/address paths themselves. The primary application of VLIW I
> have seen is in the TI 6000 series DSP chips. But in reality this is
> not what I consider VLIW. This design uses eight CPUs which are mostly
> similar, but not quite identical with two sets of four CPU units sharing
> a register file, IIRC. In reality each of the eight CPUs gets its own
> 32 bit instruction stream. They all operate in lock step, but you can't
> do eight FIR filters. I think of the four in a set, two are set up to
> do full math, etc and two are able to generate addresses. So this is
> really two CPUs with dual MACs and two address generators as it ends up
> being used most of the time. But then they make it run at clocks of
> over 1 GHz so it is a damn fast DSP and handles most of the cell phone
> calls as part of the base station.

For Itanium, the different units do different things. There are
instruction formats that divide up the bits in different ways to make
optimal use of the bits. I used to have the manual nearby, but I don't
see it right now.

(snip)

> That is a result of the heavy pipelining that is being done. I like to
> say my design is not pipelined, but someone here finally convinced me
> that my design *is* pipelined with the execution in parallel with the
> next instruction fetch, but there is never a need to stall or flush
> because there are no conflicts.

Yes, that counts, but it gets much more interesting with pipelines
like the Cray-1 that, after some cycles of latency, generate one result
every clock cycle.

> I want a design to be fast, but not at the expense of complexity. That
> is one way I think like Chuck Moore. Keep it simple and that gives you
> speed.

(snip)

> In Forth speak a POP would be a DROP. That is not often used in Forth
> really, or in my apps. I just wrote some code for my stack CPU and I
> think there were maybe two DROPs in just over 100 instructions. I am
> talking about the DUPs, SWAPs, OVERs and such. The end up being needed
> enough that it makes the register design look good... at least at first
> blush. I am looking at how to organize a register based instruction set
> without expanding the size of the instructions. I'm realizing that is
> one issue with registers, you have to specify them. But I don't need to
> make the machine totally general like the goal for RISC. That is so
> writing compilers is easier. I don't need to consider that.

For x87, they avoid the DUP, SWAP, and such by instructions being able
to specify any register in the stack. You can, for example, add any
stack register (there are eight) to the top of stack. I haven't thought
about it for a while, but I believe either pushing the result, or
replacing the previous top of the stack.

(snip)

>> Well, the stack design has the advantage that you can use instruction
>> bits either for a memory address or for the operation, allowing for much
>> smaller instructions. But that only works as long as everything is in
>> the right place on the stack.

> Yes, it is a tradeoff between instruction size and the number of ops
> needed to get a job done. I'm looking at trimming the instruction size
> down to give a workable subset for register operations.

Seems to me that one possibility is to have a really functional stack
operation instruction, such that, with the give number of bits, it
allows for the most opertions. Some combination of DUP, SWAP, and POP
all at once. Though that isn't easy for the stack itself.

-- glen

Jon Elson

unread,

Apr 1, 2013, 4:11:18 PM4/1/13

to

glen herrmannsfeldt wrote:

> Seems to me that much of the design of VAX was to improve code
> density when main memory was still a very significant part of the
> cost of a machine.

The original VAX series (780, then 730 and 750) and the uVAX I and
uVAX II had no cache, so reducing rate of memory fetches was also
important. The first cached machine I used was the uVAX III (KA650)
and a good part of the performance increase was due to the cache.

Jon

Jecel

unread,

Apr 1, 2013, 4:18:02 PM4/1/13

to

On Saturday, March 30, 2013 7:50:02 PM UTC-3, rickman wrote:
> [...] Still this is

> rather interesting, a stack CPU to implement C code... who'da thunk it?

The academic CRISP architecture, which became the short lived commercial AT&T Hobbit processor (used in early Newton and BeBox prototypes), was supposed to be the ultimate C processor and it was a stack machine.

Eric LaForest designed a very neat hybrid architecture which was very MIPS-like but with 8 visible registers instead of 32. R7 was also the return stack, so reading from it implied R> and writing to it meant >R. R0 and R1 were the data stack.

-- Jecel

Rod Pemberton

unread,

Apr 1, 2013, 8:10:27 PM4/1/13

to

"Arlet Ottens" <usen...@c-scape.nl> wrote in message
news:5157e1a1$0$6924$e4fe...@news2.news.xs4all.nl...

> On 03/30/2013 10:54 PM, Rod Pemberton wrote:

> >>
> >> I guess looking at other peoples designs (such as Chuck's)
> >> has changed my perspective over the years so that I am
> >> willing and able to do optimizations in ways I would not have
> >> wanted to do in the past. But I am a bit surprised that there
> >> has been so much emphasis on stack oriented MISC machines
> >> which it may well be that register based MISC designs are
> >> also very efficient, at least if you aren't building them to
> >> service a C compiler or trying to match some ideal RISC
> >> model.
> >>
> >
> > Are those your actual results or did you just reiterate what
> > is on Wikipedia? Yes, that's a serious question. Read the

> > MISC page: [link]

> >
> > See ... ?!
>
> It sounds to me rickman is questioning the (unsupported) claims
> on wikipedia that stack based machines have an advantage in size
> and/or simplicity, not reiterating them.
>

Well, you snipped alot of context. I wouldn't have reformatted
all of it either.

Anyway, what I took from rickman's statements was that he had only
managed to confirm what is already known about MISC according to
Wikipedia. I.e., what was/is the point?

> > Code density is a CISC concept. I don't see how it applies to
> > your MISC project. Increasing code density for a MISC
> > processor means implementing more powerful instructions,
> > i.e., those that do more work, while minimizing bytes in the
> > instruction opcode encoding. Even if you implement
> > CISC-like instructions, you can't forgo the MISC instructions
> > you already have in order to add the CISC-like instructions.
> > So, to do that, you'll need to increase the size of the
> > instruction set, as well as implement a more complicated
> > instruction decoder. I.e., that means the processor will no
> > longer be MISC, but MISC+minimal CISC hybrid, or pure
> > CISC...
>
> It is perfectly possible to trade one type of MISC processor for
> another one. The choice between stack and register based is an
> obvious one. If you switch from stack to register based, there's
> no need to keep stack manipulation instructions around.
>

Yes, true. But, what does eliminating MISC stack instructions and
replacing them with MISC register instructions have to do with the
CISC concept of code density or with CISC instructions? ...

> > Also, you cross-posted to comp.arch.fpga. While they'll
> > likely be familiar with FPGAs, most there are not going
> > to be familiar with the features of stack-based processors
> > or Forth processors that you discuss indirectly within your
> > post. They might not be familiar with ancient CISC
> > concepts such as "code density" either, or understand why
> > it was important at one point in time. E.g., I suspect this
> > Forth related stuff from above won't be widely
> > understood on c.a.f. without clarification:
>
> The design of simple and compact processors is of great interest
> to many FPGA engineers. Plenty of FPGA designs need some
> sort of control processor, and for cost reduction it's important
> to use minimal resources. Like rickman said, this involves a
> careful balance between implementation complexity, speed,
> and code density, while also considering how much work it
> is to write/maintain the software that's running on the
> processor.
>
> Code density is still critically important. Fast memory is
> small, both on FPGA as well as general purpose processors.
>

So, shouldn't he dump the entire MISC instruction set he has, and
implement a CISC instruction set instead? That's the only way
he's going to get the "critically important" code density, which
I'll take it you rank well above MISC as being important. Of
course, a CISC instruction set requires a more complicated
instruction decoder ... So, it seems that either way he proceeds,
he is being contrained by the "minimal resources" of his FPGA.
That was what he stated:

"My efforts have shown it hard to improve on code density by a
significant degree while simultaneously minimizing the resources
used by the design."

I.e., if the FPGA he is attempting to use is insufficient to do
what he wants or needs, then it's insufficient. Or, he needs some
new techniques. He didn't explicitly ask for any though ...

Glen mentioned the numerous address modes of the VAX. The 6502
also had alot of address modes and had instructions which used
zero-page as a set of fast registers. I would think that early,
compact designs like the 6502 and Z80 could be useful to rickman.
They had low transistor counts.

Z80 8500 transistors
6502 9000 transistors
8088 29000 transistors (for comparison...)

Rod Pemberton

glen herrmannsfeldt

unread,

Apr 1, 2013, 8:54:06 PM4/1/13

to

In comp.arch.fpga Rod Pemberton <do_no...@notemailnotq.cpm> wrote:

(snip)

> Well, you snipped alot of context. I wouldn't have reformatted
> all of it either.

> Anyway, what I took from rickman's statements was that he had only
> managed to confirm what is already known about MISC according to
> Wikipedia. I.e., what was/is the point?

(snip, someone wrote)

>> It is perfectly possible to trade one type of MISC processor for
>> another one. The choice between stack and register based is an
>> obvious one. If you switch from stack to register based, there's
>> no need to keep stack manipulation instructions around.

> Yes, true. But, what does eliminating MISC stack instructions and
> replacing them with MISC register instructions have to do with the
> CISC concept of code density or with CISC instructions? ...

Seems to me that one could still Huffman code the opcode, even
within the MISC concept. That is, use fewer bits for more common
operations, or where it otherwise simplifies the result.

As someone noted, you can have an N-1 bit load immediate instruction
where N is the instruction size.

In one of Knuth's TAOCP books he describes a two instruction computer.

Seems like if one is interested in MISC, someone should build that one.

Also, maybe a MIX and MMIX machine, maybe the decimal version.

For those who don't read TAOCP, MIX is defined independent of the
underlying base. Programs in MIXAL are supposed to assemble and
run correctly on hosts that use any base within the range specified.

Instruction bytes have, if I remember, between 64 and 100 possible
values, such that six bits or two decimal digits are possible
representations.

I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

-- glen

Arlet Ottens

unread,

Apr 2, 2013, 2:25:15 AM4/2/13

to

On 04/02/2013 02:10 AM, Rod Pemberton wrote:

> Yes, true. But, what does eliminating MISC stack instructions and
> replacing them with MISC register instructions have to do with the
> CISC concept of code density or with CISC instructions? ...

I don't think 'code density' is a CISC concept at all. Code density
applies to any kind of instruction encoding.

Any kind of MISC architecture will have a certain code density (for a
given application), and require a certain amount of FPGA resources. And
if you store the code in local FPGA memory, and you know your
application, you can even convert it all to FPGA resources.

Obviously, one MISC design will have better resource use than another,
and one of the questions is whether we can make any kind of general
statement about stack vs register implementation.

> So, shouldn't he dump the entire MISC instruction set he has, and
> implement a CISC instruction set instead? That's the only way
> he's going to get the "critically important" code density, which
> I'll take it you rank well above MISC as being important. Of
> course, a CISC instruction set requires a more complicated
> instruction decoder ... So, it seems that either way he proceeds,
> he is being contrained by the "minimal resources" of his FPGA.

Exactly. If you want to get FPGA resources low, a MISC design seems most
appropriate. But within the MISC concept, there are still plenty of
design decisions left.

Really, the biggest problem with FPGA CPU design is that there are too
many design decisions, and each decision influences everything else.

> Glen mentioned the numerous address modes of the VAX. The 6502
> also had alot of address modes and had instructions which used
> zero-page as a set of fast registers. I would think that early,
> compact designs like the 6502 and Z80 could be useful to rickman.
> They had low transistor counts.
>
> Z80 8500 transistors
> 6502 9000 transistors
> 8088 29000 transistors (for comparison...)

Low transistor counts do not necessarily translate to low FPGA
resources. Early CPUs used dynamic storage, dual clock latches, pass
logic and tri-state buses to create really small designs that don't
necessarily map well to FPGAs. On the other hand, FPGAs have specific
features (depending on the brand) that can be exploited to create really
tight designs.

Also, these processors are slow, using multi cycle instructions, and 8
bit operations. That may not be acceptable. And even if low performance
is acceptable, there are numerous other ways where you can trade speed
for code density, so you'd have to consider these too. For instance, I
can replace pipelining with multi-cycle execution, or use microcode.

rickman

unread,

Apr 2, 2013, 9:54:40 AM4/2/13

to

On 3/31/2013 2:34 PM, glen herrmannsfeldt wrote:
> In comp.arch.fpga rickman<gnu...@gmail.com> wrote:

(snip)
>

> I have an actual RX2600 dual Itanium box, but don't run it very
> often, mostly because of the power used.

A 100 watt lightbulb uses about $10 a month if left on 24/7. So I
wouldn't worry too much about the cost of running your Itanium. If you
turn it off when you aren't using it I can't imagine it would really
cost anything noticeable to run.

> For Itanium, the different units do different things. There are
> instruction formats that divide up the bits in different ways to make
> optimal use of the bits. I used to have the manual nearby, but I don't
> see it right now.

Yes, the array processor I worked on was coded from scratch, very
laboriously. The Itanium is trying to run existing code as fast as
possible. So they have a number of units to do similar things, but also
different types, all working in parallel as much as possible. Also the
parallelism in the array processor was all controlled by the programmer.
In regular x86 processors the parallelism is controlled by the chip
itself. I'm amazed sometimes at just how much they can get the chip to
do, no wonder there are 100's of millions of transistors on the thing.
I assume parallelism in the Itanium is back to the compiler smarts to
control since it needs to be coded into the VLIW instructions.

> For x87, they avoid the DUP, SWAP, and such by instructions being able
> to specify any register in the stack. You can, for example, add any
> stack register (there are eight) to the top of stack. I haven't thought
> about it for a while, but I believe either pushing the result, or
> replacing the previous top of the stack.

That is something I have not yet looked at, a hybrid approach with a
stack, but also with stack top relative addressing. It is important to
keep the hardware simple, but that might not be too hard to implement.
Arlet talked about his hybrid processor design with multiple stacks. I
would need to give this a bit of thought as the question becomes, what
possible advantage would a hybrid processor have over the other two?
Actually, the image in my mind is a bit like the C language stack frame
model. You can work with addressing relative to the top of stack and
when a sub is called it layers its variables on top of the existing
stack. That would require a bit of entry and exit code, so there will
be a tradeoff between simple register addressing and simple subroutine
entry.

> Seems to me that one possibility is to have a really functional stack
> operation instruction, such that, with the give number of bits, it
> allows for the most opertions. Some combination of DUP, SWAP, and POP
> all at once. Though that isn't easy for the stack itself.

Like many tradeoffs, this can complicate the hardware. If you want to
be able to combine an ADD with a SWAP or OVER, you need to be able to
access more than two operands on the stack at once. In my designs that
means pulling another register out of the proper stack. Rather than one
register and a block of memory, each stack would need two registers and
the block of memory along with the input multiplexers. This would need
to be examined in context of common code to see how much it would help
vs the expense of added hardware.

I'm still not complete in my analysis of a register based MISC design,
but at the moment I think the register approach gives better instruction
size efficiency/faster execution as well as simpler hardware design.

--

Rick

rickman

unread,

Apr 2, 2013, 10:50:22 AM4/2/13

to

That's interesting. I'm not sure I have a clear idea of how to make use
of such an architecture in a program.

My current register based approach has two "special" registers which can
do some limited addressing of memory. R7 can do indirect, indirect
post-increment or indirect pre-decrement on either store or load. R6
can only do indirect pre-decrement on store and indirect post-increment
on load to implement a stack. R6 will be used as the subroutine linkage
stack pointer. R7 is for general memory access. It is starting to
sound a little like one of the old DEC machines. I should review the
docs on the PDP-11 and also the TI MSP430 which are both similar. I
can't add all the complexity they are using, but they might give me some
ideas.

I am using some of my stack machine ideas in the register design. Like
a literal instruction is just one bit with the other 8 bits as literal
data. Jump and Call instructions have a four bit literal field included
so that they can be used without the literal extension for short jumps
saving a full opcode. But this comes at a cost. The register specs use
a lot of opcode space. So, in an opcode size that has lots of spare
opcodes in the stack design because of the implied operands, now is
squeezed for space and must be severely limited compared to many
register designs with larger opcodes.

--

Rick

rickman

unread,

Apr 2, 2013, 12:04:24 PM4/2/13

to

You clearly don't understand my post or the Wiki article or possibly
both. I suggest you reread both and then ask questions if you still
don't understand.

> Code density is a CISC concept. I don't see how it applies to
> your MISC project.

Yes, I get that you don't understand. Do you have a specific question?

> Increasing code density for a MISC processor
> means implementing more powerful instructions, i.e., those that do
> more work, while minimizing bytes in the instruction opcode
> encoding.

Yes, that part you seem to understand.

> Even if you implement CISC-like instructions, you can't
> forgo the MISC instructions you already have in order to add the
> CISC-like instructions.

Really? I can't drop the entire instruction set and start over? Who
says so? Am I breaking a law?

> So, to do that, you'll need to increase
> the size of the instruction set, as well as implement a more
> complicated instruction decoder.

Define "increase the size of the instruction set". I am using a 9 bit
opcode for my stack design and am using a similar 9 bit opcode for the
register design. In what way is the register design using a larger
instruction set?

That was exactly the blinder I was wearing until now. I had read a lot
about register CPU instruction sets where the intent was not in line
with MISC goals. MicroBlaze is a good example. I think it uses well in
excess of 1000 LUTs, maybe multiple 1000's. I need something that is
much smaller and it hadn't occurred to me that perhaps the common goals
of register designs (lots of registers, orthogonality, address mode
flexibility, etc) could be limited or even tossed out the window. I
don't need a machine that is easy for a C compiler to produce code for.
My goals are for minimal hardware without losing any more performance
than is essential. In particular I work in FPGAs, so the design needs
to work well in that environment.

> I.e., that means the processor
> will no longer be MISC, but MISC+minimal CISC hybrid, or pure
> CISC...

Nonsense. "Minimal Instruction Set Computer (MISC) is a processor
architecture with a very small number of basic operations and
corresponding opcodes." from Wikipedia. BTW, I don't think the term
MISC is widely used and is not well defined. This is the only web page
I found that even tries to define it.

Actually, if you consider only the opcodes and not the operand
combinations, I think the register design may have fewer instructions
than does the stack design. But the register design still is in work so
I'm not done counting yet.

There are some interesting corners to be explored. For example a MOV
rx,rx is essentially a NOP. There are eight of these. So instead, why
not make them useful by clearing the register? So MOV rx,rx is a clear
to be given the name CLR rx... unless the rx is r7 in which case it is
indeed a MOV r7,r7 which is now a NOP to be coded as such. The CLR r7
is not needed because LIT 0 already does the same job. Even better the
opcode for a NOP is 0x1FF or octal 777. That's very easy to remember
and recognize. It feels good to find convenient features like this and
makes me like the register MISC design.

> No offense, but you seem to be "reinventing the wheel" in terms of
> microprocessor design. You're coming to the same conclusions that
> were found in the 1980's, e.g., concluding a register based
> machine can perform better than a stack based machine, except
> you've applied it to MISC in an FPGA package... How is that a new
> conclusion?

I don't see how you can say that. I don't know of other MISC designs
that are very good. I think the picoBlaze is one, but I don't think it
was designed to be very MISC really. It was designed to be small,
period. It has a lot of fixed features and so can't be altered without
tossing old code. Some of the MicroChip devices might be MISC. I have
not worked with them. I might want to take a look actually. There may
be useful ideas.

But "reinventing the wheel"? I don't think so.

--

Rick

glen herrmannsfeldt

unread,

Apr 2, 2013, 1:20:44 PM4/2/13

to

In comp.arch.fpga rickman <gnu...@gmail.com> wrote:
> On 3/31/2013 2:34 PM, glen herrmannsfeldt wrote:
>> In comp.arch.fpga rickman<gnu...@gmail.com> wrote:

> (snip)

>> I have an actual RX2600 dual Itanium box, but don't run it very
>> often, mostly because of the power used.

> A 100 watt lightbulb uses about $10 a month if left on 24/7. So I
> wouldn't worry too much about the cost of running your Itanium. If you
> turn it off when you aren't using it I can't imagine it would really
> cost anything noticeable to run.

It is a dual processor box, plus all the rest of the systems
in the box. Yes, I was considering running it all the time, but
it is too expensive for that.

>> For Itanium, the different units do different things. There are
>> instruction formats that divide up the bits in different ways to make
>> optimal use of the bits. I used to have the manual nearby, but I don't
>> see it right now.

> Yes, the array processor I worked on was coded from scratch, very
> laboriously. The Itanium is trying to run existing code as fast as
> possible. So they have a number of units to do similar things, but also
> different types, all working in parallel as much as possible. Also the
> parallelism in the array processor was all controlled by the programmer.
> In regular x86 processors the parallelism is controlled by the chip
> itself. I'm amazed sometimes at just how much they can get the chip to
> do, no wonder there are 100's of millions of transistors on the thing.
> I assume parallelism in the Itanium is back to the compiler smarts to
> control since it needs to be coded into the VLIW instructions.

Seems to me that the big problem with the original Itanium was the
need to also run x86 code. That delayed the release for some time, and
in that time other processors had advanced. I believe that later
versions run x86 code in software emulation, maybe with some hardware
assist.

-- glen

rickman

unread,

Apr 2, 2013, 1:26:36 PM4/2/13

to

You really need to reread the wiki description of MISC. There seems to
be a disconnect.

>>> Code density is a CISC concept. I don't see how it applies to
>>> your MISC project. Increasing code density for a MISC
>>> processor means implementing more powerful instructions,
>>> i.e., those that do more work, while minimizing bytes in the
>>> instruction opcode encoding. Even if you implement
>>> CISC-like instructions, you can't forgo the MISC instructions
>>> you already have in order to add the CISC-like instructions.
>>> So, to do that, you'll need to increase the size of the
>>> instruction set, as well as implement a more complicated
>>> instruction decoder. I.e., that means the processor will no
>>> longer be MISC, but MISC+minimal CISC hybrid, or pure
>>> CISC...
>>
>> It is perfectly possible to trade one type of MISC processor for
>> another one. The choice between stack and register based is an
>> obvious one. If you switch from stack to register based, there's
>> no need to keep stack manipulation instructions around.
>>
>
> Yes, true. But, what does eliminating MISC stack instructions and
> replacing them with MISC register instructions have to do with the
> CISC concept of code density or with CISC instructions? ...

Weren't you the person who brought CISC into this discussion? Why are
you asking this question about CISC?

I have "dumped" the stack related portion of the instruction set and
replaced it with register references. Why would I do otherwise?

> I.e., if the FPGA he is attempting to use is insufficient to do
> what he wants or needs, then it's insufficient. Or, he needs some
> new techniques. He didn't explicitly ask for any though ...

No one said anything about "insufficient". I am looking for an optimal
design for a small CPU that can be used efficiently in an FPGA.

> Glen mentioned the numerous address modes of the VAX. The 6502
> also had alot of address modes and had instructions which used
> zero-page as a set of fast registers. I would think that early,
> compact designs like the 6502 and Z80 could be useful to rickman.
> They had low transistor counts.
>
> Z80 8500 transistors
> 6502 9000 transistors
> 8088 29000 transistors (for comparison...)

I'm not counting transistors. That is one of the fallacies of comparing
chip designs to FPGA designs. When you have the flexibility of using
transistors you can do things in ways that are efficient that are not
efficient in the LUTs of an FPGA.

The two big constraints are resources used in the FPGA and opcode size.
If an instruction would require addition of resources to the design,
it needs to justify that addition. An instruction also needs to justify
the portion of the opcode space it requires. Operations that specify
more than one register become very expensive in terms of opcode bits.
Operations that require additional data paths become expensive in terms
of resources. Doing as much as possible with as little as possible is
what I'm looking for.

BTW, the ZPU is a pretty good example of just how small a design can be.
But it is very slow. That's the other size of the equation. Improve
speed as much as possible while not using more resources than possible.
The optimal point depends on the coefficients used... in other words,
my judgement. This is not entirely objective.

--

Rick

rickman

unread,

Apr 2, 2013, 1:31:39 PM4/2/13

to

On 4/1/2013 8:54 PM, glen herrmannsfeldt wrote:
>
> Seems to me that one could still Huffman code the opcode, even
> within the MISC concept. That is, use fewer bits for more common
> operations, or where it otherwise simplifies the result.
>
> As someone noted, you can have an N-1 bit load immediate instruction
> where N is the instruction size.

Yup, I am still using the high bit to indicate an immediate operand.
This requires an implied location, so this literal is always in R7, the
addressing register. In my stack design it is always the return stack,
also the addressing register.

Jumps and calls still contain a small literal to be used alone when the
relative jump is within range or with the literal instruction when not.

I think this is very similar to Huffman encoding.

> In one of Knuth's TAOCP books he describes a two instruction computer.
>
> Seems like if one is interested in MISC, someone should build that one.

You can design a one instruction computer, but there is a balance
between resources used and the effectiveness of the resulting design.
The effectiveness of this sort of design is too low.

> Also, maybe a MIX and MMIX machine, maybe the decimal version.
>
> For those who don't read TAOCP, MIX is defined independent of the
> underlying base. Programs in MIXAL are supposed to assemble and
> run correctly on hosts that use any base within the range specified.
>
> Instruction bytes have, if I remember, between 64 and 100 possible
> values, such that six bits or two decimal digits are possible
> representations.
>
> I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

Doesn't sound like an especially practical computer. Has anyone ever
built one?

--

Rick

rickman

unread,

Apr 2, 2013, 1:33:02 PM4/2/13

to

I think you have a very good handle on the nature of my goals.

--

Rick

the_gavino_himself

unread,

Apr 2, 2013, 2:17:04 PM4/2/13

to

any low priced pc that can run firefox coming out with your chip?

glen herrmannsfeldt

unread,

Apr 2, 2013, 3:03:29 PM4/2/13

to

In comp.arch.fpga rickman <gnu...@gmail.com> wrote:

(snip, I wrote)

>> Also, maybe a MIX and MMIX machine, maybe the decimal version.

>> For those who don't read TAOCP, MIX is defined independent of the
>> underlying base. Programs in MIXAL are supposed to assemble and
>> run correctly on hosts that use any base within the range specified.

>> Instruction bytes have, if I remember, between 64 and 100 possible
>> values, such that six bits or two decimal digits are possible
>> representations.

>> I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

> Doesn't sound like an especially practical computer.
> Has anyone ever built one?

Well, it exists mostly to write the examples (and, I believe,
homework problems) in the book. I believe that there have been
software emulation, but maybe no hardware (FPGA) versions.

MIXAL programs should be base independent, but actual implementations
are likely one base.

(The model number, 1009 or MIX in roman numerals, is supposed to
be the average of the model numbers of some popular machines.

There is also the DLX, a RISC machine used by Hennessy and Patterson
in their book, which could also be in roman numerals.

-- glen

Jecel

unread,

Apr 3, 2013, 2:39:12 AM4/3/13

to

On Tuesday, April 2, 2013 11:50:22 AM UTC-3, Rickman wrote:

> My current register based approach has two "special" registers which can
> do some limited addressing of memory. R7 can do indirect, indirect
> post-increment or indirect pre-decrement on either store or load. R6
> can only do indirect pre-decrement on store and indirect post-increment
> on load to implement a stack. R6 will be used as the subroutine linkage
> stack pointer. R7 is for general memory access. It is starting to
> sound a little like one of the old DEC machines. I should review the
> docs on the PDP-11 and also the TI MSP430 which are both similar. I
> can't add all the complexity they are using, but they might give me some
> ideas.

You might want to also look at the TMS9900 processor. And the Inmos Transputer was a nice mix of MISC and CISC, stack and "registers".

> I am using some of my stack machine ideas in the register design. Like
> a literal instruction is just one bit with the other 8 bits as literal
> data. Jump and Call instructions have a four bit literal field included
> so that they can be used without the literal extension for short jumps
> saving a full opcode. But this comes at a cost. The register specs use
> a lot of opcode space. So, in an opcode size that has lots of spare
> opcodes in the stack design because of the implied operands, now is
> squeezed for space and must be severely limited compared to many
> register designs with larger opcodes.

My graduation project at the university in 1990 was a 12 bit MISC processor (3 four bit instructions per word), so I actually started playing around with this before Chuck Moore. In some later designs, I was worried about how many instructions were really 5 bits plus a whole word for the associated literal. So I created two versions of such instructions. Jump32, for example, would use the next 32 bit word as the address while Jump5 would use the next slot instead. That allowed the most common case to be 10 bits (5 for the opcode and 5 for the address). I also had a bunch of registers, but they weren't well integrated into the instruction set. PushReg and PopReg transferred data between them and the data stack.

-- Jecel

David Brown

unread,

Apr 3, 2013, 3:12:23 AM4/3/13

to

x86 compatibility was not the "big" problem with the Itanium (though it
didn't help). There were two far bigger problems. One is that the chip
was targeted as maximising throughput with little regard for power
efficiency, since it was for the server market - so all of the logic was
running all of the time to avoid latencies, and it has massive caches
that run as fast as possible. The result here is that the original
devices had a power density exceeding the core of a nuclear reactor (it
was probably someone from AMD who worked that out...).

The big problem, however, is that the idea with VLIW is that the
compiler does all the work scheduling instructions in a way that lets
them run in parallel. This works in some specialised cases - some DSP's
have this sort of architecture, and some types of mathematical
algorithms suit it well. But when Intel started work on the Itanium,
compilers were not up to the task - Intel simply assumed they would work
well enough by the time the chips were ready. Unfortunately for Intel,
compiler technology never made it - and in fact, it will never work
particularly well for general code. There are too many unpredictable
branches and conditionals to predict parallelism at compile time. So
most real-world Itanium code uses only about a quarter or so of the
processing units in the cpu at any one time (though some types of code
can work far better). Thus Itanium chips run at half the real-world
speed of "normal" processors, while burning through at least twice the
power.

rickman

unread,

Apr 3, 2013, 9:54:51 AM4/3/13

to

On 4/3/2013 2:39 AM, Jecel wrote:
> On Tuesday, April 2, 2013 11:50:22 AM UTC-3, Rickman wrote:
>
>> My current register based approach has two "special" registers which can
>> do some limited addressing of memory. R7 can do indirect, indirect
>> post-increment or indirect pre-decrement on either store or load. R6
>> can only do indirect pre-decrement on store and indirect post-increment
>> on load to implement a stack. R6 will be used as the subroutine linkage
>> stack pointer. R7 is for general memory access. It is starting to
>> sound a little like one of the old DEC machines. I should review the
>> docs on the PDP-11 and also the TI MSP430 which are both similar. I
>> can't add all the complexity they are using, but they might give me some
>> ideas.
>
> You might want to also look at the TMS9900 processor. And the Inmos Transputer was a nice mix of MISC and CISC, stack and "registers".

Yes, good idea. I am very familiar with the TMS9900 family. I had a
Technico 9900 single board computer many years ago... actually, I may
still have that one. I also built a 9995 SBC myself. The aspect of the
9900 family that stood out the most was the registers in memory which
could be moved by adjusting the "workspace" pointer. My MISC register
design only uses 8 registers so by using something similar to the
workspace concept it could provide for very fast context switching. At
the very least using distributed RAM I could have two sets of registers,
one for "user" mode and one for interrupts.

Its interesting that at the time the 9900 instruction set was designed,
the processor didn't really run faster than the memory. Then the
processors got a lot faster when using internal registers and the
workspace pointer idea became a bottleneck. Now in FPGAs, internal RAM
is not any slower than registers really, so the workspace pointer can
work again. The limitation on memory now is the number of ports. It
would be much more useful if FPGA memory were triple ported allowing
more to happen in a single clock cycle.

>> I am using some of my stack machine ideas in the register design. Like
>> a literal instruction is just one bit with the other 8 bits as literal
>> data. Jump and Call instructions have a four bit literal field included
>> so that they can be used without the literal extension for short jumps
>> saving a full opcode. But this comes at a cost. The register specs use
>> a lot of opcode space. So, in an opcode size that has lots of spare
>> opcodes in the stack design because of the implied operands, now is
>> squeezed for space and must be severely limited compared to many
>> register designs with larger opcodes.
>
> My graduation project at the university in 1990 was a 12 bit MISC processor (3 four bit instructions per word), so I actually started playing around with this before Chuck Moore. In some later designs, I was worried about how many instructions were really 5 bits plus a whole word for the associated literal. So I created two versions of such instructions. Jump32, for example, would use the next 32 bit word as the address while Jump5 would use the next slot instead. That allowed the most common case to be 10 bits (5 for the opcode and 5 for the address). I also had a bunch of registers, but they weren't well integrated into the instruction set. PushReg and PopReg transferred data between them and the data stack.

I don't remember when or where I read this, but the most common literals
are small values. So it can be useful to have even just a 4 bit literal
for these cases saving a full word of memory.

I have a few more ideas on optimizations for the register based design,
then I will go back to writing code. I think the best way to optimize
the design is to write code and see where the problems are. Looking at
optimizations in the register based code has even given me ideas on how
to improve the stack based design.

--

Rick

Rod Pemberton

unread,

Apr 3, 2013, 8:34:06 PM4/3/13

to

"rickman" <gnu...@gmail.com> wrote in message

news:kjf48e$5qu$1...@dont-email.me...

> Weren't you the person who brought CISC into this discussion?

Yes.

> Why are you asking this question about CISC?

You mentioned code density. AISI, code density is purely a CISC
concept. They go together and are effectively inseparable.

RISC was about effectively using all the processor clock cycles by
using fast instructions. RISC wasn't concerned about the encoded
size of instructions, how much memory a program consumed, the
cost of memory, or how fast memory needed to be.

CISC was about reducing memory consumed per instruction. CISC
reduced the average size of encoded instructions while also
increasing the amount of work each instruction performs. CISC was
typically little-endian to reduce the space needed for integer
encodings. However, increasing the amount of work per instruction
produces highly specialized instructions that are the
characteristic of CISC. You only need to look at the x86
instruction set to find some, e.g., STOS, LODS, XLAT, etc. They
are also slow to decode and execute as compared to RISC.

So, if memory is cheap and fast, there is no point in improving
code density, i.e., use RISC. If memory is expensive or slow, use
CISC.

Arlet mentioned changes to a processor that appeared to me to have
nothing to do with increasing or decreasing code density. AISI,
the changes he mentioned would only affect what was in current set
of MISC instructions, i.e., either a set of register-based MISC
instructions or a set of stack-based MISC instructions. This was
stated previously.

Rod Pemberton

Rod Pemberton

unread,

Apr 3, 2013, 8:35:36 PM4/3/13

to

"rickman" <gnu...@gmail.com> wrote in message

news:kjeve8$tvm$1...@dont-email.me...

> On 3/30/2013 5:54 PM, Rod Pemberton wrote:

...

> > Even if you implement CISC-like instructions, you can't
> > forgo the MISC instructions you already have in order to
> > add the CISC-like instructions.
>
> Really? I can't drop the entire instruction set and start over?
> Who says so? Am I breaking a law?
>

It's just a logical outcome, AISI. The design criteria that you
stated was that of producing MISC processors. MISC seems to be
purely about the minimizing the quantity of instructions. You've
produced a MISC processor. So, if you now change your mind about
MISC and add additional instructions to your processor's
instruction set, especially non-MISC instructions, you're
effectively going against your own stated design requirement: MISC
or reducing the quantity of instructions. So, it'd be you who
says so, or not... It just seems contradictory with your past
self to change course now with your current self.

> > So, to do that, you'll need to increase
> > the size of the instruction set, as well as implement
> > a more complicated instruction decoder.
>
> Define "increase the size of the instruction set".

You'll have more instructions in your instruction set.

> I am using a 9 bit opcode for my stack design and am
> using a similar 9 bit opcode for the register design. In what
> way is the register design using a larger instruction set?
>

You haven't added any additional CISC-like instructions, yet. You
just exchanged stack operations for register operations. So,
none for now.

> > I.e., that means the processor will no longer be MISC,
>> but MISC+minimal CISC hybrid, or pure
> > CISC...
>
> Nonsense. "Minimal Instruction Set Computer (MISC) is a
> processor architecture with a very small number of basic
> operations and corresponding opcodes." from Wikipedia.
> BTW, I don't think the term MISC is widely used and is not
> well defined. This is the only web page I found that even
> tries to define it.
>
> Actually, if you consider only the opcodes and not the operand
> combinations, I think the register design may have fewer
> instructions than does the stack design. But the register
> design still is in work so I'm not done counting yet.
>

How exactly does fewer instructions contribute to increased code
density for the remaining instructions? The eliminated
instructions are no longer a measured component of code density.
I.e., they no longer consume memory and therefore aren't measured.

> There are some interesting corners to be explored. For example
> a MOV rx,rx is essentially a NOP. There are eight of these. So
> instead, why not make them useful by clearing the register?
> So MOV rx,rx is a clear to be given the name CLR rx... unless
> the rx is r7 in which case it is indeed a MOV r7,r7 which is now
> a NOP to be coded as such. The CLR r7 is not needed because
> LIT 0 already does the same job. Even better the opcode for a
> NOP is 0x1FF or octal 777. That's very easy to remember
> and recognize. It feels good to find convenient features like
> this and makes me like the register MISC design.

I can see that you're attempting to minimize the quantity of
implemented instructions. Although similar in nature, that's not
the same as improving the code density. Are you conflating two
different concepts? One of them reduces the encoded size of an
instruction, while the other eliminates instructions ...

How are you going to attempt to increase the code density for your
processor?

1) adding new, additional, more powerful instructions that you
don't already have
2) merging existing instruction into fewer instructions
3) finding a more compact method of instruction encoding
4) use little-endian to reduce encoded sizes of integers
5) none or other

I'd think it should be #1 and #3 and #4, or #2 and #3 and #4, or
"other" and #3 and #4 ...

Rod Pemberton

glen herrmannsfeldt

unread,

Apr 3, 2013, 10:07:22 PM4/3/13

to

In comp.arch.fpga Rod Pemberton <do_no...@notemailnotq.cpm> wrote:

> "rickman" <gnu...@gmail.com> wrote in message
> news:kjf48e$5qu$1...@dont-email.me...
>> Weren't you the person who brought CISC into this discussion?

> Yes.

>> Why are you asking this question about CISC?

> You mentioned code density. AISI, code density is purely a CISC
> concept. They go together and are effectively inseparable.

They do go together, but I am not so sure that they are inseperable.

CISC began when much coding was done in pure assembler, and anything
that made that easier was useful. (One should figure out the relative
costs, but at least it was in the right direction.)

That brought instructions like S/360 EDMK and VAX POLY.
(Stories are that on most VAX models, POLY is slower than an
explicit loop.) Now, there is no need to waste bits, and so
instruction formats were defined to use the available bits.

S/360 (and successors) only have three different instruction lengths,
and even then sometimes waste bits.

The VAX huge number of different instructions lengths, and also IA32,
does seem to be for code size efficiency. VAX was also defined with
a 512 byte page size, even after S/370 had 2K and 4K pages.
Way too small, but maybe seemed right at the time.

> RISC was about effectively using all the processor clock cycles by
> using fast instructions. RISC wasn't concerned about the encoded
> size of instructions, how much memory a program consumed, the
> cost of memory, or how fast memory needed to be.

Yes, but that doesn't mean that CISC is concerned with the size
of instructions.

> CISC was about reducing memory consumed per instruction. CISC
> reduced the average size of encoded instructions while also
> increasing the amount of work each instruction performs.

Even if it is true (and it probably is) that CISC tends to make
efficient use of the bits, that doesn't prove that is what CISC
was about.

As above, CISC was about making coding easier for programmers,
specifically assembly programmers. Now, complex instructions take
less space than a series of simpler instructions, but then again
one could use a subroutine call.

The PDP-10 has the indirect bit, allowing for nested indirection,
which may or may not make efficient use of that bit. S/360 uses
a sequence of L (load) instructions to do indirection.

Instruction usage statistics noting how often L (load) was
executed in S/360 code may have been the beginning of RISC.

> CISC was typically little-endian to reduce the space needed
> for integer encodings.

This I don't understand at all. They take the same amount
of space. Little endian does make it slightly easier to
do a multiword (usually byte) add, and that may have helped
for the 6502. It allows one to propagate the carry in the
same order one reads bytes from memory.

But once you add multiply and divide, the advantage is pretty
small.

> However, increasing the amount of work per instruction
> produces highly specialized instructions that are the
> characteristic of CISC. You only need to look at the x86
> instruction set to find some, e.g., STOS, LODS, XLAT, etc. They
> are also slow to decode and execute as compared to RISC.

Those are not very CISCy compared with some S/360 or VAX
instructions. Now, compare to S/360 TR which will translate
by looking up bytes in a lookup table for 1 to 256 byte
long strings. (Unless I remember wrong, XLAT does one byte.)

> So, if memory is cheap and fast, there is no point in improving
> code density, i.e., use RISC. If memory is expensive or slow, use
> CISC.

Well, RISC is more toward using simpler instructions that compilers
actually generate and executing them fast. Having one instruction
size helps things go fast, and tends to be less efficient with bits.

Even so, I believe that you will find that RISC designers try to
make efficient use of the bits available, within the single size
instruction constraint.

> Arlet mentioned changes to a processor that appeared to me to have
> nothing to do with increasing or decreasing code density. AISI,
> the changes he mentioned would only affect what was in current set
> of MISC instructions, i.e., either a set of register-based MISC
> instructions or a set of stack-based MISC instructions. This was
> stated previously.

Decoding multiple different instruction formats tends to require
complicated demultiplexers which are especially hard to do in
an FPGA. Even so, one can make efficient use of the bits
and still be MISC.

-- glen

Syd Rumpo

unread,

Apr 4, 2013, 4:38:34 AM4/4/13

to

On 29/03/2013 21:00, rickman wrote:
> I have been working with stack based MISC designs in FPGAs for some
> years. All along I have been comparing my work to the work of others.
> These others were the conventional RISC type processors supplied by the
> FPGA vendors as well as the many processor designs done by individuals
> or groups as open source.

<snip>

Can you achieve as fast interrupt response times on a register-based
machine as a stack machine? OK, shadow registers buy you one fast
interrupt, but that's sort of a one-level 2D stack.

Even the venerable RTX2000 had an impressive (IIRC) 200ns interrupt
response time.

Cheers
--
Syd

Arlet Ottens

unread,

Apr 4, 2013, 5:31:38 AM4/4/13

to

It depends on the implementation.

The easiest thing would be to not save anything at all before jumping to
the interrupt handler. This would make the interrupt response really
fast, but you'd have to save the registers manually before using them.
It would benefit systems that don't need many (or any) registers in the
interrupt handler. And even saving 4 registers at 100 MHz only takes an
additional 40 ns.

If you have parallel access to the stack/program memory, you could like
the Cortex, and save a few (e.g. 4) registers on the stack, while you
fetch the interrupt vector, and refill the execution pipeline at the
same time. This adds a considerable bit of complexity, though.

If you keep the register file in a large memory, like a internal block
RAM, you can easily implement multiple sets of shadow registers.

Of course, an FPGA comes with flexible hardware such as large FIFOs, so
you can generally avoid the need for super fast interrupt response. In
fact, you may not even need interrupts at all.

Albert van der Horst

unread,

Apr 4, 2013, 7:16:17 AM4/4/13

to

In article <kjin8q$so5$1...@speranza.aioe.org>,

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>In comp.arch.fpga Rod Pemberton <do_no...@notemailnotq.cpm> wrote:
>> "rickman" <gnu...@gmail.com> wrote in message
>> news:kjf48e$5qu$1...@dont-email.me...
>>> Weren't you the person who brought CISC into this discussion?
>
>> Yes.
>
>>> Why are you asking this question about CISC?
>
>> You mentioned code density. AISI, code density is purely a CISC
>> concept. They go together and are effectively inseparable.
>
>They do go together, but I am not so sure that they are inseperable.
>
>CISC began when much coding was done in pure assembler, and anything
>that made that easier was useful. (One should figure out the relative
>costs, but at least it was in the right direction.)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.

<SNIP>

>
>Decoding multiple different instruction formats tends to require
>complicated demultiplexers which are especially hard to do in
>an FPGA. Even so, one can make efficient use of the bits
>and still be MISC.

>
>-- glen

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Arlet Ottens

unread,

Apr 4, 2013, 7:25:48 AM4/4/13

to

On 04/04/2013 01:16 PM, Albert van der Horst wrote:

>>> You mentioned code density. AISI, code density is purely a CISC
>>> concept. They go together and are effectively inseparable.
>>
>> They do go together, but I am not so sure that they are inseperable.
>>
>> CISC began when much coding was done in pure assembler, and anything
>> that made that easier was useful. (One should figure out the relative
>> costs, but at least it was in the right direction.)
>
> But, of course, this is a fallacy. The same goal is accomplished by
> macro's, and better. Code densitity is the only valid reason.

Speed is another valid reason.

glen herrmannsfeldt

unread,

Apr 4, 2013, 8:44:10 AM4/4/13

to

Presumably some combination of ease of coding, speed, and also Brooks'
"Second System Effect".

Paraphrasing from "Mythical Man Month" since I haven't read it recently,
the ideas that designers couldn't implement in their first system that
they designed, for cost/efficiency/whatever reasons, come out in the
second system.

Brooks wrote that more for OS/360 (software) than for S/360 (hardware),
but it might still have some effect on the hardware, and maybe also
for VAX.

There are a number of VAX instructions that seem like a good idea, but
as I understand it ended up slower than if done without the special
instructions.

As examples, both the VAX POLY and INDEX instruction. When VAX was new,
compiled languages (Fortran for example) pretty much never did array
bounds testing. It was just too slow. So VAX supplied INDEX, which in
one instruction did the multiply/add needed for a subscript calcualtion
(you do one INDEX for each subscript) and also checked that the
subscript was in range. Nice idea, but it seems that even with INDEX
it was still too slow.

Then POLY evaluates a whole polynomial, such as is used to approximate
many mathematical functions, but again, as I understand it, too slow.

Both the PDP-10 and S/360 have the option for an index register on
many instructions, where when register 0 is selected no indexing is
done. VAX instead has indexed as a separate address mode selected by
the address mode byte. Is that the most efficient use for those bits?

-- glen

glen herrmannsfeldt

unread,

Apr 4, 2013, 8:49:54 AM4/4/13

to

In comp.arch.fpga Syd Rumpo <use...@nononono.co.uk> wrote:

(snip)

> Can you achieve as fast interrupt response times on a register-based
> machine as a stack machine? OK, shadow registers buy you one fast
> interrupt, but that's sort of a one-level 2D stack.

If you disable interrupts so that another one doesn't come along
before you can save enough state for the first one, yes.

S/360 does it with no stack. You have to have some place in the
low (first 4K) address range to save at least one register.
The hardware saves the old PSW at a fixed (for each type of
interrupt) address, which you also have to move somewhere else
before enabling more interrupts of the same type.

> Even the venerable RTX2000 had an impressive (IIRC) 200ns interrupt
> response time.

-- glen

Arlet Ottens

unread,

Apr 4, 2013, 9:02:34 AM4/4/13

to

On 04/04/2013 02:49 PM, glen herrmannsfeldt wrote:
> In comp.arch.fpga Syd Rumpo <use...@nononono.co.uk> wrote:
>
> (snip)
>> Can you achieve as fast interrupt response times on a register-based
>> machine as a stack machine? OK, shadow registers buy you one fast
>> interrupt, but that's sort of a one-level 2D stack.
>
> If you disable interrupts so that another one doesn't come along
> before you can save enough state for the first one, yes.
>
> S/360 does it with no stack. You have to have some place in the
> low (first 4K) address range to save at least one register.
> The hardware saves the old PSW at a fixed (for each type of
> interrupt) address, which you also have to move somewhere else
> before enabling more interrupts of the same type.

ARM7 is similar. PC and PSW are copied to registers, and further
interrupts are disabled. The hardware does not touch the stack. If you
want to make nested interrupts, the programmer is responsible for saving
these registers.

ARM Cortex has changed that, and it saves registers on the stack. This
allows interrupt handlers to be written as regular higher language
functions, and also allows easy nested interrupts.

When dealing with back-to-back interrupts, the Cortex takes a shortcut,
and does not pop/push the registers, but just leaves them on the stack.

Elizabeth D. Rather

unread,

Apr 4, 2013, 2:36:55 PM4/4/13

to

On 4/3/13 11:31 PM, Arlet Ottens wrote:
> On 04/04/2013 10:38 AM, Syd Rumpo wrote:
>> On 29/03/2013 21:00, rickman wrote:
>>> I have been working with stack based MISC designs in FPGAs for some
>>> years. All along I have been comparing my work to the work of others.
>>> These others were the conventional RISC type processors supplied by the
>>> FPGA vendors as well as the many processor designs done by individuals
>>> or groups as open source.
>>
>> <snip>
>>
>> Can you achieve as fast interrupt response times on a register-based
>> machine as a stack machine? OK, shadow registers buy you one fast
>> interrupt, but that's sort of a one-level 2D stack.
>>
>> Even the venerable RTX2000 had an impressive (IIRC) 200ns interrupt
>> response time.
>
> It depends on the implementation.
>
> The easiest thing would be to not save anything at all before jumping to
> the interrupt handler. This would make the interrupt response really
> fast, but you'd have to save the registers manually before using them.
> It would benefit systems that don't need many (or any) registers in the
> interrupt handler. And even saving 4 registers at 100 MHz only takes an
> additional 40 ns.

The best interrupt implementation just jumps to the handler code. The
implementation knows what registers it has to save and restore, which
may be only one or two. Saving and restoring large register files takes
cycles!

> If you have parallel access to the stack/program memory, you could like
> the Cortex, and save a few (e.g. 4) registers on the stack, while you
> fetch the interrupt vector, and refill the execution pipeline at the
> same time. This adds a considerable bit of complexity, though.
>
> If you keep the register file in a large memory, like a internal block
> RAM, you can easily implement multiple sets of shadow registers.
>
> Of course, an FPGA comes with flexible hardware such as large FIFOs, so
> you can generally avoid the need for super fast interrupt response. In
> fact, you may not even need interrupts at all.

Interrupts are good. I don't know why people worry about them so!

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

rickman

unread,

Apr 3, 2013, 10:42:06 PM4/3/13

to

On 4/3/2013 8:34 PM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:kjf48e$5qu$1...@dont-email.me...
>
>> Weren't you the person who brought CISC into this discussion?
>
> Yes.
>
>> Why are you asking this question about CISC?
>
> You mentioned code density. AISI, code density is purely a CISC
> concept. They go together and are effectively inseparable.

Ok, so that's how you see it.

> RISC was about effectively using all the processor clock cycles by
> using fast instructions. RISC wasn't concerned about the encoded
> size of instructions, how much memory a program consumed, the
> cost of memory, or how fast memory needed to be.

Don't know why you are even mentioning RISC.

> CISC was about reducing memory consumed per instruction. CISC
> reduced the average size of encoded instructions while also
> increasing the amount of work each instruction performs. CISC was
> typically little-endian to reduce the space needed for integer
> encodings. However, increasing the amount of work per instruction
> produces highly specialized instructions that are the
> characteristic of CISC. You only need to look at the x86
> instruction set to find some, e.g., STOS, LODS, XLAT, etc. They
> are also slow to decode and execute as compared to RISC.

I think CISC was not solely about reducing memory used per instruction.
CISC was not an area of work. CISC was not even coined until long
after many CISC machines were designed. Most CISC computers were
designed with very different goals in mind. For example, the x86, a
CISC processor, was initially designed to extend the x86 instruction set
to a 16 bit processor and then to a 32 bit processor. The goal was just
to develop an instruction set that was backwards compatible with
existing processors while adding capabilities that would make 32 bit
processors marketable.

> So, if memory is cheap and fast, there is no point in improving
> code density, i.e., use RISC. If memory is expensive or slow, use
> CISC.

LOL, that is a pretty MINIMAL analysis of computers, so I guess it is
MACA, Minimal Analysis of Computer Architectures.

> Arlet mentioned changes to a processor that appeared to me to have
> nothing to do with increasing or decreasing code density. AISI,
> the changes he mentioned would only affect what was in current set
> of MISC instructions, i.e., either a set of register-based MISC
> instructions or a set of stack-based MISC instructions. This was
> stated previously.

This is so far out of context I can't comment.

--

Rick

rickman

unread,

Apr 4, 2013, 12:09:53 AM4/4/13

to

On 4/3/2013 8:35 PM, Rod Pemberton wrote:
> "rickman"<gnu...@gmail.com> wrote in message
> news:kjeve8$tvm$1...@dont-email.me...
>> On 3/30/2013 5:54 PM, Rod Pemberton wrote:
> ...
>
>>> Even if you implement CISC-like instructions, you can't
>>> forgo the MISC instructions you already have in order to
>>> add the CISC-like instructions.
>>
>> Really? I can't drop the entire instruction set and start over?
>> Who says so? Am I breaking a law?
>>
>
> It's just a logical outcome, AISI. The design criteria that you
> stated was that of producing MISC processors. MISC seems to be
> purely about the minimizing the quantity of instructions. You've
> produced a MISC processor. So, if you now change your mind about
> MISC and add additional instructions to your processor's
> instruction set, especially non-MISC instructions, you're
> effectively going against your own stated design requirement: MISC
> or reducing the quantity of instructions. So, it'd be you who
> says so, or not... It just seems contradictory with your past
> self to change course now with your current self.

Your logic seems to be flawed on so many levels. I don't think I stated
that producing a MISC processor was a "design criteria". It doesn't
even make sense to have that as a "design criteria".

I never said I was "adding" instructions to some existing instruction
set. In fact, I think I've said that the instruction set for the
register based MISC processor is so far, *smaller* than the instruction
set for the stack based MISC processor as long as you don't consider
each combination of X and Y in MOV rx,ry to be a separate instruction.
If you feel each combination is a separate instruction then they both
have approximately the same number of instructions since they both have
9 bit instructions and so have 512 possible instructions.

>>> So, to do that, you'll need to increase
>>> the size of the instruction set, as well as implement
>>> a more complicated instruction decoder.
>>
>> Define "increase the size of the instruction set".
>
> You'll have more instructions in your instruction set.

Sorry, that isn't a good definition because you used part of the term
you are defining in the definition.

>> I am using a 9 bit opcode for my stack design and am
>> using a similar 9 bit opcode for the register design. In what
>> way is the register design using a larger instruction set?
>>
>
> You haven't added any additional CISC-like instructions, yet. You
> just exchanged stack operations for register operations. So,
> none for now.

Ok, now we are getting somewhere. In fact, if you read my other posts,
you will find that I *won't* be adding any CISC instructions because one
of my stated "design criteria" is that each instruction executes in one
clock cycle. It's pretty hard to design a simple machine that can do
"complex" instructions without executing them in multiple clock cycles.

>>> I.e., that means the processor will no longer be MISC,
>>> but MISC+minimal CISC hybrid, or pure
>>> CISC...
>>
>> Nonsense. "Minimal Instruction Set Computer (MISC) is a
>> processor architecture with a very small number of basic
>> operations and corresponding opcodes." from Wikipedia.
>> BTW, I don't think the term MISC is widely used and is not
>> well defined. This is the only web page I found that even
>> tries to define it.
>>
>> Actually, if you consider only the opcodes and not the operand
>> combinations, I think the register design may have fewer
>> instructions than does the stack design. But the register
>> design still is in work so I'm not done counting yet.
>>
>
> How exactly does fewer instructions contribute to increased code
> density for the remaining instructions? The eliminated
> instructions are no longer a measured component of code density.
> I.e., they no longer consume memory and therefore aren't measured.

Not sure what you mean here. Code density how many instructions it
takes to do a given amount of work. I measure this by writing code and
counting the instructions it takes. Right now I have a section of code
I am working on that performs the DDS calculations from a set of control
inputs to the DDS. This is what I was working on when I realized that a
register based design likely could do this without the stack ops, OVER
mainly, but also nearly all the others that just work on the top two
stack items.

So far it appears the register based instructions are significantly more
compact than the stack based instructions. Just as important, the
implementation appears to be simpler for the register based design. But
that is just *so far*. I am still working on this. The devil is in the
details and I may find some aspects of what I am doing that cause
problems and can't be done in the instruction formats I am planning or
something blows up the hardware to be much bigger than I am picturing at
the moment.

>> There are some interesting corners to be explored. For example
>> a MOV rx,rx is essentially a NOP. There are eight of these. So
>> instead, why not make them useful by clearing the register?
>> So MOV rx,rx is a clear to be given the name CLR rx... unless
>> the rx is r7 in which case it is indeed a MOV r7,r7 which is now
>> a NOP to be coded as such. The CLR r7 is not needed because
>> LIT 0 already does the same job. Even better the opcode for a
>> NOP is 0x1FF or octal 777. That's very easy to remember
>> and recognize. It feels good to find convenient features like
>> this and makes me like the register MISC design.
>
> I can see that you're attempting to minimize the quantity of
> implemented instructions. Although similar in nature, that's not
> the same as improving the code density. Are you conflating two
> different concepts? One of them reduces the encoded size of an
> instruction, while the other eliminates instructions ...

You really don't seem to understand what I am doing. You continually
misinterpret what I explain.

> How are you going to attempt to increase the code density for your
> processor?
>
> 1) adding new, additional, more powerful instructions that you
> don't already have
> 2) merging existing instruction into fewer instructions
> 3) finding a more compact method of instruction encoding
> 4) use little-endian to reduce encoded sizes of integers
> 5) none or other
>
> I'd think it should be #1 and #3 and #4, or #2 and #3 and #4, or
> "other" and #3 and #4 ...

Uh, I am designing an instruction set that does as much as possible with
as little hardware as possible. When you say, "new, additional"
instructions, compared to what? When you say "more compact", again,
compared to what exactly?

When you say "litte-endian" to reduce encoded integer size, what exactly
is that? Are you referring to specifying an integer in small chunks so
that sign extension allows the specification to be limited in length?
Yes, that is done on both the stack and register based designs.
Koopmans paper lists literals and calls as some of the most frequently
used instructions, so optimizing literals optimizes the most frequently
used instructions.

--

Rick

rickman

unread,

Apr 4, 2013, 5:04:05 PM4/4/13

to

On 4/4/2013 8:44 AM, glen herrmannsfeldt wrote:
> In comp.arch.fpga Arlet Ottens<usen...@c-scape.nl> wrote:
>> On 04/04/2013 01:16 PM, Albert van der Horst wrote:
>
>>>>> You mentioned code density. AISI, code density is purely a CISC
>>>>> concept. They go together and are effectively inseparable.
>
>>>> They do go together, but I am not so sure that they are inseperable.
>
>>>> CISC began when much coding was done in pure assembler, and anything
>>>> that made that easier was useful. (One should figure out the relative
>>>> costs, but at least it was in the right direction.)
>
>>> But, of course, this is a fallacy. The same goal is accomplished by
>>> macro's, and better. Code densitity is the only valid reason.

Albert, do you have a reference about this?

I think you have just described the CISC instruction development
concept. Build a new machine, add some new instructions. No big
rational, no "CISC" concept, just "let's make it better, why not add
some instructions?"

I believe if you check you will find the term CISC was not even coined
until after RISC was invented. So CISC really just means, "what we used
to do".

--

Rick

glen herrmannsfeldt

unread,

Apr 4, 2013, 5:34:52 PM4/4/13

to

In comp.arch.fpga rickman <gnu...@gmail.com> wrote:

(snip, then I wrote)

>> Then POLY evaluates a whole polynomial, such as is used to approximate
>> many mathematical functions, but again, as I understand it, too slow.

>> Both the PDP-10 and S/360 have the option for an index register on
>> many instructions, where when register 0 is selected no indexing is
>> done. VAX instead has indexed as a separate address mode selected by
>> the address mode byte. Is that the most efficient use for those bits?

> I think you have just described the CISC instruction development
> concept. Build a new machine, add some new instructions. No big
> rational, no "CISC" concept, just "let's make it better, why not add
> some instructions?"

Yes, but remember that there is competition and each has to have
some reason why someone should by their product. Adding new instructions
was one way to do that.

> I believe if you check you will find the term CISC was not even coined
> until after RISC was invented. So CISC really just means, "what we used
> to do".

Well, yes, but why did "we used to do that"? For S/360, a lot of
software was still written in pure assembler, for one reason to make
it faster, and for another to make it smaller. And people were just
starting to learn that people (writing software) are more expensive
that machines (hardware). Well, that is about the point that it was
true. For earlier machines you were lucky to get one compiler and
enough system to run it.

And VAX was enough later and even more CISCy.

-- glen

Alex McDonald

unread,

Apr 4, 2013, 6:30:37 PM4/4/13

to

On Apr 4, 10:04 pm, rickman <gnu...@gmail.com> wrote:
> On 4/4/2013 8:44 AM, glen herrmannsfeldt wrote:
>

> > In comp.arch.fpga Arlet Ottens<usene...@c-scape.nl> wrote:

> >> On 04/04/2013 01:16 PM, Albert van der Horst wrote:
>
> >>>>> You mentioned code density. AISI, code density is purely a CISC
> >>>>> concept. They go together and are effectively inseparable.
>
> >>>> They do go together, but I am not so sure that they are inseperable.
>
> >>>> CISC began when much coding was done in pure assembler, and anything
> >>>> that made that easier was useful. (One should figure out the relative
> >>>> costs, but at least it was in the right direction.)
>
> >>> But, of course, this is a fallacy. The same goal is accomplished by
> >>> macro's, and better. Code densitity is the only valid reason.
>
> Albert, do you have a reference about this?
>
>

Let's take two commonly used S/360 opcodes as an example of CISC; some
move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes).
MVC does no padding or truncation. MVCL can pad and truncate, but
unlike MVC will do nothing and report overflow if the operands
overlap. MVC appears to other processors as a single indivisible
operation; every processor (including IO processors) sees storage as
either before the MVC or after it; it's not interruptible. MVCL is
interruptible, and partial products can be observed by other
processors. MVCL requires 4 registers and their contents are updated
after completion of the operation; MVC requires 1 for variable length
moves, 0 for fixed and its contents are preserved. MVCL has a high
code setup cost; MVC has none.

Writing a macro to do multiple MVCs and mimic the behaviour of MVCL?
Why not? It's possible, if a little tricky. And by all accounts, MVC
in a loop is faster than MVCL too. IBM even provided a macro; $MVCL.

But then, when you look at MVCL usage closely, there are a few
defining characteristics that are very useful. It can zero memory, and
the millicode (IBM's word for microcode) recognizes 4K boundaries for
4K lengths and optimises it; it's faster than 16 MVCs.

There's even a MVPG instruction for moving 4K aligned pages! What are
those crazy instruction set designers thinking?

The answer's a bit more than just code density; it never really was
about that. In all the years I wrote IBM BAL, I never gave code
density a serious thought -- with one exception. That was the 4K base
address limit; a base register could only span 4K, so code that was
bigger than that, you had to have either fancy register footwork or
waste registers for multiple bases.

It was more about giving assembler programmers choice and variety to
get the best out of the box before the advent of optimising compilers;
a way, if you like, of exposing the potential of the micro/millicode
through the instruction set. "Here I want you to zero memory" meant an
MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A
knowledgeable assembler programmer could out-perform a compiler.
(Nowadays quality compilers do a much better job of instruction
selection than humans, especially for pipelined processors that
stall.)

Hence CISC instruction sets (at least, IMHO and for IBM). They were
there for people and performance, not for code density.

rickman

unread,

Apr 4, 2013, 6:57:48 PM4/4/13

to

That's an interesting question. The short answer is yes, but it
requires that I provide circuitry to do two things, one is to save both
the Processor Status Word (PSW) and the return address to the stack in
one cycle. The stack computer has two stacks and I can save these two
items in one clock cycle. Currently my register machine uses a stack in
memory pointed to by a register, so it would require *two* cycles to
save two words. But the memory is dual ported and I can use a tiny bit
of extra logic to save both words at once and bump the pointer by two.

The other task is to save registers. The stack design doesn't really
need to do that, the stack is available for new work and the interrupt
routine just needs to finish with the stack in the same state as when it
started. I've been thinking about how to handle this in the register
machine.

The registers are really two registers and one bank of registers. R6
and R7 are "special" in that they have a separate incrementer to support
the addressing modes. They need a separate write port so they can be
updated in parallel with the other registers. I have considered
"saving" the registers by just bumping the start address of the
registers in the RAM, but that only saves R0-R5. I could use LUT RAM
for R6 and R6 as well. This would provide two sets of registers for
R0-R5 and up to 16 sets for R6 and R7. The imbalance isn't very useful,
but at least there would be a set for the main program and a set for
interrupts with the caveat that nothing can be retained between
interrupts. This also means interrupts can't be interrupted other than
at specific points where the registers are not used for storage.

I'm also going to look at using a block RAM for the registers. With
only two read and write ports this makes the multiply step cycle longer
though.

Once that issue is resolved the interrupt response then becomes the same
as the stack machine - 1 clock cycle or 20 ns.

--

Rick

Alex McDonald

unread,

Apr 4, 2013, 7:07:20 PM4/4/13

to

On Apr 3, 5:34 pm, "Rod Pemberton" <do_not_h...@notemailnotq.cpm>
wrote:

> CISC was
> typically little-endian to reduce the space needed for integer
> encodings.

As long as you discount IBM mainframes. They are big endian. Or
Borroughs/Unisys; they were big endian too. Or the Motorola 68K; it
was big endian. For little-endian CISC, only the VAX and x86 come to
mind. Of those only the x86 survives.

glen herrmannsfeldt

unread,

Apr 4, 2013, 7:15:37 PM4/4/13

to

In comp.arch.fpga Alex McDonald <bl...@rivadpm.com> wrote:

(snip, someone wrote)

>> >>> But, of course, this is a fallacy. The same goal is accomplished by
>> >>> macro's, and better. Code densitity is the only valid reason.

>> Albert, do you have a reference about this?

> Let's take two commonly used S/360 opcodes as an example of CISC; some
> move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes).

MVC moves 1 to 256 bytes, conveniently. (Unless you want 0.)

> MVC does no padding or truncation. MVCL can pad and truncate, but
> unlike MVC will do nothing and report overflow if the operands
> overlap. MVC appears to other processors as a single indivisible
> operation; every processor (including IO processors) sees storage as
> either before the MVC or after it;

I haven't looked recently, but I didn't think it locked out I/O.
Seems that one of the favorite tricks for S/360 was modifying
channel programs while they are running. (Not to mention self-
modifying channel prorams.) Seems that MVC would be convenient
for that. It might be that MVC interlocks on CCW fetch such that
only whole CCWs are fetched, though.

> it's not interruptible. MVCL is
> interruptible, and partial products can be observed by other
> processors. MVCL requires 4 registers and their contents are updated
> after completion of the operation; MVC requires 1 for variable length
> moves, 0 for fixed and its contents are preserved. MVCL has a high
> code setup cost; MVC has none.

> Writing a macro to do multiple MVCs and mimic the behaviour of MVCL?
> Why not? It's possible, if a little tricky. And by all accounts, MVC
> in a loop is faster than MVCL too. IBM even provided a macro; $MVCL.

> But then, when you look at MVCL usage closely, there are a few
> defining characteristics that are very useful. It can zero memory, and
> the millicode (IBM's word for microcode) recognizes 4K boundaries for
> 4K lengths and optimises it; it's faster than 16 MVCs.

As far as I understand, millicode isn't exactly like microcode,
but does allow for more complicated new instructions to be more
easily implemented.

> There's even a MVPG instruction for moving 4K aligned pages! What are
> those crazy instruction set designers thinking?

> The answer's a bit more than just code density; it never really was
> about that. In all the years I wrote IBM BAL, I never gave code
> density a serious thought -- with one exception. That was the 4K base
> address limit; a base register could only span 4K, so code that was
> bigger than that, you had to have either fancy register footwork or
> waste registers for multiple bases.

Compared to VAX, S/360 is somewhat RISCy. Note only three different
instruction lengths and, for much of the instruction set only two
address modes. If processors fast path the more popular instructions,
like L and even MVC, it isn't so far from RISC.

> It was more about giving assembler programmers choice and variety to
> get the best out of the box before the advent of optimising compilers;

Though stories are that even the old Fortran H could come close to
good assembly programmers, and likely better than the average assembly
programmer.

> a way, if you like, of exposing the potential of the micro/millicode
> through the instruction set. "Here I want you to zero memory" meant an
> MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A
> knowledgeable assembler programmer could out-perform a compiler.
> (Nowadays quality compilers do a much better job of instruction
> selection than humans, especially for pipelined processors that
> stall.)

For many processors, MVC was much faster on appropriately aligned
data, such as the 8 bytes from A to B. Then again, some might
use LD and STD.

> Hence CISC instruction sets (at least, IMHO and for IBM). They were
> there for people and performance, not for code density.

I noticed some time ago that the hex opcodes for add instructions
end in A, and for divide in D. (That leaves B for subtract and
C for multiply, but not so hard to remember.)

If they really wanted to reduce code size, they should have added
a load indirect register instruction. (RR format.) A good
fraction of L (load) instructions have both base and offset
zero, (or, equivalently, index and offset).

-- glen

rickman

unread,

Apr 4, 2013, 7:30:28 PM4/4/13

to

Sure, none of this stuff was done without some purpose. My point is
that there was no *common* theme to the various CISC instruction sets.
Everybody was doing their own thing until RISC came along with a basic
philosophy. Someone felt the need to give a name to the previous way of
doing things and CISC seemed appropriate. No special meaning in the
name actually, just a contrast to the "Reduced" in RISC.

I don't think this is a very interesting topic really. It started in
response to a comment by Rod.

--

Rick

rickman

unread,

Apr 4, 2013, 8:07:24 PM4/4/13

to

On 4/4/2013 7:16 AM, Albert van der Horst wrote:
> In article<kjin8q$so5$1...@speranza.aioe.org>,
> glen herrmannsfeldt<g...@ugcs.caltech.edu> wrote:
>> In comp.arch.fpga Rod Pemberton<do_no...@notemailnotq.cpm> wrote:
>>> "rickman"<gnu...@gmail.com> wrote in message
>>> news:kjf48e$5qu$1...@dont-email.me...
>>>> Weren't you the person who brought CISC into this discussion?
>>
>>> Yes.
>>
>>>> Why are you asking this question about CISC?
>>
>>> You mentioned code density. AISI, code density is purely a CISC
>>> concept. They go together and are effectively inseparable.
>>
>> They do go together, but I am not so sure that they are inseperable.
>>
>> CISC began when much coding was done in pure assembler, and anything
>> that made that easier was useful. (One should figure out the relative
>> costs, but at least it was in the right direction.)
>
> But, of course, this is a fallacy. The same goal is accomplished by
> macro's, and better. Code densitity is the only valid reason.

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.

--

Rick

Alex McDonald

unread,

Apr 4, 2013, 8:26:24 PM4/4/13

to

On Apr 4, 4:15 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

> In comp.arch.fpga Alex McDonald <b...@rivadpm.com> wrote:
>
> (snip, someone wrote)
>
> >> >>> But, of course, this is a fallacy. The same goal is accomplished by
> >> >>> macro's, and better. Code densitity is the only valid reason.
> >> Albert, do you have a reference about this?
> > Let's take two commonly used S/360 opcodes as an example of CISC; some
> > move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes).
>
> MVC moves 1 to 256 bytes, conveniently. (Unless you want 0.)

My bad; the encoding is 0 to 255 but it's interpreted as +1.

>
> > MVC does no padding or truncation. MVCL can pad and truncate, but
> > unlike MVC will do nothing and report overflow if the operands
> > overlap. MVC appears to other processors as a single indivisible
> > operation; every processor (including IO processors) sees storage as
> > either before the MVC or after it;
>
> I haven't looked recently, but I didn't think it locked out I/O.
> Seems that one of the favorite tricks for S/360 was modifying
> channel programs while they are running. (Not to mention self-
> modifying channel prorams.) Seems that MVC would be convenient
> for that. It might be that MVC interlocks on CCW fetch such that
> only whole CCWs are fetched, though.

Certainly on the S/360 whole CCWs, since it had a precise interrupt
model and MVC wasn't (and still isn't) interruptible. The S/370
allowed interrupts on page faults for the target and source, but that
is done before the instruction is executed.

IPL operates just like that; it issues a fixed CCW that reads in data
that's a PSW and some CCWs, and away she goes...

A modern Z series has more instructions than the average Forth has
words; it's in the high hundreds.

>
> > It was more about giving assembler programmers choice and variety to
> > get the best out of the box before the advent of optimising compilers;
>
> Though stories are that even the old Fortran H could come close to
> good assembly programmers, and likely better than the average assembly
> programmer.

Fortran/H was a good compiler. The early PL/I was horrible, and there
was a move to use it for systems programming work. I never did so due
to its incredibly bad performance.

>
> > a way, if you like, of exposing the potential of the micro/millicode
> > through the instruction set. "Here I want you to zero memory" meant an
> > MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A
> > knowledgeable assembler programmer could out-perform a compiler.
> > (Nowadays quality compilers do a much better job of instruction
> > selection than humans, especially for pipelined processors that
> > stall.)
>
> For many processors, MVC was much faster on appropriately aligned
> data, such as the 8 bytes from A to B. Then again, some might
> use LD and STD.

OK, 9 bytes. :-)

>
> > Hence CISC instruction sets (at least, IMHO and for IBM). They were
> > there for people and performance, not for code density.
>
> I noticed some time ago that the hex opcodes for add instructions
> end in A, and for divide in D. (That leaves B for subtract and
> C for multiply, but not so hard to remember.)
>
> If they really wanted to reduce code size, they should have added
> a load indirect register instruction. (RR format.) A good
> fraction of L (load) instructions have both base and offset
> zero, (or, equivalently, index and offset).

Agreed. And had they wanted to, a single opcode for the standard
prolog & epilog; for example:

STM 14,12,12(13)
LR 12,15
LA 15,SAVE
ST 15,8(13)
ST 13,4(15)
LR 13,15

could have been the single op

ENTRY SAVE

It was macros every time, which is in direct opposition to Albert's
assertion.

>
> -- glen

Albert van der Horst

unread,

Apr 4, 2013, 9:17:31 PM4/4/13

to

In article <kjkpnp$qdp$1...@dont-email.me>, rickman <gnu...@gmail.com> wrote:
>On 4/4/2013 8:44 AM, glen herrmannsfeldt wrote:
>> In comp.arch.fpga Arlet Ottens<usen...@c-scape.nl> wrote:
>>> On 04/04/2013 01:16 PM, Albert van der Horst wrote:
>>
>>>>>> You mentioned code density. AISI, code density is purely a CISC
>>>>>> concept. They go together and are effectively inseparable.
>>
>>>>> They do go together, but I am not so sure that they are inseperable.
>>
>>>>> CISC began when much coding was done in pure assembler, and anything
>>>>> that made that easier was useful. (One should figure out the relative
>>>>> costs, but at least it was in the right direction.)
>>
>>>> But, of course, this is a fallacy. The same goal is accomplished by
>>>> macro's, and better. Code densitity is the only valid reason.
>
>Albert, do you have a reference about this?

Not in a wikipedia sense where you're not allowed to mention original
research, and are only quoting what is in the books. It is more experience
that school knowledge.

If you want to see what can be done with a good macro processor like m4
study the one source of the 16/32/64 bit ciforth x86 for linux/Windows/Apple.
See my site below.

The existance of an XLAT instruction (to name an example) OTOH does virtually
nothing to make the life of an assembler programmer better.

Groetjes Albert
>
>--
>
>Rick

glen herrmannsfeldt

unread,

Apr 5, 2013, 12:10:47 AM4/5/13

to

In comp.arch.fpga Alex McDonald <bl...@rivadpm.com> wrote:

> On Apr 3, 5:34ï¿œpm, "Rod Pemberton" <do_not_h...@notemailnotq.cpm>
> wrote:
>> ï¿œCISC was

>> typically little-endian to reduce the space needed for integer
>> encodings.

> As long as you discount IBM mainframes. They are big endian. Or
> Borroughs/Unisys; they were big endian too. Or the Motorola 68K; it
> was big endian. For little-endian CISC, only the VAX and x86 come to
> mind. Of those only the x86 survives.

To me, the only one where little endian seems reasonable is the 6502.
They did amazingly well with a small number of gates. Note, for one,
that on subroutine call the 6502 doesn't push the address of the next
instruction on the stack. That would have taken too much logic.
It pushes the adress minus one, as, it seems, that is what is
in the register at the time. RET adds one after the pop.

Two byte addition is slightly easier in little endian order, but
only slightly. It doesn't help at all for multiply and divide.

VAX was little endian because the PDP-11 was, though I am not sure
that there was a good reason for that.

-- glen

Alex McDonald

unread,

Apr 5, 2013, 2:49:33 AM4/5/13

to

On Apr 4, 9:10 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
.
>
> VAX was little endian because the PDP-11 was, though I am not sure
> that there was a good reason for that.
>
> -- glen

John Savard's take on it; http://www.plex86.org/Computer_Folklore/Little-Endian-1326.html

Mark Wills

unread,

Apr 5, 2013, 3:41:38 AM4/5/13

to

> Rick- Hide quoted text -
>
> - Show quoted text -

Exactly. The CISC moniker was simply applied to an entire *generation*
of processors by the RISC guys. The term RISC was coined by David
Patterson at the University of California between 1980 and 1984.
http://en.wikipedia.org/wiki/Berkeley_RISC

*Until* that time, there was no need for the term "CISC", because
there was no RISC concept that required a differentiation!

I always thought of the Z80 of a CISC 8-bitter; it has some useful
memory move instructions (LDIR etc).

Mark Wills

unread,

Apr 5, 2013, 3:51:27 AM4/5/13

to

On Apr 5, 1:07 am, rickman <gnu...@gmail.com> wrote:
> On 4/4/2013 7:16 AM, Albert van der Horst wrote:
>
>
>
>
>

> > In article<kjin8q$so...@speranza.aioe.org>,
> > glen herrmannsfeldt<g...@ugcs.caltech.edu> wrote:
> >> In comp.arch.fpga Rod Pemberton<do_not_h...@notemailnotq.cpm> wrote:

> >>> "rickman"<gnu...@gmail.com> wrote in message
> >>>news:kjf48e$5qu$1...@dont-email.me...
> >>>> Weren't you the person who brought CISC into this discussion?
>
> >>> Yes.
>
> >>>> Why are you asking this question about CISC?
>
> >>> You mentioned code density. AISI, code density is purely a CISC
> >>> concept. They go together and are effectively inseparable.
>
> >> They do go together, but I am not so sure that they are inseperable.
>
> >> CISC began when much coding was done in pure assembler, and anything
> >> that made that easier was useful. (One should figure out the relative
> >> costs, but at least it was in the right direction.)
>
> > But, of course, this is a fallacy. The same goal is accomplished by
> > macro's, and better. Code densitity is the only valid reason.
>
> I'm pretty sure that conclusion is not correct. If you have an
> instruction that does two or three memory accesses in one instruction
> and you replace it with three instructions that do one memory access
> each, you end up with two extra memory accesses. How is this faster?
>
> That is one of the reasons why I want to increase code density, in my
> machine it automatically improves execution time as well as reducing the
> amount of storage needed.
>
> --
>

> Rick- Hide quoted text -
>
> - Show quoted text -

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Arlet Ottens

unread,

Apr 5, 2013, 8:31:33 AM4/5/13

to

On 04/05/2013 09:51 AM, Mark Wills wrote:

>> I'm pretty sure that conclusion is not correct. If you have an
>> instruction that does two or three memory accesses in one instruction
>> and you replace it with three instructions that do one memory access
>> each, you end up with two extra memory accesses. How is this faster?
>>
>> That is one of the reasons why I want to increase code density, in my
>> machine it automatically improves execution time as well as reducing the
>> amount of storage needed.

> I think you're on the right track. With FPGAs it's really quite simple
> to execute all instructions in a single cycle. It's no big deal at all
> - with MPY and DIV being exceptions. In the 'Forth CPU world' even
> literals can be loaded in a single cycle. It then comes down to
> careful selection of your instruction set. With a small enough
> instruction set one can pack more than one instruction in a word - and
> there's your code density. If you can pack more than one instruction
> in a word, you can execute them in a single clock cycle. With added
> complexity, you may even be able to execute them in parallel rather
> than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.

Mark Wills

unread,

Apr 5, 2013, 10:33:10 AM4/5/13

to

> since it makes instruction decoding more efficient and simpler.- Hide quoted text -

>
> - Show quoted text -

If you're referring to general purpose CPUs I'm inclined to agree.
When commenting earlier, I had in mind a Forth CPU, which executes
Forth words as native CPU instructions. That is, the instruction set
is Forth.

Since there are no need for things like addressing modes in (shall we
say) classical Forth processors then you don't actually need things
like bit-fields in which to encode registers and/or addressing modes*.
All you're left with is the instructions themselves. And you don't
need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit
instruction slots per 16-bit word, and a 4-bit 'special' field for
other stuff. We haven't allocated all 64 instructions yet.

* Even though it's not strictly necessary in a classical registerless
Forth CPU, bit-fields can be useful. We're using a couple of bits to
tell the ALU if a word pushes results or pops arguments, for example.

Arlet Ottens

unread,

Apr 5, 2013, 11:01:27 AM4/5/13

to

Well, I was thinking about general purpose applications. A Forth CPU may
map well to a Forth program, but you also have to take into account how
well the problem you want to solve maps to a Forth program.

A minimal stack based CPU can be efficient if the values you need can be
kept on top of the stack. But if you need 5 or 6 intermediate values,
you'll need to store them in memory, resulting in expensive access to
them. Even getting the address of a memory location where a value is
stored can be expensive.

Compare that to a register based machine with 8 registers. You need 3
more opcode bits, but you get immediate access to a pool of 8
intermediate values. And, with some clever encoding (like rickman
suggested) some operations can be restricted to a subset of the
registers, relaxing the number of encoding bits required.

It would be interesting to see a comparison using non-trivial
applications, and see how much code is required for one of those minimal
stack CPUs compared to a simple register based CPU.

Brad Eckert

unread,

Apr 5, 2013, 1:14:59 PM4/5/13

to

On Friday, April 5, 2013 5:31:33 AM UTC-7, Arlet Ottens wrote:
> Multiple instructions per word sounds like a bad idea. It requires
> instructions that are so small that they can't do very much, so you need
> more of them. And if you need 2 or more small instructions to do
> whatever 1 big instruction does, it's better to use 1 big instruction
> since it makes instruction decoding more efficient and simpler.

I think he's referring to Novix style stack machines, which are VLIW-like in that bit fields in the instruction directly control certain functional units, but the chip is simple enough (having implicit operands) that the VLIW instruction is still only 16 bits wide.

Bowman's J1 is a nice (although specialized) example. 200 lines of Verilog and a 69% code size reduction compared to the C/microblaze version of the same app.

Brad Eckert

unread,

Apr 5, 2013, 1:21:24 PM4/5/13

to era...@forth.com

On Thursday, April 4, 2013 11:36:55 AM UTC-7, Elizabeth D. Rather wrote:
>
> Interrupts are good. I don't know why people worry about them so!
>

If you're designing your own CPU, they tend to bite you where the sun doesn't shine. Interrupts are a corner case that demands thorough testing.

Cooperative multitasking goes a long way, especially if hardware takes care of any tight time constraints.

glen herrmannsfeldt

unread,

Apr 5, 2013, 1:33:30 PM4/5/13

to

In comp.arch.fpga Arlet Ottens <usen...@c-scape.nl> wrote:

(snip on MISC, RISC, and CISC)

> Multiple instructions per word sounds like a bad idea. It requires
> instructions that are so small that they can't do very much, so you need
> more of them. And if you need 2 or more small instructions to do
> whatever 1 big instruction does, it's better to use 1 big instruction
> since it makes instruction decoding more efficient and simpler.

Well, it depends on your definition of instruction.

The CDC 60 bit computers have multiple instructions per 60 bit word,
but then the word is big enough.

Now, consider that on many processors an instruction can push or pop
only one register to/from the stack.

The 6809 push and pop instructions have a bit mask that allows up
to eight register to be pushed or popped in one instruction.
(It takes longer for more, but not as long as for separate
instructions.)

For x87, there are instructions that do some math function and
then do, or don't, remove values from the stack. Some might count
those as two or three instructions.

-- glen

Elizabeth D. Rather

unread,

Apr 5, 2013, 1:53:26 PM4/5/13

to

I don't know about hardware issues, but I do know that cooperative
multitasking and interrupts work well together in an appropriately
designed program. I have addressed these design issues in several of my
books.

Arlet Ottens

unread,

Apr 5, 2013, 2:13:54 PM4/5/13

to

On 04/05/2013 07:14 PM, Brad Eckert wrote:
> On Friday, April 5, 2013 5:31:33 AM UTC-7, Arlet Ottens wrote:
>> Multiple instructions per word sounds like a bad idea. It requires
>> instructions that are so small that they can't do very much, so you need
>> more of them. And if you need 2 or more small instructions to do
>> whatever 1 big instruction does, it's better to use 1 big instruction
>> since it makes instruction decoding more efficient and simpler.
>
> I think he's referring to Novix style stack machines, which are VLIW-like in that bit fields in the instruction directly control certain functional units, but the chip is simple enough (having implicit operands) that the VLIW instruction is still only 16 bits wide.

Yes, I've seen those, but I don't think the code density is going to be
all that impressive, because the individual 5 (or so) bit wide
instructions are very limited in what they can do. Given typical groups
of instructions like alu, call, return, branch, you need some bits to
designate the group. Now, if you only have 5 bits, there's not going to
be much left to actually define the operation or the operands.

So, they do stuff like leave out the subtract operator, and use
negate/add instead. But if you're going to implement an adder in an
FPGA, it doesn't require any more resources to make it a full
adder/subtracter. The bit inversion can be done for free, and in the
same time. Likewise, once you set up a path from memory, though muxes,
through an ALU, back to memory, it's better to make it a big fat ALU
rather than a simple one. This requires a few more bits in the opcode
space, though.

> Bowman's J1 is a nice (although specialized) example. 200 lines of Verilog and a 69% code size reduction compared to the C/microblaze version of the same app.

But the J1 uses 16 bit instructions, not VLIW-style multiple
instructions per word.

J1 also doesn't look very code dense to me. Agreed, it's better than
microblaze, but microblaze isn't exactly tough competition.

For instance, J1 spends half the opcode space on literals, but out of
the possible 32k literals, the typical program may only use a few dozen.
That means that most of the literal space is wasted.

75% of the remaining instructions contain a 13 bit address, but most
address are not going to be valid entry points, so most of those are
going to be wasted too.

The good part about J1 is the small implementation, mostly due to an
instruction encoding that doesn't require much hardware. A lot of the
mux selector bits are coming straight out of the instruction.

Arlet Ottens

unread,

Apr 5, 2013, 2:15:51 PM4/5/13

to

On 04/05/2013 08:13 PM, Arlet Ottens wrote:
> On 04/05/2013 07:14 PM, Brad Eckert wrote:
>> On Friday, April 5, 2013 5:31:33 AM UTC-7, Arlet Ottens wrote:
>>> Multiple instructions per word sounds like a bad idea. It requires
>>> instructions that are so small that they can't do very much, so you need
>>> more of them. And if you need 2 or more small instructions to do
>>> whatever 1 big instruction does, it's better to use 1 big instruction
>>> since it makes instruction decoding more efficient and simpler.
>>
>> I think he's referring to Novix style stack machines, which are
>> VLIW-like in that bit fields in the instruction directly control
>> certain functional units, but the chip is simple enough (having
>> implicit operands) that the VLIW instruction is still only 16 bits wide.

Just after I hit 'send' on my previous message, I understood you meant
something different than what I had assumed. Sorry about the confusion !

Jecel

unread,

Apr 5, 2013, 5:30:19 PM4/5/13

to

On Wednesday, April 3, 2013 10:54:51 AM UTC-3, Rickman wrote:

> Its interesting that at the time the 9900 instruction set was designed,
> the processor didn't really run faster than the memory. Then the
> processors got a lot faster when using internal registers and the
> workspace pointer idea became a bottleneck.

Yes, computer history is more like a spiral than a line. The future often looks more like the past than like the present. While memory was actually twice as fast as microprocessors in the 1970s (which allowed the Apple II and similar designs to get video for free), TTL based minicomputer processors (like the TI990) were faster than memory and that is the technology for which the instruction set was designed.

> Now in FPGAs, internal RAM
> is not any slower than registers really, so the workspace pointer can
> work again.

Which was the trick that the Transputer used. By the time it came along, you could have a 33MHz microprocessor connected to a 5MHz memory. So it had a 4KB internal RAM that matched the processor speed. You didn't have to point to workspace pointer to it, but it was much faster when you did.

> The limitation on memory now is the number of ports. It
> would be much more useful if FPGA memory were triple ported allowing
> more to happen in a single clock cycle.

In my designs the Block RAMs run at twice the clock of the processor, so with two ports I get up to four data transfers per processor clock cycle. This matches the features of the Xilinx Virtex 4 and Spartan 6 chips that I use.

> I don't remember when or where I read this, but the most common literals
> are small values. So it can be useful to have even just a 4 bit literal
> for these cases saving a full word of memory.

Most articles about the m-code instruction set (for Wirth's Lillith workstation) have data about the output of the Modula-2 compiler which confirm what you said.

> I have a few more ideas on optimizations for the register based design,
> then I will go back to writing code. I think the best way to optimize
> the design is to write code and see where the problems are. Looking at
> optimizations in the register based code has even given me ideas on how
> to improve the stack based design.

Testing actual code is very important. For the 12 bit Forth processor I mentioned before, I used a trivial benchmark of adding two vectors and saving the result in a third vector. The first design didn't have an ADD instruction, but instead built that up from NAND and ROL (rotate left). Something like:

: NOT DUP NAND ;
: XOR OVER OVER NOT NAND >R SWAP NOT NAND R> NAND ;

and so on until I got to ADD. Each 12 bit addition took 700 clock cycles this way and the benchmark had lots and lots of ADDs. So it seemed obvious that it was worth going from 12 to 13 instructions by including a hardware adder which could do the job in a single clock cycle. I was very disappointed when the benchmark only became 5 times faster. That is a significant improvement, of course, but nothing like the 700 times faster that the often used instruction had become.

For other designs, I used the tree search algorithm that Jan Gray showed in his Circuit Cellar articles:

http://www.fpgacpu.org/xsoc/cc.html

-- Jecel

rickman

unread,

Apr 5, 2013, 5:53:55 PM4/5/13

to

On 4/5/2013 5:30 PM, Jecel wrote:
> On Wednesday, April 3, 2013 10:54:51 AM UTC-3, Rickman wrote:
>
>> Its interesting that at the time the 9900 instruction set was designed,
>> the processor didn't really run faster than the memory. Then the
>> processors got a lot faster when using internal registers and the
>> workspace pointer idea became a bottleneck.
>
> Yes, computer history is more like a spiral than a line. The future often looks more like the past than like the present. While memory was actually twice as fast as microprocessors in the 1970s (which allowed the Apple II and similar designs to get video for free), TTL based minicomputer processors (like the TI990) were faster than memory and that is the technology for which the instruction set was designed.
>
>> Now in FPGAs, internal RAM
>> is not any slower than registers really, so the workspace pointer can
>> work again.
>
> Which was the trick that the Transputer used. By the time it came along, you could have a 33MHz microprocessor connected to a 5MHz memory. So it had a 4KB internal RAM that matched the processor speed. You didn't have to point to workspace pointer to it, but it was much faster when you did.
>
>> The limitation on memory now is the number of ports. It
>> would be much more useful if FPGA memory were triple ported allowing
>> more to happen in a single clock cycle.
>
> In my designs the Block RAMs run at twice the clock of the processor, so with two ports I get up to four data transfers per processor clock cycle. This matches the features of the Xilinx Virtex 4 and Spartan 6 chips that I use.

That is a very interesting idea. But it would be hard to do I think.
The logic driving the memory then has to work at the double speed. This
makes timing constraints a real PITA. That's another nice feature of a
CPU that always runs in one clock cycle, no long paths to specify.

>> I don't remember when or where I read this, but the most common literals
>> are small values. So it can be useful to have even just a 4 bit literal
>> for these cases saving a full word of memory.
>
> Most articles about the m-code instruction set (for Wirth's Lillith workstation) have data about the output of the Modula-2 compiler which confirm what you said.

I think it was when I learned about the instruction set of some new
computer... 30 years ago, lol. I was told they had a four bit literal
instruction for those short literals because they were needed a lot more
often than the full machine word size. I don't recall if the field was
signed or not.

>> I have a few more ideas on optimizations for the register based design,
>> then I will go back to writing code. I think the best way to optimize
>> the design is to write code and see where the problems are. Looking at
>> optimizations in the register based code has even given me ideas on how
>> to improve the stack based design.
>
> Testing actual code is very important. For the 12 bit Forth processor I mentioned before, I used a trivial benchmark of adding two vectors and saving the result in a third vector. The first design didn't have an ADD instruction, but instead built that up from NAND and ROL (rotate left). Something like:
>
> : NOT DUP NAND ;
> : XOR OVER OVER NOT NAND>R SWAP NOT NAND R> NAND ;
>
> and so on until I got to ADD. Each 12 bit addition took 700 clock cycles this way and the benchmark had lots and lots of ADDs. So it seemed obvious that it was worth going from 12 to 13 instructions by including a hardware adder which could do the job in a single clock cycle. I was very disappointed when the benchmark only became 5 times faster. That is a significant improvement, of course, but nothing like the 700 times faster that the often used instruction had become.
>
> For other designs, I used the tree search algorithm that Jan Gray showed in his Circuit Cellar articles:
>
> http://www.fpgacpu.org/xsoc/cc.html

I'm more concerned with highly embedded apps like the stuff I do. I'm
not really trying to design a general purpose processor or even one to
run Forth really. I am just trying to get some work done... although I
spend far, far too much time on this. It is in no small part a bit like
a hobby. But as part of my work it also has to be useful.

The original goal of my MISC was to run Forth. But now Forth is just a
tool I will likely use to write the assembler for whatever machine I end
up using. I'm not good enough at Forth to write a cross compiler yet.

--

Rick

rickman

unread,

Apr 5, 2013, 6:00:34 PM4/5/13

to

On 4/4/2013 9:17 PM, Albert van der Horst wrote:
> In article<kjkpnp$qdp$1...@dont-email.me>, rickman<gnu...@gmail.com> wrote:
>>
>> Albert, do you have a reference about this?
>
> Not in a wikipedia sense where you're not allowed to mention original
> research, and are only quoting what is in the books. It is more experience
> that school knowledge.
>
> If you want to see what can be done with a good macro processor like m4
> study the one source of the 16/32/64 bit ciforth x86 for linux/Windows/Apple.
> See my site below.
>
> The existance of an XLAT instruction (to name an example) OTOH does virtually
> nothing to make the life of an assembler programmer better.
>
> Groetjes Albert

Ok, I'm a bit busy with a number of things to go into this area now, but
I appreciate the info.

I have used macro assemblers in the past and got quite good at them. I
even developed a micro programmed board which used a macro assembler and
added new opcodes for my design (it was a company boilerplate design
adapted to a new host) to facilitate some self test functions. I sorta
got yelled at because this meant you needed my assembler adaptations to
assemble the code for the board. I wasn't ordered to take it out so it
remained. I didn't. I left within a year and the company eventually
folded. You can make a connection if you wish... lol

--

Rick

rickman

unread,

Apr 5, 2013, 6:08:47 PM4/5/13

to

I have looked at the multiple instruction in parallel thing and have not
made any conclusions yet. To do that you need a bigger instruction word
and smaller instruction opcodes. The opcodes essentially have to become
specific for the execution units. My design has three, the data stack,
the return stack and the instruction fetch. It is a lot of work to
consider this because there are so many tradeoffs to analyze.

One issue that has always bugged me is that allocating some four or five
bits for the instruction fetch instruction seems very wasteful when some
70-90% of the time the instruction is IP < IP+1. Trying to Huffman
encode this is a bit tricky as what do you do with the unused bit??? I
gave up looking at this until after I master the Rubik's Cube. lol

It does clearly have potential, it's just a bear to unlock without
adding a lot of data pathways to the design.

--

Rick

rickman

unread,

Apr 5, 2013, 6:25:01 PM4/5/13

to

Just to make a point, my original post wasn't really about Forth
processors specifically. It was about MISC processors which may or may
not be programmed in Forth.

> A minimal stack based CPU can be efficient if the values you need can be
> kept on top of the stack. But if you need 5 or 6 intermediate values,
> you'll need to store them in memory, resulting in expensive access to
> them. Even getting the address of a memory location where a value is
> stored can be expensive.

There aren't many algorithms that can't be dealt with reasonably using
the data and return stacks. The module I was coding that made me want
to try a register approach has four input variables, two double
precision. These are in memory and have to be read in because this is
really an interrupt routine and there is no stack input. The main
process would have access to these parameters to update them.

The stack routine uses up to 9 levels on the data stack currently. But
that is because to optimize the execution time I was waiting until the
end when I could save them off more efficiently all at once. But - in
the process of analyzing the register processor I realized that I had an
unused opcode that would allow me to save the parameters in the same way
I am doing it in the register based code, reducing the stack usage to
five items max.

I understand that many stack based machines only have an 8 level data
stack, period. The GA144 is what, 10 words, 8 circular and two registers?

> Compare that to a register based machine with 8 registers. You need 3
> more opcode bits, but you get immediate access to a pool of 8
> intermediate values. And, with some clever encoding (like rickman
> suggested) some operations can be restricted to a subset of the
> registers, relaxing the number of encoding bits required.

That's the whole enchilada with a register MISC, figuring out how to
encode the registers in the opcodes. I think I've got a pretty good
trade off currently, but I have not looked at running any Forth code on
it yet. This may be much less efficient than a stack machine... duh!

> It would be interesting to see a comparison using non-trivial
> applications, and see how much code is required for one of those minimal
> stack CPUs compared to a simple register based CPU.

I have been invited to get a presentation to the SVFIG on my design. I
need to work up some details and will let you know when that will be.
They are talking about doing a Google+ hangout which I suppose is a
video since it would be replayed for the meeting. I'm not sure I can be
ready in time for the next meeting, but I'll see. I don't think my
design is all that novel in the grand scheme of MISC. So I likely will
do this as a comparison between the two designs. I think I also need to
bone up on some of the other designs out there like the J1 and the B16.

--

Rick

AKE

unread,

Apr 5, 2013, 6:29:56 PM4/5/13

to

On Friday, March 29, 2013 9:00:50 PM UTC, rickman wrote:
>
> In the thread on stack ops it was pointed out repeatedly that very often
> the stack operands would be optimized to register operands, meaning they
> wouldn't need to do the stack ops at all really. So I took a look at a
> register based MISC design. Guess what, I don't see the disadvantage!

Anton Ertl has an interesting page here in which he compares the performance of a number of Forths on a collection of benchmarks.

http://www.complang.tuwien.ac.at/forth/performance.html

The relevant quote is below:

'These results also show that the following statement has not become outdated yet: ``The resulting machine code is a factor of two or three slower than the equivalent code written in machine language, due mainly to the use of the stack rather than registers.'' [Ros86]'

If I'm understanding this correctly, I think it is in support of your thought that MISC + registers may be a winning combination.

AKE

rickman

unread,

Apr 5, 2013, 6:49:17 PM4/5/13

to

People keep quoting the 200 lines of Verilog and no mention of how many
LUT/FF it uses. I've never even counted the lines of code in any of my
designs because that is not a useful metric... is it?

I have never looked hard at the J1, but I did check it out earlier
today. It uses some of the same ideas as my design but with a larger
word size. The common stuff is the Huffman'ish encoding of literals and
branch instructions. I should compare the J1 to my design in terms of
code density. I looked for a benchmark on the J1, but J1 doesn't do
well in Google searches.

--

Rick

rickman

unread,

Apr 5, 2013, 6:54:28 PM4/5/13

to

5 bit instructions are only on a few MISC machines.

>> Bowman's J1 is a nice (although specialized) example. 200 lines of
>> Verilog and a 69% code size reduction compared to the C/microblaze
>> version of the same app.
>
> But the J1 uses 16 bit instructions, not VLIW-style multiple
> instructions per word.
>
> J1 also doesn't look very code dense to me. Agreed, it's better than
> microblaze, but microblaze isn't exactly tough competition.
>
> For instance, J1 spends half the opcode space on literals, but out of
> the possible 32k literals, the typical program may only use a few dozen.
> That means that most of the literal space is wasted.

Check Koopman's book. I believe literal are way up near the top of the
list for usage for most apps.

> 75% of the remaining instructions contain a 13 bit address, but most
> address are not going to be valid entry points, so most of those are
> going to be wasted too.

Uh, that's true for any method of generating an address. I don't think
this is a valid way of evaluating opcode usage.

> The good part about J1 is the small implementation, mostly due to an
> instruction encoding that doesn't require much hardware. A lot of the
> mux selector bits are coming straight out of the instruction.

How small is it? Maybe I should download the code and test it myself.

--

Rick

rickman

unread,

Apr 5, 2013, 7:32:07 PM4/5/13

to

I'm not sure this is still a relevant quote. The footnote is Ros86
which I believe means it was in 1986, no? That nearly 30 years ago!

--

Rick

Anton Ertl

unread,

Apr 6, 2013, 10:16:19 AM4/6/13

to

So what? The part that AKE cited says that it had not become outdated
when the page above was written (which was in 1996 or shortly after).
If it held true for so long, it might still hold true today. If you
want to know, go out an measure it! The page contains the measured
code, so it's not that hard. Current iForth and VFX might do better
(or maybe not), but for everything else I expect that the factor of
2-3 still holds.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2013: http://www.euroforth.org/ef13/

rickman

unread,

Apr 6, 2013, 3:50:36 PM4/6/13

to

On 4/6/2013 10:16 AM, Anton Ertl wrote:
> rickman<gnu...@gmail.com> writes:
>> On 4/5/2013 6:29 PM, AKE wrote:
>>> Anton Ertl has an interesting page here in which he compares the performance of a number of Forths on a collection of benchmarks.
>>>
>>> http://www.complang.tuwien.ac.at/forth/performance.html
>>>
>>> The relevant quote is below:
>>>
>>> 'These results also show that the following statement has not become outdated yet: ``The resulting machine code is a factor of two or three slower than the equivalent code written in machine language, due mainly to the use of the stack rather than registers.'' [Ros86]'
>>>
>>> If I'm understanding this correctly, I think it is in support of your thought that MISC + registers may be a winning combination.
>>
>> I'm not sure this is still a relevant quote. The footnote is Ros86
>> which I believe means it was in 1986, no? That nearly 30 years ago!
>
> So what? The part that AKE cited says that it had not become outdated
> when the page above was written (which was in 1996 or shortly after).
> If it held true for so long, it might still hold true today. If you
> want to know, go out an measure it! The page contains the measured
> code, so it's not that hard. Current iForth and VFX might do better
> (or maybe not), but for everything else I expect that the factor of
> 2-3 still holds.

Except that unless I am missing something this doesn't hold. The
analysis completely ignores the results of FLK which appears to be a
Forth. The reported speeds are 0.83 to 1.99 compared to f2c.opt and
compares similarly to the hand coded C an even is faster in the fib
benchmark where both references score a 1.00.

Is there a reason why this Forth was tested, but not included in the
analysis?

--

Rick

Jecel

unread,

Apr 6, 2013, 4:27:06 PM4/6/13

to

On Friday, April 5, 2013 6:53:55 PM UTC-3, Rickman wrote:

> I'm more concerned with highly embedded apps like the stuff I do. I'm
> not really trying to design a general purpose processor or even one to
> run Forth really. I am just trying to get some work done... although I
> spend far, far too much time on this. It is in no small part a bit like
> a hobby. But as part of my work it also has to be useful.

The benchmarks I used for evaluating embedded applications for my designs was the ability to set, for example, 3 bits in an 8 bit I/O register to a give value. This is normally rather awkward for a stack machine, so for some projects I had a separate stack processor for running user code and a very simple register processor for I/O. On another project, I added the following four instructions to the stack processor so it could do the job itself:

test - the result of ANDing T with the data from the i/o port
is pushed on the stack

clear - the result of ANDing the inverse of T with the data
from the i/o port is saved to the port

set - the result of ORing T with the data from the i/o port
is saved to the port

toggle - the result of XORing T with the data from the i/o
port is saved to the port

As you might imagine, "T" is the register holding the value of the top of the data stack. But in this version I had moved from 5 to 8 bit instructions so it was easy to add stuff like this. The improvement in the benchmarks was very impressive.

-- Jecel

Jecel

unread,

Apr 6, 2013, 4:37:26 PM4/6/13

to

On Saturday, April 6, 2013 5:27:06 PM UTC-3, I wrote wrote:

> toggle - the result of XORing T with the data from the i/o
> port is saved to the port

I suppose "the i/o port" is confusing since I didn't mention any address. This processor could run 16 different threads and 8 of them had one dedicated i/o port each. A thread could address other i/o ports as well, but not with these 4 special instructions.

Another detail is that these instructions read data from the i/o port, modify it and then put it back. This fits in the normal operation of the processor and happened in a single clock cycle. You would need three normal stack instructions otherwise, which would take up more memory and 3 clock cycles.

-- Jecel
http://www.merlintec.com/swiki/hardware/9.html is more optimized for Smalltalk than for Forth

rickman

unread,

Apr 6, 2013, 5:32:03 PM4/6/13

to

On 4/5/2013 3:51 AM, Mark Wills wrote:
>

> I think you're on the right track. With FPGAs it's really quite simple
> to execute all instructions in a single cycle. It's no big deal at all
> - with MPY and DIV being exceptions. In the 'Forth CPU world' even
> literals can be loaded in a single cycle. It then comes down to
> careful selection of your instruction set. With a small enough
> instruction set one can pack more than one instruction in a word - and
> there's your code density. If you can pack more than one instruction
> in a word, you can execute them in a single clock cycle. With added
> complexity, you may even be able to execute them in parallel rather
> than as a process.

This morning I found one exception to the one cycle rule. Block RAM.
My original design was for the Altera ACEX part which is quite old now,
but had async read on the block rams for memory and stack. So the main
memory and stacks could be accessed for reading and writing in the same
clock cycle, read/modify/write. You can't do that with today's block
RAMs, they are totally synchronous.

I was looking at putting the registers in a block RAM so I could get two
read ports and two write ports. But this doesn't work the way an async
read RAM will. So I may consider using a multiphase clock. The
operations that really mess me up is the register indirect reads of
memory, like stack accesses... or any memory accesses really, that is
the only way to address memory, via register. So the address has to be
read from register RAM, then the main memory RAM is read, then the
result is written back to the register RAM. Wow! That is three clock
edges I'll need.

If I decide to go with using block RAM for registers it will give me N
sets of regs so I have a big motivation. It also has potential for
reducing the total amount of logic since a lot of the multiplexing ends
up inside the RAM.

The multiphase clocking won't be as complex as using multiple machine
cycles for more complex instructions. But it is more complex than the
good old simple clock I have worked with. It will also require some
tighter timing and more complex timing constraints which are always hard
to get correct.

--

Rick

rickman

unread,

Apr 6, 2013, 5:37:03 PM4/6/13

to

I think multiple instructions per word is a good idea of you have a wide
disparity in speed between instruction memory and the CPU clock rate.
If not, why introduce the added complexity? Well, unless you plan to
execute them simultaneously... I took a look at a 16 bit VLIW idea
once and didn't care for the result. There are just too many control
points so that a proper VLIW design would need more than just 16 bits I
think. At least in the stack design.

An 18 bit instruction word might be worth looking at in the register
CPU. But getting more parallelism in the register design will require
more datapaths and there goes the "minimal" part of the MISC.

So many options, so little time...

--

Rick

rickman

unread,

Apr 6, 2013, 5:44:56 PM4/6/13

to

That can be useful. The automatic pop of operands is sometimes
expensive by requiring a DUP beforehand. In my design the fetch/store
words have versions to drop the address from the return stack or
increment and hold on to it. In looking at the register design I
realized it would be useful to just make the plus and the drop both
options so now there are three versions, fetch, fetch++ and fetchK (keep).

--

Rick

Marcel Hendrix

unread,

Apr 6, 2013, 7:33:00 PM4/6/13

to

rickman <gnu...@gmail.com> writes Re: MISC - Stack Based vs. Register Based

> On 4/6/2013 10:16 AM, Anton Ertl wrote:
>> rickman<gnu...@gmail.com> writes:

[..]

>> So what? The part that AKE cited says that it had not become outdated
>> when the page above was written (which was in 1996 or shortly after).
>> If it held true for so long, it might still hold true today. If you
>> want to know, go out an measure it! The page contains the measured
>> code, so it's not that hard. Current iForth and VFX might do better
>> (or maybe not), but for everything else I expect that the factor of
>> 2-3 still holds.

[..]

I redid the tests just now (i7 920 @2.66 GHz, Window 7). Times in ms.

# sieve bubble matmul fib ssa mm-rtcg
# gforth 0.285 0.376 0.223 0.334 0.269 0.322
# gforth-fast 0.250 0.300 0.176 0.267 0.224 0.501
# SF 3.4.2 0.119 0.146 0.042 0.048 0.056 0.541
# Vfx 4.5 0.094 0.109 0.031 0.047 0.031 0.234
# iForth 4.0 0.089 0.088 0.020 0.082 0.037 0.420
# gcc manual 0.071 0.086 0.028 0.052
# f2c 0.070 0.074 0.036 0.052

f2c gforth gforth-fast SF3.4.2 Vfx 4.5 iForth 4.0 gcc
# sieve 1 4.1 3.6 1.7 1.34 1.27 1
# bubble 1 5.1 4.1 1.98 1.5 1.19 1.16
# matmul 1? 6.2 4.9 1.17 0.86 0.56 0.78
# fib 1 6.4 5.13 0.92 0.9 1.58 1

# (old) Webpage results
# f2c gcc iForth Gforth
# sieve 1.00 0.86 2.16 5.05
# bubble 1.00 0.87 2.32 6.25
# matmul 1.00 1.10 2.19 5.92
# fib 1.00 1.00 1.32 3.79

1) The matmul test for f2c core-dumps.
2) The Sieve algorithm for the manual C program is not the same as Forth uses.
3) iForth 4.0 and the C-manual programs are 64bit program, the others 32bit
4) The results for fib are a parlor trick for most compilers

-marcel

AKE

unread,

Apr 7, 2013, 5:57:18 AM4/7/13

to

On Sunday, April 7, 2013 12:33:00 AM UTC+1, Marcel Hendrix wrote:
>
>
> I redid the tests just now (i7 920 @2.66 GHz, Window 7). Times in ms.
>
> # sieve bubble matmul fib ssa mm-rtcg
> # gforth 0.285 0.376 0.223 0.334 0.269 0.322
> # gforth-fast 0.250 0.300 0.176 0.267 0.224 0.501
> # SF 3.4.2 0.119 0.146 0.042 0.048 0.056 0.541
> # Vfx 4.5 0.094 0.109 0.031 0.047 0.031 0.234
> # iForth 4.0 0.089 0.088 0.020 0.082 0.037 0.420
> # gcc manual 0.071 0.086 0.028 0.052
> # f2c 0.070 0.074 0.036 0.052
>
>

Broadly speaking, it looks like these tests segment Forths/compilers into roughly three classes:

Gforth SwiftForth VFX/iForth/Gcc/f2c

where moving from left to right gives a non-trivial speed-up (relative to the previous class).

I don't know much about the differences between the various Forths -- what roughly would be the reason why VFX or iForth posts significantly better times than Gforth?

Is it a difference in compiler technology? Threading mode? Or perhaps something about the test cases?

Stephen Pelc

unread,

Apr 7, 2013, 7:02:11 AM4/7/13

to

On Sun, 7 Apr 2013 02:57:18 -0700 (PDT), AKE
<assadebr...@gmail.com> wrote:

>I don't know much about the differences between the various
>Forths -- what roughly would be the reason why VFX or iForth
>posts significantly better times than Gforth?

iForth and VFX Forth use analytical code generators - they
model the Forth data stack at the very least. This reduces
the number of instructions generated, and also reduces the
amount of memory traffic. Far more is carried in registers.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

Anton Ertl

unread,

Apr 7, 2013, 9:51:32 AM4/7/13

to

Yes. Gforth is based on a threaded code interpreter, with various
techniques that make it a good bit faster (if there is nothing
blocking them; there are quite big variations between builds of Gforth
because of this [*]; I usually see a better speedup from gforth-fast
over gforth). Swiftforth is a subroutine-threaded system with
inlining and high-level peephole optimization (what Gforth calls
static superinstructions). VFX and iForth are more sophisticated
("analytic") compilers that allocate stack items to registers, and
should not have the problem that Rose mentioned to the same degree.

> Or perhaps something about the test cases?

It seems that these small benchmarks are more amenable to optimization
than most application benchmarks.

[*] E.g., from Gforth's Benchres file:

sieve bubble matrix fib
0.152 0.208 0.076 0.244 gforth-fast 0.7.0; Core 2 3GHz (Xeon 5160) 64 bits;
gcc-4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

Anton Ertl

unread,

Apr 7, 2013, 10:13:45 AM4/7/13

to

m...@iae.nl (Marcel Hendrix) writes:
>I redid the tests just now (i7 920 @2.66 GHz, Window 7). Times in ms.

[...]

> f2c gforth gforth-fast SF3.4.2 Vfx 4.5 iForth 4.0 gcc
># sieve 1 4.1 3.6 1.7 1.34 1.27 1
># bubble 1 5.1 4.1 1.98 1.5 1.19 1.16
># matmul 1? 6.2 4.9 1.17 0.86 0.56 0.78
># fib 1 6.4 5.13 0.92 0.9 1.58 1

Thanks.

>2) The Sieve algorithm for the manual C program is not the same as Forth uses.

Does not look that different to me; the Forth programs does have some
induction variable optimizations applied manually. The C program also
contains an iteration counter that the Forth porgram doesn't, and that
might slow it down. I find it remarkable that f2c and c-manual
produced the same results in old times and now, though.

>4) The results for fib are a parlor trick for most compilers

C compilers do cool things with that nowadays (I guess that's the kind
of real-world programs that Andrew referred to elsewhere:-). I would
not expect Forth compilers to do that (it does not help on what counts
as real world in the Forth world); so what parlor trick do you mean?

Anton Ertl

unread,

Apr 7, 2013, 10:22:29 AM4/7/13

to

rickman <gnu...@gmail.com> writes:
>Except that unless I am missing something this doesn't hold. The
>analysis completely ignores the results of FLK which appears to be a
>Forth. The reported speeds are 0.83 to 1.99 compared to f2c.opt and
>compares similarly to the hand coded C an even is faster in the fib
>benchmark where both references score a 1.00.
>
>Is there a reason why this Forth was tested, but not included in the
>analysis?

Looking at the original text (section 4.5 of
<http://www.complang.tuwien.ac.at/papers/ertl96diss.ps.gz>), I see
that FLK did not occur there. So apparently I started with that text,
and later updated it with newer data, but did not rewrite the analysis
when I added FLK to the data.

Marcel Hendrix

unread,

Apr 7, 2013, 11:14:18 AM4/7/13

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes Re: MISC - Stack Based vs. Register Based

> m...@iae.nl (Marcel Hendrix) writes:
>>I redid the tests just now (i7 920 @2.66 GHz, Window 7). Times in ms.
[...]
>> f2c gforth gforth-fast SF3.4.2 Vfx 4.5 iForth 4.0 gcc
>># sieve 1 4.1 3.6 1.7 1.34 1.27 1
>># bubble 1 5.1 4.1 1.98 1.5 1.19 1.16
>># matmul 1? 6.2 4.9 1.17 0.86 0.56 0.78
>># fib 1 6.4 5.13 0.92 0.9 1.58 1

>Thanks.

>>2) The Sieve algorithm for the manual C program is not the same as Forth uses.

> Does not look that different to me; the Forth programs does have some
> induction variable optimizations applied manually. The C program also
> contains an iteration counter that the Forth porgram doesn't, and that
> might slow it down.

OK, I looked again and the core is indeed the same, a nested for loop.
However, both Vfx and iForth are about twice faster with the 'original'
Forth Sieve. Of course, that could be caused by specific peephole
optimizations.

> I find it remarkable that f2c and c-manual
> produced the same results in old times and now, though.

Yes, I also wondered about that. I compiled with MinGW64's version of gcc
and did not change the options in the makefile (only -O3).

I had to tweak the makefile (to explicitly call bash for the time command, and
to change the ftp path to a local one). The -DUNIX macro didn't work on Windows,
-DMSC fixed it, but in the end I had to hack the timer calls in one of the C
files. The makefile should also have CC=gcc.

Note that the f2c matmul seems to crash, or maybe bash crashed (core dumped,
but I still got output from the `time` command).

>>4) The results for fib are a parlor trick for most compilers

> C compilers do cool things with that nowadays (I guess that's the kind
> of real-world programs that Andrew referred to elsewhere:-). I would
> not expect Forth compilers to do that (it does not help on what counts
> as real world in the Forth world); so what parlor trick do you mean?

FIB is about the only program that I know that benefits so much from
having TOS in a register. Even SF is a lot faster than iForth here, while
it is far slower for all other programs I tested. Maybe it is fundamentally
better to have the TOS in a register. I thought that it would become a hindrance
when the compiler reached a certain level of sophistication, but at least
for iForth (and Vfx) that level is still a bit far off.

-marcel

Anton Ertl

unread,

Apr 7, 2013, 11:31:41 AM4/7/13

to

m...@iae.nl (Marcel Hendrix) writes:
>Note that the f2c matmul seems to crash, or maybe bash crashed (core dumped,
>but I still got output from the `time` command).

Looks to me like a 32/64-bit portability issue; the sizes of the
arrays are hard-coded in bytes in the generated C code. Try compiling
with -m32.

>>>4) The results for fib are a parlor trick for most compilers
>
>> C compilers do cool things with that nowadays (I guess that's the kind
>> of real-world programs that Andrew referred to elsewhere:-). I would
>> not expect Forth compilers to do that (it does not help on what counts
>> as real world in the Forth world); so what parlor trick do you mean?
>
>FIB is about the only program that I know that benefits so much from
>having TOS in a register. Even SF is a lot faster than iForth here, while
>it is far slower for all other programs I tested. Maybe it is fundamentally
>better to have the TOS in a register.

Yes, all the data I have indicates that it is a good idea.

1) http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz

Figure 21 shows the benefits (in terms of reduces loads+stores) of
keeping the TOS in a register between primitives (and why more doesn't
help).

Figure 24 shows that, if you optimize basic blocks and have 1 or 2
registers, keeping the TOS in a register at basic block boundaries is
optimal, while for more registers, keeping TOS and NOS in registers is
slightly better (for the performance model assumed there).

2) http://www.complang.tuwien.ac.at/papers/ertl%26gregg05.ps.gz

Here we see timing data from real machines instead of a performance
model, and the discussion says (in Section 4.4):

|The number of instructions executed is smallest for the canonical
|stack representation with one register (except for some of the smaller
|benchmarks). Similarly, for the single-representation stack caches,
|the one with one register executes the least instructions.
|
|The PPC7400 timings behave quite similar to the instruction counts,
|although the timing reduction is somewhat higher than the instruction
|reduction; on the PPC7447A and especially the PPC970 the times for
|canonical representations with more than one registers rise much more
|slowly (and sometimes not at all).
|
|Nevertheless, even on those CPUs using the one-register representation
|as canonical representation or, for single-representation stack
|caches, as the representation is optimal for many benchmarks, and
|close to optimal on the others.

> I thought that it would become a hindrance
>when the compiler reached a certain level of sophistication, but at least
>for iForth (and Vfx) that level is still a bit far off.

Well, if your compiler works on larger units, it will do much of that
automatically within the units, but at unit boundaries (where you
return to the canonical stack state) it is still an advantage; I think
that, with larger units, keeping more stack items in registers at
boundaries might be beneficial, but have no data on that.

Working on Gforth I originally thought that keeping the TOS in a
register would increase the register pressure, but that turned out not
to be the case. Most primitives load the TOS at some early point
anyway and store it later, and the point of highest register pressure
is between these points; keeping the TOS in a register across NEXT
does not increase register pressure.

Elizabeth D. Rather

unread,

Apr 7, 2013, 3:16:39 PM4/7/13

to

These benchmarks are useful for certain comparisons, but don't
necessarily given an overall picture of the particular implementation as
regards a particular application. For example, the overhead on OS calls
varies tremendously from one implementation to another, as does other
aspects such as multi-thread support (and overhead), user convenience,
debugging aids, etc.

Still, I hope you can see that your original question about optimization
(using stack vs. memory) doesn't have a simple answer!

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Brian Davis

unread,

Apr 7, 2013, 5:59:42 PM4/7/13

to

rickman wrote:
>
> So the main memory and stacks could be accessed for reading and writing
> in the same clock cycle, read/modify/write. You can't do that with today's
> block RAMs, they are totally synchronous.
>

I had the same problem when I first moved my XC4000 based RISC
over to the newer parts with registered Block RAM.

I ended up using opposite edge clocking, with a dual port BRAM,
to get what appears to be single cycle access on the data and
instruction ports.

As this approach uses the same clock, the constraints are painless;
but you now have half a clock for address -> BRAM setup, and half
for the BRAM data <-> core data setup. The latter can cause some
some timing issues if the core is configured with a byte lane mux
so as to support 8/16/32 bit {sign extending} loads.

-Brian

Paul Rubin

unread,

Apr 7, 2013, 6:21:08 PM4/7/13

to

Jecel <je...@merlintec.com> writes:
> The academic CRISP architecture... was supposed to be the ultimate C
> processor and it was a stack machine.

I think the stack machines generally discussed in the Forth universe
don't have any way to access data in the interior of the stack. They
are pure LIFO. CRISP by contrast was optimized to access C local
variables by reaching inside stack frames. So it's not comparable to
MISC stack machines.

rickman

unread,

Apr 8, 2013, 1:16:52 AM4/8/13

to

Yes, that was one way to solve the problem. This other I considered was
to separate the read and write on the two ports. Then the read would be
triggered from the address that was at the input to the address
register... from the previous cycle. So the read would *always* be done
and the data presented whether you used it or not. I'm not sure how
much power this would waste, but the timing impact would be small.

I looked at making the register block RAM part of the main memory
address space. This would required a minimum of three clock cycles in a
machine cycle, read address or data from register, use address to read
or write data from/to memory and then write data to register. If it
helps timing, the memory write can be done at the same time as the
register write. I'm not crazy about this approach, but I'm considering
how useful it would be to have direct address capability of the multiple
register banks.

Some of the comments about register vs. stacks and what I have seen of
the J1 has made me think about a hybrid approach using stacks in memory,
but with offset access, so items further down in the stack can be
operands, not just TOS and NOS. This has potential for saving stack
operations. The J1 has a two bit field controlling the stack pointer, I
assume that is +1 to -2 or 1 push to 2 pop. The author claims this
provides some ability to combine Forth functions into one instruction,
but doesn't provide details. I guess the compiler code would have to be
examined to find out what combinations would be useful.

The compiler end is not my strong suit, but I suppose I could figure out
how to take advantage of features like this.

--

Rick

rickman

unread,

Apr 8, 2013, 1:24:07 AM4/8/13

to

I just posted in response elsewhere in this thread that this discussion
and learning a bit about the J1 has tickled a grey cell along those same
lines. Most of the stack ops are for duplicating items so when they are
operated on they can be popped or for getting them into position so they
can be operated on. Having random access into the stack would help to
minimize these ops. The code could do its work from the top of the
stack and before exiting lop off any data that isn't needed.

In essence, the stack would be treated like a register file where TOS is
R0 and the register index is an index into the stack from the top down.
Of course, I need to give this more thought... a *lot* more thought.
It just goes well with a very simple hardware structure for the
stack/register logic... I think.

--

Rick

Syd Rumpo

unread,

Apr 8, 2013, 5:18:08 AM4/8/13

to

The PSC1000 allowed you to copy the top 16 (IIRC) return stack registers
to TOS, just like r@ but deeper - I used the symbols r@ r1@, r2@ .. rF@
for this. You could also drop these with N rdrop in (IIRC again) a
single cycle.

With real code, I found this to be very useful for locals and oft-used
variables. Push them with >r, use them with rN@ and tidy up with N rdrop.

Cheers
--
Syd

Andrew Haley

unread,

Apr 8, 2013, 5:18:06 AM4/8/13

to

We-ell, you have to distinguish between theoretical "pure" stack
machines, Forth machines, and classical stack machines such as the
Burroughs 5500/6500 et al. Even Novix had a couple of registers;
there are some things that are really hard to do without them.

Andrew.

Albert van der Horst

unread,

Apr 8, 2013, 5:52:52 AM4/8/13

to

This is the approach we took with the FIETS chip, about 1980, emulated
on an Osborne CPM computer, never build. The emulation could run a Forth
and it benefited from reaching 8 deep into both the return and the data
stack. It still would be interesting to build using modern FPGA.

>
>The compiler end is not my strong suit, but I suppose I could figure out
>how to take advantage of features like this.
>
>--
>
>Rick

Groetjes Albert
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Bernd Paysan

unread,

Apr 8, 2013, 12:53:56 PM4/8/13

to

Marcel Hendrix wrote:

> rickman <gnu...@gmail.com> writes Re: MISC - Stack Based vs. Register
> Based
>
>> On 4/6/2013 10:16 AM, Anton Ertl wrote:
>>> rickman<gnu...@gmail.com> writes:
> [..]
>>> So what? The part that AKE cited says that it had not become outdated
>>> when the page above was written (which was in 1996 or shortly after).
>>> If it held true for so long, it might still hold true today. If you
>>> want to know, go out an measure it! The page contains the measured
>>> code, so it's not that hard. Current iForth and VFX might do better
>>> (or maybe not), but for everything else I expect that the factor of
>>> 2-3 still holds.
> [..]
>
> I redid the tests just now (i7 920 @2.66 GHz, Window 7). Times in ms.
>
> # sieve bubble matmul fib ssa mm-rtcg
> # gforth 0.285 0.376 0.223 0.334 0.269 0.322
> # gforth-fast 0.250 0.300 0.176 0.267 0.224 0.501

On my new core i7 @ 3.4 GHz, gforth-fast gets

sieve bubble matmu fib
gforth 0.253 0.343 0.166 0.456
gforth-fast 0.088 0.138 0.048 0.132

Didn't test the others, but gforth-fast is really much faster on Core i7
than the memory-updating Gforth (Core i7 is write-port limited, which
explains this factor 3). 64 bits. Apples to apples, there's a good reason
why iForth is 64 bits, and exactly the same reason applies to Gforth, too:
more registers, more fun.

32 bits suck, due to the lack of registers, and apparently some of our hand-
tuned register assignments don't work on GCC 4.7, which needs investigation.
That's even slower than your 2.66GHz CPU:

sieve bubble matmu fib
gforth 0.370 0.489 0.274 0.414
gforth-fast 0.318 0.429 0.196 0.313

% ./gforth-fast-x32 --diag -e bye
*** performance problems ***
automatic register allocation: performance degradation possible

Well, that explains a lot... If that's the case, don't bother to benchmark
it.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/

Brad Eckert

unread,

Apr 8, 2013, 12:57:09 PM4/8/13

to

On Friday, April 5, 2013 11:13:54 AM UTC-7, Arlet Ottens wrote:
> Yes, I've seen those, but I don't think the code density is going to be
> all that impressive, because the individual 5 (or so) bit wide
> instructions are very limited in what they can do.

I think most of the code density comes from factoring, which means 16-bit calls. The semantic density of rest of the instruction set is secondary. A general purpose CPU tries to balance things out, but a CPU designed for one application can make the ISA strong in one area and then just throw in enough extra hardware to cover the bare essentials.

So the time critical code has good semantic density due to the instruction set coding and the rest has good semantic density due to factoring.

Rod Pemberton

unread,

Apr 8, 2013, 11:55:50 PM4/8/13

to

"Albert van der Horst" <alb...@spenarnc.xs4all.nl> wrote in
message news:515e262b$0$26895$e4fe...@dreader37.news.xs4all.nl...

> The existance of an XLAT instruction (to name an example)
> OTOH does virtually nothing to make the life of an
> assembler programmer better.
>

Why do you say that?

It seems good for 256 byte (or less) lookup tables, 8-bit
character translation, simple decompression algorithms, etc. You
can even use it for multiple tables at once, e.g., using XCHG to
swap BX. It's definately difficult for a compiler implementer to
determine when to use such a CISC instruction.

Rod Pemberton

rickman

unread,

Apr 9, 2013, 10:13:33 AM4/9/13

to

On 4/8/2013 12:57 PM, Brad Eckert wrote:
> On Friday, April 5, 2013 11:13:54 AM UTC-7, Arlet Ottens wrote:
>> Yes, I've seen those, but I don't think the code density is going to be
>> all that impressive, because the individual 5 (or so) bit wide
>> instructions are very limited in what they can do.
>
> I think most of the code density comes from factoring, which means 16-bit calls. The semantic density of rest of the instruction set is secondary. A general purpose CPU tries to balance things out, but a CPU designed for one application can make the ISA strong in one area and then just throw in enough extra hardware to cover the bare essentials.

I'm not sure the code density comes from factoring and factoring does
not imply 16 bit calls. Code density comes from matching the
instruction set to the task and using an instruction encoding that
optimizes the most frequency operations for length. Factoring is not
really an instruction set issue, that is a programmer issue.

The assumption that a call implies 16 bit operands not not valid. More
than one instruction set design uses sign extension and variable length
values to express addresses. In my instruction set a call and jump are
relative and a single opcode (8 or 9 bits depending on the hardware) can
extend � some number of locations without added parameters. So in the
shortest case the literal is only four bits.

> So the time critical code has good semantic density due to the instruction set coding and the rest has good semantic density due to factoring.

That is an interesting way to look at it.

--

Rick