The RISC penalty

Duane Sand

unread,

Dec 7, 1995, 3:00:00 AM12/7/95

to

The December 1995 issue of IEEE MICRO magazine has an interesting and
inflamatory article in its Micro View column, named "The RISC Penalty",
written by Tom Pittman. He has been working on faster replacements for
PowerMacs' 68K emulator, using dynamic recompilation of 68K code into
mildly-optimized PowerPC code. He and others found that recompilation
did not give the speedups they expected over simple interpretation,
when running large real programs. He says the main problem is the
excessive number of code cache misses caused by the poor code density
of RISC code. He calls the whole RISC architecture thing a fad, a
severe case of The Emperor's New Clothes. He argues for maybe going to
a byte-coded stack machine instruction set, a la Burroughs.

Torben AEgidius Mogensen

unread,

Dec 7, 1995, 3:00:00 AM12/7/95

to

d.s...@ix.netcom.com (Duane Sand) writes:

As processors tend to get faster faster than memory, you can argue
that taking extra time to decode instructions matters little, as it is
feeding the instructions to the processor that takes time. Increasing
code density is a way to reduce the problem of keeping the processor
busy. Larger instruction caches help too, but only if the code is
reasonably well-behaved. Caches are good for Fortran programs with
loops, but bad for OO programs that jump all over the place.

However, increasing code density need not be a move away from RISC.
RISC is a lot more than uniform 32-bit instructions (which is the main
reason for bad density). RISC also means avoiding hazards like
multiple memory references in a single instruction (which makes
restarting difficult) and a preference for using registers instead of
memory, which is good if the memory is slow compared to the processor.

ARM Ltd. has with the Thumb instruction set shown that you can make a
compact RISC instruction set. You could probably get even better code
density for RISC than this, if you designed the instruction set from
scratch with code density as the main motivation. This need not mean
going to variable length instructions or stack code. Variable length
instructions complicate the pipeline, but I don't think they will be a
major deviation from the RISC idea. Stack instructions may improve
code density, but the inherent sequentiality of stack operations makes
it a bad choice for the next generation of processors, which get most
of their speed by instruction level parallelism. Using multiple stacks
reduces this problem, but also increases code size.

You can also try to address the memory/processor gap by the Tera
approach of task switching on high-latency operations. But this will
probably mainly be a solution for high-end machines in the foreseeable
future. And the Tera approach only solves the latency problem, the
bandwidth required to keep the processor busy is not reduced.
Increasing code density helps this, but only for the code bandwidth,
not for data bandwidth.

However, in most current RISCs, the average data memory access is less
than one per instruction, often as a consequence of needing an
instruction per memory transfer (load/store multiple registers
instructions helps here). This is no major problem if the code is in
cache, but I think we will see less cache-friendly code in the future.

So, improving code density may well be an important issue in the
future. Doing this should not hurt the potential parallelism of the
code or require more memory accesses. Below I have listed possible
ways of improving code density, some that does and and some does not
do this.

Bad ways:

1) Going to a small set of registers. While this saves bits in the
instruction word, it can potentially increase the number of
memory accesses.

2) Using a stack instead of registers. Hurts parallelism.

3) Combining many operations in single instructions. While this may
make some operation combinations compact, it will inevitably hurt
other combinations.

Good ways:

1) Variable length instructions. While this complicates the
decoding, I don't think this will be the major bottleneck of
future processors. We might even see bit-stream encoding (even
though this failed miserably on the i432). Using such an
encoding, you can even let some registers and constants have
shorter encodings than others. Even though the encoding is
sequential, you can decode in parallel.

2) Letting some instructions access only a subset of registers,
e.g. require registers used in the same instruction to be close
(using small offsets for all but the first register) or by using
a "random" selection of register combinations, or letting some
instructions have implied register usage.

3) Use code locality to shorten instruction coding. E.g. use
distance to receiving/producing instruction instead of storing
intermediates in global register file with uniform coding cost.
You can e.g. let a single bit determine that an argument is the
result of the preceding operations, or that the result will be
used in the next instruction. This allows both producer and
receiver to save (most of) the bits needed for register
specification.

All of the "good" ways listed above complicate instruction decoding,
but this is not now (I believe) the most critical part of instruction
execution. While the latter approach goes against the current trend of
separating producers and consumers in the instruction stream to
increase parallelism, it mainly requires a somewhat larger look-ahead
window to find parallelism.

Torben Mogensen (tor...@diku.dk)

Sanjay Padubidri

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to

Michel Eftimakis (Michel.E...@VLSI.COM) wrote:
: tor...@diku.dk (Torben AEgidius Mogensen) wrote:
: >Good ways:

: >
: > 1) Variable length instructions. While this complicates the

: [...]
: > 2) Letting some instructions access only a subset of registers,
: [...]
: > 3) Use code locality to shorten instruction coding. E.g. use
: [...]

: There's also another way of reducing memory accesses... It is code
: compression. The code can be uncompressed on the fly at the time of
: execution by the processor.

: Other ideas ?

How about register windows? You could put in an arbitrary number of
registers and not have to worry about using bits to address them. All
you need are two instructions to push and pop windows. This would
particularly be attractive to companies like Intel that have to maintain
binary compatibility.

If you think about it, this suggestion is actually a variation of idea 2 above.

-Sanjay Padubidri
Dept. of Computer Science
SUNY at Stony Brook
sanj...@cs.sunysb.edu

Duane Sand

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to

In <4a8at8$h...@mercury.IntNet.net> a...@news.IntNet.net (Al Lipscomb)
writes:
>
>: d.s...@ix.netcom.com (Duane Sand) writes:
>: >He calls the whole RISC architecture thing a fad, a

>: >severe case of The Emperor's New Clothes. He argues for maybe
going to
>: >a byte-coded stack machine instruction set, a la Burroughs.
>

>I would say that RISK and the A Series (Burroughs now Unisys) are not
a
>mutualy exclusive set of features. The tagged memory, descriptors and
>stack based architechure have many good points. The CPU of the A has
does
>not have to worry about how memory is laid out, everything is an
arrays.
> ...

Oops. I hope this doesn't turn into yet another isn't A-series-nice
thread.
I'm sure Pittman wasn't thinking of descriptor-oriented tagged-memory
machines. The Mac programs he cares about hardly have any arrays as
such, and probably few vectorizable loops. My pardon for using the
B-word. Pittman actually ended by speculating about an implied-operand
machine with 5-bit opcode words; he didn't mention Burroughs.

IMHO, if one were to design a nice pipeline-able multi-issue machine
for C applications, with great code density, I don't think you would
start by hiding the control info everywhere, in indirect descriptor
words. Sure, you can undo those implied indirections with a bunch of
specialized caches and deeper fetch pipelines, but that's a solving a
problem that is easier to solve cost effectively by not having
descriptors & tags at all.

Al Lipscomb

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to

Torben AEgidius Mogensen (tor...@diku.dk) wrote:
: d.s...@ix.netcom.com (Duane Sand) writes:

: >of RISC code. He calls the whole RISC architecture thing a fad, a

: >severe case of The Emperor's New Clothes. He argues for maybe going to
: >a byte-coded stack machine instruction set, a la Burroughs.

I would say that RISK and the A Series (Burroughs now Unisys) are not a

mutualy exclusive set of features. The tagged memory, descriptors and
stack based architechure have many good points. The CPU of the A has does
not have to worry about how memory is laid out, everything is an arrays.

: Bad ways:

: 1) Going to a small set of registers. While this saves bits in the
: instruction word, it can potentially increase the number of
: memory accesses.

: 2) Using a stack instead of registers. Hurts parallelism.

This is not always the case. The A Series uses an external model that the
compiler produces stack code for. The processor is a GPR based design.
The combination seems to produce a real work horse. I think this design
may allow the cache hardware to gain more information about the
environment. I would also venture that most subroutines run with only
registers and no real access to the stack.

The tagged descriptors should help also. When the decode units see an
instruction that operates on an array it can notify the cache control to
look in the stack for the descriptor and then go find the array. You
could at least have the descriptor moved into a register and the physical
mapping completed by the time the data in the array is required.

Since the arrays are all dereferenced by hardware you can also employ
some vectored instructions. These instructions can be handled by
independent execution units in the CPU. For example a instruction to
initialize an entire array to zero, or add the contents of two arrays.
Since the hardware understands the arrays, if the descriptor to the
arrays is touched before a long operation is completed an interrupt could
be caused.

Michel Eftimakis

unread,

Dec 8, 1995, 3:00:00 AM12/8/95

to

tor...@diku.dk (Torben AEgidius Mogensen) wrote:

>Good ways:
>
> 1) Variable length instructions. While this complicates the

[...]

> 2) Letting some instructions access only a subset of registers,

[...]

> 3) Use code locality to shorten instruction coding. E.g. use

[...]

There's also another way of reducing memory accesses... It is code
compression. The code can be uncompressed on the fly at the time of
execution by the processor.

Other ideas ?

Bye

--
Michel EFTIMAKIS........................Internet : Michel.E...@vlsi.COM
DECT Design Engineer....................Phone....: (33) 92 96 27 19
VLSI TECHNOLOGY France..................Fax......: (33) 92 96 27 01
505, Route des Lucioles.................CellularP:
Sophia Antipolis - 06560 Valbonne FRANCE

Bernd Paysan

unread,

Dec 9, 1995, 3:00:00 AM12/9/95

to

Torben AEgidius Mogensen wrote:
> Bad ways:
>
> 1) Going to a small set of registers. While this saves bits in the
> instruction word, it can potentially increase the number of
> memory accesses.

Is this bad today? The additional memory accesses would be stack accesses
(thus in cache); but it seems to be more important to restrict the number
of addressing modes compared to 68k (the 68k has 6 bits per address, 3
register, 3 mode). Not having RMW operations is good (for restarting), not
having two memory accesses per operation is good too; but one memory
address (say load and add or so) isn't really a problem.

> 2) Using a stack instead of registers. Hurts parallelism.

You need at least a P6-like approach do gain parallelism out of stack
code. Usual stack code consists of typical pieces like "load something,
process with other values on the stack, store result away". You can easily
transform this stack ops into uOPS, using register renaming to transform
the stack into a GPR address, and schedule all those uOPS into a data flow
execution unit. Expensive, yes, but IMHO less than x86 architecture, as
typical stack operations can be translated 1:1 into uOPS (just add the
register names), and there is little variation in code length (perhaps two
or three formats: one/two with, one without immediate argument).

> 3) Combining many operations in single instructions. While this may
> make some operation combinations compact, it will inevitably hurt
> other combinations.

Creating such a VLIW design with small subinstructions is difficult, but
not impossible. The result of my 4stack processor project was that it is
hard to write good optimized code that uses less than two no-nop
instructions per 64 bit word; perhaps except if loop unrolling is used
(but this is inner loop stuff, thus much more cache hits than misses).
This approach combines the short instructions of a stack processor, and
the parallelism of a VLIW processor. Compared with a single-stack
processor it produces longer code if there is few parallelism to exploint,
but code with much inherent parallelism requires lots of stack operations
on a single stack processor, which results perhaps in longer code.

Example: to show a parallel compiler concept for the 4stack processor, I
wrote a prototype that can compile Forth, a stack language. The code was
the inner loop of a FFT, and in stack code it looks like that:

( memory pointers in local variables: pr, pi qr, qi )
( on the stack: fr, fi, the sine and cosine values)
qr @ qi @
2over 2swap 2over 2over
rot * -rot * swap - 4 -roll
-rot * -rot * swap + swap
pr @ pi @
2over 2over rot swap - pr ! - pi !
rot swap + qr ! + qi !

This does not even include the pointer updates. The "4 -roll" operation
would be very expensive (and I would have replaced it for a real code),
but with the experimental compiler, it is cheap. Assuming every token
takes one byte, the code above would be 45 bytes.

The 4stack code for the same problem is 4 instructions = 32 bytes long,
uses software pipelining, has the pointer update operations missing above,
and is still less readable (lines beginning with ;; comments what is on
the stack - for more informations look into my homepage, the user manual
of the 4stack processor has a detailed description of what happens in the
FFT code). Unlike the single stack code it is not dominated by stack
operations:

;; fr hi fi hr
pick 3s0 mul 0s0 pick 1s0 mul 0s0 ld 2&0: R1 0 #
;; fr hr -- fi hi --

mul 2s1 mul@ mul s1 mul@
;; fr gi 2:hifr/2 fi gr 2:hrfr/2

asr mulr@+ asr -mulr@+ 0 # ld 3&1: R1 +N
;; fr gi/2 (hifr+fihr)/2=:qi
;; fi gr/2 (hrfr-hifi)/2=:qr

add 1s0 subr 0s0 add 3s0 subr 2s0 st 2&0: R1 N+ st 3&1: R1 N+
;; qi+gi/2 qi-gi/2 qr+gr/2 qr-gr/2

The stack code is good, if you compare it with a typical RISC: FFT on
HP-PA, which generates unusual compact RISC code is 24 instructions=96
bytes. The VLIW stack code, however, is better. This is just one example.
If you generate mildly optimized code to transform 68k code into 4stack
code, you may result with something as bad or worse than on the PPC.

> Good ways:
>
> 1) Variable length instructions. While this complicates the
> decoding, I don't think this will be the major bottleneck of
> future processors. We might even see bit-stream encoding (even
> though this failed miserably on the i432). Using such an
> encoding, you can even let some registers and constants have
> shorter encodings than others. Even though the encoding is
> sequential, you can decode in parallel.

No. You have at least to mark the start of the instructions, then it works
(the P6 does this while loading the I-cache line). Even then it remains
expensive. Variable length instructions are expensive, the more you want
to issue at once, the more expensive (and certainly: the more variable the
instruction length, the more expensive).

> 2) Letting some instructions access only a subset of registers,
> e.g. require registers used in the same instruction to be close
> (using small offsets for all but the first register) or by using
> a "random" selection of register combinations, or letting some
> instructions have implied register usage.

All these restrictions complicate the compiler.

> 3) Use code locality to shorten instruction coding. E.g. use
> distance to receiving/producing instruction instead of storing
> intermediates in global register file with uniform coding cost.
> You can e.g. let a single bit determine that an argument is the
> result of the preceding operations, or that the result will be
> used in the next instruction. This allows both producer and
> receiver to save (most of) the bits needed for register
> specification.

This idea is almost the same as a stack processor, with the difference
that you may have register accesses, too (somehow like the transputer).

> All of the "good" ways listed above complicate instruction decoding,
> but this is not now (I believe) the most critical part of instruction
> execution. While the latter approach goes against the current trend of
> separating producers and consumers in the instruction stream to
> increase parallelism, it mainly requires a somewhat larger look-ahead
> window to find parallelism.

As it would be with a parallelizing stack processor.

BTW: the original problem was "mildly optimized PPC code". It should be
clear that you can't get dense code if you don't optimize good. A 68k
instruction like move.l (A6)+,-(A7) will become 4 RISC instructions
(from 2 bytes!). What would be more interesting is how large a recompiled
application becomes, and if this depends on RISC/CISC or if different RISC
implementations have large differences in instruction size. A former
verson of gforth, a portable Forth in C, gave about the following results:

PA-RISC: 32k
i386: 45k
MIPS R3000: 64k

It is interesting, that a popular CISC has less code density than a
popular RISC, and another popular RISC has twice as much code than that.
It should be noticed, that the R4000 does better, and that these numbers
are old, because it is difficult to obtain them with the current version
of gforth (the HP-UX version needs to include some code statically linked,
while the Linux version can make use of the shared libs of this code). And
the 2.7.2 version of GCC produces more compact code for the i486 than the
one that was used to obtain these numbers. It is now in the 34k region.

--
Bernd Paysan
"Late answers are wrong answers!"
http://www.informatik.tu-muenchen.de/~paysan/

Brian N. Miller

unread,

Dec 11, 1995, 3:00:00 AM12/11/95

to

In article <4a738c$2...@odin.diku.dk>, tor...@diku.dk (Torben AEgidius Mogensen) writes:
|
|Caches are ... bad for OO programs.

Has any research validated this claim? There's a slight chance
that OO is less cache friendly. That doesn't mean that OO
code doesn't benefit from cache. Both OO and cache are growing
disproportionally to their alternatives. They must be doing
something right together.

|Variable length instructions complicate the pipeline.

Not necessarily. Instruction decode can extend all instructions
into an internal format of constant width. Irregardless, the "RISC
Penalty" is becoming far too costly to continue demanding instructions
arrive directly from core with fixed size. From an abstract viewpoint,
the claim that RISC instructions are of fixed width is silly anyway.
What a 3-operand architecture calls one instruction, a RISC architecture
is forced to decompose into some _variable_ number of lesser instructions.
At this higher level, the RISC instruction stream doesn't look so clean
anymore.

|The inherent sequentiality of stack operations makes

|it a bad choice for the next generation of processors, which get most
|of their speed by instruction level parallelism.

Can this be so? Does a stack architecture preclude reorder? I'm confused.

|Require registers used in the same instruction to be close
|(using small offsets for all but the first register).
|... Use code locality to shorten instruction coding. E.g. use
| distance to receiving/producing instruction.

Yes. Awhile ago someone claimed in a post here that most register
operands existed merely as temporaries in a reorder buffer. Maybe
operand addressing could be taylored for that.

| Letting some instructions have implied register usage.

Rewarding, but decidedly un-RISC. Hennessey+Patterson praised
RISC for its operand orthagonality.

Torben AEgidius Mogensen

unread,

Dec 12, 1995, 3:00:00 AM12/12/95

to

b...@indica.bbt.com (Brian N. Miller) writes:

>In article <4a738c$2...@odin.diku.dk>, tor...@diku.dk (Torben AEgidius Mogensen) writes:
>|
>|Caches are ... bad for OO programs.

I didn't write that. You must have mixed a reply into the text.
I basically agree with you that cache will benefit OO, but that OO
will give more misses than Fortran-style code.

>|Variable length instructions complicate the pipeline.

If you read my original posting, you will see that I argue that this
is a lesser penalty.

>|The inherent sequentiality of stack operations makes

>|it a bad choice for the next generation of processors, which get most
>|of their speed by instruction level parallelism.

>Can this be so? Does a stack architecture preclude reorder? I'm confused.

On stack architectures, the stack is used by all instructions, and
these change the position/content of elements of the stack. In order
for stack architectures to have compact code, implicit top-of-stack
arguments are used. This essentially means that the next operation in
most cases will depend on the result of the previous (as this is the
one that gives the TOS a value), or else it is just a simple PUSH
operation. O.K., it is possible to combine some operations,
e.g. PUSH operations followed by TOS operations, as it is done on the
T9000. You can also dynamically map stack locations to registers and
detect hazards among these registers instead of the stack as a global
entity. This is, however, rather complicated (though maybe not much
more so than register renaming). Nevertheless, I think it is a bad
idea to design a critical resource (e.g. a common stack) into the ISA
and then try to work around it in the implementation.

>| Letting some instructions have implied register usage.

>Rewarding, but decidedly un-RISC. Hennessey+Patterson praised
>RISC for its operand orthagonality.

One can wonder whether orthogonality is a requirement for or a sympton
of RISC. I tend to think the latter. I'm all for orthogonality, but it
can hurt code density. So when the aim is to improve code density, I'm
willing to sacrifice orthogonality.

Torben Mogensen (tor...@diku.dk)

Dennis O'Connor~

unread,

Dec 13, 1995, 3:00:00 AM12/13/95

to

tor...@diku.dk (Torben AEgidius Mogensen) writes:

] >Rewarding, but decidedly un-RISC. Hennessey+Patterson praised

] >RISC for its operand orthagonality.
]
] One can wonder whether orthogonality is a requirement for or a sympton
] of RISC. I tend to think the latter.

I always thought it was something the compiler writers begged for.
They seem to think that their efforts are better spent on other
things instead of tracking and shuffling data around in the registers
so that the operands are in the right dedicated register for
whatever operation needs to be done.
--
Dennis O'Connor doco...@sedona.intel.com
i960(R) Architecture and Core Design Not an Intel spokesman.
TIP#518 Fear is the enemy.

Preston Briggs

unread,

Dec 14, 1995, 3:00:00 AM12/14/95

to

Torben AEgidius Mogensen wrote:
>] One can wonder whether orthogonality is a requirement for or a sympton
>] of RISC. I tend to think the latter.

and Dennis O'Conner wrote:
>I always thought it was something the compiler writers begged for.
>They seem to think that their efforts are better spent on other
>things instead of tracking and shuffling data around in the registers
>so that the operands are in the right dedicated register for
>whatever operation needs to be done.

Yes! And it's not just the compiler-writer's efforts we're trying to
save, but runtime and code space.

Preston Briggs

Cliff Click

unread,

Dec 14, 1995, 3:00:00 AM12/14/95

to

doco...@sedona.intel.com (Dennis O'Connor~) writes:

> tor...@diku.dk (Torben AEgidius Mogensen) writes:
> ] >Rewarding, but decidedly un-RISC. Hennessey+Patterson praised
> ] >RISC for its operand orthagonality.
> ]

> ] One can wonder whether orthogonality is a requirement for or a sympton
> ] of RISC. I tend to think the latter.
>

> I always thought it was something the compiler writers begged for.
> They seem to think that their efforts are better spent on other
> things instead of tracking and shuffling data around in the registers
> so that the operands are in the right dedicated register for
> whatever operation needs to be done.

Unfortunately, irregular register sets have been around for soooo
long, that this problem is fairly well solved. A nice coalescing
allocator will get the right things in the right registers without too
many extra moves. These allocators work really well on RISC too, so
they are fairly common.

Cliff
--
Cliff Click Compiler Researcher & Designer
RISC Software, Motorola PowerPC Compilers
cli...@risc.sps.mot.com (512) 891-7240

Allen J. Baum

unread,

Dec 14, 1995, 3:00:00 AM12/14/95

to

In article <4ajmrq$q...@odin.diku.dk>, tor...@diku.dk (Torben AEgidius
Mogensen) wrote:

>
> One can wonder whether orthogonality is a requirement for or a sympton

> of RISC. I tend to think the latter. I'm all for orthogonality, but it
> can hurt code density. So when the aim is to improve code density, I'm
> willing to sacrifice orthogonality.

I'm tempted to agree.
I believe that the main reason for orthogonality is to make the
compiler writers job easier. Secondarily, it makes hardware designers
job easier.

Another major reason, however, is that it exposes more opportunities for
optimization and parallelization. If an instruction set is non-orthogonal,
then it is probably that some resource is special cased, or not replicated,
and may become a bottleneck. And, (allegedly) becaue of the orthogonality,
compilers can be written that take advantage of it.

So, there is a fairly clear tradeoff of performance vs. code density.
This was a pretty easy decision a few years ago, but it is becoming
apparent that code density, by itself, is becoming a much more significant
performance issue than before (since the advent RISC, anyway).

I can't quantify, in my head, what degree of problem it is.

If I have an instruction set with half the density of another,
this will not likely cause a factor of two in off-cache traffic, especially
since Istreams exhibit lots of locality.

It's equivalent to having an Icache half the size.
For tiny caches, that could conceivably to double the miss rate, but
for large caches - I don't know, say 20..30% for arguments sake.

So, we're trading an extra 20% of Icache traffic (over and above I+D
traffic) to gain something in performance improvement because of
optimizations (which could actually lower Dcache traffic), making a
compiler writers job easier, and
possibly make the hardware designers job easier. The latter two have
performance implications as well; at 30-40% clock rate improvement a year,
delaying a product because it is more complex means effectively making it
lower performance.

Doing this tradeoff is nontrivial. If clock rates get such that the
off-chip memory traffic completely swamps the bus, then any improvement
in code density is a likely a win.

But..... compilers have been typically biased for performance (although
ARM, I believe, has a compiler switch to bias it towards code density),
which
will (typically) make code density worse.

So, what if we have an orthogonal instruction set with a compiler that is
biased toward code density - will performance improve in a system that is
bottlenecked on off-chip memory traffic? Tough call. My brain hurts.

--
*******************************************************
* Allen J. Baum *
* Apple Computer Inc. MS 305-3B *
* 1 Infinite Loop *
* Cupertino, CA 95014 *
*******************************************************

Doug Siebert

unread,

Dec 14, 1995, 3:00:00 AM12/14/95

to

ba...@apple.com (Allen J. Baum) writes:

>If I have an instruction set with half the density of another,
>this will not likely cause a factor of two in off-cache traffic, especially
>since Istreams exhibit lots of locality.

>It's equivalent to having an Icache half the size.
>For tiny caches, that could conceivably to double the miss rate, but
>for large caches - I don't know, say 20..30% for arguments sake.

When I originated the "64 bit instructions" thread a few weeks ago I got
e-mail from someone who said a couple years ago he had done exactly what I
was looking for -- did cache traces to determine how much additional traffic
from memory to the cache(s) was generated by taking a 32 bit modern RISC
instruction set and making it 64 bits; with all the extra 32 bits completely
wasted. I assume the configuration tested had a normal ordinary cache size.
Anyway, they found only a 3% increase. He asked me to not to mention any
names or details, as this study was never published. At any rate, given this
result, I think the 20-30% is a gross overestimation, though that was pretty
much what I was thinking until I got that mail.

If anyone else knows or or has done any similar studies I'd be interested to
see how the results compare. It would be interesting to hack a compiler to
insert a NOP every other instruction, then run SPEC on that versus SPEC using
the normal compiler, and see how much a few representative sample architectures
suffer/don't suffer from this. I wonder if you could do this by having the
compiler generate assembly code and adding the NOPs there? Or would that mess
up branch targets, etc.?

--
Doug Siebert || "Usenet is essentially Letters to the Editor
University of Iowa || without the editor. Editors don't appreciate
dsie...@icaen.uiowa.edu || this, for some reason." -- Larry Wall
(c) 1995 Doug Siebert. Redistribution via the Microsoft Network is prohibited.

John Carr

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

In article <4apuhn$k...@server05.icaen.uiowa.edu>,
Doug Siebert <dsie...@icaen.uiowa.edu> wrote:

[the subject is I-cache miss rates for doubling instruction size:]

>Anyway, they found only a 3% increase.

Does 3% increase mean the hit rate drops from 95% to 94.85% or 92%? If
the latter, considering that a I-cache miss costs 10+ cycles that is a
CPI increase of about .2 (assuming 4 64 bit instructions in a 32 byte
cache block).

>It would be interesting to hack a compiler to
>insert a NOP every other instruction, then run SPEC on that versus SPEC using
>the normal compiler, and see how much a few representative sample architectures
>suffer/don't suffer from this.

This would artificially inhibit superscalar dispatch. The original Alpha,
for example, dual issues aligned quadwords. Adding NOPs would make it at
best a single issue processor.

--
John Carr (j...@mit.edu)

Brian N. Miller

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

In article <baum-14129...@piem3.atg.apple.com>, ba...@apple.com (Allen J. Baum) writes:

|It makes hardware designers job easier.

I think perhaps the opposite. True, orthagonality might simplify some VHDL
expressions, and that's a big win. But, orthagonality leads to an explosion
of interconnects, which in turn raises layout, area, and electronic design
concerns, all hopefully handled without human intervention. But the increased
complexity hurts yield _without_ improving speed; better to spend those details
on cache. Orthagonality also combinatorily expands the extent of simulation
needed to provide 100% coverage. Bloating simulation hurts the edit/debug
cycle, and that might hurt the project's schedule.

|If clock rates get such that the
|off-chip memory traffic completely swamps the bus, then any improvement

|in code density is a likely a win. ...ARM, I believe, has a compiler switch
|to bias it towards code density.

I suspect a trend.

Albert Such

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

tor...@diku.dk (Torben AEgidius Mogensen) wrote:

>b...@indica.bbt.com (Brian N. Miller) writes:
>

>>In article <4a738c$2...@odin.diku.dk>, tor...@diku.dk (Torben AEgidius Mogensen) writes:
>>| Letting some instructions have implied register usage.

>
>>Rewarding, but decidedly un-RISC. Hennessey+Patterson praised
>>RISC for its operand orthagonality.
>

>One can wonder whether orthogonality is a requirement for or a sympton
>of RISC. I tend to think the latter. I'm all for orthogonality, but it
>can hurt code density. So when the aim is to improve code density, I'm
>willing to sacrifice orthogonality.
>

My understanding is that lacking orthogonality can go against code density:
if certain operations can only work with a reduced set of registers you
have to move operands to/from those registers and, because the number of
registers actually available for those instructions is reduced, you can end
up having more spilling code.

In other posting in this thread, somebody quoted code density for PA-RISC being
higher than for i386. A possible explanation for that is the non orthogonal
instruction set of the i386.

Albert Such
HP-BCD

Brian N. Miller

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

In article <4apq2h$4...@oomph.tera.com>, pre...@Tera.COM (Preston Briggs) writes:
|
|It's not just the compiler-writer's efforts we're trying to

|save, but runtime and code space.

As Allen Baum pointed out earlier, what good is a new CPU if its held up
by a compiler that's still in development? I would have thought that by
now that algorithms for handling constraints on register colorization would
be sufficiently generalized so as to be easily applied to a new instruction
set. So my question is:
Does orthagonality _significantly_ accelerate the deployment of a compiler's
back end?

Preston Briggs

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

I wrote

>|It's not just the compiler-writer's efforts we're trying to
>|save, but runtime and code space.

and b...@bbt.com writes:
>As Allen Baum pointed out earlier, what good is a new CPU if its held up
>by a compiler that's still in development?

Hmmm. That's been my argument for quite a while against
distributed-memory machines and cache-dependent machines.

>I would have thought that by
>now that algorithms for handling constraints on register colorization would
>be sufficiently generalized so as to be easily applied to a new instruction
>set. So my question is:
>Does orthagonality _significantly_ accelerate the deployment of a compiler's
>back end?

I'm not exactly sure about the connection between my comment and
yours, but... Coloring handles lots of register irregularities in a
reasonable fashion -- meaning as well or better than anything else.
But if you've got a multiplier that requires its operands in r0 and
r1, and an adder that produces it's result in r2, nothing can help you
avoid a reg-reg copy if you've got to multiply a sum. Hence my
argument that irregularity can cost runtime and code space.

I don't know how much orthogonality helps (or irregularity hurts)
compiler development -- that's a tough software engineering
experiment.

Preston Briggs

Bruce Hoult

unread,

Dec 15, 1995, 3:00:00 AM12/15/95

to

d.s...@ix.netcom.com (Duane Sand) writes:
> The December 1995 issue of IEEE MICRO magazine has an interesting and
> inflamatory article in its Micro View column, named "The RISC Penalty",
> written by Tom Pittman. He has been working on faster replacements for
> PowerMacs' 68K emulator, using dynamic recompilation of 68K code into
> mildly-optimized PowerPC code. He and others found that recompilation
> did not give the speedups they expected over simple interpretation,
> when running large real programs. He says the main problem is the
> excessive number of code cache misses caused by the poor code density

> of RISC code. He calls the whole RISC architecture thing a fad, a
> severe case of The Emperor's New Clothes. He argues for maybe going to
> a byte-coded stack machine instruction set, a la Burroughs.

It's hard to imagine a program with less locality of reference than
Apple's own 68K emulator. It basically loads the next opcode, multiplies
by 8, adds a base address, and jumps into a 512 KB switch statement body.
Each branch of the switch statement is 8 bytes, containing one useful
instruction, plus one branch statement that either goes back to the
interpreter loop (for really simple 68K instructions), or else to
additional code that completes the instruction (often shared by a family
of similar opcodes). The total thing is about 800 KB in size.

Connectix's "Speed Doubler" contains 40 KB of PPC code that really does
nearly double the emulation speed on many real-world programs (such as
"Hellcats Over The Pacific" or "Microsoft Word 5.1") -- and far more than
that on typical simple synthetic benchmarks.

Clearly, 40 KB isn't going to contain an entire 68K emulator, so my guess
is that "Speed Doubler" is using Apple's lookup table as a set of code
generation macros for simple-minded dynamic recompilation.

I don't know what kind of speed increase Pittman was expecting, but this
doesn't seem unreasonable to me. A factor of 10 slowdown over native
code seems to be a fairly universal rule of thumb for interpretive
emulators, and Apple's one is pretty much right on that mark. I believe
the DEC guys doing fullblown static recompilation of VAX code for the
Alpha get about a factor of two slowdown over recompilation from the
original source code.

It seems to me that I'd be astounded if a dynamic recompiler with only
a few hundred K of code cache space could come within a factor of two of
static recompilation i.e. 4 times slower than native code. At around 5 or
6 times slower than native, Speed Doubler seems like a good effort to me.

-- Bruce

Henry Baker

unread,

Dec 16, 1995, 3:00:00 AM12/16/95

to

In article <4atcou$9...@oomph.tera.com>, pre...@Tera.COM (Preston Briggs) wrote:

> >So my question is:
> >Does orthagonality _significantly_ accelerate the deployment of a compiler's
> >back end?
>

> I don't know how much orthogonality helps (or irregularity hurts)
> compiler development -- that's a tough software engineering
> experiment.

All other things being equal, the sheer _size_ of the code generator is
a barrier to getting the compiler to work. You can probably estimate to
a first approximation the size of the code generator from the size of the
instruction set manual. Of course, if the vendor cheated, and didn't bother
telling you lots of things about how the instruction set _actually_
operates, then add another year or so to the development, while the SW
people do extensive experiments to figure out what the heck is going on.

Also add another year if you want the maximum performance, and you didn't
tell the SW people about the screw cases where the instructions run
10X slower.

BTW, most HW people think that adding a few instructions shouldn't cost
much in revamping the compiler. This is true if the compiler doesn't
actually generate the instructions! However, if the new instructions are
actually very useful, it could completely obsolete a whole theory of
what sort of code to generate for the optimum performance.

----

To summarize:

1. A smaller instruction set is easier to generate good code for _if_ the
instructions have a relatively simple performance model.

2. Orthogonality is a waste of time, if you have to immediately throw it
away to take get decent performance. E.g., the different models of the IBM/360
all had the same instruction sets, but vast differences in which instructions
were efficient. E.g., on the model 30, byte instructions were very efficient
since most of the datapaths were 8 bits, while on the larger models, the
byte instructions were dreadful due to 32 bit and larger datapaths.

3. A decent model of how to gracefully handle branches to get performance
is probably a lot more important. E.g., if you have deep pipelining, and
the pipelines don't know the eventual destination of the values, then any
branch will be penalized by having to wait for the pipelines to empty
_or_ all branch destinations will have to have a complete understanding of
all the possible states of the pipelines in the preceding portions of the
code. This blows up the code in a fairly spectacular manner without any
redeeming value.

4. With modern machines being held up more & more by data latency issues,
the details of the instruction sets become less important compared with
the problems of modelling the caches, datapaths, functional units, etc.
I would say that _orthogonality of data paths and functional units_ is a much
bigger problem than _orthogonality of instruction sets_.

--
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html

Byron Rakitzis

unread,

Dec 17, 1995, 3:00:00 AM12/17/95

to

In article <4arun9$1...@ansible.bbt.com>, Brian N. Miller <b...@bbt.com> wrote:

>I think perhaps the opposite. True, orthagonality might simplify some VHDL
>expressions, and that's a big win. But, orthagonality leads to an explosion
>of interconnects, which in turn raises layout, area, and electronic design
>concerns, all hopefully handled without human intervention. But the increased
>complexity hurts yield _without_ improving speed; better to spend those details
>on cache. Orthagonality also combinatorily expands the extent of simulation
>needed to provide 100% coverage. Bloating simulation hurts the edit/debug
>cycle, and that might hurt the project's schedule.

I don't know if this was mentioned a while ago on comp.arch, but here
goes:

I heard about an idea to make a non-orthogonal instruction set where
the non-orthogonality is chosen in such a way as to "compress" the
instruction set: The idea being that if you represent the instruction
set as a graph, that the compiler would be able to find short paths in
the graph even though a human mind might get confused about which
register is allowed to be used in which context.

An example of the idea (off the top of my head) is this: suppose that
the base register of a load must be numbered one greater than the target
of a load:

lw r0,(r1) # legal
lw r0,(r2) # illegal

Now this is a terrible example, given, say, how frequently accesses
off a register like (sp) happen in real code, but it would completely
remove 5 bits (assuming 32 registers) out of the instruction encoding.

Perhaps someone can offer better insights into this problem. This is
not my area of expertise. I think this is a really clever idea, though.

Byron.
--
Byron Rakitzis Network Appliance
<by...@netapp.com> 319 N. Bernardo
(415) 428-5104 Mountain View, CA 94043

Mike Haertel

unread,

Dec 18, 1995, 3:00:00 AM12/18/95

to

In article <4b0tfg$2...@nova.netapp.com>,

Byron Rakitzis <by...@netapp.com> wrote:
>An example of the idea (off the top of my head) is this: suppose that
>the base register of a load must be numbered one greater than the target
>of a load:
>
> lw r0,(r1) # legal
> lw r0,(r2) # illegal
>
>Now this is a terrible example, given, say, how frequently accesses
>off a register like (sp) happen in real code, but it would completely
>remove 5 bits (assuming 32 registers) out of the instruction encoding.
>
>Perhaps someone can offer better insights into this problem. This is
>not my area of expertise. I think this is a really clever idea, though.

This is awfully reminiscent of floating-point register pairs.
The compiler issues for dealing with those are well understood.

In general anything that adds weird constraints in how registers
are used will make the compiler's life more difficult. Simple
compilers usually end up generating extra data movement, and their
more sophisticated brethren typically do extra thinking to make sure
results somehow magically get computed into the right registers
in the first place. Sometimes you just can't do that.

So, the price of register usability constraints is extra data movement.
The benefit may be easier and/or faster implementation (but this is
technology dependent) and perhaps reduced I-fetch bandwidth requirements.

An example of easier implementation: Many ISA's have disjoint FP and
integer registers. Historically this often occurred because die area
limitations led to off-chip FPU's. Even nowdays it may be desirable
to have the FP registers near the FPU and a long way away from the rest
of the chip.

It's not always clear when such implementation considerations should
lead to architecturally visible constraints. The FP register issue
is well understood and nobody in the compiler community complains
very much. On the other hand, branch delay slots, which were popular
a few years ago, are pretty much a waste for all concerned nowdays.

The real issue with all such architectural features is, how will they
affect future implementations? It's much easier to have architecturally
separate FP and integer register files, and secretly have them unified
on chip (like P6), then it is to have an architecturally unified register
file and secretly have it split into FP and integer parts on chip...

As for reduced I-fetch bandwidth, I wonder how important it is?
In the 1980's the RISC guys got a lot of milage out of giant
instructions, using large I-caches to compensate for the increased
bandwidth requirements. Does that tradeoff still hold, or are people
starting to feel the pain of today's giant I-cache busting programs?
Is the answer larger caches, or denser instruction encoding? I note
that at least one RISC manufacturer (Acorn, I think) has a new dense
encoding scheme. Has the wheel of reincarnation turned again?
--
Mike Haertel <hae...@ichips.intel.com>

Allen J. Baum

unread,

Dec 18, 1995, 3:00:00 AM12/18/95

to

In article <4arun9$1...@ansible.bbt.com>, b...@bbt.com wrote:

> I said (RE: orthogonality):

>
>> It makes hardware designers job easier.
>

> I think perhaps the opposite. True, orthagonality might simplify some VHDL
> expressions, and that's a big win. But, orthagonality leads to an explosion
> of interconnects,

I don't follow that argument. If some register can be used for some
instructions and not others (non-orthogonal), the read buses are still
connected, so interconnect doesn't change. Decoding is more complicated.

If I have completely dedicated registers, than it is likely I have more
interconnect; I have to be able to load/store them, but I can't use the
same datapaths that the GPRs use, or if I do than I have to have some
additional interconnect for their dedicated function.

This might be just me thinking about some other meaning of the concept of
orthogonality than you were thinking of. Could you elaborate?

Stanley T.H. Chow

unread,

Dec 19, 1995, 3:00:00 AM12/19/95

to

In article <4b4fj7$s...@news.jf.intel.com>,

Mike Haertel <hae...@ichips.intel.com> wrote:
>In article <4b0tfg$2...@nova.netapp.com>,
>Byron Rakitzis <by...@netapp.com> wrote:

[Byron's idea of restricting register combinations]

>>Perhaps someone can offer better insights into this problem. This is
>>not my area of expertise. I think this is a really clever idea, though.
>
>This is awfully reminiscent of floating-point register pairs.
>The compiler issues for dealing with those are well understood.

If I remember correctly, the later IBM 360/370/... mainframes have
several(many?) instructions that tak advantage of this. One of the
instructions used 3? explicite registers and 3? implied registers.

As Mike says, (and mentioned by other posters like Blum), the
register allocations problems are well understood and pretty much
standard these days.

>In general anything that adds weird constraints in how registers
>are used will make the compiler's life more difficult. Simple
>compilers usually end up generating extra data movement, and their
>more sophisticated brethren typically do extra thinking to make sure
>results somehow magically get computed into the right registers
>in the first place. Sometimes you just can't do that.

But as you say, the register allocators have been written and it's
not that difficult to put into a new compiler. Even if lots of
register shuffling is done, it's done at register renaming with
no execute stages. The only down side is not getting all the gain
from code density, but I suspect the amount of regsiter shuffling
will be minimal (depending on number of registers, restrictions
imposed, etc).

>
>So, the price of register usability constraints is extra data movement.
>The benefit may be easier and/or faster implementation (but this is
>technology dependent) and perhaps reduced I-fetch bandwidth requirements.

Agree, a complex trade-off that depends of a bunch of non-arch factors.

>The real issue with all such architectural features is, how will they
>affect future implementations? It's much easier to have architecturally
>separate FP and integer register files, and secretly have them unified
>on chip (like P6), then it is to have an architecturally unified register
>file and secretly have it split into FP and integer parts on chip...

Who knows what the future will bring. Also, most people care a lot
about NOW and only a little about the future; so it makes commercial
sense to limit concerns to the next two generations and face the
future when the time comes.

>As for reduced I-fetch bandwidth, I wonder how important it is?
>In the 1980's the RISC guys got a lot of milage out of giant
>instructions, using large I-caches to compensate for the increased
>bandwidth requirements. Does that tradeoff still hold, or are people
>starting to feel the pain of today's giant I-cache busting programs?

Some people had "giant I-cache busting programs" before RISC :-)

>Is the answer larger caches, or denser instruction encoding? I note
>that at least one RISC manufacturer (Acorn, I think) has a new dense
>encoding scheme. Has the wheel of reincarnation turned again?

I am sure the answer is "BOTH".

--
Stanley Chow; sc...@bnr.ca, stanley....@nt.com; (613) 763-2831
Bell Northern Research Ltd., PO Box 3511 Station C, Ottawa, Ontario
Me? Represent other people? Don't make them laugh so hard.

Stanley T.H. Chow

unread,

Dec 19, 1995, 3:00:00 AM12/19/95

to

In article <baum-18129...@piem3.atg.apple.com>,

Allen J. Baum <ba...@apple.com> wrote:
>In article <4arun9$1...@ansible.bbt.com>, b...@bbt.com wrote:
>> I think perhaps the opposite. True, orthagonality might simplify some VHDL
>> expressions, and that's a big win. But, orthagonality leads to an explosion
>> of interconnects,
>
>I don't follow that argument. If some register can be used for some
>instructions and not others (non-orthogonal), the read buses are still
>connected, so interconnect doesn't change. Decoding is more complicated.
>
>If I have completely dedicated registers, than it is likely I have more
>interconnect; I have to be able to load/store them, but I can't use the
>same datapaths that the GPRs use, or if I do than I have to have some
>additional interconnect for their dedicated function.

I agree.

On the other hand, in the case of completely dedicated registers, the
additional interconnect is likely to be "local", so it's not as bad
(but than, I am way out of my depth here).

With dedicated registers, the register should need fewer R/W ports. I have
no idea what happens to forwarding paths; but if the registers are
sufficiently dedicated, I suppose forwarding just follows the normal
paths.

Chris Brown

unread,

Dec 21, 1995, 3:00:00 AM12/21/95

to

In article <29018...@hoult.actrix.gen.nz>,

Bruce Hoult <Br...@hoult.actrix.gen.nz> wrote:
>Clearly, 40 KB isn't going to contain an entire 68K emulator,

Why ever not? I wrote a MIPS R2000 emulator that does the user mode
and kernel mode instructions, as well as the MMU hardware, limited
ability to interface with native C routines, including a simple
integrated debugger, and a small kernel written in MIPS assembly to
handle exceptions & page faulting for my final year project at
University. OK, so it's interprative, rather than a dynamic
recompiler, and I don't claim to have implemented the full
functionality required for running MIPS compiled software on it, but
its performance is respectible for a non-commercial effort, it's
written in C and compiles to (depending on the machine) between 25 and
35 kilobytes. On a Sparc 5, with very simple code (but handling the
address exceptions, etc.), I've managed to get it up to over 100000
instructions per second.

--
/* _ */main(int k,char**n){char*i=k&1?"+L*;99,RU[,RUo+BeKAA+BECACJ+CAACA"
/* / ` */"CD+LBCACJ*":1[n],j,l=!k,m;do for(m=*i-48,j=l?m/k:m%k;m>>7?k=1<<m+
/* | */8,!l&&puts(&l)**&l:j--;printf(" \0_/"+l));while((l^=3)||l[++i]);}
/* \_,hris Brown -- All opinions expressed are probably wrong. */

Marinos Yannikos

unread,

Dec 21, 1995, 3:00:00 AM12/21/95

to

In article <4bbd92$l...@doc.armltd.co.uk>,

cbr...@armltd.co.uk (Chris Brown) writes:
>In article <29018...@hoult.actrix.gen.nz>,
>Bruce Hoult <Br...@hoult.actrix.gen.nz> wrote:
>>Clearly, 40 KB isn't going to contain an entire 68K emulator,
>
>Why ever not? I wrote a MIPS R2000 emulator that does the user mode
>and kernel mode instructions, as well as the MMU hardware, limited
>ability to interface with native C routines, including a simple
>integrated debugger, and a small kernel written in MIPS assembly to

>handle exceptions & page faulting [...] between 25 and
>35 kilobytes.

Don't forget that the 68K instruction set is "slightly" more complicated
than the simple Mips instruction set of the R2000. Unless you don't mind
up to 4-5 decoding steps per instruction (usually resulting in one slow
indirect jump each), you need a big jump table. Also, using real registers
instead of a register array for the emulated CPU's registers results in
increased code size but much higher speed. Granted, that doesn't render
an intepretive 68K emulator in 40KB impossible (that has to be proven first),
but certainly not very useful.

-nino
--
http://www.complang.tuwien.ac.at/nino

Joe Keane

unread,

Dec 30, 1995, 3:00:00 AM12/30/95

to

I think that this was a good thread.

One thing that Hennessy & Patterson stress repeatedly is that `big code bad'
doesn't cut it as a good argument. To make a useful analysis, it's important
to have real numbers about the trade-off between code size and performance.

It's certainly not true that if you double code size, your machine will run
twice as slow. It won't be exactly the same either, but that's probably much
closer to the truth. My guess for a typical penalty is something like 10%.

Of course, no matter what size a cache is, you're always going to have some
application with frequent accesses just a bit over that size. In that case,
things would work much better if the cache were a bit bigger, or it used a
different replacement algorithm, or things could be stored more compactly.

Fortunately, especially averaged over applications, access tends to have a
pretty `fractaloid' pattern, so the function of miss rate versus cache size
comes out fairly smooth. Any specific application may be just an anomaly.

One problem is that code size is easy to measure, while performance is not.
It depends on lots of icky things like clock speed, superscalar issue, stalls,
branch penalties, cache sizes, cache associativity, external caches, size and
speed of main memory, type of memory, operating system, speed of disks, etc.
Running on a newer or older machine may totally reverse some results you got.
But if you simplify things too much, your arguments become worthless.

I think the biggest problem is that making a new instruction set with good
code density is fun and interesting. There's no end to the amount of clever
encoding and complex instructions that you can dream up. Plus there's a
definite appeal to being able to do a lot of work with just a few bytes.

I think that there's a place for such dense code, just that it shouldn't be
the lowest-level code accessible for a machine. Most code isn't critical to
performance, and plenty of code is executed very rarely. It's a good idea to
have a compiler that can make some kind of byte-code for specified functions,
and i'm sure that we'll see this more. It could be justified entirely by
performance reasons if you could save enough time on page faults.

I like some of the schemes that were suggested earlier. I think that the
basic idea is to start with a really simple, easy-to-decode instruction set
and see if we can make restrictions that will significantly reduce code size
and have only a small effect on other areas, such as pipeline depth and the
number of instructions generated for certain code. Still, we shouldn't be
surprised if such things fail and the stupid instruction set turns out best.

I'm fond of special registers, because i think that it allows you to have a
really large general register set, without making your code size really large
or requiring your general registers to be ridiculously fast and multi-ported.
You end up specifying more moves, but a GPR RISC machine moves three operands
to/from general registers for every instruction, so which is worse really?

It's true that things like special registers or limited register combinations
make the compiler's job more difficult. But it's certainly tractable, so we
should consider what are the benefits and drawbacks. I don't believe a claim
that any move from the GPR model is going to totally defeat compiler writers.

Remember that a half-decent optimizer on a fast machine will almost surely
produce faster code than an incredibly clever optimizer on a slow machine.

--
Joe Keane, amateur architect

Bruce Hoult

unread,

Dec 30, 1995, 3:00:00 AM12/30/95

to

cbr...@armltd.co.uk (Chris Brown) writes:
> In article <29018...@hoult.actrix.gen.nz>,
> Bruce Hoult <Br...@hoult.actrix.gen.nz> wrote:
> >Clearly, 40 KB isn't going to contain an entire 68K emulator,
>
> Why ever not? I wrote a MIPS R2000 emulator that does the user mode
> and kernel mode instructions, as well as the MMU hardware, limited
> ability to interface with native C routines, including a simple
> integrated debugger, and a small kernel written in MIPS assembly to

> handle exceptions & page faulting for my final year project at
> University. OK, so it's interprative, rather than a dynamic
> recompiler, and I don't claim to have implemented the full
> functionality required for running MIPS compiled software on it, but
> its performance is respectible for a non-commercial effort, it's
> written in C and compiles to (depending on the machine) between 25 and
> 35 kilobytes. On a Sparc 5, with very simple code (but handling the
> address exceptions, etc.), I've managed to get it up to over 100000
> instructions per second.

You're quite right. I should have said "Clearly, 40 KB isn't going to
contain an entire, practical, 68K emulator that is faster than Apple's
built-in emulator.". Sorry.

I don't dispute that you can build an emulator that size. In fact that's
something I myself did back in March '82 at the start of my 2nd year at
university when I wrote a 6502 emulator in VAX Pascal. I don't recall the
compiled size, but it was certainly less than 40 KB of compiled code, and
ran at around 20,000 6502 instructions per second on a VAX 11/780.

The problem is that isn't *useful* as a replacement for Apple's interpretive
emulator, which on a 60 MHz PPC601 does around 5,000,000 emulated instructions
per second. You emulator is 50 times slower. *fifty* *times*.

Connectix's "Speed Doubler", on the other hand, contains 40 KB of PPC code,
but allows the PPC601 to emulate 68K code several times faster than Apple's
effort. In fact, using Speed Doubler, a simple loop such as this...

for i := 1 to max do
t := t + i;

... runs *in emulation* at 29 MIPS on my 60 MHz PPC601. That pretty darn good
for 40 KB of code.

-- Bruce

Lawson English

unread,

Jan 1, 1996, 3:00:00 AM1/1/96

to

Joe Keane <j...@netcom.com> wrote:
[snipt]
: One problem is that code size is easy to measure, while performance is not.

: It depends on lots of icky things like clock speed, superscalar issue, stalls,
: branch penalties, cache sizes, cache associativity, external caches, size and
: speed of main memory, type of memory, operating system, speed of disks, etc.
: Running on a newer or older machine may totally reverse some results you got.
: But if you simplify things too much, your arguments become worthless.

Let's not forget overhead of emulated vs native calls, especially if you
have possible "context-switching" going on in there.

Case in point: the Microseconds() call on the Mac is generally held to
have about a 20 microsecond resolution. HOWEVER, in actual timing on a
quadra, the code:

Microseconds(&start);
Microseconds(&end);

Yields an average of 18-20 with a range of 0 (less than 1 microsecond)
latency to 92. That's pretty big, no?

ON a PowerMac 9500/120 (120Mhz 604), the same code has an average latency
of 40 microseconds, with a range of 0 to 307!

Since the minimum is <1u-sec, I suspect that the call itself is not
emulated, however, the current interupt setup is said to be, so if there
is a lot of interupt-driven events occuring at roughly the same time, the
emulation overhead could be horrible, contention-wise. Cache-thrashing
galore.

Another problem is the issue of algorithm design. The standard
column-stretching texture-mapping popularized in Doom can be the source of a
horrible
slowdown on machines that depend on L2 cache coherency for speed. If, by
some unfortunate "coincidence," the scan-line width of your offscreen
buffer is related to a relatively large power of 2 (e.g. 640 =
2x2x2x2x2x2x5), you can be overwriting your cache every few pixels.
"Staggering" your buffer (this works with ANY array that is accessed
vertically, btw) by padding the end with 32 bytes of unused memory, will
eliminate the problem.

The above is an example of how an algorithm that served to speed *up* a
program (column-stretching only requires 1 divide/column as compared to 1
divide/pixel with generic texture-mapping) can actually slow *down* a
different system unless you are aware of where/why the slow-down is
occuring and can compensate.

The most subtle problem with optimization on a PowerPC that I am aware of is
due to I-Cache thrashing because something in your inner loop is
coincidentally conflicting with something else *called* by your inner
loop. This problem (from what I've heard) can be often be eliminated or at
least reduced by reordering the compile sequence of your various
libraries and files, but I don't know of any way of avoiding the problem
completely since you generally can't control where system libraries and
ROM are loaded in relation to your own code.

The moral is that the new architectures require a new way of optimizing.
The other moral is less obvious: the so-called experts on things like
emulators that work for the big companies may not be the world's best in
their field. Compare the speed of the current PCI-only emulator from
Apple, which does dynamic compilation with SpeedDoubler or the 68K
emulator from Ardi(?) which also do dynamic compilation. The third-party
versions generally are faster, simply because the guys that found them
are often folks who have an affinity for that particular problem, whereas
the employees of the maga-companies may not be quite as creative, since
they often were *assigned* the job, rather than created it.

Bottom line:

don't judge an instruction set by how it runs Microsoft Word and company,
but rather by how it runs the latest platform-specific ground-breaking game.

Have a look at the speed of Marathon 2, for instance, to get some idea of
what well-optimized PowerPC code is capable of...
[snipt]

: Remember that a half-decent optimizer on a fast machine will almost surely

: produce faster code than an incredibly clever optimizer on a slow machine.

Not really. Careful choice of algorithms is part of optimization, and the
difference between 2 mathematically equivalent algorithms can often be
the equivalent of 2-4 generations of hardware (or more!) speed-wise.

BTW, the latest processors often have built-in performance evaluators.
E.g., the two performance monitor registers on the PPC 604 that can
monitor about 30 different conditions including D-cache and I-Cache
misses, etc. Has anyone given any thought on how to create compilers and
test-data that could *automatically* take advantage of the information
available via these registers?

Talk about state-of-the-art, feedback-driven optimizing compilers!

--
-------------------------------------------------------------------------------
Lawson English __ __ ____ ___ ___ ____
eng...@primenet.com /__)/__) / / / / /_ /\ / /_ /
/ / \ / / / / /__ / \/ /___ /
-------------------------------------------------------------------------------

Zalman Stern

unread,

Jan 2, 1996, 3:00:00 AM1/2/96

to

Lawson English (eng...@primenet.com) wrote:
: Let's not forget overhead of emulated vs native calls, especially if you

: have possible "context-switching" going on in there.

: Case in point: the Microseconds() call on the Mac is generally held to
: have about a 20 microsecond resolution. HOWEVER, in actual timing on a
: quadra, the code:

: Microseconds(&start);
: Microseconds(&end);

: Yields an average of 18-20 with a range of 0 (less than 1 microsecond)
: latency to 92. That's pretty big, no?

: ON a PowerMac 9500/120 (120Mhz 604), the same code has an average latency
: of 40 microseconds, with a range of 0 to 307!

: galore.

The two machine have different system software and a different bus. The
implementation of Microseconds is also likely different. I would not draw
too many conclusions about PowerPC vs. 68k from this.

For comparison, a routine which queries the POWER style realtime clock
registers on a Power Macintosh 8500/120 (120Mhz 604) delivers a 6us min
time, 7us average, and 2*ms* max time for 1 million iterations. (This call
involves an illegal instruction trap as the POWER RTCL/RTCU registers are
not implemented in the PowerPC architecture.) (Note that the max number
here is not the same as interrupt latency.)

[An example where padding for cache is important.]

: The most subtle problem with optimization on a PowerPC that I am aware of is

: due to I-Cache thrashing because something in your inner loop is
: coincidentally conflicting with something else *called* by your inner
: loop. This problem (from what I've heard) can be often be eliminated or at
: least reduced by reordering the compile sequence of your various
: libraries and files, but I don't know of any way of avoiding the problem
: completely since you generally can't control where system libraries and
: ROM are loaded in relation to your own code.

These are both well known problems. Especially for direct mapped caches.
Early in the days of MIPS (like the mid-80's), there was a program called
cord (and later cord2) that reordered code to minimize I-Cache mises. (I
believe cord was written by Mark Himelstein, though Earl Killian may have
had a hand in it as well. I expect if you crank up the highest optimization
levels of SGI's latest greast compilers, some cache reordering takes
place.)

: The moral is that the new architectures require a new way of optimizing.

Maybe new to Lawson... Some of us have been discussing this stuff on
comp.arch for ten years or more. And I'm sure most of these techniques were
at least known if not perfected ten or twenty years before
that. (I.e. caches have been around for a long long time. The state of the
art in cache analysis and modeling has gotten a lot better in the last 10
to 15 years.)

: The other moral is less obvious: the so-called experts on things like

: emulators that work for the big companies may not be the world's best in
: their field. Compare the speed of the current PCI-only emulator from
: Apple, which does dynamic compilation with SpeedDoubler or the 68K
: emulator from Ardi(?) which also do dynamic compilation. The third-party
: versions generally are faster, simply because the guys that found them
: are often folks who have an affinity for that particular problem, whereas
: the employees of the maga-companies may not be quite as creative, since
: they often were *assigned* the job, rather than created it.

Beware morals from the clueless...

I've met the two guys responsible for Apple's PowerPC emulation
technology. Gary Davidian developed the original interpreter for the first
release of Apple's PowerPC products. Eric Traut implemented the dynamic
recompiling emulator for the second major release of MacOS for PowerPC. The
dynamic recompiling emulator is based on caching the results of the
interpretive emulator with a small amount of post pass optimization. I also
have used SpeedDoubler and have some info on what Ardi's product does.

Both Gary and Eric were (and are I expect) quite gung ho about building
fast emulators. Gary built his interpreter in an environment where a lot of
people believed emulation would never work. Eric did three complete
implementations of the dynamic recompiling emulation technology with
significant quantitative analysis to refine the result. (This was partially
in an attempt to get acceptable performance on a machine with insufficient
cache. Which turned out to be basically impossible.) It got faster and
better each time. The end result is a *product* with significantly better
performance and *rock-solid* reliability.

In this particular instance, both Apple's emulators shipped before the
competition and delivered higher reliability. They are part of Apple's
ongoing system software projects and hence have to work in a complex
"legacy" environment. (As does SpeedDoubler. Ardi's product does not use
Apple's OS so its constraints are different.)

SpeedDoubler certainly did not deliver equivalent reliability to the Apple
emulator at first ship. (The first machine we tried it on would not boot
after the install.) Connectix has done a couple of rev's on it now and I
hear its pretty reliable. This would have been unacceptable in an Apple
product. (Imagine shipping tens of thousands of machines that crash on
boot...) Ardi Executor does not run anywhere near all Macintosh
software. It is a very interesting technology and holds promise for the
future but it does not imply Apple's engineers did less than the best
possible job because they decided to build something they could ship.

-Z-

Lawson English

unread,

Jan 2, 1996, 3:00:00 AM1/2/96

to

Zalman Stern <zal...@netcom.com> wrote:
[lots of corrections and clarifications snipt]

Thanks.

Craig Nelson

unread,

Jan 2, 1996, 3:00:00 AM1/2/96

to

zal...@netcom.com (Zalman Stern) wrote:
>Lawson English (eng...@primenet.com) wrote:

>: The most subtle problem with optimization on a PowerPC that I am aware of is
>: due to I-Cache thrashing because something in your inner loop is
>: coincidentally conflicting with something else *called* by your inner
>: loop. This problem (from what I've heard) can be often be eliminated or at
>: least reduced by reordering the compile sequence of your various
>: libraries and files, but I don't know of any way of avoiding the problem
>: completely since you generally can't control where system libraries and
>: ROM are loaded in relation to your own code.
>
>These are both well known problems. Especially for direct mapped caches.
>Early in the days of MIPS (like the mid-80's), there was a program called
>cord (and later cord2) that reordered code to minimize I-Cache mises. (I
>believe cord was written by Mark Himelstein, though Earl Killian may have
>had a hand in it as well. I expect if you crank up the highest optimization
>levels of SGI's latest greast compilers, some cache reordering takes
>place.)
>

As I tried to point out once before, there is no excuse for having
direct-mapped
I-caches. One may refer to a patent application from LSI or consider the
following short explanation. One wishes to predict the next I-cache line in the
presence of branches. One does this by appending cache indices and hints
as to whether or not the branch is likely to be taken to the cache line.
These are known as branch-following, self-sequencing, or "intelligent"
I-caches.
The speed penalty is approximately a mux and some fanout. Now assume
there is not a branch but consider the next sequential set (non-standard usage)
to be a prefix of the index. Ahh! Now we have a set-associative cache that
accesses as fast as an "intelligent" I-cache.

I thank LSI for the opportunity to develop this and other improvements
to I-caches as part of the Lightning project. Too bad we didn't get
to finish, but that is another story. Some companies are better
not to partner with.

Stefan Monnier

unread,

Jan 2, 1996, 3:00:00 AM1/2/96

to

In article <4c9hum$p...@nnrp1.news.primenet.com>,
Lawson English <eng...@primenet.com> wrote:
] The most subtle problem with optimization on a PowerPC that I am aware of is

] due to I-Cache thrashing because something in your inner loop is
] coincidentally conflicting with something else *called* by your inner
] loop.

Now, why should this only occur on PowerPC ?
It's just related to the presence of an I-cache. Even the 68040 has an I-cache.

Stefan

Andy Belk

unread,

Jan 8, 1996, 3:00:00 AM1/8/96

to

I'm just a programmer visiting here so please excuse me ... ;-)

In article <jgkDKD...@netcom.com> Joe Keane <j...@netcom.com> writes:
> I think that this was a good thread.
>
> One thing that Hennessy & Patterson stress repeatedly is that `big
> code bad' doesn't cut it as a good argument. To make a useful analysis,
> it's important to have real numbers about the trade-off between code
> size and performance.
>
> It's certainly not true that if you double code size, your machine will
> run twice as slow. It won't be exactly the same either, but that's
> probably much closer to the truth. My guess for a typical penalty is
> something like 10%.

Er ... no. If my application is 2 MB on one architecture, it will run
fine and dandy in 3MB of memory. If it's 4 MB, it will likely spend a
lot of it's time swapping (or paging or whatever) and will run more
than 10 times slower.
This is the problem with bloated code (be it the programmer's fault or
the chip architecture ...) in real world situations, and is independent
of the increased bandwidth requirements between CPU and main memory,
which have much smaller effects.

Warning: Due to its censorship, CompuServe and its subscribers
are expressly prohibited from storing or copying this document
in any form.

Andrew Robert Geweke

unread,

Jan 26, 1996, 3:00:00 AM1/26/96

to

Andy Belk (be...@il.us.swissbank.com) wrote:
: I'm just a programmer visiting here so please excuse me ... ;-)

: In article <jgkDKD...@netcom.com> Joe Keane <j...@netcom.com> writes:
: > I think that this was a good thread.
: >
: > One thing that Hennessy & Patterson stress repeatedly is that `big
: > code bad' doesn't cut it as a good argument. To make a useful analysis,
: > it's important to have real numbers about the trade-off between code
: > size and performance.
: >
: > It's certainly not true that if you double code size, your machine will
: > run twice as slow. It won't be exactly the same either, but that's
: > probably much closer to the truth. My guess for a typical penalty is
: > something like 10%.

: Er ... no. If my application is 2 MB on one architecture, it will run
: fine and dandy in 3MB of memory. If it's 4 MB, it will likely spend a
: lot of it's time swapping (or paging or whatever) and will run more
: than 10 times slower.
: This is the problem with bloated code (be it the programmer's fault or
: the chip architecture ...) in real world situations, and is independent
: of the increased bandwidth requirements between CPU and main memory,
: which have much smaller effects.

This is also false. Unless (a) your VM has major problems, or (b) your
code has far less locality than most, you still won't have problems. You
will load in the 3MB of code most frequently used; I would guess that
in a 4MB-code program (this is enormous!), this is going to be well over
99.999....% of the code.

Besides, most memory partitions on programs these days are due to large
data and runtime requirements, not code size. If you're talking about
'typical' personal computer apps, apps with code of over 1MByte are
rather rare.

It's still the memory-to-processor bottleneck that matters more, and
that's why RISC chips have large caches, among other things.

/ ag

Serge Pachkovsky

unread,

Jan 26, 1996, 3:00:00 AM1/26/96

to

Andrew Robert Geweke (gewe...@cps.msu.edu) wrote:
: will load in the 3MB of code most frequently used; I would guess that

: in a 4MB-code program (this is enormous!), this is going to be well over

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Most certainly it is not! The code I am now working on, 100.000 lines
in Fortran (which I would classify as a medium-size quantum chemistry
program), compiles to executable with .text segment (i.e., code) size of
1,8Mb using mips2 (32bit) instruction set and 3.3Mb using mips4 (64bit)
instruction set. I can name several QC programs in wide circulation with
code sizes at lest an order of magnitude larger, and I doubt situation
is any different with non-trivial codes in other areas.

Regards,

/Serge.P

--

Russian guy from the Zurich university...

Zalman Stern

unread,

Jan 26, 1996, 3:00:00 AM1/26/96

to

In article <4ea4va$t...@msunews.cl.msu.edu>, gewe...@cps.msu.edu (Andrew

Robert Geweke) wrote:
> Besides, most memory partitions on programs these days are due to large
> data and runtime requirements, not code size. If you're talking about
> 'typical' personal computer apps, apps with code of over 1MByte are
> rather rare.

I'd hardly say "rare." Here are datafork sizes for a few Power Macintosh
native applications:
Microsoft Word 6.0: 3.6M
Microsoft Excel 5.0: 6.5M
Adobe Photoshop 3.0.5: 5.2M
Netscape Navigator 1.1N: 1.2M

These numbers overstate things slightly as they include static data and
relocation info. (Relo info is reasonably dense in Apple's PEF format.)
Netscape may fall under a meg in actual code size, but that's probably no
longer true in 2.0. Note that most "data" for a Macintosh application
lives in resources which are not counted in the above numbers.

One can look at other Macintosh apps and one can look at Windows apps. I'm
pretty sure you'll find that many if not most commercial applications have
more than a megabyte of compiled code.

Zalman Stern, Caveman Programmer, Macromedia Video Products, (415) 378 4539
3 Waters Dr. #100, San Mateo CA, 94403, zal...@macromedia.com
"Its not an 'attitude,' it's a fact!" -- Calvin (Bill Watterson)

Joe Keane

unread,

Jan 29, 1996, 3:00:00 AM1/29/96

to

That reminds me, i found the article that people were commenting on.
I'd say that it's basically `a whole lot of nothing'.

It doesn't help that the author implies that all the major vendors are
conspiring to hide the fact that RISC is a total failure. There may be
something true along these lines, that it's an over-used buzzword, and
recent machines are less simple, but that point could be argued better.

He shows us how an emulator can take dozens of instructions for each one
emulated. Then, binary recompilation, trying to preserve the exception
model, is better but still has some cost. The point isn't clear to me.

The major `discoveries' though, are the following: code runs many times
faster if it fits in cache; and cycle-counting doesn't work so well if
you ignore memory-system overhead. Well, thank you and duh to that.
I'd like to hear comments if you think that i'm missing something.

For comparison, in the same issue, there's an article, with lots of
real, interesting data, by some Japanese guys who made a low-power RISC
chip that has, among other things, better code density than the 68K.

Nathan D. Tuck

unread,

Feb 1, 1996, 3:00:00 AM2/1/96

to

In article <4eav4r$e...@rzunews.unizh.ch>,

Serge Pachkovsky <p...@ocisgi7.unizh.ch> wrote:
>Andrew Robert Geweke (gewe...@cps.msu.edu) wrote:
>: will load in the 3MB of code most frequently used; I would guess that
>: in a 4MB-code program (this is enormous!), this is going to be well over
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>Most certainly it is not! The code I am now working on, 100.000 lines
>in Fortran (which I would classify as a medium-size quantum chemistry
>program), compiles to executable with .text segment (i.e., code) size of
>1,8Mb using mips2 (32bit) instruction set and 3.3Mb using mips4 (64bit)
>instruction set. I can name several QC programs in wide circulation with

Just noting that this would be a result of different optimizations
and levels of loop unrolling etc with the mip2 vs mips4 compilers.
Instruction widths are the same, so there theoretically shouldn't be
any more code. Text size could swell for other reasons though -
storing 64 bit constants and jump tables.

nate

John Bayko

unread,

Feb 2, 1996, 3:00:00 AM2/2/96

to

In article <jgkDLx...@netcom.com>,
Joe Keane <j...@netcom.com> wrote:
> [...]

>
>For comparison, in the same issue, there's an article, with lots of
>real, interesting data, by some Japanese guys who made a low-power RISC
>chip that has, among other things, better code density than the 68K.

The Hitachi SH series - touched briefly in the Great
Microprocessors list, viewable from the CPU Info Center at:

http://infopad.eecs.berkeley.edu/CIC/

or my home page.
But while it's got good code density, (so does the ARM CPU),
there's no indication in the article how fast it is.
But the real point of RISC was to see what you can get rid of and
make a better (in whatever terms you want - faster, cheaper, lower
power, or whatever) CPU from there. I'm sure it'll happen again,
a few decades hence.

--
John Bayko (Tau).
ba...@cs.uregina.ca
http://www.cs.uregina.ca/~bayko