Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unsafe at any speed

12 views
Skip to first unread message

Colin Plumb

unread,
Oct 31, 1989, 8:07:21 PM10/31/89
to
Has anyone read the "Stop Bit" column on the back page of the November
BYTE? Written by Dave Nelson, it engages in RISC-bashing, which I have
no objection to, but uses such ridiculous arguments I have a strong
desire to write a rebuttal. But perhaps somone out there with a
heavier-duty arsenal of facts will do a better job?

Here are the points raised:
- ``Yet RISC microprocessors remain controversial, and for good reason.''
-- Certainly, everyone is worried about binary compatibility issues, which
Mr. Nelson never mentions. But do I hear anyone debating the speedups?
- ``Throwing out complex, multiple-cycle instructions (e.g., floating-point
arithmetic and character-string manipulation)'' -- gee, have I seen anyone
throwing out floating-point math?
- Throwing out multi-cycle instructions raises MIPS ratings without raising
work done -- true enough
- ``RISC architects eagerly sacrifice instruction-set power to increase MIPS.''
-- And I though the point was to carefully choose so as to maximise the
instruction set power x MIPS product...
- ``"MIPS are like RPM: they tell you how fast the engine is running, but they
don't tell you if it's doing any work."'' -- True, but that doesn't
detract from real performance gains, merely masks them, and the marketroids
are starting to realise that the customers are starting to realise that
native MIPS don't mean much. An Inmos Transputer is about 15 native MIPS,
since each byte is a separate instruction, but I'm not about to call it
15 times a VAX.
- ``In any case, processor speed is only one aspect of computer system
performance. Nevertheless, RISC designers focus on processor speed because
it is the main determinant of their favourite workloads: purely computational
tasks.'' -- No, he doesn't talk at all about I/O bottlenecks, he talks about
the instruction-fetch bottleneck. I wonder what happened to all the work
people have put into memory interfaces. It's gotta be here somewhere...
- RISC architects make the convenient simplification of discarding
instructions that have a low "average frequency of execution," increasing
the frequency-duration product and reducing overall performance. --
Errant nonsense... I have never seen *anything* that does not deal
exclusively with frequency-duration products and the effect of the slowdown
on uncommon instructions on the common ones. That's kind of the point:
not that eliminating the VAX convert instruction is good in itself, but
that the loss is massively compensated for in other ways. (Why am
I preaching to the converted?)
- RISC designs exacerbate the von Neumann bottleneck, due to having to
fetch greater numbers of instructions. -- I haven't noticed the
*number* of instructions rising much, just their *size*. And sequential
instruction fetch is easy to speed up. It's jumps that kill you.
- ``To mitigate the effect of RISC's high ratio of instructionto-operand
access, RISC designers introduce complex auxiliary-support mechanisms,
such as instruction pipelines and multiple register sets.'' -- I've got
news for you, bub. Register windows (I think that's what he's talking
about, not barrel processors) *increase* the i-d access ratio, like they're
supposed to. And pipelines increase performance dramatically on *any*
architcture. It's just a lot *easier* to implement them on RISC machines.
- ``These mechanisms, together with weak instructions, make system software
for a RISC inherently more complicated, and thus less reliable, than the
corresponding CISC software'' -- I had noticed a distinct trend the other
way.
(This is a goodie)
- ``To speed instruction execution, RISC designers require all operands to
be accessed from typeless registers. Thus, RISC processors aquire artificial
operand uniformity at the expense of rendering operands typeless and
thereby forfeiting processor-based type- and value-checking. Since
incorrect operations on data values is a major source of errors, safety is
being sacrificed for speed.'' -- Oh, now he's arguing for Lisp machines
and data tags? I thought he was talking about traditional CISCs (VAXen,
etc.). The safety argument can be made, although it takes a lot more work
than a bland assertion, and I could point out that most people aren't
much more pleased with a run-time detected fatal error than with a
crash or bogus result, and would rather the programmer didn't make the
mistake in the first place. Such error-checking is useful during
development, but it seems that emulating type-checking on fast processors
is faster than doing it in hardware in many cases.
- ``..."complex" operations (i.e., operations that actually process data).''
-- Your typical VAX program can be transliterated to an R3000 program
usually instruction-for-instruction, with the odd sequence for procedure
calls, if a frame pointer is being maintained. The point is not well
taken.

He closes with crediting RISC with inspiring multiple-instruction-per-cycle
processors, which I'm not very sure about, then

``Safe processor architectures, such as object-oriented designs, are not
yet in favour with computer designers. But the same parallelism that lets
a processor execute multiple instructions per machine cycle can also support
concurrent verification of data and program integrity. When computer
designers realize this, they can design safe processors that also have
terrific MIPS ratings.''

-- If you're trying to argue for tagged-type architectures, you almost lost
me, and sure as heck confused the average BYTE reader, who thinks the
80386 is a CISC of the type you're talking about. And your argument does
not support the conclusion in the least. Go ahead and claim that security
is a better goal than MIPS, but do it coherently like a recent discussion
in comp.arch!

Is someone going to refute this nonsense? I'd burn it with the rest of the
junk mail, except that BYTE's readership is large enough that this would
cause significant air pollution, and then comes the task of deprogramming
all the people who read it before I got there...

(I'm flaming, but I don't think the author is here to launch into a war
with me, so the need for politeness is reduced.)
--
-Colin

Jeffrey Lee

unread,
Nov 1, 1989, 9:50:18 AM11/1/89
to
ccp...@rose.waterloo.edu (Colin Plumb) writes:

>- ``"MIPS are like RPM: they tell you how fast the engine is running, but they
> don't tell you if it's doing any work."'' -- True, but that doesn't
> detract from real performance gains, merely masks them, and the marketroids
> are starting to realise that the customers are starting to realise that
> native MIPS don't mean much. An Inmos Transputer is about 15 native MIPS,
> since each byte is a separate instruction, but I'm not about to call it
> 15 times a VAX.

Argh! Perhaps it's (past) time to start explicitly using the terms NMIPS
(native MIPS) and VMIPS (virtual [usually VAX relative] MIPS) instead
of just MIPS. Anyone who uses ``MIPS'' to mean NMIPS or VMIPS relative
to anything but a VAX/780 unless EXPLICITLY stated should be publically
censured. It's probably WAY too late, but maybe if the movers and
shakers started the trend, others might eventually catch on.

> ...confused the average BYTE reader, who thinks the


>80386 is a CISC of the type you're talking about.

Even worse, the 80386 and 68030 (among others) use pipelining and
caching and other tricks being spoken against w.r.t. RISC processors.
If these aren't safe in RISC processors...

Time for a mail-in protest campaign in BYTE'S Letters to the Editor?
After all, the best way to reach BYTE readers is through BYTE.

j.

Colin Plumb

unread,
Nov 1, 1989, 1:35:40 PM11/1/89
to
In article <1989Nov1.0...@jarvis.csri.toronto.edu> jo...@db.toronto.edu (Jeffrey Lee) writes:

> Argh! [Native MIPS vs. VAX MIPS]

I agree that VAX MIPS is a measure of non-zero utility, and thank you
for pointing that out. I'll add it to my rebuttal. Whenever I read
MIPS claims, I suspect that's native (and peak native as well), and
think the maketers are idiots. But if you're just talking about
a chip, it's hard to give system MIPS. "A 15 MIPS machine" is
something that executes my C (Fortran, Lisp, ML, ...) code 15 times faster
than a VAX 11/780.

> Even worse, the 80386 and 68030 (among others) use pipelining and
> caching and other tricks being spoken against w.r.t. RISC processors.
> If these aren't safe in RISC processors...

No, he seems to like cacheing.. .he was complaining that larger RISC
instructions take up more cache space. Solution: use longer cache
lines. Real cheap! (Instruction) cache line size should be
proportional to the size of the instructions between control transfers,
so if that's the only reason for the increased code size, that's all
the solution that's necessary.
--
-Colin

Mark Robert Thorson

unread,
Nov 1, 1989, 9:52:35 PM11/1/89
to
I've been mulling over this argument against RISC. Please feel free to
blow holes in it, if you can:

Imagine a RISC chip which has an interface to a coprocessor. Let's say the
coprocessor executes all your CISC-only instructions, like string moves and
complicated procedure calls and 4 x 4 matrix transform, etc.

Now, if there is ANY performance benefit to having CISC instructions (and indeed
there must be, otherwise the RISC vs. CISC argument would not exist) your
system will run faster with the CISC coprocessor plugged in, rather than
interpreting the CISC functions with RISC instructions.

Therefore, the RISC vs. CISC argument boils down to whether the die size
occupied by the CISC functionality is a net gain or a net loss when combined
on the same chip with the RISC CPU.

If the additional chip real estate for that CISC functionality pushes your
chip beyond the die size comfortably supported by your fastest CMOS process,
that CISC functionality could very well be a loser. If this is the case,
RISC is a winner using the technology you have today.

The argument against RISC would be that it does not anticipate future
technology. In the future, it will be just as easy to put either the RISC
or the CISC on a single die. The only question will be whether the CISC
functionality is more performance-enhancing than additional cache space.
Your CISC functionality would have to be pretty lousy to lose the contest
against a small amount of cache. If we choose RISC now, we will be
saddled with a sub-optimal architecture when the future comes a-knocking.

If this argument is true, it implies that a hybrid approach is best.
This would be a RISC CPU with a tightly defined CISC superset. Using
today's technology, the optimal implementation would be either a pure
RISC (with the CISC superset interpreted using an unimplemented instruction
trap) or a RISC with a coprocessor interface. Using tomorrow's technology
the CISC CPU would be implemented directly, with no architectural separation
between RISC and CISC.

Then again, you might argue that it's a good thing to throw away all of
your software every 5 or 10 years. Software designed for yesterday's
machines will not use the power of the new machines. Gotta put all that
FORTRAN and COBOL and IBM 360 and real-mode 80x86 code out with the trash.

Eric S. Raymond

unread,
Nov 2, 1989, 8:27:08 AM11/2/89
to
In <17...@watdragon.waterloo.edu> Colin Plumb wrote:
> Has anyone read the "Stop Bit" column on the back page of the November
> BYTE? Written by Dave Nelson, it engages in RISC-bashing, which I have
> no objection to, but uses such ridiculous arguments I have a strong
> desire to write a rebuttal.

Looks like you've already done it :-). This guy Dave Nelson has a fixation;
I've seen him utter public flame against RISCs before, and he sells a
benchmark suite that's designed to make RISCs look bad (it almost says as much
in the promo literature). It all reminds me of Lyndon LaRouche's paranoid
thing about the British -- maybe Nelson's wife ran off with a RISC designer :-).
--
Eric S. Raymond = er...@snark.uu.net (mad mastermind of TMN-Netnews)

Barry Shein

unread,
Nov 2, 1989, 2:09:00 PM11/2/89
to

The thing these arguments seem to ignore is how system speed is
affected, beyond mere register-register add instructions or whatever.

At some point we stop straining CPU speed and start straining memory
bandwidth more and more. RISC's are much harder on memory bandwidth
than CISC's in general (I'd be surprised if anyone doesn't believe
that or has proof otherwise.)

So when we finally start hurting for memory bandwidth (ok, if ever)
then all of a sudden CISCs will look better and better.

I realize this doesn't take into account some factors but try the
following (the other factors ignored are merely linear corrections):

At 50 MIPS, assuming one fetch per IP you need 20 nsec memory
accessing, at 100 MIPS, 10 nsec. To support that you also need
gigahertz memory busses. That's ok, that's just about exactly where we
are today at the cutting edge of technology (tho not with uProcessors,
you don't get 10nsec memory access from them and won't for a while.)

But, eventually something is going to break. Putting more than one CPU
on a shared memory bus only makes things worse faster, putting dozens
on (already done regularly) makes things worse much faster.

This isn't an argument against MIMD, it just shows an easy way to get
into trouble today if you can't wait. You can still build MIMD systems
which beat the heck out of most SISD systems on a lot of common
problems, and can for several years to come. The SISD's have no
advantage in this arena anyhow, they're only hurting later because
they're slower, you don't see the problem yet.

Anyhow, consider the SYSTEM, not just the instruction times.

(note: this argument is not orginal by any means but seems to need
repeating from time to time.)
--
-Barry Shein

Software Tool & Die, Purveyors to the Trade | b...@world.std.com
1330 Beacon St, Brookline, MA 02146, (617) 739-0202 | {xylogics,uunet}world!bzs

Peter da Silva

unread,
Nov 3, 1989, 8:36:49 AM11/3/89
to
In article <23...@cup.portal.com> m...@cup.portal.com (Mark Robert Thorson) writes:
> In the future, it will be just as easy to put either the RISC
> or the CISC on a single die.

This is where the argument falls apart. The improvements in technology can be
used to:

(a) Build a CISC CPU using current RISC techniques.
(b) Build a RISC CPU using faster techniques.
(c) Build a SIMD or MIMD RISC CPU (for example, VLIW or Superscalar).

You're assuming that for some reason avenues B and C will be cut off at some
point in the future. You haven't shown this is a reasonable assumption.
--
`-_-' Peter da Silva <pe...@ficc.uu.net> <pe...@sugar.hackercorp.com>.
'U` -------------- +1 713 274 5180.
"That particular mistake will not be repeated. There are plenty of mistakes
left that have not yet been used." -- Andy Tanenbaum (a...@cs.vu.nl)

Stan Chow

unread,
Nov 3, 1989, 11:38:46 AM11/3/89
to
In article <23...@cup.portal.com> m...@cup.portal.com (Mark Robert Thorson) writes:
> ...

>If this argument is true, it implies that a hybrid approach is best.
>This would be a RISC CPU with a tightly defined CISC superset. Using
>today's technology, the optimal implementation would be either a pure
>RISC (with the CISC superset interpreted using an unimplemented instruction
>trap) or a RISC with a coprocessor interface. Using tomorrow's technology
>the CISC CPU would be implemented directly, with no architectural separation
>between RISC and CISC.
>

This is what Intel has done with the 960 family - have a RISC core with
many micro-coded instructions. With higher silicon density, they can move
more of the intructions into hardware.

Come to think of it, this is what *all* CISC do. Both the Motorola 68K and
Intel x86 families did this. The early micro-coded instrcutions have been
migrated into hardware, making newer processors faster.

This is also why some people (including myself) have refered to RISC as
a technology window. It is an optimal architecture for today's technology.

Donald Lindsay

unread,
Nov 3, 1989, 11:48:07 AM11/3/89
to
In article <1989Nov2.1...@world.std.com> b...@world.std.com
(Barry Shein) writes:
>RISC's are much harder on memory bandwidth
>than CISC's in general (I'd be surprised if anyone doesn't believe
>that or has proof otherwise.)

This is sort-of-true, but it's more meaningful to break it down into
three things:

1) CISCs are slower, this year (but that's not a virtue, unless they
are also cheaper)

2) RISCs have less dense instruction encoding (also not a virtue).

3) RISCs tend to have more registers, hence do fewer loads and stores.

>So when we finally start hurting for memory bandwidth (ok, if ever)
>then all of a sudden CISCs will look better and better.

I don't agree. Programs with bad data locality are fairly common.
Programs with bad code locality are much rarer. This means that the
RISC defect, above - code density - is hidden nicely by caches. My
crystal ball doesn't see this changing, so the RISC/CISC issues won't
be influenced (much) by memory bandwidth.
--
Don D.C.Lindsay Carnegie Mellon Computer Science

Jeff Libby

unread,
Nov 3, 1989, 12:55:56 PM11/3/89
to
In article <1989Nov2.1...@world.std.com>, b...@world.std.com

(Barry Shein) writes:
>
> The thing these arguments seem to ignore is how system speed is
> affected, beyond mere register-register add instructions or whatever.
>
> At some point we stop straining CPU speed and start straining memory
> bandwidth more and more. RISC's are much harder on memory bandwidth
> than CISC's in general (I'd be surprised if anyone doesn't believe
> that or has proof otherwise.)
>
Higher bandwidth memory systems are primarily driven by higher processor
speeds. A 20 Vax-mips RISC CPU requires more memory bandwidth than a 4
Vax-mips CISC CPU mainly because it's faster. A 20 Vax-mips CISC CPU
will require nearly the same memory bandwidth as the same performance RISC
CPU.

> So when we finally start hurting for memory bandwidth (ok, if ever)
> then all of a sudden CISCs will look better and better.

I guess that I would restate this to say "slower processors will look
better when faster processors aren't really faster because they've run
out of memory bandwidth". I don't agree with the CISC vs. RISC
component of this argument, but otherwise it is pretty reasonable.

---------------------------------------------
Jeff Libby
Mips Computer Systems
{decvax,ucbvax,hplabs,sun,ames,prls}!decwrl!mips!libby or li...@mips.com

Alan Lovejoy

unread,
Nov 3, 1989, 1:16:52 PM11/3/89
to
In article <23...@cup.portal.com> m...@cup.portal.com (Mark Robert Thorson) writes:
>Imagine a RISC chip which has an interface to a coprocessor. Let's say the
>coprocessor executes all your CISC-only instructions, like string moves and
>complicated procedure calls and 4 x 4 matrix transform, etc.

>Now, if there is ANY performance benefit to having CISC instructions(and indeed

>there must be, otherwise the RISC vs. CISC argument would not exist) your
>system will run faster with the CISC coprocessor plugged in, rather than
>interpreting the CISC functions with RISC instructions.

I detect several fallacious assumptions behind this argument:

Fallacious Assumption #1: There is no loss of efficiency when operations are
performed by a coprocessor. In the real world, the laws of physics and the
limitations of current technology exact a performance penalty for using
a coprocessor. However, I suspect that Mr. Thorson's example was intended
as a "thought expereriment," where ignoring the loss of efficiency which
results when signals must travel between chips is not only acceptable, but
necessary for the validity of the argument.

Fallacious Assumption #2: Compilers make full and optimum use of all the
instructions that are made available by the architecture. The falseness
of this assumption has been irrefutably proven. The studies which refuted
this assumption served as one of the primary motivations for the development
of the RISC concept.

Fallacious Assumption #3: The RISC speed advantage derives solely from the
fact that RISC CPU's have simpler and fewer instructions. RISC is not a
dogma which dictates a particular set of characteristics for an instruction
set architecture. It is a design METHODOLOGY which seeks to optimize
performance within the constraints of existing technology. The methodology
can be applied to the design of both the instruction set architecture and
the circuit architecture (i.e., the pyhsical CPU implementation). The newest
implementations of most processors use "RISC-like" hardware features (e.g.,
caches, pipelines, hardwired logic, multiple functional units, etc.). But
it is much harder to retrofit an existing instruction set with "RISC-like"
enhancements (e.g., 3-operand instructions, delay slots, large register files,
register windows, etc.). It is often convenient to conceptually separate
instruction set architecture from circuit architecture, but in the real
world each influences the other in often subtle ways. There are aspects of
a RISC CPU's instruction set architecture which are dictated by the needs
of the circuit architecture, and vice-versa (e.g., fixed-length instructions,
address alignment restrictions). Grafting RISC-like hardware features onto
an old CISC instruction set is not likely to result in an optimum design.
RISC performance also derives from a synergistic confluence of compiler
technology, instruction set semantics and hardware implementation strategies.
Once you have committed yourself to an instruction set architecture, your
ability to optimize the performance of your SYSTEM AS A WHOLE becomes
constrained (because the SYSTEM has fewer degrees of freedom). Of course,
this is a problem for ALL existing architectures, whether they are called
CISC or RISC.

>Therefore, the RISC vs. CISC argument boils down to whether the die size
>occupied by the CISC functionality is a net gain or a net loss when combined
>on the same chip with the RISC CPU.

In contrasting "CISC" with "RISC," it is important to be very clear whether
you mean "the instruction set architecture of the old standby CPU" vs. "the
instruction set architecture of these newfangled processors with no software
base," or whether you mean "instructions with complex semantics" vs.
"instructions with simple semantics." I suspect you mean the latter.

There are several reasons why RISC proponents prefer instructions with
simple semantics:

1. Simple instructions are more likely to be used by code generators, because
they are more generic. Simple instructions are more generic because they
have a finer semantic granularity: they are less likely to have unwanted
and/or unneeded "side-effects." Generic instructions are more likely to
be useful in any given algorithm.

2. It is easier for code generators to optimize using simple instructions,
because there are more degrees of freedom available. The code generator can
more easily emit instructions which only do the work required, in the optimum
sequence of steps, when each instruction has very simple semantics. This
is analogous to the difference between assembly language and high level
languages: it is easier to write optimum code in a lower level language,
in spite of the fact that one HLL statement may require 10 assembly statements.

3. Simple instructions require less instruction-specific circuitry, which
means there is more die area available for circuitry which can be used to
globally (generically) speed-up all instructions (e.g., caches, registers,
pipelines, muliptiple functional units, etc.).

4. Simple instructions are less likely to cause the processor to stall
while waiting for some subfunction (required by the complex semantics of the
instruction) to be completed. Asking the machine to do more for each
instruction will not increase performance if each instruction will take
longer (real time, not necessarily clock cylces) to execute. The assumption
that additional circuitry can always compensate for the increase in
work-per-instruction is just as false as the assumption that hiring more
workers will always result in getting the project completed sooner. Some
things must be done sequentially, and no amount of additional hardware
will enable you to perform the second step before the first step has been
completed. It is NOT efficient to ask the machine to do multiple things in one
instruction when those things INHERENTLY CANNOT BE DONE IN PARALLEL!!! Hence
the RISC infatuation with REGISTER-REGISTER instructions and simple
addressing modes.

Much more than die size is at issue.


____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me.
Motto: If nanomachines will be able to reconstruct you, YOU AREN'T DEAD YET.

Robert Herndon

unread,
Nov 3, 1989, 2:55:55 PM11/3/89
to
In article <1989Nov2.1...@world.std.com>, b...@world.std.com (Barry Shein) writes:
>
> The thing these arguments seem to ignore is how system speed is
> affected, beyond mere register-register add instructions or whatever.
>
> At some point we stop straining CPU speed and start straining memory
> bandwidth more and more. RISC's are much harder on memory bandwidth
> than CISC's in general (I'd be surprised if anyone doesn't believe
> that or has proof otherwise.)
>
> So when we finally start hurting for memory bandwidth (ok, if ever)
> then all of a sudden CISCs will look better and better.

Hmmmmm.... One major advantage of RISCs is that because their addressing
modes are so restricted, one tends to optimize the number of programmed
memory references (as opposed to fetch references) fairly carefully.
One of the biggest problems of CISCs is that the auto-increment-memory-
indirect type of operations have is that while the CPU is waiting for
the memory to return so it can increment and write it, and fetch the
final indirect operand is that the processor can't execute any other
operations (well, not without difficult conflict checking) while the
CPU is stalled waiting for memory. In a RISC, one does the read,
does some other stuff (if your optimizer works and there is other
stuff to do), starts the indirect read, increments the result, maybe
does some other stuff, and then as late as possible issues the
write of the incremented value. Provided the pointer isn't kept
in a register somewhere and incremented there.
This is the one of the most important wins for RISCs -- yes, memory
is slow, so you make memory reference instructions not dependent on
anything else, and stall the CPU only when necessary.
Fetching here is a separate case: caches work quite well in the
general case for instructions, and branch target caching and delayed
branches also help. Here RISCs are usually worse than CISCs, but
the problem isn't insoluble.
--
Robert Herndon Dept. of Computer Science,
...!ihnp4!umn-cs!herndon Univ. of Minnesota,
her...@umn-cs.ARPA Minneapolis, MN 55455
herndon...@csnet-relay.ARPA

Preston Briggs

unread,
Nov 4, 1989, 1:46:38 PM11/4/89
to
In article <1989Nov4.0...@ico.isc.com> r...@ico.isc.com (Dick Dunn) writes:
>Finally, what's a realistic difference in code size between RISC and CISC?
>Is it all that much?

>Obviously I'm just conjecturing a counterargument. Anybody got numbers?

Sure, I've got numbers...
Here's a sampling of fortran routines, numerically oriented,
that I experimented with. I started with the linpack routines,
added a few more, and eventually got bored.

code size is measured with "size *.o"

routine | Sun 3/260 | MIPS M/120 | RT (AOS 4.3) | RT
| f77 -O | f77 -O2 | f77 -O | experimental
--------+---------------+---------------+---------------+-------------
daxpy | 328 | 848 | 476 | 352
ddot | 360 | 736 | 508 | 332
dgefa | 560 | 832 | 1,060 | 964
dgesl | 1,008 | 1,088 | 1,236 | 1,172
dmxpy | 3,272 | 3,616 | 4,148 | 2,976
dscal | 168 | 384 | 256 | 212
epslon | 264 | 144 | 232 | 132
idamax | 304 | 240 | 400 | 216
matgen | 432 | 1,392 | 680 | 480
second | 24 | 48 | 92 | 44
svd | 7,328 | 11,792 | 12,580 | 6,488
decomp | 1,912 | 2,768 | 2,544 | 1,720
solve | 576 | 912 | 776 | 500

We've got other machines around, so someday I'll extend the test.
I'm particularly interested in the SPARC, the RT AIX compiler,
and the IBM 3081. You see, my conclusion from the above chart is that
the compiler maturity matters more than the machine.

Consider the RT. With different compilers, we see it moving from worst to
best on nearly every example. Actually, our experimental compiler isn't
mature -- it's just what happens when you have no deadlines and lots
of curiosity.

Of course, this whole method of measuring architextural merit is
pretty suspect. What happens if one of the compilers is agressive
and starts unrolling loops? Bigger code, surely; but often faster too.
(for example, does the MIPS compiler unroll daxpy? some of svd? others?)

Preston Briggs

Gil Pilz@Eng@Banyan

unread,
Nov 3, 1989, 3:48:46 PM11/3/89
to
Umm . . I hate to bother you all but could someone break down the
meaning of the acronyms "MIMD" and "SISD" for me ? I keep seeing them
in this discussion and the context hasn't made the general meanings
clear. Thanks.

Gilbert W. Pilz Jr. g...@banyan.com

"wish I was as cool as Calvin
this sort of thing wouldn't bother me
making fun of bums, making fun of bums
bad karma thing to do"
- too much joy

Preston Briggs

unread,
Nov 4, 1989, 2:32:20 PM11/4/89
to
In article <26...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:
>In article <1989Nov4.0...@ico.isc.com> r...@ico.isc.com (Dick Dunn) writes:
>>Finally, what's a realistic difference in code size between RISC and CISC?
>>Is it all that much?

>>Obviously I'm just conjecturing a counterargument. Anybody got numbers?

>Sure, I've got numbers...
>Here's a sampling of fortran routines, numerically oriented,

And certain elements among you ask:
"What about C?
What about systems routines?"

I also compiled a copy of Hans Boehm's conservative garbage collector
on 4 different machines.

| RT (AOS 4.3) | Sun 3/260 | MIPS M/120 | Sparc
file | hc -O | cc -O | cc -O | cc -O
----------------+---------------+---------------+---------------+---------
alloc | 1,848 | 1,704 | 2,752 | 2,240
allochblk | 904 | 1,000 | 1,296 | 1,288
markroots | 192 | 176 | 192 | 152
misc | 576 | 568 | 800 | 640
reclaim | 472 | 504 | 672 | 656

So, what do we see? I'd say the MIPS and the SPARC make bigger
code than the RT or the SUN. (Note that the RT has 2 byte and 4 byte
instructions.) Of course, the MIPS and the SPARC smoke the other two
machines.

Given a slightly larger instruction cache, we should (as evidenced
above) see the same number of instruction fetches. Given RISC machines'
(typically) larger register sets, we should see less data accesses.

Preston Briggs

Dick Dunn

unread,
Nov 3, 1989, 7:45:29 PM11/3/89
to
b...@world.std.com (Barry Shein) writes:

> At some point we stop straining CPU speed and start straining memory
> bandwidth more and more. RISC's are much harder on memory bandwidth
> than CISC's in general (I'd be surprised if anyone doesn't believe
> that or has proof otherwise.)

I assume that Barry means that for a RISC and a CISC running at the same
effective speed (i.e., getting the same amount of work done; running the
same program in the same amount of time), the RISC is much harder on memory
bandwidth. In that case, no, I don't really believe it...I'll buy that it
may push the memory a little more, but I wouldn't expect it to be a lot.

Why not? Two reasons: First, data memory access is approximately the
same...if you're doing the same work, you're going to have essentially the
same data access pattern. (This assumes lots of other things being equal,
which they won't be, of course.:-) Second, I would expect better locality
for code reference than for data reference, hence the I cache ought to do
more good than the D cache. Aren't the pathological cache-busting programs
generally ones which spray data accesses all over the place?

Finally, what's a realistic difference in code size between RISC and CISC?
Is it all that much?

Obviously I'm just conjecturing a counterargument. Anybody got numbers?

--
Dick Dunn r...@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870
...Worst-case analysis must never begin with "No one would ever want..."

Larry Weber

unread,
Nov 4, 1989, 7:24:31 PM11/4/89
to
In article <26...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) writes:
> stuff deleted ...

>code size is measured with "size *.o"
>
>routine| Sun 3/260 | MIPS M/120 | RT (AOS 4.3) | RT
> | f77 -O | f77 -O2 | f77 -O | experimental
>--------+---------------+---------------+---------------+-------------
>daxpy | 328 | 848 | 476 | 352
>ddot | 360 | 736 | 508 | 332
>dgefa | 560 | 832 | 1,060 | 964
>dgesl | 1,008 | 1,088 | 1,236 | 1,172
>dmxpy | 3,272 | 3,616 | 4,148 | 2,976
>dscal | 168 | 384 | 256 | 212
>epslon | 264 | 144 | 232 | 132
>idamax | 304 | 240 | 400 | 216
>matgen | 432 | 1,392 | 680 | 480
>second | 24 | 48 | 92 | 44
>svd | 7,328 | 11,792 | 12,580 | 6,488
>decomp | 1,912 | 2,768 | 2,544 | 1,720
>solve | 576 | 912 | 776 | 500
>
> ... stuff

> What happens if one of the compilers is agressive
>and starts unrolling loops? Bigger code, surely; but often faster too.
>(for example, does the MIPS compiler unroll daxpy? some of svd? others?)
>

The MIPS compilers do loop unrolling at opt level -O (O2). To check to
see if this is the case, use the -V option of the compiler and look for
references to 2.0 compilers. I would guess that accounts for the big
size differences.

In general the number of instructions don't differ that much between
RISC and CISC execpt in the following places:
o individual loads and stores instead of store/load multiple
o explicit instructions on prolog/epilog (contrasted with calls on the VAX)
o delay slots (I use 6% as a rule for this expansion).

The space difference also includes fixed instruction size effects. Comparing
MIPS to 68000 we can see MIPS has 3 5 bit register fields for 15 bits and the
68000 (extreme case) has 2 3 bit fields for 6 bits of register designators.
This alone accounts for a significant difference in average instruction size.

On the RISC side is use of more registers that remove load and store
instructions. The result of all these effects tend to have RISC machine
requiring more instruction space (10%-50%), similar instruction counts and
slightly fewer data references (assuming most data references are really
from withn a loop for an array - these can't generally go into a register) .

While at some point instruction bandwidth could be a limiting factor, using
I caches, wide buses and instruction streaming keeps this under control
for a long while.
--
-Larry Weber DISCLAIMER: I speak only for myself, and I sometimes deny that.
UUCP: {ames,decwrl,prls,pyramid}!mips!larry OR la...@mips.com
DDD: 408-991-0214 or 408-720-1700, x214
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Eric S. Raymond

unread,
Nov 4, 1989, 8:37:44 PM11/4/89
to
In <1TKvy4#504tdZ=er...@snark.uu.net> I wrote:
> Looks like you've already done it :-). This guy Dave Nelson has a fixation;

I confused two Nelsons. The guy with the anti-RISC fixation is *Neal* Nelson.
Apologies to Dave Nelson; he may be just as misguided ;-), but we have no
evidence that he's flaming for commercial gain.

I visited Neal Nelson's booth at UNIX EXPO, btw. His figures confirm what I
suspected, that my 16Mhz 386 box is within a couple percent of a Sun 3/180's
performance for user count below about 4 (the 6386's ESDI disk I/O subsystem
maxes out faster -- it would be interesting to know if the bottleneck is at
bus, disk, or controller). I trust these figures because both chips are CISCs.

I feel a quiet glow of satisfaction. It's not the bare fact that I've got
a Sun-3 equivalent on my desk that makes me grin, it's the fact that it was
done with cheap commodity technology to build a box that meets about every
real open-system spec there is. The times, they are a-changin'...

Eric S. Raymond

unread,
Nov 5, 1989, 11:24:29 AM11/5/89
to
In <1989Nov4.0...@ico.isc.com> Dick Dunn wrote:
> Second, I would expect better locality
> for code reference than for data reference, hence the I cache ought to do
> more good than the D cache. Aren't the pathological cache-busting programs
> generally ones which spray data accesses all over the place?

Not necessarily. There's a subtle problem here; good software modularity
practices tend to hurt code locality. If you're calling subroutines a lot
in generated code the PC jumps all over the shop.

I have no statistics on this, but I can easily imagine something like, say,
the inner loop of a threaded-code interpreter busting hell out of the I cache
because it's doing the equivalent of an indexed call indirect to some far-off
routine every couple of instructions.

Has anyone done any systematic investigation of this issue?

Eliot &

unread,
Nov 6, 1989, 8:32:14 AM11/6/89
to
While I would agree that many subroutines calls, or even a threaded
interpreter, would tend to undermine *sequential* instruction reference, and
thus substantially affect the performance of a sequential pre-fetch buffer, a
true cache is insensitive to jumps. What matters for cache performance is the
the *total volume* of instruction locations referenced over a range of time,
and whether they tend to fit in the cache. Processor performance may still be
adversely affected, since it may still be faster to buffer and deliver
sequential instruction words than to take a jump, especially on a fast
pipelined machine, like many RISC cpus.

--

J. Eliot B. Moss, Assistant Professor
Department of Computer and Information Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206; Mo...@cs.umass.edu

sh...@ccs1.cs.umass.edu

unread,
Nov 6, 1989, 3:36:33 PM11/6/89
to
In article <1TMk2X#Qggn6=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond)
writes:

>In <1989Nov4.0...@ico.isc.com> Dick Dunn wrote:
>> Second, I would expect better locality
>> for code reference than for data reference, hence the I cache ought to do
>> more good than the D cache. Aren't the pathological cache-busting programs
>> generally ones which spray data accesses all over the place?
>
>Not necessarily. There's a subtle problem here; good software modularity
>practices tend to hurt code locality. If you're calling subroutines a lot
>in generated code the PC jumps all over the shop.

This happens for example in a FORTH machine, FORTH typically is
subroutine threaded, so there is a flurry of subroutine calls
happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016
Forth CPU).

In forth there is a subroutine call every five or so instructions
I would guess.

Wonder if some better cache philosophy can help here.

-- shrikumar ( sh...@ccs1.cs.umass.edu, sh...@ncst.in )

Bill Mangione-Smith

unread,
Nov 5, 1989, 7:45:42 PM11/5/89
to
In article <1TMN4c#5GdcwV=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond) writes:
>Its not the bare fact that I've got

>a Sun-3 equivalent on my desk that makes me grin, it's the fact that it was
>done with cheap commodity technology to build a box that meets about every
>real open-system spec there is.
> Eric S. Raymond = er...@snark.uu.net a

But wait, he says. I thought cheap commodity technology, in a box that meets
about every real open-system spec, was *the* definition of a Sun 3.
What, did they actually do something interesting with one of the branches
of the Sun 3 tree that I haven't heard of? Mind you, this is *not* a slam.
I use a dec 3100, which now defines 'cheap commodity technology' in the
realm of real unix boxes.

Bill Mangione-Smith
Advanced Computer Architecture Lab
University of Michigan
bil...@dip.eecs.umich.edu

David Collier-Brown

unread,
Nov 5, 1989, 10:45:06 PM11/5/89
to
er...@snark.uu.net (Eric S. Raymond) writes:
[concerning the locality of reference of an I-cache]

>Not necessarily. There's a subtle problem here; good software modularity
>practices tend to hurt code locality. If you're calling subroutines a lot
>in generated code the PC jumps all over the shop.

The I-cache tends to "lose" every time a subroutin call occurs, unless
the routine is tiny (relative to the cache size). There is a locality-
of-reference gain in programs that are constructed of "packages" or
"objects", though, at the page-reference level: related code lands in the
same page, +/- some fuzz factor. [1]
On the other hand, you get terrible behavior at startup and closedown time
as you go off and ping all the package_$init() and package_$onexit()
routines.

--dave
[1] This can be improved by "binding" commonly-used code together with
a prelinker, bending the hierarchy a bit (;-)).
--
David Collier-Brown, | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave., | {toronto area...}lethe!dave
Willowdale, Ontario, | Joyce C-B:
CANADA. 416-223-8968 | He's so smart he's dumb.

Robert Firth

unread,
Nov 6, 1989, 2:17:15 PM11/6/89
to
In article <48...@yunexus.UUCP> dav...@yunexus.UUCP (David Collier-Brown) writes:

> On the other hand, you get terrible behavior at startup and closedown time
>as you go off and ping all the package_$init() and package_$onexit()
>routines.

On the other paw, it's really not all that hard to link all the
initialisation code together in one Psect (or whatever) and just
fall through from one module's init code to the next's. All but
the most seriously stupid linkers can do this.

Steve Warren

unread,
Nov 6, 1989, 2:51:32 PM11/6/89
to
In article <67...@pdn.paradyne.com> al...@oz.paradyne.com (Alan Lovejoy) writes:
> [...] The newest

>implementations of most processors use "RISC-like" hardware features (e.g.,
>caches, pipelines, hardwired logic, multiple functional units, etc.).
[...]
I enjoyed the article. However, I don't think that these features are part
of the differences between RISC and CISC. They are part of any high-
performance architecture (although in general hard-wired logic is more
prevalent in RISC). All these techniques were being used to optimize
performance before RISC.

--Steve
-------------------------------------------------------------------------
{uunet,sun}!convex!swarren; swa...@convex.COM

Peter da Silva

unread,
Nov 6, 1989, 8:35:46 AM11/6/89
to
In article <7...@zip.eecs.umich.edu> bil...@dip.eecs.umich.edu.UUCP (Bill Mangione-Smith) writes:
> What, did they actually do something interesting with one of the branches
> of the Sun 3 tree that I haven't heard of?

Mild sarcasm: "Yes, they didn't use an intel CPU".

> Mind you, this is *not* a slam.
> I use a dec 3100, which now defines 'cheap commodity technology' in the
> realm of real unix boxes.

The forefront of cheap commodity technology in the realm of real UNIX boxes
is an AT-bus/80386 running Everex' version of System V UNIX with a VGA, an 80
Meg ST-506 drive, and at least 4 Meg of RAM. We're talking $5,000 workstations
here.


--
`-_-' Peter da Silva <pe...@ficc.uu.net> <pe...@sugar.hackercorp.com>.
'U` -------------- +1 713 274 5180.

"*Real* wizards don't whine about how they paid their dues"
-- Quentin Johnson qu...@atanasoff.cs.iastate.edu

Henry Spencer

unread,
Nov 6, 1989, 3:55:21 PM11/6/89
to
In article <12...@key.COM> da...@key.com () writes:
>It may turn out that the fixed word size of the RISC processors is too
>much excess baggage to carry. The shorter word size for simple operations
>in the CISC processors, such as adding two registers, could allow more
>than two instructions per cycle with the same size instruction fetch.
>
>Can you say "pipeline interlock"?

Can you say "decoding bottleneck"? That shorter word size gets more
instructions into one fetch, but that *varying* word size means you don't
know where instruction N+1 begins until you've fully decoded instruction N.
It's not an accident that DEC had such a hard time building a pipelining
VAX.
--
A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology
megabyte of flaming. | uunet!attcan!utzoo!henry he...@zoo.toronto.edu

Herman Rubin

unread,
Nov 7, 1989, 6:46:24 AM11/7/89
to
In article <1989Nov6.2...@utzoo.uucp>, he...@utzoo.uucp (Henry Spencer) writes:
> In article <12...@key.COM> da...@key.com () writes:
> >It may turn out that the fixed word size of the RISC processors is too
> >much excess baggage to carry. The shorter word size for simple operations
> >in the CISC processors, such as adding two registers, could allow more
> >than two instructions per cycle with the same size instruction fetch.
> >
> >Can you say "pipeline interlock"?
>
> Can you say "decoding bottleneck"? That shorter word size gets more
> instructions into one fetch, but that *varying* word size means you don't
> know where instruction N+1 begins until you've fully decoded instruction N.
> It's not an accident that DEC had such a hard time building a pipelining
> VAX.

There are two ways around this bottleneck that I can think of offhand.
Probably others can be used.

One method is to have decode lookahead. That is, the instruction decoder
server moves instructions to decoder units in advance. This does require
an extra level, but still might very well pay off in the long run.

The other one is to have the first, or first few, bits of each instruction
segment identify what that segment is for. This would lengthen the
instructions somewhat, but might also be useful.

I see no good way to handle fixed length instructions. Even some machines
claiming to be RISC, or called RISC by others, do not have this property.
Certainly 16 bits is a lot for the instruction proper, and 32 bits for an
address may be too few. Also, I have posted to this group instances where
the performance of the computer would be greatly enhanced by instruction
modifiers of the order of 8-16 bits, such as which choice of quotient and
remainder to use as a function of the signs of the arguments.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)

Eric S. Raymond

unread,
Nov 7, 1989, 9:28:32 AM11/7/89
to
In <7...@zip.eecs.umich.edu> Bill Mangione-Smith wrote:
> But wait, he says. I thought cheap commodity technology, in a box that meets
> about every real open-system spec, was *the* definition of a Sun 3.

It ain't `cheap' until an average individual can buy and maintain it out of,
say, 6 months' discretionary income. Sun-3s failed that test. So does the
3100.
--
Eric S. Raymond = er...@snark.uu.net (mad mastermind of TMN-Netnews)

James Shankland

unread,
Nov 7, 1989, 4:30:02 PM11/7/89
to
In article <1989Nov6.2...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>In article <12...@key.COM> da...@key.com () writes:
>>[suggesting single-size instruction sets may turn out to be too expensive
>>on RISCs]

>>
>Can you say "decoding bottleneck"? That shorter word size gets more
>instructions into one fetch, but that *varying* word size means you don't
>know where instruction N+1 begins until you've fully decoded instruction N.
>It's not an accident that DEC had such a hard time building a pipelining
>VAX.

Well, I try not to steal candy from babies, and I try not to VAX-bash
:-); and it seems to me that you could have a couple of different
instruction lengths without a decoding bottleneck: say, 16-bit and
32-bit instructions, where the msb of the instruction tells you whether
it's a short or a long one; or 16, 32, and 48 bits, with the top two
bits fully determining the instruction length. You don't have to go to
byte-at-a-time decoding, like certain brain-damaged ... uh ... never
mind. No VAX-bashing.

jas

Robert Colwell

unread,
Nov 7, 1989, 4:23:28 PM11/7/89
to
In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:
>> Can you say "decoding bottleneck"? That shorter word size gets more
>> instructions into one fetch, but that *varying* word size means you don't
>> know where instruction N+1 begins until you've fully decoded instruction N.
>> It's not an accident that DEC had such a hard time building a pipelining
>> VAX.
>
>There are two ways around this bottleneck that I can think of offhand.
>Probably others can be used.
>
>One method is to have decode lookahead. That is, the instruction decoder
>server moves instructions to decoder units in advance. This does require
>an extra level, but still might very well pay off in the long run.
>
>The other one is to have the first, or first few, bits of each instruction
>segment identify what that segment is for. This would lengthen the
>instructions somewhat, but might also be useful.
>
>I see no good way to handle fixed length instructions. Even some machines
>claiming to be RISC, or called RISC by others, do not have this property.
>Certainly 16 bits is a lot for the instruction proper, and 32 bits for an
>address may be too few. Also, I have posted to this group instances where
>the performance of the computer would be greatly enhanced by instruction
>modifiers of the order of 8-16 bits, such as which choice of quotient and
>remainder to use as a function of the signs of the arguments.

Herman, someday you and I are going to agree on something, but it looks
like today isn't the day.

The Intel 432 did what you're suggesting above, with the addition
of a bit-aligned instruction stream. They did as much instruction decode
lookahead as they were able, and early fields of an instruction encoded
the sizes of later fields. The shortest instruction was 6 bits and the
longest was longer than 300. My measurements led me to the conclusion
that this cost the 432 something like 15% in performance, and that includes
the primary beneficial effect that all the hardware devoted to this
comprised -- smaller object code.

I don't doubt that for certain types of programs or applications, a
machine that differs from another only in some kind of instruction
modifier might be much faster, but I also don't find that observation
particularly compelling. If there's one axiom in machine design that
is almost univerally true, it's that you don't get something for nothing.
You'd have to show that modifying the instruction encodings in the way
you mention above would a) benefit enough code to make it worthwhile,
b) would not slow down everything else, whether through cycle time
degradation or additional state at context swap time, and c) would not
make the design too cumbersome or complex (which could cause new bugs
to appear or lengthen the design time.)

It's really hard to adequately document a claim that a change to a
machine architecture would result in a performance improvement without
actually building one. (In fact, it's hard even then!)

Bob Colwell ..!uunet!mfci!colwell
Multiflow Computer or col...@multiflow.com
31 Business Park Dr.
Branford, CT 06405 203-488-6090

John Hascall

unread,
Nov 7, 1989, 5:27:01 PM11/7/89
to
In article <???> pe...@ficc.uu.net (Peter da Silva) writes:

}In article <???> bil...@dip.eecs.umich.edu.UUCP (Bill Mangione-Smith) writes:

}> Mind you, this is *not* a slam.
}> I use a dec 3100, which now defines 'cheap commodity technology' in the

}The forefront of cheap commodity technology in the realm of real UNIX boxes
}is an AT-bus/80386 running Everex' version of System V UNIX with a VGA, an 80
}Meg ST-506 drive, and at least 4 Meg of RAM. We're talking $5,000 workstations

Since when does VGA a workstation make?

It seems 1 Mpixels is the only one of the 3M's left...
who is satisfied with 1 MIP and 1Mbyte these days???


John Hascall

Peter da Silva

unread,
Nov 7, 1989, 11:54:01 PM11/7/89
to
Re: what's commodity technology.

I put it like this:

It's not a personal computer unless it costs less than the down payment on
a small car. That's the maximum that an individual can be expected to pay
without a bank loan.

Piercarlo Grandi

unread,
Nov 7, 1989, 1:36:26 PM11/7/89
to
In article <68...@pt.cs.cmu.edu> lin...@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

1) CISCs are slower, this year (but that's not a virtue, unless they
are also cheaper)

Current CISC architectures have been designed assuming slow memories, but
not much slower than CPUs, low levels of integration, and an overriding need
for code density, which has been satisfied by packed instruction formats
and the attending slow decode...

2) RISCs have less dense instruction encoding (also not a virtue).

There is no good reason to assume that higher level, more complex
architectures than RISCy ones need have the same assumptions in the future.
In particular, simple decode can well be applied to CISC... Dense encoding
is *good*, but it is not at all incompatible with density. E.g. stack
machines.



3) RISCs tend to have more registers, hence do fewer loads and stores.

This is the usual myth, that we have discussed so many times -- you need more
than a handful registers only if register assignment is done statically,
which is about equivalent to putting *all* variables in registers, without
any sophisticated optimization whatsoever (and even this consumes relatively
few registers, given the simplicity of most codes); if you use dynamic usage
information, either by manual register variable tagging, or using profile or
heuristic driven algorithms, the point of diminishing returns is between 8
and 16 registers, and 4 can be enough for C integer codes.

The point with RISC, and it is not a virtue, is that, given their load-store
architecture, the cost of access to non registered variables is much higher,
even if they are accessed infrequently; load-store raises the point of
diminishing returns, which is not a virtue.

>So when we finally start hurting for memory bandwidth (ok, if ever)
>then all of a sudden CISCs will look better and better.

I don't agree. Programs with bad data locality are fairly common.

Mostly because programmers can easily profile programs for speed but not
for locality.

By the way, I do not agree either with the idea that CISC, as we know them
now, will look better and better. Traditional CISC is bad, because
complexity is *always* bad for many reasons. Simpler CPU designs are good
for both software and hardware -- in particular you can use much less highly
integrated technologies, much better tools. Simplicity helps BOTH speed and
reliability.

That the current major attributes of RISC designs -- load-store
architecture, low code density, many registers -- are the best way to
achieve simplicity is however questionable. My champion as to this is the
CRISP thing: if I remember correctly the chip complexity of the CRISP and
the MIPSco chips is exactly the same, at 170k, speed is not that different,
factoring out technology, and the architectures are as different as they can
be.

Programs with bad code locality are much rarer. This means that the
RISC defect, above - code density - is hidden nicely by caches. My
crystal ball doesn't see this changing, so the RISC/CISC issues won't
be influenced (much) by memory bandwidth.

You are quite right, but only assuming that all of the program is memory
resident, of course -- if not, there are strong hints that even very small
decreases in locality, as may be caused by less dense code, will result in
dramatic increases in paging rates and working set sizes, which is especially
bad on multiprogrammed systems; or even on workstations, as any user of GNU
Emacs or CC can tell you. But then, who ever cared about working set sizes
or paging rates? :-(.

Peter da Silva

unread,
Nov 8, 1989, 12:04:28 AM11/8/89
to
In article <18...@atanasoff.cs.iastate.edu> has...@atanasoff.UUCP (John Hascall) writes:
> Since when does VGA a workstation make?

Since you could fit a million pixels on one.

> It seems 1 Mpixels is the only one of the 3M's left...
> who is satisfied with 1 MIP and 1Mbyte these days???

Properly applied, .6 Mips can feel quite speedy. My Amiga keeps its screen
up to date a hell of a lot better than the Sun-3s down the hall. But for
a UNIX system... no, that's not good enough any more.

Preston Briggs

unread,
Nov 8, 1989, 12:41:28 PM11/8/89
to
In article <14...@aber-cs.UUCP> p...@cs.aber.ac.uk (Piercarlo Grandi) writes:
> 3) RISCs tend to have more registers, hence do fewer loads and stores.

>This is the usual myth, that we have discussed so many times -- you need more
>than a handful registers only if register assignment is done statically,
>which is about equivalent to putting *all* variables in registers, without
>any sophisticated optimization whatsoever (and even this consumes relatively
>few registers, given the simplicity of most codes); if you use dynamic usage
>information, either by manual register variable tagging, or using profile or
>heuristic driven algorithms, the point of diminishing returns is between 8
>and 16 registers, and 4 can be enough for C integer codes.

We've "discussed" it before, but I (and others) never really believed
you had any valid points in this area; I think everyone just got
tired of posting.

Other people run programs in languages besides "integer C".
Given numeric fortran, I can "profitably" use a lot of registers
(32, 64, whatever). With reasonable dependence analysis,
you can rearrange loops and array accesses to consume all the registers
any machine has.

Computers have hierarchies of memory. Most machines have at least
two levels, registers and main memory. Many machines have 3,
adding cache. We could count virtual memory as a 4th, and some of the
newest will have multilevel caches as well.

Stack machines are the exception, at least conceptually, having a single
level only.

Optimizer writers (and many programmers) see management of these
memory hierarchies as an opportunity (obligation, challenge).
If we do a good job managing them, our program goes faster.
However, optimizer people think you should let the compiler do
the management. The programmer should worry about correctness
and algorithmic efficiency, as opposed to being concerned with
non-portable micro-optimizations that obscure the intent of the code.


>The point with RISC, and it is not a virtue, is that, given their load-store
>architecture, the cost of access to non registered variables is much higher,
>even if they are accessed infrequently; load-store raises the point of
>diminishing returns, which is not a virtue.

I disagree. Separate load-store instructions allow lower cost of access
to non-registered variables. Why? Imagine trying to add from memory
to a register. A CISC version might be

ADD r5,0(r8) ; load from adress specisfied from r8
wait for memory, doing
nothing much in the meantime
add the value to r5
throw away the value you just waited for.

cost = 1 or 2 + memory acess time

LD r7,0(r8) ; load from adress in r8 into r7
.
. ; do some useful work while waiting for memory
.
ADD r5,r6,r7 ; r5 := r6 + r7

cost equals 2. Additionally, the old value of r6 is still
around, and the value from memory is handy if we need it.

> ... Simplicity helps BOTH speed and
>reliability.

Hooray! We agree.


>That the current major attributes of RISC designs -- load-store
>architecture, low code density, many registers -- are the best way to
>achieve simplicity is however questionable. My champion as to this is the
>CRISP thing: if I remember correctly the chip complexity of the CRISP and

But nobody builds CRISPs do they? Why?

CRISP is a cool chip and fun to read about, but a lot of their
design decisions reflect their desire for simple compilers -- no
optimization, no register allocation, no scheduling. This is ok I guess
(people at Bell have always argued this way), but you give up
performance opportunities. The reference here is

Design tradoffs to support the C programming language in the CRISP...
Ditzel, McLellan, Berenbaum
APLOS II


Preston Briggs
pre...@titan.rice.edu

Alex Colvin

unread,
Nov 8, 1989, 3:57:18 PM11/8/89
to
> The I-cache tends to "lose" every time a subroutin call occurs, unless
> the routine is tiny (relative to the cache size). There is a locality-

How "tiny" are we talking here, or how big are the "sub"routines? Seems like
a couple of them would fit into an 8K cache, loops & all.

Richard H. Gumpertz

unread,
Nov 8, 1989, 12:03:57 PM11/8/89
to
For what it's worth, the IBM "universal" instruction set (read 360/370/etc.)
does have an INSTRUCTION-LENGTH CODE both in the first byte of the instruction
and in the PSW stored on traps.
--
===============================================================================
| Richard H. Gumpertz rhg%cps...@uunet.uu.NET -or- ...uunet!amgraf!cpsolv!rhg |
| Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749 |
===============================================================================

Gerry Gleason

unread,
Nov 9, 1989, 2:36:08 PM11/9/89
to
In article <14...@aber-cs.UUCP> p...@cs.aber.ac.uk (Piercarlo Grandi) writes:
> . . . My champion as to this is the

>CRISP thing: if I remember correctly the chip complexity of the CRISP and
>the MIPSco chips is exactly the same, at 170k, speed is not that different,
>factoring out technology, and the architectures are as different as they can
>be.

CRISP exists? I thought AT&T dumped that project. Is someone else making
them?

But your right, it's a great architecture. Simple where it counts but not
impossible for anyone but a compiler to write code for. I got to make
UNIX run on early silicon of this processor, so I have some experience
with it.

Gerry Gleason

Bob Bentley

unread,
Nov 9, 1989, 2:42:28 PM11/9/89
to
In article <1TMk2X#Qggn6=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond) writes:
>
> ... There's a subtle problem here; good software modularity

>practices tend to hurt code locality. If you're calling subroutines a lot
>in generated code the PC jumps all over the shop.
>
>I have no statistics on this, but I can easily imagine something like, say,
>the inner loop of a threaded-code interpreter busting hell out of the I cache
>because it's doing the equivalent of an indexed call indirect to some far-off
>routine every couple of instructions.
>
>Has anyone done any systematic investigation of this issue?
>--

We did a pretty thorough study of cache behavior at BiiN (now, alas, defunct).
This was prompted by the large variance which we observed in cache hit ratios
for different benchmarks; in particular, OS-intensive benchmarks were *much*
worse than CPU-only (Dhrystone, etc.) or Unix utility (grep, diff, etc.) tests.

There were a number of contributing causes to the observed cache behavior
(use of a sub-sectored caching scheme was a major one, since it led to an
effective cache occupancy of < 40%). However, the nature of OS code was
certainly a factor. The BiiN OS was written in Ada in a highly modular fashion.
The first versions of the OS contained considerable amounts of diagnostic code,
and very few routines were inlined. The result was that there was very little
either spatial or temporal locality in the OS code. Calls/branches were very
frequent, tight loops were very rare. Though not as bad as Eric's hypothetical
example, the effect on cache hit ratio and hence on overall system performance
was still significant. There are some definite negative performance
implications to modular programming techniques which need to be kept in mind.
(This is not to say that I am opposed to modular programming techniques,
especially for projects as large and complex as the BiiN OS - as a former
colleague of mind used to ask, "Do you want it to go fast or do you want it to
work"?).

Bob Bentley

--------------------------------------------------------------------------------
| Intel Corp., M/S JF1-58 UUCP: r...@omefs3.intel.com |
| 2111 N.E. 25th Avenue Phone: (503) 696-4728 |
| Hillsboro, Oregon 97124 Fax: (503) 696-4515 |
--------------------------------------------------------------------------------

Lawrence Crowl

unread,
Nov 9, 1989, 3:53:56 PM11/9/89
to
In article <1TMk2X#Qggn6=er...@snark.uu.net>
er...@snark.uu.net (Eric S. Raymond) writes:
>good software modularity practices tend to hurt code locality. If you're
>calling subroutines a lot in generated code the PC jumps all over the shop.

For the purposes of caches and locality, it doesn't matter if the PC jumps all
over the shop. What matters is if the PC jumps to someplace it has recently
been. Unrolling loops reduces PC jumping, but also reduces locality and cache
effectiveness. Replacing several calls to a procedure by the inline code to
achieve the same function produces the same effect. By leaving the calls in
place, I reduce the number of (static) instructions that the PC visits, and
thereby increase the PC locality. In short, MODULARITY IMPROVES PC LOCALITY.

There are some tradeoffs involved in modularity that affect performance.
First, the additional call instructions may increase code size for very small
procedures. (Automatic inlining helps here.) Second, the PC jumps drain the
instruction pipeline. If the drain is sufficiently frequent, lower PC
locality may improve performance. Third, procedures may be spread over many
pages and each occupy only a small portion of its page. The result is that
while the INSTRUCTION locality increases, the PAGE locality decreases. This
could reduce performance via translation cache misses and paging.

Suppose we have a highly modular program, and an inlining compiler. The
compiler may choose to call a procedure, or expand it inline. For those
procedures where calls are a net performance loss, the compiler will choose to
expand them inline. Given a modular program, and inlining compiler may emit
either modular code, or non-modular code. The reverse is not true. So, given
an inlining compiler, MODULARITY IS ALWAYS GOOD.

(There are also macro-efficiency and software engineering reasons to prefer
modularity. However, they are outside the scope of this newsgroup.)
--
Lawrence Crowl 716-275-9499 University of Rochester
cr...@cs.rochester.edu Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627

Michael A. Pasek

unread,
Nov 10, 1989, 8:58:22 AM11/10/89
to
In article <27...@brazos.Rice.edu> pre...@titan.rice.edu (Preston Briggs) wrote:
>[stuff about how many registers you need deleted]
>...Separate load-store instructions allow lower cost of access

>to non-registered variables. Why? Imagine trying to add from memory
>to a register. A CISC version might be
> ADD r5,0(r8) ; load from adress specisfied from r8
> wait for memory, doing
> nothing much in the meantime
> add the value to r5
> throw away the value you just waited for.
> cost = 1 or 2 + memory acess time
>[and promotes the RISC version as:]

> LD r7,0(r8) ; load from adress in r8 into r7
> . ; do some useful work while waiting for memory
> ADD r5,r6,r7 ; r5 := r6 + r7
> cost equals 2. Additionally, the old value of r6 is still
> around, and the value from memory is handy if we need it.

I don't know if I'm just stupid, but on the machine I work with, the "CISC"
example takes a total of ONE cycle, assuming that the next instruction does
not reference R5, and is not contained in the memory location referenced
by R8. Also, in the "RISC" example, the "useful work" that may be done
would have the same restrictions as the "CISC" version (no reference to R7,
and the instructions cannot be in the memory location referenced by R8).
Finally, the "throw away" of the value we just waited for can be avoided
by coding the "RISC" example on the "CISC" machine, with the same results:
it takes 2 cycles (if you do that "useful work" in between the two
instructions).

Anyway, I think the RISC/CISC battle will only be a philosophical argument (or
perhaps a religious war), and the real decision of which is best will be made
on a case by case basis by the implementer(s) of the solution(s) to a specific
problem (or set of problems). If one actually is better than the other, the
marketplace will decide (maybe).

M. A. Pasek Switching Software Development NCR Comten, Inc.
(612) 638-7668 CNG--er--PU4 Port Devel. 2700 N. Snelling Ave.
pa...@c10sd3.StPaul.NCR.COM Roseville, MN 55113

Preston Briggs

unread,
Nov 11, 1989, 1:54:44 PM11/11/89
to
In article <17...@ncrcce.StPaul.NCR.COM> pa...@c10sd3.StPaul.NCR.COM (M. A. Pasek) writes:
>In article I wrote:
>>...Separate load-store instructions allow lower cost of access

>>to non-registered variables. Why? Imagine trying to add from memory
>>to a register. A CISC version might be

>> ADD r5,0(r8) ; load from adress specisfied from r8
>> wait for memory, doing
>> nothing much in the meantime
>> add the value to r5
>> throw away the value you just waited for.
>> cost = 1 or 2 + memory acess time

>> and a risc version might be

>> LD r7,0(r8) ; load from adress in r8 into r7
>> . ; do some useful work while waiting for memory

>> ADD r5,r6,r7 ; r5 := r6 + r7
>> cost equals 2. Additionally, the old value of r6 is still
>> around, and the value from memory is handy if we need it.

>I don't know if I'm just stupid, but on the machine I work with, the "CISC"
>example takes a total of ONE cycle, assuming that the next instruction does
>not reference R5, and is not contained in the memory location referenced
>by R8.

And (perhaps) not use the adder, which is busy with the previous instruction.
I'm sure you're not stupid -- you're just pointing out some of the
exaggeration in my posting. On the other hand, that's a mighty impressive
cache you've got that will service memory requests in 1 cycle.
My 6809 had memory that fast, but on newer machines, I can't afford
memory to keep up with the CPU. On yet another hand, IBM and CDC
(and others?) have built CPUs that will allow all sorts of out-of-order
execution, but they were complex and expensive (and admittedly fast).
The point is not to waste functional resources waiting for memory.

>Finally, the "throw away" of the value we just waited for can be avoided
>by coding the "RISC" example on the "CISC" machine, with the same results:
>it takes 2 cycles (if you do that "useful work" in between the two
>instructions).

Of course. But RISC people (and this compiler person) argue that
you nearly always want to preserve the value; therefore, why
provide the more complex "reg = reg op memory" (or worse) instructions?

It's kind of a question of balance. Everything on the CPU costs money
and we'd like to keep it all busy all the time. I think this'll be
the big challenge of chips like the i860 and the new IBM erisc.
We'll have an integer adder, an fp adder, and fp multiplier,
and perhaps more. How can we keep it all going? Waiting on memory
is not the answer.

An interesting paper on "balance" is
"Estimating interlock and improving balance for pipelined architectures"
Callahan, Cocke, Kennedy
International Conference of Parallel Processing, 1987

They define several kinds of balance.
The "machine balance" is the rate data can be fetched (number of words
per cycle) divided the rate at which floating point operations
can be performed (number of flops/cycle).

The "loop balance" of a given loop is the number of words accessed
divided by the number of flops performed.

If, for some particular loop, the loop balance is greater than the
machine balance, then we can say the loop is "memory-bound."
On the other hand, if the loop balance is less than the
machine balance, then the loop is "compute-bound."

Consider a simple version of matrix multiply.
do i = 1, n
do j = 1, n
C(i, j) = 0
do k = 1, n
C(i, j) = C(i, j) + A(i, k) * B(k, j)
end
end
end

In this case the loop bound of the inner loop is 3 accesses/2 flops.
For the i860, we could say it's balance is 1 access/2 flops.
(This is all complicated by cache, load double, load quad, non-pipelined
operations, etc. Since the chip can only fetch off-chip every other cycle,
we could claim the balance is much lower. On the other hand, if we don't
use pipelined FP, then the balance is raised.)
So, the i860 is memory-bound in this case, even if all the data
fits in cache.

The paper points out refinements of these balance computations
plus transformations that help match machine balance and loop
balance. For example, we can rewite the above loop giving
do i = 1, n
do j = 1, n
t = 0
do k = 1, n
t = t + A(i, k) * B(k, j)
end
C(i, j) = t
end
end
where t will end up in a register.

Now the loop balance is 2 accesses/2 flops which is better,
but the i860 would still be memory-bound.
If we "unroll and jam" (as described in the paper), we get
do i = 1, n
do j = 1, n, 4
t0 = 0
t1 = 0
t2 = 0
t3 = 0
do k = 1, n
ta = A(i, k)
t0 = t0 + ta * B(k, j+0)
t1 = t1 + ta * B(k, j+1)
t2 = t2 + ta * B(k, j+2)
t3 = t3 + ta * B(k, j+2)
end
C(i, j+0) = t0
C(i, j+1) = t1
C(i, j+2) = t2
C(i, j+3) = t3
end
end
where n is divisible by 4 for clarity.

Now the loop balance is 5 accesses/8 flops. This is a lot better
(comparing 5/8 to 1/2). Of course, there are a lot of possible
variations; more are explored in the paper.

Note that we suddenly need 5 fp registers to hold intermediate results.
If we continue unrolling and jamming, we'll want even more registers;
hence our violent reaction to people who suggest that 4 registers
are plenty. Note that it isn't the loop that requires the registers,
or even the transformation; it's the mismatch between the
machine's balance and the loops balance.


Hillis (in The Connection Machine) takes another tack.
He points out that the cpu is usually kept wonderfully busy,
but that the bulk of memory (often the bulk of a machine) is
usually idle. That is, only a small fraction is being accessed
at a time. This leads him to many processors with proportionally
less memory. His hope is that with an adequate number of processors,
he can keep *all* the memory reasonably busy.

Preston Briggs
pre...@titan.rice.edu

Eric S. Raymond

unread,
Nov 12, 1989, 12:17:26 PM11/12/89
to
In <1989Nov9.2...@cs.rochester.edu> Lawrence Crowl wrote:
> (There are also macro-efficiency and software engineering reasons to prefer
> modularity.

Hearty agreement on that (I'm a rather vigorous partisan of ADT methods,
myself).

You describe a set of heuristics for choosing inlining or non-inlining that
make intuitive sense (at least they do now that I've reread your posting
about three times :-)). But it sounds to me as though effective use of those
heuristics might require maintenance of an awful lot of `macro' information
about the code object and its call patterns, including a global flow
analysis. Is this actually done? If not, what kind of less elaborate state
has been found to suffice?

Marty Fraeman

unread,
Nov 15, 1989, 11:10:18 AM11/15/89
to
In article <63...@dime.cs.umass.edu> sh...@ccs1.cs.umass.edu (H.Shrikumar{sh...@ncst.in}) writes:
>In article <1TMk2X#Qggn6=er...@snark.uu.net> er...@snark.uu.net (Eric S. Raymond)
>writes:
>>In <1989Nov4.0...@ico.isc.com> Dick Dunn wrote:
>>> Second, I would expect better locality
>>> for code reference than for data reference, hence the I cache ought to do
>>> more good than the D cache. Aren't the pathological cache-busting programs
>>> generally ones which spray data accesses all over the place?
>>
>>Not necessarily. There's a subtle problem here; good software modularity
>>practices tend to hurt code locality. If you're calling subroutines a lot
>>in generated code the PC jumps all over the shop.
>
>This happens for example in a FORTH machine, FORTH typically is
>subroutine threaded, so there is a flurry of subroutine calls
>happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016
>Forth CPU).
>
>In forth there is a subroutine call every five or so instructions
>I would guess.
We have looked at a similar issue in Forth. Over 90% of sequential
code accesses are less than 6.25 instructions long on the SC32 Forth
engine. This machine can execute most Forth primitives with a single
one cycle instruction except for load and store which take two cycles.
Subroutine calls are one cycle and most returns take zero cycles since
they can generally be combined with another instruction.

We also looked at the effectiveness of instruction caches on this
machine and found that fairly small caches (<16KB) could still achieve
>95% hit rates. However, since the size of the programs we studied
was fairly modest our I-cache size result should be taken with a grain
of salt. On the other hand the size of programs we studied was
comparable to the size of single threads on typical real-time
applications we've developed in the past so I believe there is some
significance to our data.

As a final comment on I-cache effectiveness in Forth, keep in mind that
while Forth instruction traces hop all over the place the hierachical
nature of most Forth implementations keeps code size much smaller than
usual.

Marty Fraeman

m...@aplcen.apl.jhu.edu
301-953-5000, x8360

JHU/Applied Physics Laboratory
Johns Hopkins Road
Laurel, Md. 20707

Peter da Silva

unread,
Nov 15, 1989, 5:45:09 PM11/15/89
to
> In forth there is a subroutine call every five or so instructions
> I would guess.

Yeh, but most of these calls would be to the same place, so the subroutine
would be in the cache. You'd expect that { @ ! (colon) (exit) + - / * } and
a few others would be in the cache most of the time.

> Wonder if some better cache philosophy can help here.

I'd worry more about pipeline flushes.

Eric Freudenthal

unread,
Nov 16, 1989, 8:33:31 AM11/16/89
to

In <1989Nov4.0...@ico.isc.com> Dick Dunn wrote:
> Second, I would expect better locality
> for code reference than for data reference, hence the I cache ought to do
> more good than the D cache. Aren't the pathological cache-busting programs
> generally ones which spray data accesses all over the place?

No. That is typical of pathological demand-paged-memory-busting
techniques. The reason is that typically caches consist of a large
set of realatively short entries (typically a few-to-several bytes) as
compared to demand-paged-memory which consists of a set of LONG
entry-measured in kilobytes. A truly random cache access pattern is
pretty good because the lookup scheme is usually a hash function for
which it is easy to generate bad access patterns (typically (addr >>
m)). Furthermore, the amount of memory moved (therefore the time
involved with moving) for a single cache-update is not very great, so
if the extra stuff is never used, very little effort was wasted. In
comparison a single page-in takes so much time, and moves so much
data, that random addressing patterns will cause pathologically bad behavior.

Caches, however are only useful if the same data is used
repetitively before it is flushed. This seems to be more an issue of
algorithm design than programming style.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eric Freudenthal
NYU Ultracompter Lab
715 Broadway, 10th floor
New York, NY 10012

Phone:(212) 998-3345 work
(718) 789-4486 home
Email:freu...@ultra.nyu.edu
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eric Freudenthal
NYU Ultracompter Lab
715 Broadway, 10th floor
New York, NY 10012

Phone:(212) 998-3345 work
(718) 789-4486 home
Email:freu...@ultra.nyu.edu

0 new messages