Floating point performance

Roger Shepherd

unread,

Oct 8, 1986, 3:12:27 AM10/8/86

to

I was interested to read Ian Kaplan's (...!loral!ian) appeal
for microprocessors with fast floating point. I am a little
concerned with the use of `peak' performance to characterise
the speed of part as I don't think that this necessarily
reflects the the USABLE performance of the part. I think it
is instructive to look at MFLOPs compared with Whetstones. (A
good benchmark of performance on `typical' scientific
programs).

For example, Ian quotes some performance figures as

Intel 80287 < 0.1 MFLOP (say 0.95 MFLOP at 8Mhz?)
National 0.1 MFLOP
Motorola 0.3 MFLOP

To these I'll add the figure for the INMOS transputer (no
co-processor, floating point done in software)

Inmos IMS T414-20 0.09 MFLOP (typical for * and /,
+ and - are slower!)

According to the figures I have to hand, these processors
compare somewhat differently when the Whetstone figures are
compared. For example, I have single length Whetstone
figures as follows for these machines

kWhets MWhets/MFLOP
(normalised)
Intel 80286/80287 (8 Mhz) 300 3.2 1.0
NS 30032 & 32081 (10 Mhz) 128 1.3 0.4
MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8

Inmos IMS T414B-20 663 7.4 2.3

The final column gives some feel for how effective these
processor/co-processor (just processor for the T414)
combinations are at turning MFLOPS into usable floating
point performance.

Also, I don't quite know why Ian likes the CLIPPER (three
chips on the picture of the (large) module I've seen) but
dislikes the NS 32310 (four chips); they seem to give the
same MFLOP rating. (Does anyone have Whetstone figures for
these two?)

Comparisons against Weiteks (or whatever) are also somewhat
suspect. To use their peak data rate you have to use them in
pipelined mode, their scalar mode tends to be somewhat slower
and it might be possible to build a microprocessor system
that could feed them data and accept results at that rate.
However, if you're only using the chips in that mode I'm not
convinced that you really want all that silicon to be taken
up with a large pipelineable (?) multiplier; I'd rather have
a processor there!

On the same subject (sort of), what measure should be made
of the `goodness' of a floating point micro-processor? How
about MWhetstones per square centi-metre. (Or do all you
guys and girls still use inches? :-) ) Or, how about
MWhetstones per milliwatt?
--
Roger Shepherd
INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
USENET: ...!euroies!shepherd
PHONE: +44 454 616616

Ian Kaplan

unread,

Oct 8, 1986, 3:30:17 PM10/8/86

to

In article <3...@euroies.UUCP> shep...@euroies.UUCP (Roger Shepherd) writes:
>
> kWhets MWhets/MFLOP
> (normalised)
> Intel 80286/80287 (8 Mhz) 300 3.2 1.0
> NS 30032 & 32081 (10 Mhz) 128 1.3 0.4
> MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8
>
> Inmos IMS T414B-20 663 7.4 2.3
>

These figures are interesting. I am surprised at the figure for the
80287. Intel uses this processor on the Intel Cube, which has very
poor floating point performance. The above table suggests that
reasonable floating point performance could be achieved increasing the
clock rate. It is not clear to me that this is born out by reality.
Does anyone have floating point performance numbers for a 12.5 MHz
80287?

>
>Also, I don't quite know why Ian likes the CLIPPER (three
>chips on the picture of the (large) module I've seen) but
>dislikes the NS 32310 (four chips); they seem to give the
>same MFLOP rating. (Does anyone have Whetstone figures for
>these two?)
>

The Whetstone figure for the 32310 is 1.137 MWhets and 0.8 MFLOP.

I like the Clipper because I have the impression that is uses less board
space and power than a 32032, a 32310 and two Weitek chips.
There are other considerations that must be taken into account also,
like the history of a product line.

>
>--
>Roger Shepherd
>INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
>USENET: ...!euroies!shepherd
>PHONE: +44 454 616616

Ian Kaplan
Loral Dataflow Group
Loral Instrumentation
USENET: {ucbvax,decvax,ihnp4}!sdcsvax!loral!ian
ARPA: sdcc6!loral!ian@UCSD
USPS: 8401 Aero Dr. San Diego, CA 92123

Ray Curry

unread,

Oct 8, 1986, 4:53:37 PM10/8/86

to

>Path: nsc!pyramid!decwrl!decvax!ucbvax!ucbcad!nike!lll-crg!seismo!mcvax!euroies!shepherd
>From: shep...@euroies.UUCP (Roger Shepherd)
>Newsgroups: net.arch
>Subject: Floating point performance
>Message-ID: <3...@euroies.UUCP>

>dislikes the NS 32310 (four chips); they seem to give the
>same MFLOP rating. (Does anyone have Whetstone figures for
>these two?)

>Comparisons against Weiteks (or whatever) are also somewhat
>suspect. To use their peak data rate you have to use them in
>pipelined mode, their scalar mode tends to be somewhat slower

--
>Roger Shepherd
>INMOS Limited, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ, GB
>USENET: ...!euroies!shepherd
>PHONE: +44 454 616616

Just by coincidence, I have been running some floating point benchmarks
on NS32081 floating point processor and thought I needed to respond
with some more up to date numbers. I ran the single precision Whetstone
on the NS32032 and NS32081 at 10MHz on the DB32000 board, and the NS32332
and NS32081 at 15 MHz on the DB332 board. I don't know where the posted
32032-32081 number came from but I measure better even using our older
compiliers. Our new compilers show marked improvement.

32032-32081 (10MHz) 189 Kwhets (old compiler)
32032-32081 (10MHz) 390 Kwhets (new compiler)
32332-32081 (15MHz) 728 Kwhets (new compiler)

I used the 32332-32081 numbers to generate instruction counts to project
worst case performance for the NS32310 and the NS32381, worst case being
using the identical math routines and minimizing the pipelining of the
32310. These project performance for the 32332-32381 (15MHz) at approx-
imately 1100-1200 KWhets and 32332-32310 (15MHz) at 1500-1600 KWhets.
Since both the 32310 and 32381 will have new instructions that will
impact the math libraries, the real performance could be higher.
Just for interest, preliminary analysis is saying pipelining should
improve performance at least 15% overall (30% for the floating point
portion of the instruction mix).

I would like to add my own question to the value of benchmarks. That
is what do the people on the net feel about transcendental functions?
The Whetstone seems to me to place more emphasis on them than real life.
One of the reasons for not including them directly in the 32081 was that
it was felt that implementing them in math routines instead of hardware
was more cost effective. Is this true or are transcendentals important
enough for the increased cost of implementing them in hardware?

Henry Spencer

unread,

Oct 9, 1986, 5:40:48 PM10/9/86

to

> ...what do the people on the net feel about transcendental functions?

> The Whetstone seems to me to place more emphasis on them than real life.
> One of the reasons for not including them directly in the 32081 was that
> it was felt that implementing them in math routines instead of hardware
> was more cost effective. Is this true or are transcendentals important
> enough for the increased cost of implementing them in hardware?

Personally, while I strongly suspect that a software implementation is
more cost-effective than doing them in hardware, putting them on-chip
strikes me as a marvellous way of getting them right once and for all
and encouraging everyone to use the done-right version. (This does assume,
of course, that the chip-maker spends the necessary money to *get* them
right, which requires high-paid specialists and a lot of work.) One
could get much the same effect with a bare-bones arithmetic chip and a
ROM chip containing the math routines, except that ROMs are too easy to
copy and you'd never recover the investment needed to do a good job.
--
Henry Spencer @ U of Toronto Zoology
{allegra,ihnp4,decvax,pyramid}!utzoo!henry

Richard Miner

unread,

Oct 10, 1986, 9:34:48 AM10/10/86

to

I have been doing some work with NEC's uPD7281 and was
curious to know if anyone else in net land has used these? They are
a powerfull set of chips that have a pipelined dataflow architecture
and assembply language. You can connect up to 14 chips together to do
parallel computations. We are building an Image Processing coprocessor
for a 680XX workstation with them. I am looking for others to discuss
design and algorithm issues with.

UUCP: !wanginst!ulowell!miner Rich Miner
ARPA: mi...@ulowell.CSNET University of Lowell, Comp Sci Dept
TALK: (617) 452-5000 x2693 Lowell MA 01854 USA
HAL hears the 9000 series is not selling. "Please explain Dave. Why
aren't HAL's selling?" Bowman hesitates. "You aren't Amiga compatible."

ste...@videovax.uucp

unread,

Oct 11, 1986, 3:16:29 PM10/11/86

to

In article <3...@euroies.UUCP>, Roger Shepherd (shep...@euroies.UUCP)
writes:

> I was interested to read Ian Kaplan's (...!loral!ian) appeal

> for microprocessors with fast floating point. . . .

> For example, Ian quotes some performance figures as
>
> Intel 80287 < 0.1 MFLOP (say 0.95 MFLOP at 8Mhz?)
> National 0.1 MFLOP
> Motorola 0.3 MFLOP

Shouldn't the figure in parentheses for the 80287 be 0.095 MFLOP??

> To these I'll add the figure for the INMOS transputer (no
> co-processor, floating point done in software)
>
> Inmos IMS T414-20 0.09 MFLOP (typical for * and /,
> + and - are slower!)
>
> According to the figures I have to hand, these processors
> compare somewhat differently when the Whetstone figures are
> compared. For example, I have single length Whetstone
> figures as follows for these machines
>
> kWhets MWhets/MFLOP
> (normalised)
> Intel 80286/80287 (8 Mhz) 300 3.2 1.0
> NS 30032 & 32081 (10 Mhz) 128 1.3 0.4
> MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8
>
> Inmos IMS T414B-20 663 7.4 2.3
>
> The final column gives some feel for how effective these
> processor/co-processor (just processor for the T414)
> combinations are at turning MFLOPS into usable floating
> point performance.

As one who has looked at the relative merits of various processors
and coprocessors before making a selection, I am not at all concerned
about "how effective [a] processor/co-processor combination[] [is] at
turning MFLOPS into usable floating point performance." The bottom
line for an application is closely tied to the numbers in the "kWhets"
column. The real question is, "How fast will it run my application?"

Steve Rice

----------------------------------------------------------------------------
{decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

John Mashey

unread,

Oct 13, 1986, 5:21:47 AM10/13/86

to

In article <19...@videovax.UUCP> ste...@videovax.UUCP (Steven E. Rice) writes:
>In article <3...@euroies.UUCP>, Roger Shepherd (shep...@euroies.UUCP)
>writes:

>> ..... For example, I have single length Whetstone

>> figures as follows for these machines
>> kWhets MWhets/MFLOP
>> (normalised)
>> Intel 80286/80287 (8 Mhz) 300 3.2 1.0
>> NS 30032 & 32081 (10 Mhz) 128 1.3 0.4
>> MC 68020 & 68881 (16 & 12.5) 755 2.5 0.8
>>
>> Inmos IMS T414B-20 663 7.4 2.3
>>
>> The final column gives some feel for how effective these

>> proce)/co-processor (just processor for the T414)

>> combinations are at turning MFLOPS into usable floating
>> point performance.
>
>As one who has looked at the relative merits of various processors
>and coprocessors before making a selection, I am not at all concerned
>about "how effective [a] processor/co-processor combination[] [is] at
>turning MFLOPS into usable floating point performance." The bottom
>line for an application is closely tied to the numbers in the "kWhets"
>column. The real question is, "How fast will it run my application?"

--THE RIGHT QUESTION!-----------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note that this discussion is very akin to the "peak Mips" versus
"sustained Mips" versus "how fast does it run real programs" argument
in the integer side of the world. I think both Roger and Steven have
some useful points, and, in fact, don't seem to be to disagree very much:
1) (Roger): MFLOPS don't mean very much. (see (1) below, etc)
2) (Steven): and neither do Whetstones!
3) (Roger): propose Whetstones / (peak MFLOPS) as architectural measure.
Note that most vendors spec MFLOPS using cached, back-to-back adds with
both arguments already in registers. For real programs, one also needs to
measure effects of:
a) coprocessor interaction, i.e., can you load/store directly to
the coprocessor from memory, or do you need to copy arguments
thru the CPU? (can make large difference).
b) Pipelining/overlap effects?
c) Number of FP registers.
d) Compiler effects.
(1)In general, peak MFlops don't seem to mean too much. Whetstones seem to
test the FP libraries more than anything else (although this at least
measures SOMETHING a bit more real). (2) A lot of people like LINPACK
MFLops ratings, or Livermore Loops, although the former, at least,
also measures memory system very strongly, i.e., its bigger than almost
any cache, and that's quite characteristic of some codes, and totally
uncharacteristic of others.

(3) However, a useful attribute of Roger's measure's (or variant thereof)
is that looking at the measure (units of real performance) per Mhz,
you some idea of architectural efficiency, i.e., smaller numbers are
better, in that (cycle time) is likely to be a property of the technology,
and hard to improve, at a given level of technology. [This is clearly
a RISC-style argument of reducing the cycle count for delivered performance,
andthen letting technology carry you forward.] Using the numbers above,
one gets KiloWhets / Mhz, for example:
Machine Mhz KWhet KWhet/Mhz
80287 8 300 40
32332-32081 15 728 50 (these from Ray Curry,
32332-32381 15 1200 80 in <38...@nsc.UUCP>) (projected)
32332-32310 15 1600 100* "" "" (projected)
Clipper? 33 1200? 40 guess? anybody know better #?
68881 12.5 755 60 (from discussion)
68881 20 1240 60 claimed by Moto, in SUN3-260
SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160)
MIPS R2360 8 1160 140* DP (interim, with restrictions)
MIPS R2010 8 4500 560 DP (simulated)

The *'d ones are boards / controllers for Weitek parts.
The Kwhet/Mhz numbers were heavily rounded: 1-2 digits accuracy is about
all you can extract from this, at best. One can argue about the speed that
should be used for the 68881 systems, since the associated 68020 runs faster.
What you do see is (not surprisingly) that heavily microcoded designs
get less Kwhet/Mhz than those that use either the Weitek parts or are
not microcoded.

As usual, whether you think this means anything or not depends on whether or
not you think Whetstones are a good measure. If not, it would help to
see other things proposed. For some reason, Floating Point benchmarks seem
to vary pretty strongly in their behavioral patterns.
Also, if anybody has better numbers, it would be nice to see them. At least
some of the ones in the list above are of uncertain parentage.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

d...@sun.uucp

unread,

Oct 14, 1986, 9:16:52 PM10/14/86

to

Mflops Per MHz

David Hough
dho...@sun.com

I'd like to add to John Mashey's recent posting about floating-
point performance. In the following table extracted and revised from
that posting, the Sun-3 measurements are mine; the MIPS numbers are
Mashey's. All KW results indicate thousands of double precision Whet-
stone instructions per second. Results marked * represent implementa-
tions based on Weitek chips. As Mashey points out, it's not clear
whether the MHz should refer to the CPU or FPU, so I included both.

Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz

Sun-3/160+68881 16.7 16.7 955 60 60
Sun-3/160+68881 25 20 1240 50 60
Sun-3/160+FPA* 16.7 16.7 1840 100 100
Sun-3/260+FPA* 25 16.7 2600 100 160

MIPS R2360* 8 8 1160 140 140 (interim restrictions)
MIPS R2010 8 8 4500 560 560 (simulated)

As you puzzle over the meaning of these results, remember that
elementary transcendental function routines have minor effect on Whet-
stone performance when the hardware is high-performance. Whetstone
benchmark performance is mostly determined by the following code:

DO 90 I=1,N8
CALL P3(X,Y,Z)
90 CONTINUE

SUBROUTINE P3(X,Y,Z)
IMPLICIT REAL (A-H,O-Z)
COMMON T,T1,T2,E1(4),J,K,L
X1 = X
Y1 = Y
X1 = T * (X1 + Y1)
Y1 = T * (X1 + Y1)
Z = (X1 + Y1) / T2
RETURN
END

On Weitek 1164/1165-based systems, execution time for the P3 loop is
dominated by the division operation, which is about 6 times slower
than an addition or multiplication and can't be overlapped with any
other operation, inhibiting pipelining. Furthermore, not only can no
1164 operation overlap any 1165 operation, but parallel invocation of
P3 calls can't be justified without doing enough analysis to discover
something far more interesting: the best way to improve Whetstone per-
formance is to do enough global inter-procedural optimization in your
compiler to determine that P3 only needs to be called once. This
gives a 2X performance increase with no hardware work at all! One MIPS
paper suggests that the MIPS compiler does this or something similar.
Maybe benchmark performance should be normalized for software as well
as hardware technology.

I've discussed benchmarking issues at length in the Floating-
Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading
to the recommendation that the nonlinear optimization and zero-finding
that P3 is intended to mimic is better benchmarked by the real thing,
such as the SPICE program. Of course, SPICE is a complicated real
application and its performance is difficult to predict in advance,
and that makes marketing and management scientists everywhere uneasy.

Linear problems are usually characterized by large dimension and
therefore memory and bus performance is as important as peak
floating-point performance; a Linpack benchmark with suitably-
dimensioned arrays is appropriate.

I don't know whether RISC or CISC designs will prove to give the
most bang for the buck, but I do have some philosophical questions for
RISC gurus: Is hardware floating point faster than software floating
point on RISC systems? If so, and it is because the FPU technology is
faster than the CPU, then why isn't the CPU fabricated with that tech-
nology? If it's just a matter of obtaining parallelism, then wouldn't
two identical CPU's work just as well and be more flexible for non-
floating-point applications? If there are functional units on the FPU
that aren't on the CPU, should they be on the CPU so non-floating-
point instructions can use them if desirable? If the CPU and FPU are
one chip, cycle times should be slower, but would the reduced communi-
cation overhead compensate? If you use separate heterogeneous proces-
sors, don't you end up with ... a CISC?

ag...@ccvaxa.uucp

unread,

Oct 15, 1986, 10:59:00 AM10/15/86

to

>> The final column gives some feel for how effective these

>> processor/co-processor (just processor for the T414)

>> combinations are at turning MFLOPS into usable floating
>> point performance.
>
>As one who has looked at the relative merits of various processors
>and coprocessors before making a selection, I am not at all concerned
>about "how effective [a] processor/co-processor combination[] [is] at
>turning MFLOPS into usable floating point performance." The bottom
>line for an application is closely tied to the numbers in the "kWhets"
>column. The real question is, "How fast will it run my application?"
>

> Steve Rice

Well, not only that, but perhaps also "How much will it cost to run my
application?" Users don't care how effective an architecture is, they
only care what the final result is. People interested in design, however,
may be interested in the ratios, since a low performance product with
good figures of merit may show an approach that should be turned into
a high performance product.

Jim Giles

unread,

Oct 15, 1986, 2:59:44 PM10/15/86

to

In article <81...@sun.uucp> d...@sun.UUCP writes:
>
> Mflops Per MHz
>
> David Hough
> dho...@sun.com
>

Mflops:(Millions of FLoating point OPerations per Second)
MHz: (Millions of cycles per second)

Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
sec^2)

Sounds like an acceleration to me. Must be a measure of how fast computer
speed is improving. Still, the choice of units forces this number to
be small. 8-)

J. Giles
Los Alamos

Kevin Kissell

unread,

Oct 15, 1986, 6:34:44 PM10/15/86

to

Keywords:

In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:
>However, a useful attribute of Roger's measure's (or variant thereof)
>is that looking at the measure (units of real performance) per Mhz,
>you some idea of architectural efficiency, i.e., smaller numbers are
>better, in that (cycle time) is likely to be a property of the technology,
>and hard to improve, at a given level of technology. [This is clearly
>a RISC-style argument of reducing the cycle count for delivered performance,

>and then letting technology carry you forward.] Using the numbers above,

>one gets KiloWhets / Mhz, for example:

I don't understand how someone of John's sophistication can insist on
repeating such a clearly fallacious argument. The statement "cycle time
is likely to be a property of the technology" is simply untrue, as I have
pointed out in previous postings. Cycle time is a the product of gate delays
(a property of technology) and the number of sequential gates between latches
(a property of architecture). For example, let us consider two machines
that are familiar to John and myself and yet of interest to the newsgroup:
the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time
of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built
with essentially the same 2-micron CMOS technology. I somehow doubt that
Fairchild's CMOS transistors switch four times faster than that of whoever
is secretly building R2000s this week. The difference is architectural.

As I understand it, the R2000 was designed to take advantage of delayed
load/branch techniques, and to execute instructions in a small number of
clocks, which in fact go hand-in-hand. A load or branch can take as little
as two clocks. But the addition of two numbers cannot take less than one
clock, and so the ALU has a leasurely 125ns to do something that it could
in principle have done more quickly, had it been more heavily pipelined.

The Clipper was designed from fairly well-established supercomputer and
mainframe techniques. The cycle time is the time required to do the smallest
amount of useful work - an integer ALU operation at 30ns. Other instructions
must then of course be multiples of that basic unit. Assuming cache hits,
a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes
9 (270ns vs. 250ns for the R2000).

It should be noted that both machines allow for the overlapped execution
of instructions, but in different ways. The R2000 overlaps register
operations with loads and branches using delay slots. The Clipper
overlaps loads but not branches, using resource scoreboarding instead
of delay slots. This means that the R2000 can branch more efficiently
(assuming the assembler can fill the delay slot), but the Clipper can
have more instructions executing concurrently than the R2000 (4 vs 2)
in in-line code.

Draw your own conclusions about "architectural efficiency".

>Machine Mhz KWhet KWhet/Mhz
>80287 8 300 40
>32332-32081 15 728 50 (these from Ray Curry,
>32332-32381 15 1200 80 in <38...@nsc.UUCP>) (projected)
>32332-32310 15 1600 100* "" "" (projected)
>Clipper? 33 1200? 40 guess? anybody know better #?
>68881 12.5 755 60 (from discussion)
>68881 20 1240 60 claimed by Moto, in SUN3-260
>SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160)
>MIPS R2360 8 1160 140* DP (interim, with restrictions)
>MIPS R2010 8 4500 560 DP (simulated)

John's guess for the Clipper is off by over a factor of two. The Clipper
FORTRAN compiler was brought up only recently. In its present sane but
unoptimizing state, I obtained the following result on an Interpro 32C
running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
Hills Clipper FORTRAN compiler with Fairchild math libraries:

Mhz Kwhet Kwhet/Mhz
Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of
more practical consequence.

Kevin D. Kissell
Fairchild Advanced Processor Division

John Mashey

unread,

Oct 16, 1986, 12:03:13 AM10/16/86

to

In article <81...@sun.uucp> d...@sun.uucp (David Hough) writes:
>
> Mflops Per MHz

> I'd like to add to John Mashey's recent posting about floating-

>point performance....
Thanks; as I'd said, parentage of numbers was suspect, so it's good to
see some I trust some more.

>
>Machine CPU Mhz FPU MHz KW KW/CPUMhz KW/FPUMHz
>
>Sun-3/160+68881 16.7 16.7 955 60 60

Oops, I'd thought you guys used 12.5Mhz 68881s at one point [but I checked
the current literature and it says no. Has it changed recently?

> .... Whetstone

>benchmark performance is mostly determined by the following code:

> (bunch of code) ...

>
>On Weitek 1164/1165-based systems, execution time for the P3 loop is

>dominated by the division operation...

>something far more interesting: the best way to improve Whetstone per-
>formance is to do enough global inter-procedural optimization in your
>compiler to determine that P3 only needs to be called once. This
>gives a 2X performance increase with no hardware work at all! One MIPS
>paper suggests that the MIPS compiler does this or something similar.

Actually, that's an optional optimizing phase whose heuristics are
still being tuned: we didn't use it on this, and in fact, don't generally
use them on synthetic benchmarks at all: it's too destructive!
(There's nothing like seeing functions being grabbed in-line, discovering
that they don't do anything, and then just optimizing the whole thing away.
At least Whetstone computes and prints some numbers, so some real
work got done. Nevertheless, David's comments are appropriate, i.e., we
share the same skepticism of Whetstone, as I'd noted in the original
posting).

>Maybe benchmark performance should be normalized for software as well
>as hardware technology.

True! Some interesting work on that line was done over at Stanford by
Fred Chow, who did a machine-independent optimizer with multiple back-ends
to be able to compare machines using same compiler technology. That's
probably the best way to factor it out. The other interesting way is to
be able to turn optimizations on/off and see how much difference they make.

>
> I've discussed benchmarking issues at length in the Floating-
>Point Programmer's Guide for the Sun Workstation, 3.2 Release, leading

Is this out yet? Sounds good. Previous memos have been useful.

>to the recommendation that the nonlinear optimization and zero-finding
>that P3 is intended to mimic is better benchmarked by the real thing,
>such as the SPICE program.

Yes, although it would be awfully nice to have smaller hunks of it that
could be turned into reasonable-size benchmarks, especially ones that
could be simulated (in advance of CPU design) a little easier.

>
> Linear problems are usually characterized by large dimension and
>therefore memory and bus performance is as important as peak
>floating-point performance; a Linpack benchmark with suitably-
>dimensioned arrays is appropriate.

Yes.

>
> I don't know whether RISC or CISC designs will prove to give the
>most bang for the buck, but I do have some philosophical questions for
>RISC gurus: Is hardware floating point faster than software floating
>point on RISC systems? If so, and it is because the FPU technology is
>faster than the CPU, then why isn't the CPU fabricated with that tech-
>nology? If it's just a matter of obtaining parallelism, then wouldn't
>two identical CPU's work just as well and be more flexible for non-
>floating-point applications? If there are functional units on the FPU
>that aren't on the CPU, should they be on the CPU so non-floating-
>point instructions can use them if desirable? If the CPU and FPU are
>one chip, cycle times should be slower, but would the reduced communi-
>cation overhead compensate? If you use separate heterogeneous proces-
>sors, don't you end up with ... a CISC?

1) Is hardware FP faster? Yes.
2) No, technology is the same, at least in our case. I don't know what
other people do.
3) It's not just parallelism, but dedicating the right kind of hardware.
A 32-bit integer CPU has no particular reason to have the kinds of datapaths
an FPU needs. There are functional units on the FPU, but they aren't ones
that help the CPU much (or they would have been on the CPU in the first place!)
4) Would reduced communication overhead compensate? Probably not, at the
current state of technology that is generally available. Right now, at least
in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it
just has to be heavily microcoded. It's only when chip shrinkage gets enough
that you can put the fastest FPU together with the CPU on 1 chip, and have
nothing better to put on that chip, that it's worth doing for performance.
(Note: there may be other reasons, or different price/performance aim points
for integrating them, but if you want FP performance, you must dedicate
significant silicon real-estate.)
5) Don't you end up with ... a CISC? I'm not sure what this means. RISC
means different things to different people. What it usually means to us is:
a) Design approach where hardware resources are concentrated on things
that are performance-critical and universal.
b) The belief that in making things fast, instructions and/or
complex addressing formats drop out, NOT as a GOAL,but as a side-effect.
Thus, in our case, we designed a CPU that would go fast for integer performance,
and have a tight-coupled coprocessor interface that would let FP go fast also.
(Note: integer performance is universal, whereas FP is mostly bimodal:
people either don't care about it all, or want as much as they can get.)
When you measure integer programs, you make choices to include or delete
features, according to the statistics seen in measuring substantial programs.
You do the same thing for FP-intensive programs. Guess what! You discover
that FP Adds, Subtracts, Multiplies (and maybe Divides) are:
a) Good Things
b) Not simulatable by integer arithmetic very quickly.
However, suppose that we'd discovered that FP Divide happened so seldom
that it could be simulated in software at an adequate performance level,
and that taking that silicon and using it to make FP Mult faster gave better
overall performance. In that case, we might have done it that way.

In any case, we don't see any conflict in having a RISC with FP,
(or decimal, or ...anything where some important class of application needs
hardware thrown at it and can justify the cost of having it.)
Seymour Cray has been doing fast machines for years with similar design
principles (if at a different cost point!) and FP has certainly been there.

Anyway, thanks for the additional data. Also, I'd be happy to see more
discussion on what metrics are reasonable [especially since the original
posting invented "Whetstones/MHz" on the spur of the moment, and there
have been some interesting side discussions generated, both on:
a) Are KWhets a good choice?
b) What's a MHz?
As can be seen, this business is still clearly in need of benchmarks that:
a) measure something real.
b) measure something understandable.
c) are small enough that they can be run and simulated in reasonable
time.
d) predict real performance of adequate-sized classes of programs.
e) are used by enough people that you can do comparisons.

Joe Buck

unread,

Oct 16, 1986, 12:56:14 PM10/16/86

to

In article <85...@lanl.ARPA> j...@a.UUCP (Jim Giles) writes:
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> sec^2)
>
>Sounds like an acceleration to me. Must be a measure of how fast computer
>speed is improving. Still, the choice of units forces this number to
>be small. 8-)

You multiplied instead of dividing. If you had divided, you would
have found that the number measures floating point operations per
cycle.

--
- Joe Buck {hplabs,fortune}!oliveb!epimass!jbuck, nsc!csi!epimass!jbuck
Entropic Processing, Inc., Cupertino, California

Dennis Griesser

unread,

Oct 16, 1986, 5:12:40 PM10/16/86

to

In article <81...@sun.uucp> d...@sun.UUCP writes:
> Mflops Per MHz

In article <85...@lanl.ARPA> j...@a.UUCP (Jim Giles) writes:
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> sec^2)
>
>Sounds like an acceleration to me. Must be a measure of how fast computer
>speed is improving. Still, the choice of units forces this number to
>be small.

You are not factoring the units out correctly...

million x flop second flop
-------------- x --------------- = -----
second million x cycle cycle

Sounds reasonable to me.

Dave Seaman

unread,

Oct 16, 1986, 10:57:38 PM10/16/86

to

In article <85...@lanl.ARPA> j...@a.UUCP (Jim Giles) writes:

>In article <81...@sun.uucp> d...@sun.UUCP writes:
>>
>> Mflops Per MHz
>>
>> David Hough
>> dho...@sun.com
>>
>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> sec^2)

You didn't divide properly. "Mflops per MHz" means "Mflops divided by MHz",
which, by the "invert the denominator and multiply" rule, comes out to

"Floating point operations per cycle"

after cancelling the "millions of ... per second" from numerator and
denominator.

I'm not claiming that this is a particularly useful measure, but that's
what it means.
--
Dave Seaman
a...@h.cc.purdue.edu

John Gilmore

unread,

Oct 17, 1986, 7:10:59 AM10/17/86

to

Speaking from MIPS, ma...@mips.UUCP (John Mashey) writes:
> ...looking at the measure (units of real performance) per Mhz,

>you some idea of architectural efficiency, i.e., smaller numbers are
>better, in that (cycle time) is likely to be a property of the technology,
>and hard to improve, at a given level of technology.

Speaking from Fairchild, kis...@garth.UUCP (Kevin Kissell) writes:
> I don't understand how someone of John's sophistication can insist on
> repeating such a clearly fallacious argument. The statement "cycle time

> is likely to be a property of the technology" is simply untrue...

I love it! The Intel/Motorola wars have been fun, but I'm glad they're
temporarily in abeyance. Onward with the RISC versus RISC wars! B*}
--
John Gilmore {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu jgil...@lll-crg.arpa
(C) Copyright 1986 by John Gilmore. May the Source be with you!

Ed Nather

unread,

Oct 17, 1986, 12:28:57 PM10/17/86

to

When I first started messing with computers (longer ago than I like to
remember) I was discouraged to learn they could not handle numbers as
long as I sometimes needed. Then I learned about floating point -- as
a way to get very large numbers into registers (and memory cells) of
limited length. It sounded great until I learned that you give up
something when you do things that way --simple operations become much
more complex (and slower) using standard hardware. Also, the aphorism
about using a lot of floating operations was brought home to me:
"Using floating point is like moving piles of sand around. Every time
you move one you lose a little sand, and pick up a little dirt."

Has hardware technology progressed to the point where we might want to
consider making a VERY LARGE integer machine -- with integers long
enough so floating point operations would be unnecessary? I'm not
sure how long they would have to be, but 512 bits sounds about right
to start with. This would allow integers to have values up to about
10E150 or so, large enough for Avagadro's number or, with suitable
scaling, Planck's constant. It would allow rapid integer operations
in place of floating point operations. If you could add two 512-bit
integers in a couple of clock cycles, it should be pretty fast.

I guess this would be a somewhat different way of doing parallel
operations rather than serial ones.

Is this crazy?

--
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nat...@astro.AS.UTEXAS.EDU

Robert Montante

unread,

Oct 17, 1986, 1:02:20 PM10/17/86

to

>> Mflops Per MHz
>> [...]

>Mflops:(Millions of FLoating point OPerations per Second)
>MHz: (Millions of cycles per second)
>
>Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> sec^2)
>
>Sounds like an acceleration to me. Must be a measure of how fast computer
>speed is improving. Still, the choice of units forces this number to
>be small. 8-)

I get: 10e6 X FLoating_point_OPerations / second
-----------------------------------------------
10e6 X Cycles / second

which reduces to
FLoating_point_OPerations / Cycle

an apparent measure of instruction complexity. But then, if you use a Floating
Point Accelerator, perhaps these interpretations are consistent. 8->

*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*-=-*

RAMontante
Computer Science "Have you hugged ME today?"
Indiana University

Henry Spencer

unread,

Oct 17, 1986, 2:00:25 PM10/17/86

to

> ... marketing and management scientists ...
^ ^
Syntax error in above line: incompatible concepts!

Eric H Jensen

unread,

Oct 17, 1986, 3:13:32 PM10/17/86

to

In article <60...@ut-sally.UUCP> nat...@ut-sally.UUCP (Ed Nather) writes:
>more complex (and slower) using standard hardware. Also, the aphorism
>about using a lot of floating operations was brought home to me:
>"Using floating point is like moving piles of sand around. Every time
>you move one you lose a little sand, and pick up a little dirt."

I thought numerical analysis was the plastic sheet you place your sand
on - with some thought (algorithm changes) you can control your
errors most of the time or at least understand them. Then of course
there is always an Extended format ...

>Has hardware technology progressed to the point where we might want to
>consider making a VERY LARGE integer machine -- with integers long

>...

>scaling, Planck's constant. It would allow rapid integer operations
>in place of floating point operations. If you could add two 512-bit
>integers in a couple of clock cycles, it should be pretty fast.

I would not want to be the one to place and route the carry-lookahead
logic for a VERY fast 512 bit adder (you could avoid this by using the
tidbits approach but that has many other implications). The real
killers would be multiply and divide. If you really want large
integers use an efficient bignum package; hardware can help by
providing traps or micro-code support for overflow conditions.

--
eric h. jensen (S1 Project @ Lawrence Livermore National Laboratory)
Phone: (415) 423-0229 USMail: LLNL, P.O. Box 5503, L-276, Livermore, Ca., 94550
ARPA: ehj@angband UUCP: ...!decvax!decwrl!mordor!angband!ehj

Josh Knight

unread,

Oct 18, 1986, 2:03:23 PM10/18/86

to

In article <16...@mordor.ARPA> e...@mordor.UUCP (Eric H Jensen) writes:
>In article <60...@ut-sally.UUCP> nat...@ut-sally.UUCP (Ed Nather) writes:
>>more complex (and slower) using standard hardware. Also, the aphorism
>>about using a lot of floating operations was brought home to me:
>>"Using floating point is like moving piles of sand around. Every time
>>you move one you lose a little sand, and pick up a little dirt."
>
>I thought numerical analysis was the plastic sheet you place your sand
>on - with some thought (algorithm changes) you can control your
>errors most of the time or at least understand them. Then of course
>there is always an Extended format ...
>

There are two realistic limits to the precision used for a particular
problem, assuming that the time to accomplish the calculation is not
an issue. The first is the precision of the input data (in astronomy
this tends to be less than single precision, somtimes significantly
less) and the second is the number of intermediate values you are willing
to store in your calcuation (i.e. memory). The number of intermediate
values includes things like the fineness of the grid you use in an
approximation to a continuous problem formulation (although quantum
mechanics makes everything "grainy" at some point, numerical calculations
aren't usually done to that level). As Eric points out, proper handling
of the calcuations and some extended precision (beyond what is kept in long
term storage) will provide all the precision that is available with the
given resources. Indeed, the proposal to use long integers is wasteful
of the very resource that is usually in short supply in these calculations,
namely memory (reference to all the "no virtual memory MY Cray!" verbage).
When Ed stores the mass of a 10 solar mass star in his simulation of the
evolution of an open cluster as a 512 bit integer, approximately 500 of
the bits are wasted on meaningless precision. The mass of the sun is
of order 10e33 grams, but the precision to which we know the mass is only
five or six decimal digits (limited, I believe, by the precision of G, the
gravitational coupling constant, but the masses of stars other than the
sun are typically much more poorly known), thus storing this number in
a 512 bit integer wastes almost all the bits, only 15-20 of them mean
anything.

I'll admit (before I get flamed) that the IBM 370 floating point format
has some deficiencies when it comes to numerical calculations (truncating
arithmetic and hexadecimal normalization). I will also disclaim that
I speak only for myself, not my employer.
--

Josh Knight, IBM T.J. Watson Research
jo...@ibm.com, jo...@yktvmh.bitnet, ...!philabs!polaris!josh

Jason Zions

unread,

Oct 18, 1986, 6:42:29 PM10/18/86

to

j...@lanl.ARPA (Jim Giles) / 12:59 pm Oct 15, 1986 /

> Mflops:(Millions of FLoating point OPerations per Second)
> MHz: (Millions of cycles per second)
>
> Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> sec^2)
>

'Scuse me?
Mflops per MHz

= Mflops / MHz

Millions Flop / Sec
= --------------------
Millions Hertz / Sec

= Floating Point OP / Hertz

= FLOP per Cycle.

In other words, how many (or few) floating point operations happen in a
single cycle. Yeah, it's gonna be small number, but not as silly a number
as your derivation shows.

> J. Giles
> Los Alamos
--
Jason Zions Hewlett-Packard
Colorado Networks Division 3404 E. Harmony Road
Mail Stop 102 Ft. Collins, CO 80525
{ihnp4,seismo,hplabs,gatech}!hpfcdc!hpcnoe!jason

Barry Shein

unread,

Oct 18, 1986, 7:22:26 PM10/18/86

to

From: nat...@ut-sally.UUCP (Ed Nather)

>Has hardware technology progressed to the point where we might want to
>consider making a VERY LARGE integer machine -- with integers long

>enough so floating point operations would be unnecessary?

Why wouldn't the packed decimal formats of machines like the IBM/370
be sufficient for most uses (31 decimal digits+sign expressed as
nibbles, slightly more complicated size range for multiplication and
division operands, basic arithmetic operations supported.) That's not
a huge range but it's a lot larger than 32-bit binary. I believe it
was developed because a while ago people like the gov't noticed you
can't do anything with 32-bit regs and their budgets, and floating
point was unacceptable for many things. There are no packed decimal
registers so I assume the instructions are basically memory-bandwidth
limited (not unusual.)

The VAX seems to support operand lengths up to 16-bits (65k*2 digits?
I've never tried it.) There is some primitive support for this (ABCD,
SBCD) in the 68K.

-Barry Shein, Boston University

Josh Knight

unread,

Oct 18, 1986, 8:33:20 PM10/18/86

to

Sorry about all the typos in <753@polaris>.

han...@mips.uucp

unread,

Oct 19, 1986, 3:15:53 AM10/19/86

to

In article <3...@garth.UUCP> kis...@garth.UUCP (Kevin Kissell) writes:
>I don't understand how someone of John's sophistication can insist on
>repeating such a clearly fallacious argument. The statement "cycle time
>is likely to be a property of the technology" is simply untrue, as I have
>pointed out in previous postings. Cycle time is a the product of gate delays
>(a property of technology) and the number of sequential gates between latches
>(a property of architecture). For example, let us consider two machines
>that are familiar to John and myself and yet of interest to the newsgroup:
>the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time
>of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built
>with essentially the same 2-micron CMOS technology. I somehow doubt that
>Fairchild's CMOS transistors switch four times faster than that of whoever
>is secretly building R2000s this week. The difference is architectural.

"cycle time is likely to be a property of the technology" is

clearly a simplification that is useful for making relatively
crude comparisons between widely varying machine designs.
Cycle time, while a crude measure, has the advantage
that it is clearly observable and well-documented.

In practice, the number of sequential gates between latches
is also generally a property of the technology, given that
designers are attempting to optimize their own design.
It is counterproductive to over-pipeline a design, as
pipe registers themselves add delay and complexity.
Let me emphasize, however, that I do not intend to
assert that the Fairchild design is over-pipelined.

Now let us address the general issue of a comparison
of the technology of the two machines discussed above,
(two machines that were clearly chosen entirely at random).
It is indeed safe to assume that an 8 MHz R2000 has a
cycle time of 125 ns. However, 8 MHz is not the maximum
clock rate that the silicon will support - that figure
is 16.67 MHz, or a cycle time of 60 ns (worst case over commercial
temperatures). This 16.67 MHz R2000 part is built in a
2-micron CMOS technology, and Fairchild's part is
built in a process that is also described as a 2-micron CMOS
technology. However, the phrase "2-micron CMOS technology"
is actually very vague.

The available public literature from both companies is
not sufficient to compare these technologies point-by-point,
but I fully expect that Fairchild has pushed harder
on effective transistor gate length and oxide thickness to
reach 33 MHz than MIPS has yet employed to reach 16.67 MHz.
A difference in comparable gate speed of a factor of
two is actually entirely plausable, though we believe the
actual ratio is more on the order of 1.5.

We have been getting our process technology from the same
suppliers week after week. By using a slightly less agressive
technology, we are able to get reliable, multiple-sourced processing.

>As I understand it, the R2000 was designed to take advantage of delayed
>load/branch techniques, and to execute instructions in a small number of
>clocks, which in fact go hand-in-hand. A load or branch can take as little
>as two clocks. But the addition of two numbers cannot take less than one
>clock, and so the ALU has a leasurely 125ns to do something that it could
>in principle have done more quickly, had it been more heavily pipelined.

I have to disagree on several of the points claimed here.
The R2000 design will execute load and branch instructions
at a rate of one instruction per cycle (a 60 ns cycle),
and takes one 60 ns cycle to perform an integer ALU operation.
In fact, the R2000 will execute ALL instructions in a
single cycle, which substantially simplified the design.
It is, of course, entirely untrue that the addition of
two numbers cannot take less than one clock, but this is
not the heart of the matter: the integer ALU is not
the critical path in the R2000 design.

>The Clipper was designed from fairly well-established supercomputer and
>mainframe techniques. The cycle time is the time required to do the smallest
>amount of useful work - an integer ALU operation at 30ns. Other instructions
>must then of course be multiples of that basic unit. Assuming cache hits,
>a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes
>9 (270ns vs. 250ns for the R2000).

Correcting the numbers above, we have 120/180 ns (Clipper)
vs. 60 ns (R2000) for a load, and 270 ns vs 60 ns for a branch.

>It should be noted that both machines allow for the overlapped execution
>of instructions, but in different ways. The R2000 overlaps register
>operations with loads and branches using delay slots. The Clipper
>overlaps loads but not branches, using resource scoreboarding instead
>of delay slots. This means that the R2000 can branch more efficiently
>(assuming the assembler can fill the delay slot), but the Clipper can
>have more instructions executing concurrently than the R2000 (4 vs 2)
>in in-line code.

Resource scoreboarding is no more effective at using load
delay slots (which are delays inherent in the computation)
than static scheduling. Since instructions are issued in
the order in which they are presented in a scoreboard
controller, an operation that depends on the value of
a pending load instruction must wait for
the load to complete on either machine. The number of
delay cycles, is, however, an important factor in
determining performance. It is hardly advantageous
to have 4 cycle (is is it 6 cycle?) load instructions,
no matter how slickly this is portrayed as a feature with
the phrase "can have more instructions executing concurrently."
The R2000 can fill the delay slot with a useful instruction,
(which can even be an additional load instruction) over 70%
of the time. With what frequency can Clipper compilers find
three instructions, none of which can be a load, to
fill the three load delay slots on a Clipper?

>Draw your own conclusions about "architectural efficiency".

The Clipper designers claim 5 MIPS performance at 33 MHz,
while the R2000 performs at 10 MIPS at 16.67 MHz.
The Fairchild technology is as much as twice as
agressive as the R2000 technology, but the Clipper
only achieves half the performance. My conclusion
is that the R2000 is two-four times as "efficient"
an architecture.

For Clipper to reach the same performance in the same technology,
using their current architecture, they need 66 MHz parts,
with an input clock rate well above the broadcast FM radio band.

>>Machine Mhz KWhet KWhet/Mhz
>>80287 8 300 40
>>32332-32081 15 728 50 (these from Ray Curry,
>>32332-32381 15 1200 80 in <38...@nsc.UUCP>) (projected)
>>32332-32310 15 1600 100* "" "" (projected)
>>Clipper? 33 1200? 40 guess? anybody know better #?
>>68881 12.5 755 60 (from discussion)
>>68881 20 1240 60 claimed by Moto, in SUN3-260
>>SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160)
>>MIPS R2360 8 1160 140* DP (interim, with restrictions)
>>MIPS R2010 8 4500 560 DP (simulated)
>
>John's guess for the Clipper is off by over a factor of two. The Clipper
>FORTRAN compiler was brought up only recently. In its present sane but
>unoptimizing state, I obtained the following result on an Interpro 32C
>running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
>Hills Clipper FORTRAN compiler with Fairchild math libraries:
>
> Mhz Kwhet Kwhet/Mhz
>Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of
> more practical consequence.
>
>Kevin D. Kissell
>Fairchild Advanced Processor Division

Clipper 33 2930 90 = Kwhet/MHz

I'd like thank Kevin for providing this performance data
and point out that this ratio is a respectable accomplishment
on Fairchild's part - this number is comparable to the
values obtained by using multiple-chip FP processors
built with Weitek arithmetic units and interfaced to
microcoded processors. While the FP arithmetic operations
take longer in the Clipper than in Weitek parts
(which are built in an unmistakably slower technology),
by reducing communications overhead, the overall performance
comes out comparably well.

Let me make clear why Kwhet/MHz or MIPS/MHz ratios are useful:
they provide some insight into where the emphasis was placed
in the design, and where future derivative designs can reach.
It's my view that Kevin's remarks confirm that the Clipper design
was intended from the start to build a machine with a low MIPS/MHz
ratio, with the clock rate based on the lowest conceivable
executable unit. It should also be clear what level of
architectural efficiency results from optimizing integer
ALU operations (Clipper), rather than by optimizing the architecture
to execute load, store and branch operations (MIPS).

--

Craig Hansen | "Evahthun' tastes
MIPS Computer Systems | bettah when it
...decwrl!mips!hansen | sits on a RISC"

ma...@mips.uucp

unread,

Oct 19, 1986, 4:19:32 AM10/19/86

to

In article <3...@garth.UUCP> kis...@garth.UUCP (Kevin Kissell) writes:

>In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:

>...that are familiar to John and myself and yet of interest to the newsgroup:

>the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time
>of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built
>with essentially the same 2-micron CMOS technology. I somehow doubt that
>Fairchild's CMOS transistors switch four times faster than that of whoever
>is secretly building R2000s this week. The difference is architectural.

(One of my colleagues got here first, hansen@mips, in 7...@mips.UUCP,
so I'll just add a few notes where they don't overlap too much.)

There was no intent in the original posting to start a MIPS versus
Clipper war [contrary to John Gilmore's posting in <11...@hoptoad.uucp>:
sorry John, another Moto versus Intel battle we do not need, fun though
it may be to watch!] I was only trying to be reasonably inclusive of
relevant 32-bit micros. However, now that the issue has been raised.....

An 8Mhz R2000 isn't pushing the technology very hard, ON PURPOSE!!!
8Mhz parts appear first, followed by 12s and 16s, for the same reasons you got
12Mhz 68020s before 16s and 25s. Also, I'm told that the 2u design doesn't
push 2u technology as hard as it might have, in order to let the same
design be shrunk to 1.5u and 1.2u with minimal effort.

Now, the reason one might care about MWhets/MHz (or any similar measure
that compares the delivered real performance with some basic technology
speed) is to understand the margin and headroom in a design.
Since Kevin brought the issue up, some hypothetical questions:
a) Will there be 66Mhz Clippers in 2u CMOS?
[To get actual performance like 16Mhz R2000 in 2u;]
[If the answer is yes, I know a bunch of people, not all at
MIPS, either, who have some real tough questions involving
transmission-line effects, how to do ECL or other reduced-
voltage-swing I/O, etc.]
b) If they will be, what year will they be?
[1987?]
c) When will there be bigger / (more in parallel) CAMMU chips?
[Because if there aren't, how are the caches going to
get enough bigger to keep the delivered performance in line
with the CPU clock speed improvements? (for real programs)?
Chips gets faster with shrinks, but they don't magically
get re-laid-out to acquire more memory. CAMMU chips have
some good ideas in them, but they're not very big, especially
compared with the needs of some of the real programs that
people would like to run on high-performance micros. (There
is some real nasty stuff lurking out there! People keep
putting them on our machines, so we know....If the Clipper
FORTRAN compilers just came up recently, and they haven't
yet tried running 500KLOC FORTRAN programs...interesting
times are ahead....)

>
>The Clipper was designed from fairly well-established supercomputer and

>mainframe techniques....

"fairly well-established supercomputer and mainframe techniques"

is interesting. I can think of 2 ways to read this assertion:
a) High-performance VLSI designs should be done just like big
machines.
OR
b) High-performance VLSI should be designed with good understanding
of big machines, as well as good understanding of the tradeoffs
necessary for VLSI [margin, headroom, packaging constraints, processes,
etc, etc], where those are different from the design tradeoffs of
the big ECL boxes.
I hope Kevin meant b), which most people would agree with.

>
>John's guess for the Clipper is off by over a factor of two. The Clipper

Thanks for the info: all I'd seen were random guesses from people around
the net, and it's a useful contribution to see numbers from somebody
that knows. Hopefully, we'll see more? [I assume that was DP?]

>FORTRAN compiler was brought up only recently. In its present sane but
>unoptimizing state, I obtained the following result on an Interpro 32C
>running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
>Hills Clipper FORTRAN compiler with Fairchild math libraries:
>
> Mhz Kwhet Kwhet/Mhz
>Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of
> more practical consequence.

As hansen@mips noted, these are reasonable results, and I'd assume they'll
improve somewhat with more mature compiler technology.

Actually, this raises a set of questions that might be of general interest
in this newsgroup, basically:
1) What metrics are interesting?
2) How do you define them?
3) In what problem domains are they relevant?
4) What are different constraints that people use?
5) How do different metrics correlate, specifically, are some of the simpler
(easier-to-measure) good predictors of the more complex ones?

For example, here are some metrics, all of which have appeared in this
newsgroup at some time or other. Proposals are solicited:

a) Clock rate. (Mhz) --
b) Peak Mips [i.e., typically back-to-back cached, register-register adds]. --
c) Sustained Mips ?
d) Benchmark performance relative to other computers ++
e) Peak Mflops [i.e., "" "" for FP] --
f) Dhrystones
g) Whetstones +
h) LINPACK MFLops ++
i) Kwhets / Mflops [g/e] -
j) Kwhets / Mhz [g/a] +
k) Kg
l) cm2 (or cm3)
m) Watts
n) $$ +++
o) Kwhets / Kg [g/k]
p) Kwhets / cm2 [g/l] +
q) Kwhets / Watt [g/m] +
r) (any of the above) / $$ +++(esp if d))
---------
(-- & ++ indicate general impression of these metrics)

What's interesting is that people have all sorts of different constraint
combinations or optimization functions over any of these. Let me try
a few examples, and solicit some more:
1) Maximize g), h) etc, subject to few constraints, i.e., for people who
buy CRAYs, etc, money is (almost( no object.
2) Maximize one of the performance numbers, subject to some constraint.
The constraint might be:
absolute cm2 or cm3, as in some avionics things, i.e., if it
doesn't fit, it doesn't matter how fast it is!
$$: get me the most for some fixed amount of money, and I don't
care if it's 2X faster, even if it's more cost-effective.
3) Performance may not be particularly important at all, relative to
object-code compatbility, software availability, service, etc.

Comments? What sorts of metrics are important to the people who read
this newsgroup? What kinds of constraints? How do you buy machines?
If you buy CPU chips, how do you decide what to pick?

Ed Nather

unread,

Oct 20, 1986, 5:03:19 PM10/20/86

to

In article <7...@polaris.UUCP>, jo...@polaris.UUCP (Josh Knight) writes:
> Indeed, the proposal to use long integers is wasteful
> of the very resource that is usually in short supply in these calculations,
> namely memory (reference to all the "no virtual memory MY Cray!" verbage).
> When Ed stores the mass of a 10 solar mass star in his simulation of the
> evolution of an open cluster as a 512 bit integer, approximately 500 of
> the bits are wasted on meaningless precision.

I guess I didn't really make myself clear in my original posting. I
didn't mean to imply that ONLY 512 bit integers would be available, but
COULD be available, just as bytes and double-bytes are availble as
subsets of the 32-bit integers on a Vax (ooops -- sorry, Josh).
It would not be an unreasonable implementation to have registers of,
say, 128 bits, so "quad integers" of 512 bits would have to be
operated on in 4 pieces. My point was to substitute integer operations
for floating ones and still retain workable precision.

Robert D. Silverman

unread,

Oct 21, 1986, 9:22:13 AM10/21/86

to

Such a machine is in fact being built. The machine physically resides at
U. Oregon and was designed and built by Duncan Buell, Don Chiarulli and
Walter Rudd. It is a 256 bit machine which can be changed dynamically at
run time to act as 8 32 bit processors in parallel, as 4 64 bit processors
etc.

It is intended for research in computational number theory, especially
integer factorization.

Bob Silverman

Stuart D. Gathman

unread,

Oct 21, 1986, 12:02:31 PM10/21/86

to

In article <60...@ut-sally.UUCP>, nat...@ut-sally.UUCP (Ed Nather) writes:

> long as I sometimes needed. Then I learned about floating point . . .

> . . . . It sounded great until I learned that you give up

> something when you do things that way --simple operations become much

> more complex (and slower) using standard hardware. . . .

For problems appropriate to floating point, the input is already
imprecise. Planck's constant is not known to more than a dozen
digits at most. Good floating point software keeps track of
the remaining precision as computations proceed. Even if the
results were computed precisely using rational arithmetic,
the results would be more imprecise than the input. Rounding
in floating point hardware contributes only a minor portion of
the imprecision of the result in properly designed software.

For problems unsuited to floating point, e.g. accounting, yes the
floating point hardware gets in the way. For accounting one should
use large integers: 48 bits is plenty in practice and no special hardware
is needed. The 'BCD' baloney often advocated is just that. Monetary
amounts in accounting are integers. 'BCD' is sometimes used so that
decimal fractions round correctly, but the correct method is to use
integers.

Rational arithmetic is another place for large integers. Numbers
are represented as the quotient of two large integers. This is where
special hardware might help. Symbolic math often uses rational
arithmetic, but the large integers should be variable length. Numbers
such as '1' and '2' are far more common than 100 digit monsters.
--
Stuart D. Gathman <..!seismo!{vrdxhq|dgis}!BMS-AT!stuart>

bst...@gorgo.uucp

unread,

Oct 22, 1986, 12:39:00 AM10/22/86

to

AARRRGGH...

There is NO implicitly direct relationship between MegaFlops and clock
speed! Good grief people, what about consideration of the number of tics per
MMU cycle and how this is affected by the need for waitstates at higher clock
speeds? Does the architecture support stackable MMU's, and how might this
affect memory cycle time and how do the cycle times differ amoung various
addressing modes? Flops/Hz is just not a valid measurement of anything. This
is particularly true in view of the fact that "Flops" varies wildly from
benchmark to benchmark.

Steve Blasingame (Oklahoma City)
bst...@eris.Berkeley.Edu
ihnp4!occrsh!gorgo!bsteve

"We burn the tabonga with a might fire and yet it would not die..."
From: FROM HELL IT CAME

ste...@videovax.uucp

unread,

Oct 22, 1986, 2:29:15 AM10/22/86

to

In article <7...@mips.UUCP>, John Mashey (ma...@mips.UUCP) writes:

> Comments? What sorts of metrics are important to the people who read
> this newsgroup? What kinds of constraints? How do you buy machines?
> If you buy CPU chips, how do you decide what to pick?

1. Reputation of the manufacturer -- will they be there next year, and
five years from now? Will the family grow, or will it dead-end?
Will existing software run on new members of the family with minimal
changes and at the same time gain the performance advantages of the
upgraded hardware?

2. Software development environment. Can the desired end result be
developed with available tools, or will we have to create our
own?

3. Performance of the system that will result.

4. Cost of the CPU and FPU chips, and cost of the hardware required to
make it all work.

The ranking is relative, but not rigid -- for example, failing #3 might
require changing #4 or accepting more pain in #2 in order to gain
acceptable performance. Fudging on #1 could be hazardous to your mental
health, though!

John Gilmore

unread,

Oct 22, 1986, 7:24:58 AM10/22/86

to

In article <60...@ut-sally.UUCP> nat...@ut-sally.UUCP (Ed Nather) writes:
>"Using floating point is like moving piles of sand around. Every time
>you move one you lose a little sand, and pick up a little dirt."

IEEE floating point requires an "exact" mode which causes a trap any
time the result of an operation is not exact. This lets your
software know that it has picked up dirt, if it cares, and lets
particularly smart software change to extended precision, long integers,
or whatever.

I was wondering how you represent values <1 in your 512-bit integers...
or are you going to figure out binary points on the fly? In that case
you might as well let hardware do it -- that's called floating point!

Rex Ballard

unread,

Oct 22, 1986, 9:42:49 AM10/22/86

to

In article <31...@h.cc.purdue.edu> a...@h.cc.purdue.edu.UUCP (Dave Seaman) writes:
>In article <85...@lanl.ARPA> j...@a.UUCP (Jim Giles) writes:
>>In article <81...@sun.uucp> d...@sun.UUCP writes:
>>> Mflops Per MHz

> "Floating point operations per cycle"
>
>after cancelling the "millions of ... per second" from numerator and
>denominator.
>
>I'm not claiming that this is a particularly useful measure, but that's
>what it means.

Sounds to me like what is really wanted is cycles/flop average.
This does give an indication of micro-code efficiency of the
archetecture used, based on average, rather than advertized
figures.

Obviously the more cycles/flop, the less inherantly efficient
the chip archetecture is. By figuring these values using
whetstones or similar benchmarks, additional "off-chip" factors
such as set up overhead for the FPU calls are correctly included.

This sounds like an interesting measure of CPU/FPU archetecture
in the broader sense of the word.

>Dave Seaman
>a...@h.cc.purdue.edu

Richard Mateosian;;;5407745;JS93

unread,

Oct 23, 1986, 1:10:52 AM10/23/86

to

>
>IEEE floating point requires an "exact" mode which causes a trap any
>time the result of an operation is not exact. This lets your
>software know that it has picked up dirt, if it cares, and lets
>particularly smart software change to extended precision, long integers,
>or whatever.
>

You will get an inexact flag if you divide 1.0 by 3.0. It's a signal
that roundoff has occurred, not necessarily that there's been any loss
of precision.

Richard Mateosian ...ucbvax!ucbiris!srm 2919 Forest Avenue
415/540-7745 s...@iris.Berkeley.EDU Berkeley, CA 94705

kis...@garth.uucp

unread,

Oct 23, 1986, 4:05:17 AM10/23/86

to

In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:

> Right now, at least
>in anything close to 2micron CMOS, if the FPU is part of the CPU chip, it
>just has to be heavily microcoded.

Oh? What law of physics are we violating? ;-)

Peter S. Shenkin

unread,

Oct 23, 1986, 1:42:50 PM10/23/86

to

In article <BMS-AT.253> stu...@BMS-AT.UUCP (Stuart D. Gathman) writes:
>
>For problems appropriate to floating point, the input is already
>imprecise. Planck's constant is not known to more than a dozen
>digits at most. Good floating point software keeps track of
>the remaining precision as computations proceed.

??? I've never heard of this. Could you say more? Until you do, I will....
Read on.

> ...Rounding

>in floating point hardware contributes only a minor portion of
>the imprecision of the result in properly designed software.

I disagree. Consider taking the average of a many floating point numbers
which are read in from a file, and which differ greatly in magnitude.
How many there are to average may not be known until EOF is encountered.
The "obvious" way of doing this is to accumulate the sum, then divide
by n. But if some numbers are very large, the very small ones will
fall off the low end of the dynamic range, even if there are a lot of
them; this problem is avoided if one uses higher precision (double
or extended) for the sum. If declaring things this way is what you mean by
properly designed software, OK. But the precision needed for intermediate
values of a computation may greatly exceed that needed for input and
output variables. I call this a rounding problem. I know of no "floating
point software" that will get rid of this. There are, of course, programming
techniques for handling it, some of which are very clever. Again, I suppose
you could say that if you don't implement them then you're not using
properly designed software. But these techniques are time-consuming to
build in to programs, and time consuming to execute; therefore, they
should only be used where they're really needed. But the whole point is that
the precision needed for intermediate results may GREATLY exceed that needed
for input and output variables, and an important part of numerical analysis
is being able to figure out where that is.

Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027
{philabs,rna}!cubsvax!peters cubsvax!pet...@columbia.ARPA

lude...@ubc-cs.uucp

unread,

Oct 24, 1986, 3:50:09 AM10/24/86

to

In article <2...@BMS-AT.UUCP> stu...@BMS-AT.UUCP writes:
>For problems unsuited to floating point, e.g. accounting, yes the
>floating point hardware gets in the way. For accounting one should
>use large integers: 48 bits is plenty in practice and no special hardware
>is needed.

As someone who has done accounting using floating point for
accounting, I wish to point out that 8-byte floating point has more
precision than 15 digits of BCD. Remembering that the exponent only
takes a "few" bits, I'll happily use floating point any day instead
of integers (even 48 bit integers). Integers work fine for
accounting as long as one is adding and subtracting but if one has to
multiply (admittedly, not often), there's big trouble. After quite a
number of attempts to make things balance to the penny, I changed to
floating point and all my problems vanished (the code ran faster,
too).

Brian Case

unread,

Oct 24, 1986, 12:41:01 PM10/24/86

to

Jeeze Louise. Of course there is a relationship between MegaFlops and
MHz. What you get is Flops per cycle which is the inverse of cycles
per Flop; cycles per Flop is very interesting, just as cycles per
instruction is very interesting, and the inverse is equally interesting
if not quite as intuitive. What the hell do the number of "tics per
MMU cycle," "stackable MMU's," and the variation of cycle time "amoung
[sic] various addressing modes" have to do with floating point? What
the hell is a "stackable MMU;" and the last time I checked, cycle time
doesn't vary with addressing mode, even on the CISCiest of CISCs (ok,
AMD does have a 2900 clock chip which allows different cycle times to
be selected by microinstructions, but this thing is virtually never used
in the real world). Flops/Hz does measure something; it may not be a
really good way to measure architectural efficiency, but not for any of
the reasons you submit.

holloway

unread,

Oct 24, 1986, 2:56:00 PM10/24/86

to

One correction, the DRAFT (Dynamically Reconfigurable Architecture for
Factoring Things) machine is at Oregon State University.

For more information, details, application ideas, ... contact:

Jim Holloway -- holloway@orstcs
or
Don Chiarulli -- don@pitt

Henry Spencer

unread,

Oct 25, 1986, 9:13:04 PM10/25/86

to

> ... the last time I checked, cycle time
> doesn't vary with addressing mode, even on the CISCiest of CISCs...

Varying cycle time is not uncommon in the PDP11 line, at least in the 11s
I've had cause to investigate in detail. The microcode selects the cycle
time it wants on a cycle-by-cycle basis, presumably with the delays of
different processor sections in mind.

Tim Rentsch

unread,

Oct 26, 1986, 11:14:19 PM10/26/86

to

In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:
> Now, the reason one might care about MWhets/MHz (or any similar measure
> that compares the delivered real performance with some basic technology
> speed) is to understand the margin and headroom in a design.

There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a
measure of architectural "goodness". Certainly, measuring FLOPS/HZ
is a reasonable attempt to factor out the particulars of the device
fabrication, which are obviously irrelevant to architecture. (If
your chip runs twice as fast as my chip only because it is 5 times
as small, your process technology is better than mine, but your
architecture may not be.) BUT -- and here is the pitfall -- it just
might be that given identical fabrication methods, the better
FLOPS/HZ choice would still run slower because it would not support
the higher clock rate. RISC proponents would argue that one reason
for having simple instruction sets is to *lower the cycle time* so
that the machine can run faster and get more work done. Your
machine's FLOPS/HZ may be twice as good as mine, but if my HZ is
three times yours (in identical technology), my machine is faster --
and so my architecture is better.

> Comments? What sorts of metrics are important to the people who read
> this newsgroup? What kinds of constraints? How do you buy machines?
> If you buy CPU chips, how do you decide what to pick?

The metrics I'm interested in measure speed. (Basically, I'm hooked
on fast machines.) Other constraints are less interesting because:
(1) I will buy the fastest machine I can afford, and (2) in terms of
architecture, speed is the bottom line -- all else is just
mitigating circumstances. ("I know machine X runs 3 times as fast as
machine Y, but machine X is Gallium Arsenide." Compare
architectures, not technologies.)

Here are my favorite metrics (in no particular order):

(1) micro-micro-benchmark: well defined task, with well defined
algorithm, hand coded in lowest level language available (microcode
if it comes to that) by arbitrarily clever programmer who can take
advantage of all machine dependencies (instruction timings, overlaps
and/or interlocks, special instructions, cache sizes, etc.).
Algorithm can change slightly to take advantage of machine
characteristics, but must be "recognizable".

(1a) same as above, but at assembly language level. instruction set
cleverness is allowed; microcode and special knowledge such as
cache size is not.

(2) micro-benchmark: well defined task, with algorithm given in some
particular programming language (and benchmark must be compiled from
the given algorithm). The point here is to measure the speed of the
machine in "typical" situations, including compiler effectiveness.
the time taken to do the compile is irrelevant, as long as it is
reasonably finite.

(3) macro-benchmark: the problem with (1) and (2) is that they don't
measure all kinds of things that inevitably take place in real
systems. (on the other hand (1) and (2) are easy to run, and also
easy to fudge, so they are more often done.) a macro-benchmark is
like (2) in having a given program, except that the given program is
very large, so that code size is comparable to amount of real memory
on the machine (hopefully code > real memory). now the
effectiveness of the machine for problems-in-the-large will be
measured, including things like swapping speeds and TLB hit rates,
etc. sadly, this is a vague measure because there are so few large
programs which can be used as the benchmark, and many variable
parameters creep in (such as how fast the disk seeks are, etc.).
even so, it is worth remembering that speed in the small is
different from speed in the large, and that the latter is really
what we desire. (or should that be, "what I desire"? :-)

cheers,

txr

Lawrence Crowl

unread,

Oct 27, 1986, 3:18:05 PM10/27/86

to

>>In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:
>> Now, the reason one might care about MWhets/MHz (or any similar measure
>> that compares the delivered real performance with some basic technology
>> speed) is to understand the margin and headroom in a design.

>In article <1...@unc.unc.UUCP> ren...@unc.UUCP (Tim Rentsch) writes:
> There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a measure
> of architectural "goodness". Certainly, measuring FLOPS/HZ is a reasonable
> attempt to factor out the particulars of the device fabrication, which are

> obviously irrelevant to architecture. ... BUT -- and here is the pitfall

> -- it just might be that given identical fabrication methods, the better
> FLOPS/HZ choice would still run slower because it would not support
> the higher clock rate.

Perhaps what we are missing is that for a given level of technology, a longer
clock cycle allows us to have a larger depth of combinational circuitry. That
is, we can have each clock work through more gates. So, a 4 MHz clock which
governs propogation through a combinational circuit 4 gates deep will do
roughly the same work as a 1 MHz clock governing propogation through a
combinational circuit 16 gates deep. Perhaps a better measure is the depth of
gates required to implement a FLOP, (or an instruction, or a window, etc.).

The very fast clock, heavily pipelined machines like the Cray and Clipper
follow the first approach, while the slower clock, less pipelined machines
like the Berkley RISC and MIPS follow the second approach. Which is better is
probably dependent upon the technology used to implement the architecture and
the desired speed. For instance, if we want a very fast vector processor, we
should probably choose the fast clock, more pipelined architecture. If we want
a better price/performance ratio, we should probably choose the slow clock,
less pipelined architecture.

BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The
quality of an architecture is dependent on the technology used to implement it,
and no architecture is "best" under more than a limited range of technologies.
For instance, under technologies in which the bandwidth to memory is most
limited, stack architectures (Burroughs, Lilith) will be "better". Under
technologies where the ability to process instructions is most limited, the
wide register to register architectures will be "better".
--
Lawrence Crowl 716-275-5766 University of Rochester
cr...@rochester.arpa Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627

Frank Adams

unread,

Oct 27, 1986, 5:45:16 PM10/27/86

to

In article <2...@BMS-AT.UUCP> stu...@BMS-AT.UUCP writes:

>For problems unsuited to floating point, e.g. accounting, yes the
>floating point hardware gets in the way. For accounting one should
>use large integers: 48 bits is plenty in practice and no special hardware
>is needed.

48 bits is not always adequate. One sometimes has to perform operations
of the form a*(b/c), rounded to the nearest penny (integer). Doing this
with integer arithmethic requires intermediate results with double the
precision of the final results. With floating point, this is not necessary.

Frank Adams ihnp4!philabs!pwa-b!mmintl!franka
Multimate International 52 Oakland Ave North E. Hartford, CT 06108

Henry Spencer

unread,

Oct 28, 1986, 12:23:42 PM10/28/86

to

> > ... the last time I checked, cycle time
> > doesn't vary with addressing mode, even on the CISCiest of CISCs...
>
> Varying cycle time is not uncommon in the PDP11 line, at least in the 11s
> I've had cause to investigate in detail. The microcode selects the cycle
> time it wants on a cycle-by-cycle basis, presumably with the delays of
> different processor sections in mind.

Several people have expressed some interest in knowing details, so I think
it's probably worth posting this. The best information on this sort of
thing is in an old CMU tech report titled "Impact of Implementation Design
Tradeoffs on Performance: The PDP-11, A Case Study", by Snow and Siewiorek.
It's long out of print, but is reprinted in Bell, Mudge, & McNamara's
"Computer Engineering" book.

The report is too old to cover the newest 11s. I would guess that varying
cycle times are rather less likely in the LSI implementations, which would
let out the 23, 24, 73, 84, and the T11 and J11 chips. The only other 11
that I can think of which is too new to be covered is the 44, the last of
the MSI implementations. The 44 does have two different cycle times,
although a quick glance at the processor manual doesn't tell me exactly what
it uses them for.

[I can hear cries of "so what does the report say?". Here goes.]

Of the older 11s, up to and including the 60, there are three with
gearshifts: the 10, 34, and 40.

The 11/10 doubles its clock speed for doing multi-bit shifts. This may seem
a bit specialized, especially when you remember that the 10 has no multi-
bit shift instructions! The key thing to remember is that the 10 was a
bare-minimum-cost 11 designed circa 1970, when the variety of MSI chips
was limited. So the 11/10 does not have a byte swapper. When it wants to
get at an odd-numbered byte, it has to do an 8-bit shift. This may sound
like a gross performance disaster, but it's not; the report examined the
possible performance improvement from adding a byte swapper to the 10, and
concluded it would be quite small. Odd-byte accesses simply aren't common.

The 11/34 has a long cycle for bus operations and a short cycle for normal
activity. The report didn't go into detail.

The 11/40 is the fanciest of the lot, with a three-speed gearbox. It uses
an extra-long cycle for the worst-case use of the data paths, reading from
and writing to scratchpad RAM in the same cycle. The medium cycle is for
data-path cycles that don't involve writes to the scratchpad. And the short
cycle is for microinstructions which don't use the data paths at all, and
hence don't care about their propagation delays. As I recall, the report
assessed this multi-speed mechanism as quite effective in speeding up a
relatively simple implementation.

Brian Case

unread,

Oct 28, 1986, 1:44:33 PM10/28/86

to

>Perhaps what we are missing is that for a given level of technology, a longer
>clock cycle allows us to have a larger depth of combinational circuitry. That
>is, we can have each clock work through more gates. So, a 4 MHz clock which
>governs propogation through a combinational circuit 4 gates deep will do
>roughly the same work as a 1 MHz clock governing propogation through a
>combinational circuit 16 gates deep. Perhaps a better measure is the depth of
>gates required to implement a FLOP, (or an instruction, or a window, etc.).

Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
pipeline can be kept full (one of the major goals of RISC), then it will do
4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't
such a bad metric to use for comparison. If pipelining can't be implemented
or the pipeline can't be kept full for a reasonable portion of the time,
the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

>The very fast clock, heavily pipelined machines like the Cray and Clipper
>follow the first approach, while the slower clock, less pipelined machines
>like the Berkley RISC and MIPS follow the second approach. Which is better is

Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co.
will agree with this statement. The clock speeds may vary among the machines
you mention, but that is basically a consequense of implementation technology.
I think everyone is trying to make pipestages as short as possible so that
future implementations will be able to exploit future technology to the
fullest extent.

>probably dependent upon the technology used to implement the architecture and
>the desired speed. For instance, if we want a very fast vector processor, we
>should probably choose the fast clock, more pipelined architecture. If we want
>a better price/performance ratio, we should probably choose the slow clock,
>less pipelined architecture.

I certainly agree that if a very fast vector processor is required, the higest
clock speed possible with the most pipelining that makes sense should be
chosen. But why should we chose a different approach for the better price/
performance ratio? Unless you are trying only to decrease price (which is
not the same as increasing price/performance), one should still aim for the
highest possible clock speed and pipelining. If the price/performance is
right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In
addition, for little extra cost (I claim but can't unconditionally prove),
the 4 at 4 Mhz version will in some cases give me the option of 4 times the
throughput. I do acknowledge that I am starting to talk about a machine
for which FLOPS/MHz may not be a good comparison metric.

>BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The
>quality of an architecture is dependent on the technology used to implement it,
>and no architecture is "best" under more than a limited range of technologies.
>For instance, under technologies in which the bandwidth to memory is most
>limited, stack architectures (Burroughs, Lilith) will be "better". Under
>technologies where the ability to process instructions is most limited, the
>wide register to register architectures will be "better".

I agree that technology influences (or maybe "should influence") architecture.
But I don't think limited memory bandwidth indicates a stack architecture,
rather, I would say a stack archtitecture is contraindicated! If memory
bandwidth is a limiting factor on performance, then many registers are needed!
Optimizations which reduce memory bandwidth requirements are those that keep
computed results in registers for later re-use; such optimizations are
difficult, at best, to realize for a stack architecture. When you say "the
ability to process instructions is most limited" I guess that you mean "the
ability to fetch instructions is most limited" (because any processor whose
ability to actually process its own instructions is most limited is probably
not worth discussing). In this case, I would think that shorter instructions
in which some part of operand addressing is implicit (e.g. instructions for a
stack machine) would be indicated; "wide register to register" instructions
would simply make matters worse. Probably the best thing to do is design the
machine right the first time, i.e. give it enough instruction bandwidth.

I fear that this posting reads like a flame; it is not intended to be a flame.

John Mashey

unread,

Oct 29, 1986, 3:09:40 AM10/29/86

to

In article <21...@rochester.ARPA> cr...@rochtest.UUCP (Lawrence Crowl) writes:
>>>In article <7...@mips.UUCP> ma...@mips.UUCP (John Mashey) writes:

> ... MWhets/Mhz, etc, as way to factor out transient technology...

>
>Perhaps what we are missing is that for a given level of technology, a longer
>clock cycle allows us to have a larger depth of combinational circuitry. That
>is, we can have each clock work through more gates. So, a 4 MHz clock which
>governs propogation through a combinational circuit 4 gates deep will do
>roughly the same work as a 1 MHz clock governing propogation through a
>combinational circuit 16 gates deep. Perhaps a better measure is the depth of
>gates required to implement a FLOP, (or an instruction, or a window, etc.).

Can you suggest some numbers for different machines? One of the reasons
I proposed a (simplsitic) measure is the absolute difficulty of finding
such thing out.

>
>
>BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The
>quality of an architecture is dependent on the technology used to implement it,
>and no architecture is "best" under more than a limited range of technologies.
>For instance, under technologies in which the bandwidth to memory is most
>limited, stack architectures (Burroughs, Lilith) will be "better". Under
>technologies where the ability to process instructions is most limited, the
>wide register to register architectures will be "better".

Much of this seems true. We always claim that the real meaning of RISC in
VLSI RISC is "Response to Inherent Shifts in Computer technology", i.e
in hardware: fast, dense, cheap SRAMs; higher-pincount VLSI packages,
and in software: more use of high-level languages; portable OS's like UX/.
In the days of core memories, it is likley that the more aggressively
undense RISCs [i.e., those with only 32-bit instructions] would have
been bad ideas for anything but high-end machines.
Given: TTL, NMOS, CMOS, ECL, GaAs, for example, it would be interesting to
hear what people think who are / have implmented same machine over multiple
technologies [such as DEC VAXen, IBM 370s, HP SPectrums, all of which
are supposed exist in at least 3 of the first 4 of the above; I think most
GaAs designs are RISCs, given smaller gate counts.]

Colin Plumb

unread,

Nov 2, 1986, 3:18:50 PM11/2/86

to

In article <19...@mmintl.UUCP> fra...@mmintl.UUCP (Frank Adams) writes:
>>> Comments? What sorts of metrics are important to the people who read
>>> this newsgroup? What kinds of constraints? How do you buy machines?
>>> If you buy CPU chips, how do you decide what to pick?
>>
>>The metrics I'm interested in measure speed. (Basically, I'm hooked
>>on fast machines.) Other constraints are less interesting because:
>>(1) I will buy the fastest machine I can afford, and (2) in terms of
>>architecture, speed is the bottom line -- all else is just
>>mitigating circumstances.
>

>I must disagree. Reliability is at least as important as speed.

I must disagree. The the idea is to get as much effective speed out of the
machine as possible. A machine that is down 50% of the time delivers 1/2
of its operational speed to the user as throughput. Turnaround time (which is
what most people are interested in) will suffer more, under most circumstances.

Still, I'd prefer exclusive use of a big VAX that's down from midnight to noon
to exclusive use of a smaller one that's almost never down. My only interest
is how fast the machine gets my work done.

-Colin Plumb (ccp...@watnot.UUCP)

Will someone tell me why everybod puts disclaimers down here?

Eugene Miya N.

unread,

Nov 3, 1986, 2:37:52 PM11/3/86

to

>> Summary: Words about relability, speed, thruput, etc.

The only problem with speed as the one true performance metric is
Harlan Mills' "Law." He says that he will take any program and make it
five times faster or use five times less storage (note: but not both).
The problem with speed is the frequent interchangeability with storage.
The problem comes with problems which have massive storage AND speed
requirements. Yes, getting your job done fast is important, granted,
but people with foresight understand the costs and tradeoffs associated
with "speed at all costs."

We put disclaimers on the bottoms of some our postings 1) for humor, like
line eater quotes at the heads of some messages, 2) because some of us
have been burned [what goes around, comes around]. This latter is quite
serious. I know enough now to NOT put postings of DEC, Cray, IBM, etc.
internal material even though I've not signed any non-disclosure
agreements. As Gary Perlman has pointed out, the Net is a great place
to do industrial information gathering (espionage).

--eugene miya
NASA Ames Research Center

cds...@alberta.uucp

unread,

Nov 3, 1986, 2:39:26 PM11/3/86

to

In article <12...@watnot.UUCP> ccp...@watnot.UUCP (Colin Plumb) writes:
>In article <19...@mmintl.UUCP> fra...@mmintl.UUCP (Frank Adams) writes:
>>>The metrics I'm interested in measure speed. (Basically, I'm hooked
>>>on fast machines.) Other constraints are less interesting because:
>>

>>I must disagree. Reliability is at least as important as speed.
>
>I must disagree. The the idea is to get as much effective speed out of the
>machine as possible. A machine that is down 50% of the time delivers 1/2
>of its operational speed to the user as throughput. Turnaround time (which is
>what most people are interested in) will suffer more, under most circumstances.

..whatever that paragraph means.

There are a number of things wrong with this attitude. It is completely bogus
to assume that "50% downtime" means "up from noon to midnite only".

A lack of reliability implies that you cannot predict your downtime. 50%
downtime could easily mean that your "big VAX", which takes 20 minutes
to boot, is ready to accept logins for exactly 1 minute before crashing
and remaining down for 21 minutes. Machine crashes are a stochastic process,
and this is no good at all if your probability of failure is high.

Then again, this whole argument is silly. Reliability is in some sense
orthogonal to the other performance statistics. If your machine crashes
all the time, speed simply doesn't enter into the calculation because
the break it puts in the users' work habits is unacceptable. The probability
of losing valuable work, computed results, etc. is also too high.
I suppose one could trade off reliability for speed, but most manufacturers
realize that unreliable machines are extremely costly in service and annoyance
time, and therefore manufacturers try to maximize the reliability.
Unreliable machines are hard to sell.

> -Colin Plumb (ccp...@watnot.UUCP)

Chris Shaw cdshaw@alberta
University of Alberta
CatchPhrase: Bogus as HELL !

David Singer

unread,

Nov 3, 1986, 4:16:46 PM11/3/86

to

In article <12...@watnot.UUCP> ccp...@watnot.UUCP (Colin Plumb) writes:
>In article <19...@mmintl.UUCP> fra...@mmintl.UUCP (Frank Adams) writes:
>>>The metrics I'm interested in measure speed. (Basically, I'm hooked
>>>on fast machines.) Other constraints are less interesting because:

>>>(1) I will buy the fastest machine I can afford, and (2) in terms of
>>>architecture, speed is the bottom line -- all else is just
>>>mitigating circumstances.
>>

>>I must disagree. Reliability is at least as important as speed.
>
>I must disagree. The the idea is to get as much effective speed out of the
>machine as possible. A machine that is down 50% of the time delivers 1/2
>of its operational speed to the user as throughput. Turnaround time (which is
>what most people are interested in) will suffer more, under most circumstances.

I really fail to see how an immediate or rapid answer you can't trust
is of any use at all.

Lawrence Crowl

unread,

Nov 3, 1986, 5:50:59 PM11/3/86

to

>>> ma...@mips.UUCP (John Mashey)
)) cr...@rochtest.UUCP (Lawrence Crowl)
> bc...@amdcad.UUCP (Brian Case)
] ma...@mips.UUCP (John Mashey)
cr...@rochtest.UUCP (Lawrence Crowl)

>>> ... MWhets/Mhz, etc, as way to factor out transient technology...

))Perhaps what we are missing is that for a given level of technology, a longer
))clock cycle allows us to have a larger depth of combinational circuitry. That
))is, we can have each clock work through more gates. So, a 4 MHz clock which
))governs propogation through a combinational circuit 4 gates deep will do
))roughly the same work as a 1 MHz clock governing propogation through a
))combinational circuit 16 gates deep. Perhaps a better measure is the depth of
))gates required to implement a FLOP, (or an instruction, or a window, etc.).

]Can you suggest some numbers for different machines? One of the reasons
]I proposed a (simplsitic) measure is the absolute difficulty of finding
]such thing out.

No, I cannot suggest numbers. I suspect they would be difficult to obtain.
Maybe I should think more next time.

>Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
>pipeline can be kept full (one of the major goals of RISC), then it will do
>4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
>MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't
>such a bad metric to use for comparison. If pipelining can't be implemented
>or the pipeline can't be kept full for a reasonable portion of the time,
>the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

One of us is confused here, and I do not know which. Assume a IPS takes a
constant 16 combinational gates. The 4 MHz and 4 gates will require 4 stages
while the 1 MHz and 16 gates will require one stage. Both machines will
execute 1 MIPS. But they have a factor of 4 difference in MHz/MIPS. If we
pipeline the 4 MHz and 4 gates into a four stage pipeline, the MHz/MIPS will
be the same but the performance will be a factor of 4 different.

))The very fast clock, heavily pipelined machines like the Cray and Clipper
))follow the first approach, while the slower clock, less pipelined machines
))like the Berkley RISC and MIPS follow the second approach. Which is better is

>Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co.
>will agree with this statement. The clock speeds may vary among the machines
>you mention, but that is basically a consequense of implementation technology.
>I think everyone is trying to make pipestages as short as possible so that
>future implementations will be able to exploit future technology to the
>fullest extent.

There are at least two approaches, exemplified by the following two examples.
The first has a clock controlling progress through three stages from the
register bank to the ALU, through the ALU, and back to the register bank.
The second approach is to do all this in one stage. The first approach has
the potential to pipe while the second has a lower clock rate. In both cases
faster clock rates allow faster implementations. Which machines take which
approach?

))probably dependent upon the technology used to implement the architecture and
))the desired speed. For instance, if we want a very fast vector processor, we
))should probably choose the fast clock, more pipelined architecture. If we
))want a better price/performance ratio, we should probably choose the slow
))clock, less pipelined architecture.

>I certainly agree that if a very fast vector processor is required, the higest
>clock speed possible with the most pipelining that makes sense should be
>chosen. But why should we chose a different approach for the better price/
>performance ratio? Unless you are trying only to decrease price (which is
>not the same as increasing price/performance), one should still aim for the
>highest possible clock speed and pipelining. If the price/performance is
>right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In
>addition, for little extra cost (I claim but can't unconditionally prove),
>the 4 at 4 Mhz version will in some cases give me the option of 4 times the
>throughput. I do acknowledge that I am starting to talk about a machine
>for which FLOPS/MHz may not be a good comparison metric.

Higher clock rates generally imply higher quality parts, more EMI shielding,
etc, which implies a higher cost. You do not expect a 3000 RPM engine to
cost the same as a 8000 RPM engine do you? In addition, exploiting pipeline
potential generally costs significant development effort and gates to control
the piping. Now, adding some pipeling to a simple scheme is probably cost
effective, but adding as much as is possible is not. We must find a balance.

))BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The
))quality of an architecture is dependent on the technology used to implement
))it, and no architecture is "best" under more than a limited range of
))technologies. For instance, under technologies in which the bandwidth to
))memory is most limited, stack architectures (Burroughs, Lilith) will be
))"better". Under technologies where the ability to process instructions is
))most limited, the wide register to register architectures will be "better".

>I agree that technology influences (or maybe "should influence") architecture.
>But I don't think limited memory bandwidth indicates a stack architecture,
>rather, I would say a stack archtitecture is contraindicated! If memory
>bandwidth is a limiting factor on performance, then many registers are needed!
>Optimizations which reduce memory bandwidth requirements are those that keep
>computed results in registers for later re-use; such optimizations are
>difficult, at best, to realize for a stack architecture.

Stacks and registers are not incompatible. It is easy to imagine a machine
which did pushes and pops between the stack and a register bank. If register
to register architectures are allowed to store temporaries and local variables
in registers, the stack architecture should be allowed to also. We should
separate the notion of registers as a means to evaluate expressions and as
a storage media.

>When you say "the ability to process instructions is most limited" I guess

>that you mean "the ability to fetch instructions is most limited" (because
>any processor whose ability to actually process its own instructions is most
>limited is probably not worth discussing). In this case, I would think that
>shorter instructions in which some part of operand addressing is implicit
>(e.g. instructions for a stack machine) would be indicated; "wide register to
>register" instructions would simply make matters worse. Probably the best
>thing to do is design the machine right the first time, i.e. give it enough
>instruction bandwidth.

"The ability to fetch instructions" is precisely what I did NOT mean. You
seem to have effectively argued for a stack architecture when bandwidth to
memory is limited. After all, instructions are in memory. What I meant
by "the ability to process instructions" is once you have the instruction
in the CPU, how quickly can you deal with it (relative to getting it into
the CPU in the first place).

Daniel Klein

unread,

Nov 4, 1986, 9:49:47 AM11/4/86

to

Okay, blatant, flaming opinion time...

I really don't care how fast the internal engine has to run to produce my
output. If my little Alfa Romeo is tooling down the highway at 70 MPH with
an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari
doing 110 MPH with an internal engine speed of 4900 RPM, who is going faster?
Certainly not me, not matter how you multiply the numbers! My MPH/RPM is a
little higher, but I got my doors blown off nonetheless.

So if I am able to build some bizarre semi-synchronous architecture with a
2 GHz clock rate, does it mean my machine is slower (when you divide out the
clock in MFlops/MHz)? I don't think so. If we are looking for an esoteric
comparison of architectural efficiency, *then* perhaps we have a reasonable
metric here.

Now, wasn't it interesting how the MIPS machines appeared at the top of the
performance chart in the initial posting by Mashey? Personally, I think RISC
architectures are a good idea, so I'm not arguing architectural values here.
But RISC looks just *great* when you use the clever little formula of
MFlops/MHz. All I care about though, is who gets my jobs done the fastest.

--> The standard disclaimer: my opinions are my own, so there, nyaa nyaa.
--
--=============--=============--=============--=============--=============--
Daniel V. Klein, who lives in Pittsburgh, allegedly works for the Software
Engineering Institute, and strives to survive as best he can.

ARPA: d...@sei.cmu.edu
USENET: {ucbvax,harvard,cadre}!d...@sei.cmu.edu

"The only thing that separates us from the animals is
superstition and mindless rituals".

Jeff Rininger

unread,

Nov 4, 1986, 12:04:00 PM11/4/86

to

In article <7...@spar.SPAR.SLB.COM> sin...@spar.UUCP (David Singer) writes:
>I really fail to see how an immediate or rapid answer you can't trust
>is of any use at all.

In some domains, a rapid, if somewhat uncertain, answer is
much better than no answer at all. One example is the
defensive software for the DARPA "pilot's associate" research.
I may be able to supply a reference if anyone is interested.

Gregory Smith

unread,

Nov 4, 1986, 2:35:46 PM11/4/86

to

In article <7...@spar.SPAR.SLB.COM> sin...@spar.UUCP (David Singer) writes:

This is silly. Broken computers don't give wrong answers. They crash,
or they log soft errors, or they act flaky. It is almost impossible to
imagine a hardware fault that would have no visible effect other than
to make the 'value' (whatever it may be) of the output wrong.

Of course, floating point hardware is a little different, since it
is used only for numerical calculations which are part of the problem
( as opposed to the CPU alu which is also used for indexing, etc.)
You can always arrange to run an FPU diagnostic every 5 mins if this
is an issue.

There are few things more useless than a computer that executes
instructions correctly 999 times out of 1000....

--
----------------------------------------------------------------------
Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

Colin Plumb

unread,

Nov 4, 1986, 3:21:34 PM11/4/86

to

(in response to my posting)...

>
>I really fail to see how an immediate or rapid answer you can't trust
>is of any use at all.

Hm... you have a point there. I was thinking of reliability as an up/down
distinction. You're right that a wrong answer is even worse than no answer.
Still, incorrect answers are rarely hardware faults... They're more usually
software problems.

-Colin Plumb (ccp...@watnot.UUCP)

Guy Harris

unread,

Nov 5, 1986, 5:01:12 PM11/5/86

to

> I really don't care how fast the internal engine has to run to produce my
> output. If my little Alfa Romeo is tooling down the highway at 70 MPH with
> an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari
> doing 110 MPH with an internal engine speed of 4900 RPM, who is going
> faster? Certainly not me, not matter how you multiply the numbers!
> My MPH/RPM is a little higher, but I got my doors blown off nonetheless.

Yes, but what if:

1) Horsepower, say, were linearly proportional to RPM

2) The horsepower need by both cars to sustain a particular
speed were the same

3) Your Alfa had a redline of 20,000 RPM, while the Ferrari had a
redline of 6000 RPM

4) "All other things are equal"

Then just step on the gas hard enough to get near the redline, and blow the
Ferrari's doors off.

I believe Mashey's thesis is that this is more-or-less the proper analogy;
the maximum clock rate possible is mainly a function of the chip technology,
not the architecture, so an architecture that gets more work done per clock
tick can ultimately be made to run faster than ones that get less work done
per clock tick. I shall voice no opinion on whether this is the case or not
(I don't know enough to *have* an opinion on this) , but will just let the
chip designers battle it out.

> So if I am able to build some bizarre semi-synchronous architecture with a
> 2 GHz clock rate, does it mean my machine is slower (when you divide out the
> clock in MFlops/MHz)? I don't think so.

Since MFlops/MHz is !N*O*T! a measure of machine speed, and was never
intended as such by Mashey, the machine is neither faster nor slower "when
you divide out the clock in MFlops/MHz". If you don't divide out the clock,
no, it doesn't mean your machine is slower. Nobody would argue that it did.

> If we are looking for an esoteric comparison of architectural efficiency,
> *then* perhaps we have a reasonable metric here.

Well, what did you *think* MFlops/MHz was intended as? It *was* intended
for comparing architectural efficiency!

Please, people, before you flame this measure as absurd, make sure you're
not flaming it for not being a measure of raw speed; it wasn't *intended* to
be a measure of raw speed. *You*, the end-user, may not be interested in
architectural efficiency, but may only be interested in "how fast something
gets your job done"; the person who has to design and build that something,
however, is going to be interested in architectural efficiency.
--
Guy Harris
{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
g...@sun.com (or g...@sun.arpa)

f...@siemens.uucp

unread,

Nov 6, 1986, 9:07:00 AM11/6/86

to

> This is silly. Broken computers don't give wrong answers. They crash,
> or they log soft errors, or they act flaky. It is almost impossible to
> imagine a hardware fault that would have no visible effect other than
> to make the 'value' (whatever it may be) of the output wrong.

Cheap computers without memory parity checking could have a soft memory
error which would make a data value wrong without crashing the computer,
logging soft errors, or acting flaky. Of course, nobody uses a computer
without error detection, do they? Do you disable parity checking on the
plug-in memory boards for your PC?

-----------------------------------------------------
Frederic W. Brehm (ihnp4!princeton!siemens!fwb)
Siemens Research and Technology Laboratories
105 College Road East
Princeton, NJ 08540
(609) 734-3336

Ken Shoemaker ~

unread,

Nov 6, 1986, 4:40:28 PM11/6/86

to

arguments about architectural effeciency aside, you'd have an easier time
making a system that runs at 8 MHz than one that runs at 33 MHz (or whatever)
even if the overall memory access time requirement is the same. And
you'd have a much easier time making a system that goes at 16 MHz than one
that goes at 66 MHz.
--
The above views are personal.

I've seen the future, I can't afford it...

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
csnet/arpanet: k...@mipos3.intel.com

Tim Rentsch

unread,

Nov 7, 1986, 2:28:50 AM11/7/86

to

In article <35...@utcsri.UUCP> gr...@utcsri.UUCP (Gregory Smith) writes:

> There are few things more useless than a computer that executes
> instructions correctly 999 times out of 1000....

This prompted a memory which I could not resist sharing with
netland. I know net.arch is not the appropriate place, so followons
to /dev/null, and no flames, ok?

(The following is not original, but I do not remember the source.)

"The code is 99% debugged...."

one in every hundred statements is WRONG!

cheers,

txr

Dik T. Winter

unread,

Nov 8, 1986, 3:43:38 PM11/8/86

to rnews@mcvax

In article <5...@cubsvax.UUCP> pet...@cubsvax.UUCP (Peter S. Shenkin) writes:
>In article <BMS-AT.253> stu...@BMS-AT.UUCP (Stuart D. Gathman) writes:
>>
>> Good floating point software keeps track of
>>the remaining precision as computations proceed.
>
>??? I've never heard of this. Could you say more? Until you do, I will....
>Read on.
>
>> ...Rounding
>>in floating point hardware contributes only a minor portion of
>>the imprecision of the result in properly designed software.
>
>I disagree. Consider taking the average of a many floating point numbers
>which are read in from a file, and which differ greatly in magnitude.
>How many there are to average may not be known until EOF is encountered.
>The "obvious" way of doing this is to accumulate the sum, then divide
>by n. But if some numbers are very large, the very small ones will
>fall off the low end of the dynamic range, even if there are a lot of
>them; this problem is avoided if one uses higher precision (double
>or extended) for the sum. If declaring things this way is what you mean by
>properly designed software, OK. But the precision needed for intermediate
>values of a computation may greatly exceed that needed for input and
>output variables. I call this a rounding problem. I know of no "floating
>point software" that will get rid of this.
>

Well, there are at least three packages dealing with it: ACRITH from IBM
and ARITHMOS from Siemens (they are identical in fact) and a language called
PASCAL-SC on a KWS workstation (a bit obscure I am sure). They are based
on the work by Gulisch et al. from the University of Karlsruhe. They
use arithmetic with directed rounding and accumulation of dot products
in long registers (168 bytes on IBM). On IBM there is microcode support
for this on the 4341 (or 4381 or 43?? or some such beast).

The main purpose is verification of results (at least, that is my opinion).
For instance on a set of linear equations find a solution interval that
contains the true solution with the constraint that the interval is as
small as possible. They then first proceed finding an approximate solution
using standard techniques followed by an iterative scheme to obtain a
smallest interval using interval arithmetic combined with long registers.
This is superior toi standard interval arithmetic because the latter tends
to give much too large intervals.
--
dik t. winter, cwi, amsterdam, nederland
UUCP: {seismo,decvax,philabs,okstate,garfield}!mcvax!dik
or: d...@mcvax.uucp
ARPA: dik%mcvax...@seismo.css.gov

Frank Adams

unread,

Dec 31, 1986, 7:46:49 PM12/31/86

to

>> Comments? What sorts of metrics are important to the people who read
>> this newsgroup? What kinds of constraints? How do you buy machines?
>> If you buy CPU chips, how do you decide what to pick?
>
>The metrics I'm interested in measure speed. (Basically, I'm hooked
>on fast machines.) Other constraints are less interesting because:
>(1) I will buy the fastest machine I can afford, and (2) in terms of
>architecture, speed is the bottom line -- all else is just
>mitigating circumstances.

I must disagree. Reliability is at least as important as speed.

Frank Adams ihnp4!philabs!pwa-b!mmintl!franka