On RISC processors without a scoreboard*, how are the results of memory
references guaranteed to be available before they are used in subsequent
computations? I thought through several scenarios and couldn't find a very
good solution and being being unfamiliar with RISCs other than the 88k,
thought the net might have some answers.
1) If the processor waited until the load was complete before launching
the next instruction, that would guarantee the correct result was available
before an attempt was made to use it. This obviously is very poor from
a performance standpoint though.
2) If the compiler/assembly writer is required to wait a certain number
of ticks after a load before using the results of the load, how is the
required amount of delay determined? Can you assume you always get a
cache hit (I doubt it)? Do you have to use worst case (then why use a
cache at all)? What if the access is across a shared bus and the tape
controller is currently hogging the bus and will be for the next 10ms
(you certainly wouldn't want to use a 10ms delay after each load)? Does
the memory system have to give absolute top priority to CPU requests
so that the required delay is fixed? What if memory refresh were also
pending - it seems here you would have a nasty choice of letting memory
be trashed or letting the CPU use bogus results?
Does #2 mean that a binary compiled for a particular system won't run on
a different system (having a different memory speed) even though they have
the same CPU chip and that instead, the source must be recompiled for each
different system the program is to run on (or if the current system is
sped up/slowed down)? This seems like it would be clumsy and difficult
to manage from a software distribution aspect.
3) Is there some internal register that can be referenced that indicates
when the result is available and software is required to interogate
the register to determine if the result is ready? This would add overhead to
every load instruction.
4) Is there some other method I haven't thought of that is the method of
choice? Which RISC processors don't have scoreboarding and what method
do they use?
* For those unfamiliar with what scoreboarding is, the scoreboard is
merely an internal register which keeps track of which registers have
stale data (for example the destination register of a load instruction
which has been launched but not completed). If the CPU encounters an
instruction which uses a register which is stale, it waits until the
scoreboard indicates the register is valid again before it begins
the instructions execution thus assuring the correct operands are used
by the instruction. Note this is a hardware feature - software is
not required to examine the scoreboard to accomplish this.
As an example, on the 88k, you could do the following sequence:
ld r2,r0,0 ; Assume this ld takes a long time.
add r3,r2,1 ; Add 1 to result of ld and put in r3.
and the result would be correct (for any length of ld time). Without
a scoreboard, the result in r3 would be based on the stale data in r2
(which is probably the wrong value).
--
Just a few more bits in the stream.
The Sneek
The scoreboarding on the 88k (correct me if I'm wrong) associates a bit
for each register in the register file to tell whether the value is
stale or not. During the issuing of an instruction, the operand
registers are checked for staleness, and the pipe is stalled if
required.
The bit-per-register is one way of detecting dependencies. Another way
(used on the Am29000) is to associate a destination register pointer
(and a "ready bit") with each functional unit. Forwarding logic matches
the register operand fields of the instruction being issued with the
destination register pointers of the functional units, stalling if it
finds a match and the value is not available, yet. When the value is
available, it is forwarded to the correct operand bus (A or B, or
perhaps both) so that the instruction does not have to wait for the
value to be written into the register file. This method has the benefit
of integrating closely with the forwarding logic.
-- Tim Olson
Advanced Micro Devices
(t...@delirun.amd.com)
This was discussed a while back. Sounds like it's time to do it again.
NOTE: caveats: there are certainly architectures for which some ofthe
following doesn't fit. It is intended to be mostly applicable to
"mainline" RISC micros.
>2) If the compiler/assembly writer is required to wait a certain number
>of ticks after a load before using the results of the load, how is the
>required amount of delay determined? Can you assume you always get a
>cache hit (I doubt it)?....
Following are:
1) How the MIPS R2000/R3000 does it.
2) Some thoughts on when scoreboarding does/doesn't make sense, and why.
HOW THE R2000/R3000 WORK
1) A load requires one cycle to initiate, plus one "architectural"
cycle of latency, which must be covered by the software,
i.e., it must insert a nop if it can't find anything useful.
2) If we ever built a machine in which the primary cache had 2 or more
cycles of latency, the software would be responsible for the 1st cycle of
latency, and the hardware responsible for cycles 2,...n. (This preserves the
ability to run the existing object code unchanged.) However, given the
domain of problems we worry about, this is probably irrelevant, as we'll
probably never build a system whose first-level cache has more than 1 cycle of
latency, given the performance impacts. Of course, it's hard to build
a machine that has a zero-cycle load latency :-) I.e., there's a
reason for having exactly 1 cycle of latency, not 2 or 0.
3) A lot of statistics say:
a) Reorganizers get to fill 70-80% of load delays in 1-cycle
latency machines.
b) Roerganizers get to fill (as I recall, it's been awhile)
30% of delays in 2-cycle latency machines.
c) After that, it gets bad (10% or less). [What does this mean?
ans: if you load something, you expect to use it pretty soon.]
4) All of that assumes a cache-hit (which is, of course, the most
frequent case). If there is a cache-miss, the main pipeline freezes
while the cache refill is done:
a) One or more words of data are brought into the D-cache.
b) The stalled instruction resumes after the fixup.
Thus, the software could care less how far away memory is, whether or
not DRAM refresh is running, etc.
5) There are split I & D-caches, and a 1-cycle latency branch is used,
with software also being required to fill the branch delay slot.
Approximately similar numbers (1 slot rather than 0 or 2) apply as
in the load case.
6) As it happens, when I said the main pipeline freezes, what I glossed over
was that there are numerous independent functional units:
a) The main integer unit (i.e., that really knows where the PC is),
including all loads/stores.
b) Integer multiplier/divider
c) FP adder
d) FP multiplier
e) FP divider
The results of b) - e) are interlocked, i.e., if you attempt to use the
result of b) - e), and it isn't ready, it is interlocked by the hardware,
rather than software. You can consider this "partial scoreboarding",
although we don't particularly call it that, as the main pipeline does
completely freeze, rather than trying to issue the following instruction.
7) Now, why should it be that we interlock some things, and not others?
a) Note that the 2 things that are NOT interlocked (load and branch):
Have small (1-cycle delays) that are unlikely to be shortened,
and are undesirable to lengthen
Have about a 70% chance of having a 1-cycle delay slot filled
by software.
b) Note that the things that ARE interlocked:
Are multiple cycle operations (DP ADD = 2, rest are longer).
Are operations whose performance might well change in a future
version (i.e., you might throw more silicon at DP MULT to improve
it from 5 cycles, etc)
Are generally difficult for compilers to handle completely, for
both ofthe above reasons.
SCOREBOARDING
1) If you look at where scoreboarding came from (I remember it first from
CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
you had machines with:
a) Many independent functional units
b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
for example, and the FP operations were of course longer).
c) Multiple memory pipes, with a long latency to memory, and no caches.
2) Note that RISC micros probably have a), but normally, only the FP ops,
and maybe a few others have long multi-cycle operations, and they sure don't
have multiple memory pipes, and they often have caches.
3) Whether or not a RISC machine has no scoreboarding, some, or complete,
it had better be designed with a pipeline reorganizer in mind, because it's
absolutely silly not to be using one. With one exception (in our case),
a scoreboard can NEVER do any better than a reorganizer. I'll show the
exception later, but here is an example why I say this. Consider:
a = b + c; d = 0;
1 lw r1,b
2 lw r2,c
3 add r3,r2,r1
4 sw r3,a
5 sw zero,d
On a machine with a simple load-interlock, and 1-cycle-latency loads,
the machine would stall for 1 cycle before instruction 3.
Suppose you do more complex scoreboarding, i.e., you continue attempting
to issue instructions? Then, you might do 1, 2, try 3 and save it,
and then either come back and do 3 again (probably) or go on to 4 and
discover that it stalls also. If one looks at integer code sequences,
one discovers that it is hard to discover many things that don't quickly
depend on something else (barring, of course, Multiflow-style compiler
technology and hardware designs not currently feasible in VLSI micros).
Now, the point is that no simple scoreboarding hardware is going to be able
to figure out that the best sequence is:
a = b + c; d = 0;
1 lw r1,b
2 lw r2,c
5 sw zero,d
3 add r3,r2,r1
4 sw r3,a
which ALWAYS wins (given the memory system described). Note that
a 2-cycle latency adds 1 cycle more of delay, even with the reorganized
code. Also, note that if you're willing to add mucho more
hardware, you can lookahead further (really, keeping 1 independent PC or
pipeline for each extra instruction, but compilers can always lookahead
further). Similar things go on for floating-point, i.e., no matter
how much scoreboarding you do, the compiler MUST do some code-spreading
or you will lose performance.
Note also that streamlined RISC micros usually do 2 register reads and
1 write/cycle, it's hard to get anything else sque
As one last example, one of the MIPers here once wrote a reorganizer for
a machine architecture that was fully-interlocked, and improved the
performance at least 5% on some benchmarks.
SUMMARY OF THIS PART: any pipelined machine ought to be done expecting to have
a good reorganizer, even if fully scoreboarded.
The exception I mentioned:
function(x) =>
lw r2,x
jal function
nop
We can't move the lw to cover the jal-delay slot, because we're not sure
the function's first instruction doesn't use it. It turns out that the
number of jal-delay slots left unfilled is fairly low, and the fraction
that might have been filled by a load is even lower, as there is almost
always some argument-gathering that can be moved into the delay slot.
(Most of the jal-delay slots come from code like this:
if (something)
func1();
else
func2();
of course, one can turn on a linker-optimizer that looks at the targets
of jal's and copies the target into the delay slot if it can.
Anyway, I think the bottom-line is that full scoreboarding:
Is not very relevant to the integer part of the CPU.
Might be useful for long multi-cycle operations (such as FP),
maybe (although Earl Killian has a long analysis that
argues otherwise that I don't have time to condense & post;
maybe Earl will),
Might be helpful for compilers that can't do code scheduling,
even though it is silly to do high-performance pipelined RISCs
without having a good reorganizer.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR ma...@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Usually by assuming a cache hit; the entire processor stalls on a cache miss.
Details of the stall may vary.
Another thing that register scoreboarding can be used for is to prefetch
data into caches: if you know that you are going to use a particular vector
at some time in the future, you can preload it into the cache so that it
is available when you get back to it. This certainly is a more elegant
solution than a "do the load, but don't wait for the data to ever come back"
method of doing the same thing. This could be useful when working with
very large data sets, i.e., those that don't fit into your cache in a
typical system where the main memory latency is long, but transfer time is
short. But what do I know? And why am I defending Mot?
You don't have to break many eggs to hate omlets -- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
Register scoreboarding is also useful to software types and other end users of
this complicated compiler technology. No matter how mature the compiler is, it
is still a large software program and will have bugs. It may not be worth the
hardware trade offs, but the software costs should also be considered.
Anything that makes the software "safer" seems like a win to me, but I'll admit
I'm biased!
Brian McElhinney
Software Engineer
m...@tc.fluke.com
Sure you can let >1 be writes. In fact, you'd better allow this, if
you're going to interleave your memory banks (and you'll want to do
*that* if your CPU is much faster than your DRAMS and you need a lot
of memory).
What kind of "bus error" are you pondering? If you're thinking of an
access violation, then (I think?) you can just blast the process and
dump whatever state would be helpful to the programmer on the way
out. If you're considering a transient in the hardware, like a
parity error on the bus going to memory, again I'd say you might
either blast the process (and start up some machine-check code) or
you might try to salvage the task. In the TRACE one can have many
memory references in flight simultaneously, and enough info to
restart each is kept in a queue on each CPU board. If an exception
occurs (including a TLB miss) then the trap handler must sort through
these queues and get each reference completed before returning control
to where the trap occurred. I don't think this constitutes an
imprecise interrupt, since the state of the machine can be fixed
exactly.
Bob Colwell mfci!col...@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405 203-488-6090
This sums it up very well.
[ Concerning different method of handling stale registers ]
>value to be written into the register file. This method has the benefit
>of integrating closely with the forwarding logic.
The scoreboard method can allow register forwarding also.
This makes sense for RISC processors with a common data and code bus.
After all, if the data cache misses, the bus is being hogged by the data
unit and the instruction unit then has to stall until the data transaction
finishes (although a few instructions might have been prefetched). I had
gotten used to the split instruction and data buses on the 88k which
yields different behaviour in the above case. On the 88k, a data cache
miss doesn't neccesarily create an instruction pipe stall. I think this
is better from a performance standpoint but it does require some type of
scoreboarding to handle the data dependencies properly.
It does, up to 3 memory transactions (4 in a special case) can be pending
simultaneously. 2 of the transactions can be in progress at the same time
(although at different stages of completion) on the external bus (the PBUS).
For example, the following sequence can be done
ld r1,r2,0
st r3,r4,0
ld r5,r6,0
and all 3 transactions will be then be simultaneously in progress by
the memory unit. The PBUS timing would look like (assuming cache hits):
Clock PBUS address PBUS data PBUS R/W PBUS response
1 rise r2 - R -
1 fall r2 - R -
2 rise r4 data at *r2 W -
2 fall r4 data in r3 W SUCCESS to ld
3 rise r6 data in r3 R -
3 fall r6 - R SUCCESS to st
4 rise - data at *r6 - -
4 fall - - - SUCCESS to ld
Briefly put, the address lines on the current clock tick correspond
to the next ticks data lines and the current tick data lines corresponds
to the previous tick address lines so 2 transactions are in progress at the
same time. This allows a memory access to be completed each clock tick on
a sustained basis.
This is similar (sort of) to what the 88k does although we do call it
scoreboarding since we interlock on a per register stale flag rather than a
per functional unit done flag. We use the scoreboard on memory loads as well
the FP unit.
>a) Note that the 2 things that are NOT interlocked (load and branch):
> Have small (1-cycle delays) that are unlikely to be shortened,
> and are undesirable to lengthen
> Have about a 70% chance of having a 1-cycle delay slot filled
> by software.
But why not interlock on loads as well? It removes the *requirement* that
a load be followed by nop if something useful can't be found to put there
and yet still allows the compiler/coder to put a useful instruction in
the delay slot if one exists. This provides the following benefits.
1) More compact code since the loads that don't have a useful delay slot
instruction don't have to have a nop. Of course, this is a less common
case (only 30% of loads) than those that do so the difference will likely
be slight but nonetheless, is a slight improvement over not interlocking.
2) As one other poster pointed out, it makes the coding easier and more
bug proof since the compiler or coder doesn't have to remember to insert
a nop on the loads with no useful delay slot instruction. On the 88k,
there is no way a user program can access stale registers
regardless of how dumb/bug ridden the compiler or writer is.
I see the above 2 items as improvements created by interlocking on
load results and since the interlock circuitry is already partially in
place (for the FP instructions) I don't see any additional disadvantages so
why not interlock on loads?
>SCOREBOARDING
>1) If you look at where scoreboarding came from (I remember it first from
>CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones),
>you had machines with:
> a) Many independent functional units
> b) Many multiple-cycle operations (integer add was 3 cycles in 6600,
> for example, and the FP operations were of course longer).
> c) Multiple memory pipes, with a long latency to memory, and no caches.
>2) Note that RISC micros probably have a), but normally, only the FP ops,
>and maybe a few others have long multi-cycle operations, and they sure don't
>have multiple memory pipes, and they often have caches.
^^^^^^^^^^^^^^^^^^^^^
The 88k has this as well as a & b (on board FP unit).
>On a machine with a simple load-interlock, and 1-cycle-latency loads,
>the machine would stall for 1 cycle before instruction 3.
>Suppose you do more complex scoreboarding, i.e., you continue attempting
>to issue instructions? Then, you might do 1, 2, try 3 and save it,
>and then either come back and do 3 again (probably) or go on to 4 and
>discover that it stalls also. If one looks at integer code sequences,
>one discovers that it is hard to discover many things that don't quickly
>depend on something else (barring, of course, Multiflow-style compiler
>technology and hardware designs not currently feasible in VLSI micros).
The 88k scoreboarding logic doesn't perform this complex of a task. It
merely monitors the instruction stream and stalls the instruction unit if
an attempt is made to execute an instruction which uses stale
registers and then lets the instruction unit continue once the correct
operands are available. For example, in the following code sequence
(assume the ld is a cache miss):
1) ld r2,r3,0 ; Get value.
2) add r3,r3,16 ; Bump pointer
3) add r2,r2,1 ; Increment value.
4) sub r4,r4,1 ; Dec count.
the instruction unit will stall on instruction 3 since it attempts to
use stale data. Even though instruction 4 theoretically could be
executed (as it isn't dependent on the ld results), it won't be started
until the ld is complete, and instruction 3 is completed.
Efficient code on the 88k is just as dependent on a good compiler
(or coder) as any other RISC micro. Scoreboarding on the 88k simply
ensures you won't use stale data inadvertantly - it's up to the
compiler to produce efficient code that minimizes stalls.
> Might be useful for long multi-cycle operations (such as FP),
> maybe (although Earl Killian has a long analysis that
> argues otherwise that I don't have time to condense & post;
> maybe Earl will),
Consider the case where multiple FP operations can be in progress at
the same time (as on the 88k). How could you handle this situation without
a per register good/stale indicator (read scoreboard)? Without a
scoreboard, the processor would seem to be required to process memory
accesses and FP operations sequentially. This might be acceptable but
it probably will impact performance adversely in FP or memory intensive
applications.
So it seems to me that scoreboarding is not required as long as the
multi-cycle functional units are single threaded. Here, the per functional
unit done flag seems like it would work ok. However, to obtain the
performance boost from pipelined functional units (like on the 88k),
register scoreboarding is a requirement and not an option.
Can you elaborate a bit? Forwarding requires that you also know which
functional unit(s) to forward from -- don't you need other hardware than
the scoreboard bits for this?
-- Tim Olson
Advanced Micro Devices
(t...@delirun.amd.com)
--
In article <4...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
>What kind of "bus error" are you pondering? If you're thinking of an
>access violation, then (I think?) you can just blast the process and
>dump whatever state would be helpful to the programmer on the way out.
The programmer would appreciate knowing where the error occured.
If the PC has jumped and a register window has slid by the time the
error comes back, he could be in trouble, unless...
>In the TRACE one can have many
>memory references in flight simultaneously, and enough info to
>restart each is kept in a queue on each CPU board. If an exception
>occurs (including a TLB miss) ... the state of the machine can be fixed
>exactly.
...unless the hardware keeps enough state information. (As we keep building
our machines faster and faster, let's not forget the seat belts :-)
--
------------------------------------------------------------------------------
My opinions are my own. Objects in mirror are closer than they appear.
E-Mail route: ...!pyramid!garth!walter (415) 852-2384
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
------------------------------------------------------------------------------
>Can you elaborate a bit? Forwarding requires that you also know which
>functional unit(s) to forward from -- don't you need other hardware than
>the scoreboard bits for this?
> -- Tim Olson
Yes other hardware is required. My posting was intended to elaborate that
scoreboarding doesn't *preclude* register forwarding - not that scoreboarding
itself is sufficient to implement the feature.
Why do you say that forwarding requires knowledge of which functional
unit will be supplying the fed forward result? The instruction unit knows
which register(s) it is waiting on and can enable feed forward circuitry
so that when the result(s) is available, it is fed to the instruction unit
simultaneously with the result is being written to the register file. With
this method, the feed forward logic can be independent of the functional units.
The instruction unit doesn't know, and doesn't care, what is supplying the
result but just knows that it needs that particular result for the next
instruction and snags it.
Forwarding could also be implemented by connecting the input of the
instruction unit to the output of the particular functional unit that
will provide the result but this would require a seperate feed forward
path for each functional unit. This would also work but as you say,
requires knowledge of which functional unit the result(s) is expected
from.
John meant the ability to do multiple data references per cycle, not
the ability to do an instruction fetch and data fetch per cycle (which
any well designed RISC supports). So I don't think the 88k qualifies
here.
> Consider the case where multiple FP operations can be in progress at
> the same time (as on the 88k). How could you handle this situation without
> a per register good/stale indicator (read scoreboard)? Without a
> scoreboard, the processor would seem to be required to process memory
> accesses and FP operations sequentially. This might be acceptable but
> it probably will impact performance adversely in FP or memory intensive
> applications.
You don't need a scoreboard to do what you say, but when the number of
pending results is large, it is the most appropriate technique. The
alternative, providing a register field comparator for each pending
result, is appropriate when the number of pending results is small.
But my question in all this is why did the 88000 choose to fully
pipeline floating point and thus allow such a large number of pending
results? I understand why the 6600 and its successors did it, but the
same analysis for the 88000 suggests it is unnecessary. You don't
have the cache bandwidth to make fp pipelining useful even on large
highly vectorizable problems (i.e. 32b per cycle isn't enough). You
can't feed the fp pipeline fast enough.
For example, for linpack on a load/store machine with 32b/cycle you
need only start a new add every 8 cycles and a new multiply every 8
cycles to run at memory speed. With fp latencies < 8 cycles, no
pipelining is necessary in the individual fp units. Go through the 24
livermore loops and count the number of words loaded/stored per fp
op and you'll see similar results.
The price you paid for pipelining appears to be enormous: the 88100's
fp op latencies average 2.5x longer than the R3010's when measured in
cycles; 3x longer when measured in ns.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
>1) More compact code...
This is a legitimate issue, although it's about a 5% effect.
We considered doing load-interlocks (and can do so in the future in
an upward -compatible way), but didn't for philosophical reasons,
i.e., we had a rule that we wouldn't put features in that we'd have to
live with forever if we couldn't prove they were worthwhile for
performance (1%); there was some concern that for some design directions,
there might be a cycle-time hit if this were in the critical path somewhere.
Anyway, it is a legitimate topic for debate.
>2) As one other poster pointed out, it makes the coding easier and more
>bug proof since the compiler or coder doesn't have to remember to insert
>a nop on the loads with no useful delay slot instruction. On the 88k,
>there is no way for a user program can access stale registers
>regardless of how dumb/bug ridden the compiler or writer is.
This one, however, is completely a non-issue, although, to be fair,
it's frequently asked by people running into the R2000 for the first time:
a) It is trivial to make the assembler insert load-nops where
needed. I doubt that our code took more than 10 minutes to write
and debug, and I don't ever remember having problems with that
in almost 4 years.
b) Any software system that couldn't be trusted with this problem,
shouldn't be trusted with ANY problem: really, there are zillions
of harder problems that need to be solved. (well, maybe not zillions,
but many :-)
c) Figuring that a RISC compiler should do optimization, but worrying
that this feature might be buggy, is like worrying about the safety
of flying in a 747 and bringing your own seat-belt because you
don't really trust Boeing to remember to include them :-)
....
> The 88k scoreboarding logic doesn't perform this complex of a task. It
>merely monitors the instruction stream and stalls the instruction unit if
>an attempt is made to execute an instruction which uses stale
>registers and then lets the instruction unit continue once the correct
>operands are available. For example, in the following code sequence
>(assume the ld is a cache miss):
>
>1) ld r2,r3,0 ; Get value.
>2) add r3,r3,16 ; Bump pointer
>3) add r2,r2,1 ; Increment value.
>4) sub r4,r4,1 ; Dec count.
>
>the instruction unit will stall on instruction 3 since it attempts to
>use stale data. Even though instruction 4 theoretically could be
>executed (as it isn't dependent on the ld results), it won't be started
>until the ld is complete, and instruction 3 is completed.
Thanx: we weren't sure whether it had multiple streams or not.
The example seems to indicate that the 88K indeed has a load with
2 cycles of latency (i.e., cycles 2 & 3 above). From the example in
<10...@nud.UUCP> that gave cycles for the ld/st/ld code, one would have
thought there was only 1 latency cycle. Can you say:
a) Are there indeed 2 latency cycles (i.e., that instruction 3
will indeed stall above)?
b) If so, what is the reason for the second latency slot?
(I realize that you may not want to answer this one :-)
Note that our numbers say that in our machines, it would cost us
10-15% in overall performance to go from 1 cycle latency to 2,
and the similarity of machines probably means about the same amount
for an 88K.
>In article <10...@nud.UUCP> t...@nud.UUCP (Tom Armistead) writes:
>> For example, in the following code sequence
>>(assume the ld is a cache miss):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>1) ld r2,r3,0 ; Get value.
>>2) add r3,r3,16 ; Bump pointer
>>3) add r2,r2,1 ; Increment value.
>>4) sub r4,r4,1 ; Dec count.
>>
>>the instruction unit will stall on instruction 3 since it attempts to
>>use stale data.
>Thanx: we weren't sure whether it had multiple streams or not.
>The example seems to indicate that the 88K indeed has a load with
>2 cycles of latency (i.e., cycles 2 & 3 above).
> ...
>Note that our numbers say that in our machines, it would cost us
>10-15% in overall performance to go from 1 cycle latency to 2,
>and the similarity of machines probably means about the same amount
>for an 88K.
>--
>-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
Tom said to assume that the load caused a cache miss, so the example
given does not imply that the load has 2 delay cycles. I think
everyone designing fast new machines HOPES that the m88k will have
some glaring defect like that. I know I sincerely hope that it has
at least ONE delay cycle, or all our gooses are cooked! ;-)
Now, wasn't that a clever way to give an example and still not give
away the timings?
-Mark Hall (smart mailer): mark...@pyramid.pyramid.com
(uucp paths ):
{amdahl|decwrl|sun|seismo|lll-lcc}!pyramid!markhall
I have observed a few cases where hours were wasted searching for a
programming bug, only to find that the CISC (68020) compiler's optimizer
is buggy (so one turns on the optimizer only for a release, since it
slows compilations anyway, thoughouly test again after turning on, etc.).
Are there really *good* reasons to put more trust in RISC optimizers?
Methinks the metaphor a bit overdone.
--
------------------------------------------------------------
Ron Voss Motorola Microsystems Europe, Munich
mcvax!unido!mucmot!ron CIS 73647,752
my opinions are just that, and not necessarily my employer's
------------------------------------------------------------
On the 88k, a load on cache hit does in fact have two delay cycles.
The data memory load/store pipeline is three deep.
-=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP]
(andrew%tekecs....@relay.cs.net) [ARPA]
Ron, I couldn't let that one slide on by. I think it's a truism that
good designers produce good designs, including compilers/optimizers,
and not-so-good designers produce everything else. My experience has
been that the quality of a piece of software is a far stronger
function of the abilities of its creators than of its intrinsic
complexity.
One could perhaps make the case the a RISC optimizer is likely to be
intrinsically less complex than its CISC counterpart due to the
more-easily-understood machine spec, but it's probably impossible to
prove, and I don't think it's as strong an argument as the above.
I also have the feeling that more talented designers are turning
their attentions to compilers than ever before, so perhaps the
overall quality of this kind of software is improving, but I won't
even try to defend that thesis if you don't like it.
I have to agree. I once saw an old version of the Pyramid cc optimizer
totally trash a for(i = 0; i < 7; i++) loop with a printf() in the middle.
I guess that bit of code didn't really do anything. :-) (Talk about FUN
to debug...) And we've been told to avoid the global optimizer like the
plague when any of the code to be tuned is asynchronous, like the signal
handlers we have in 99.99% of our applications...
Phil Kos
Information Systems
...!uunet!pyrdc!osiris!phil The Johns Hopkins Hospital
Baltimore, MD
Assuming the FP operands are not in cache, this is true. However,
there will be some class of problems which can make effective use of
the FP pipelining and assuming that FP pipelining has no bad side effects
(see paragraph below), it only makes sense to provide the feature.
>The price you paid for pipelining appears to be enormous: the 88100's
I didn't design the 88100 (I am not a chip designer at all). However,
I doubt that any speed difference is due to a tradeoff made to provide FP
pipelining. Whether the FP unit was pipelined or not, I think the FP
latencies would still be the same. Do you have some reason to think that
not pipelining can decrease FP latency? If no increase in latency is
incurred to provide pipelining, then providing it can only help
performance.
>fp op latencies average 2.5x longer than the R3010's when measured in
>cycles; 3x longer when measured in ns.
Would you post the R3010 FP stats? Not that I doubt you, I'm just
interested in how each particular R3010 FP instruction compares to the
equivalent 88K instruction.
> a) Are there indeed 2 latency cycles (i.e., that instruction 3
Yes, 2 on loads. 0 latency on stores.
> b) If so, what is the reason for the second latency slot?
Address calculation. The 88k provides a basic set of three addressing
modes. The most frequently used construct for referencing memory is
"register pointer + offset" (offset is rarely 0). Without providing this basic
addressing mode, the code would have to do something like:
add ptr,ptr,offset ;
ld dest,ptr,0
sub ptr,ptr,offset ; If pointer must be preserved.
for most loads and stores. This overhead is worse than a 2 tick latency
on loads. Also note that there is 0 latency on 88k store instructions*.
Since many lds have a corresponding st instruction somewhere, the averaged 88k
latency will be < 2 ticks.
* What is the latency on a R3000 store instruction?
>Note that our numbers say that in our machines, it would cost us
>10-15% in overall performance to go from 1 cycle latency to 2,
Assuming the addressing modes and other aspects of the machine remain
the same, this figure is in the ballpark (although a little high).
However, I think the static machine assumption is not a valid one to make.
Umm, this has nothing whatsoever to do with RISC vs. CISC nor optimizer bugs;
the problem here is that signal handlers (and other forms of asynchrony)
violate assumptions that optimizers tend to make. For example, in the
following case:
extern int j;
int i, l, a, b, c;
...
i = j;
k = a + b + c;
l = j;
if the value assigned to "i" happens to be lying around in a register (as a
side effect of the code that performs that assignment) when the value of "j" is
assigned to "l", you can just store that register into "l", right? Wrong. A
signal could arrive between the point at which the value was loaded into the
register and the point at which the value was stored into "l", and the signal
handler can change the value of "j" in that interval. The same can happen with
values stored in registers on I/O devices, or variables shared between
processes.
Unfortunately, optimizers tend not to know that certain external variables are
"safe", and won't change asynchronously, while others are "unsafe" and may
change asynchronously. Some optimizers may simply assume that all external
variables are "safe", in which case you'd better disable the optimizer when
writing code where not all external variables are "safe" (or, at least, cut the
optimization down to a level where the optimizer no longer makes this
assumption).
To see a painfully elaborate discussion of ways to handle this problem, just
search for the word "volatile" in all articles in "comp.lang.c" on your machine
and print all the articles in question. :-)
>I have observed a few cases where hours were wasted searching for a
>programming bug, only to find that the CISC (68020) compiler's optimizer
>is buggy (so one turns on the optimizer only for a release, since it
>slows compilations anyway, thoughouly test again after turning on, etc.).
>Are there really *good* reasons to put more trust in RISC optimizers?
>Methinks the metaphor a bit overdone.
Sorry, let me try again. I wasn't trying to say optimizers were easy
or safe at all: they're not, in fact, they are like building a 747.
The point was, in fact, that if you're doing an optimizer (building
a 747), THAT's what you worry about, because it's hard. You don't
worry about getting nops after loads ("this feature") (seatbelts),
because if a somebody can't get THAT right, they have no business
building optimizers (747s). [No disrespect to Boeing, I just picked the
biggest plane I've flown in.]
1) I have marginally more trust in RISC optimizers than in CISC ones,
on general principles [i.e., at least some parts are simpler.]
2) What I mostly trust/distrust is the optimizer's author(s);
I'd much rather have an expert's CISC optimizer than a turkey's RISC one.
(Fortunately, I get to use a super-expert's RISC optimizer. :-)
I'm interested... at first I thought you said that 99.99% of your applications
were in the signal handlers, but that was wrong. Still, what do you do in
the signal handlers - set flags that other steps can use? Sounds like you
need volatile.
Hey! Why not ask somebody who has to write the compiler? Don't you trust us?
No? Smart idea. Seriously, the reason given for delayed branches is that
(1) the compiler can do it, (2) nobody uses assembly anyway.
I would like to think that (1) is true, but compilers have just as many
problems as anybody else. (2) I have not yet heard of any machine in general
production without an assembler. Even if that were the case, because (1) is
false, somebody has to debug the compiler. It is hard enough to backtranslate
assembly through the compiler to the source language on a conveniential
machine.
Given the current ratio of software cost to hardware cost, I feel the more
extreme proposals of RISKs are misguided.
TA# I didn't design the 88100 (I am not a chip designer at all).
TA# However, I doubt that any speed difference is due to a tradeoff made
TA# to provide FP pipelining. Whether the FP unit was pipelined or not, I
TA# think the FP latencies would still be the same. Do you have some reason
TA# to think that not pipelining can decrease FP latency? If no increase
TA# in latency is incurred to provide pipelining, then providing it can only
TA# help performance.
.........................................................................
The answer to this is, yes, I believe the lack of pipelining does
partially explain why the R3010 has smaller latency than the pipelined
88k. Do you have an alternate suggestion?
For example, non-pipelined functional units can reuse hardware
resources. If it takes 2 shifts to do an fp add, a non-pipelined adder
can have just one shifter but a pipelined adder needs 2 shifters.
This is especially nasty for things like the divider which reuse some
hardware entities (e.g. quotient digit lookup table) many, many times.
An N-stage-pipelined divider would have needed N of these lookup tables.
By avoiding the need for extra hardware, the non-pipelined implementation
frees up valuable chip area. In the R3010 we used that area for
(i) 64-bits x 16-words x 4-ports register file; (ii) separate divider
independent of multiplier; (iii) delay-optimized physical layout and hand
tweaked circuit design employing LARGE transistors; (iv) control unit's
resource scheduler that permits 4 instructions to execute concurrently.
.........................................................................
TA# Would you post the R3010 FP stats? Not that I doubt you, I'm just
TA# interested in how each particular R3010 FP instruction compares to the
TA# equivalent 88K instruction.
.........................................................................
88100 R3010 cycle
Operation cycles cycles ratio
======================================
sp add 5 2 2.5
dp add 6 2 3.0
sp mul 5 4 1.2
dp mul 10 5 2.0
sp div 30 12 2.5
dp div 60 19 3.2
sp convert 5 1-3 1.7-5
dp convert 6 1-3 2-6
abs/neg/mov 5? 1
.........................................................................
eak> You don't have the cache bandwidth to make fp pipelining useful
eak> even on large highly vectorizable problems (i.e. 32b per cycle
eak> isn't enough). You can't feed the fp pipeline fast enough.
TA# Assuming the FP operands are not in cache, this is true. However,
TA# there will be some class of problems which can make effective use of
TA# the FP pipelining and assuming that FP pipelining has no bad side effects
TA# (see paragraph below), it only makes sense to provide the feature.
.........................................................................
I *was* talking about when the operands are in cache. Let me repeat the
example in my original posting:
For example, for linpack on a load/store machine with 32b/cycle you
need only start a new add every 8 cycles and a new multiply every 8
cycles to run at cache speed. With fp latencies < 8 cycles, no
pipelining is necessary in the individual fp units. Go through the 24
livermore loops and count the number of words loaded/stored per fp
op and you'll see similar results.
What's going on here? The DAXPY inner loop of linpack
DY(I) = DY(I) + DA*DX(I)
has a fp add, fp multiply, two loads, and one store (ignoring loop
overhead, which can be effectively eliminated by unrolling). With a
32b interface to the data cache, the two loads and one store take 6
cycles. The two fp ops take 2 cycles to issue. Total 8 cycles for
2 flops. So if your add latency and multiply latency are <= 8 cycles
you can execute DAXPY without any pipelining, running at 75% of your
cache bandwidth.
Let's formalize this:
Let m be the ratio of memory reference cycles to floating point ops
for a computation. Let f1, f2, ... fn be the frequencies of the
different floating point ops (they sum to 1). Let t1, t2, ... tn be
the latencies of the different floating point ops. Let r1, r2, ... rn
be the pipeline rates of the different op units. Assume the
computation is infinitely parallelizable (the most favorable
circumstance for pipelining).
Consider a load/store architecture with one instruction issue per
cycle. This means the time to do one flop is bounded below by m+1
cycles. A given functional unit will require pipelining (ri < ti) to
run at this rate iff ti*fi > m+1. The pipelining required is a new
op every ri=(m+1)/fi cycles. Alternatively, the latency required for
no pipelining is ti=(m+1)/fi.
Example: linpack with one 32-bit access per cycle (e.g. r3000 or
88000): n=2, f1=0.5, f2=0.5, m=3. Thus r1=r2=8.
Example: linpack with one 64-bit access per cycle, or half-precision
linpack with one 32-bit access per cycle: m=1.5. So 5-cycle adds and
multiplies are sufficient (R3010 latencies work fine). Or pipelining
to start new ops every 5.
So what you need to do to refute my claim is to show important
problems with low m values and with fi almost one.
One such computation is something like livermore loops 11 and 12 in
single precision, for which n=1, f1=1, m=2, so you need a new add
every 3 cycles. But of course the R3010's 2 cycle add handles that
just fine without pipelining.
Another such comutation is something like the multiple-vector
techniques for LU-decomposition. E.g. for 8-way vectors, m=0.625,
f1=0.5, f2=0.5, so you need 3.25 cycle add and multiply to run at peak
rate for sp, 4.5 cycle add and multiply for dp.
Still, I would not consider these to prove the need for pipelining
with 32b cache interfaces. Do you know a computation which does?
What are its parameters in the above terms?
.........................................................................
> Yes, 2 on loads. 0 latency on stores.
>> b) If so, what is the reason for the second latency slot?
> Address calculation. The 88k provides a basic set of three addressing
>modes. The most frequently used construct for referencing memory is
>"register pointer + offset" (offset is rarely 0). Without providing this basic
>addressing mode, the code would have to do something like:
> add ptr,ptr,offset ;
> ld dest,ptr,0
> sub ptr,ptr,offset ; If pointer must be preserved.
>for most loads and stores. This overhead is worse than a 2 tick latency
>on loads. Also note that there is 0 latency on 88k store instructions*.
>Since many lds have a corresponding st instruction somewhere, the averaged 88k
>latency will be < 2 ticks.
We agree on the desirability of non-zero offsets 100% (although our friends
down the street at AMD may not :-) For this kind of design, I can't imagine
that people would do add/ld/sub, but would use a temp register:
add temp,ptr,offset ;
> ld dest,temp,0
Anyway, maybe I'm missing something: I still don't understand where the
second latency cycle comes from in the hardware design. You compute
the address and get it on the bus, and the third instruction following
can actually use the data. Moto info says that the CMMU returns data in one
cycle, so it almost sounds like it's costing a cycle on the CPU to
make the data available (i.e., forwarding logic is not fast enough, or
something). Anyway, I still don't understand what address-mode computation
has to do with it, at least not from the given example [which is identical
to what R2000s do]. So please say some more.
>* What is the latency on a R3000 store instruction?
0.
>>Note that our numbers say that in our machines, it would cost us
>>10-15% in overall performance to go from 1 cycle latency to 2,
> Assuming the addressing modes and other aspects of the machine remain
>the same, this figure is in the ballpark (although a little high).
>However, I think the static machine assumption is not a valid one to make.
Can you say what your numbers are, i.e., why is this high?
My reasoning is as follows:
a) Loads are about 20% of the instructions.
b) You get to schedule approx. 70% of the 1st delay slots.
c) You get to schedule approx 30% of the 2nd delay slots, i.e.,
the penalty for a second delay slot is the 70% that you don't
get to schedule: .7 * 20% = 14%
Now, in this area, R2000s and 88Ks are similar enough that I think the
statistics carry over, but counterarguments would be useful.
>> b) If so, what is the reason for the second latency slot?
The reason for the additional delay slot is twofold. First, the M88100
has allotted .2 clocks to the data path, .7 clocks to a 32 bit adder,
and more than .25 of a clock for driving a signal off chip with .1
clock setup margins. Since we desired all signals to be synchronous,
we rounded this 1.25 clocks up to 1.5 clocks. These 1.5 clocks are
spent performing the addition and delivering the address to the M88200
cache. The cache has 1 clock from valid address to valid data (1.2
valid address to hit). Then the processor uses the remaining .5 clock
to perform byte extraction, sign extension, delivering the result to
the result bus. Therefore the memory pipeline is 3 clocks in length.
The MIPS Co. designers sliced this into 2 clocks and integrated it into
their basic (only) pipeline. This means that when the data cache
misses or is not available (e.g. bus snooping) that their entire
processor skips a beat. Since the M88100 data pipeline is decoupled
from the instruction pipeline, the beat is not skipped unless a data
dependency occurs. This is more beneficial in the case of outbound
store traffic on write-through pages than on the nominal load traffic.
In two clocks they must perform a 32+16 addition and the mmu function
before driving the pins. Moreover, the multiplexed bus requires that
the addresses be latched (externally on both the R2000 and the R3000).
Memory must still be accessed and delivered to the result bus/register
file/forwarding unit. Therefore, the time must come from somewhere,
and it seems to have come out of the memory SRAM speed in the R3000 and
the circuit speeds and margins. According to the data sheet (using a
F373 transparent latch), the MIPS design requires 20ns SRAMs. Our
design (assuming a glue part cache implementation: F373 latch and 3ns
t-sub-OHA* SRAM with spec sheet timing) allows 32ns SRAM parts at 40ns cycle
or 42ns SRAM at 50ns. The MIPS design requires 0ns SRAM at 40 MHz,
while we do not hit this brick wall until 120 MHz.
Since the two SRAM arrays' data pins are connected together and to
three other interfaces, they must be sampled at 2/3 of the phase period
by all interested parties and then rapidly removed from the multiplexed
data bus (the data is only valid for 3.5ns on the data bus). This
requires very strict timing control over this vital feature. In
addition, the Cypress spec's I checked had t-sub-HZOE** of 15ns maximum
while the R3000 requires this in 8ns. Note that this is not a problem
as the other bank of SRAMs has a minimum output enable time of 9ns
while the Cypress parts (CY7C161 16K*4) did not.
The MIPS processors do not snoop their bus and therefore leave memory
coherence to the write-through mechanism. In multiprocessing
applications, the memory bus can become saturated with a few processors
on the bus (~4?). Write-back caches cause a sufficient reduction in
memory bus traffic to allow twice the number of processing ensembles to
utilize the bus.
Therefore, the answer to the original question is: we were more
conservative in allotting nanoseconds to functions, and, in particular,
we attempted to beat on the SRAM technology less hard. If we are
correct, this should be more scalable in the future (read ECL/GaAs) as
off-chip delays approach .4 cycle. Alternatively, our design costs
less to manufacture in high volume and allow less costly SRAM parts
than the MIPS Co. design.
This should allow us to push the clock frequency farther than MIPS Co.,
but not at the current mask set. Remember the recent announcement of
33 MHz M68020 and M68030: a little shrink and a lot of tuning can
obtain many MHz. We have just begun to tune these M88k parts.
>>Note that our numbers say that in our machines, it would cost us
>>10-15% in overall performance to go from 1 cycle latency to 2,
Our numbers show that with the decoupling of the pipeline and earliest
load scheduling we suffer 6-8% cycle-to-cycle performance degradation
when comparing the MIPS design to the M88100 design. This is not too
bad when one considers the relaxation of external RAM speed requirements
and allowing the hit logic to run in arrears of the data delivery.
Independent of this current discussion on the length of load pipelines,
one might want to ask the folks at MIPS Co. this question:
Why did you multiplex your memory bus?
Consider factors related to power dissipation. The current M88100
processors are running between .25 and .5 watts @ 20Mhz. If we were to
multiplex the 2 memory ports as did MIPS Co., our worst case power
consumption would be 4 watts. The problem is that the AC power
dissipation is given by:
2*C*V**2*F*N,
where:
C = load capacitance (F),
V = voltage swing (V),
F = frequency (Hz), and
N = number of pins which make transitions.
(For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz)
The addresses from the instruction port are very highly correlated
(about 1.4 bits per cycle change). The addresses from the data port
are only partially correlated (less so with better compilers). Mixing
these two streams results in almost uncorrelated address streams and
therefore a bigger N, resulting in more power dissipation. Notice that
the pin counts on the two packages are not that much different (144 for
the R3000 vs. 180 for the MC88100) and neither are the power/grounds
pin counts (30 for the R3000 vs. 36 for the MC88100).
So why multiplex the memory bus? Your pin count isn't that much lower,
your power dissipation suffers greatly and you tend to create
difficulties in interfacing to your memory system.
* t = time of output hold from address change
OHL
** t = time of output enable high to high Z (impedance)
HZOE
/\ /\ Mitch Alsup
//\\ //\\ Manager of Architecture for the M88000
///\\\ ///\\\
// \\ // \\ Remember: SPARC spelled backwards is .....!
/ \/ \
/ \
> 88100 R3010 cycle
> Operation cycles cycles ratio
> ======================================
> abs/neg/mov 5? 1
To clear up the question mark: on the 88k, floating point values are
kept in general registers, so "abs" is done by ANDing to 0 the sign
bit, "neg" is done by XORing the sign bit, and "mov" is done with
conventional register-to-register move instructions. Thus:
abs 1
neg 1
sp mov 1
dp mov 2
mp> The MIPS processors do not snoop their bus and therefore leave
mp> memory coherence to the write-through mechanism. In
mp> multiprocessing applications, the memory bus can become saturated
mp> with a few processors on the bus (~4?). Write-back caches cause a
mp> sufficient reduction in memory bus traffic to allow twice the
mp> number of processing ensembles to utilize the bus.
Write through to a 32b bus does indeed limit you to about 4 processors
(the max supported by the 88000). If you want more than that, use a
64b bus (~8 processors), or use a secondary cache (which has its own
benefits) and make it write-back. Both approaches have already been
implemented in R2000-based systems.
When you build a R3000-based MP, you don't have to limit the amount of
cache per processor (unlike, e.g., the 88000, which allows only 16KB
per processor in a 4-processor system). If you're building an MP,
presumably you're interested in performance, so it seems strange to
cripple each processor with a small cache.
mp> ...in particular, we attempted to beat on the SRAM technology less
mp> hard. If we are correct, this should be more scalable in the
mp> future (read ECL/GaAs) as off-chip delays approach .4 cycle.
It's hard to envision ECL output drive-times approaching 0.4 clocks;
modern ECL parts (e.g. Sony's CXB1100Q 3-NOR) can receive signals from
off-chip, do the NOR function, and drive off chip again (using 100K
levels) in 390 picoseconds. {EDN 6-23-88, p. 97} So (0.4 * Tclock) =
390ps giving Tclock = 975 picoseconds (1.03 GHz) !!! A more
"believable" clock period might be 4ns (a la Cray-2), in which case
the drive-off time is 0.1 clock.... a smaller fraction of the cycle
than you quote for your CMOS design.
mp> Alternatively, our design costs less to manufacture in high volume
mp> and allow less costly SRAM parts than the MIPS Co. design.
An R3000 may require faster SRAMs, but these are multiple-sourced,
off-the-shelf, commodity devices, and the price of 30 16Kx4 20ns SRAMs
is actually lower than that of eight 88200s (sole-sourced).
mp> Independent of this current discussion on the length of load
mp> pipelines, one might want to ask the folks at MIPS Co. this
mp> question:
mp> Why did you multiplex your memory bus?
mp> Consider factors related to power dissipation. The current M88100
mp> processors are running between .25 and .5 watts @ 20Mhz. If we
mp> were to multiplex the 2 memory ports as did MIPS Co., our worst
mp> case power consumption would be 4 watts. The problem is that the
mp> AC power dissipation is given by:
mp> 2*C*V**2*F*N,
mp> N = number of pins which make transitions.
mp> The addresses from the instruction port are very highly correlated
mp> (about 1.4 bits per cycle change). The addresses from the data
mp> port are only partially correlated (less so with better
mp> compilers). Mixing these two streams results in almost
mp> uncorrelated address streams and therefore a bigger N, resulting
mp> in more power dissipation.
Your observation is interesting. But I hope the 88100's packaging
isn't based on an _average_case_ analysis. Malicious programmers (who
write, for example, a branch-to-branch infinite loop) can't melt the
88100, can they? So the worse case is really 4 watts, right? What
does the datasheet say?
Also the cpu subsystem power consumption isn't much reduced by your
observation, since it only applies to the address outputs on the cpu,
which is a small fraction of the total pins in the system (the I-cache
SRAM pins change every cycle even if the address doesn't change much).
Hmm, I guess the address outputs in an 88000 system are a significant
fraction of the pins, unlike the R3000, because you send the full 32b
virtual address off chip, instead of only 18b as in the R3000. That
wastes as much power as you save on average from the demultiplexing,
doesn't it?
mp> Notice that the pin counts on the two packages are not that much
mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
mp> MC88100).
MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
time the difference between 180 pins and 144 pins was significant.
The i80386, designed at roughly the same time, actually has fewer
pins -- 132. The 2nd generation (R3000) could have changed the
interface, but we felt it was desirable to provide an upgrade path for
R2000 designs.
Back when RISC vs. CISC was still an issue, people complained RISC
required more bandwidth. We just provided the necessary bandwidth.
When we need more, we'll add more. If that means modifying the
R3000-style cache interface, so what?
Oughtn't the operand be first tested for "IEEE-ness"? Specifically,
what if the operand of neg is a Signaling NaN? Oughtn't this cause
an invalid operation exception?
Or perhaps this a case of "should" versus "shall" in the IEEE spec.....
--
-- Mark Johnson
MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
...!decwrl!mips!mark (408) 991-0208
The "Recommended Functions & Predicates" section of the standard has:
"-x is x copied with its sign reversed, not 0-x; the distinction is germane
when x is +-0 or NaN."
From this, the obvious conclusion is that -NaN is that same NaN with its
sign reversed.
Same goes for abs(x); "abs(x)=copysign(x,1.0), even if x is a NaN".
--
Gideon Yuval, yu...@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
(alternative E-mail address: decwrl!nsc!taux01!yu...@uunet.uu.net)
>In article <10...@tekecs.TEK.COM>, and...@frip.gwd.tek.com (Andrew Klossner)
>writes
> > To clear up a question: on the 88k, floating point values are
> > kept in general registers, so "abs" is done by ANDing to 0 the
> > sign bit, "neg" is done by XORing the sign bit, and "mov" is
> > done with conventional register-to-register move instructions.
>
>Oughtn't the operand be first tested for "IEEE-ness"? Specifically,
>what if the operand of neg is a Signaling NaN? Oughtn't this cause
>an invalid operation exception?
Not if you expect to actually use or sell the resulting code. Testing
for special cases slows code down horribly, ESPECIALLY in cases like
this where otherwise it would take only 1 cycle or in that vicinity!
It would be fine for debugging.
My copy says "Signaling NaNs shall be reserved operands that signal
the invalid operation exception for every operation listed in
Section 5. Whether copying a signaling NaN without a change of format
signals the invalid operation exception is the implementor's option".
"neg" doesn't seem to be explicitly called out, but is
implied by subtraction, I think, as one of the operations.
Of course, signalling NaNs must be able to set a flag
when encountered, but need not (should) trap.
The question is, of course, whether an explicit test for NaNness
followed by conventional arithmetic might not be faster than
a special purpose FP instruction that combines the test and the
NaN detection flag setting.
Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801
ag...@gould.com - preferred, if you have MX records
ag...@xenurus.gould.com - if you don't
...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.
Seems to me a branch-to-branch i-loop would execute entirely
out of an onboard I-Cache if there is one. But, if not, a
5V 20MHz signal driving, say, 30pf dissipates 30mW. Multiply
by 32 and you get about 1 Watt power increases. During this
branch-to-branch, the data address bus would, of course, be idle.
Nastier would be something like :
7FFFFFFF : STORE-0-TO-H55555555
80000000 : BRANCH-TO-HFFFFFFFF
. . .
FFFFFFFF : STORE-FFFFFFFF-TO-HAAAAAAAA
00000000 : BRANCH-TO-H7FFFFFFF
This sequence changes, on average, 64 output-pins per
cycle : 96 during the cycles with the stores, 32 on the branches.
Assumimg the buses do the most conservative thing ( float ) during
their non-driven periods. This produces 3 Watts power dissipation.
Unless, of course, there's an on-board I-Cache, in which case
only 32 pins per cycle, average ( 64 in the store cycles, 0 in
the branch cycles ) for a dissipation of 1 Watt.
] Also the cpu subsystem power consumption isn't much reduced by your
] observation, since it only applies to the address outputs on the cpu,
] which is a small fraction of the total pins in the system (the I-cache
] SRAM pins change every cycle even if the address doesn't change much).
How to handle an extra Watt of power dissipation on a chip is a
more difficult question than how to handle it in a board or
hybrid, or how to add it to your power supply.
] Hmm, I guess the address outputs in an 88000 system are a significant
] fraction of the pins, unlike the R3000, because you send the full 32b
] virtual address off chip, instead of only 18b as in the R3000. That
] wastes as much power as you save on average from the demultiplexing,
] doesn't it?
A 16-bit processor would consume less power than a 32-bit processor,
all else being equal. People build the 32-bit ones anyway. But I
digress. So, what can you address with 18 bits ? 256KWords ? Not
enough. I'm kinda surprised if MIPS's address bus is 18 bits,
unless they have some kind of clever external mapping scheme.
] mp> Notice that the pin counts on the two packages are not that much
] mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
] mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
] mp> MC88100).
] MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
] time the difference between 180 pins and 144 pins was significant.
] The i80386, designed at roughly the same time, actually has fewer
] pins -- 132. The 2nd generation (R3000) could have changed the
] interface, but we felt it was desirable to provide an upgrade path for
] R2000 designs.
Big-pin-count packages are a neccesary evil. They cost more, take
up more real estate, have lower packaging yeilds, are more difficult
to reliably attach to boards, have more problems from thermal-
expansion-coefficients, et cetera. PGAs can't be surface mounted,
which means you can't put them on both sides of the board and
multi-layer routing gets painful. Fine-pitch pins have their
own problems.
The RPM40 uses a 132 pin LCC. It only outputs new instruction
addresses on branches, traps, et cetera. It uses the same pins
for instruction addresses as for operand addresses, even though
the data buses for each are separate. The instruction memory
system is therefor neccesarily a pipelined look-ahead system
much like those used on vector processors for vector fetc
( after all, instruction memory is essentially a set of vectors,
one vector per basic block ).
We felt we had to go this route, no choice : at the time ( late '85 )
the 132-pin LCC was the biggest thing we knew of that was fully
qualified at 40MHz, and trying to multiplex the instruction and
operand addresses at 80MHz was considered beyond CMOS capability.
] Back when RISC vs. CISC was still an issue, people complained RISC
] required more bandwidth. We just provided the necessary bandwidth.
] When we need more, we'll add more. If that means modifying the
] R3000-style cache interface, so what?
] --
] UUCP: {ames,decwrl,prls,pyramid}!mips!earl
] USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
"RISC vs. CISC" is, I think, not so much an absolute issue
as an issue of which fits a particular technology better.
When memory speeds are up there with adder speeds, RISC wins.
When adders are much faster than memory, perhaps CISC does.
Different horses for different courses, I think.
Woops, gotta go.
--
Dennis O'Connor oconnor%sun...@steinmetz.UUCP ARPA: OCON...@ge-crd.arpa
"Never confuse USENET with something that matters, like PIZZA."
Ho boy! The last time I saw a compiler XOR the sign bit of floating zero
must have been about 1975. Old bugs never die, they just migrate to new
hosts.
Every so often someone says something along the lines of "well,
implementations of architecture X are faster now, but they require
faster memory/<performance measure> than architecture Y. Isn't the
Cray-1 still the champ by that measure? Memory delay is 11 clock
cycles (93ns @ 8.5ns/cycle), although it is interleaved so
non-conflicting accesses can be done concurrently. How many of those
cycles do the memory chips have to respond?
According to Smith, Weiss, and Pang, in IEEE Trans on Computers, Aug
86, reducing the Cray-1 memory delay to 5 cycles increases the overall
performance (on scalar code) by just over 10%. Increasing the delay
to 20 cycles (if interleaving and transfering the result back to the
cpu alone take 10 cycles, this lets the memory have at least 85ns)
hurts (scalar) performance by about 20%, or to "just" 6 MFLOPS.
(These numbers also say something about caches on Crays.)
-andy
----
UUCP: {arpa gateways, decwrl, uunet, rutgers}!polya.stanford.edu!andy
ARPA: an...@polya.stanford.edu
(415) 329-1718/723-3088 home/cubicle
Limiting Switching Transients in High-Speed Digital Processors
Henry G. Dietz Paul H. Dietz
ha...@ee.ecn.purdue.edu p...@speech1.cs.cmu.edu
Abstract
Pin counts on CMOS VLSI processors are currently very
high and will probably continue to grow. This causes a
variety of problems, not least of which is the possibility
of encountering unacceptable switching transients when many
output pins change state simultaneously. These transients
can drastically reduce the noise immunity of internal gates,
severly limiting performance.
To limit the number of output pins simultaneously
changing state, we propose to directly manage output
requests on the basis of predictions of the switching tran-
sients implied in each output request. Each chip would be
designed assuming a well-specified parameter, the Maximum
Allowable Switching Transient (MAST), and an output request
which could exceed the MAST would be serialized so that the
MAST is not exceeded. This direct control of switching
transients can be implemented in either a hardware-intensive
or software-intensive style. The overall effect is that a
processor chip may incorporate many pins, yet need not be
designed to survive the worst case of all output pins
attempting to change state simultaneously.
1. Background
There are a number techniques that have been used to
limit switching transients. These can be grouped into two
major categories: reduction in the number of output pins
that are active at any one time or reduction of the observed
transient itself.
The number of output pins can be reduced by transmit-
ting data serially or by time multiplexing data buses to
serve multiple functions. Alternatively, output times for
various signals can be slightly skewed so that the outputs
are not set simultaneously. Unfortunately, the quest for
higher operating speeds often precludes the obvious
application of these techniques.
To reduce the switching transient generated per output
pin, some manufacturers have devoted large die areas to
decoupling capacitors; but this is not practical for designs
which are already pushing die-size constraints. Other
manufactures use off-chip capacitors mounted in the same
package as the die, which can provide much larger decoupling
capacitances. However, the series inductance inherent in
going off-chip is greater, limiting the effectiveness.
Another approach, perhaps more generally applicable, is to
maintain separate power buses for output buffers and inter-
nal state logic [Car88]. Also, by careful design of the
output buffer [GaT88], one can make buffer power consumption
more consistent, hence reducing the worst-case values and
achieving significant improvement.
It is reasonable to assume that next generation devices
will incorporate some combination of these methods, yet, all
of these techniques require that the chip be designed for
the worst case: additional performance gains can be made by
restricting simultaneous output operations only when the
MAST otherwise would be exceeded.
2. Approach
There are two difficulties in directly controlling out-
put state transitions based on potential MAST violation.
The first problem is how to detect or predict when a MAST
violation may occur; this may be done placing the main bur-
den either on hardware (detection) or on software (predic-
tion). The second problem is, given that a particular out-
put request would exceed the MAST if done simultaneously,
how can the hardware arrange to perform the output pin state
transitions without exceeding the MAST. We will discuss
this second problem first.
2.1. Output Serialization
Given that a particular logically-simultaneous output
operation would exceed the MAST, hardware must intervene to
insure that the limit is observed. This can be done by per-
mitting only a fraction of the output pins to change simul-
taneously in one cycle and performing the rest of the output
on successive cycles which are inserted just for that pur-
pose. We say that such an output operation has been serial-
ized.
Although output serialization requires only relatively
simple hardware, some care must be taken. For example,
strobe/ready bits must change state only once the
corresponding data bits are in the correct state.
When the MAST is not exceeded, the requested output is
performed in a single cycle. (In this case, the additional
circuitry has no effect.) This is an efficient technique
because, for example, localities in instruction address
space often correspond to minor bit changes in the address
outputs.[1]
In some cases, an optimizing compiler/linker/loader can
significantly enhance this kind of locality - these code
transformations are discussed in detail in the full version
of the paper. A simple example of the type of optimization
possible is to generate code so that jump and call targets
(labels and function/procedure entries) are placed at
addresses which differ from the invoking-instruction's
address in only a few bits (more precisely, causing changes
in fewer than MAST bits). Another example is that a loop
whose code would normally span a high-order memory address
bit change could be moved to a portion of the address map
where fewer address bits change. Even data-related outputs
sometimes can be transformed to minimize simultaneous bit
changes, either by careful layout of data structures or by
recognition of properties of operations being performed.
2.2. MAST Violation Detection/Prediction
As discussed in section 2.1, compiler technology (e.g.,
flow and other static analyses [AhS86] [DiC88]) can be used
to predict, and hence to alleviate by code motion, etc.,
possible violations of the MAST. This same compiler tech-
nology can be used to predict when the MAST will be violated
and to directly encode that information in the instructions
it generates; hardware would simply serialize any operation
which the compiler tagged as suspect.[2]
Of course, the compiler must conservatively assume that
any operation which it can't prove is less than the MAST, is
actually greater than the MAST. This isn't always true -
some output changes are always unknown until runtime, and
the compiler must assume that all of these change.
The more hardware-intensive alternative is to simply
use a circuit to detect, at runtime, when a proposed output
would actually exceed the MAST, and to invoke serialization
only then. In the full-length paper, several techniques are
presented for constructing such a circuit.
Compared to the software prediction, hardware detection
insures that all outputs that can be done in a single cycle
are so accomplished, whereas the compiler tagging may cause
some to be unnecessarily serialized. The trade-off is that
the hardware is fairly complex and that the compiler cannot
know precisely how long each instruction will take to exe-
cute (which reduces the effectiveness of many conventional
compiler optimizations).
3. Conclusion
Using either the software-intensive or the hardware-
intensive technique proposed, the concept of directly manag-
ing output pin state changes can provide substantial perfor-
mance increases with only minor impact on the processor
design. Typically, a circuit using these techniques will be
running at or near its MAST, thus making the best possible
use of the available bandwidth.
_________________________
[1] Although it might not be practical, use of Gray
coded rather than 2's-complement integers to represent
addresses would insure that sequential addresses differ
by only a single bit.
[2] For those who would rather not place such faith
in the compiler, a simple circuit can detect a glitch
on the power bus, thereby detecting an instruction
which the compiler failed to tag but which exceeds the
MAST. The circuit would simply initiate a cold-start.
References
[AhS86] Aho, A. V., Sethi, R., and Ullman, J. D., Com-
pilers: Principles, Techniques, and Tools, Addison
Wesley, Reading, Massachusetts, 1986.
[Car88] Carley, L. R., Personal communication, Jan. 31,
1988.
[DiC88] Dietz, H. G. and Chi, C-H., "A Compiler-Writer's
View of GaAs Computer System Design," IEEE Proc.
of HICSS-21, pp. 256-265, Jan. 1988.
[GaT88] Gabara, T. and Thompson, D., "Ground Bounce Con-
trol in CMOS Integrated Circuits," to appear in
IEEE Proc. of International Solid-State Circuits
Conference, 1988.
With IEEE floating point, negating 0.0 to produce -0.0 is correct.
--
Bruce Robertson
MicroSage Computer Systems
br...@stride.com
"When you build a R3000-based MP, you don't have to limit the
amount of cache per processor (unlike, e.g., the 88000, which
allows only 16KB per processor in a 4-processor system)."
Presumably Earl meant "data cache" where he wrote "cache."
Just to keep the record straight: in a 4-processor 88k system on a
single M-bus (the bus implemented by the CMMU pinouts), each processor
has 16KB data cache and 16KB instruction cache, for a total of 32KB
cache. If you want more than that, you need more than one M-bus. Some
system designers are doing this.