Consider a simple pipeline: fetch, decode, execute, store.
Each step in the pipeline perforce uses the same number of
clocks, but it may be that some steps don't require as
much time to perform. So we can, for example, double the
clock frequency and introduce a pipeline with more steps:
fetch 1, fetch 2, decode, execute, store 1, store 2. (This
does not necessarily reflect a real example; I don't know
where the bottlenecks are in the real world.) The total
cycle time (from fetch to fetch) is now less than it was
before. "Superpipelining" is simply taking this idea to
further extremes - speed up the clock, introduce more pipe-
line stages (each of which takes less time than in a nonsuper-
pipelined architecture), and you (hopefully) get an overall
increase in throughput.
Now consider a simple CPU: one fetch unit, one ALU, one set
of physical registers. If in general each instruction requires
the use of each of these resources, we can only perform "one"
instruction at a time (actually, we perform one piece of each
of #pipeline_stages instructions at a time). In a superscalar
machine, we add additional ALUs, registers (with renaming, etc.)
and other units so that we can have multiple simultaneous
pipelines.
Superscalar is when a CPU can *complete* more than 1 instruction per
cycle.
Normally, this would occur by having 2 or more pipelines.
Super-pipelined is when the pipeline(s) in a CPU have (typically) more
than the normal number of stages. (Normal usually being considered
5 stages). So a super-pipelined CPU might have a 7-stage pipeline or
a 10-stage pipeline.
These definitions are not meant to be exactly, academically correct but
just to give you an idea of the difference between them.
Mike Schmit
-------------------------------------------------------------------
msc...@ix.netcom.com author:
408-244-6826 Pentium Processor Programming Tools
800-765-8086 ISBN: 0-12-627230-1
-------------------------------------------------------------------
No, that is not enough : a processor with a non-blocking cache
can complete a load and another instruction in the same cycle,
but that does not make it "superscalar".
"Superscalar" in my mind means that the processor can also
issue more than one instruction per cycle, as well as being
able to complete more than one, all from a single thread.
There are various kinds of superscalar, categorized by their
issue templates : first, of course, there is the number of
instructions that can issue per cycle, and the number of
instructions that can complete per cycle (not neccesarily
the same values, by the way). Then there is the issue
of wether the template is symetric or assymetric : without
considering branchs, which in the commoner cases you can't
issue and complete two or more of in parrellel, symetric
means any instruction can issue from any position in the
instruction window, while asymetric means that not all
orderings of (non-dependant) instructions can issue in
parrallel.
Early on, it was very common to have separate floating-point
and integer pipelines, and allow one instruction to issue
and complete per cycle in each pipeline. For machines with
separate integer and FP register sets, this is a very
natural arraingment. This is what I categorize as two-way
asymetric superscalar. Now, beyond that, we can ask wether
an INT-FP pairing _and_ a FP-INT pair could issue : if so,
the machine is order-insensitive, if not ...
This is just the system I use for categorizing micro-architecures.
If there is a more formal system, I'm not remembering it.
Wether an otherwise-non-superscalar machine that did branch
folding should be considered "superscalar" is a good question.
I guess it probably should.
] Super-pipelined is when the pipeline(s) in a CPU have (typically) more
] than the normal number of stages. (Normal usually being considered
] 5 stages). So a super-pipelined CPU might have a 7-stage pipeline or
] a 10-stage pipeline.
I personally have never been happy with this definition of
super-pipelined. It's too subjective, and not really useful
from an architecture point of view. Now, if you were to
pipeline so deeply that the result of a common operation
like an integer add was not available for use in the clock
cycle following it's execution, THAT I would call super-pipelined.
If you were to piepline your in-pipeline instruction or data cache,
that would be a good candidate to be called "super-pipelined".
These are better criteria because they speak to the tradeoff
that was made : an extra cycle of result latency was accepted
in exchange for a higher clock rate, the idea being that the
penalty of the occasional stall waiting for a result was more
than offset by the increased clock rate.
--
Dennis O'Connor doco...@sedona.intel.com
i960(R) Architecture and Core Design Not an Intel spokesman.
TIP#518 Fear is the enemy.
"Superpipelining is a new and special term meaning pipelining. The
prefix is attached to increase the probability of funding for research
proposals. There is no theoretical basis distinguishing superpipelining
from pipelining. Etymology of the term is probably similar to the
derivation of the now-common terms methodology and functionality as
pompous substitutes for method and function. The novelty of the term
superpiplining lies in its reliance on a prefix rather than a suffix for
the pompous extension of the root word."
I'll stand by my 1991 definition except that the term has become a
marketing term as well. All other definitions of pipelining I have seen
essentially "undefine" pipelining, placing it in limbo between the new
definition of pipelining and unpipelined design.
While both have the ability of issuing more than one instruction per cycle
(a definition for superscalar), the primary objective of a superscalar
arch. to is to reduce the cycles per instruction. Superpipelined reduces
the cycle time itself.
In an ideal case, the super pipelined arch. will not have any structural
dependencies. In a superscalar machine, the only way to remove a structural
dependency is to duplicate the functional resource ($$$$).
Superpipelined, by nature of the arch. having pipelined the 'major' pipe
stages, should not have a prob. with it. as an example, if you have 2
back-to-back floating-point-division- operations, and it takes 20 cycles/
operation, and there is only one functional unit, a superscalar arch. will
take 40 cycles to execute the 2 instructions, while in the superpipelined,
this functional unit itself will be pipelined, and would take 30 cycles
(or so).
however, clock skew and pipeline latches are a major overhead of the
superpipelined arch.
"SuperScalar Microprocessor Design", Mike Johnson, chapter 1 (or it 2?)
gives a good description of this.
thanks
=b
--
-------------------------------------------------------------------------------
Bharat P. Baliga-Savel(DM) HaL Computer Systems Inc.
"CBS sucks"
-Nebraska Fans, Fiesta Bowl, 1/2/96
> Here's my definition of Superpipelining (from a Feb 1991 Microprocessor
> Report article "1991: The Year of the RISC"). From a footnote in the
> article.
.....
Amusing, but I have a real definition:
If the issue rate is higher than the minimum execution time for an
instruction, the machine is superpipelined. This can be observed by
seeing if data-dependent minimum time instructions stall.
Needless to say, the R4000 does not fit this definition (memrefs had 3
cycle latency, as opposed to the majority of RISCs with 2)
The DEC PRISM, by repute, (as opposed to fact which I don't know), worked
this way.
The Stellar may have, or maybe it was only ops followed by conditional branches.
The WM machine kind of works this way on paper -
there are 2 ops/inst, and the first op of the second instruction couldn't
depend on the second of of the first inst.
Los of other machines may be called SuperPipelined, but mostly it's probably
marketing, as Nick suggests; but I would suggest that it may very well become
pejorative instead of complimentary. Most superpipelined machines simply
have latencies than a standard, canonical RISC (whatever that is), of 1
for all integer ops but branch and load.
--
*******************************************************
* Allen J. Baum *
* Apple Computer Inc. MS 305-3B *
* 1 Infinite Loop *
* Cupertino, CA 95014 *
*******************************************************
The term superpipelined was coined by Norm Jouppi and David Wall, in
their 1989 ASPLOS paper "Available Instruction-Level Parallelism for
Superscalar and Superpipelined Machines". In their original defn, a
superscalar pipeline was one in which the cycle time was less than the
latency of ANY functional unit. Eg, simple integer add ops retiring at
a rate of one every two clocks, when their regs are interdependent.
This matches the defn Baum gives.
In Wall's 1991 ASPLOS paper "Limits of Instruction-Level Parallelism",
he refined the definition as a pipe where the cycle time is less than
the "typical instruction latency", which in practice for practical
modern machines, means Load ops but perhaps not integer add ops. The
intent of this defn was to argue his thesis that multi-issue and deeper
pipes were complementary ways of exploiting the available data-flow
concurrency within sequential instr streams.
AFAIK, Mips was the first company to use the phrase in specs and
marketing literature, for the R4000 which had a +2 cycle latency
following loads, but single-cycle execution of nearly everything else.
This allowed a faster chip and easier compiler codegen model than their
original plan for a classic dual-issue machine with single-cycle
dcache, with both designs soaking up and exploiting the same amount of
instr-level parallelism.
I believe the original SPARCs also had +2 cycles of latency following
integer loads. But these chips were not superpipelined in this sense,
because the whole pipe simply froze on these cycles rather than
allowing independent ops to proceed around the pending load.
[snip]
>
> "Superscalar" in my mind means that the processor can also
> issue more than one instruction per cycle, as well as being
> able to complete more than one, all from a single thread.
>
> There are various kinds of superscalar, categorized by their
> issue templates : first, of course, there is the number of
> instructions that can issue per cycle, and the number of
> instructions that can complete per cycle (not neccesarily
> the same values, by the way). Then there is the issue
> of wether the template is symetric or assymetric : without
> considering branchs, which in the commoner cases you can't
> issue and complete two or more of in parrellel, symetric
> means any instruction can issue from any position in the
> instruction window, while asymetric means that not all
> orderings of (non-dependant) instructions can issue in
> parrallel.
>
Dennis
Thanks for an informative posting. It rises a few questions,
however.
1. What is a cycle? Is it clock cycle, clock phase or *pipeline-
step* (which could be several clock phases) or ...
2. While beeing capable of issuing more than one instruction per
cycle it seems reasonable that more than one instruction must
be available to issue. How is that accomplished in superscalar
architectures? It seems to me that more than one instruction
must be fetched and decoded per cycle. Is it so?
3. What about *sustained* instruction rate? Is that also more than
one per cycle?
Robert Tjarnstrom
I generally think of "cycle" as the clock rate of the pipestage
latches.
] 2. While beeing capable of issuing more than one instruction per
] cycle it seems reasonable that more than one instruction must
] be available to issue. How is that accomplished in superscalar
] architectures? It seems to me that more than one instruction
] must be fetched and decoded per cycle. Is it so?
Yep. Wide instruction caches and duplication of decoder logic.
For non-symetric templates, the "duplication" can be partial.
] 3. What about *sustained* instruction rate? Is that also more than
] one per cycle?
It can be. Depends on meory architecture, algorithm, degree of
optimization, etc.