Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Explanation of Superpipelining, please?

486 views
Skip to first unread message

|S| Norbert Juffa

unread,
Mar 5, 1993, 1:06:00 PM3/5/93
to
I am currently doing some assembly language programming on SPARC CPUs. I am
implementing some low-level integer arithmetic routines that support long
integers (128 to 2048 bit). These routines are the machine specific part of
an otherwise portable toolbox written in C. Spending some time writing the
machine specific assembly language code is worthwhile since a speedup of about
a factor of 5 can be achieved by this technique compared to an implementation
that is written 100% in C.

I have programmed different routines for three types of SPARC CPUs (SPARC II,
microSPARC, SuperSPARC). I have gotten the manuals for the SuperSPARC (Viking)
CPU from Texas Instruments and have tuned my code to its superscalar
architecture. The SuperSPARC chip can issue up to three instructions per clock
cycle, which are handled by several functional units in parallel. There are
3 integer ALUs, 1 shifter, and a floating-point unit with a separate pipeline.
Due to the nature of my code and the instruction grouping limits imposed by the
SuperSPARC, my code runs at about 1.3 to 1.4 instructions per clock cycle (I
spend lots of time doing multiplication, and the multiplication on SuperSPARC
can not execute in parallel with other instructions). Other integer code seems
to run at up to 1.8 instructions per clock cycle. So I am well aware of the
advantages of a superscalar CPU that can launch multiple instructions due to
the availability of several independent functional units.

Now I want to make a projection how fast my routines could run on other CPUs.
The R4000 architecture from MIPS/SGI is one of the architectures that look
promising, since it offers 64 bit operation and a reasonably fast HW multiply
(20 clock cycles for 64x64->128 bit), which *can* execute in parallel with
other instructions. Looking at the architectural description of this CPU, I
find that this is a "superpipelined" processor. The pipeline of the R4000 is
8 deep, as opposed to the pipeline of the R3000, which is 5 deep. In the
literature I have (Kane,Heinrich: MIPS RISC Architecture. MIPS: Introduction
to the R4000) no multiple functional units are mentioned for the R4000. However
the MIPS brochure tells me that the CPU launches two instructions on every
clock cycle and "exploits instruction-level parallelism". Does this relate to
the fact that the CPU is clocked at 50 MHz externally and runs at 100 MHz
internally? With respect to the external clock, I can see that two instructions
are launched in every clock cycle, even without multiple functional units. But
what does "superpipelined" really mean? For example, the Intel i486DX2-66 has
a 5 stage pipeline and runs internally at twice the externally applied
frequency. So one could claim that it, too, launches up to two instruction in
every clock cycle, right? I have never heard, though, that the i486DX2-66 was
claimed to be superpipelined. Unfortunately, Patterson&Hennessy doesn't tell
me much about superpipelining either. So what is the difference between a
pipelined, "clock-doubled" CPU and a superpipelined one? Obviously,
superpipelined CPUs seem to have a "deeper" pipeline than used on most
processors (5+/-1). Maybe someone from MIPS/SGI can jump in and explain
the concept of superpipelining (John Mashey, are you out there :-) I am also
puzzled that I haven't heard about the superpipelining concept being used in
other CPUs. Did I miss some important discussion in comp.arch :-(?

Any pointer appreciated.

Norbert
-------------------------------------------------------------------------------
Norbert Juffa email: S_J...@IRAVCL.IRA.UKA.DE Live and let live!

Stefan Monnier

unread,
Mar 7, 1993, 10:10:17 PM3/7/93
to
In article <1n84q8...@iraul1.ira.uka.de> S_J...@IRAV1.ira.uka.de (|S| Norbert Juffa) writes:
>the MIPS brochure tells me that the CPU launches two instructions on every
>clock cycle and "exploits instruction-level parallelism". Does this relate to
>the fact that the CPU is clocked at 50 MHz externally and runs at 100 MHz
>internally? With respect to the external clock, I can see that two instructions
>are launched in every clock cycle, even without multiple functional units. But
>what does "superpipelined" really mean? For example, the Intel i486DX2-66 has
>a 5 stage pipeline and runs internally at twice the externally applied
>frequency. So one could claim that it, too, launches up to two instruction in
>every clock cycle, right? I have never heard, though, that the i486DX2-66 was
>claimed to be superpipelined. Unfortunately, Patterson&Hennessy doesn't tell
>me much about superpipelining either. So what is the difference between a
>pipelined, "clock-doubled" CPU and a superpipelined one? Obviously,
>superpipelined CPUs seem to have a "deeper" pipeline than used on most
>processors (5+/-1). Maybe someone from MIPS/SGI can jump in and explain
>the concept of superpipelining (John Mashey, are you out there :-) I am also
>puzzled that I haven't heard about the superpipelining concept being used in
>other CPUs. Did I miss some important discussion in comp.arch :-(?
>
>Any pointer appreciated.
>
>Norbert

It seems 'superpipelining' has no precise definition. It just means
that the pipeline depth has been 'artificially' increased:
The way instructions are executed can be naturally divided in a few
'independent' parts ((pre)fetch, decode, data fetch, execute, write
back). As you noticed, this makes ~5 stages.
Superpipelining is reached when you split one (the longer ones) of these
stages, so as to increase (ideally double) the clock speed.

I don't think the R4000 is the only superpipelined processor. As far
as I know the Alpha could also be considered as 'superpipelined'
although it is rarely referenced as such because it is also superscalar
and this sounds better !

Superpipelining and superscalar are two orthogonal issues. And to a
certain extand they are exchangeable: doubling the clock rate (and the
instruction latency) is comparable to doubling the number of
instructions executed each cycle.
Superpipelining requires less control logic (but very careful latches
design), but seems to give less speed improvement (if the peak
performance is improved by a factor of 2, the actual improvement could
be around 1.7 for a superscalar processor and around 1.5 for a
superpipelined one)

Stefan

Raul Izahi Lopez Hernandez

unread,
Mar 10, 1993, 3:03:54 AM3/10/93
to
In article <1n84q8...@iraul1.ira.uka.de> S_J...@IRAV1.ira.uka.de (|S| Norber
t Juffa) writes:
>Now I want to make a projection how fast my routines could run on other CPUs.
>The R4000 architecture from MIPS/SGI is one of the architectures that look
>promising, since it offers 64 bit operation and a reasonably fast HW multiply
>(20 clock cycles for 64x64->128 bit), which *can* execute in parallel with
>other instructions. Looking at the architectural description of this CPU, I
>find that this is a "superpipelined" processor. The pipeline of the R4000 is
>8 deep, as opposed to the pipeline of the R3000, which is 5 deep. In the
>literature I have (Kane,Heinrich: MIPS RISC Architecture. MIPS: Introduction
>to the R4000) no multiple functional units are mentioned for the R4000. However
>the MIPS brochure tells me that the CPU launches two instructions on every
>clock cycle and "exploits instruction-level parallelism". Does this relate to
>the fact that the CPU is clocked at 50 MHz externally and runs at 100 MHz
>internally? With respect to the external clock, I can see that two instructions
No.
Clock doubling is different from superpipelining. Superpipelining is
a form of parallelism. Erik Williams (Stanford EE482) has discussed that
superpipelining is parallelism through time, while superscalar execution is
parallelism through space. Both can achieve n instructions per clock cycle.
I know that some people would argue about them being the same since you get
more work done per unit of time but the mechanism is very different. I
am not 100% sure but if the R4400 were to run at 50MHz externally and it did
clock doubling inside, then superpipelined execution would allow it to
run at an equivalent rate of 200MHz. But I believe that the 50MHz outside
and 100MHz 'inside' takes into account superpipelining and not clock
doubling. The Intel486DX2/66 is not superpipelined.
The superpipeline of the R4000 happened since there are tiny events in
an instruction which last only half a clock cycle. Here the clock cycle is
defined as the inverse of the length of the largest indivisible instruction.
This is for example a stage in the adder or an addition itself. So if
the cache tag lookup takes half a cycle then the implementation is a
candidate for superpipelining. Another half clock event is register fetch
for an operation.

I am sure the MIPS folks can illustrate in more detail the pipeline of
the R4000 which I don't have at hand now. All the detailed technical stuff
is at home :) .
Borrowing Erik's diagram from the above mentioned class this is
superpipelining: this is superscalar
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD

AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
AAAABBBBCCCCDDDD AAAABBBBCCCCDDDD
|---|---|---|---|---|---| |---|---|---|---|---|
clock cycles

Maybe the right 'format' for each line should be AAAABBCCDDDEFFFF but I
hope that you get the idea.
So you can see that after a few pipeline fill cycles you get a lot of
results per clock cycle. Again this is assuming that there is no clock doubling
involved and the example above shows 4-way 'parallelism'.

>are launched in every clock cycle, even without multiple functional units. But
>what does "superpipelined" really mean? For example, the Intel i486DX2-66 has
>a 5 stage pipeline and runs internally at twice the externally applied
>frequency. So one could claim that it, too, launches up to two instruction in
>every clock cycle, right? I have never heard, though, that the i486DX2-66 was
>claimed to be superpipelined. Unfortunately, Patterson&Hennessy doesn't tell
>me much about superpipelining either. So what is the difference between a
>pipelined, "clock-doubled" CPU and a superpipelined one? Obviously,

Hennessy covers that in a later course... H&P is mostly for EE282.
I am sure that a textbook that deals with that in detail is on the cooking
pan.

>superpipelined CPUs seem to have a "deeper" pipeline than used on most
>processors (5+/-1). Maybe someone from MIPS/SGI can jump in and explain
>the concept of superpipelining (John Mashey, are you out there :-) I am also
>puzzled that I haven't heard about the superpipelining concept being used in
>other CPUs. Did I miss some important discussion in comp.arch :-(?

I am sure that somebody is going to bring up some 60s CYBER or so up
as a superpipelined machine, but today there are not many superpipelined
architectures. However even MIPS says that it might go superscalar if it
finds a need for it. Maybe the TFP is superscalar. I don't have any details
on it.

>
>Any pointer appreciated.
>
>Norbert
>-------------------------------------------------------------------------------
>Norbert Juffa email: S_J...@IRAVCL.IRA.UKA.DE Live and let live!
>


--
-----> All opinions expressed here are my own, not IBM's <-----
Raul Izahi Lopez IBM Bergen Environmental Sciences and Solutions Centre
iz...@bsc.no Thormoehlensgate 55, 5008 Bergen, NORWAY (47-5)54-4653

--
Raul Izahi Lopez Hernandez | GraduateD Student !!! from California U.S.A's
iz...@leland.stanford.edu | Stanford University, Electrical Engineering Dept.
iz...@nova.stanford.edu | & Universidad ITESO, Guadalajara, MEXICO
iz...@bsc.no | Bergen Scientific Centre IBM, Bergen , Norway

Anton Ertl

unread,
Mar 10, 1993, 1:36:49 PM3/10/93
to
A good article on superpipelining is

@InProceedings{jouppi&wall89,
author = "Norman P. Jouppi and David W. Wall",
title = "Available Instruction-Level Parallelism for
Superscalar and Superpipelined Machines",
crossref = "asplos89",
pages = "272--282"
}

If we use the definitions of that paper, any processor where you
can fill delay slots with independent instructions is superpipelined,
i.e. all RISC processors and some newer CISCs.

To make the term more useful, we should use the "degree of
superpipelining" defined in the paper (the same applies to
superscalarity). The degree of superpipelining for the
Multititan (a typical RISC) is 1.7; the CRAY-1 has a degree of
superpipelining of 4.4.

They compute the degree of superpipelining in the following way:

sum over all instructions(frequency*latency)
(The frequency is a percentage here).

This assumes that everything is fully pipelined, i.e. that one
instruction can be issued every cycle, independent of instruction type
(if there is no data dependency). It's an interesting task to
generalize this. How about:

degree of superpipelining = (number of cycles needed by a program if
every instruction is treated as dependent on its predecessor)/
(number of cycles needed if every instruction is treated as
independent from its predecessor)

Now get your trace-driven simulators going and post some results!

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen

Erich Nahum

unread,
Mar 10, 1993, 2:35:47 PM3/10/93
to
In article <1nk7db...@morrow.stanford.edu> iz...@nova1.stanford.edu (Raul Izahi Lopez Hernandez) writes:
> I am sure the MIPS folks can illustrate in more detail the pipeline of
>the R4000 which I don't have at hand now. All the detailed technical stuff
>is at home :) .

I dug out a worthwhile piece from comp.arch about this very subject
from 2 years ago (has it been that long?).

------------------------[start included article]-------------------------
>From: cpr...@mips.COM (Charlie Price)
Newsgroups: comp.arch
Subject: Re: R4000
Date: 11 Feb 91 21:52:07 GMT
Organization: MIPS Computer Systems, Inc

In article <obgVpm200...@andrew.cmu.edu> mh...@andrew.cmu.edu
(Mark Hahn)writes:
>isn't MIPS's "superpiplining" just the common trick
>of sticking in a clock doubler?

Superscalar is pretty easy to define, but
what *does* superpipelining really mean?

At least one definition is that it is an implementation in which
extra stages are added to a "normal" pipeline simply to decrease
the clock interval and increase the issue rate.
The R4000 qualifies by this measure.

Another view is that a regular pipeline issues one instruction per
I-cache access latency period. A superpipeline issues two or more
instructions during the cache access latency.
The R4000 also qualifies by this measure.

One superpipeline "feature" that the R4000 does NOT have, is a multi-stage
ALU. The designers squeezed very hard to get the ALU into one clock.

This is a "good thing" and an important detail of the design. It makes
it possible for the result of an ALU operation to be available (by
bypassing) to the ALU stage of the following operation.
This means that the R4000 has no issue restrictions;
this instruction sequence can be issued in the same external cycle:
sub r1 from r2, result in r3
or r3 with r4, result in r5

Pipeline details for the curious ( "|" denotes parallel operation):
The R3000 pipeline, has 5 stages:

IF Instruction fetch from I-cache
RF Register Fetch | instruction decode
ALU ALU op or load/store address computation
MEM D-cache access
WB WriteBack results to register file

The cache access time is one clock period and
an instruction is issued in each clock period.
This is an incomplete description, and parts of the processor are
The R4000 has an 8-stage pipeline that takes 4 EXTERNAL clocks:

IF I-fetch, First cycle || instr address translation
IS I-fetch, Second cycle || instr address translation

RF Register Fetch | instruction decode | tag check of I-cache entry
EX ALU or load/store address computation

DF D-cache access, First cycle | data address translation
DS D-cache access, Second cycle | data address translation

TC Tag Check of D-cache entry
WB WriteBack to register file

Two instructions are issued per EXTERNAL clock,
this is the same period as the on-chip cache latency.
To do this, an internal clock runs at double the external clock and
one instruction is issued per internal clock so
This requires that cache access is pipelined.

This is much like the 3K pipeline except that the cache access
was chopped into two stages, and the D-cache tag check
needed a separate stage before writeback.
The RegisterFetch, EXecute, and WriteBack stages do roughly the same
work as before, just faster.

Squeezing the ALU into one clock required a faster adder.

------------------------[end included article]-------------------------

The only thing I would add is that there's a very good description
of superscalars/superpipeling by Jouppi and Wall in ASPLOS III titled


"Available Instruction-Level Parallelism for Superscalar and

Superpipelined Machines." Well worth reading.

-Erich

------------------------------------------------------------------------------
Erich Nahum Department of Computer Science
Networks and Performance Group University of Massachusetts at Amherst
na...@cs.umass.edu Amherst, MA 01003

Dennis O'Connor

unread,
Mar 10, 1993, 9:31:32 AM3/10/93
to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
] degree of superpipelining = (number of cycles needed by a program if

] every instruction is treated as dependent on its predecessor)/
] (number of cycles needed if every instruction is treated as
] independent from its predecessor)

This is a good test for general "super-ness". But it does not
distinguish between superpipelined and superscalar. Although they
may give similar performance results, the design issues and impact
on things like die size and power are very different.

First, let me use the term interupt to mean all asynchronous
change to the flow of instructions in the pipeline : interupts,
faults, thread switchs.

It might be more interesting to define superscalar and superpipelined
in terms of interupt-relative atomic time intervals.

If a machine can issue more than one instruction between the
boundaries at which the pipeline can be interupted, it's superscalar.

If the result of a register-register operation is not available in the
following between-interupts interval, it's superpipelined.

If both the above are true, it's superpiplined and superscalar.

This gives a externally-measurable meaning for the words that
agrees well with the hardware-oriented meaning : superpipelined
generally means that, generally, the ALU has been pipelined,
while superscalar means generally that different instructions
can be issued to different execution units in a single cycle.

It should be obvious that simply core-clock-doubling a chip won't make
it satisfy either of my "super" criteria, if one considers "interupts"
(i.e. faults or exceptions) that originate in the core.

--
Dennis O'Connor doco...@sedona.intel.com

Message has been deleted

Anton Ertl

unread,
Mar 12, 1993, 4:23:25 AM3/12/93
to
In article <DOCONNOR.93...@potato.sedona.intel.com>, doco...@sedona.intel.com (Dennis O'Connor) writes:
|> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
|> ] degree of superpipelining = (number of cycles needed by a program if
|> ] every instruction is treated as dependent on its predecessor)/
|> ] (number of cycles needed if every instruction is treated as
|> ] independent from its predecessor)
|>
|> This is a good test for general "super-ness". But it does not
|> distinguish between superpipelined and superscalar.

You are right. Let's do a few more definitions:

superness = defined above

degree of superpipelining = superness if the machine is restricted to
issuing one instruction per cycle

superscalarity = superness if all operations have a one cycle latency

This all depends on the definition of cycle. For this we can use the
definition given in Dennis' posting, i.e. the time between issuing
consecutive (groups of) instructions.

Note that, with these definitions, for some machines

superness != (degree of superpipelining)*superscalarity

and that some of the interesting properties of the, e.g., SuperSPARC
are only reflected in the superness.

Conor O'Neill

unread,
Mar 12, 1993, 11:49:33 AM3/12/93
to
>>From: cpr...@mips.COM (Charlie Price)

>At least one definition is that it is an implementation in which
>extra stages are added to a "normal" pipeline simply to decrease
>the clock interval and increase the issue rate.
>The R4000 qualifies by this measure.
>
>Another view is that a regular pipeline issues one instruction per
>I-cache access latency period. A superpipeline issues two or more
>instructions during the cache access latency.
>The R4000 also qualifies by this measure.

How about this for an approximate definition:

If the pipeline runs faster than the cache, but the cache runs at the
same speed as the bus (ie off-chip), then it's superpipelined.

If the pipeline runs at the same speed as the cache, but the cache runs
faster than the bus (ie off-chip), then it's clock-doubled (-tripled, etc).

By this simple definition, the MIPS R4000 is superpipelined,
and the 486-DX2 is clock-doubled.

Hypothetical real answer:
Any multi-syllable word with 'super' in it is pure marketingspeak,
and has no real value whatsoever.

---
Conor O'Neill, Software Group, INMOS Ltd., UK.
UK: co...@inmos.co.uk US: co...@inmos.com
"It's state-of-the-art" "But it doesn't work!" "That is the state-of-the-art".

0 new messages