Dynamic pipeline length adjustment

869 views
Skip to first unread message

lkcl

unread,
Aug 16, 2019, 11:43:48 AM8/16/19
to
One of our team, Jacob, came up with a very simple idea, to solve the issue that always bites a processor design: increasing the pipeline length comes with an increased clock latency, at slower speeds, but too short a pipeline (too many gates) and the top clock rate is limited.

The idea that Jacob had was to allow every other pipeline stage to dynamically *bypass* its latches. Thus, at slow speed, we can unlock the gates, halve the pipeline length, and reduce clock latency. however under normal circumstances, that would inherently limit the max clock rate.

If we want a boost, we quiesce the pipeline, close the bypass gates, and now the pipeline length is doubled; each combinatorial block is now half the former size, the clock rate can now go up. Yes the pipeline depth is doubled, but so is the clock rate.

As we are doing an out of order design, with full Dependency Matrices, the change in pipeline length does not bother us at all.

The question in my mind is: I have never seen this trick used in a commercial processor - ever. So what did we miss? Is there some power or gate penalty for this type of "latch bypass" capability that we haven't thought of?

Thoughts appreciated.

L.

Rick C. Hodgin

unread,
Aug 16, 2019, 11:59:46 AM8/16/19
to
On 8/16/2019 11:43 AM, lkcl wrote:
> One of our team, Jacob, came up with a very simple idea ...to allow
> every other pipeline stage to dynamically *bypass* its latches. Thus,
> at slow speed, we can unlock the gates, halve the pipeline length, and
> reduce clock latency...If we want a boost, we quiesce the pipeline,
> close the bypass gates, and now the pipeline length is doubled; each
> combinatorial block is now half the former size, the clock rate can
> now go up. Yes the pipeline depth is doubled, but so is the clock rate.
>
> As we are doing an out of order design, with full Dependency Matrices,
> the change in pipeline length does not bother us at all.

I don't know if it's been done before. I've never read about it in an
architecture, but I think it's genius. And now that I've read about it,
it's obvious too. :-)

Kudos to Jacob.

--
Rick C. Hodgin

Rick C. Hodgin

unread,
Aug 16, 2019, 12:03:02 PM8/16/19
to
On 8/16/2019 11:43 AM, lkcl wrote:
> The idea that Jacob had was to allow every other pipeline stage to
> dynamically *bypass* its latches. Thus, at slow speed, we can unlock
> the gates, halve the pipeline length, and reduce clock latency...


This should be able to be done in more than just 1/1 and 1/2 staging.
You should be able to extend it further to 1/3 and 1/4 when your clock
speed goes to idle / low-workload speeds.

Quite a thing there Jacob's thought up.

--
Rick C. Hodgin

MitchAlsup

unread,
Aug 16, 2019, 12:21:00 PM8/16/19
to
On Friday, August 16, 2019 at 10:43:48 AM UTC-5, lkcl wrote:
> One of our team, Jacob, came up with a very simple idea, to solve the issue that always bites a processor design: increasing the pipeline length comes with an increased clock latency, at slower speeds, but too short a pipeline (too many gates) and the top clock rate is limited.
>
> The idea that Jacob had was to allow every other pipeline stage to dynamically *bypass* its latches. Thus, at slow speed, we can unlock the gates, halve the pipeline length, and reduce clock latency. however under normal circumstances, that would inherently limit the max clock rate.

IBM has some papers on this, but I can't remember the name(s).

At the HW level, all you need (and WANT) if for both latches in a pipeline
flip-flop to be transparent at the same time. THen everything takes care of
itself.

The IBM papers show not just how to do this, but how to reconfigure the scan
path at the same time.

lkcl

unread,
Aug 16, 2019, 1:45:03 PM8/16/19
to
On Saturday, August 17, 2019 at 12:03:02 AM UTC+8, Rick C. Hodgin wrote:
> On 8/16/2019 11:43 AM, lkcl wrote:
> > The idea that Jacob had was to allow every other pipeline stage to
> > dynamically *bypass* its latches. Thus, at slow speed, we can unlock
> > the gates, halve the pipeline length, and reduce clock latency...
>
>
> This should be able to be done in more than just 1/1 and 1/2 staging.
> You should be able to extend it further to 1/3 and 1/4 when your clock
> speed goes to idle / low-workload speeds.

yehyeh, and with power going down on a square law, if the workload is inherently parallel (GPU), but affected by latency (register dependencies, tight loops) we do not necessarily need huge numbers of branch prediction units because the pipelines are shorter.

Then (because this is a hybrid CPU / VPU) if we need single thread performance we rack it back up.

L.

Rick C. Hodgin

unread,
Aug 16, 2019, 1:54:45 PM8/16/19
to
It's a brilliant idea. Make sure you keep Jacob happy. Who
knows what else he may come up with? :-)

--
Rick C. Hodgin

lkcl

unread,
Aug 16, 2019, 1:57:12 PM8/16/19
to
On Saturday, August 17, 2019 at 12:21:00 AM UTC+8, MitchAlsup wrote:
> On Friday, August 16, 2019 at 10:43:48 AM UTC-5, lkcl wrote:
> > One of our team, Jacob, came up with a very simple idea, to solve the issue that always bites a processor design: increasing the pipeline length comes with an increased clock latency, at slower speeds, but too short a pipeline (too many gates) and the top clock rate is limited.
> >
> > The idea that Jacob had was to allow every other pipeline stage to dynamically *bypass* its latches. Thus, at slow speed, we can unlock the gates, halve the pipeline length, and reduce clock latency. however under normal circumstances, that would inherently limit the max clock rate.
>
> IBM has some papers on this, but I can't remember the name(s).

If anyone happens to know the reference that would be real handy.

> At the HW level, all you need (and WANT) if for both latches in a pipeline
> flip-flop to be transparent at the same time. THen everything takes care of
> itself.

I have yet to work out the circumstances where data would not be lost by opening or closing the latch bypasses.

The simplest safest option is to just let the entire pipe quiesce.

> The IBM papers show not just how to do this, but how to reconfigure the scan
> path at the same time.

Scan path... is that related to testing?
A 1997 patent by TI talks about how to load data into each of the latches (directly) so that the number of test vectors can be reduced, but also each individual stage may be tested as well.

Does that ring any bells?

https://patents.google.com/patent/US5970241

lkcl

unread,
Aug 16, 2019, 2:02:15 PM8/16/19
to
On Saturday, August 17, 2019 at 1:54:45 AM UTC+8, Rick C. Hodgin wrote:

>
> It's a brilliant idea. Make sure you keep Jacob happy. Who
> knows what else he may come up with? :-)

:)

A partitionable Wallace multiplier so that we can use the same logic for scalars and any width of SIMD, quadratic algorithms for FP rounding accuracy emulation, a pipeline that can do DIV, SQRT and RSQRT, that's so far :)

L.

MitchAlsup

unread,
Aug 16, 2019, 4:22:31 PM8/16/19
to
On Friday, August 16, 2019 at 12:57:12 PM UTC-5, lkcl wrote:
> On Saturday, August 17, 2019 at 12:21:00 AM UTC+8, MitchAlsup wrote:
> > On Friday, August 16, 2019 at 10:43:48 AM UTC-5, lkcl wrote:
> > > One of our team, Jacob, came up with a very simple idea, to solve the issue that always bites a processor design: increasing the pipeline length comes with an increased clock latency, at slower speeds, but too short a pipeline (too many gates) and the top clock rate is limited.
> > >
> > > The idea that Jacob had was to allow every other pipeline stage to dynamically *bypass* its latches. Thus, at slow speed, we can unlock the gates, halve the pipeline length, and reduce clock latency. however under normal circumstances, that would inherently limit the max clock rate.
> >
> > IBM has some papers on this, but I can't remember the name(s).
>
> If anyone happens to know the reference that would be real handy.
>
> > At the HW level, all you need (and WANT) if for both latches in a pipeline
> > flip-flop to be transparent at the same time. THen everything takes care of
> > itself.
>
> I have yet to work out the circumstances where data would not be lost by opening or closing the latch bypasses.

Err, no.

Every OTHER flip flop is allowed to be completely transparent. The rest perform
normal pipeline staging. This also means you have to "stage" the feedback
loops from the non-transparent flip-flops.

MitchAlsup

unread,
Aug 16, 2019, 4:27:29 PM8/16/19
to
Easily done::

Take the majority gate carry = (a&b | a&c | b&c)
and make it .......... carry = (a&b&z | a&c&z | b&c&z )
when z = 1 the AOI gate works just like above
when z = 0 no carries pass this point (towards greater significance)

{At least 30 years old}

When each layer in the multiplier tree inverts logic polarity::
This takes a 2-2-2 AOI gate and makes it into a 3-3-3 AOI gate.

EricP

unread,
Aug 16, 2019, 4:55:37 PM8/16/19
to
lkcl wrote:
> On Saturday, August 17, 2019 at 12:21:00 AM UTC+8, MitchAlsup wrote:
>> On Friday, August 16, 2019 at 10:43:48 AM UTC-5, lkcl wrote:
>>> One of our team, Jacob, came up with a very simple idea, to solve the issue that always bites a processor design: increasing the pipeline length comes with an increased clock latency, at slower speeds, but too short a pipeline (too many gates) and the top clock rate is limited.
>>>
>>> The idea that Jacob had was to allow every other pipeline stage to dynamically *bypass* its latches. Thus, at slow speed, we can unlock the gates, halve the pipeline length, and reduce clock latency. however under normal circumstances, that would inherently limit the max clock rate.
>> IBM has some papers on this, but I can't remember the name(s).
>
> If anyone happens to know the reference that would be real handy.

This is by IBM but for a pipelined MPEG decoder:

Fine-Grain Real-Time Reconfigurable Pipelining 2003
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.1015

Other search terms are "transparent pipeline" "collapsed pipeline"
"Adaptive pipeline"

Transparent mode flip-flops for collapsible pipelines 2007
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.182.2056

Adaptive pipeline structures for speculation control 2003
http://apt.cs.manchester.ac.uk/ftp/pub/amulet/papers/efthym_async03.pdf

lkcl

unread,
Aug 16, 2019, 11:16:42 PM8/16/19
to
On Friday, August 16, 2019 at 9:22:31 PM UTC+1, MitchAlsup wrote:

> Every OTHER flip flop is allowed to be completely transparent. The rest perform
> normal pipeline staging. This also means you have to "stage" the feedback
> loops from the non-transparent flip-flops.

oh - yes, duh, i'd worked that bit out :) or, i should check: this is for being able to do microcode loops, where intermediate data is fed back into the pipeline for further processing, right? you definitely don't want to try to feed back into something that can flip between register and combinatorial.

(btw thank you eric for the references).

lkcl

unread,
Aug 16, 2019, 11:21:20 PM8/16/19
to
On Friday, August 16, 2019 at 9:27:29 PM UTC+1, MitchAlsup wrote:

> Easily done::
>
> Take the majority gate carry = (a&b | a&c | b&c)
> and make it .......... carry = (a&b&z | a&c&z | b&c&z )
> when z = 1 the AOI gate works just like above
> when z = 0 no carries pass this point (towards greater significance)
>
> {At least 30 years old}
>
> When each layer in the multiplier tree inverts logic polarity::
> This takes a 2-2-2 AOI gate and makes it into a 3-3-3 AOI gate.

https://en.wikipedia.org/wiki/AND-OR-Invert

i'm not sure what you're referring to, Mitch. are you saying that
AOI logic functions can be optimally used in transparent pipeline
flip-flops, or were you referring to one of the (random) things
i mentioned?

apologies.

l.

MitchAlsup

unread,
Aug 17, 2019, 1:10:03 PM8/17/19
to
No, what I am saying is that there is an easy way to "clip" the carry chains
in a multiplier. Once so clipped, the left and right half multipliers are
independent and can be used for different multiplications at the same time.

The addition of z to the AOI "clips" said carry chain.

already...@yahoo.com

unread,
Aug 17, 2019, 4:14:54 PM8/17/19
to
On Friday, August 16, 2019 at 8:45:03 PM UTC+3, lkcl wrote:
>
> and with power going down on a square law,
>
> L.

Why do you expect power to go down by square law?
Pipeline latches (Flipflops?) consumes power, sure, and you will save half of that, but power consumed pipeline latch by is typically not a dominant part of GPU dynamic power budget.
And your bypass (mux) can't be totally free so the saving will be even less.
Add static power to that, which nowadays is not negligible, and your saving is even less.
IMHO, you will be lucky if power goes down by a linear law, so at least you don't lose on perf/watt.

That is, unless you somehow manage to lower a voltage after you switched into half-rate mode. But I don't see why you possibly will be able to do it.

lkcl

unread,
Aug 17, 2019, 11:20:05 PM8/17/19
to
On Saturday, August 17, 2019 at 9:14:54 PM UTC+1, already...@yahoo.com wrote:

> That is, unless you somehow manage to lower a voltage after you switched into half-rate mode. But I don't see why you possibly will be able to do it.

yes, usually, in SoCs, when dropping the clock rate you have an external PMIC (such as the AXP209, or Action Semi ACT8600, both of which are around the USD $1 mark and have multiple DC-DC and LDO outputs) you can then also drop the main core voltage and the DDR2/3 voltage (independently).

l.

already...@yahoo.com

unread,
Aug 18, 2019, 3:20:44 AM8/18/19
to
My question is not *how* you can reduce a voltage, but *why* do you think that you will be able to do it?
Is it because your original 2n-stage pipeline is poorly balanced in term of critical path and n-stage pipelene is balanced better? Or some other reason?
Or do you plan to reduce the clock frequency by more than twice?

Your original post does not explain a motivation sufficiently clearly.

lkcl

unread,
Aug 18, 2019, 3:45:15 AM8/18/19
to
On Sunday, August 18, 2019 at 8:20:44 AM UTC+1, already...@yahoo.com wrote:
> On Sunday, August 18, 2019 at 6:20:05 AM UTC+3, lkcl wrote:
> > On Saturday, August 17, 2019 at 9:14:54 PM UTC+1, already...@yahoo.com wrote:
> >
> > > That is, unless you somehow manage to lower a voltage after you switched into half-rate mode. But I don't see why you possibly will be able to do it.
> >
> > yes, usually, in SoCs, when dropping the clock rate you have an external PMIC (such as the AXP209, or Action Semi ACT8600, both of which are around the USD $1 mark and have multiple DC-DC and LDO outputs) you can then also drop the main core voltage and the DDR2/3 voltage (independently).
> >
> > l.
>
> My question is not *how* you can reduce a voltage, but *why* do you think that you will be able to do it?

... this makes no sense - i.e. i may be missing something obvious. why? because we will use Cell Libraries that accept variable voltages. um... :)

> Is it because your original 2n-stage pipeline is poorly balanced in term of critical path and n-stage pipelene is balanced better?

no. if we haven't designed the pipeline so it can be a 2n-stage and an n-stage, that's.... well... a design flaw.

> Or some other reason?
> Or do you plan to reduce the clock frequency by more than twice?

yes (variable clock rate, just like other SoCs) however that is also a red herring.

> Your original post does not explain a motivation sufficiently clearly.

the original post was not intended to explain a motivation for square-law power consumption: i assumed it was well understood, and so was just a throw-away (afterthought) comment.


W = V^2 / R. therefore, if the voltage may be dropped (because the clock rate is dropped), there are inherent power reductions beyond the linear "expectation"

if we kept the voltage *exactly the same* regardless of the clock rate, there would be no such square-law-related power saving.

if the voltage is insufficient, and the clock rate is increased, the transistors do not have sufficient current to drive to the required thresholds to indicate "1" and "0".

this is typically compensated for by increasing the supply voltage.

therefore, as the clock rate goes up, the [practical, required] power consumption is *not* linear.

the typical voltage swings for 40nm are 0.9v to 1.2v or thereabouts. 0.9v will be useable at around... 400mhz, whilst to get to around 1.2ghz to 1.5ghz a 1.2v core voltage will be required.

i heard that there was someone doing cells for globalfoundries 22nm where the supply voltage could go as low as 0.4v. which is deeply impressive.

l.

already...@yahoo.com

unread,
Aug 18, 2019, 4:30:39 AM8/18/19
to
All you said above is obvious.
What is not obvious is why [in chip that runs predominantly throughput-bound jobs, like GPU or GPGPU] switching to shallow pipeline at lower clock could be advantageous.
Shallow pipeline = much higher [than deep pipeline at the same clock frequency] voltage at marginally better throughput and much better latency. But we already established that latency is relatively unimportant, otherwise we wouldn't build it as GPU/GPGPU. So taking throughput per as watt as main figure of merit and assuming, generously, that shallow pipeline will require 1.7x higher voltage than deep pipeline and will deliver 1.2x higher throughput (both figures extremely optimistic) shallow pipeline loses by factor of 2.4X.

lkcl

unread,
Aug 18, 2019, 5:30:39 AM8/18/19
to
right. ok. this question makes sense.

> Shallow pipeline = much higher [than deep pipeline at the same clock frequency] voltage at marginally better throughput and much better latency.

ah no. there are three decisions to make (actually 4 in "traditional") design and unfortunately what you are doing is conflating at least two of those decisions.

* clock: fast or slow
* voltage: high or low
* pipeline length: long or short
* [and for traditional processor design]: parallelism: more or less.


> But we already established that latency is relatively unimportant, otherwise we wouldn't build it as GPU/GPGPU. So taking throughput per as watt as main figure of merit and assuming, generously, that shallow pipeline will require 1.7x higher voltage than deep pipeline and will deliver 1.2x higher throughput (both figures extremely optimistic) shallow pipeline loses by factor of 2.4X.

you're confusing several things (which is probably why you're asking the question).

normally, these would be 2 completely separate chips:

A GPU:

* low clock rate
* lower voltage
* short pipelines
* large parallelism (to make up for the lower clock rate)

A CPU:

* high clock rate
* higher voltage (to compensate for the power drain caused by high clock rate)
* longer pipelines
* less parallelism (because the applications can't cope, and the clock is faster anyway)


interestingly, give-or-take, the actual instruction COMPLETION time is the SAME (give-or-take), because whilst the clock rate is (for example) doubled, the pipeline length is (for example also) doubled, when comparing these two designs.


now let's go to the "hybrid" design.

* the amount of actual hardware is the same.
* therefore there is no "advantage" compared to the typical-CPU vs
typical-GPU as you do not have "more parallelism".

however, what you *do* have is the option to open up the flip-flop gates, halve the pipeline length, HALVE the clock rate, and then REDUCE THE VOLTAGE, which you can do because you are running at a LOWER clock rate.

Q: is the instruction completion time reduced?
A: no.

Q: is the power still reduced?
A: YES.

Q: what would the performance be like when compared to a "normal" CPU,
if we took a "normal" CPU and simply halved the clock rate?
A: the performance would be the same.

Q: would the power reduction be the same?
A: yes

Q: would we still be able to reduce the voltage?
A: yes.

Q: so what's different, then?
A: *THE INSTRUCTION COMPLETION TIME*

Q: so the instruction completion latency would be reduced?
A: YES

Q: okaay, so that has implications for the number of branch prediction
units needed, and so on, doesn't it?
A: YEEEEEES.

Q: and the number of Reservation Stations needed in an OoO design?
A: yeeeees

Q: and in an in-order design, the shortened pipelines would mean less
stalling, wouldn't it?
A: yeeeeees!

Q: so overall, the "standard" CPU design, because of the longer pipelines,
would have significant performance / latency penalties compared to the
"flexible" pipeline design?
A: yeeeeees.


now let's compare a GPU vs this hybrid design: we're going to go the *other* way. let's start from the gates being OPEN and let's try to ramp up the clock rate and the power.

Q: the "standard" GPU with short pipelines, can we ramp up the clock rate?
A: NO.

Q: why the heck not?
A: because the stages are so short - the number of gate chains
connected together are so long, that this sets a MAXIMUM FREQUENCY that
CANNOT BE EXCEEDED.

so the questions stop right there. the "standard" GPU design which has a low clock frequency (low MAXIMUM clock frequency as it turns out), simply cannot be utilised for CPU workloads.


the assumption (i think) that you made was that it would even be *possible* for a (fixed-length) short pipeline design to increase the clock rate.

often this is simply not the case, and in fact, in our design, we will very deliberately pick the number of stages in the "shortened" layout so that it will run successfully at only up to around 800mhz in that configuration.

we will *deliberately* design the number of combinatorial blocks per pipeline stage so that if you want to go above that number (to 1.5ghz) you *have* to close the gates, activate the flip-flops that will be present on every other stage, and double the pipeline length in the process.

*now* you can go to 1.5ghz [and have to increase the voltage to match].


clearly, in this configuration, the increased latency (the longer pipeline depth) would need to be compensated for.

so, actually, yes, you would actually need lots of branch prediction/speculation units, lots more Reservation Stations and so on, in order to "cope" with the longer pipelines.

which has some interesting implications when you go back down to the shorter depth.

the problem with an Out-of-Order design is that if you have a pipeline depth of say 12, you *MUST* have a minimum of 12 Reservation Stations (actually, 12 Reservation Station "Rows", if you use a Tomasulo Algorithm).

the reason is because you cannot - must not - have data in the pipeline that you cannot "re-associate" with its originator.

therefore, if you have say a pipeline depth of 12, and only 6 RS's (6 RS "rows" total in Tomasulo), the maximum utilisation of the pipeline is ONLY FIFTY PERCENT. as in: it must run with ONLY 6 things being processed at any one time.

if this is not obvious: if you add a 7th, what "ID" do you give it, so that you can re-associate it with the Rservation Station when it comes back out the pipeline? there are only 6 RS's, there is only one set of operand latches per RS, there are only 6 "IDs"... so... um... how on earth can you even *put* a 7th item into the pipeline, when you have nowhere to store the result??

this is clearly a problem :)

so, you *have* to design the Dependency Matrices (or Tomasulo RS's) to cope:

* 12-long pipelines == 12-long RS's (or more).

now let's drop the clock rate, and halve the pipeline length. now you have 12 RS's, but the pipelines are only SIX long. that means that there are now twice as many RS's as there is pipelines that can cope with them.

so what do you do with all those yummy extra RS's?

stuff them with data of course! :)

now you can do things like:

* extra run-ahead branch-prediction on loops (yay!)
* detect more opportunities for operand-forwarding (w00t!)
* double the multi-issue execution rate from 2 to 4 (or 4 to 8). yippee!

all of which occurs *without* having any kind of major performance penalties or increase in instruction completion latency.

there's probably some more that i haven't thought of.

so it's a huge set of advantages that go well beyond those of a first glance.

l.

Anton Ertl

unread,
Aug 18, 2019, 8:18:38 AM8/18/19
to
lkcl <luke.l...@gmail.com> writes:
>On Sunday, August 18, 2019 at 8:20:44 AM UTC+1, already...@yahoo.com wrote:
>> Or do you plan to reduce the clock frequency by more than twice?

The whole scheme only makes sense once voltage has reached the
minimum. You can then reduce the clock further to reduce dynamic
power, and once you have reduced it enough to allow the half-depth
configuration to work at that voltage level, you can reconfigure the
pipeline to reduce dynamic power some more.

>the typical voltage swings for 40nm are 0.9v to 1.2v or thereabouts. 0.9v =
>will be useable at around... 400mhz, whilst to get to around 1.2ghz to 1.5g=
>hz a 1.2v core voltage will be required.

On a Juniper XT (Radeon 6770), which is built in 40nm, the voltage
levels are indeed 0.9V and 1.2V, but the associated clock rates are
400MHz and 850MHz; so apparently the clock rate cost of that voltage
difference is just a little over a factor of 2.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Pedro Pereira

unread,
Aug 18, 2019, 9:19:14 AM8/18/19
to
It's not really relevant to the discussion, but some systems like the blackfin bf53x
even have an on-chip switching buck regulator.

Pedro Pereira

Przemysław Cieszyński

unread,
Aug 18, 2019, 9:47:26 PM8/18/19
to
I had similar idea long time ago and gains are ilusoric. You still have full design cost of deep pipeline, including interstage timing restrictions(BTW I am software guy ). As alternative thing about NOP interleaving, you can virtually shorten depth of pipeline.
Much better idea is to make reconfigurable processor that can gradually change mode of operation from OOO to IO (but SMALL+BIG combo is more effecient and versatile).

lkcl

unread,
Aug 19, 2019, 1:41:00 AM8/19/19
to
On Monday, August 19, 2019 at 2:47:26 AM UTC+1, Przemysław Cieszyński wrote:
> I had similar idea long time ago and gains are ilusoric. You still have full design cost of deep pipeline, including interstage timing restrictions(BTW I am software guy ).

so am i! :)

> As alternative thing about NOP interleaving, you can virtually shorten depth of pipeline.

honestly: my immediate reaction to that is "yuk". yes it reduces power consumption, no it does nothing to reduce latency.

> Much better idea is to make reconfigurable processor that can gradually
> change mode of operation from OOO to IO

interestingly, when using 6600-style Dependency Matrices, i noticed that if you have only one single Function Unit, the end result is an in-order system. in other words, an in-order design is a degenerate case of a 6600-style OoO design.

the problem that i see with the logic that you propose is: if you are going to go to all of the trouble of successfully creating a precise-exception-capable 6600-style OoO machine, *why would you then limit it to in-order*? there's no point (that i can see).

> (but SMALL+BIG combo is more efficient and versatile).

only for certain workloads, where the uses for each are mutually exclusive. that in turn implies that the silicon area is effectively wasted.

whilst it's not _completely_ equivalent to SMALL+BIG, the dynamic transparent latches gives the same hardware the opportunity to complete instructions in the same time using far less power.

[it's not completely equivalent because in many SMALL configs the circuits are optimised completely differently to be more power-efficient, and, also, there is the opportunity to use completely different *algorithms* within the pipelines, not just reconfigure the flip-flops between the stages *of* those algorithms].

dynamic transparent flip-flops is basically just another tool in a long line of tools that can be used to design processors. it's not a panacea.

l.

Terje Mathisen

unread,
Aug 19, 2019, 4:56:55 AM8/19/19
to
My immediate concern was that this extra logic makes the circuit slower,
but only by one extra gate for every byte boundary, i.e. 7 for a 64-bit
unit splittable into byte ops.

OTOH wide adders are already doing one or more forms of carry forwarding
speedups anyway, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

already...@yahoo.com

unread,
Aug 19, 2019, 7:29:14 AM8/19/19
to
This sentence hint on rather deep misunderstanding of causes and effects.
I am not good at explanations, but first google hit gave a very competent answer:
https://www.quora.com/Why-do-processors-need-a-higher-voltage-to-run-stably-at-higher-clock-speeds


> * longer pipelines


That's more complicated.
High-performance CPU pipelines tend to have longer decode, dispatch, rename, issue and retiriment phases.
However FP execution phase tends to have the same number of stages as GPUs or to be only slightly longer, as 4 vs 3.
Integer execution stages of CPUs can be even shorter than those of GPUs. In CPU world 1 integer execution stage is a norm, esp. outside of IBM. I don't know what is "norm" in GPU realm and if such thing even exists.
So, latencies, measured in clock cycles are quite comparable between GPUs and CPUs, except for latency (more commonly called penalty) of mispredicted branch.

> * less parallelism (because the applications can't cope, and the clock is faster anyway)
>
>
> interestingly, give-or-take, the actual instruction COMPLETION time is the SAME (give-or-take), because whilst the clock rate is (for example) doubled, the pipeline length is (for example also) doubled, when comparing these two designs.
>
>
> now let's go to the "hybrid" design.
>
> * the amount of actual hardware is the same.
> * therefore there is no "advantage" compared to the typical-CPU vs
> typical-GPU as you do not have "more parallelism".
>
> however, what you *do* have is the option to open up the flip-flop gates, halve the pipeline length, HALVE the clock rate, and then REDUCE THE VOLTAGE, which you can do because you are running at a LOWER clock rate.
>
> Q: is the instruction completion time reduced?
> A: no.
>
> Q: is the power still reduced?
> A: YES.
>

Linearly.
Energy per unit of work is about (ignoring 5-10-15%) the same for 2n-stage pipeline at frequency 2x and n-stage at frequency x. Because in both cases the circuit will produce reliable results at about the same voltage.
If you still can't grasp it please reread the answer of Santosh Gannavarapu.

Paul A. Clayton

unread,
Aug 19, 2019, 7:37:08 AM8/19/19
to
On Friday, August 16, 2019 at 11:43:48 AM UTC-4, lkcl wrote:
> One of our team, Jacob, came up with a very simple idea, to solve
> the issue that always bites a processor design: increasing the
> pipeline length comes with an increased clock latency, at slower
> speeds, but too short a pipeline (too many gates) and the top
> clock rate is limited.
>
> The idea that Jacob had was to allow every other pipeline stage
> to dynamically *bypass* its latches. Thus, at slow speed, we
> can unlock the gates, halve the pipeline length, and reduce
> clock latency. however under normal circumstances, that would
> inherently limit the max clock rate.

I proposed a similar concept a while ago but using wave pipelining
to avoid the intermediate latches. Wave pipelining has substantial
practical issues, so such is unlikely to be implemented.

> If we want a boost, we quiesce the pipeline, close the bypass
> gates, and now the pipeline length is doubled; each combinatorial
> block is now half the former size, the clock rate can now go up.
> Yes the pipeline depth is doubled, but so is the clock rate.

Not quite doubled I would assume because in shallow pipeline mode
the intermediate latch overhead would be smaller. (Manufacturing
variability would also favor a shallower pipeline because variation
would be averaged over larger chunks of logic.)

lkcl

unread,
Aug 19, 2019, 8:17:57 AM8/19/19
to
On Monday, August 19, 2019 at 12:29:14 PM UTC+1, already...@yahoo.com wrote:

> This sentence hint on rather deep misunderstanding of causes and effects.
> I am not good at explanations, but first google hit gave a very competent answer:
> https://www.quora.com/Why-do-processors-need-a-higher-voltage-to-run-stably-at-higher-clock-speeds

appreciate the link. there's lots of details i don't precisely know,
and this is very informative.

>
> > * longer pipelines
>
>
> That's more complicated.
> High-performance CPU pipelines tend to have longer decode, dispatch, rename, issue and retiriment phases.
> However FP execution phase tends to have the same number of stages as GPUs or to be only slightly longer, as 4 vs 3.
> Integer execution stages of CPUs can be even shorter than those of GPUs. In CPU world 1 integer execution stage is a norm, esp. outside of IBM. I don't know what is "norm" in GPU realm and if such thing even exists.
> So, latencies, measured in clock cycles are quite comparable between GPUs and CPUs, except for latency (more commonly called penalty) of mispredicted branch.

ok, right. this applies to *ordinary* CPUs and *ordinary* GPUs. where typically the GPU - as a separate processor - communicates with the CPU over shared memory (typically a PCIe bus) and the GPU's FP and INT engine(s) are (typically) heavily optimised for a 32-bit workload, and often do not meet (or need to meet) IEEE754 accuracy requirements.

we're designing a *hybrid* processor that must cope with *both* workloads, often running simultaneously, *both tasks using the same instruction set*.

consequently, the absolute last thing that we can do is create a 64-bit IEEE754 accurate FP unit, and expect the design to be cost-competitive with modern GPUs of today.

so what we're doing is to design a Beast of a microcoded partitionable ALU that is split in similar ways to how Ivan has described the Mill's ALUs.

it will be heavily optimised to complete *32-bit* operations (both FP and INT) efficiently, sacrificing speed and performance on 64 bit by reusing the *32-bit* pipelines and/or by opening up the partitions between two 32-bit pipelines to create a 64-bit-wide one.

deploying "data recycling" tricks - pushing the data back into the pipeline - would clearly extend the completion time (double or quadruple), because only partial results can be created on each iteration, which then have to be amalgamated.

[btw we will also provide early-in and early-out bypass points so that the same ALUs may be used for INT workloads and FP workloads, so it's *not* just FP that's impacted, it's INT as well]

consequently, the effective pipeline length for performing 64-bit arithmetic is *not* 3-4 clock cycles: on some operations it would be more like 12-16.

thus it is *really important* for this hybrid design to get that down, which is why we're looking at the transparent flip-flops.

if we were doing a "traditional CPU", we would not be having this discussion.

if we were doing a "traditional GPU", we would not be having this discussion.

it's only because we are designing a *hybrid* CPU/GPU/VPU that we are even considering these strange (and otherwise unnecessary) design strategies.

l.

lkcl

unread,
Aug 19, 2019, 10:05:59 AM8/19/19
to
On Monday, August 19, 2019 at 12:37:08 PM UTC+1, Paul A. Clayton wrote:

> > If we want a boost, we quiesce the pipeline, close the bypass
> > gates, and now the pipeline length is doubled; each combinatorial
> > block is now half the former size, the clock rate can now go up.
> > Yes the pipeline depth is doubled, but so is the clock rate.
>
> Not quite doubled I would assume because in shallow pipeline mode
> the intermediate latch overhead would be smaller. (Manufacturing
> variability would also favor a shallower pipeline because variation
> would be averaged over larger chunks of logic.)

... and there will be some aspects that can't have this trick applied:
instruction decode and so on will likely be a fixed pipeline-length
overhead.

l.

lkcl

unread,
Aug 19, 2019, 10:37:25 AM8/19/19
to
On Monday, August 19, 2019 at 1:17:57 PM UTC+1, lkcl wrote:

> it will be heavily optimised to complete *32-bit* operations (both FP and INT) efficiently, sacrificing speed and performance on 64 bit by reusing the *32-bit* pipelines and/or by opening up the partitions between two 32-bit pipelines to create a 64-bit-wide one.

btw just to clarify, there: that's partitioning on the *width*, not partitioning on the *depth*. [actually we'll add both]

what i was referring to here was that we have a SIMD-capable partitioned wallace tree multiplier block that is capable of either 8x8 MUL, 4x16 MUL, 2x32 MUL or 1x64 MUL pipelined operations.

so we open up the *width-wise* partition gates and instead of 2x 32-bit MUL answers coming out the pipeline, we get 1x 64-bit.

however if the amount of gates needed is too great, what we will do instead is to do two *completely separate* width-partitionable 32-bit-wide MUL units (because we still have to do 8-bit and 16-bit SIMD), then use them with microcoding to feed back partial results, and construct a 64-bit answer with several passes.

and that will be where the *depth* partitioning particularly comes into play.

l.

already...@yahoo.com

unread,
Aug 19, 2019, 10:58:16 AM8/19/19
to
First, if your circuit designers are competent, it is unlikely that they really use wallace tree. As explained multiple times by Mitch Alsup, Wallace Trees have no advantages over Dadda Multiplier and several disadvantages.
Second, fully pipelined 64-bit multiplier, esp. one that is capable to generate upper 64 bit of 128-bit result, is much more expensive than 2x32 or other smaller variants. So in typical low end designs such multipliers are non-pipelined or partially pipelined, e.g. producing one result every 3 or 2 clock cycles. Most likely it would be the best trade-off for your design as well.

lkcl

unread,
Aug 19, 2019, 11:11:30 AM8/19/19
to
On Monday, August 19, 2019 at 3:58:16 PM UTC+1, already...@yahoo.com wrote:
> On Monday, August 19, 2019 at 5:37:25 PM UTC+3, lkcl wrote:
> > On Monday, August 19, 2019 at 1:17:57 PM UTC+1, lkcl wrote:
> >
> > > it will be heavily optimised to complete *32-bit* operations (both FP and INT) efficiently, sacrificing speed and performance on 64 bit by reusing the *32-bit* pipelines and/or by opening up the partitions between two 32-bit pipelines to create a 64-bit-wide one.
> >
> > btw just to clarify, there: that's partitioning on the *width*, not partitioning on the *depth*. [actually we'll add both]
> >
> > what i was referring to here was that we have a SIMD-capable partitioned wallace tree multiplier block that is capable of either 8x8 MUL, 4x16 MUL, 2x32 MUL or 1x64 MUL pipelined operations.
> >
> > so we open up the *width-wise* partition gates and instead of 2x 32-bit MUL answers coming out the pipeline, we get 1x 64-bit.
> >
> > however if the amount of gates needed is too great, what we will do instead is to do two *completely separate* width-partitionable 32-bit-wide MUL units (because we still have to do 8-bit and 16-bit SIMD), then use them with microcoding to feed back partial results, and construct a 64-bit answer with several passes.
> >
> > and that will be where the *depth* partitioning particularly comes into play.
> >
> > l.
>
> First, if your circuit designers are competent, it is unlikely that they really use wallace tree. As explained multiple times by Mitch Alsup, Wallace Trees have no advantages over Dadda Multiplier and several disadvantages.

ok. will point jacob at that.

> Second, fully pipelined 64-bit multiplier, esp. one that is capable
> to generate upper 64 bit of 128-bit result,

and signed/unsigned versions of the same (RISC-V has 4 variants on multiply)

> is much more expensive than 2x32 or other smaller variants.
> So in typical low end designs such multipliers are non-pipelined
> or partially pipelined, e.g. producing one result every 3 or 2
> clock cycles. Most likely it would be the best trade-off for
> your design as well.

sigh we need full SIMD (and don't want separate hardware for each),
and to cover all 4 types of signed/unsigned RISC-V MULs as well.

it's not as straightforward a decision as it seems.

l.

MitchAlsup

unread,
Aug 19, 2019, 12:26:18 PM8/19/19
to
On Monday, August 19, 2019 at 3:56:55 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, August 16, 2019 at 1:02:15 PM UTC-5, lkcl wrote:
> >> On Saturday, August 17, 2019 at 1:54:45 AM UTC+8, Rick C. Hodgin wrote:
> >>
> >>>
> >>> It's a brilliant idea. Make sure you keep Jacob happy. Who
> >>> knows what else he may come up with? :-)
> >>
> >> :)
> >>
> >> A partitionable Wallace multiplier so that we can use the same logic for scalars and any width of SIMD, quadratic algorithms for FP rounding accuracy emulation, a pipeline that can do DIV, SQRT and RSQRT, that's so far :)
> >>
> >> L.
> >
> > Easily done::
> >
> > Take the majority gate carry = (a&b | a&c | b&c)
> > and make it .......... carry = (a&b&z | a&c&z | b&c&z )
> > when z = 1 the AOI gate works just like above
> > when z = 0 no carries pass this point (towards greater significance)
> >
> > {At least 30 years old}
> >
> > When each layer in the multiplier tree inverts logic polarity::
> > This takes a 2-2-2 AOI gate and makes it into a 3-3-3 AOI gate.
> >
> My immediate concern was that this extra logic makes the circuit slower,
> but only by one extra gate for every byte boundary, i.e. 7 for a 64-bit
> unit splittable into byte ops.

It is only slower as a 3-input gate is slower than a 2-input gate. The number
of gates (or logic and of delay) is equal.
>
> OTOH wide adders are already doing one or more forms of carry forwarding
> speedups anyway, right?

Adders are often built in 9-bit increments where the 9th bit is used to
determine the carry into the next segment.

When bit-9 gets 00 there is no carry out
When bit-9 gets 01 there is a carry out if there is a carry out of bit-8
When bit-9 gets 10 there is a carry out if there is a carry out of bit-8
When bit-9 gets 11 there is a carry out

This often comes with no gate delay penalty!

lkcl

unread,
Aug 20, 2019, 2:35:09 AM8/20/19
to
On Friday, August 16, 2019 at 9:27:29 PM UTC+1, MitchAlsup wrote:
ok i had to do a bit of reading of the code that jacob's written,
in order to understand it. i *think* this is what's being done.

all of the Full 3-2 Adders have their carry bits shifted, then ANDed
with the "partition mask", before being passed on to the next level
in the reduction.

actually... no it isn't. not precisely. the shift is being done
*after* the carry computation, which of course the tools won't
spot that and properly optimise it.

thank you! appreciated the insight.

l.

lkcl

unread,
Aug 21, 2019, 5:29:39 AM8/21/19
to
On Monday, August 19, 2019 at 5:26:18 PM UTC+1, MitchAlsup wrote:


> Adders are often built in 9-bit increments where the 9th bit is used to
> determine the carry into the next segment.
>
> When bit-9 gets 00 there is no carry out
> When bit-9 gets 01 there is a carry out if there is a carry out of bit-8
> When bit-9 gets 10 there is a carry out if there is a carry out of bit-8
> When bit-9 gets 11 there is a carry out
>
> This often comes with no gate delay penalty!

i was wondering why the hell FPGA DSPs sometimes have 18-bit adders...

EricP

unread,
Aug 21, 2019, 8:31:31 AM8/21/19
to
To add 17-bit integers, of course.


EricP

unread,
Aug 21, 2019, 10:14:48 AM8/21/19
to
Because it's just that much better...

https://www.youtube.com/watch?v=KOO5S4vxi0o



Reply all
Reply to author
Forward
0 new messages