Energy-Performance Trade-offs in Processor Architecture

Brett Davis

unread,

Dec 22, 2010, 6:21:29 AM12/22/10

to

Energy-Performance Trade-offs in Processor Architecture
and Circuit Design: A Marginal Cost Analysis
http://isca2010.inria.fr/media/slides/Azizi-ISCA2010-EnergyEfficientProcessors-FinalDistributable.pdf

from ISCA 2010
http://isca2010.inria.fr/index.php?option=com_content&view=article&id=68&Itemid=72&lang=en

Brett

Andy "Krazy" Glew

unread,

Dec 23, 2010, 3:41:49 PM12/23/10

to

A nice paper, but almost by definition incremental, marginal, and
derivative. Which I mean in a mathematical sense, since that is how
they pose the optimization problem.

The example data presented, showing that 2 architectures dominate the
"energy efficient" frontier, 2-wide in-order and out-of-order, must be
taken with a large grain of salt. The slides don't go into detail on
the workload; although the paper does, there are the usual issues of "do
I care about the workload modelled", etc.

In particular, the dismissal of 1-wide OOO may make sense for the
workload and parameters considered - but "obviously" (in an asymptotic
sense) as you increase memory latency, without adding more cache levels,
OOO beats in-order, and 1-wide is as good as 2-wide. And the CAMs that
make out-of-order power hungry can be changed into less power hungry
mechanisms.
Such variations are outside the scope of what was presented in the
paper. They are not outside the scope of analysis: the optimization
framework is a good one. But, I have seen many people think that by
varying a few parameters they have completely expplored a design space;
or, rather, they think that the design space is homogenous and analytic.

Brett Davis

unread,

Dec 23, 2010, 8:16:53 PM12/23/10

to

In article <BJedndPnk7yPKY7Q...@giganews.com>,

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:

> On 12/22/2010 3:21 AM, Brett Davis wrote:
> > Energy-Performance Trade-offs in Processor Architecture
> > and Circuit Design: A Marginal Cost Analysis
> > http://isca2010.inria.fr/media/slides/Azizi-ISCA2010-EnergyEfficientProcessors-FinalDistributable.pdf
> >
> > from ISCA 2010
> > http://isca2010.inria.fr/index.php?option=com_content&view=article&id=68&Itemid=72&lang=en
> >
> > Brett
>
>
> A nice paper, but almost by definition incremental, marginal, and
> derivative. Which I mean in a mathematical sense, since that is how
> they pose the optimization problem.
>
> The example data presented, showing that 2 architectures dominate the
> "energy efficient" frontier, 2-wide in-order and out-of-order, must be
> taken with a large grain of salt. The slides don't go into detail on
> the workload; although the paper does, there are the usual issues of "do
> I care about the workload modelled", etc.
>
> In particular, the dismissal of 1-wide OOO may make sense for the
> workload and parameters considered - but "obviously" (in an asymptotic
> sense) as you increase memory latency, without adding more cache levels,
> OOO beats in-order, and 1-wide is as good as 2-wide. And the CAMs that
> make out-of-order power hungry can be changed into less power hungry
> mechanisms.

You are assuming increased latency over time, but I had thought that
over the past decade memory latency was basically flat, maybe improving.
Clock speeds are unchanged over the past decade, the prime driver of latency.
RAM has changed from DDR to DDR3, bigger packets, but much higher clock,
so you get your data faster?

> Such variations are outside the scope of what was presented in the
> paper. They are not outside the scope of analysis: the optimization
> framework is a good one. But, I have seen many people think that by
> varying a few parameters they have completely expplored a design space;
> or, rather, they think that the design space is homogenous and analytic.

On a bandwidth basis DRAM is at equality with ethernet and flash, and
about to be left in the dust:

DDR3 modules can transfer data at a rate of 8002133 MT/s using both
rising and falling edges of a 4001066 MHz I/O clock.

10 gigabit ethernet gives ~1000 MT/s

SSD flash drives up to 6 gigabit/s today.

In the long run DRAM is dead, why add a thousand pins for DRAM when
you can add 8 pins for ethernet.
You will have a few gigs of embedded RAM and page from a flash drive.
A Cell style arch for everyone, lots of CPUs with local RAM streaming
data around as needed.

I agree the graphs are flawed, but only because I want performance
comparisons for code/data that fits in L3, so I will know what is
best for a Cell style design.

So my assumptions are that effective latencys are about to go down
radically, not the slow march up of the past.

Most apps are single threaded, so the OS will give each of your 16
CPUs its own app to run, and bigger apps will be partitioned to fit
the local RAM of a set of CPUs.

In the longer run a process may jump to the CPU with the code, as
it can be faster and more efficient than moving the code.

iOS for your iPad will likely be the first to convert to the new paradigm.
Economics will dictate the move.
So everyone will be using this hardware, not a fringe phenomenon.

The old rule of infinite CPU demand is dead.
The old rule of infinite DRAM demand is at deaths door.
Infinite graphics demand is good for about a decade.

Brett

Robert Myers

unread,

Dec 23, 2010, 10:33:27 PM12/23/10

to

On Dec 23, 8:16 pm, Brett Davis <gg...@yahoo.com> wrote:

>
> The old rule of infinite CPU demand is dead.
> The old rule of infinite DRAM demand is at deaths door.
> Infinite graphics demand is good for about a decade.
>

Do you--does everyone--know something obvious about the future of AI
that I don't?

Robert.

Brett Davis

unread,

Dec 24, 2010, 6:16:49 PM12/24/10

to

In article
<6711b636-406f-4562...@r29g2000yqj.googlegroups.com>,

A good overview is here:
http://en.wikipedia.org/wiki/Artificial_intelligence

Here is the one approach that has worked, I mentioned this
many months ago, and it has become far more successful than
when I last looked:
http://en.wikipedia.org/wiki/Cyc

Looks like we will have all the parts for a primitive Cylon
in about a decade.
In about two decades maybe it will get to high human IQ.

For performance reasons you would create separate models
optimized with separate data sets, like students taking
different majors at a University.

Maybe 12 models will do. ;)

Will robots dream of electric sheep?

Brett

Robert Myers

unread,

Dec 24, 2010, 7:44:41 PM12/24/10

to

On Dec 24, 6:16 pm, Brett Davis <gg...@yahoo.com> wrote:
> In article
> <6711b636-406f-4562-864e-eb431a050...@r29g2000yqj.googlegroups.com>,

> Robert Myers <rbmyers...@gmail.com> wrote:
>
> > On Dec 23, 8:16 pm, Brett Davis <gg...@yahoo.com> wrote:
>
> > > The old rule of infinite CPU demand is dead.
> > > The old rule of infinite DRAM demand is at deaths door.
> > > Infinite graphics demand is good for about a decade.
>
> > Do you--does everyone--know something obvious about the future of AI
> > that I don't?
>
> > Robert.
>
> A good overview is here:http://en.wikipedia.org/wiki/Artificial_intelligence
>
> Here is the one approach that has worked, I mentioned this
> many months ago, and it has become far more successful than
> when I last looked:http://en.wikipedia.org/wiki/Cyc
>
> Looks like we will have all the parts for a primitive Cylon
> in about a decade.
> In about two decades maybe it will get to high human IQ.
>
> For performance reasons you would create separate models
> optimized with separate data sets, like students taking
> different majors at a University.
>
> Maybe 12 models will do. ;)
>

Interesting pair of taxonomies--AI broken down in into slices and all
knowledge broken down into slices (perhaps as few as 12?).

I think I'm less optimistic about the pace of progress in AI than
either you or the Wikipedia article would be, but I am not so
pessimistic that I think AI will fail indefinitely (as, in my
estimation, it has largely done so far--or, at least, fallen so far
short of any reasonable measure of success that it has failed so far).

Thus, I wondered why AI was not included in your list of requirements
for computing power. I assume that AI will more than fill any gap
that is left by the graphics problem being more or less completely
solved would leave. I assume that the desired performance/watt figure
of merit that is desired for mobile applications is infinity.

Robert.

Brett Davis

unread,

Dec 25, 2010, 1:04:18 AM12/25/10

to

In article
<9db7d64b-8d61-4e9d...@32g2000yqz.googlegroups.com>,

Cell works great as a AI brain, just use 10s of thousands in a
heirachy, and thousands of GPUs for the vision system.
The next generation of Cell will run Unix, so software will not be
the issue it is with todays Cell.

Take a look at todays 1U rack, two CPU dies with the bulk of the
space and cost used by DRAM, plus hard drives.
Instead use dies with 64 Cell processors and put 64 of them on
the board in a simple 8 x 8 grid. No DRAM chips.

Compare the performance of the old pair of quad cores against
the 4,096 Cell processors and you get a near thousand fold
increase in compute density.

A modern supercomputer will use 10,000 1U rack slices.
That should put us within a order of magnitude of the human brain.

Todays quad cores plus DRAM are buggy whips, ~4 years from now
all of the top 10 super computers will be Cell Like, or GPU based.
In ~6 years the top 100 will be dominated by Cells.

Its the software to build a brain that is lagging, not the hardware.
Once that first brain can learn for itself from the internet that
brains IQ will quickly grow exponentially.

Brett

PS: Before reading this most here probably thought I was crazy for
promoting Cell as the future, I was just ahead of the curve waiting
for the average engineer to catch up.
(Which by definition means I am crazy. ;)

EricP

unread,

Dec 25, 2010, 1:57:42 PM12/25/10

to

Brett Davis wrote:
>
> Its the software to build a brain that is lagging, not the hardware.
> Once that first brain can learn for itself from the internet that
> brains IQ will quickly grow exponentially.

By the way, Feb 14-16 in North America the IBM "Watson"
natural language AI system will play the 2 top winners in Jeopardy.

'Jeopardy!' to pit humans against IBM machine
www.physorg.com/news/2010-12-jeopardy-pit-humans-ibm-machine.html

IBM Watson
www.ibmwatson.com

Eric

Robert Myers

unread,

Dec 25, 2010, 3:00:51 PM12/25/10

to

On Dec 25, 1:04 am, Brett Davis <gg...@yahoo.com> wrote:
.
> Its the software to build a brain that is lagging, not the hardware.

http://news.cnet.com/8301-27083_3-20023112-247.html

"Human brain has more switches than all computers on Earth"

Lots of things about things about this problem have been mispredicted/
misestimated.

Robert.

MitchAlsup

unread,

Dec 25, 2010, 6:20:08 PM12/25/10

to

On Dec 23, 2:41 pm, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 12/22/2010 3:21 AM, Brett Davis wrote:
>
> > Energy-Performance Trade-offs in Processor Architecture
> > and Circuit Design: A Marginal Cost Analysis

> >http://isca2010.inria.fr/media/slides/Azizi-ISCA2010-EnergyEfficientP...
>
> > from ISCA 2010
> >http://isca2010.inria.fr/index.php?option=com_content&view=article&id...

>
> > Brett
>
> A nice paper, but almost by definition incremental, marginal, and
> derivative. Which I mean in a mathematical sense, since that is how
> they pose the optimization problem.
>
> The example data presented, showing that 2 architectures dominate the
> "energy efficient" frontier, 2-wide in-order and out-of-order, must be
> taken with a large grain of salt.

Indeed, no consideration of other-than Thomosulo reservation station
OoO.

For example: No mention of Thronton Scoreboard (ao any other
scoreboard);
No mentio of Value Free Reservation Stations, Dispatch stack, or other
general instruction queueing mechanisms.

No consideration of other-than branch predicting instruction fetch.
No consideration of really short pipelines ala RISC generation 1 for
the
in order machines.
No consideration of strategies between In Order and Out of Order.

However, the interesting point there did achieve was to find that
optimal
performance per power lies in the more-gates-per-cycle range (20
rather than 16-ish).

All in all nice graduate work, not quite industrial strength.

Mitch

Robert Myers

unread,

Dec 25, 2010, 8:09:23 PM12/25/10

to

On Dec 25, 6:20 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> All in all nice graduate work, not quite industrial strength.

I'd be willing to give almost anyone points just for trying, but, of
course, I'm not viewing it all from the perspective of you or Andy.

In particular, the cost of exploring the design space seems (to a
naive observer like me) so high that anyone other than AMD and Intel
and now perhaps ARM (or someone like Apple, which can throw away money
with no reasonable hope of return, if it cares to) would find the
enterprise roughly as plausible as attempting to build a business plan
for a vacation condo-sharing operation on Mars.

Suppose that all the requisite simulator software and as much computer
time as you needed were all free. The only thing that isn't free is
human beings with enough wits to submit simulation jobs that mean
something and to interpret the results (which is to say that even
graduate students aren't free enough). Would it make any difference?
Would anyone other than the usual players have the resources to mount
an industrial-strength effort?

Robert.

Paul A. Clayton

unread,

Dec 25, 2010, 9:26:40 PM12/25/10

to

On Dec 25, 8:09 pm, Robert Myers <rbmyers...@gmail.com> wrote:
> On Dec 25, 6:20 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > All in all nice graduate work, not quite industrial strength.
>
> I'd be willing to give almost anyone points just for trying, but, of
> course, I'm not viewing it all from the perspective of you or Andy.
>
> In particular, the cost of exploring the design space seems (to a
> naive observer like me) so high that anyone other than AMD and Intel
> and now perhaps ARM (or someone like Apple, which can throw away money
> with no reasonable hope of return, if it cares to) would find the
> enterprise roughly as plausible as attempting to build a business plan
> for a vacation condo-sharing operation on Mars.

ISTM that the design space is extraordinarily vast. Even if all
the clever microarchitectural tricks were entirely independent
of all other factors and the economic factors (utility and costs
of specific design points) for all the potential workloads were
well understood, even Intel might not have the resources to master
computer architecture (probably even if limited to x86-based
servers and personal computers).

(I do wonder why certain microarchitectural tricks are not used--
I am still surprised, e.g., that specialized caching has not been
adopted. I also wonder how economically viable a less balanced
processor would be--e.g., I can imagine AMD, as a minority vendor,
targeting slightly inferior median performance with a significantly
superior performance on a few benchmarks [Bulldozer might be such
a processor with superior performance for high thread count
workloads].)

> Suppose that all the requisite simulator software and as much computer
> time as you needed were all free. The only thing that isn't free is
> human beings with enough wits to submit simulation jobs that mean
> something and to interpret the results (which is to say that even
> graduate students aren't free enough). Would it make any difference?
> Would anyone other than the usual players have the resources to mount
> an industrial-strength effort?

The graduate students would also need to communicate--a
microarchitectural idea may fail for all implementations the
originator can conceive but mesh beautifully with another
idea and implementation. One also needs to define the
requirements. For personal computers it is not clear how
fitness for purpose can be marketed, and some of the features
of a personal computer are not easily appreciated without
significant use (and the software and software configuration
may have a greater impact than the hardware).

Paul A. Clayton
just a technophile

Terje Mathisen

unread,

Dec 26, 2010, 5:02:19 AM12/26/10

to

Paul A. Clayton wrote:
> (I do wonder why certain microarchitectural tricks are not used--
> I am still surprised, e.g., that specialized caching has not been
> adopted. I also wonder how economically viable a less balanced
> processor would be--e.g., I can imagine AMD, as a minority vendor,
> targeting slightly inferior median performance with a significantly
> superior performance on a few benchmarks

It is the exact opposite:

Intel have been able to deliver multiple cpus with one or more severe
achilles' heels, knowing that pretty much all important sw vendors would
be willing to run their codes through compilers that knew how to avoid
most of the problems areas.

AMD otoh have been forced to deliver much more balanced cores, they knew
they could not get away with stuff like the P4 which would stumble on
both integer MUL and shift operations.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Brett Davis

unread,

Dec 27, 2010, 12:51:04 AM12/27/10

to

In article
<40a8112c-01a2-4d20...@o4g2000yqd.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:

> On Dec 25, 1:04 am, Brett Davis <gg...@yahoo.com> wrote:
> .
> > Its the software to build a brain that is lagging, not the hardware.
>
> http://news.cnet.com/8301-27083_3-20023112-247.html
>
> "Human brain has more switches than all computers on Earth"

Obsolete info.

> Lots of things about things about this problem have been mispredicted/
> misestimated.

From Brain Facts and Figures
http://faculty.washington.edu/chudler/facts.html

Average number of neurons in the brain = 100 billion
Todays chips have a billion transistors, only need 100.

Number of synapses for a "typical" neuron = 1,000 to 10,000
A super computer will have 1,000+ 1U slices.
So a single Super computer has 1/10th to 1/100th the switches
of a human brain.

EEG - beta wave frequency = 13 to 30 Hz
CPUs run at 3GHz, 100,000 times faster.
The human brain is highly optimized, so lets divide by 100
to be fair, now we are at equality between a SuperComputer and
the human brain, computation wise.

If I am off by an order of magnitude, that is only 5 years for
the computer industry to match.
Its software and understanding that we lack, not hardware to
build a human brain equivalent.

Brett

Paul A. Clayton

unread,

Dec 27, 2010, 9:13:44 AM12/27/10

to

On Dec 26, 5:02 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:

> Paul A. Clayton wrote:
> > (I do wonder why certain microarchitectural tricks are not used--
> > I am still surprised, e.g., that specialized caching has not been
> > adopted. I also wonder how economically viable a less balanced
> > processor would be--e.g., I can imagine AMD, as a minority vendor,
> > targeting slightly inferior median performance with a significantly
> > superior performance on a few benchmarks
>
> It is the exact opposite:
>
> Intel have been able to deliver multiple cpus with one or more severe
> achilles' heels, knowing that pretty much all important sw vendors would
> be willing to run their codes through compilers that knew how to avoid
> most of the problems areas.
>
> AMD otoh have been forced to deliver much more balanced cores, they knew
> they could not get away with stuff like the P4 which would stumble on
> both integer MUL and shift operations.

Not quite "exact opposite": I receive the impression that the Pentium
4
glass jaws were not "slightly inferior median performance" issues. I
agree that AMD cannot press for general recompiling of code, but I
also
see AMD as having a greater encouragement to specialize (with a
capacity
of about 20% of the market, generating strong interest in some smaller
groups of users can compensate for weaker interest in most of the
market
[perhaps somewhat explaining popcount and AMD64]; also AMD already
has
difficulty competing head-to-head with Intel due to process
technology
lag and having fewer resources for processor design, so a modest
further
general performance disadvantage may not be as significant [e.g.,
lagging
by about 15% rather than about 10-12%] if some workloads have a
substantial
advantage [e.g.,, 20% or more]). AMD might not be able to afford a
product with a very significant downside, but I was thinking of an
upward long tail (and so lower median) not downward long tail (and so
higher median).

Paul A. Clayton

unread,

Dec 27, 2010, 9:55:38 AM12/27/10

to

On Dec 27, 12:51 am, Brett Davis <gg...@yahoo.com> wrote:
[snip]

> If I am off by an order of magnitude, that is only 5 years for
> the computer industry to match.
> Its software and understanding that we lack, not hardware to
> build a human brain equivalent.

It seems to me that one would not want to build a "human brain
equivalent". Such would seem not to match the strengths (and
weaknesses) of the technology we have (e.g., development of
self-modifying hardware seems less further advanced). Such
would also seem to fit less well with the strengths (and
weaknesses) of the human users. If one is going to build a
computer of equivalent magnitude, replicating the functionality
of the human brain seems undesirable (and less practical with
current methods). (Scaling computer hardware seems much better
understood than scaling human brain hardware. This may also
have a significant role in developments for the next few
decades.)

An interesting aspect of computers is the ability to develop
different species. Although there is diversity among human
brains, the diversity is rather constrained.

Eventually, an artificial processor very similar to the source
of human intelligence might be a practical design choice, but
I suspect that superior 'intelligence' will come much earlier
than similar operation. However, I am much more ignorant of
AI than even a typical grad student in that field.

nm...@cam.ac.uk

unread,

Dec 27, 2010, 10:04:19 AM12/27/10

to

In article <913b0f23-44a3-4215...@m35g2000vbn.googlegroups.com>,

Paul A. Clayton <paaron...@gmail.com> wrote:
>
>Eventually, an artificial processor very similar to the source
>of human intelligence might be a practical design choice, but
>I suspect that superior 'intelligence' will come much earlier
>than similar operation. However, I am much more ignorant of
>AI than even a typical grad student in that field.

Thus demonstrating that you understand it better than they do :-)

My understanding is that most advances consist of throwing away
previously held beliefs, because they have been discovered to be
misleading or even erroneous. There are some areas that are
RELATIVELY simple, like the visual cortex, but even there, a lot
of progress is simply discovering that we know less that we
thought we did.

Regards,
Nick Maclaren.

Brett Davis

unread,

Dec 28, 2010, 10:08:26 PM12/28/10

to

In article <ggtgp-6ED43D....@netnews.mchsi.com>,

Brett Davis <gg...@yahoo.com> wrote:
> In article <BJedndPnk7yPKY7Q...@giganews.com>,
> "Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:
> > In particular, the dismissal of 1-wide OOO may make sense for the
> > workload and parameters considered - but "obviously" (in an asymptotic
> > sense) as you increase memory latency, without adding more cache levels,
> > OOO beats in-order, and 1-wide is as good as 2-wide. And the CAMs that
> > make out-of-order power hungry can be changed into less power hungry
> > mechanisms.
>
> You are assuming increased latency over time, but I had thought that
> over the past decade memory latency was basically flat, maybe improving.
> Clock speeds are unchanged over the past decade, the prime driver of latency.
> RAM has changed from DDR to DDR3, bigger packets, but much higher clock,
> so you get your data faster?

I remember DDR1 being ~66 MHz a decade ago, DDR4 will be 2133 MHz.
32 times faster clock, and we have moved from a 32bit bus to 128+.

DDR3 latencies are about 10ns, from:
http://en.wikipedia.org/wiki/Ddr3

SDRAM has open row read times of 10ns, and CAS latency of 10-15ns:
http://en.wikipedia.org/wiki/SDRAM

We have gone from single core to quad, so you can have some contention
that will sometimes add a ~8 cycle packet wait conflict.
But when you are waiting ~200 cycles normally anyway that is lost in the noise.

I am shocked to find out I was right.
DRAM latency is basically unchanged over the past decade.

Meanwhile transistor density has doubled 5 times, a 32 times improvement.
So the average DRAM has 32 times the RAM.

Brett

Alternatives:

Racetrack RAM is promising 20-32ns reads/writes, and manyfold density improvement.
http://en.wikipedia.org/wiki/Racetrack_memory
http://www.channelregister.co.uk/2010/12/24/ibm_racetrack_memory/
http://latimesblogs.latimes.com/technology/2010/12/ibm-racetrack-memory.html
http://news.cnet.com/8301-11386_3-20026553-76.html

"Now that we are at the development phase, it's more a question of obtaining
this significant investment to build the prototypes quickly."

Out of research, now begging for billions to build real chips?
Me thinks Global Foundries is more than eager to take on Samsung.
(IBM only has one fab, a small one, and it is busy.)

RRAM is showing <0.3ns switching times, dont know what normal addressing overheads are.
http://en.wikipedia.org/wiki/Resistive_random-access_memory

Torben Ægidius Mogensen

unread,

Jan 4, 2011, 5:07:35 AM1/4/11

to

Brett Davis <gg...@yahoo.com> writes:

> Energy-Performance Trade-offs in Processor Architecture
> and Circuit Design: A Marginal Cost Analysis
> http://isca2010.inria.fr/media/slides/Azizi-ISCA2010-EnergyEfficientProcessors-FinalDistributable.pdf

Some marginally interesting (but hardly surprisable) points. But I
found it odd that the graphs that compare a large number of CPUs from
the mid-80s to today do not include any ARM processors, especially since
the article is about power efficiency.

Torben