Superscalar vs. Superpipelined

heinz

unread,

Oct 14, 2016, 3:08:40 PM10/14/16

to

Today's high performance processor designs seem to have converged to roughly a 4-way superscalar, 20 pipelinestage design. Why? After all a 80-way, single pipelined design or a scalar 80-pipelined design should deliver the same theoretical performance.

What are the tradeoffs of the three designs proposed above?

Peak Performance:
Peak throughput should be the same for all designs, whereas the pipelined architecture latency increases due to registers.

Resources:
Pipelined design requires registers as additional resources.
Superscalar design increases resources linearly with number of ways. Some resources scale quadratically (forwarding paths between ALUs). Diversified pipelines avoids duplication by e.g. have less integer ALUs as ways -> leads to structural hazards.

Dependencies:
In case of RAW dependencies both architectures suffer to the same degree. Mispredictions/flushes should have similar impact too

Clocking:
Clock skew becomes a more significant issue for superpipelined designs with high clock speeds than for larger (superscalar) designs.

Scheduling:
In a 4-way superscalar design we have to schedule 4 instructions per cycle although we have 4 times slower clock frequency as in a scalar design. Nevertheless, there are non-linearities which imho should favor the pipelined design.

Atomicity:
At some point it becomes impossible to further break down operations into smaller pipestages (max clock speed is limited, resource duplication is not.)

Verdict:
From above, It seems that superpipelining is the superior approach only limited by atomicity of operations and clock skew. Hence, when building a new processor that should be able to execute 80 instructions concurrently, should we first max out pipelining and only then go superscalar?

Ivan Godard

unread,

Oct 14, 2016, 4:14:21 PM10/14/16

to

The design has converged because of scaling limits in classical von
Neumann architecture.

Anyone can build a machine with lots of functional units; novices assume
that it's all about the ISA and invent new instructions. In reality,
it's all about figuring out what to do next, and getting access to what
you are going to do it to. Broadly, about decode and data paths. These
are single-cycle issues, that are not alleviated by superpipelining.

When we started there were two known ways to break the decode scaling
limit: multiple decoders, as in barrel processors, and multiple data per
instruction, as in GPUs and vector ops. Both work very well when the app
is shaped like the architecture: lots of threads with no latency
constraints (barrels), and embarrassingly data-parallelism (vector). The
rest of the apps? Not so much.

On the data side, arbitrary source to arbitrary destination (e.g.
general register designs) don't scale. Part of the problem is the sheer
complexity of the crossbar, but most of it is due to hazards that arise
from uncoordinated writing of intermediates.

Superpipelining doesn't help the per-cycle constraints. If you can only
decode four instructions per cycle then you can smear those across
pipeline stages but you will still only get a throughput of four per
cycle. All superpipelining can do is lower the per-stage gate count and
so raise the clock rate of those cycles. That works, and we've gone to
that well a lot, but it seems pretty dry by now. And the more stages the
more you waste in mispredict recovery. If your control flow is
statically known, as in stream processors and many DSPs then you never
miss and you can pipe until the clock stutters for other reasons.
Regular apps - not so much.

We don't pipe much; we want a low clock for power and yield reasons,
although it's good to know that we can run up the clock later if we
want. But that means that we need to find parallelism in other ways, and
that meant we needed to crack the decode and crossbar scaling limits.

We haven't changed the polynomial scaling of decode, but we have changed
the constants in the formula. There are two decoders in each Mill, and
three block of ops in each instruction, so we have 6x the decode
capacity. So long as you don't have a linear parse (e.g. x86) you can
decode 4-8 ops per cycle without wire problems. Times six and we can
decode 24-48 ops per cycle, and the decode scaling constraint has become
a code bandwidth constraint - for us.

We haven't changed the scaling of the data crossbar either. However, we
got rid of the hazards completely, by making the communication
temporaries Single Assignment. And we pushed the availability issue (a
scoreboard or rename/issue queue in conventional hardware) onto the
compiler. In present memory tech that makes the crossbar constraint into
a data bandwidth constraint. However, advanced in on-chip memory may
push us back into the crossbar in the future, and it's not clear whether
we'll be able to scale beyond say 50 operations per cycle if that happens.

heine...@gmail.com

unread,

Oct 14, 2016, 5:20:29 PM10/14/16

to

Hi Ivan,

I understand why architecture has converged (limited ILP, complexity, etc) I also understand that there are alternative designs but this is not my question.

My question is: What are the tradeoffs between superscalar and superpipelined?

For example regarding your decoding example. Lets assume decoding a CISC (variable length) instruction takes 1ns. Would you rather have a cycle time of 4ns and decode four instructions per cycle or a scalar pipeline running at 1ns cycle time?

In other words, decoding is equally hard in superpipelined and superscalar processors so why favor one over the other?

Ivan Godard

unread,

Oct 14, 2016, 6:21:18 PM10/14/16

to

Quadibloc

unread,

Oct 14, 2016, 6:21:53 PM10/14/16

to

I also think that decode is a red herring. One could arrange to decode the
whole program before execution starts; we choose not to do it that way because
there is no need, everything else is the bottleneck.

But the other concern Ivan raised is the critical one.

Data paths.

Typically, in a program, one will have a lot of instructions that depend on
each other. The most natural way to write a program is with each instruction
depending on the one immediately before.

Compilers and other things - out-of-order execution, RISC instruction sets with
larger register files - let one extend that a limit, to have maybe three
instructions doing other stuff between one instruction and the next one that
depends on it.

A pipelined architecture deals nicely with that; it performs several
instructions in parallel, but at different stages of execution, *on the same
hardware*, so when the next dependent instruction comes along, the hardware is
ready for it.

So computer A is pipelined: it has a cycle time of X, and instructions have
latencies of 4X.

The superscalar equivalent of that would be computer B: it has a cycle time of
4X, and instructions have latencies of 4X.

So computer B has to perform addition or multiplication *just as fast* as
computer A, but cost 1/4 as much, only because it isn't pipelined; it doesn't
have the ability to work on extra instructions in between the instructions that
it is already working on.

Sometimes that might work, if doing the instruction, like multiplication or
division, involves many repetitive steps, for each of which separate hardware
is needed in a fully-pipelined computer. But often, that doesn't work.

So pipelining is favored because it gives you more performance from fewer
transistors.

John Savard

Ivan Godard

unread,

Oct 14, 2016, 6:44:33 PM10/14/16

to

On 10/14/2016 2:20 PM, heine...@gmail.com wrote:

> Hi Ivan,
>
> I understand why architecture has converged (limited ILP, complexity,
> etc) I also understand that there are alternative designs but this is
> not my question.
>
> My question is: What are the tradeoffs between superscalar and
> superpipelined?
>
> For example regarding your decoding example. Lets assume decoding a
> CISC (variable length) instruction takes 1ns. Would you rather have a
> cycle time of 4ns and decode four instructions per cycle or a scalar
> pipeline running at 1ns cycle time?
>
> In other words, decoding is equally hard in superpipelined and
> superscalar processors so why favor one over the other?
>

Unfortunately it's a false dichotomy: you cannot in general swap pipes
for stages.

If you can decode four in 4ns, you may be able to decode one in 1.5 ns
and four in 6ns. There is a fixed interstage communication overhead,
independent of any useful work done by the stage.

CISC generally requires a linear parse, and so does your example,
because if decode were parallel and you could do one in 1ns then you
could do 4 in 1ns too. But the linear parse is sequential. The fix is to
dump the linear parse - better engineers than you or I have already
gotten all they can out of piping.

The tradeoff is somewhat between latency and throughput, but in practice
chips need both. A modern high-end chip is already at the scaling limit
in both clock (SP) and wayness (SS) for a given power and yield
envelope. Intel does really good work; there is no wider or longer
available even if you are willing to swap for it. An 8-way 4GHz chip
cannot become a 1-way 32GHz chip without melting. And it can't become a
32-way 1GHz chip because it can't parse 32 instructions, and even if it
could there would not be useful ILP in a conventional encoding for it to
use.

We broke those limits - or rather, we pushed them out a ways - by
rethinking from scratch how CPUs work. But even on a Mill, the times
when we can usefully use all the slots on a high-end Mill are rare.
However, there are apps where the critical loop can indeed saturate a
Mill - and the Gold and its high-end brethren are intended for that
rather specialized market.

If you have such markets in mind - some HPC work, stream processing,
some DSP - then I encourage you to think about the needs of the app,
rather than starting a priori for an architecture that even Intel agrees
is antiquated.

heine...@gmail.com

unread,

Oct 14, 2016, 7:40:06 PM10/14/16

to

All both of you say is correct but my question is still not answered. I still don't understand why a 4-way, 20 pipestage processor is better than a 3-way 27 stage processor.

> available even if you are willing to swap for it. An 8-way 4GHz chip
> cannot become a 1-way 32GHz chip without melting.

Why? P=C*F*V^2. It doesn't matter whether we double ways (C) or double frequency/pipestages.

> And it can't become a
> 32-way 1GHz chip because it can't parse 32 instructions, and even if it

Why? If you have enough cycle time available, say a minute, you can decode 100s of instructions per cycle - even sequentially. You might have to do multiple predictions per cycles because multiple loop bodies have to be executed concurrently the super wide pipeline etc but I see no principle issue here. If there is please quantify.

What are the non-linear factors here that make one approach better than the other? We both mentioned one: data paths between ALUs grow quadratically with number of ALUs. However, I don't find a non-linearity that works in favor of superscalar. As a result, the scalar 32GHz chip should be superior (ignoring clock skew and atomicity see above).

> could there would not be useful ILP in a conventional encoding for it to
> use.

A scalar 32GHz chip requires exactly the same amount of ILP than a 4-way 8 GHz chip to deliver the same performance if everything else is kept the same.

Unlike you I am not trying to build a new chip to break the market, I am just trying to understand the status quo.

heine...@gmail.com

unread,

Oct 14, 2016, 7:44:15 PM10/14/16

to

> So pipelining is favored because it gives you more performance from fewer
> transistors.

I agree, so why do we do superscalar at all? I mentioned some reasons (atomicity, clock skew, diversified ALUs) but is there anything else?

Rick C. Hodgin

unread,

Oct 14, 2016, 8:04:49 PM10/14/16

to

heinz wrote:
> I still don't understand why a 4-way, 20 pipestage processor
> is better than a 3-way 27 stage processor.

There's a tradeoff in general purpose code execution regarding raw
compute and instructional branching. A 27-stage pipelie would
take ~33% longer to refill the pipeline on missed branch predictions.
On most code it will suffer a regular hefty performance penalty
on branches that are not well-predicted.

Intel took their Prescott iteration of Pentium-4 to 31 stages, up from 20
in the prior cores (Willamette and Northwood). It required special
high speed RDRAM to keep it fed, and it suffered severe penalties for
mispredicted branches. When it was computing without mispredicted
branches it was an amazing performer. Intel also introduced hint
instructions and assumed branch directions to aid compilers in speeding
up code to compensate for the full restarts which occurred more
often than they would prefer.

Ultimately, the high cost of the special RDRAM, and the penalties associated
with the rigors of running traditional code, they ultimately stepped back
to more traditional pipeline depths, presumably as the more easy
method of attacking general purpose code.

Best regards,
Rick C. Hodgin

Heinz

unread,

Oct 14, 2016, 8:39:02 PM10/14/16

to

Thanks Rick, but I have to disagree again:

> There's a tradeoff in general purpose code execution regarding raw
> compute and instructional branching. A 27-stage pipelie would
> take ~33% longer to refill the pipeline on missed branch predictions.
> On most code it will suffer a regular hefty performance penalty
> on branches that are not well-predicted.

I claim that both designs suffer the same from a misprediction. The number of flushed instructions is the same, they are residing either in a deeper or wider pipeline.

(Of course the 27 pipe design suffers more than a 20 pipestage design of the same width but keep in mind that we also changed the number of ways). If I remember well P4 was 4-way superscalar already.

>
> Intel took their Prescott iteration of Pentium-4 to 31 stages, up from 20
> in the prior cores (Willamette and Northwood). It required special
> high speed RDRAM to keep it fed, and it suffered severe penalties for
> mispredicted branches.

The same issues arise if you go wider instead of deeper. You have to provide more instructions and suffer more from mispreds. Again the number of in-flight instructions is the same in both designs.

Rick C. Hodgin

unread,

Oct 14, 2016, 8:53:42 PM10/14/16

to

Heinz wrote:
> I claim that both designs suffer the same from a misprediction.
> The number of flushed instructions is the same, they are
> residing either in a deeper or wider pipeline.

Work is not computed until it's gone through the pipeline. The 20-stage
pipe is retiring instructions ~25% faster than the 27-stage pipe. When
the average x86 instruction stream branches every six instructions,
that adds up to something.

> (Of course the 27 pipe design suffers more than a 20 pipestage
> design of the same width but keep in mind that we also changed
> the number of ways). If I remember well P4 was 4-way superscalar
> already.

Willamette needed a notably higher clock to achieve the same
throughput as PIII, because PIII had a 10-stage integer pipeline.
Prescott also needed a notably higher clock speed to perform well
with such a deep pipe. It also ran double-clocked ALUs and the
special higher-speed memory. It had special needs shorter pipes
didn't.

The longer pipeline has primary disadvantage in refilling. They can be
significant on average code, and there's a reason why the experts
typically stick to 20 or fewer stages.

Quadibloc

unread,

Oct 14, 2016, 11:05:19 PM10/14/16

to

Well, after you have achieved the maximum possible performance from pipelining,
and you still want even more performance, superscalar is the next thing that
remains available.

For thermal reasons, and because an Earle latch still requires extra gates,
even if it adds no delay, generally speaking pipeline stages in computers are
not made shorter than about eight gate delays - and they're usually
significantly longer than that.

John Savard

Noob

unread,

Oct 15, 2016, 4:19:03 PM10/15/16

to

On 15/10/2016 01:40, heine...@gmail.com wrote:

> All both of you say is correct but my question is still not answered.
> I still don't understand why a 4-way, 20 pipestage processor is
> better than a 3-way 27 stage processor.

Intel tried to go really deep with the P4.
One problem is branch mispredicts, which flush the pipe AFAIU.

Regards.

Bruce Hoult

unread,

Oct 15, 2016, 6:11:14 PM10/15/16

to

That applies to every CPU. But Intel tried to do something clever in P4, with something called a "trace cache" instead of a normal L1 instruction cache.

Exact details of how it worked are hard to find, and most accounts seem to be from before or at the same time as the first P4 release, so I have no idea how or if it changed in later iterations.

The basic idea is to record the actual instructions (micro-ops actually) executed along a quite long execution path, in and out of functions, past conditional branches.

If a function executes differently depending on where it is called from, but consistently from each call site, the trace cache can start in the calling function and then correctly predict the behaviour in the called function. Loops with small fixed loop counts can be in effect unrolled in the trace cache with both the branches back to the start of the loop AND the loop exit predicted correctly!

Mozilla did a similar thing in software for Javascript in their web browser, with "TraceMonkey". Testing whether values contain integers is very important to JavaScript, but the tests take more time than the actual operations, so it's very important to correctly predict the results of the tests.

Both Pentium 4 and TraceMonkey showed amazing results on a lot of benchmarks. But they both fell down horribly in many complex real-world situations. You could compare them to a boxer with a massively powerful punch but a glass jaw.

Anton Ertl

unread,

Oct 16, 2016, 8:52:48 AM10/16/16

to

Noob <ro...@127.0.0.1> writes:
>Intel tried to go really deep with the P4.
>One problem is branch mispredicts, which flush the pipe AFAIU.

I don't think that is the reason they did not go deeper. They have
good branch predictors that made it look promising to go for even
deeper pipelines than that of the Prescott (31 stages). The ISCA 2002
proceedings contain several papers that explore the idea of machines
with very deep pipelines with high clock rates; in particular,
Sprangle and Carmean found 52 stages would be optimal for the Pentium
4 microarchitecture (and with the better branch predictors of today,
even deeper pipelines would be even better).

However, the high clock rates did not appear in practice. The
Prescott only achieved 3.8GHz (<10% faster than the 20-stage
Gallatin); the Tejas, which was supposed to reach 7GHz, was cancelled,
apparently because of heat problems (already at 2.8GHz the Tejas
emitted 150W of heat according to
<https://en.wikipedia.org/wiki/Pentium_4#Successor>); and I guess that
the heat is more concentrated in a small are in such a CPU than in a
GPU with hundreds of units.

What I wonder about is why this problem was not foreseen a few years
earlier, leading Intel to predict 7GHz for Tejas in September 2003,
and then canceling it in May 2004 (it was initially supposed to be
available in 2004).

@InProceedings{sprangle&carmean02,
author = {Eric Sprangle and Doug Carmean},
title = {Increasing Processor Performance by Implementing
Deeper Pipelines},
crossref = {isca02},
pages = {25--34},
annote = {This paper starts with the Williamette (Pentium~4)
pipeline and discusses and evaluates changes to the
pipeline length. In particular, it gives numbers on
how lengthening various latencies would affect IPC;
on a per-cycle basis the ALU latency is most
important, then L1 cache, then L2 cache, then branch
misprediction; however, the total effect of
lengthening the pipeline to double the clock rate
gives the reverse order (because branch
misprediction gains more cycles than the other
latencies). The paper reports 52 pipeline stages
with 1.96 times the original clock rate as optimal
for the Pentium~4 microarchitecture, resulting in a
reduction of 1.45 of core time and an overall
speedup of about 1.29 (including waiting for
memory). Various other topics are discussed, such as
nonlinear effects when introducing bypasses, and
varying cache sizes. Recommended reading.}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

unread,

Oct 16, 2016, 9:35:44 AM10/16/16

to

In article <2016Oct1...@mips.complang.tuwien.ac.at>,

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>Noob <ro...@127.0.0.1> writes:
>>Intel tried to go really deep with the P4.
>>One problem is branch mispredicts, which flush the pipe AFAIU.
>
>I don't think that is the reason they did not go deeper. They have
>good branch predictors that made it look promising to go for even

>deeper pipelines than that of the Prescott (31 stages). ...

On benchmarketing code suites, yes. The problem about all extreme
designs is that they suck for anything that doesn't fit their use
pattern - as did the Pentium 4. It doesn't help if they are better
for 90% of codes and on average for the user who needs to use one
of the 10% and is badly affected by its running time.

Programs with lots of unpredictable branches are exactly such a
case, as are ones with lots of integer multiplication in the
addressing pipeline. I have had users hit by both, and the answer
was that they would be better off with an older, 'slower' computer.

>However, the high clock rates did not appear in practice. The
>Prescott only achieved 3.8GHz (<10% faster than the 20-stage
>Gallatin); the Tejas, which was supposed to reach 7GHz, was cancelled,
>apparently because of heat problems (already at 2.8GHz the Tejas
>emitted 150W of heat according to
><https://en.wikipedia.org/wiki/Pentium_4#Successor>); and I guess that
>the heat is more concentrated in a small are in such a CPU than in a
>GPU with hundreds of units.
>
>What I wonder about is why this problem was not foreseen a few years
>earlier, leading Intel to predict 7GHz for Tejas in September 2003,
>and then canceling it in May 2004 (it was initially supposed to be
>available in 2004).

It was. The executive suits chose to believe the engineers that told
them what they wanted to hear, rather than those that told the facts
as they were. Note that the facts weren't that the powers were
predicted in detail, but they were known to increase rapidly with
shrinking process size and clock rate, it wasn't clear whether the
rate of increase would itself increase, and it was completely unknown
whether anything could be done to keep it under control.

The rest is history.

Regards,
Nick Maclaren.

Quadibloc

unread,

Oct 16, 2016, 11:07:53 AM10/16/16

to

On Sunday, October 16, 2016 at 7:35:44 AM UTC-6, Nick Maclaren wrote:

> On benchmarketing code suites, yes. The problem about all extreme
> designs is that they suck for anything that doesn't fit their use
> pattern - as did the Pentium 4. It doesn't help if they are better
> for 90% of codes and on average for the user who needs to use one
> of the 10% and is badly affected by its running time.

> Programs with lots of unpredictable branches are exactly such a
> case, as are ones with lots of integer multiplication in the
> addressing pipeline. I have had users hit by both, and the answer
> was that they would be better off with an older, 'slower' computer.

You know what they say, "You can't stop progress"!

There is an obvious workaround, which Intel already has in their
microprocessors - HyperThreading.

With simultaneous multithreading, a microprocessor can behave as if it is twice
as many cores with half as many pipeline stages each.

So if they had found a way to deal with the thermal issues, with some care and
attention, a processor with a 52-stage pipeline could have been better in every
way than what had gone before. However, Intel would have had to have gone to
*at least* four-way SMT instead of just two-way in order to achieve that,
because even 27 stage pipelines are 'way out there. I would be inclined to
recommend 32-way SMT myself, if it were achievable.

In that way, a fancy 8 GHz processor would no longer be embarrassed by superior
performance by a 250 MHz processor of simpler design on those programs which
are better suited to the latter. And it would offer super fast performance on
newer programs designed for it.

John Savard

Nick Maclaren

unread,

Oct 16, 2016, 12:09:09 PM10/16/16

to

In article <aa1067a2-e604-421b...@googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> On benchmarketing code suites, yes. The problem about all extreme
>> designs is that they suck for anything that doesn't fit their use
>> pattern - as did the Pentium 4. It doesn't help if they are better
>> for 90% of codes and on average for the user who needs to use one
>> of the 10% and is badly affected by its running time.
>
>> Programs with lots of unpredictable branches are exactly such a
>> case, as are ones with lots of integer multiplication in the
>> addressing pipeline. I have had users hit by both, and the answer
>> was that they would be better off with an older, 'slower' computer.
>
>You know what they say, "You can't stop progress"!
>
>There is an obvious workaround, which Intel already has in their
>microprocessors - HyperThreading.
>
>With simultaneous multithreading, a microprocessor can behave as if it is twice
>as many cores with half as many pipeline stages each.

I don't want to be rude, but I doubt that you have tested that out
in practice. Even if it were true (which it very rarely is), it
doesn't help with the time each program takes to complete.

Regards,
Nick Maclaren.

Megol

unread,

Oct 16, 2016, 5:22:33 PM10/16/16

to

On Sunday, October 16, 2016 at 5:07:53 PM UTC+2, Quadibloc wrote:
> You know what they say, "You can't stop progress"!

Physics and economy seem to do a great job in doing just that.

> There is an obvious workaround, which Intel already has in their
> microprocessors - HyperThreading.
>
> With simultaneous multithreading, a microprocessor can behave as if it is twice
> as many cores with half as many pipeline stages each.

No. What you are talking about is a barrel processor and even a barrel processor will feel the impact of a very deep pipeline.

Don't get me wrong - I think barrel processors should be used more - but it isn't a competitor with SMT. Different reasons for existence, different techniques.

> So if they had found a way to deal with the thermal issues, with some care and
> attention, a processor with a 52-stage pipeline could have been better in every
> way than what had gone before. However, Intel would have had to have gone to
> *at least* four-way SMT instead of just two-way in order to achieve that,
> because even 27 stage pipelines are 'way out there. I would be inclined to
> recommend 32-way SMT myself, if it were achievable.

Of course it is achievable. But why would one do that instead of having several cores that have a less deep pipeline and can be powered down if not needed for a certain workload?

There are two parts to power consumption in a processor, static power (leakage+) and dynamic power. The later would increase in an ultra-SMT processor with an ultra-deep pipeline as pipeline stages have overheads. The former would also increase as it is hard to power down pipeline stages to reduce leakage.

> In that way, a fancy 8 GHz processor would no longer be embarrassed by superior
> performance by a 250 MHz processor of simpler design on those programs which
> are better suited to the latter. And it would offer super fast performance on
> newer programs designed for it.

It wouldn't.

Stephen Fuld

unread,

Oct 16, 2016, 6:58:18 PM10/16/16

to

On 10/16/2016 8:07 AM, Quadibloc wrote:
> On Sunday, October 16, 2016 at 7:35:44 AM UTC-6, Nick Maclaren wrote:
>
>> On benchmarketing code suites, yes. The problem about all extreme
>> designs is that they suck for anything that doesn't fit their use
>> pattern - as did the Pentium 4. It doesn't help if they are better
>> for 90% of codes and on average for the user who needs to use one
>> of the 10% and is badly affected by its running time.
>
>> Programs with lots of unpredictable branches are exactly such a
>> case, as are ones with lots of integer multiplication in the
>> addressing pipeline. I have had users hit by both, and the answer
>> was that they would be better off with an older, 'slower' computer.
>
> You know what they say, "You can't stop progress"!
>
> There is an obvious workaround, which Intel already has in their
> microprocessors - HyperThreading.
>
> With simultaneous multithreading, a microprocessor can behave as if it is twice
> as many cores with half as many pipeline stages each.

Actual tests show that isn't true in general. While SMT can help, it
certainly doesn't double throughput, as there is still some contention
within the processor, and certainly it reduces the effective cache size
per process, which reduces hit rate and thus time to complete, as well
as contention to the lower memory levels.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Quadibloc

unread,

Oct 16, 2016, 11:26:05 PM10/16/16

to

On Sunday, October 16, 2016 at 3:22:33 PM UTC-6, Megol wrote:
> On Sunday, October 16, 2016 at 5:07:53 PM UTC+2, Quadibloc wrote:

> > With simultaneous multithreading, a microprocessor can behave as if it is twice
> > as many cores with half as many pipeline stages each.

> No. What you are talking about is a barrel processor and even a barrel processor will feel the impact of a very deep pipeline.

Ah, I thought that simultaneous multithreading was implemented by what was
basically a barrel processor, the difference being that a barrel processor
rotates through N threads on a fixed cycle, whereas with SMT, cycling through
the threads that are active, up to N, is more flexible.

If that is not true, I do have a fundamental misunderstanding about SMT.

John Savard

Quadibloc

unread,

Oct 16, 2016, 11:29:57 PM10/16/16

to

On Sunday, October 16, 2016 at 10:09:09 AM UTC-6, Nick Maclaren wrote:
> Even if it were true (which it very rarely is), it
> doesn't help with the time each program takes to complete.

Well, not helping with latency doesn't mean, though, that the person with a
program poorly adapted to a deep pipeline would be *better off* with an older,
"slower" processor; now, the two are about the same.

Even _without_ SMT, a processor with a 20 stage pipeline should be the equal of
a processor with a 10 stage pipeline and half the frequency; SMT is just to
avoid waste of throughput. And of course that's an oversimplification, as what
really counts is the execute cycles in each individual instruction.

John Savard

Quadibloc

unread,

Oct 16, 2016, 11:35:10 PM10/16/16

to

On Sunday, October 16, 2016 at 3:22:33 PM UTC-6, Megol wrote:

> On Sunday, October 16, 2016 at 5:07:53 PM UTC+2, Quadibloc wrote:

> > So if they had found a way to deal with the thermal issues, with some care and
> > attention, a processor with a 52-stage pipeline could have been better in every
> > way than what had gone before. However, Intel would have had to have gone to
> > *at least* four-way SMT instead of just two-way in order to achieve that,
> > because even 27 stage pipelines are 'way out there. I would be inclined to
> > recommend 32-way SMT myself, if it were achievable.

> Of course it is achievable. But why would one do that instead of having
> several cores that have a less deep pipeline and can be powered down if not
> needed for a certain workload?

Because some programs are well adapted to a deep pipeline, and on that design,
having multiple cores at a lower frequency would mean those programs could no
longer run as fast.

If you have two ways to design a processor, which both yield the same
throughput for a given number of transistors, but the second way gives higher
peak performance on suitably optimized programs - with both ways giving equal
performance on older programs not so optimized - the second way is preferable.

Of course, though, the ability to power down some cores could be decisive in an
ultra-low-power design; usually, the limiting factor is not the chip's average
power consumption, but its maximum power consumption, which depends on how one
cools it. So, yes, if you're looking at having to worry about *battery life*,
you would play by different rules.

John Savard

Stephen Fuld

unread,

Oct 17, 2016, 2:41:07 AM10/17/16

to

Apparently you do. See

https://en.wikipedia.org/wiki/Simultaneous_multithreading

Basically, think of a single thread superscaler processor. It picks up
hopefully multiple operations from a queue and passes them to the
functional units. The limit is the number of functional units or the
within queue instruction dependencies, whichever is smaller, usually the
dependencies.

Now assume you have two queues, with instructions "tagged" by which
queue they come from. Also there is a second register set associated
with the second queue. Now the scheduler can select instructions from
both queues up to the limit of the number of functional units and the
respective queue's dependencies.

In an ideal situation, you can use the functional units left over from
one queue to service instructions from the other queue. But the reality
is far less optimistic. :-(

Quadibloc

unread,

Oct 17, 2016, 3:06:09 AM10/17/16

to

That's true, but don't you get cache problems with multiple cores too?

John Savard

Stephen Fuld

unread,

Oct 17, 2016, 3:35:49 AM10/17/16

to

Only at levels where the cache is shared among multiple cores. The ones
closest to the cores are typically not shared among cores so don't
suffer in a multi-core system by do in an SMT system.

Nick Maclaren

unread,

Oct 17, 2016, 4:12:01 AM10/17/16

to

In article <nu1v00$2op$1...@dont-email.me>,

Right. You still get the contention, of course, which is why a lot
of scalable programs run better using only half the available cores.
In the future, I expect that to drop, so some will use even fewer.

SMT wasn't quite a boondoggle, was was pretty close. I was disgusted
when I looked at the discrepancies between what the papers actually
showed and what their abstracts claimed they did, which is why SMT
has never been scaled up much. On the Intel MIC, Intel recommend
its use for increasing the effective vector lengths (i.e. for more
data parallelism). God alone knows what IBM and Fujitsu systems
do - I have never seen an in-depth, unbiassed report of it.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Oct 17, 2016, 5:56:34 AM10/17/16

to

Some might ask: If you use, say, 2-way-SMT on a single core to replace
2 cores, why not let the single core have twice as much L1 cache?

Because a larger cache would require a longer latency and more power
consumption.

Quadibloc

unread,

Oct 17, 2016, 1:52:13 PM10/17/16

to

They still suffer a reduction in effective cache size, since now the chip real estate is divided between the cache levels that are not shared.

Of course, the size of L1 cache is limited by other factors in order for it to
run at the expected speed, so it is true that SMT reduces the amount of L1
cache available to a process, and that may have been your point.

John Savard

Niels Jørgen Kruse

unread,

Oct 17, 2016, 4:34:57 PM10/17/16

to

Nick Maclaren <n...@wheeler.UUCP> wrote:

> Right. You still get the contention, of course, which is why a lot
> of scalable programs run better using only half the available cores.
> In the future, I expect that to drop, so some will use even fewer.

Is that a roundabout way to say that you expect more available cores?

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Anton Ertl

unread,

Oct 18, 2016, 2:34:34 AM10/18/16

to

n...@wheeler.UUCP (Nick Maclaren) writes:
>In article <2016Oct1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>What I wonder about is why this problem was not foreseen a few years
>>earlier, leading Intel to predict 7GHz for Tejas in September 2003,
>>and then canceling it in May 2004 (it was initially supposed to be
>>available in 2004).
>
>It was. The executive suits chose to believe the engineers that told
>them what they wanted to hear, rather than those that told the facts
>as they were.

And what did they tell them? Did they think that Tejas would eat less
power than it did, or that the increased power could be cooled away
like all the times before (remember when the 30W of the 21064 or the
13W of the original Pentium were considered excessive?). If the
latter, why didn't it work? I remember someone posting here a few
years earlier that the cooling people had told him that they could
manage 1000W/cm^2 (with heat pipes IIRC).

Nick Maclaren

unread,

Oct 19, 2016, 4:33:45 PM10/19/16

to

In article <2016Oct1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>
>>>What I wonder about is why this problem was not foreseen a few years
>>>earlier, leading Intel to predict 7GHz for Tejas in September 2003,
>>>and then canceling it in May 2004 (it was initially supposed to be
>>>available in 2004).
>>
>>It was. The executive suits chose to believe the engineers that told
>>them what they wanted to hear, rather than those that told the facts
>>as they were.
>
>And what did they tell them? Did they think that Tejas would eat less
>power than it did, or that the increased power could be cooled away
>like all the times before (remember when the 30W of the 21064 or the
>13W of the original Pentium were considered excessive?). If the
>latter, why didn't it work? I remember someone posting here a few
>years earlier that the cooling people had told him that they could
>manage 1000W/cm^2 (with heat pipes IIRC).

Yes, they did. But the customers told them to sod off. I was perhaps
the first person to put the metric MFlops/KW into a procurement
contract but, within 2 years of my doing it, it was widespread.

What the real process engineers said was that they didn't know how
to increase the clock rate and shrink the process without using more
power, and so couldn't to do so. But the 'can do' brigade had a
more popular message though, of course, they couldn't do it.

Regards,
Nick Maclaren.

Stefan Monnier

unread,

Oct 20, 2016, 10:13:45 AM10/20/16

to

> From above, It seems that superpipelining is the superior approach only
> limited by atomicity of operations and clock skew. Hence, when
> building a new processor that should be able to execute 80 instructions
> concurrently, should we first max out pipelining and only then
> go superscalar?

My guess is that the current status quo is roughly due to the following:
- in the abstract, increasing pipeline depth is indeed a better option
than widening the pipe. So we want to do use as much as possible.
- but in a pipeline stage, some percentage of the time/delay/gates is
taken by the need to latch for the next stage. Say the latching costs
"1 gate delay". Then in a design with a "20 gate delay cycle", the
"pipelining overhead" is a 5%, but as you pipeline deeper, this
percentage increases.

I don't know the cost of latching, but I suspect that it's no less than
"1 gate delay". And IIUC current designs have a cycle time equivalent
to 10-20 gate delays. This "pipelining" overhead is of course not just
impacting cycle time but also power consumption.

Of course, there might also be other factors:
- deeper pipelining means higher frequency, which means transistors
work harder. It might be more difficult to evacuated the locally
generated heat than in a superscalar design where the same heat is
spread out over more transistors.
- The atomicity issues you mention.

Stefan

paul wallich

unread,

Oct 20, 2016, 12:05:27 PM10/20/16

to

On 10/20/16 10:13 AM, Stefan Monnier wrote:
>> From above, It seems that superpipelining is the superior approach only
>> limited by atomicity of operations and clock skew. Hence, when
>> building a new processor that should be able to execute 80 instructions
>> concurrently, should we first max out pipelining and only then
>> go superscalar?
>
> My guess is that the current status quo is roughly due to the following:
> - in the abstract, increasing pipeline depth is indeed a better option
> than widening the pipe. So we want to do use as much as possible.
> - but in a pipeline stage, some percentage of the time/delay/gates is
> taken by the need to latch for the next stage. Say the latching costs
> "1 gate delay". Then in a design with a "20 gate delay cycle", the
> "pipelining overhead" is a 5%, but as you pipeline deeper, this
> percentage increases.

At some point, do you run into some kind of minimum amount of work that
you can effectively do in a single pipe stage?

Ivan Godard

unread,

Oct 20, 2016, 1:20:35 PM10/20/16

to

Yes: getting the clock to the stage, clocking it, and latching the
interstage is not free. That's why the OP question is bogus: it asserted
that width and depth are equivalent, and they are not.

MitchAlsup

unread,

Nov 5, 2016, 4:28:31 PM11/5/16

to

On Friday, October 14, 2016 at 2:08:40 PM UTC-5, Heinz wrote:
> Today's high performance processor designs seem to have converged to roughly a 4-way superscalar, 20 pipelinestage design. Why?

I have built a 6-wide machine with a 7 stage pipeline (1992)
I have built a 4-wide machine with a ~30 stage pipeline (2005)

In both cases, the floating point multiplier was 54-56 gates deep. In the former design, that multiplier was a 3 clock latency unit, while in the later it was a 8 clock of latency unit. Running the data paths this fast just require nice hard work, no magic involved.

But what happens in the control unit is completely unexpected. The novice would assume that splitting a stage into two stages results in two stages. It turns out that when you split a 5 stage pipeline into half as many gates per clock as before, you get a 15 stage pipeline, not a 10 stage pipeline.

In effect, every time you split 2 16-gate stages into 4 8-gate stages, you, instead, get 6 8-gate stages. This effect occurs because certain things cannot "pipelined". Every time there is a pipeline feedback loop (forwarding, branch misprediction, load miss,...) these things get really nasty to split.

Another more subtle effect has also transpired. One can build a almost fast machine at 20 gates with a pipeline 5 stages shorter than Opteron. But to take the design from 20-gates to 16-gates adds those 5 more pipeline stages. At the time, we had just gotten branch prediction "good enough" to cover this and get a few more percent of performance instead of the 25% we would have gotten from edumacated guess work.

And then there is the cache hierarchy. One can make small caches with low latency or larger caches with higher latency. 2-cycle loads look good on small benchmarks and fail miserably on data-base applications. 3-cycle loads deliver a bit lower performance on toy applications, but a lot better performance on real applications. The 30-stage machine above had to use a 4-cycle load just getting the generated address to the SRAMs too a whole cycle. We even debated short circuiting Load Alignment under certain circumstances, but ultimately pounded the logic into submission.

So the reason general purpose machines have stabilized where they are is a combination of power, performance, area, and cost. Given less general purpose nature, one can get 10%-25% better performance at reasonable power and area--but only at the cost of sucking at more general purpose workloads. The 10%-25% is just never enough to abandon general purpose for some thing a bit better. The costs are simply too high.

Mitch

already...@yahoo.com

unread,

Nov 6, 2016, 8:40:58 AM11/6/16

to

On Saturday, November 5, 2016 at 10:28:31 PM UTC+2, MitchAlsup wrote:
> On Friday, October 14, 2016 at 2:08:40 PM UTC-5, Heinz wrote:
> > Today's high performance processor designs seem to have converged to roughly a 4-way superscalar, 20 pipelinestage design. Why?
>
> I have built a 6-wide machine with a 7 stage pipeline (1992)

MC88120?
I didn't realize that it was THAT wide!
Even MIPS R8000 was narrower, and it was a multichip shipped couple of years later.

> I have built a 4-wide machine with a ~30 stage pipeline (2005)

One of K9 candidate designs? Or the only?

MitchAlsup

unread,

Nov 6, 2016, 9:58:39 AM11/6/16

to

On Sunday, November 6, 2016 at 7:40:58 AM UTC-6, already...@yahoo.com wrote:
> On Saturday, November 5, 2016 at 10:28:31 PM UTC+2, MitchAlsup wrote:
> > On Friday, October 14, 2016 at 2:08:40 PM UTC-5, Heinz wrote:
> > > Today's high performance processor designs seem to have converged to roughly a 4-way superscalar, 20 pipelinestage design. Why?
> >
> > I have built a 6-wide machine with a 7 stage pipeline (1992)
>
> MC88120?
> I didn't realize that it was THAT wide!

Yes, and it could run MATRIX 300 at 5.92 IPC including cache misses, TLB misses, and DRAM refresh cycles.

> Even MIPS R8000 was narrower, and it was a multichip shipped couple of years later.

> > I have built a 4-wide machine with a ~30 stage pipeline (2005)
>
> One of K9 candidate designs? Or the only?

The only, and within a few months of tapeout when the power wall hit.

Mitch

already...@yahoo.com

unread,

Nov 6, 2016, 10:24:27 AM11/6/16

to

[...]

> My guess as to "why" would be diminishing returns, plus increased public
> consciousness about energy consumption. Multiple cores are sold as offering
> increased performance at lower energy costs; that is true for some tasks, but
> not others - but the general computer user doesn't need to get latency down to
> the lowest possible level.

It's not just the diminishing returns, it's the knee of the cost curve.
The GPU folks have shown that you can go up to (roughly) 200W aircooled
if you take your thermal design really seriously, but it's not easy. On
the CPU side, more than 100W seems to almost always be accompanied by
interesting cooling technologies that require fairly close monitoring
and regular maintenance to keep the magic smoke inside. (Where "regular
maintenance means "more often than the expected life of the machine",
but we're talking about mass consumers here.)

If you look at mass computer cost numbers, they haven't changed all that
much in the past 20 years or so. Anything that put a substantial delta
on your price would by definition not be mass.

paul

Torbjorn Lindgren

unread,

Nov 6, 2016, 8:20:13 PM11/6/16

to

Quadibloc <jsa...@ecn.ab.ca> wrote:
>I wonder if this has any connection to the fact that the Kaby Lake i7
>processors are *dual-core* units, but with fancier on-chip GPU power
>(dual core with HyperThreading used to mean i3, so i7 has lost its
>meaning, the meaning it had right up through Skylake, even for laptop
>parts).

No, that's just Intel not launching the larger Kaby Lake's until they
have yields under control and/or they have enough production capacity,
exactly as they've done with several other launches before this.

Right now there are only low (U) and ultra-low (Y) power versions of
Kaby Lake and in those segment i7 has *ALWAYS* meant dual-core+HT,
just like the current Kaby Lake i7's.

So until higher-power Kaby Lake chips are available everyone uses
Skylake when they need the bigger/higher TDP parts, this is why Apple
just launched several Skylake laptops. We know there's a bunch of
desktop quad-core Kaby Lake parts coming in Q1, the date for the
higher power laptop models aren't known yet (but likely Q1 or even
Q2).

Yes, it sucks that i3/i5/i7 has no real meaning between the various
segments but that's been true since they made those names up. The same
term has always meant very different things on desktop, laptop (35W+)
and low/ultra-low power.

>All right... the current high-end is the Nvidia GeForce GTX 1080, and
>that has a TDP of 180 watts. Below 300 watts, below 200 watts
>slightly - but well above 100 watts.

AFAIK AMD's top-end is still Fury X with a TDP of 275W and their
second-rank card is the 390X also with a TDP of 275W.

And while you're correct about the GTX 1080 TDP, it isn't really
Nvidia's top-end, that's the Titan X Pascal with a 250W TDP. If AMD
had anything even remotely competetive to the 1080 there would be a
1080 ti card with 250-275W TDP... Because they don't need to market
them as such, they instead sell them as Titan X Pascal at much higher
markup for the AI & Heuristics markets but despite labelling it's
definitely the top-end Nvidia graphics card and is being used as such.

250W-275W has been the most common top-end graphics card TDP for
top-end cards for the last 5+ GPU generations and it's actually more
likely to exceed 275W than to go below 250W. If AMD Vega performs like
everyone expect, both sides top-end card will very likely be in that
250-2875W range "soon" again...

During that period there's also been have several graphics card at
300+W TDP, including one where they had to artificially limit it to
get it down to "only" 375W due to PCI-E slot limits so they could sell
it as a PCI-E card! But pretty much everyone who bought it flipped the
"go faster" switch on it where it has a ~425W TDP, if you didn't plan
to flip that switch you would have bought a different (cheaper) card...

I also need to point out that these are all official TDP for
manufacturer clocked cards. In this segement probably more than 50% of
all cards is sold "factory overclocked" which also implies higher or
much higher TDPs... 200W+ TDP isn't uncommon for many OC 1080's and I
suspect many OC 390X cards are in the 325W TDP range (Fury/FuryX is
much less likely to come with wild overclocks).

>But regular Intel products indeed have lower power - a recent Intel
>Extreme Edition CPU has a TDP of 140 watts.

Looks like Intel's current top-end is the Xeon E5-2679 v4 with a TDP
of 200W, then there's a few E5/E7 v3 and v4 models with 165W TDP.
Still, most Intel server cpus are in the 80-95W or 120-135W TDP band
(and then probably 65W). OTOH, dual-socket servers are relatively
common which doubles the CPU power usage (quad-socket is much, much
smaller market).

>I do remember a news item about Intel selling - with little publicity
>- special high-speed editions of its chips to a select clientele,
>such as automated stock market traders, that come at a premium price
>but which manage to finish their calculations before the other
>fellow's computer.

Intel will happily makes custom combination of number of cores,
frequency and turbo boost if you order enough chips, it's not exactly
announced but neither is really secret. Oracle, Amazon (Cloud) and
Azure (MS Cloud) are all known to have special editions and it's not
hard to find out how those are configured.

Oracle's version is probably the most technically interesting, AFAIK
the other known ones are just a core vs speed combinations that Intel
didn't think would sell enough while Oracle's version is "flexible" in
a way their normal chips aren't.

https://www.extremetech.com/computing/187055-intel-releases-rare-details-of-its-customized-oracle-cpus-and-there-a-lot-more-to-come

I assume the reason "normal" Intel chips doesn't offer this is because
it requires quite a bit of additional testing which cost money and
using it requires special modifications to the OS. If Intel senses a
wider need it may show up in wider deployment.

Oracle certainly has the margins on the product they choose to use it
in to pay Intel a significant premium for CPUs with this feature if
they deemed it worthwhile (which they apparently did).

Anton Ertl

unread,

Nov 7, 2016, 3:45:18 AM11/7/16

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>I wonder if this has any connection to the fact that the Kaby Lake i7
>processors are *dual-core* units, but with fancier on-chip GPU power (dual core
>with HyperThreading used to mean i3, so i7 has lost its meaning, the meaning it
>had right up through Skylake, even for laptop parts).

My Core i3-3227U (Ivy League) has two cores and hyperthreading, too.
The only meaning of i3, i5, i7 that is consistent across power classes
is: i7 is more expensive than i5, which is more expensive than i3.

already...@yahoo.com

unread,

Nov 7, 2016, 4:34:15 AM11/7/16

to

On Monday, November 7, 2016 at 10:45:18 AM UTC+2, Anton Ertl wrote:
> Quadibloc <jsa...@ecn.ab.ca> writes:
> >I wonder if this has any connection to the fact that the Kaby Lake i7
> >processors are *dual-core* units, but with fancier on-chip GPU power (dual core
> >with HyperThreading used to mean i3, so i7 has lost its meaning, the meaning it
> >had right up through Skylake, even for laptop parts).
>
> My Core i3-3227U (Ivy League) has two cores and hyperthreading, too.
> The only meaning of i3, i5, i7 that is consistent across power classes
> is: i7 is more expensive than i5, which is more expensive than i3.
>

I am afraid that you are giving to Intel's consistency too much credit.
Look at Core i3-6167U. It is more expensive than majority of Core i5-6xxxU models.

unread,

Nov 7, 2016, 9:00:30 AM11/7/16

to

On Sunday, November 6, 2016 at 11:51:57 PM UTC+2, MitchAlsup wrote:
>

> The real question is why can yo sell a 300W GPU but not a 300W CPU?
>

I'd guess the answer is - you can sell a 300W CPUs. About as many of them as 300W GPUs or, at best, twice more.

So the next question is why GPU companies consider it worthwhile and Intel/AMD do not consider it worthwhile.
I can see two reasons:
1. 300W GPU uses the same microarchitecture as mass market 2-50W GPUs. So, designing it is a smaller investment than designing/validating a CPU with completely different microarchitecture.
As a matter of fact, despite limited demand, CPU companies do sell >150W model, but these models are all cheap to develop, because they are variants of either mass-market desktop or mass-market server.

2. The difference in selling price between "enthusiast" 300W GPU and mass-market 50W GPU is big. CPU companies do not think that they can charge similar premium for "enthusiast" CPUs. Or, rather, they can, but then they will sell too few. There are technical reasons for it too - 300W GPUs are many time faster than 50W GPUs. One can realistically expect the same performance disparity with CPUs. Memory wall alone is sufficient reason, but there are other reasons too.

Oh, what a boring stuff I am writing here. Nothing half-original :(

MitchAlsup

unread,

Nov 7, 2016, 9:21:57 AM11/7/16

to

On Monday, November 7, 2016 at 4:40:47 AM UTC-6, already...@yahoo.com wrote:

> Unfortunately, spec.org does not think that we have to know about Spec89. As far as they concerned, the life started at Spec92.
> So, the only info I was able to find about matrix300 is a sharp criticism of this bench in H&P. Apart from saying that it is bad benchmark ("99% of the execution time was in a single line"), they mention that it does eight different 300x300 matrix multiplications.

It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose row and column. While it is somewhat indicative of BLAS activities, the matrix ends up fitting in too many caches. In my design, however, it did not fit, and the processor was taking a cache miss every other cycle.

> I still don't fully understand "eight different multiplications" part, but that's probably does not matter. What does matter is that for matrix multiplication IPC is the last thing I want to know. Achieved FLOPS/Hz, on the other hand, a moderately interesting. Not as interesting as FLOPS/W and FLOPS/$, but interesting nevertheless.

I looked at the code and only found 4.

> Do you remember how many FLOPS/Hz you got on matrix300 ?

1.95 ! with no secondary cache !

Mitch

already...@yahoo.com

unread,

Nov 7, 2016, 9:59:03 AM11/7/16

to

On Monday, November 7, 2016 at 4:21:57 PM UTC+2, MitchAlsup wrote:
> On Monday, November 7, 2016 at 4:40:47 AM UTC-6, already...@yahoo.com wrote:
>
> > Unfortunately, spec.org does not think that we have to know about Spec89. As far as they concerned, the life started at Spec92.
> > So, the only info I was able to find about matrix300 is a sharp criticism of this bench in H&P. Apart from saying that it is bad benchmark ("99% of the execution time was in a single line"), they mention that it does eight different 300x300 matrix multiplications.
>
> It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose row and column.

O.k. Memory footprint is even smaller than I thougth.

> While it is somewhat indicative of BLAS activities, the matrix ends up fitting in too many caches. In my design, however, it did not fit, and the processor was taking a cache miss every other cycle.
>

Why so?
Cache blocking was not allowed by benchmark rules?
Or it was allowed when done automatically by compiler (according to H&P IBM compiler was smart enough), but not allowed when done by hand?

> > I still don't fully understand "eight different multiplications" part, but that's probably does not matter. What does matter is that for matrix multiplication IPC is the last thing I want to know. Achieved FLOPS/Hz, on the other hand, a moderately interesting. Not as interesting as FLOPS/W and FLOPS/$, but interesting nevertheless.
>
> I looked at the code and only found 4.
>
> > Do you remember how many FLOPS/Hz you got on matrix300 ?
>
> 1.95 ! with no secondary cache !
>
> Mitch

Did you have FMA?

Terje Mathisen

unread,

Nov 7, 2016, 1:18:44 PM11/7/16

to

MitchAlsup wrote:
> On Monday, November 7, 2016 at 4:40:47 AM UTC-6, already...@yahoo.com
> wrote:
>
>> Unfortunately, spec.org does not think that we have to know about
>> Spec89. As far as they concerned, the life started at Spec92. So,
>> the only info I was able to find about matrix300 is a sharp
>> criticism of this bench in H&P. Apart from saying that it is bad
>> benchmark ("99% of the execution time was in a single line"), they
>> mention that it does eight different 300x300 matrix
>> multiplications.
>
> It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose

Isn't the last one (A^T*B^T) the sasme as the first one followed by a
transpose? ((A*B)^T)

Terje

> row and column. While it is somewhat indicative of BLAS activities,
> the matrix ends up fitting in too many caches. In my design, however,
> it did not fit, and the processor was taking a cache miss every other
> cycle.
>
>> I still don't fully understand "eight different multiplications"
>> part, but that's probably does not matter. What does matter is that
>> for matrix multiplication IPC is the last thing I want to know.
>> Achieved FLOPS/Hz, on the other hand, a moderately interesting. Not
>> as interesting as FLOPS/W and FLOPS/$, but interesting
>> nevertheless.
>
> I looked at the code and only found 4.
>
>> Do you remember how many FLOPS/Hz you got on matrix300 ?
>
> 1.95 ! with no secondary cache !
>
> Mitch
>

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Nick Maclaren

unread,

Nov 7, 2016, 1:41:46 PM11/7/16

to

In article <d81aa79d-fedf-46ed...@googlegroups.com>,

No, not at all. While we were one of the first sites to start using
performance/watt as a primary benchmarking metric, within a couple
of years it was common. Note that I am talking HPC and commercial
server-room procurements here, not home hobbyist or 'defence'.

The reason for this is quite simply that many sites were running
out of power or cooling capacity, and did not have any practical
way to expand. This was more common in Europe and Japan, of course,
but affected many sites in the USA, too. For example, financial
services involve a lot of gambling with other people's money (the
futures market), and milliseconds matter, so they need to be close
to the exchange server. And a lot of other sites are in densely
developed cities, places with limited power supplies or roof space
for cooling towers, or with development restrictions.

Yes, there is a market for power hogs, but it's smaller than you
might think. Inter alia, many sites would prefer 3x the number of
CPUs, each 1/2 the performance, if it comes in at the same power
budget.

Regards,
Nick Maclaren.

Quadibloc

unread,

Nov 7, 2016, 2:39:41 PM11/7/16

to

On Monday, November 7, 2016 at 11:41:46 AM UTC-7, Nick Maclaren wrote:

> Yes, there is a market for power hogs, but it's smaller than you
> might think. Inter alia, many sites would prefer 3x the number of
> CPUs, each 1/2 the performance, if it comes in at the same power
> budget.

Makes sense; 1.5x the throughput. And if you *can* put 1,000 CPUs to work, then
you probably can use 2,000 CPUs as well.

A fast power hog is what you need when the number of CPUs you can usefully put
to work on your problem is *one*. Or close to it.

I thought that this happened quite often, but I'm no expert.

John Savard

already...@yahoo.com

unread,

Nov 7, 2016, 3:25:47 PM11/7/16

to

On Monday, November 7, 2016 at 8:18:44 PM UTC+2, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, November 7, 2016 at 4:40:47 AM UTC-6, already...@yahoo.com
> > wrote:
> >
> >> Unfortunately, spec.org does not think that we have to know about
> >> Spec89. As far as they concerned, the life started at Spec92. So,
> >> the only info I was able to find about matrix300 is a sharp
> >> criticism of this bench in H&P. Apart from saying that it is bad
> >> benchmark ("99% of the execution time was in a single line"), they
> >> mention that it does eight different 300x300 matrix
> >> multiplications.
> >
> > It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose
>
> Isn't the last one (A^T*B^T) the sasme as the first one followed by a
> transpose? ((A*B)^T)
>
> Terje
>

No, A^T*B^T is the same as (B*A)^T

Terje Mathisen

unread,

Nov 7, 2016, 6:17:20 PM11/7/16

to

OK, that makes sense when matrix mul isn't commutative.

Thanks!

This means that the compiler cannot reuse the first result at this point.

Terje

MitchAlsup

unread,

Nov 7, 2016, 8:22:54 PM11/7/16

to

On Monday, November 7, 2016 at 8:59:03 AM UTC-6, already...@yahoo.com wrote:
> On Monday, November 7, 2016 at 4:21:57 PM UTC+2, MitchAlsup wrote:
> > It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose row and column.
>
> O.k. Memory footprint is even smaller than I thougth.

Memory footprint is 3*300*300*8 = 2,1600,000 bytes so it is not all that small.

In the first transpose (neither) it is strictly 2 misses every cache line DIV 8. In the second it is one cache miss DIV 1 and one cache miss DIV 8, in the third and fourth it is one cache miss DIV 1. Any vector memory system can handle it with ease, and almost any cache system with 256KB and more than one set can handle it with ease.

> > While it is somewhat indicative of BLAS activities, the matrix ends up fitting in too many caches. In my design, however, it did not fit, and the processor was taking a cache miss every other cycle.
> >
>
> Why so?
> Cache blocking was not allowed by benchmark rules?

The code was already blocked prior to being fed to the compiler!

Mitch

MitchAlsup

unread,

Nov 7, 2016, 8:23:31 PM11/7/16

to

On Monday, November 7, 2016 at 12:18:44 PM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, November 7, 2016 at 4:40:47 AM UTC-6, already...@yahoo.com
> > wrote:
> >
> >> Unfortunately, spec.org does not think that we have to know about
> >> Spec89. As far as they concerned, the life started at Spec92. So,
> >> the only info I was able to find about matrix300 is a sharp
> >> criticism of this bench in H&P. Apart from saying that it is bad
> >> benchmark ("99% of the execution time was in a single line"), they
> >> mention that it does eight different 300x300 matrix
> >> multiplications.
> >
> > It performs A*B, A*B^T, A^T*B, and A^T*B^T where ^T denotes transpose
>
> Isn't the last one (A^T*B^T) the sasme as the first one followed by a
> transpose? ((A*B)^T)

A nuance I do not remember.

Bruce Hoult

unread,

Nov 8, 2016, 10:20:27 AM11/8/16

to

On Monday, November 7, 2016 at 1:58:03 AM UTC+3, Quadibloc wrote:
> On Sunday, November 6, 2016 at 2:51:57 PM UTC-7, MitchAlsup wrote:
> > On Sunday, November 6, 2016 at 9:24:27 AM UTC-6, already...@yahoo.com wrote:
>
> > > Or marketing people didn't realize at time that they can't sell CPU that
> > > consumes more than 100-110W to mass consumers?
>
> > The real question is why can yo sell a 300W GPU but not a 300W CPU?
>
> Hmm.
>
> I was aware that GPU cards often had big fans on them, but compared to the
> bigger fan and heatsink combos sitting on CPUs, I hadn't realized that GPUs got
> to use more power than CPUs.
>
> I wonder if this has any connection to the fact that the Kaby Lake i7
> processors are *dual-core* units, but with fancier on-chip GPU power (dual core
> with HyperThreading used to mean i3, so i7 has lost its meaning, the meaning it
> had right up through Skylake, even for laptop parts).

Not so. My 2011 11" MacBook Air has a dual core + HT i7-2677M 1.8 GHz, turbo 2.9 GHz.

My GeekBench3 for it: https://browser.primatelabs.com/geekbench3/1443716
And my iPhone 6s: https://browser.primatelabs.com/geekbench3/4831757

Very very similar!

MB Air: 2410 4894, iPhone 2538 4422 (single core, multi core)

Anton Ertl

unread,

Nov 8, 2016, 11:21:59 AM11/8/16

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Monday, November 7, 2016 at 2:57:51 AM UTC-7, Anton Ertl wrote:
>> If that
>> is the reason, why didn't they realize the problem earlier? Were they
>> expecting a cooling technology that did not work out?
>
>I strongly doubt that the reason is as straightforward as that. Cooling
>technologies are mature, it's the silicon that's at the cutting edge.

If they required cooling technology that exceeded the capabilities of
existing, mature cooling technology, they would need a new cooling
technology.

>So what changed appears to have been *anticipated* consumer demand, and
>definitely stuff like that happens in management and not engineering.

I would expect that that's a consideration they take into account much
earlier in the project (at about the time when they first produce
estimates of power consumption). Ok, there may have been a change of
mind in the management, but it's unlikely that that happens in two
companies at similarly late stages in development (it's even unlikely
in one, but such thinks happen occasionally).

Anton Ertl

unread,

Nov 8, 2016, 11:48:35 AM11/8/16

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, November 6, 2016 at 10:17:20 AM UTC-7, Anton Ertl wrote:
>
>> Maybe you can shed more light at why projects such as K9 and Tejas
>> were pursued that far, and then killed. Didn't you have power in your
>> sights? Or did you assume it could be dealt with? Or something else?

...

>And here's a contemporary news account about Tejas:
>

>http://www.theinquirer.net/inquirer/news/1044825/the-intel-tejas-affair-exp=
>lained
>
>Their story is that the 65nm process became available earlier than expected=
>.=20
>Prescott could be moved to 65nm quickly. Tejas could not, it would take tim=
>e to=20
>change it over to 65nm. Actually selling the 90nm Tejas would create market=
>=20

>confusion, so it was dropped.

If Tejas had achieved it's goals of 7GHz without much loss of IPC
compared to a Pentium 4, it would have been ahead of AMD, which would
have been a significant marketing advantage for Intel. I don't think
"market confusion" trumps that. They have marketing people to deal
with that.

Also, the fact that a similar fate befell Mitch Alsup's K9 makes it
appear more likely that both projects had a similar technical problem
rather than some kind of delay-shrink cancellation.

>It doesn't make sense to abandon a chip design that's better than the other=
>=20

>ones you have just because it got delayed in the past, but it's ready now.
>

>It seems mysterious that there would be a sudden turnaround in thinking tha=
>t=20
>300 watts is just peachy to deciding it's too scary to foist on the consume=
>r.

Right.

>But I could believe that doubts about the high power consumption *plus* a=
>=20

>project plagued by delays could lead to it getting dropped...

It had already taped out, so the finish line was in sight. Except if
at that time they found something they had not thought of earlier. And
given the similar fate of the K9, I suspect a common cause.

Quadibloc

unread,

Nov 8, 2016, 4:10:05 PM11/8/16

to

On Tuesday, November 8, 2016 at 9:21:59 AM UTC-7, Anton Ertl wrote:

> I would expect that that's a consideration they take into account much
> earlier in the project (at about the time when they first produce
> estimates of power consumption). Ok, there may have been a change of
> mind in the management, but it's unlikely that that happens in two
> companies at similarly late stages in development (it's even unlikely
> in one, but such thinks happen occasionally).

Well, what now seems to have happened - after I did a search for news articles
that appeared at the time of Tejas' cancellation - was this:

- Tejas ran somewhat late, and so it would have had to have undergone a process
shrink to be equivalent to other designs already available (one article I saw
noted that making it available as-is would have confused customers, that was
given as "the reason" Tejas died);

- These other designs were multi-core chips, first dual core, then, later, quad
core. Even a quad core had less power consumption than a single-core Tejas.

- Thus, when Intel learned *through actual sales* of multi-core chips that the
market could be persuaded that two cores at X MHz were almost as good as one
core at 2X MHz, then any plans to do a process shrink on Tejas were dropped.

I find that believable - and for AMD to follow the lead of Intel, rather than
to go in a direction which its much bigger and more successful competitor
decided was not worthwhile also makes sense, as it was smaller and could
therefore afford to take less risks.

So it makes sense that they thought 300 W TDP would be acceptable - since
problems are often serial - but when experience showed them they could sell
multi-core chips based on throughput per watt, then Tejas got shoved from a
mainline product to a niche product not worth the cost of further development.
They didn't imagine something wrong about cooling systems: they learned
something they didn't know about what people would enthusiastically buy.

John Savard