Bulldozer details + bobcat

555 views
Skip to first unread message

Brett Davis

unread,
Nov 13, 2009, 1:45:58 AM11/13/09
to
Bulldozer details + bobcat

AMD will sample chips early next year
http://www.theinquirer.net/inquirer/news/1561981/amd-sample-chips
http://www.theregister.co.uk/2009/11/12/amd_client_roadmaps/

AMD fusions die photo
page 45 has a 4 core + graphics, looks to be 1/4th a top end GPU, 400 pipes?
http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMjR8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

Botcat Core page 15, Bulldozer page 16
http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMTl8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

More Bulldozer page 16+
http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMzJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

more
http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMjd8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1

From
http://phx.corporate-ir.net/phoenix.zhtml?c=74093&p=irol-analystday

Also
http://arstechnica.com/hardware/news/2009/11/amd-bobcat-bulldozer.ars
http://citavia.blog.de/
http://info.nuje.de/Bulldozer_Core_uArch_0.4.png (MAY NOT BE BULLDOZER)

Bobcat is 4 issue: 2 int + load + store, quoted 90% of PhenomII speed?
I assume PhenomII does 3 int, including up to 2 loads + store
instead of Int op. So bobcat could be faster for some work?
Same FPU as PhenomII, 1 float add vector + one float mul vector?

Bulldozer is not as radical as I had assumed. Simple 4 wide with
shared FPU, which is 2 full FMAD vector.

Bulldozer may not be bigger than PhenomII, so PhenomII would go away,
but for a while AMD will have three desktop CPU product lines.

The decade old K7 full triple pipelines finally dies?
Or does it get a shared FPU upgrade? (Which would crush Core2.)

Core2 is part 4 way for decode/issue, but only 3 integer pipes.
And Atom is crap. ;)

AMD ends up with a nicer distribution of products than Intel,
makes Intel look like a small company competing with a dominate one.
Not having the money or engineering talent to cover all markets.
;)

Brett

Terje Mathisen

unread,
Nov 13, 2009, 2:18:57 AM11/13/09
to
Brett Davis wrote:
> The decade old K7 full triple pipelines finally dies?
> Or does it get a shared FPU upgrade? (Which would crush Core2.)
>
> Core2 is part 4 way for decode/issue, but only 3 integer pipes.
> And Atom is crap. ;)
>
> AMD ends up with a nicer distribution of products than Intel,
> makes Intel look like a small company competing with a dominate one.
> Not having the money or engineering talent to cover all markets.

I think that's a somewhat optimistic (on behalf of AMD) viewpoint.

Intel have after all been pretty consistent in not allowing AMD x86
performance leads for longer periods of time.

OTOH AMD is indeed doing a pretty good job, they just need to make a bit
more money doing it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

ArarghMai...@not.at.arargh.com

unread,
Nov 13, 2009, 3:45:45 AM11/13/09
to
On Fri, 13 Nov 2009 06:45:58 GMT, Brett Davis <gg...@yahoo.com> wrote:

These last 4 URLs all appear to be the same, and go nowhere. :-)


>
>From
>http://phx.corporate-ir.net/phoenix.zhtml?c=74093&p=irol-analystday
>
>Also
>http://arstechnica.com/hardware/news/2009/11/amd-bobcat-bulldozer.ars
>http://citavia.blog.de/
>http://info.nuje.de/Bulldozer_Core_uArch_0.4.png (MAY NOT BE BULLDOZER)
>
>Bobcat is 4 issue: 2 int + load + store, quoted 90% of PhenomII speed?
>I assume PhenomII does 3 int, including up to 2 loads + store
>instead of Int op. So bobcat could be faster for some work?
>Same FPU as PhenomII, 1 float add vector + one float mul vector?
>
>Bulldozer is not as radical as I had assumed. Simple 4 wide with
>shared FPU, which is 2 full FMAD vector.
>
>Bulldozer may not be bigger than PhenomII, so PhenomII would go away,
>but for a while AMD will have three desktop CPU product lines.
>
>The decade old K7 full triple pipelines finally dies?
>Or does it get a shared FPU upgrade? (Which would crush Core2.)
>
>Core2 is part 4 way for decode/issue, but only 3 integer pipes.
>And Atom is crap. ;)
>
>AMD ends up with a nicer distribution of products than Intel,
>makes Intel look like a small company competing with a dominate one.
>Not having the money or engineering talent to cover all markets.
>;)
>
>Brett

--
ArarghMail911 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

To reply by email, remove the extra stuff from the reply address.

Noob

unread,
Nov 13, 2009, 4:41:18 AM11/13/09
to
Arargh wrote:

> On Fri, 13 Nov 2009 06:45:58 GMT, Brett Davis wrote:
>
>> Bulldozer details + bobcat
>>
>> AMD will sample chips early next year
>> http://www.theinquirer.net/inquirer/news/1561981/amd-sample-chips
>> http://www.theregister.co.uk/2009/11/12/amd_client_roadmaps/
>>
>> AMD fusions die photo
>> page 45 has a 4 core + graphics, looks to be 1/4th a top end GPU, 400 pipes?
>> http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMjR8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1
>>

>> Bobcat Core page 15, Bulldozer page 16

The document in question is

AMD Financial Analyst Day
APU Strategy
Chuck Moore, Corporate Fellow and CTO Technology Development
November 11, 2009

I've uploaded a copy to
http://dl.free.fr/bMjPWZw4s/AMD_APU_Strategy.pdf

Robert Myers

unread,
Nov 13, 2009, 9:40:07 AM11/13/09
to
On Nov 13, 4:41 am, Noob <r...@127.0.0.1> wrote:
> Arargh wrote:
> > On Fri, 13 Nov 2009 06:45:58 GMT, Brett Davis wrote:
>
> >> Bulldozer details + bobcat
>
> >> AMD will sample chips early next year
> >>http://www.theinquirer.net/inquirer/news/1561981/amd-sample-chips
> >>http://www.theregister.co.uk/2009/11/12/amd_client_roadmaps/
>
> >> AMD fusions die photo
> >> page 45 has a 4 core + graphics, looks to be 1/4th a top end GPU, 400 pipes?
> >>http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMjR8Q2...

>
> >> Bobcat Core page 15, Bulldozer page 16
> >>http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMTl8Q2...
>
> >> More Bulldozer page 16+
> >>http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMzJ8Q2...
>
> >> more
> >>http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9MjAzMjd8Q2...

>
> > These last 4 URLs all appear to be the same, and go nowhere. :-)
>
> The document in question is
>
> AMD Financial Analyst Day
> APU Strategy
> Chuck Moore, Corporate Fellow and CTO Technology Development
> November 11, 2009
>
> I've uploaded a copy tohttp://dl.free.fr/bMjPWZw4s/AMD_APU_Strategy.pdf

One wonders who suggested renaming Veterans Day as Financial Analyst
Day. Maybe that's why the document was taken down. What a blunder.

Robert Myers.

Brett Davis

unread,
Nov 13, 2009, 10:03:34 PM11/13/09
to
In article <937qf59ms76d1mig5...@4ax.com>,
ArarghMai...@NOT.AT.Arargh.com wrote:

> >Bulldozer details + bobcat

> >AMD fusions die photo

I hate it when a database gives you bogus one time use links.

This is the starting link:
http://sites.amd.com/us/atwork/promo/events/Pages/analyst-day.aspx

Next is "Executive Presentations and Webcast Archive", which might work:
http://phx.corporate-ir.net/phoenix.zhtml?c=74093&p=irol-analystday

And of course all the links on that page are totally non-copyable.
Idiots.

I changed my opinion of Bulldozer, not two cores with shared FPU.
Kind of one core with two separate integer pipes, and code threads.
Actually a lot depends on how wide the fetch and decode is.

For x86 fetch and decode is a large chunk of die space that is mostly idle.
With that assumption you are back to this performing the same as a dual core,
at 2/3rds the die space. Which would be a pretty overwhelming advantage.

Unlike hypethreading this will never cost you performance. And for single
thread benchmarks you turn off the second thread, same as Intel. ;)

Brett

Brett Davis

unread,
Nov 13, 2009, 11:06:49 PM11/13/09
to
In article <WKedna4VcuL_lWDX...@lyse.net>,
Terje Mathisen <Terje.M...@tmsw.no> wrote:

> Brett Davis wrote:
> > The decade old K7 full triple pipelines finally dies?
> > Or does it get a shared FPU upgrade? (Which would crush Core2.)

Actually get back near parity, and that depends if the second FPU
would find instructions to issue in the major gaming benchmarks.
Still going to lose on integer work.

In any case AMD will have Bulldozer for the benchmark queen position.

> > Core2 is part 4 way for decode/issue, but only 3 integer pipes.
> > And Atom is crap. ;)
> >
> > AMD ends up with a nicer distribution of products than Intel,
> > makes Intel look like a small company competing with a dominate one.
> > Not having the money or engineering talent to cover all markets.
>
> I think that's a somewhat optimistic (on behalf of AMD) viewpoint.

If Intel was a smart Monopoly they would have spent the extra money
to cover different markets with different optimized designs.
This would have cost more, cutting the outrageous monopoly profits
a tiny bit. But it would have locked AMD into one market segment which
a monopoly can turn into a loss leader. (A tactic Intel tried.)

Instead they were greedy and stupid.

Next year AMD will be selling three different designs optimized for
three different markets. Against Intels one chip with crippled cache
and/or crippled clock to compete in different markets.

The current strategy of using a benchmark queen with die size and heat
issues in all markets, is looking kinda stupid.

"Bobcat, 90% of the performance in less than half the die area."

Intel Atom: Half the performance bundled with Intel mandated crippled
graphics to guarantee you will throw it away and buy a real computer.

Have I mentioned that Atom is crap. ;)

> Intel have after all been pretty consistent in not allowing AMD x86
> performance leads for longer periods of time.

NetBust was 2 or three years in second place.

> OTOH AMD is indeed doing a pretty good job, they just need to make a bit
> more money doing it.
>
> Terje

Brett

Brett Davis

unread,
Nov 14, 2009, 4:25:30 PM11/14/09
to

> For x86 fetch and decode is a large chunk of die space that is mostly idle.
> With that assumption you are back to this performing the same as a dual core,
> at 2/3rds the die space. Which would be a pretty overwhelming advantage.

Here is a Athlon64 die plot:
http://www.amd.com/us-en/assets/content_type/DigitalMedia/Ath64_die_marked_P.jpg

Biggest block is "Fetch Scan Align Microcode", second smallest is the execution units.
Instruction Cache shared, Bus unit shared, Load/Store unit I am not sure.

So Bulldozer has 2 Execution units, 2 L1 data caches, likely two Load/Store, and
a doubled FPU that is shared. So indeed 2/3rds the die space of two full cores.

If L2 size stays the same the total die size and production cost is ~20% greater
for twice the performance. Overwhelming advantage.

Brett

Andy "Krazy" Glew

unread,
Nov 15, 2009, 1:50:01 AM11/15/09
to Brett Davis
Brett Davis wrote:
> Bulldozer details + bobcat

BRIEF:

AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded)
microarchitecture. That's my baby!

DETAIL:

Thursday was both a very good day and a very bad day for me. Good,
because my MCMT ideas finally seem to be going into a product. Bad,
because I ended up driving 4 hours from where I work with IV in the
Seattle area back to Portland, to my wife who was taken to a hospital
emergency room. The latter is personal. The former is, well, personal
too, but also professional.

I can't express how good it feels to see MCMT become a product. It's
been public for years, but it gets no respect until it is in a product.
It would have been better if I had stayed at Intel to see it through.
I know that I won't get any credit for it. (Except from some of the guys
who were at AMD at the time.) But it feels good nevertheless.

The only bad thing is that some guys I know at AMD say that Bulldozer is
not really all that great a product, but is shipping just because AMD
needs a model refresh. "Sometimes you just gotta ship what you got." If
this is so, and if I deserve any credit for CMT, then I also deserve
some of the blame. Although it might have been different, better, if I
had stayed.

I came up with MCMT in 1996-2000 while at the University of Wisconsin.
It became public via presentations.

I brought MCMT back to Intel in 2000, and to AMD in 2002.

I was beginning to despair of MCMT ever seeing the light of day. I
thought that when I left AMD in 2004, the MCMT ideas may have left with
me. Apparently not. I must admit that I am surprised to see that the
concept endured so many years - 5+ years after I left, 7+ years to
market. Apparently they didn't have any better ideas.

True, there were rumors. For example, Chuck Moore presented a slide
with Multicluster Multithreading on it to analysts in 2004 or 2005. But
things went quiet. There were several patents filed, with diagrams that
looked very much like the ones I drew for the K10 proposal. But, one
often sees patent applications for cancelled projects.

Of course, AMD has undoubtedly changed and evolved MCMT in many ways
since I first proposed it to them. For example, I called the set of an
integer scheduler, integer execution units, and an L1 data cache a
"cluster", and the whole thing, consisting of shared front end, shared
FP, and 2 or more clusters, a processor core. Apparently AMD is calling
my clusters their cores, and my core their cluster. It has been
suggested that this change of terminology is motivated by marketing, so
that they can say they have twice as many cores.

My original motivation for MCMT was to work around some of the
limitations of Hyperthreading on Willamette. E.g. Willamette had a very
small L0 data cache, 4K in some of the internal proposals, although it
shipped at 8K. Two threads sharing such a tiny L0 data cache thrash.
Indeed, this is one of the reasons why hyperthreading is disabled on
many systems, including many current Nhm based machines with much larger
closest-in caches.

At the time, the small L0s were a given. You couldn't build a
Willamette style "fireball" high frequency machine, and have a much
bigger cache, and still preserve the same small cache latency.

To avoid threads thrashing each other, I wanted to give each thread
their own L0. But, you can't do so, and still keep sharing the
execution units and scheduler - you can't just build a 2X larger array,
or put two arrays side by side, and expect to have the same latency.
Wires. Therefore, I had to replicate the execution units, and enough of
the scheduler so that the "critical loop" of Scheduler->Execution->Data
Cache was all isolated from the other thread/cluster. Hence, the form
of multi-cluster multi-threading you see in Bulldozer.

True, there are differences, and I am sure more will become evident as
more Bulldozer information becomes public. For example, although I came
up with MCMT to make Willamette-style threading faster, I have always
wanted to put SpMT, Speculative Multithreading, on such a substrate.
SpMT has potential to speed up a single thread of execution, by
splitting it up into separate threads and running the separate threads
on different clusters, whereas Willamette-style hyperthreading, and
Bulldizer-style MCMT (apparently), only speed up workloads that have
existing independent threads. I still want to build SpMT. My work at
Wisconsin showed that SpMT on a Willamette substrate was constrained by
Willamette's poor threading microarchitecture, so naturally I had to
first create the best explicit threading microarchitecture I could, and
then run SpT on top of it.

If I received arows in my back for MCMT, I received 10 times as many
arrows for SpMT. And yet still I have hope for it. Unfortunately, I am
not currently working on SpMT. Haitham Akkary, the father of DMT,
continues the work.

I also tried, and still continue, to explore other ways of speeding up
single threads using multiple clusters.

Although I remain an advocate of SpMT, I have always recognized the
value of MCMT as an explicit threaded microarchitecture.

Perhaps I should say here that my MCMT had a significant difference from
clustering in, say, the Alpha 21264,
http://www.hotchips.org/archives/hc10/2_Mon/HC10.S1/HC10.1.1.pdf
Those clusters bypass to each other: there is a fast bypass within a
cluster, and a slightly slower (+1 cycle) bypass of results between
clusters. The clusters are execution units only, and share the data
cache. This bypassing makes it easy (or at least easier) to spread a
single thread across both clusters. My MCMT clusters, on the other
hand, do NOT bypass to each other. This motivates separate threads per
cluster, whether explicit or implicit.

I have a whole taxonomy of different sorts of clustering:
* fast vs slow bypass clusters
* fully bypassed vs. partially bypassed
* mechanisms to reduce bypassing
* physical layout of clusters
* bit interleaved datapaths
* datapaths flowing in opposite directions,
with bypassing where they touch
* what's in the cluster
* execute only
* execute + data cache
* schedule + execute + data cache
* renamer + schedule + execute + datacache
...
* what gets shared between clusters
* front-end
* renamer?
* data-cache - L0? L1? L2?
* TLBs...
* MSHRs...
* FP...

Anyway: if it has an L0 or L1 data cache in the cluster, with or
without the scheduler, it's my MCMT. If no cache in the cluster, not
mine (although I have enumerated many such possibilities).

Motivated by my work to use MCMT to speed up single threads, I often
propose a shared L2 instruction scheduler, to load balance between the
clusters dynamically. Although I admit that I only really figured out
how to do that properly after I left AMD, and before I joined Intel.
How to do this is part of the Multi-star microarchitecture, M*, that is
my next step beyond MCMT.

Also, although it is natural to have a single (explicit) thread per
cluster in MCMT, I have also proposed allowing two threads per cluster.
Mainly motivated by SpMT: I could fork to a "runt thread" running in
tghe same cluster, and then migrate the run thread to a different
cluster. Intra-cluster forking is faster than inter-cluster forkng, and
does not disturb the parent thread.
But, if you are not doing SpMT, there is much less motivation for
multiple threads per cluster. I would not want to do that unless I was
also trying to build a time-switched lightweight threading system.
Which, as you can imagine if you know me, I have also proposed. In
fact, I hope to go to the SC'09 Workshop on that topic.

I will be quite interested to see whether Bulldozer's cluster-private L1
caches (in AMD's swapped terminology, core-private L1 caches) are write
through or write-back. Willamette's L0 was write-through. I leaned
towards write-back, because my goal was to isolate clusters from each
other, to reduce thrashing. Also, because write-back lends itself
better to a speculative versionong cache, useful for SpMT.

With Willamette as background, I leaned towards a relatively small, L0,
cache in the cluster. Also, such a small L0 can often be pitch-matched
with the cluster execution unit datapath. A big L1, such as Bulldozer
seems to have, nearly always has to lie out of the datapath, and
requires wire turns. Wire turns waste area. I have, from time to time,
proposed putting the alignment muxes and barrel shifters in the wire
turn area. I'm surprised that a large cluster L1 makes sense, but that's
the sort of thing that you can only really tell from layout.

Some posters have been surprised by sharing the FP. Of course, AMD's K7
design, with separate clusters for integer and FP, was already half-way
there. They only had to double the integer cluster. It would have been
harder for Intel to go MCMT, since the P6 family had shared integer and
FP. Willamette might have been easier to go MCMT, since it had separate FP.

Anyway... of course, for FP threads you might like to have
thread-private FP. But, in some ways, it is the advent of expensve FP,
like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer
MCMT: the FP is so big that the overhead of replicating the integer
cluster, including the OOO logic, is a drop in the bucket.
You'd like to have per-cluster-thread FP, but such big FP workloads are
often so memory intensive that they thrash the shared-between-clusters
L2 cache: threading may be disabled anyways. As it is, you get good
integer threads via MCMT, and you get 1 integer thread and 1 FP thread.
Two FP threads may have some slowdown, although, again, if memory
intensive they may be blocking on memory, and hence allowing the other
FP thread t use the FP. But two purely computational FP threads will
almost undoubtedly block, unless the schedulers are piss-poor and can't
use all of the FP for a single thread (e.g. by being too small).

I certainly want to explore possibilities such as SpMT and other single
thread speedups. But I know that you can't build all the neat ideas in
one project. Apparently MCMT by itself was enough for AMD Bulldozer.
(Actually, I am sure that there are other new ideas in Bulldozer. Just
apparently not SpMT or spreading a single thread across clusters.) Look
at the time-lag: 10-15 years from when I came up with MCMT in
Wisconsin, 1996-2000. It is now 7-5 years from when I was at AMD,
2002-2004, and it will be another 2 years or so before Bulldozer is a
real force in the marketplace.

I don't expect to get any credit for MCMT. In fact, I'm sure I'm going
to get shit for this post. I don't care. I know. The people who were
there, who saw my presentations and read my proposals, know. But, e.g.
Chuck Moore wasn't there at start; he came in later. Even Mike Haertel,
my usual collaborator, wasn't there; he was hired in later, although
before Chuck. Besides, Mike Haertel thinks that MCMT is obvious.
That's cool, although I ask: if MCMT is obvious, then why isn't Intel
doing it? Companies like Intel and AMD need idea generating people like
me about once every 10 years. In between, they don't need new ideas.
They need new incremental improvements of existing ideas.

Anyway... It's cool to see MCMT becoming real. It gives me hope that my
follow-on to MCMT, M* may still, eventually, also become real.

Andy "Krazy" Glew

unread,
Nov 15, 2009, 2:01:53 AM11/15/09
to Brett Davis
Andy "Krazy" Glew wrote:
> I can't express how good it feels to see MCMT become a product. It's
> been public for years, but it gets no respect until it is in a product.
> It would have been better if I had stayed at Intel to see it through. I
> know that I won't get any credit for it. (Except from some of the guys
> who were at AMD at the time.) But it feels good nevertheless.

Freudian typo?

It would have been better if I had stayed at either Intel or AMD to see
MCMT through.

But MCMT never took hold at Intel. Apparently it did at AMD.

While I never took root at either company. Not for lack of trying. But
neither was a friendly place.

Anton Ertl

unread,
Nov 15, 2009, 6:49:55 AM11/15/09
to
"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>There were several patents filed, with diagrams that
>looked very much like the ones I drew for the K10 proposal.

That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
right?

>Anyway: if it has an L0 or L1 data cache in the cluster, with or
>without the scheduler, it's my MCMT.

That reminds me of a paper [kim&smith02] that I have always found very
impressive, and I wondered why I have never seen followup papers or an
implementation (but then I don't follow ISCA and Micro proceedings
closely these days); hmm, looking at the research summary at Smith's
home page, he is still working on something like this. Of course that
work was also done in Wisconsin (same University, right?), so it may
have been inspired by your ideas. Do you have any comments on that?

@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Andy "Krazy" Glew

unread,
Nov 15, 2009, 11:54:22 AM11/15/09
to Anton Ertl
Anton Ertl wrote:
> "Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>> There were several patents filed, with diagrams that
>> looked very much like the ones I drew for the K10 proposal.
>
> That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
> right?

So far as I can tell, Barcelona is a K8 warmed over.

There were several K10s. While I wanted to work on low power when I went
to AMD, I was hired to consult on low power and do high end CPU, since
the low power project was already rolling and did not need a new chef.
The first K10 that I knew at AMD was a low power part. When that was
cancelled I was sent off on my lonesome, then wth Mike Haertel, to work
on a flagship, out-of-order, aggressive processor, while the original
low power team did something else. When that other low-power project was
cancelled, that team came over to the nascent K10 that I was working on.
My K10 was MCMT, plus a few other things. I had actually had to
promise Fred Weber that I would NOT do anything advanced for this K10 -
no SpMT, just MCMT. But when the other guys came on board, I thought
this meant that I could leave the easy stuff for them, while I tried to
figure out how to do SpMT and/or any other way of using MCMT to speed up
single threads. Part of my motivation was that I had just attended ISCA
2003 in San Diego, where several of outstanding problems in big machines
had been solved, and I was scared that Intel would come out with
something if we did not.

In retrospect, that fear was unjustified.

Moral: don't give up power.

I don't know what happened to K10 after I left AMD, but I'm guessing
that the names were shifted, a warmed over K8 became K10, and the MCMT
K10 became Bulldozer. Slipping quite a few years.

Anton Ertl

unread,
Nov 15, 2009, 12:07:22 PM11/15/09
to
"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>Anton Ertl wrote:
>> "Andy \"Krazy\" Glew" <ag-...@patten-glew.net> writes:
>>> There were several patents filed, with diagrams that
>>> looked very much like the ones I drew for the K10 proposal.
>>
>> That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
>> right?
>
>So far as I can tell, Barcelona is a K8 warmed over.
...

>I don't know what happened to K10 after I left AMD, but I'm guessing
>that the names were shifted, a warmed over K8 became K10

Yes, that was visible to us outsiders around the time of the release
of Barcelona. IIRC the K-name that we could read about was K8L at
first, later K10. My guess was that K9 was a cancelled project (or
maybe skipped because they did not want a project that was a dog:-).

>and the MCMT
>K10 became Bulldozer. Slipping quite a few years.

In the meantime I have read
<http://www.anandtech.com/printarticle.aspx?i=3674> and at least the
block diagrams there look like the Bobcat core, the low-power part
from AMD will be related to Bulldozer: a core consists of one cluster
(instead of two), and the FPU may be different, but the overall
structure is similar. So maybe your work also ended up in a low-power
part after all.

Andy "Krazy" Glew

unread,
Nov 15, 2009, 9:15:24 PM11/15/09
to Anton Ertl
Anton Ertl wrote:
> That reminds me of a paper [kim&smith02] that I have always found very
> impressive, and I wondered why I have never seen followup papers or an
> implementation (but then I don't follow ISCA and Micro proceedings
> closely these days); hmm, looking at the research summary at Smith's
> home page, he is still working on something like this. Of course that
> work was also done in Wisconsin (same University, right?), so it may
> have been inspired by your ideas. Do you have any comments on that?
>
> @InProceedings{kim&smith02,
> author = {Ho-Seop Kim and James E. Smith},
> title = {An Instruction Set and Microarchitecture for
> Instruction Level Distributed Processing},
> crossref = {isca02},

Same university, but I doubt direct inspiration. I know both Ho-Seop
and Jim Smith. But, they were in EE, while I was in CS under Sohi.

At the time there were many of us playing in the same space. Subrao
Palacharla was, I believe, the first to get published in the area, with
complexity effective, also with Jim Smith. Stole much of my thunder.

However, Subrao's, and especially Ho-Seop's, work was very much of the
flavor of "big out-of-order machines are too complicated, so we have to
find a way to approximate them by putting together simpler compnents".
Whereas my work was more of "How do I take the OOO designs that I know
are being worked on, and scale them up?" Also, at the time, although
I knew about Willamette - indeed, the scheduler structure of queues
feeding an RS arose from the debate between OOO (me) and in-order (Sager
and Upton) - but the academics at UWisc did not know. And I could not
tell them.

(This led, e.g. to me having to maintain 2 simulators, 1 with public
info and 1 with private info. It also led to me having arguments with
Jim Smith, where Jim said that it was useless to scale up window size
because brnach predictors weren't accurate, whereas I knew about
Willamette's predictors, so I knew that much more accuracy was coming.
Much of my work in multilevel branch predictors wa just designed to shut
Jim up, by showing that greater accuracy was possible, so I could just
go and work on big windows. Relatedly, I remember trying to persuade
Jim that 300 cycle memory latency was interesting, but he couldn't get
there. Years later he published "A Day in the Life of a Cache Miss",
imho a badly misnamed paper, that started exploring long latencies.)

Anyway, my big problem with ILDP is that it was a microoptimization. To
use it, you would basically have to throw away out-of-order
CPUs, and start over. And
in the first generation, you would just be playing catch up.

I've seen this many times. People think that they can get paid to
re-implement an existing CPU better, with a better, newer,
microarchiture. Maybe so - but remember that you are then competing
with the design team that is already going over the existing design with
a fine tooth comb. I've seen this several times with attempts to do a
new low power microarchitecture. I think that Wmt was much like this -
out-of-order done anew, rather than extending P6 OOO. The folks who
pushed run-ahead were in this camp: they weren't better than OOO, just
more efficient. Or so they believed. Because you also have to remember
that there is risk in doing anything new - so if the new supposedly more
efficient microarchitecture does not quickly make the phase change to
proven to more efficient, it will get canned.

MCMT is more along the other line. It takes the existing pieces of a
microarchitecture, and rearranges them.

I think ILDP was somewhat in the same camp as RunAhead. A good idea,
but not clearly better than the incumbent OOO.

Andy "Krazy" Glew

unread,
Nov 15, 2009, 9:16:55 PM11/15/09
to Anton Ertl
Anton Ertl wrote:
>> So far as I can tell, Barcelona is a K8 warmed over.
>
> Yes, that was visible to us outsiders around the time of the release
> of Barcelona. IIRC the K-name that we could read about was K8L at
> first, later K10. My guess was that K9 was a cancelled project (or
> maybe skipped because they did not want a project that was a dog:-).

Mitch Alsup was K9.

Brett Davis

unread,
Nov 15, 2009, 9:56:01 PM11/15/09
to
In article <2009Nov1...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> That reminds me of a paper [kim&smith02] that I have always found very
> impressive, and I wondered why I have never seen followup papers or an
> implementation (but then I don't follow ISCA and Micro proceedings
> closely these days); hmm, looking at the research summary at Smith's
> home page, he is still working on something like this. Of course that
> work was also done in Wisconsin (same University, right?), so it may
> have been inspired by your ideas. Do you have any comments on that?
>
> @InProceedings{kim&smith02,
> author = {Ho-Seop Kim and James E. Smith},
> title = {An Instruction Set and Microarchitecture for
> Instruction Level Distributed Processing},

> - anton

This paper is timely for me as I have an 8 way design spec'd out that
I am getting ready to write up, and I have found no one else looking
at 8 way. (Feel free to fix this shortcoming of mine.)

The second paper corrects the mistake of using accumulator instructions.
http://www.ece.wisc.edu/~hskim/

Most people will read these papers are declare this to be a failure,
as indeed this design is. But the flaws can be fixed.

The basic problem is using Alpha RISC ops as his source, the timeline
of 2002 means he was stuck with GCC. Today due to Apple we have LLVM
and Clang which will allow you access to more parallelism.
http://en.wikipedia.org/wiki/LLVM
http://en.wikipedia.org/wiki/Clang

I am concerned about one quote from the second paper:
"The typical IPC of SPEC INT type workload is below 2."

With loop unrolling you should be able to get arbitrary IPC
multiplication, the problem is the OoO engine being too stupid
due to various real world constraints to make use of the IPC.

Or is there something fundamental that I am Missing.

Brett

Brett Davis

unread,
Nov 16, 2009, 12:09:34 AM11/16/09
to
In article <4AFFA499...@patten-glew.net>,

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> wrote:

> I can't express how good it feels to see MCMT become a product. It's
> been public for years, but it gets no respect until it is in a product.
> It would have been better if I had stayed at Intel to see it through.
> I know that I won't get any credit for it. (Except from some of the guys
> who were at AMD at the time.) But it feels good nevertheless.

Claiming credit can get you pushed out of the company. The old 1%
inspiration and 99% perspiration, with the 99% being pissed at you
for claiming the credit.
I stopped tooting my horn, and now people like me. ;)
(My boss of course knows what I am up to, otherwise I would leave.)

> The only bad thing is that some guys I know at AMD say that Bulldozer is
> not really all that great a product, but is shipping just because AMD
> needs a model refresh. "Sometimes you just gotta ship what you got." If
> this is so, and if I deserve any credit for CMT, then I also deserve
> some of the blame. Although it might have been different, better, if I
> had stayed.

Bulldozer just has to be close to Core2 or better in performance, in
order to stop Intel from playing the pricing squeeze game they are doing
against Barcelona.
Better to save the company now, then wait two more years for a "better"
design that may only be 2% faster, and which will be rolled into the
next needed refresh anyway.

Barcelona was of course a no-brainer re-engineering to remove all the
known bottlenecks throttling the K7 pipeline, so that a new faster
pipeline could be plugged in and not be crippled by: instruction fetch
decode, memory bandwidth, cache bandwidth, etc. Just about everything
in the K10 is faster than the old K7/K8 pipeline in the K10 needs.
And the end result is quite a lot faster than the K8.

> SpMT has potential to speed up a single thread of execution, by
> splitting it up into separate threads and running the separate threads

> on different clusters. [...]


>
> If I received arows in my back for MCMT, I received 10 times as many
> arrows for SpMT. And yet still I have hope for it. Unfortunately, I am
> not currently working on SpMT. Haitham Akkary, the father of DMT,
> continues the work.

I have started backing away from SpMT, too many difficult problems to
solve at once, and there are better solutions. If you can go 8 way a
simple recompile with a loop unroll will get you the same performance
as splitting the loop into odd and even paired threads.
(Not going to add 16 read ports to the register file to do this... ;)

For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
latency for any rare copies needed between the duplicate register sets.
In loops each pipeline handles its odd or even half of the loop count.
Outside of loops you have both CPUs running the same code, has power
and heat issues. But you win the all important benchmark queen position.
(Gamers will love it, server folk will not buy it.)
Each half would have its own instruction pointer, memory latencies in
the non-loop code would re-sync the IPs to near match.
Someone will do this one day.

Apples solution to this problem is software only, Grand Dispatch, and
quite frankly this is the best solution of all. Use lots of cheap
small CPUs and let the OS and compiler and programmer decide how to
chop up the loops.

We are reaching the end of CPU design, little more can be done with
the opcodes and pipelines we have today. This fight between GM and Ford
may end with both suffering under the weight of dozens of little
companies selling "free" multi-clusters. (20 year prediction...)

> I have a whole taxonomy of different sorts of clustering:
> * fast vs slow bypass clusters
> * fully bypassed vs. partially bypassed
> * mechanisms to reduce bypassing
> * physical layout of clusters
> * bit interleaved datapaths
> * datapaths flowing in opposite directions,
> with bypassing where they touch
> * what's in the cluster
> * execute only
> * execute + data cache
> * schedule + execute + data cache
> * renamer + schedule + execute + datacache
> ...
> * what gets shared between clusters
> * front-end
> * renamer?
> * data-cache - L0? L1? L2?
> * TLBs...
> * MSHRs...
> * FP...
>

> Anyway... It's cool to see MCMT becoming real. It gives me hope that my
> follow-on to MCMT, M* may still, eventually, also become real.

So where does my paired odd/even pipelines proposal fit in your taxonomy.

Brett

Andy "Krazy" Glew

unread,
Nov 16, 2009, 12:52:58 AM11/16/09
to Brett Davis
Brett Davis wrote:
>> @InProceedings{kim&smith02

>> author = {Ho-Seop Kim and James E. Smith},
>> title = {An Instruction Set and Microarchitecture for
>> Instruction Level Distributed Processing},
>> - anton
>
> This paper is timely for me as I have an 8 way design spec'd out that
> I am getting ready to write up, and I have found no one else looking
> at 8 way. (Feel free to fix this shortcoming of mine.)

Or take an MCMT machine like Barcelona, which essentially has two
roughly 4-wide clusters - and figure out a way to spread computation
of a single thread across both clusters.

The problem with width is the bypasses.

Andy "Krazy" Glew

unread,
Nov 16, 2009, 1:03:38 AM11/16/09
to Brett Davis
Brett Davis wrote:
> For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
> latency for any rare copies needed between the duplicate register sets.
> In loops each pipeline handles its odd or even half of the loop count.
> Outside of loops you have both CPUs running the same code, has power
> and heat issues. But you win the all important benchmark queen position.
> (Gamers will love it, server folk will not buy it.)
> Each half would have its own instruction pointer, memory latencies in
> the non-loop code would re-sync the IPs to near match.
> Someone will do this one day.
>
> So where does my paired odd/even pipelines proposal fit in your taxonomy.
>
> Brett

Are you bypassing between the clusters? If so, you have a bypass
cluster. Or else how are you transferring registers across iterations?
It sounds as if you have an incomplete 2 cycle inter-cluster bypass.

I must admit that I am puzzled by using loop-iteration SpMT if you can
do the bypassing between the clusters. I guess that you are using that
"batching" to hopefully reduce inter-cluster bypassing. But then I am
not a big fan of inner loop SpMT. Loops are easy, and are already done
pretty well.

Having both run the same code sounds like Jim Goodman's Datascalar.

So: you seem to be proposing clusters with partial intercluster
bypassing of 2 clocks, batched on alternate loop iterations, with
Datascalar style non-loop execution.

You haven't said enough about the physical layout to talk about those
clustering effects.

Noob

unread,
Nov 16, 2009, 6:01:22 AM11/16/09
to
Brett Davis wrote:

> The basic problem is using Alpha RISC ops as his source, the timeline
> of 2002 means he was stuck with GCC. Today due to Apple we have LLVM
> and Clang which will allow you access to more parallelism.

"We have LLVM due to Apple" ?!

Did Apple support the project from the start?

"The LLVM project started in 2000 at the University of Illinois at
Urbana-Champaign, under the direction of Vikram Adve and Chris Lattner.
In 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM
system for various uses within Apple's development systems."

Andy "Krazy" Glew

unread,
Nov 16, 2009, 3:49:22 PM11/16/09
to Brett Davis
Andy "Krazy" Glew wrote:
> Brett Davis wrote:
>> For SpMT I looked at Alpha style paired integer pipelines ...

>>
>> So where does my paired odd/even pipelines proposal fit in your taxonomy.
>
> You haven't said enough about the physical layout to talk about those
> clustering effects.

The physical layout matters a lot, and hence has its own terminology.

For example, most datapaths are bit interleaved - the actual wires might
look like

ALU 0 s1 bit 0
ALU 0 s2 bit 0
ALU 0 d bit 0

ALU 1 s1 bit 0
ALU 1 s2 bit 0
ALU 1 d bit 0

ALU 0 s1 bit 1
ALU 0 s2 bit 1
ALU 0 d bit 1

ALU 1 s1 bit 1
ALU 1 s2 bit 1
ALU 1 d bit 1

Bit interleaving makes bypassing of same bit to same bit easier.

Unfortunately, bit interleaving introduces O(N^2) factor to the area.
Doesn't matter for small degree of superscalarness, but matters as you
get bigger, as you become wire limited.

For a long time I observed that most processors tended to have only
around 4-6 superscalar. Which corresponds roughly to 12-24 wires
for each bit:
12 = (4 ops * (2 inputs/op + 1 output/op))
to
24 (6 ops * (3 inpurs/op + 1 output/op))

For a while I was puzzled by IBM's Power family, e.g. Power 4, which
seemed to be on the low side. Until I was told that there were extra
wires not related to superscalarness, for DMA, etc.,


Anyway - my hypothesis seems to have been broken by recent Intel
machines that are up to 7 wide. But nevertheless...

8-wide on a bit interleaved datapath is pushing the envelopew.


8-wide as 2 4 wide bit interleaved datapaths is not so bad, although you
will pay a lot of area for the wire turns to get to side by side
datapaths. S I might call this an 8-wide cluster composed of two
adjacent 4-wide bit-interleaved clusters. With whatever bypassing you
say.

As I mentioned earlier, there is a sweet trick you can play with
datapaths that are paired and opposite, reflections. You bit interleave
the input and output wires within the, say, 4-wide cluster, and then you
bit interleave the outputs (but not the inputs). Trouble with this
trick is that it takes one of the ends of the datapath away - where do
you put the scheduler? The data cache?
So, I call this "opposed" clustering.


Andy "Krazy" Glew

unread,
Nov 16, 2009, 4:05:44 PM11/16/09
to Brett Davis
Andy "Krazy" Glew wrote:
> Of course, AMD has undoubtedly changed and evolved MCMT in many ways
> since I first proposed it to them. For example, I called the set of an
> integer scheduler, integer execution units, and an L1 data cache a
> "cluster", and the whole thing, consisting of shared front end, shared
> FP, and 2 or more clusters, a processor core. Apparently AMD is calling
> my clusters their cores, and my core their cluster. It has been
> suggested that this change of terminology is motivated by marketing, so
> that they can say they have twice as many cores.
>
> My original motivation for MCMT was to work around some of the
> limitations of Hyperthreading on Willamette.

Now that y'all can see MCMT from a source other than me, it may be
easier to explain some of the reasons what I am interested by SIMT /
Coherent Threading in GPUs like Nvidia (and ATI, and Intel).

The SIMT GPUs take a single instruction, takes it through a shared
front-end, and distributes it to different "threads" running in
different lanes. Essentially, replicated execution units. Memory may
or may not be replicated; the GPUs seem often to decouple memory from
the execution lanes, as is needed for non-stride-1 accesses.

MCMT, as in Bulldozer, shares the front end, replicates in each cluster
the scheduler, execution units, and L1 data cache. MCMT is typically
superscalar, but if we examined the limit case of a non-superscalar
MCMT, it would be taking one instruction, and distributing it only one
of the clusters.

Roughly speaking, MCMT clusters correspond to SIMD/SIMT vector lanes.

But while SIMT can send the same instruction to multiple lane/clusters,
MCMT does not.

So, the logical question is, why not? Why not send the same
instruction(s) to 2 (or more) clusters in an MCMT machine? If you can
recognize that the clusters are executing the same code?

To do this on an out-of-order processor you would probably need to
replicate the renamer. (Or split it into two stages, one shared, a
hopefully cheaper stage replicated.) But apart from this, it would
work. The scheduler within the MCMT clusters would decouple, and allow
the clusters to operate independently, and perhaps diverge.

This might allow MCMT clustering to scale beyond 2-3.

Downside is, Bulldozer shares the FP. But FP workloads are the
workloads that benefit most from SIMD/SIMT.

Niels Jørgen Kruse

unread,
Nov 16, 2009, 4:49:32 PM11/16/09
to
Andy "Krazy" Glew <ag-...@patten-glew.net> wrote:

Die Photos of PPC970, POWER4 and POWER5 clearly have 2 identical integer
units replicated, so I doubt they are interleaved.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark

Mayan Moudgill

unread,
Nov 16, 2009, 9:44:39 PM11/16/09
to
Brett Davis wrote:
>
> I am concerned about one quote from the second paper:
> "The typical IPC of SPEC INT type workload is below 2."
>
> With loop unrolling you should be able to get arbitrary IPC
> multiplication, the problem is the OoO engine being too stupid
> due to various real world constraints to make use of the IPC.

Consider the following loop (and assume all loads hit in cache).

while( p != NULL ) {
n++;
p = p->next;
}

Please unroll to arbitrarily multiply the IPC.

Brett Davis

unread,
Nov 17, 2009, 1:22:40 AM11/17/09
to
In article <4B00EB3A...@patten-glew.net>,

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> wrote:

> Brett Davis wrote:
> > For SpMT I looked at Alpha style paired integer pipelines with a 2 cycle
> > latency for any rare copies needed between the duplicate register sets.
> > In loops each pipeline handles its odd or even half of the loop count.
> > Outside of loops you have both CPUs running the same code, has power
> > and heat issues. But you win the all important benchmark queen position.
> > (Gamers will love it, server folk will not buy it.)
> > Each half would have its own instruction pointer, memory latencies in
> > the non-loop code would re-sync the IPs to near match.
> > Someone will do this one day.
> >
> > So where does my paired odd/even pipelines proposal fit in your taxonomy.
> >
> > Brett
>
> Are you bypassing between the clusters? If so, you have a bypass
> cluster. Or else how are you transferring registers across iterations?
> It sounds as if you have an incomplete 2 cycle inter-cluster bypass.

After looking at the bypass problem I have decided that there will be none.
Minimal signaling, on the loop branch you signal a loop detect, and
after a loop or a few loops signal an attempt to to do odd loops only,
the second CPU would ack and do even loops.

So this is separate CPUs running the same code, no sharing of register state.
There are some other things you can do also, if one CPU is farther ahead
and fails a speculated branch, the second CPU has less state to rewind, if
any, and so the second CPU takes the lead. Faster than one CPU.

For the rare branches the speculator knows are a coin flip, you can have
each CPU take each choice. The winner makes progress, the one that loses
will catch up on the next cache miss. Faster than one CPU.

Any time the CPUs are in lock step it may pay to throw one down the
"95% wrong" path on the off chance it wins.

For the moment lets forget that you will be lucky to be 5% faster, and
you could use that second CPU to get 100% on another task.
If all that Matters is benchmark peak, and you can sell your Core11 CPU
for $999 while your competitor has to sell his Phantom5 for $200, well
the profit will pay for the inefficiency.

You would share the FPU/Vector unit, so the area cost would be the same
as the Bulldozer, well within modern die size budgets. If the Bulldozer
integer pipes are next to each other maybe something like this or some
of your designs is planned for now or the future...

Few common folk can make any use of multi-cores, if I can turn my 8 core
Bulldozer into a 4 core thats 5% faster, I will, as will most.

The gamers are happy, the server buyers are happy, win win.

So where does my KISS simple paired CPUs fit in your taxonomy, and has
it been done? (Bet not.)

> I must admit that I am puzzled by using loop-iteration SpMT if you can
> do the bypassing between the clusters. I guess that you are using that
> "batching" to hopefully reduce inter-cluster bypassing. But then I am
> not a big fan of inner loop SpMT. Loops are easy, and are already done
> pretty well.

"Loops are easy." ;) Pray tell where these plentiful easy to speedup
areas are in CPU design. ;) Run strait into a brick wall you have. ;)

Brett

Brett Davis

unread,
Nov 17, 2009, 1:39:14 AM11/17/09
to
In article <X--dnTCmHuSLk5_W...@bestweb.net>,
Mayan Moudgill <ma...@bestweb.net> wrote:

Please point out the Spec benchmark that does this. ;)

The problem has to be game-able to make it past the politics
and dollars that decide what makes it into a major benchmark.
That actual problems might be solved is incidental. ;)

Brett

Andy "Krazy" Glew

unread,
Nov 17, 2009, 1:54:02 AM11/17/09
to Brett Davis
Brett Davis wrote:
> In article <4B00EB3A...@patten-glew.net>,
> "Andy \"Krazy\" Glew" <ag-...@patten-glew.net> wrote:
>> Are you bypassing between the clusters? If so, you have a bypass
>> cluster. Or else how are you transferring registers across iterations?
>> It sounds as if you have an incomplete 2 cycle inter-cluster bypass.
>
> After looking at the bypass problem I have decided that there will be none.
> Minimal signaling, on the loop branch you signal a loop detect, and
> after a loop or a few loops signal an attempt to to do odd loops only,
> the second CPU would ack and do even loops.
>
> So this is separate CPUs running the same code, no sharing of register state.
> There are some other things you can do also, if one CPU is farther ahead
> and fails a speculated branch, the second CPU has less state to rewind, if
> any, and so the second CPU takes the lead. Faster than one CPU.

I take it that you are designing an ISA from scratch, rather than trying
to apply SpMT to existing binaries (which have registers live across
iterations).

Several of us have looked at "all state in memory" SpMT.


> Few common folk can make any use of multi-cores, if I can turn my 8 core
> Bulldozer into a 4 core thats 5% faster, I will, as will most.

Actually, if you can turn off 4 of the cores completely, you will
probably save enough power/heat that you can run the remaining cores
more than 5-10% faster.

Your idea may still be good, but you have to win more than 5%.


> "Loops are easy." ;) Pray tell where these plentiful easy to speedup
> areas are in CPU design. ;) Run strait into a brick wall you have. ;)

I haven't.

The guys I left behind have.

Matt

unread,
Nov 17, 2009, 4:54:35 AM11/17/09
to
On 15 Nov., 07:50, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

> Brett Davis wrote:
> > Bulldozer details + bobcat
>
> BRIEF:
>
> AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded)
> microarchitecture.  That's my baby!
>
> DETAIL:
[...]

> True, there are differences, and I am sure more will become evident as
> more Bulldozer information becomes public.  For example, although I came
> up with MCMT to make Willamette-style threading faster, I have always
> wanted to put SpMT, Speculative Multithreading, on such a substrate.
> SpMT has potential to speed up a single thread of execution, by
> splitting it up into separate threads and running the separate threads
> on different clusters, whereas Willamette-style hyperthreading, and
> Bulldizer-style MCMT (apparently), only speed up workloads that have
> existing independent threads. I still want to build SpMT.  My work at
> Wisconsin showed that SpMT on a Willamette substrate was constrained by
> Willamette's poor threading microarchitecture, so naturally I had to
> first create the best explicit threading microarchitecture I could, and
> then run SpT on top of it.

Have you seen the "eager execution" patent application (app. no.
20090172370)? I mentioned it in my blog a while back (
http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
). This patent application mentions some ways to use the clusters to
speed up single thread performance. There are some more patent
applications on this topic, but they just repeat most of the methods
listed in the "eager execution" one.

Matt
("Dresdenboy")

Terje Mathisen

unread,
Nov 17, 2009, 10:56:10 AM11/17/09
to
Mayan Moudgill wrote:
> Consider the following loop (and assume all loads hit in cache).
>
> while( p != NULL ) {
> n++;
> p = p->next;
> }
>
> Please unroll to arbitrarily multiply the IPC.

The best you can do here is to have the hardware notice that (since each
node is the same size) the stride is mostly constant, so hw prefetch can
have a chance to get the next node/pointers resident in L1.

You'll still be limited to whatever the minimum L1 load-to-use latency
is, and quite often a lot worse.

Using sw to solve the same problem is _very_ problematical, at best:

Having something like Itanium advanced (fault-free) load attempts could
let you try to run ahead, but you would _very_ quickly run out of L1
bandwidth, issue slots and power.

The only solution to this sort of problem is some form of clustering,
where multiple nodes are known to be stored back-to-back in a single
block of memory.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Mayan Moudgill

unread,
Nov 17, 2009, 7:21:02 PM11/17/09
to
Matt wrote:
> Have you seen the "eager execution" patent application (app. no.
> 20090172370)? I mentioned it in my blog a while back (
> http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
> ). This patent application mentions some ways to use the clusters to
> speed up single thread performance. There are some more patent
> applications on this topic, but they just repeat most of the methods
> listed in the "eager execution" one.
>

Had a look; it seems, at a first reading, to be one more of the
"wouldn't it be great to follow both paths of a branch, and once we
figure out which way the branch went, we'll copy registers from one path
over to the other, and, oh, BTW, its not clear that it wouldn't have
been less work to just re-execute the second path" ideas.

Robert Myers

unread,
Nov 18, 2009, 12:01:18 AM11/18/09
to
On Nov 17, 7:21 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> Matt wrote:
> > Have you seen the "eager execution" patent application (app. no.
> > 20090172370)? I mentioned it in my blog a while back (
> >http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th...

> > ). This patent application mentions some ways to use the clusters to
> > speed up single thread performance. There are some more patent
> > applications on this topic, but they just repeat most of the methods
> > listed in the "eager execution" one.
>
> Had a look; it seems, at a first reading, to be one more of the
> "wouldn't it be great to follow both paths of a branch, and once we
> figure out which way the branch went, we'll copy registers from one path
> over to the other, and, oh, BTW, its not clear that it wouldn't have
> been less work to just re-execute the second path" ideas.

I've stopped reading papers about such schemes. Everything falls
apart when you ask the question, "But is it worth it?" as to power and
die area.

As to copying registers, though, is that what you'd really do? Why
wouldn't it be just like register renaming?

Robert.

Terje Mathisen

unread,
Nov 18, 2009, 2:19:44 AM11/18/09
to
Mayan Moudgill wrote:

> Matt wrote:
>> ). This patent application mentions some ways to use the clusters to
>> speed up single thread performance. There are some more patent
>> applications on this topic, but they just repeat most of the methods
>> listed in the "eager execution" one.
>>
>
> Had a look; it seems, at a first reading, to be one more of the
> "wouldn't it be great to follow both paths of a branch, and once we
> figure out which way the branch went, we'll copy registers from one path
> over to the other, and, oh, BTW, its not clear that it wouldn't have
> been less work to just re-execute the second path" ideas.

I agree.

Even worse, as Andy already mentioned:

With multiple cores and dynamic frequency adjustments depending upon the
number of concurrently active cores, having one core totally idle
instead of trying to follow a low likelihood branch could allow the
active core enough thermal headspace to run enough faster to make up for
some missed branch predictions.

Brett Davis

unread,
Nov 18, 2009, 2:32:36 AM11/18/09
to
In article <4B02488A...@patten-glew.net>,

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> wrote:

> > So this is separate CPUs running the same code, no sharing of register state.
> > There are some other things you can do also, if one CPU is farther ahead
> > and fails a speculated branch, the second CPU has less state to rewind, if
> > any, and so the second CPU takes the lead. Faster than one CPU.
>
> I take it that you are designing an ISA from scratch, rather than trying
> to apply SpMT to existing binaries (which have registers live across
> iterations).

The P4 proved that you did not need a new instruction set to do everything
that the RISC chips were doing. It also proved that you needed to recompile
to not risk negative performance improvements. (Later fixed.)

Best would be a CLIW instruction set with lots of visible registers, using
8 ALUs with only 16 x86 registers would be interesting. ;)
But likely not productive...

> Several of us have looked at "all state in memory" SpMT.

That violates the fact that registers are faster.

> > Few common folk can make any use of multi-cores, if I can turn my 8 core
> > Bulldozer into a 4 core thats 5% faster, I will, as will most.
>
> Actually, if you can turn off 4 of the cores completely, you will
> probably save enough power/heat that you can run the remaining cores
> more than 5-10% faster.
>
> Your idea may still be good, but you have to win more than 5%.

On rare occasion it would get 50%, a rigged benchmark will get ~95%.
Intel would get away with this, they have the marketing dollars to
decide what benchmarks are used by publications on and off the web.

Hypethreading benchmarks are the classic example, speedups you will
never see in the real world. Code fabricated to have an IPC less
than one even after OoO, so two copies can run and not bottleneck
on any of the bottlenecks, like only 2 ALUs in the CPU...

Actually Hypethreading was not that good, could not share an ALU,
the CPU had to be idle, so the benchmark had to be some variation
on pointer chasing code that did little real work. Something you
could optimize away and get a 5 to 10x speedup.

> > "Loops are easy." ;) Pray tell where these plentiful easy to speedup
> > areas are in CPU design. ;) Run strait into a brick wall you have. ;)
>
> I haven't.
>
> The guys I left behind have.

Touche. ;)

Brett

Andy "Krazy" Glew

unread,
Nov 18, 2009, 2:44:39 AM11/18/09
to Mayan Moudgill

Eager execution predates MCMT. Tjaden and Flynn, circa 1968, if I
remember correctly, studied what we might now call eager execution. (By
the way, Tjaden

There might be novelty in implementing Eager Execution

My original MCMT eager execution ideas (from Wisconsin) worked in
conjunction with the second level scheduler and the PRF shared between
the clusters. When forking, you copied the renamer state (not really
that many bits). Because the PRF was shared between clusters, you
didn't need to transfer values, although you did need to wait until
values from the parent cluster had been written back, since you might
not have bypassing.

Eager execution is only a good idea if forking is close to free - or at
least does not disturb the parent thread. The same is true for SpMT.

---

I hope that AMD's patent applications are on inventions that were made
after I left AMD, or by people other than me - because, of course, if
they claim to have invented things that I either invented at AMD or
brought to AMD, their patents would have issues.

But I'm not going to investigate. I don't need the aggravation.

Mayan Moudgill

unread,
Nov 18, 2009, 3:16:55 AM11/18/09
to
Andy "Krazy" Glew wrote:
> Mayan Moudgill wrote:
>
>> Matt wrote:
>>
>>> Have you seen the "eager execution" patent application (app. no.
>>> 20090172370)? I mentioned it in my blog a while back (
>>> http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-threading-and-single-thread-execution-6464533/
>>>
>>>
>> Had a look; it seems, at a first reading, to be one more of the
>> "wouldn't it be great to follow both paths of a branch, and once we
>> figure out which way the branch went, we'll copy registers from one
>> path over to the other, and, oh, BTW, its not clear that it wouldn't
>> have been less work to just re-execute the second path" ideas.
>
> When forking, you copied the renamer state (not really
> that many bits).

You'll also have to duplicate the call-return prediction structure.

You'll have to tweak the address matching logic in the store queue.

You'll have to change the branch resolution clean-up. IMO, this is the
difficult bit.
- you have to free up some renamed registers (or, if you're using a
different implementation, partially purge the writeback queue).
- you have to free up entries in the store queue.

On a non-eager machine, the store queue is logically a FIFO; when a
branch mispredict resolves you (logically) move the valid pointer to the
last correctly predicted store. In the case of eager execution, the
store queue becomes a sparse structure, which in turn means that there
are going to be issues with allocation, load/store matching, and
identifying age for commit.

Even free-ing of mispredicted registers could be a little more expensive.
- You may need more bits to distinguish between true/false path predictions.
- If you're going to be eager on more than one branch, then mispredict
resoultion can potentially kill trees of rename sequences, not just
straight-line paths.
- Detecting a free register may become more of a challenge.

nm...@cam.ac.uk

unread,
Nov 18, 2009, 5:10:13 AM11/18/09
to
In article <961ed6a9-3bb2-4e33...@c3g2000yqd.googlegroups.com>,
Robert Myers <rbmye...@gmail.com> wrote:

>On Nov 17, 7:21=A0pm, Mayan Moudgill <ma...@bestweb.net> wrote:
>> Matt wrote:
>> > Have you seen the "eager execution" patent application (app. no.
>> > 20090172370)? I mentioned it in my blog a while back (
>> >http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th...
>> > ). This patent application mentions some ways to use the clusters to
>> > speed up single thread performance. There are some more patent
>> > applications on this topic, but they just repeat most of the methods
>> > listed in the "eager execution" one.
>>
>> Had a look; it seems, at a first reading, to be one more of the
>> "wouldn't it be great to follow both paths of a branch, and once we
>> figure out which way the branch went, we'll copy registers from one path
>> over to the other, and, oh, BTW, its not clear that it wouldn't have
>> been less work to just re-execute the second path" ideas.
>
>I've stopped reading papers about such schemes. Everything falls
>apart when you ask the question, "But is it worth it?" as to power and
>die area.

Most of them fall apart long before that!

Eager execution may or may not be a winner for a few parallel paths,
but it very clearly can help only with 'pipeline' penalties and not
'preloading' ones. And it is the latter that are the killer in
almost all real programs. Sun's 'scout threads' have more potential,
because they can't be debunked using only the back of an envelope,
but I still have reservations.

I still think that there is serious potential in using a cooperative
approach to branch prediction, as I have posted before. It involves
adding a few, simple instructions that have no effect on the actions
on the program, but may be used to pass information from the compiler
to the prediction unit. Roughly:

SET_PRED n <registers>
Sets predictor n to an undefined hash of the register values.
USE_PRED n <confidence>
Uses that predictor, with confidence between (say) -7 and 7; negative
values imply the confidence of it not being taken.
PUSH_PREDS <predictors>
POP_PREDS <predictors>
Used at function calls and elsewhere to indicate the change in use.
Simply clearing a predictor is always a permitted action, and most
hardware would keep only a very small stack of them.

And, yes, my approach is specifically aimed at O-O languages, where
the same code is used for different objects, and where the path is
dependent more on the object than the instruction. Would it be a
significant gain? Dunno, but simulating it would be a damn good
research topic for a graduate student. Not hard, no special hardware
needed (so cheap), potentially valuable, ....


Regards,
Nick Maclaren.

Stephen Fuld

unread,
Nov 18, 2009, 12:36:40 PM11/18/09
to
nm...@cam.ac.uk wrote:

snip

> I still think that there is serious potential in using a cooperative
> approach to branch prediction, as I have posted before. It involves
> adding a few, simple instructions that have no effect on the actions
> on the program, but may be used to pass information from the compiler
> to the prediction unit.

Let me make sure I understand what you are proposing.

> Roughly:
>
> SET_PRED n <registers>
> Sets predictor n to an undefined hash of the register values.

So this takes some number of registers, whose names are given in the
instruction, creates a hash of the values in those registers and saves
the hash in some dedicated location, indexed by n? Is that an accurate
description?

> USE_PRED n <confidence>
> Uses that predictor, with confidence between (say) -7 and 7; negative
> values imply the confidence of it not being taken.

So this takes the same hash of the same registers and compares the value
of the hash to the one stored at location n. If the values match, use
the confidence value in the instruction to predict the branch. If the
hash values don't match, don't use the confidence value (and perhaps
default to some other predictor). Is that correct?

> PUSH_PREDS <predictors>
> POP_PREDS <predictors>
> Used at function calls and elsewhere to indicate the change in use.
> Simply clearing a predictor is always a permitted action, and most
> hardware would keep only a very small stack of them.

What is specified by the <predictors> in these instructions? I
understand the intent, but not exactly what the instructions specify.

This whole scheme as I described it sounds problematical, so before I
spend much time discussing the problems, please let me know if my
descriptions are accurate, or where I went wrong.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

nm...@cam.ac.uk

unread,
Nov 18, 2009, 12:49:09 PM11/18/09
to
In article <he1bb8$hmg$1...@news.eternal-september.org>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>
> > Roughly:
>>
>> SET_PRED n <registers>
>> Sets predictor n to an undefined hash of the register values.
>
>So this takes some number of registers, whose names are given in the
>instruction, creates a hash of the values in those registers and saves
>the hash in some dedicated location, indexed by n? Is that an accurate
>description?

Not quite. It looks up the hash in a branch predictor structure,
and saves a confidence value, according to whether it is expected
to correspond with branches being taken. That structure is likely
to be a simple LRU.

>> USE_PRED n <confidence>
>> Uses that predictor, with confidence between (say) -7 and 7; negative
>> values imply the confidence of it not being taken.
>
>So this takes the same hash of the same registers and compares the value
>of the hash to the one stored at location n. If the values match, use
>the confidence value in the instruction to predict the branch. If the
>hash values don't match, don't use the confidence value (and perhaps
>default to some other predictor). Is that correct?

No, not at all. It uses the prediction value directly by looking
up the value indexed by n, and matching signs. It will also update
that value, according to the success of the prediction.

>> PUSH_PREDS <predictors>
>> POP_PREDS <predictors>
>> Used at function calls and elsewhere to indicate the change in use.
>> Simply clearing a predictor is always a permitted action, and most
>> hardware would keep only a very small stack of them.
>
>What is specified by the <predictors> in these instructions? I
>understand the intent, but not exactly what the instructions specify.

The indices of the predictors - i.e. the 'n's above.

>This whole scheme as I described it sounds problematical, so before I
>spend much time discussing the problems, please let me know if my
>descriptions are accurate, or where I went wrong.

Certainly. I apologise for being obscure. The basic idea is that
there are a small number of predictor states, which are selected
and used according to code generated by the compiler. The actual
prediction is handled entirely by the hardware, based on what has
happened before (i.e. just as at present).

The only real difference from the current mechanisms is that the
compiler can tell the hardware "these are the registers that control
which object I am handling", rather than the lookup being purely
based on the instruction location (and possibly thread number).


Regards,
Nick Maclaren.

Stephen Fuld

unread,
Nov 18, 2009, 2:23:12 PM11/18/09
to
nm...@cam.ac.uk wrote:
> In article <he1bb8$hmg$1...@news.eternal-september.org>,
> Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>>> Roughly:
>>>
>>> SET_PRED n <registers>
>>> Sets predictor n to an undefined hash of the register values.
>> So this takes some number of registers, whose names are given in the
>> instruction, creates a hash of the values in those registers and saves
>> the hash in some dedicated location, indexed by n? Is that an accurate
>> description?
>
> Not quite. It looks up the hash in a branch predictor structure,
> and saves a confidence value, according to whether it is expected
> to correspond with branches being taken. That structure is likely
> to be a simple LRU.

So the structure is a two dimensional table, one dimension indexed by n,
and the other implementation dependent, but treated as an LRU? What
does each entry in the table contain? It must have the hash, and
presumably a few (4?) bit field to indicate prediction direction and
confidence.. Anything else?

So we take the hash. If it found, we move the corresponding entry to
the top of the LRU. If the hash is not found, it evicts the LRU entry
and creates a new MRU entry with the hash. Is that right?

>>> USE_PRED n <confidence>
>>> Uses that predictor, with confidence between (say) -7 and 7; negative
>>> values imply the confidence of it not being taken.
>> So this takes the same hash of the same registers and compares the value
>> of the hash to the one stored at location n. If the values match, use
>> the confidence value in the instruction to predict the branch. If the
>> hash values don't match, don't use the confidence value (and perhaps
>> default to some other predictor). Is that correct?
>
> No, not at all. It uses the prediction value directly by looking
> up the value indexed by n, and matching signs.

Signs??? Do you mean the sign of the predictor value in the table and
the sign of the confidence field? So we go the n array in the table and
match what against the hash value found in each entry? Don't we need to
do the hash on the registers to have something to match against?
Otherwise, how do we know which entry to use? Or do we just use the MRU
entry? Once we pick an entry, what do we do if the signs match versus
if they don't?

> It will also update
> that value, according to the success of the prediction.

So you increment or decrement the value in the entry in the table?


It's getting clearer, but I am still not clear on the details.

nm...@cam.ac.uk

unread,
Nov 18, 2009, 2:45:52 PM11/18/09
to
In article <he1hiv$jmf$1...@news.eternal-september.org>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>
>So the structure is a two dimensional table, one dimension indexed by n,
>and the other implementation dependent, but treated as an LRU? What
>does each entry in the table contain? It must have the hash, and
>presumably a few (4?) bit field to indicate prediction direction and
>confidence.. Anything else?

No, and it needn't even be that. The 'n' need not be stored in the
structure. The hash is looked up and the confidence value (which
includes the direction) copied to the separate register file that
is indexed by n (and, again, contains just a confidence value).

>So we take the hash. If it found, we move the corresponding entry to
>the top of the LRU. If the hash is not found, it evicts the LRU entry
>and creates a new MRU entry with the hash. Is that right?

Yes.

>Signs??? Do you mean the sign of the predictor value in the table and
>the sign of the confidence field? So we go the n array in the table and
>match what against the hash value found in each entry? Don't we need to
>do the hash on the registers to have something to match against?
>Otherwise, how do we know which entry to use? Or do we just use the MRU
>entry? Once we pick an entry, what do we do if the signs match versus
>if they don't?

When using the data, we have the 'n', so we just fetch the confidence
value from the register file indexed by that. This is negated if the
confidence in the USE_PRED is negative, and its sign used to predict
the branch.

>> It will also update
>> that value, according to the success of the prediction.
>
>So you increment or decrement the value in the entry in the table?

Or whatever. That's what I was thinking of as a first experiment.

>It's getting clearer, but I am still not clear on the details.

Nor am I! What I am describing is an approach that might lead to
better prediction for O-O codes. But the details would be modified
according to whatever techniques are already known to be effective
for branch prediction, and as a result of simulation.

My key idea is to get away from using just the instruction address
and thread index for branch prediction, and to use the address of
the object being worked on (or whatever the compiler regards as
appropriate). Really no more than that. But doing even that needs
some handshaking between the compiler and the hardware.

Oh, the other main idea is 'advisory instructions' only, so that
a CPU could ignore them without behavioural changes. But that isn't
a new idea.

Regards,
Nick Maclaren.

EACA

unread,
Nov 19, 2009, 1:30:20 AM11/19/09
to
"The only bad thing is that some guys I know at AMD say that Bulldozer
is
not really all that great a product, but is shipping just because AMD
needs a model refresh."

Andy:
Can you elaborate more on this issue, because I find it irrational
that AMD's own engineers (or whoever told you that) don't "trust
enough" their new micro-architecture.

Waiting anxiously for your reply.

nm...@cam.ac.uk

unread,
Nov 19, 2009, 4:58:10 AM11/19/09
to
In article <ggtgp-1B5ECC....@netnews.asp.att.net>,

What I call vectorisable codes. A solved problem since the 1970s.
Unfortunately, outside HPC, they are rare - and, even in that, they
don't dominate.


Regards,
Nick Maclaren.

Mayan Moudgill

unread,
Nov 19, 2009, 6:34:15 AM11/19/09
to
Niels J�rgen Kruse wrote:
> Andy "Krazy" Glew <ag-...@patten-glew.net> wrote:
>
>
>>The physical layout matters a lot, and hence has its own terminology.
>>
>>For example, most datapaths are bit interleaved - the actual wires might
>>look like
....

>>Bit interleaving makes bypassing of same bit to same bit easier.
>>
>>Unfortunately, bit interleaving introduces O(N^2) factor to the area.
>>Doesn't matter for small degree of superscalarness, but matters as you
>>get bigger, as you become wire limited.
>>
....

>>
>>For a while I was puzzled by IBM's Power family, e.g. Power 4, which
>>seemed to be on the low side. Until I was told that there were extra
>>wires not related to superscalarness, for DMA, etc.,
>
>
> Die Photos of PPC970, POWER4 and POWER5 clearly have 2 identical integer
> units replicated, so I doubt they are interleaved.
>

I'll preface this response by saying that I'm not very sure that a good
full custom design might not be able to work around some of the
problems, and that I am not practiced in this area.

Using bit interleaving for bypassing works something like this for two
ALUs, each with 2 inputs and 1 output.
+----------
v XXXX | YYYY
---> MX3==============> YYYY
^ XXXX | YYYY
<=====+=======================
v XXXX | YYYY ||
---> MX3 --> XXXX | YYYY ||
^ XXXX | YYYY ||
<-----+------------ YYYY ||
v XXXX | YYYY ||
---> MX3 =============> YYYY ||
^ XXXX | YYYY ||
+=======================
v XXXX | YYYY
---> MX3 --> XXXX | YYYY
^ XXXX | YYYY
+------------

Here X's and Y's are the logic for the two ALUs, the --- are the wires
for the X ALU input and output, the === are the wires for the Y ALU
input and output. Control wires are not shown. This cell will be (kind
of) replicated 32 times for a 32-bit ALU. Nothing is to scale; in
particular, in real life the muxes might stretch all the way across the
height of the cell, and the wires would be much closer together (or even
ontop of each other).

At the left we have the path to the register file. The register file and
the result values feed four 3-1 bypass muxes; two select the X inputs
and two select Y inputs. The X/Y inputs are then sent to the appropriate
ALU cell.

[The reason why the cells can't be replicated exactly are the
propagate/generate trees for adds and the compare result trees; hence
the empty space between the cells and muxes]

Now, consider where this is useful:
- X must be tall so that there is space for the wires to go over it
- X must be skinny so that the additional wire delay does not impact
cycle time
- X must not have many layers of metal so that it is possible to route
Y's wires over it
- Y must not have too large a cycle time, so that the additional wire
delay (of routing over X) does not kill performance
- X and Y must have roughly the same height.

It pretty much restricts this to simple ALUs; add/subtracts are probably
as complex operations as you can sustain. There is very little chance
that X can contain shifters or multipliers - that'll violate the skinny
and the layers of metal restrictions.

Y can, in general, be a little more complex than X. But, again, if Y is
too much more complex than X, then Y's implementation may have to be
contorted to fit the height of X. Also, Y cannot use a full cycle [it
still has the additional routing delay over X], so it really can't be
a very complex unit.

Bit lane interleaving seems, to me, to be a somewhat overly restrictive
and impractical optimization technique. It will let you be blindingly
fast on a small subset of the total op space.

I do not know anything about the exact thinking that went into the POWER
aarchitects choices for integer unit layout. However, given that I
probably got my attitudes from them, they may have shared my opinions :)

Stephen Fuld

unread,
Nov 19, 2009, 2:35:16 PM11/19/09
to
nm...@cam.ac.uk wrote:
> In article <he1hiv$jmf$1...@news.eternal-september.org>,
> Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>> So the structure is a two dimensional table, one dimension indexed by n,
>> and the other implementation dependent, but treated as an LRU? What
>> does each entry in the table contain? It must have the hash, and
>> presumably a few (4?) bit field to indicate prediction direction and
>> confidence.. Anything else?
>
> No, and it needn't even be that. The 'n' need not be stored in the
> structure. The hash is looked up and the confidence value (which
> includes the direction) copied to the separate register file that
> is indexed by n (and, again, contains just a confidence value).

OK, then a one dimensional table of m entries each containing a hash
value and a confidence value. m is architecture independent; larger
values would keep more "history". And a separate structure, n deep
containing the confidence interval.

snip

> What I am describing is an approach that might lead to
> better prediction for O-O codes. But the details would be modified
> according to whatever techniques are already known to be effective
> for branch prediction, and as a result of simulation.
>
> My key idea is to get away from using just the instruction address
> and thread index for branch prediction, and to use the address of
> the object being worked on (or whatever the compiler regards as
> appropriate). Really no more than that. But doing even that needs
> some handshaking between the compiler and the hardware.

I understand the goal. While I am still not sure of the details of your
proposal, let me make some comments that, I think are applicable.

I have a limited understanding of hardware design, so someone else,
please correct me. Depending upon the number of registers used in the
hash, the instructions could get expensive. For example, if you want to
hash five registers, I think you would need at least three cycles (at
two register read ports per instruction cycle) just to read the values
from the registers. Then you have an "associative memory" lookup of the
m values in the LRU structure. With sufficient hardware, this could
occur in parallel, but for reasonable sized m this is a lot of hardware.
So it might add more cycles to the instruction execution. I believe
the LRU updating of the structure, which is expensive, could be done
after the instruction completes, thus not impacting instruction
execution time. It would limit the rate at which such instructions
could be issued, but that may not be a problem.

But if the instructions take, say 3 cycles, is the improvement in branch
prediction worth the cost?

On the other hand, what about a simplified version - something like the
following. The branch predictor would use a few predefined bits of a
predefined register appended to the branch address to use in indexing
the predictor table. Since branch instructions often don't use the
second register read port, it might not extend the prediction time at
all. The compiler would know which register was used, of course, and
would try to keep that register as the "object id", or object address,
or some unique value. If the value was not wanted for the prediction,
the compiler could add an instruction to zero it out prior to the
branch. This extra instruction might be almost free as it has no
dependencies and could be easily scheduled. This might get some of the
advantages of your proposal, but with a lot less hardware and complexity.


> Oh, the other main idea is 'advisory instructions' only, so that
> a CPU could ignore them without behavioural changes. But that isn't
> a new idea.

Agreed.

nm...@cam.ac.uk

unread,
Nov 19, 2009, 3:00:24 PM11/19/09
to
In article <he46lk$8ph$1...@news.eternal-september.org>,

Stephen Fuld <SF...@Alumni.cmu.edu.invalid> wrote:
>
>I have a limited understanding of hardware design, so someone else,
>please correct me. Depending upon the number of registers used in the
>hash, the instructions could get expensive. For example, if you want to
>hash five registers, I think you would need at least three cycles (at
>two register read ports per instruction cycle) just to read the values
>from the registers. ...

I have none :-) If I were running that project, I would first run
simulations to discover if it had enough potential to be worth doing
some serious analysis. And one of the aspects of that would be just
how simple a hash gave most of the benefit. It might well be just
the bottom 8 bits of a single register!

My take on it is by looking at the code of O-O method functions, and
noting that a lot of the conditional branches are controlled by a
single factor. That isn't always explicit in the data, but that's
no obstacle.


Regards,
Nick Maclaren.

Terje Mathisen

unread,
Nov 20, 2009, 3:32:29 AM11/20/09
to
nm...@cam.ac.uk wrote:
> SET_PRED n<registers>
> Sets predictor n to an undefined hash of the register values.
> USE_PRED n<confidence>
> Uses that predictor, with confidence between (say) -7 and 7; negative
> values imply the confidence of it not being taken.
> PUSH_PREDS<predictors>
> POP_PREDS<predictors>
> Used at function calls and elsewhere to indicate the change in use.
> Simply clearing a predictor is always a permitted action, and most
> hardware would keep only a very small stack of them.
>
> And, yes, my approach is specifically aimed at O-O languages, where
> the same code is used for different objects, and where the path is
> dependent more on the object than the instruction. Would it be a

Isn't it easier to "simply" agree with the compiler writers that the
self pointer will always be loaded into (say) ECX/RCX, and then let the
content of that register, along with the EIP/RIP code address, control
the BTB lookups?

If you're already going to have a multi-level branch predictor, with a
top-level chooser to determine which one to use, then you could have
this object type predictor be one of the lower-level ones.

This at least is both forward and backwards compatible with existing
code and does not require any instruction set changes.

Cameron Jack

unread,
Nov 24, 2009, 7:05:06 AM11/24/09
to
On Nov 18, 9:16 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> Andy "Krazy" Glew wrote:
> > Mayan Moudgill wrote:
>
> >> Matt wrote:
>
> >>> Have you seen the "eager execution" patent application (app. no.
> >>> 20090172370)? I mentioned it in my blog a while back (
> >>>http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th...

Below is a repost of a comment I made at RWT a few weeks back. I had
no replies and am keen to hear what people think of my ideas - whether
they are derivative, someone else's, good or rubbish:

I've been wondering about dynamic predication ever since the first
static implementation in Itanium.

Could you use the existing instruction pointer stack with an
additional branch ID field (like Thread ID) and predicate value, and
additionally tagging each instruction as it passed through the
pipeline with these?

If the processor already uses a Future Register File like the Opteron
already does then it should be an easy matter to reset all renamed
registers with the matching branch ID and logically false, while
completing all those with the matching BID and logically true. You'll
need more rename registers and everything will get bulkier. You'd also
want to limit its usage to simple cases like where the BHT has seen
the branch a lot of times but the outcome in the BTB is less than say
30% certain.

Added: My reasoning for going down the dynamic predication route
rather than DMT as such is less problem with context sharing -
especially with VMs and the extra page table structures now on the
CPU.

Also, since you mention register access becoming a problem I've also
been wondering about using a register "block" use/renaming/retiring
scheme similar to Power5's instruction "blocks" (where up to 5
instructions in a certain mix are build, executed and retired
together). Obviously you'd need a lot more registers, but wiring could
be made a lot simpler - especially renaming could be extremely wide.

Andy "Krazy" Glew

unread,
Nov 24, 2009, 10:19:27 AM11/24/09
to Cameron Jack
Cameron Jack wrote:
> I've been wondering about dynamic predication ever since the first
> static implementation in Itanium.

Go for it!

Some of us have done a lot of work on dynamic predication. (My resume
includes an OOO Itanium, plus I have been working on VLIW and
predication longer than OOO.) But since such work inside companies will
never see the light of day, do not let that hold you back, since you are
not so constrained by NDAs and trade secrets.

I worked on the trace cache (back in my MS, before Intel) mainly so that
I could easily see convergence points, and predicate the sides of the
branches. That was naive: you don't need to see the convergence point
when you start predicating.

Perhaps because I have worked on this, I am not so afraid of eager
execution as is Mayan. When renamed registers have multiple copies with
predicate mask sets attached to them, doing the same thing for the store
buffer is not such a step.

My problem with eager execution is not about how to build it, but about
how to make it worthwhile. Sure, you can track branch probabilities,
and then go the other path... But branch predictors are so accurate.

Go for it, Jack. Maybe I should brush off my dynamic predication work
from my MS circa 1990, and/or forget anything I did on Intel and try to
come up with new, better, ways. Trouble is, it is just so painful to
have to forget almost 2 decades' worth of ideas: I almost prefer not to
work on such topics.

Mayan Moudgill

unread,
Nov 24, 2009, 11:55:03 AM11/24/09
to
Andy "Krazy" Glew wrote:

> Perhaps because I have worked on this, I am not so afraid of eager
> execution as is Mayan. When renamed registers have multiple copies with
> predicate mask sets attached to them, doing the same thing for the store
> buffer is not such a step.
>

I don't know if that is correct (BTW: I've worked on this on paper, not
at the VHDL level, though).

Assumption is that all in-flight instructions have an issue order (tag);
this is the sequence number in which that instance of the instruction
would have been executed.

Lets look at the store buffer. Its entries are:
A. invalid
B. store written out to L1
C. committed but not yet written out to L1
D. uncommitted
Case B. is useful for bypassing access to the L1 in some cases. We shall
ignore this case, treating these entries as "free", but its worth
keeping in mind

When a load comes in, it must search the store buffer. Possible answers
from each entry are:
E. missed
F. hit but store is from the future (i.e. was issed later than the load)
G. partial hit from unwritten store
H. complete hit
Case F. can be treated as a miss, but its important to be able to
distinguish between future matches and past matches. Order is also
important when reconciling between multiple G. & H. hits; for each byte,
you have to pick the isued-closest-to-the-load store that wrote to the
byte, and if there is no such store, you have to go the L1 cache. We
call the case where there is a (partial) hit in the store queue a
(partial) bypass, and the issue order(s) of the bypassing store(s) the
bypass issue order

Additionally, if loads can overtake stores, then there must be a
load-bypass queue in which the load's issue-order, address and size must
be stored, and the bypassed store issue order(s) must be stored. [This
can be combined with the load-waiting-for-a-L1-miss queue].

When a new store comes in, an invalid (or free) entry is found, and the
store's adddres, size, data, and issue order is written to the entry.
[Couple of things: the entry can be allocated earlier, and the
computation of the data and the address can be split].

Additionally, if loads can overtake stores, it must search the load
queue, check to see:
- does it overlap with a load?
- does the load come later in issue-order?
- would it have mattered (i.e. for each overlapping byte, did the byte
get bypassed from a yet-later store)?
If so, corrective action must be taken [generally equivalent to a branch
mispredict]

The store retire logic generally will retire committed stores in issue
order. It may be possible to retire committed stores in other orders,
but things get squirrely w.r.t. memory ordering and overlapping writes.
[BTW: I am ignoring a large number of details w.r.t. memory ordering and
cache coherence protocols]

When a branch mispredict is determined, any store queue entries that are
from the mispredicted path are rolled back; i.e. if an entry's issue
order is after the issue-order of the mispredicted branch, the entry
marked invalid.

One possible simplification is to allocate store queue entries in issue
order. This allows depth in the store-queue to be a proxy for
issue-order; the closer it is to the head of the queue, the earlier it
was issued. Writing to L1 is always from the head of the queue.
Mispredict rollback moves the tail of the queue to the last known
mispredicted entry. The next free entry to be allocated is at the tail
of the queue. [This can also be done by allocating entries at execution
time and executing stores in-order]

Eager execution complicates the concept of issue-order. It is now
issue-order-on-a-particular-path. The behaviors required of the store
queue remain the same.

However, allocating store queues in issue order is no longer practical;
it would lead to invalid entries between the head and the tail of the
FIFO, and a sparse store queue. This, in turn, would make the store
queue a potential bottleneck (either because of the number of effective
entries if you kept the store queue size the same, or because of the
cycle time if you increased the store queue size).

So, the store queue structure you would need to use would have to have
out-of-order entries, additional logic to check for order and path,
etc., etc.

I will offer an interesting data-point: On both Power5, and AFAIK,
Core-i7/Nehalem, the store-queue is NOT shared between the two threads;
they have separate queues. The modifications to support eager execution
are similar, but more complicated, than those required for two threads.
I would suggest that, given store-queue partitioning for 2 threads was
picked over a shared store-queue, the architects found that modifying
the store queue for eager execution would indeed have a cycle time impact.

Brett Davis

unread,
Dec 3, 2009, 2:01:51 AM12/3/09
to
In article <4B01BAD2...@patten-glew.net>,

"Andy \"Krazy\" Glew" <ag-...@patten-glew.net> wrote:

> Andy "Krazy" Glew wrote:
> > Brett Davis wrote:

> >> For SpMT I looked at Alpha style paired integer pipelines ...
> >
> > You haven't said enough about the physical layout to talk about those
> > clustering effects.


>
> The physical layout matters a lot, and hence has its own terminology.
>

> 8-wide on a bit interleaved datapath is pushing the envelopew.
>
>
> 8-wide as 2 4 wide bit interleaved datapaths is not so bad, although you
> will pay a lot of area for the wire turns to get to side by side
> datapaths. S I might call this an 8-wide cluster composed of two
> adjacent 4-wide bit-interleaved clusters. With whatever bypassing you
> say.
>
> As I mentioned earlier, there is a sweet trick you can play with
> datapaths that are paired and opposite, reflections. You bit interleave
> the input and output wires within the, say, 4-wide cluster, and then you
> bit interleave the outputs (but not the inputs). Trouble with this
> trick is that it takes one of the ends of the datapath away - where do
> you put the scheduler? The data cache?
> So, I call this "opposed" clustering.

I am afraid I need a Transistor 101 refresher:

The speed of light in copper, or aluminum, in a modern fab process,
and the resulting mm covered at 3GHz, and translated into the number
of wires you can cross at a right angle, distance wise.

The time a transistor takes to switch, and the resulting number of
transistors you can jump through in 3GHz for a CPU stage.

The number of active inputs you can drive with one transistors output.
The number of inactive inputs you can drive with one transistors output.
(Does a hard limit of only two active listeners out of eight help.)


I do not really like the Alpha 4+4 and two cycle delay to cross,
actually I hate it. I can live with 0 cycles to two closet ALUs, with
a one cycle delay for every second crossed. With 64 bit registers that
means crossing 127 wires. You have like 15 wire layers, so there is a
desire to stack the 2in1out wires to reduce the distance covered by
the bypass wires if you do not interleave. (Ignoring the big economic
incentive to use big previous gen patterning for the upper level wires.)

How many layers are 45nm on a 45nm CPU process, verses 65nm layers,
and 90nm? Most die slices I have seen tend to march up to low res fast.

I am willing to skin this old cat in new ways to make a 8 way design
work. RRAM is going to force many of the upper layers to high res
anyway, so I will be catching a wave that enables high density wire
stacking where it is needed. Lots of breakthroughs and patents to be
had in this area...

Brett

Del Cecchi

unread,
Dec 3, 2009, 4:03:53 PM12/3/09
to

"Brett Davis" <gg...@yahoo.com> wrote in message
news:ggtgp-191C1C....@netnews.asp.att.net...

In CMOS, the transistors don't really have a switching time. Rather
there is delay associated with charging or discharging the gate
(transistor gate, not logic gate) between the on voltage and the off
voltage (usually VDD and Gnd respectively). Since both gate
capacitance and current capability of a transistor are proportional to
width, this leads to the concept of gate limited delay, or the minimum
delay of a gate with no interconnect driving an identical gate, and
driven by an identical gate.

To the gate delay one must add the effects of the interconnect and fan
out.

So far as I know, most of the layers in a 45 nm process are 45 nm
although typically the last few metal layers are thicker and coarser
for lower resistance power distribution.

Unless things have changed recently I don't think anybody is doing 15
levels of interconnect.

del


Robert Myers

unread,
Dec 3, 2009, 5:09:03 PM12/3/09
to
On Dec 3, 4:03 pm, "Del Cecchi" <delcecchioftheno...@gmail.com> wrote:

>
> In CMOS, the transistors don't really have a switching time.  Rather
> there is delay associated with charging or discharging the gate
> (transistor gate, not logic gate) between the on voltage and the off
> voltage (usually VDD and Gnd respectively).    Since both gate
> capacitance and current capability of a transistor are proportional to
> width, this leads to the concept of gate limited delay, or the minimum
> delay of a gate with no interconnect driving an identical gate, and
> driven by an identical gate.
>
> To the gate delay one must add the effects of the interconnect and fan
> out.
>
> So far as I know, most of the layers in a 45 nm process are 45 nm
> although typically the last few metal layers are thicker and coarser
> for lower resistance power distribution.
>
> Unless things have changed recently I don't think anybody is doing 15
> levels of interconnect.

Which is it that limits clock speed, though? Irreducible gate delay,
or interconnect delay? Since I understand (after a less-than-polite
exchange here) that interconnect delay *also* depends on gate delay
(because of the use of repeaters on long interconnects), I am prepared
for the answer that there is no simple answer, but I would be
delighted if someone gave it a shot.

Robert.

Del Cecchi

unread,
Dec 3, 2009, 10:06:57 PM12/3/09
to

"Robert Myers" <rbmye...@gmail.com> wrote in message
news:1d2e8d2a-a35c-4d8a...@s20g2000yqd.googlegroups.com...

Robert.

-----------------

In CMOS there really isn't an irreducible gate delay as such. But the
minimum transition times are limited by the inherent capacitance and
the current drive of the transistors. So if you build an inverter
that uses transistors of a size that carry 1 ma with a gate voltage of
1.5volts (vdd)and the capacitance of the output node is 50 fF then
even if you apply an instantaneous voltage to the input it will take
the output 37.5 pS to get to the threshold of VDD/2.

Any wiring or fan out will increase that time.

Since wires are resistive, the delay of a wire increases faster than
the length of the wire. (distributed RC circuit), so cutting the wire
in pieces and inserting buffers is a win, in general. If the buffers
had zero delay you could use many and reduce the delay of a long wire
to an arbitrary extent. But they don't so you can't.

You can play around with the numbers by using the first approximation
to the FET equations.

Id=constant*w/l*(Vgs-Vt)Vds (linear region)
Id=1/2*constant*w/l(Vgs-Vt)**2

note that the input and output capacitances are approximately
proportional to width of devices and length is determined by process.
(yes, I ignored fringing. too bad)

del

Andy "Krazy" Glew

unread,
Dec 4, 2009, 12:47:22 AM12/4/09
to Robert Myers

Both.

The total delay through any critical loop. A critical loop like the
path from the output of an ALU, through wires and bypass muxes back to
its input, and through the ALU to that same output bit. Or,
equivalently, to the next bit, in a carry save form. Or, longer, for
carry propagate. Or like the path the input to the AGU (Address
Generation Unit), through the AGU, to the cache decoders, through the
cache data array, and back through alignment and bypass to the input of
the AGU.

You'll hear different answers from different people. E.g. one guy might
say interconnect is 50% of your favorite critical loop, while another
guy says that it is only 10%. When you look closely, you will see that
the first guy considered all of the wires between transistors as
interconnect, while the second guy only considers wires between cells,
e.g. gates, to be interconnect, but thinks that wires within gates are
not interconnect.

With the former definition, 50% was not unreasonable. You don't go too
far wrong thinking of interconnect as 100%, or, rather,
Tcrit = max(interconnect delay, charging time)
and totally ignore switching time. That is, in fact, the simplistic
timing model that I use to do an initial evaluation of al of my designs.
Or, at least, that I used, back in the P4/Wmt/Fireball era.
After that, learn about logical effort.
After that, model it.