AMD Bulldozer released... and it sucks

Dan

unread,

Oct 12, 2011, 11:59:43 PM10/12/11

to

So, AMD Bulldozer has finally been released after years of delays, and
quite frankly based on all the reviews (and the fact that the top of
the line model has an MSRP of $245) it isn't very good.

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested
http://www.planet3dnow.de/vbulletin/showthread.php?t=399114
http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=102&Itemid=1
http://www.hardocp.com/article/2011/10/11/amd_bulldozer_fx8150_desktop_performance_review
http://www.techspot.com/review/452-amd-bulldozer-fx-cpus/page13.html
http://hothardware.com/Reviews/AMD-FX8150-8Core-Processor-Review-Bulldozer-Has-Landed/
http://vr-zone.com/articles/amd-fx-8150-cpu-overclocking-review-a-bulldozer-for-gamers-/13694.html

The question is, of course, why? I have a few theories:

1. The cache hierarchy is not good

Each module has a 64KB 2-way set associative I-cache, two 16KB 4-way
set-associative write-through D-caches, and 2MB L2. 16KB L1 is a
significant step below the previous 64KB L1D caches in K10. Not only
that, but the L2 cache has a horrible 21-27 cycle access latency. I'm
very interested as to what set of applications this hierarchy is
optimized for, since it's very different than anything else on the
market. Given the current state of BD performance, could it be that
their simulations that led to their cache hierarchy design were
inaccurate?

2. Switching from full custom to automatic synthesis caused them to
miss clock frequency goals

Current rumors state that AMD switched from their usual full custom
chip design to a more SoC-like design process (which apparently caused
a mass exodus of ex-DEC engineers) . Doing so costed them 20% in power/
perf/area. I was under the impression though that the latest shrinks
of K10 were also synthesized instead of full custom, but what do I
know. The BD die shots don't really look like the chip was fully
synthesized (unlike Bobcat, which is obviously fully synthesized), so
I dunno.

3. The Global Foundries 32nm process sucks

This still doesn't explain why a 2 billion transistor architecture is
worse than the Sandy Bridge 1 billion transistor architecture.

Any comments? I know Andy Glew used to work on it, and now that
Bulldozer is finally out, would you be able to comment about it?

Paul A. Clayton

unread,

Oct 13, 2011, 2:49:10 AM10/13/11

to

On Oct 12, 10:59 pm, Dan <dzha...@gmail.com> wrote:
[snip]

> The question is, of course, why? I have a few theories:
>
> 1. The cache hierarchy is not good

This seems likely to be a significant contributor. While a small
Dcache is reasonable for a short cycle time microarchitecture,
writeback to a shared L2 might cost excessive bandwidth
(perhaps especially if being 'mostly exclusive' means that
most clean L1 evictions will be written back to L2) and the
large size would seem to mandate high latency (excluding
NUCA optimizations). As at least one Real World Tech forum
poster noted, AMD might have been expecting smart
prefetching to hide most of the extra latency. It was also
speculated there that the large L2 might have been intended
to facilitate production of a single module variant.

> 2. Switching from full custom to automatic synthesis caused them to
> miss clock frequency goals
>
> Current rumors state that AMD switched from their usual full custom
> chip design to a more SoC-like design process

The previous design was almost certainly not full custom but
full custom of critical paths. (It is a bit sad that even with
unlimited processing time that automated methods would have
a 20% performance and 20% power penalty as well as
reliability issues, as stated by Cliff Maier.)

A 20% double whammy would certainly account for a lot
of lost performance.

[snip]

> 3. The Global Foundries 32nm process sucks
>
> This still doesn't explain why a 2 billion transistor architecture is
> worse than the Sandy Bridge 1 billion transistor architecture.
>
> Any comments? I know Andy Glew used to work on it, and now that
> Bulldozer is finally out, would you be able to comment about it?

Based on his 15 Nov 2009 posting to comp.arch
(Message-ID: <4AFFA499...@patten-glew.net>),
Andy Glew was only involved in the early microarchitecture
considerations.

Terje Mathisen

unread,

Oct 13, 2011, 3:06:39 AM10/13/11

to

Dan wrote:
> 1. The cache hierarchy is not good
>
> Each module has a 64KB 2-way set associative I-cache, two 16KB 4-way
> set-associative write-through D-caches, and 2MB L2. 16KB L1 is a
> significant step below the previous 64KB L1D caches in K10. Not only
> that, but the L2 cache has a horrible 21-27 cycle access latency. I'm

16K L1 + ~24 cycles to L2???

That is simply broken, I can't think of a single program which would
have that as a "sweet spot". :-(

> very interested as to what set of applications this hierarchy is
> optimized for, since it's very different than anything else on the
> market. Given the current state of BD performance, could it be that
> their simulations that led to their cache hierarchy design were
> inaccurate?
>
> 2. Switching from full custom to automatic synthesis caused them to
> miss clock frequency goals
>
> Current rumors state that AMD switched from their usual full custom
> chip design to a more SoC-like design process (which apparently caused
> a mass exodus of ex-DEC engineers) . Doing so costed them 20% in power/
> perf/area. I was under the impression though that the latest shrinks
> of K10 were also synthesized instead of full custom, but what do I
> know. The BD die shots don't really look like the chip was fully
> synthesized (unlike Bobcat, which is obviously fully synthesized), so
> I dunno.
>
> 3. The Global Foundries 32nm process sucks
>
> This still doesn't explain why a 2 billion transistor architecture is
> worse than the Sandy Bridge 1 billion transistor architecture.

Bulldozer will have to compete against Ivy Bridge as well, and that
bridge is significantly better than the Sandy one...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Oct 13, 2011, 1:54:43 PM10/13/11

to

On Wednesday, October 12, 2011 10:59:43 PM UTC-5, Dan wrote:
> 2. Switching from full custom to automatic synthesis caused them to
> miss clock frequency goals
>
> Current rumors state that AMD switched from their usual full custom
> chip design to a more SoC-like design process (which apparently caused
> a mass exodus of ex-DEC engineers).

AMD never used full custom (at least from 19999 when I got there to 2007 when I left). AMD used a human place auro route scheme that got the better of both full custom and silicon compilation design styles. {Some major large blocks (SRAMs, multiplier trees, adders, register files) were essentially full custom.}

Mitch

Andy "Krazy" Glew

unread,

Oct 13, 2011, 2:05:22 PM10/13/11

to

I worked at AMD 2002-2004. I do not know how or in what ways what I
worked on back then relates to Bulldozer. I highly suspect that if what
I worked on back then *is* substantially related to Bulldozer, that you
have your answer: if a design that is 7+ years old is only just now
being introduced, well, that's an awfully long time.

"Now that Bulldozer is finally out, would you be able to comment about
it?" -- No. It doesn't work that way.

===

What I do know is that Bulldozer's so-called "multi-core" design
resembles the "multi-cluster multi-threading" (MCMT) ideas that I came
up with at UWisc, in reaction to Willamette style multithreading.
Willamettte had two threads shaing a tiny L0 data cache - 8K in what
shipped, 16K in later versions, smaller in some proposals.

IMHO having two threads share a 4K or 8K cache for a prolonged period of
time is stupid. So the basic idea behind my multicluster multithreading
is to duplicate the cache, so that each thread has a private cache.
Share the next level out. Now, you can't just duplicate or split the
cache without paying a penalty. So you duplicate the closely coupled
ALUs, and then possibly move out to the scheduler and renamer -
basically putting the critical loop inside the duplicated "cluster".
(Note that you don't necessarily need to dupliccate the scheduler, since
it isn't inside the most critical loop; but duplicating the scheduler is
attractive because a fast scheduler grows more difficult with size.)

So, that's the basic idea of my flavor of multicluster multithreading:
duplicate the innermost cache, plus whatever is neecessary for the
critical loop.

PRO: two threads are not sharing a really tiny cache

CON: when you are only using a single thread, you can't use the extra
space in the other L0$.

I mean, one of the main justifications for multithreading, as opposed to
multiple single threaded cores, is that you can build a larger machine
that sometimes can be used by a single thread. Multiclustering's hard
partitioning makes that more difficult - you can only get the extra
benefit for a single threaded workload in the shared L2 data cache and
shared instruction cache. (And in other shared structures, such as the
shared L2 scheduler I have written about.)

===

I'll stop here.

Mentioning only briefly that one of the big reasons I proposed
multicluster multithreading was to support SpMT, speculative
multithreading. AMD does not have SpMT, AFAIK.

While I think MCMT is one of the best ways to support a multithreaded
workload with a small L0 or L1 cache, I don't know that I would have
built it for a non-SpMT machine.

The "speed demon" motivation for MCMT, threading with a very small L0 or
L1 cache, went away with the right hand turn away from Wmt or K9-style
high frequency design. When a chip like Sandybridge can build a 32K L1
cache without impacting frequency too much - indeed, when design trends
want you to build a lower clock frequency, to improve yields and power -
it is not clear that MCMT has an advantage over SMT with a shared L1 cache.
I.e. I think MCMT is clearly justified with a 4K or 8K L0 cache.
But probably not justified with a 32K L1 cache. 16K? I am sure that
AMD must have simulated it.

Paul A. Clayton

unread,

Oct 14, 2011, 1:45:26 PM10/14/11

to

On Oct 13, 1:49 am, "Paul A. Clayton" <paaronclay...@gmail.com> wrote:
[snip]

> full custom of critical paths. (It is a bit sad that even with
> unlimited processing time that automated methods would have
> a 20% performance and 20% power penalty as well as
> reliability issues, as stated by Cliff Maier.)
>
> A 20% double whammy would certainly account for a lot
> of lost performance.

Oops! It was 20% performance and 20% *area*. That is
still a painful 'double whammy'.