Where is Bulldozer's renamer?

Andy Glew

unread,

Sep 7, 2010, 11:04:26 PM9/7/10

to

By now, most of you will be familiar with Bulldozer and/or the
multicluster multithreading concept:

Shared front end
* branch prediction
* instruction cache
* decode

Separate clusters of the tight loop (AMD calls these cores)
* scheduler
* execution units
* L1 data cache

Shared
* L2 cache
* and in Bulldozer's case, floating point.

By the way: I called this design (with the shared FP optional)
multicluster multithreading, MCMT, back in Wisconsin in 1996-2000.
My term is not appropriate for AMD, since call the Scheduler/Exec/L1$ a
core, and the whole thing a module or cluster. I.e. the terms clustrer
and core are swapped. In AMD's 2005 Analyst Day, Chuck Moore had a
slide that mentioned something he called "Cluster Based Multithreading".

... Anyway ...

What I have not seen mentioned is the position of the renamer.

This is largely inspired by The Inquirer's wondering about reverse
multithreading.

If the SX$ cores/clusters are truly independent, then it is probably
natural to put the renamer inside the cores/clusters. Giving
Renamer -> Scheduler -> Execute -> Cache

Sharing the renamer would only mean that there is a bigger lookup array,
and hence be slower.

However, as I just mentioned in a different post, the renamer consists
of several parts: lookup array, bypass comparators, allocator.

The bypass comparators could be shared, although probably at the cost of
a pipestage.

As for why this post is related to "reverse multithreading" - which
sounds like something near and dear to my heart, using MCMT and/or
threads to speed up a single logical thread of computaion, whether by
SpMT or batching ... :

When I am trying to do SpMT, I tend to have the cluster scheduler be an
S1, with a large S2 scheduler shared between clusters. Similarly for
instruction window. This allows a big OOO thread to get most of the
resources of the machine, with a small thread running on the other
cluster - i.e. it justifies building a bigger OOO machine than you might
otherwise have done.

Also, because I use this shared scheduler and instruction window to make
thread migration between clusters easier and cheaper and less intrusive
on a non-speculative thread.

When I am doing instruction batching, I might use an S2 scheduler. But
I can get away without the S2 scheduler. However, a shared renamer is
convenient wrt tracking which cluster contains the master copy of a
logical register.

(Hey, here's something new: a two level renamer, the outer shared level
just indicating what cluster a logical register lives in, the inner
cluster private doing the usual rename. With 1 bit clusters probably
could do this afgfordably in parallel with shared register in and out
comparison or bypassing circuitry for renaming.)

So, here's why the question "Where is Bulldozer's renamer?" is interesting.

If the renamer is shared between cluster/cores, then Bulldozer may be
moving towards both cluster/cores to speed up a single thread. The
Inquirer's rumored reverse multithreading.

If the renamer is cluster/core private, then Bulldozer is probably not
moving that direction.

David Kanter

unread,

Sep 8, 2010, 3:53:55 AM9/8/10

to

If I might toot my own horn here:
http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=6

Retirement tracking is handled per core. Renaming for integer and
memory ops is done in each core. Renaming for FP/SIMD is done within
the shared FP unit.

David

Andy Glew

unread,

Sep 8, 2010, 9:28:01 AM9/8/10

to

On 9/8/2010 12:53 AM, David Kanter wrote:
> On Sep 7, 8:04 pm, Andy Glew<"newsgroup at comp-arch.net"> wrote:
>> By now, most of you will be familiar with Bulldozer and/or the
>> multicluster multithreading concept:

>>...
>> What I have not seen mentioned is the position of the renamer.

> If I might toot my own horn here:

> http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=6
>
> Retirement tracking is handled per core. Renaming for integer and
> memory ops is done in each core. Renaming for FP/SIMD is done within
> the shared FP unit.
>
> David

Thanks, David (or should I say, thanks RWT.com). You're allowed to toot
your own horn, in moderation. RWT.com is one of the best tech sites.

By the way, I encourage you to clip quotes, rather than including all of
my very long posts.

Q: where did you get this information about the renamer position? I did
not see it in any of the HotChips slides. I guess you press people get
different slidesets than we of the broader public, and/or have the
benefit of asking questions in interviews.

Anyway... if there is per-cluster/core renaming (grr, it is so much
simpler to say cluster - as in, there is a shared FP cluster, and two
thread private integer/cache clusters)

then I think that it is unlikely that AMD is trying to move towards
"reverse multithreading" the way the Inquirer suggested.

---

I would have been chagrined if they were.

Matt

unread,

Oct 25, 2010, 5:20:34 PM10/25/10

to

I'm currently looking into a possible design element of Bulldozer.
There are papers about a "Flywheel" architecture, describing how
differently clocked parts of the pipeline play together. This also
involves two stages of renaming. Further there was an "accelerated
mode" mentioned in GCC postings about the BD dispatch group creation.
One of the persons involved in "Flywheel" is Emil Talpes, at AMD since
2005. And I'm also looking for a reason why so many instructions have
latencies being a multiple of 2.

Matt / citavia.blog.de

Brett Davis

unread,

Oct 26, 2010, 3:09:40 AM10/26/10

to

In article
<3f17c21f-069f-42db...@t13g2000yqm.googlegroups.com>,
Matt <Alphar...@gmx.de> wrote:

An IBM POWER question, something about 6 GHz... ;)

Also gives you access to IBMs compiler tech, minus the back end.
Or LLVM and Clang?

Brett

Andy "Krazy" Glew

unread,

Oct 26, 2010, 11:25:48 AM10/26/10

to

On 10/25/2010 2:20 PM, Matt wrote:
> I'm currently looking into a possible design element of Bulldozer.
> There are papers about a "Flywheel" architecture, describing how
> differently clocked parts of the pipeline play together.

While there are undoubtedly novel aspects in Flywheel,
isn't "differently clocked parts of the pipeline"
what Willamette was doing with its slow/medium/fast clocks?
(medium being instruction fetch, fast being scheduler and integer ALU).

Note: when I say "Willamette" I do not mean "obviously a bad idea
because Pentium 4 failed in the marketplace". There were lots of good
ideas in Willamette, including replay pipelines and multiple clock
domains. Although the replay pipelines had stability problems, those
can be fixed; and I am not aware of any problems with the clock domains,
unless you blame those for the excessive power consumption. Maybe the
"fireball" went too far allong the path of hand tweaked circuits.

Matt

unread,

Oct 26, 2010, 12:57:30 PM10/26/10

to

On 26 Okt., 17:25, "Andy \"Krazy\" Glew" <a...@SPAM.comp-arch.net>
wrote:

> On 10/25/2010 2:20 PM, Matt wrote:
>
> > I'm currently looking into a possible design element of Bulldozer.
> > There are papers about a "Flywheel" architecture, describing how
> > differently clocked parts of the pipeline play together.
>
> While there are undoubtedly novel aspects in Flywheel,
> isn't "differently clocked parts of the pipeline"
> what Willamette was doing with its slow/medium/fast clocks?
> (medium being instruction fetch, fast being scheduler and integer ALU).

Sure, it is. The interesting thing about Flywheel is the author. Here
I'm just trying to find links between research and actual
implementations. And then there were the 2 renamers with the second
one being kind of a renaming-updater.

> Note: when I say "Willamette" I do not mean "obviously a bad idea
> because Pentium 4 failed in the marketplace". There were lots of good
> ideas in Willamette, including replay pipelines and multiple clock
> domains. Although the replay pipelines had stability problems, those
> can be fixed; and I am not aware of any problems with the clock domains,
> unless you blame those for the excessive power consumption. Maybe the
> "fireball" went too far allong the path of hand tweaked circuits.

I know several people not so familiar with uarchs, who are scared away
by reading "high frequency" or "speed demon" and "Bulldozer" on the
same page of text, because of their experiences with Netburst. What
could make a difference today is that for higher clocked simple ALUs
(e.g. 2x or even 1.5x) there is no need to use fast, not so power
efficient logic like domino.

So far a lot of research points towards lower FO4 designs. Even Chuck
Moore talked about low FO4 designs in some of his talks.

Matt

nedbrek

unread,

Oct 26, 2010, 9:19:56 PM10/26/10

to

Hello all,

"Andy "Krazy" Glew" <an...@SPAM.comp-arch.net> wrote in message
news:bIidnWEvbr5jb1vR...@giganews.com...

>
> Note: when I say "Willamette" I do not mean "obviously a bad idea because
> Pentium 4 failed in the marketplace". There were lots of good ideas in
> Willamette, including replay pipelines

Problems with replay:
1) Power - you take instructions from dispatch to execute two or three (or
more!) times
2) Complexity - replay becomes a crutch (can't make TLB timing, make it a
replay condition, etc.)

Why do we (fundamentally) need replay?
- need to schedule load dependencies well in advance of load hit
determination
-> speculative issue for load dependents (not necessarily bad)

- instructions scheduled based on load hit must be rescheduled due to load
miss
-> two options:
1) keep instructions in the scheduler until known good (no replay, P6 style)
2) release on issue (which requires reinsertion aka replay)

P4 had puny (distributed) schedulers and a long schedule->known good
latency. Sandy Bridge has an enormous (unified) scheduler, hopefully they
decided against replay (details might turn up in the Intel journals).

Ned