SMT exploiting 21264-like clustering?

Paul A. Clayton

unread,

May 27, 2010, 6:10:16 PM5/27/10

to

Another obvious (possibly half-way decent) idea: Use the duplicated
register file of a clustered processor design like the Alpha 21264 to
hold distinct contexts.

Such a static partitioning might not be advisable under two
simultaneous threads usually, but at four (reasonably active) threads,
static partitioning might be a net gain in many cases. To allow a
slight increase in support for burst ILP, the inter-cluster forwarding
could write to register caches rather than to the other register file
and these register values could be used for issuing instructions from
the other cluster. The extra write ports in each cluster could then
be used to support two-result operations if desired.

(A two issue per cluster processor might share a multiplier/divider
[possibly replicating enough of a multiplier to support independent 16-
bit by 64-bit multiplications??]. At three issues per cluster
distinct multipliers might make sense.)

(Static partitioning of two threads might make sense when ILP is
relatively low with little benefit from using full issue width issue
for a single thread, when extra registers could be used to support
deeper speculation, or under other circumstances.)

(Obviously, one could also use such register-duplicating clustering to
support SIMD-like operations.)

Paul A. Clayton
just a technophile

Andy 'Krazy' Glew

unread,

May 28, 2010, 2:13:19 AM5/28/10

to

On 5/27/2010 3:10 PM, Paul A. Clayton wrote:
> Another obvious (possibly half-way decent) idea: Use the duplicated
> register file of a clustered processor design like the Alpha 21264 to
> hold distinct contexts.

Looks like you have found another way of arriving at, another evolutionary path, to

a) AMD's MCMT (Multicluster Multithreading) as in Bulldozer

b) my MultiStar.

I arrived at it from a different path: (a) thinking that most multicluster uarch for single threads were not very
successful, (b) using multicluster for separate threads, and (c) then trying to go back and use the MCMT to speed up
single thread.

I.e. you

MCST (multicluster singlethread) -> MCMT

me

MCMT -> MCST ?

I wonder what things work out differently when you think this way?

I never liked the inter-cluster bypass of the 21264. Complete bypass networks are expensive; incomplete are a glass
jaw. But, heck, even un-clustered machines now have incomplete bypass networks.

Paul A. Clayton

unread,

May 29, 2010, 10:19:07 PM5/29/10

to

On May 28, 2:13 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:
[snip]

> I.e. you
>
> MCST (multicluster singlethread) -> MCMT
>
> me
>
> MCMT -> MCST ?
>
> I wonder what things work out differently when you think this way?

Well, one of my habits of thinking seems to be to exploit existing
features for alternate uses (e.g., huge page TLB entries holding
PDEs). (This is probably part of the reason I find SMT appealing--
existing [or extreme] ILP core -> choice of single thread
performance or moderately great multithread throughput.)

> I never liked the inter-cluster bypass of the 21264. Complete bypass networks are expensive; incomplete are a glass
> jaw. But, heck, even un-clustered machines now have incomplete bypass networks.

I kind of dislike complete bypass because it seems wasteful. (I
would irrationally dislike it even if it were cheap.) Other than
squaring, when is a result used by both inputs of a functional
unit? (Intelligent forwarding would seem desirable, but such
could add excessive delay [aside from area/power costs].)

BTW, could a staggered ALU be used to ease the delay
problem of scheduling/forwarding? If one 'cluster' of
ALUs was staggered a half-cycle relative to the other with
the less significant bits forwarded as soon as available,
could one see some benefit? (I like the Pentium 4
staggered ALU concept. I do wonder if it might be useful
for a low-power design--i.e., addition takes two cycles
to fully complete [less logic activity] but has single cycle
forwarding. [I suspect the ideas in the Pentium 4 are
now tainted with the relative failure of the Pentium 4.])

Andy 'Krazy' Glew

unread,

May 30, 2010, 11:18:05 AM5/30/10

to

On 5/29/2010 7:19 PM, Paul A. Clayton wrote:

> BTW, could a staggered ALU be used to ease the delay
> problem of scheduling/forwarding? If one 'cluster' of
> ALUs was staggered a half-cycle relative to the other with
> the less significant bits forwarded as soon as available,
> could one see some benefit? (I like the Pentium 4
> staggered ALU concept. I do wonder if it might be useful
> for a low-power design--i.e., addition takes two cycles
> to fully complete [less logic activity] but has single cycle
> forwarding. [I suspect the ideas in the Pentium 4 are
> now tainted with the relative failure of the Pentium 4.])

I don't think that Pentium 4 had what you think of as a staggered ALU.

When I think of staggered ALU, I think of two ALUs, with the second ALU receiving inputs from the first, and possibly
from the generic register file. I.e. something that allows you to execute A+B->C; C+D->E in one clock cycle.

Pentium 4 actually just ran the ALUs - and the associating support logic, like the scheduler - at 2X the published
frequency of the core. I.e. if the core was publicly 2.5GHz, the "fireball" was actually running at 5GHz.

The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and
the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU.

Paul A. Clayton

unread,

May 31, 2010, 10:02:10 PM5/31/10

to

On May 30, 11:18 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net>
wrote:
[snip]

> The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and
> the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU.

I took the term from "Using Internal Redundant Representations and
Limited Bypass to Support Pipelined Adders and Register Files"
(Mary D. Brown, Yale N. Patt; 2001 [HPCA-3]):
"An example of this concept, called staggered adds, was
implemented in the Intel Pentium 4 [10]. When staggering a 32-bit
add over two cycles, the carry-out of the 16th bit and the lower half
of the result are produced in the first cycle, and the upper half of
the
result is produced in the second cycle."

So what is the proper term for this kind of pipelined addition?

(ISTR reading somewhere that the AMD K5 used the early
availability of the less significant bits of a sum to shorten
load latency, so early use of partial results is not an extremely
new idea.)

Andy 'Krazy' Glew

unread,

Jun 1, 2010, 11:06:42 AM6/1/10

to

On 5/31/2010 7:02 PM, Paul A. Clayton wrote:
> On May 30, 11:18 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net>
> wrote:
> [snip]
>> The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and
>> the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU.
>
> I took the term from "Using Internal Redundant Representations and
> Limited Bypass to Support Pipelined Adders and Register Files"
> (Mary D. Brown, Yale N. Patt; 2001 [HPCA-3]):
> "An example of this concept, called staggered adds, was
> implemented in the Intel Pentium 4 [10]. When staggering a 32-bit
> add over two cycles, the carry-out of the 16th bit and the lower half
> of the result are produced in the first cycle, and the upper half of
> the
> result is produced in the second cycle."
>
> So what is the proper term for this kind of pipelined addition?

I apologize.

Apparently the Willamette team was using the term "staggered ALU",
e.g. in paper http://www.dre.vanderbilt.edu/~aky/My/ppt/The%20Microarchitecture%20of%20Pentium%204%20Processor.pdf

This use is scattered all over the Internet, in lots of class notes.

(I used the term "width pipelined", but that was really early in the life of Willamette.)

Apparently the ALUs that are set to cascade from one to the other within a clock cycle are more commonly called cascaded
ALUs.

Terms for my lexicon.

Paul A. Clayton

unread,

Jun 1, 2010, 2:44:57 PM6/1/10

to

On Jun 1, 11:06 am, Andy 'Krazy' Glew <ag-n...@patten-glew.net> wrote:
[snip]

> I apologize.

Well, this ended up helping both of us.

> Apparently the Willamette team was using the term "staggered ALU",

> e.g. in paperhttp://www.dre.vanderbilt.edu/~aky/My/ppt/The%20Microarchitecture%20o...

>
> This use is scattered all over the Internet, in lots of class notes.
>
> (I used the term "width pipelined", but that was really early in the life of Willamette.)
>
> Apparently the ALUs that are set to cascade from one to the other within a clock cycle are more commonly called cascaded
> ALUs.
>
> Terms for my lexicon.

Thank you for the research (and personal history)!

Perhaps a lexicon/list of abbreviations might be appropriate for your
CompArch wiki. (Thank you also for this donation.)

Andy 'Krazy' Glew

unread,

Jun 2, 2010, 12:21:39 AM6/2/10

to

Yep. Although I must admit that I have a bit of trouble with the wiki organization in this regard. I find it hard to
define terms in isolation - I often want to define terms via little essays that compare them to related terms.

In my dreams, I rewrite the wiki software so that such terms can easily forward to discussion pages. But mediawiki's
redirection technology is quite primitive, so I think a rewrite is needed. But that is lower priority than me getting
the ability to edit diagrams onto the comp-arch.net wiki.

By the way, in case anyone is interested: my current direction in diagram editing is SVG-edit.

Andy 'Krazy' Glew

unread,

Jun 2, 2010, 1:51:02 AM6/2/10

to

On 6/1/2010 9:21 PM, Andy 'Krazy' Glew wrote:
> On 6/1/2010 11:44 AM, Paul A. Clayton wrote:
>> On Jun 1, 11:06 am, Andy 'Krazy' Glew<ag-n...@patten-glew.net> wrote:
>> [snip]
>>> I apologize.
>>
>> Well, this ended up helping both of us.
>>
>>> Apparently the Willamette team was using the term "staggered ALU",
>>> e.g. in
>>> paperhttp://www.dre.vanderbilt.edu/~aky/My/ppt/The%20Microarchitecture%20o...
>>>
>>>
>>> This use is scattered all over the Internet, in lots of class notes.
>>>
>>> (I used the term "width pipelined", but that was really early in the
>>> life of Willamette.)
>>>
>>> Apparently the ALUs that are set to cascade from one to the other
>>> within a clock cycle are more commonly called cascaded
>>> ALUs.
>>>
>>> Terms for my lexicon.
>>
>> Thank you for the research (and personal history)!
>>
>> Perhaps a lexicon/list of abbreviations might be appropriate for your
>> CompArch wiki. (Thank you also for this donation.)

You made me do it:

https://semipublic.comp-arch.net/wiki/Cascaded_ALUs
https://semipublic.comp-arch.net/wiki/Staggered_ALUs
https://semipublic.comp-arch.net/wiki/Width-pipelined_ALUs

If I write a page every evening, I may be finished before I get senile.