Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Scoreboard with register renaming

1,073 views
Skip to first unread message

infinis...@gmail.com

unread,
Apr 28, 2018, 3:54:17 AM4/28/18
to
I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

Quadibloc

unread,
Apr 28, 2018, 2:46:46 PM4/28/18
to
I'm not sure I understand you. What you're describing sounds like what
everyone usually does when making an out-of-order computer, the
reservation stations approach being mostly unique to the 91, 95, and 195.

adaptive...@gmail.com

unread,
Apr 28, 2018, 2:53:29 PM4/28/18
to
On Saturday, April 28, 2018 at 4:54:17 PM UTC+9, infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

Scoreboarding with register renaming were there, you can check history of microprocessor out with googling.

Best,
S.Takano

infinis...@gmail.com

unread,
Apr 28, 2018, 4:37:32 PM4/28/18
to
I'm not really a CPU historian other than intel/amd commercial cpu's, and whatever I happen to come by when learning about designs. When I look up dynamic scheduling/out of order execution I came across the CDC 6600 and the system 360/91, the first being the first example of scoreboarding and the second being the first example of 'tomasulo'. When I google scoreboarding with register renaming I just get explanations of the CDC6600 implementation. If you could point out a particular processor I could look up it would be greatly appreciated. As to what most out of order cpu's use... from what I've seen of commercial out of order cpu's (x86,PPC,arm) they are variations of Tomasulo.

infinis...@gmail.com

unread,
Apr 28, 2018, 4:44:51 PM4/28/18
to
I googled 'history of the microprocessor' 'history of the CPU' and 'history of mainframe cpu' all that came up were intel/x86 centric histories. Do you have a better search term?

EricP

unread,
Apr 29, 2018, 12:18:57 PM4/29/18
to
infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming? It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

The word scoreboard is quite overloaded, so you need to be more
explicit in describing what microarchitecture you are thinking about.
When a processor uses a "scoreboard" you really have to look
at exactly how it works to see how it is the same or different.

A CDC 6600 style uArch managed by its scoreboard is
- in order issue, out-of-order execute, out-of-order complete
- with RAW hazards handled by stall in the register read stage
- with WAR hazards handled by stall in the write-back stage
- with WAW hazards handled by stall in the issue stage
- stall for available Function Unit in issue stage
- no forwarding

One important consequence of OoO complete is imprecise exceptions.
e.g.

div r0, r1, r2 ; r0 = r1/r2
add r3, r4, r5 ; r3 = r4+r5

the above add can issue and complete and update r3,
the later the divider unit notices an exception, say underflow,
and throws an exception aborting the update to r0.
So r0 is unchanged but the later r3 is changed.

Adding a renamer in the instruction decode stage:
- eliminates WAR and WAR hazard detection and it becomes
subsumed by a more general resource availability check
for a free physical register.
Note that physical register status now tracks pending reads,
so a physical register is free if it is:
not the architecture register, not busy, and no reads pending.
- allows rollback to prior committed architected state
implementing precise exceptions.

It also needs some logic for rollback to a prior
architected register set, and to track and commit.

It the end we have:
- in order issue, out-of-order execute, out-of-order complete
- with RAW hazards handled by stall in the register read stage
- stall for free physical register in the instruction decode stage
- stall for available FU in issue stage
- no forwarding
- precise interrupts

Now the question becomes, once you have done all this,
how did the performance change?
Is having precise interrupts sufficient cause to
justify adding renaming?

Reservation stations (with renamer) allows distribution of the
pending instructions from the decoder stage to the various stations
so you don't stall at the issue stage for an available FU.
Instead each RS tracks its own FU and data availability,
and does its own OoO scheduling and OoO complete.
But now you have to figure out how to get all the necessary
information and data to each RS.
One question, for example, is how many registers should each RS have?
Too few and you stall at decode, too many and they sit idle.

Eric

infinis...@gmail.com

unread,
Apr 29, 2018, 3:00:09 PM4/29/18
to
On Sunday, April 29, 2018 at 12:18:57 PM UTC-4, EricP wrote:
>
> The word scoreboard is quite overloaded, so you need to be more
> explicit in describing what microarchitecture you are thinking about.
> When a processor uses a "scoreboard" you really have to look
> at exactly how it works to see how it is the same or different.

I was using the term scoreboarding both loosely and specifically. How come both... I was contemplating centralized control circuitry vs the distributed method found in Tomasulo. So there is both the possibility of scoreboarding with register renaming, and there are other possibilities.


> Adding a renamer in the instruction decode stage:
> - eliminates WAR and WAR hazard detection and it becomes
> subsumed by a more general resource availability check
> for a free physical register.
> Note that physical register status now tracks pending reads,
> so a physical register is free if it is:
> not the architecture register, not busy, and no reads pending.
> - allows rollback to prior committed architected state
> implementing precise exceptions.

First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.

> It the end we have:
> - in order issue, out-of-order execute, out-of-order complete
> - with RAW hazards handled by stall in the register read stage
> - stall for free physical register in the instruction decode stage
> - stall for available FU in issue stage
> - no forwarding
> - precise interrupts

I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.

> Now the question becomes, once you have done all this,
> how did the performance change?
> Is having precise interrupts sufficient cause to
> justify adding renaming?

Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.

>
> Reservation stations (with renamer) allows distribution of the
> pending instructions from the decoder stage to the various stations
> so you don't stall at the issue stage for an available FU.
> Instead each RS tracks its own FU and data availability,
> and does its own OoO scheduling and OoO complete.
> But now you have to figure out how to get all the necessary
> information and data to each RS.
> One question, for example, is how many registers should each RS have?
> Too few and you stall at decode, too many and they sit idle.
>
> Eric
I understand the purpose of reservation stations but I wonder if this distributed design (physically) has a positive impact on clock speed vs a centralized solution. Also the only reason the scoreboard in the 6600 stalled if a FU was busy was that the FU's weren't pipelined. That should be easily remedied. In fact IIRC the successor to the 6600 the 7600(???) implemented pipelined FU's.

EricP

unread,
Apr 29, 2018, 4:50:41 PM4/29/18
to
infinis...@gmail.com wrote:
> On Sunday, April 29, 2018 at 12:18:57 PM UTC-4, EricP wrote:
>> The word scoreboard is quite overloaded, so you need to be more
>> explicit in describing what microarchitecture you are thinking about.
>> When a processor uses a "scoreboard" you really have to look
>> at exactly how it works to see how it is the same or different.
>
> I was using the term scoreboarding both loosely and specifically. How come both... I was contemplating centralized control circuitry vs the distributed method found in Tomasulo. So there is both the possibility of scoreboarding with register renaming, and there are other possibilities.
>
>
>> Adding a renamer in the instruction decode stage:
>> - eliminates WAR and WAR hazard detection and it becomes
>> subsumed by a more general resource availability check
>> for a free physical register.
>> Note that physical register status now tracks pending reads,
>> so a physical register is free if it is:
>> not the architecture register, not busy, and no reads pending.
>> - allows rollback to prior committed architected state
>> implementing precise exceptions.
>
> First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.

Yes, that was a typo - should have been WAR and WAW.

Sure there is an instruction decoder for a scoreboarded cpu.
The scoreboard is a a kind of scheduler, it tracks dependencies
and decides when to issue to the FU's.

The renamer removes architectural register dependencies
but it also separates the execute completion from state commit.

The renamer contains two sets of maps from architecture to
physical registers, the future set and the committed set.
At state commit, the committed map gets updated with the
latest physical register. If something goes wrong, the committed
set can be copied into the future set effecting a rollback.

>> It the end we have:
>> - in order issue, out-of-order execute, out-of-order complete
>> - with RAW hazards handled by stall in the register read stage
>> - stall for free physical register in the instruction decode stage
>> - stall for available FU in issue stage
>> - no forwarding
>> - precise interrupts
>
> I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.

To me it is the renamer that records the committed vs future state,
and that is what creates precise interrupts.

Or to put it a different way, one could build a cpu with
a ROB and OoO scheduler and completion but _without_ a renamer.
It would NOT have precise interrupts.

I am trying to look at each potential change in isolation.
What is required to add JUST a renamer?
What is required for JUST OoO issuing and completion?
What is required for JUST forwarding?

A ROB is typically also associated with Out-of-Order issuing
and has other features associated with it,
and I was trying to separate out the changes for just adding
a renamer while retaining the original in-order issue.

But yes, adding the renamer would need something to trigger
the renamer state commit. I wanted something very simple.
I thought about various shift registers but none really worked out.
The simplest I can think of is a circular buffer with just
an architecture register#, a physical register#, and a Done flag.
It didn't seem fair to call that a ROB though that is partly its job.

>> Now the question becomes, once you have done all this,
>> how did the performance change?
>> Is having precise interrupts sufficient cause to
>> justify adding renaming?
>
> Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.

Yes it does make interrupts precise if you add some minimal
support logic, like that circular buffer I mention above.
It is the renamer that allows the state commit vs rollback.

Tomasulo bypasses the in-order issue bottleneck with reservation stations.
It ALSO has a renamer and therefore a future vs committed state.

>> Reservation stations (with renamer) allows distribution of the
>> pending instructions from the decoder stage to the various stations
>> so you don't stall at the issue stage for an available FU.
>> Instead each RS tracks its own FU and data availability,
>> and does its own OoO scheduling and OoO complete.
>> But now you have to figure out how to get all the necessary
>> information and data to each RS.
>> One question, for example, is how many registers should each RS have?
>> Too few and you stall at decode, too many and they sit idle.
>>
>> Eric
> I understand the purpose of reservation stations but I wonder if this distributed design (physically) has a positive impact on clock speed vs a centralized solution. Also the only reason the scoreboard in the 6600 stalled if a FU was busy was that the FU's weren't pipelined. That should be easily remedied. In fact IIRC the successor to the 6600 the 7600(???) implemented pipelined FU's.

From a concurrency tracking point of view you can have
multiple FU's, or pipelined FU's, or multiple pipelined FU's.
Both have multiple instructions in flight.
But a pipeline can only start or complete one instruction per clock,
while multiple FU's start or complete multiple instructions per clock
and so can have more port and bus resource contentions.

Eric




infinis...@gmail.com

unread,
Apr 29, 2018, 6:16:00 PM4/29/18
to
On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:

> > First you wrote WAR and WAR, I assume you meant WAR and WAW. Second I thought the renamer would be relevant in the issue and read operands stage since there is no instruction decode stage in a scoreboard.
>
> Yes, that was a typo - should have been WAR and WAW.
>
> Sure there is an instruction decoder for a scoreboarded cpu.
> The scoreboard is a a kind of scheduler, it tracks dependencies
> and decides when to issue to the FU's.

Of course there is an instruction decoder but it a part of the issue stage. (at least in the 6600)

>
> The renamer removes architectural register dependencies
> but it also separates the execute completion from state commit.

With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.

>
> The renamer contains two sets of maps from architecture to
> physical registers, the future set and the committed set.
> At state commit, the committed map gets updated with the
> latest physical register. If something goes wrong, the committed
> set can be copied into the future set effecting a rollback.

You would need multiple mappings just for the future set since you would need processor context on a per inflight Instruction pointer. And once committed you can't rollback, thats the definition of committed i.e. you're sure this state is coherent. (at least in the case of inorder retirement, which is necessary for precise interrupts)

>
> >> It the end we have:
> >> - in order issue, out-of-order execute, out-of-order complete
> >> - with RAW hazards handled by stall in the register read stage
> >> - stall for free physical register in the instruction decode stage
> >> - stall for available FU in issue stage
> >> - no forwarding
> >> - precise interrupts
> >
> > I thought reorder buffers allowed for precise interrupts. Register renaming alone does not. In fact you state out of order completion which implies imprecise interrupts.
>
> To me it is the renamer that records the committed vs future state,
> and that is what creates precise interrupts.
>
> Or to put it a different way, one could build a cpu with
> a ROB and OoO scheduler and completion but _without_ a renamer.
> It would NOT have precise interrupts.

Could you build an OoO with an ROB but without renaming? That doesn't seem feasible. Also since when does a renamer keep track of per instruction state.

>
> I am trying to look at each potential change in isolation.
> What is required to add JUST a renamer?
> What is required for JUST OoO issuing and completion?
> What is required for JUST forwarding?
>
> A ROB is typically also associated with Out-of-Order issuing
> and has other features associated with it,
> and I was trying to separate out the changes for just adding
> a renamer while retaining the original in-order issue.

A ROB is associated with OoO retirement not issuing. And moreover it enables in order commit. In using a ROB you commit state changes in order... in doing so you don't need to keep a full context per instruction you just need to keep track of what each instruction does to the state.

>
> But yes, adding the renamer would need something to trigger
> the renamer state commit. I wanted something very simple.
> I thought about various shift registers but none really worked out.
> The simplest I can think of is a circular buffer with just
> an architecture register#, a physical register#, and a Done flag.
> It didn't seem fair to call that a ROB though that is partly its job.

I haven't delved into ROB's in a while but IIRC your 'circular buffer' sounds like it supposed to do what an ROB does. But you never mentioned a circular buffer in your original post you only mentioned the renamer.

>
> >> Now the question becomes, once you have done all this,
> >> how did the performance change?
> >> Is having precise interrupts sufficient cause to
> >> justify adding renaming?
> >
> > Again you're not adding precise interrupts with register renaming but you are getting rid of the stalls caused by RAW and WAW hazards. Which are the primary reasons Tomasulo is abstractly better than Scoreboarding.
>
> Yes it does make interrupts precise if you add some minimal
> support logic, like that circular buffer I mention above.
> It is the renamer that allows the state commit vs rollback.

Like I said above you didn't mention a circular buffer in your original post and you're saying the circular buffer is responsible not the renamer.

>
> Tomasulo bypasses the in-order issue bottleneck with reservation stations.
> It ALSO has a renamer and therefore a future vs committed state.

Tomasulo issues in order, and has inprecise interrupts. Its only when you add additional hardware do you get precise interrupts.



MitchAlsup

unread,
Apr 29, 2018, 10:44:41 PM4/29/18
to
On Saturday, April 28, 2018 at 2:54:17 AM UTC-5, infinis...@gmail.com wrote:
> I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming?

The CDC 6600 scoreboard renamed registers into the function unit delivering the RAW result

The CDC 7600 scoreboard renamed registers into the function unit and pipeline number of the RAW result

Neither did register renaming.

> It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?

I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.

It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.

BTW: the ENTIRE CDC 6600 scoreboard was smaller than a single entry in the 360-91 reservation station scheme.

Quadibloc

unread,
Apr 29, 2018, 11:44:35 PM4/29/18
to
I may be all wrong, but to me he sounds like he is describing exactly
what everyone else is already doing: instead of reservation stations, they
use register renaming with centralized control, which is sort of
like a "scoreboard" - just not
exactly like the 6600, because now register renaming is added.

infinis...@gmail.com

unread,
Apr 30, 2018, 2:38:54 AM4/30/18
to
On Sunday, April 29, 2018 at 10:44:41 PM UTC-4, MitchAlsup wrote:
> On Saturday, April 28, 2018 at 2:54:17 AM UTC-5, infinis...@gmail.com wrote:
> > I was wondering if anybody/any company has tried to implementing scoreboarding with register renaming?
>
> The CDC 6600 scoreboard renamed registers into the function unit delivering the RAW result
>

I thought the 6600 scorboard buffers/stalls in the read operand stage if the
operands aren't available?

> The CDC 7600 scoreboard renamed registers into the function unit and pipeline number of the RAW result
>
> Neither did register renaming.
>
If neither did register renaming, then why did you just say they renamed registers.

> > It seems like a possibilty... why hasn't anyone tried? Are there drawbacks to such a approach? centralized control circuitry equals layouts with lower clocks??? Something the amateur wouldn't see? I mean is there any real advantage to having the reservation stations issue instructions to execution units instead of a centralized solution?
>
> I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.
>
and how would I do that?

> It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.
>
> BTW: the ENTIRE CDC 6600 scoreboard was smaller than a single entry in the 360-91 reservation station scheme.
This seems impossible.

infinis...@gmail.com

unread,
Apr 30, 2018, 3:14:55 AM4/30/18
to
Who is doing this exactly?

Quadibloc

unread,
Apr 30, 2018, 3:39:32 AM4/30/18
to
Pretty much everyone who is making an OoO CPU. But I could be wrong.

John Savard

Stephen Fuld

unread,
Apr 30, 2018, 11:05:23 AM4/30/18
to
On 4/29/2018 7:44 PM, MitchAlsup wrote:


snip

> I have a 140 page chapter in a comp-arch book discussing scoreboards. Find a way to e-mail me and I will send it to you.
>
> It discusses adding forwarding to a scoreboard, having multiple writes to the same register, and advancing the state of the scoreboard art.


Is this a book that you are writing or an existing book by someone else?



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Apr 30, 2018, 10:29:04 PM4/30/18
to
Apparently you aren't worth the trouble.

EricP

unread,
May 1, 2018, 12:36:29 PM5/1/18
to
infinis...@gmail.com wrote:
> On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:
>
>> The renamer removes architectural register dependencies
>> but it also separates the execute completion from state commit.
>
> With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.

Ok... we are talking about different kinds of renamers.

I'm thinking of a single unified physical register file for
both future and committed results, with renaming using two
Register Alias Tables (RAT) for future FRAT and committed CRAT.
On commit the commit the CRAT map is updated.

You are thinking of the ROB storing future result registers
that are later copied to the architecture registers on commit.
The RAT then selects between ROB or architecture register sources.

(A third way is using separate physical file and architecture file,
and copying from physical to architecture on commit.)

The Design Space of Register Renaming Techniques, Sima 2000
https://classes.soe.ucsc.edu/cmpe202/Fall04/papers/rat.pdf

My original reason for looking at this was to see what it would take
to allow variable latency instructions, with out-of-order completion,
with precise interrupts, at low cost.
Keeping in-order issue (maybe dual issue but still in-order)
with a unified physical file, a dual RAT renamer and a
circular commit order buffer looks like it would work.

Lets say there is a 4-bit architecture register number (ARN),
and 6-bit physical register number (PRN).

In the unified register approach a RAM-based RAT uses the 4-bit ARN
to select a map entry and read out a 6-bit PRN. There are 2 RAT tables,
a Future RAT (FRAT) and Committed RAT (CRAT).
There are also wires running horizontally from CRAT to FRAT allowing
all future entries to be reset to their committed values to effect
a rollback in one clock.

The single physical register file holds both future and committed values.
Source operands can come from the issuing instruction, or from
the physical file, and results all go to the physical file.

That just leaves a simple circular Commit Order Buffer (COM) to
trigger CRAT updates on commit (and committed program counter, etc).
The commit write the committed ARN->PRN mapping into CRAT making
the future registers "real" and gives precise interrupts.

Eric

MitchAlsup

unread,
May 1, 2018, 12:47:00 PM5/1/18
to
On Tuesday, May 1, 2018 at 11:36:29 AM UTC-5, EricP wrote:
> infinis...@gmail.com wrote:
> > On Sunday, April 29, 2018 at 4:50:41 PM UTC-4, EricP wrote:
> >
> >> The renamer removes architectural register dependencies
> >> but it also separates the execute completion from state commit.
> >
> > With renaming with no ROB or history buffer or ...alt method to deal with per inflight Instruction pointer state (i'm assuming no speculative hardware, I don't remember how rollback of spec ex occurs but IIRC its usually with a ROB) the processor would only have access to the most recent state being considered by the hardware.
>
> Ok... we are talking about different kinds of renamers.
>
> I'm thinking of a single unified physical register file for
> both future and committed results, with renaming using two
> Register Alias Tables (RAT) for future FRAT and committed CRAT.
> On commit the commit the CRAT map is updated.
>
> You are thinking of the ROB storing future result registers
> that are later copied to the architecture registers on commit.
> The RAT then selects between ROB or architecture register sources.
>
> (A third way is using separate physical file and architecture file,
> and copying from physical to architecture on commit.)

A forth way is to use a physical register file using a CAM to map logical
registers to youngest physical register at issue and a decoder to write
into the physical registers. Each issue cycle the valid bits of the PRN
are written into a history buffer. Recovery is as simple as reading the
valid bits back out and writing them into the CAM.

infinis...@gmail.com

unread,
May 3, 2018, 5:29:18 PM5/3/18
to
I was reviewing 'standard' scoreboarding and want to make sure I understand something correctly. The instruction window(max) is equal to the number of function units, correct? As in only one instruction can be issued per function unit.

MitchAlsup

unread,
May 3, 2018, 5:52:28 PM5/3/18
to
On Thursday, May 3, 2018 at 4:29:18 PM UTC-5, infinis...@gmail.com wrote:
> I was reviewing 'standard' scoreboarding and want to make sure I understand something correctly. The instruction window(max) is equal to the number of function units, correct? As in only one instruction can be issued per function unit.

In std scoreboard, yes.

However, it is easy to add function unit pipelining to the model so the
RAW hazards (forwarding) are denoted by [FU,p#]. This is basically what the CDC 7600 did. In this case the maximum number of instructions is FU*p#, and each FU can be issued only 1 instruction per clock, but multiple FUs can be issues per clock.

Also note: one can add forwarding by having the FU ship the "go_write" signal 1 cycle earlier than CDC 6600 did.

But one MAJOR trick (not often noticed by the amateur) is that the Scoreboard only requires latches in the pipeline (rather than requiring flip-flops). This gets rid of a LOT of area and power.

infinis...@gmail.com

unread,
May 6, 2018, 4:32:42 AM5/6/18
to
In Tomasulo the instruction window is the size of the reorder buffer right?

MitchAlsup

unread,
May 6, 2018, 2:26:19 PM5/6/18
to
On Sunday, May 6, 2018 at 3:32:42 AM UTC-5, infinis...@gmail.com wrote:
> In Tomasulo the instruction window is the size of the reorder buffer right?

Which is equal to the total number of stations.

infinis...@gmail.com

unread,
May 6, 2018, 9:42:42 PM5/6/18
to
Are there any scoreboarding implementations that mimick this behavior by buffering more instructions in the read operand stage?

MitchAlsup

unread,
May 7, 2018, 11:45:39 AM5/7/18
to
K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006, used
value free reservation stations to save RS area. After the instruction fired
into execution, the register file was read and forwarding performed.

thomas....@gmail.com

unread,
May 7, 2018, 8:30:52 PM5/7/18
to
Am Montag, 7. Mai 2018 17:45:39 UTC+2 schrieb MitchAlsup:

> K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006,

Out of curiosity: Why was it canceled? I would assume it hit the thermal wall?

Regards,

Thomas

infinis...@gmail.com

unread,
May 8, 2018, 10:08:15 AM5/8/18
to
So only tags from the CDB are propagated to the reservation stations? Did this require more ports on the register file?
What is your opinion on this type of design?
and like thomas asks above why was it cancelled.

MitchAlsup

unread,
May 8, 2018, 1:03:01 PM5/8/18
to
Power Wall.

MitchAlsup

unread,
May 8, 2018, 1:05:05 PM5/8/18
to
On Tuesday, May 8, 2018 at 9:08:15 AM UTC-5, infinis...@gmail.com wrote:
> On Monday, May 7, 2018 at 11:45:39 AM UTC-4, MitchAlsup wrote:
> > On Sunday, May 6, 2018 at 8:42:42 PM UTC-5, infinis...@gmail.com wrote:
> > > On Sunday, May 6, 2018 at 2:26:19 PM UTC-4, MitchAlsup wrote:
> > > > On Sunday, May 6, 2018 at 3:32:42 AM UTC-5, infinis...@gmail.com wrote:
> > > > > In Tomasulo the instruction window is the size of the reorder buffer right?
> > > >
> > > > Which is equal to the total number of stations.
> > >
> > > Are there any scoreboarding implementations that mimick this behavior by buffering more instructions in the read operand stage?
> >
> > K9, a cancelled AMD chip designed for 5GHz operating frequency in 2006, used
> > value free reservation stations to save RS area. After the instruction fired
> > into execution, the register file was read and forwarding performed.
>
> So only tags from the CDB are propagated to the reservation stations?
Yes
> Did this require more ports on the register file?
No the Reservation Stations would only fire if ports were available.
> What is your opinion on this type of design?
Value-free RS:: they are OK--saves lots of area, enables larger execution windows
8-gates/clock:: way to fast
> and like thomas asks above why was it cancelled.
Power Consumption

Quadibloc

unread,
May 8, 2018, 4:06:09 PM5/8/18
to
There is a Wikipedia article on the K9, but I wouldn't be surprised if it
was talking about a different design.

Instead of describing AMD's answer to the Pentium 4, the design described
was ambitious in a completely different way: it was to issue up to
eight instructions simultaneously in the same clock cycle.

A chip that could do that wouldn't need to run at 5 GHz to be fast.

Ivan Godard

unread,
May 8, 2018, 4:19:35 PM5/8/18
to
Fancy that!

infinis...@gmail.com

unread,
May 8, 2018, 4:44:13 PM5/8/18
to
On Tuesday, May 8, 2018 at 1:05:05 PM UTC-4, MitchAlsup wrote:
> > Did this require more ports on the register file?
> No the Reservation Stations would only fire if ports were available.

What arbitrated port access? There must be some type of synchronization.

MitchAlsup

unread,
May 8, 2018, 6:19:18 PM5/8/18
to
On Tuesday, May 8, 2018 at 3:06:09 PM UTC-5, Quadibloc wrote:
> There is a Wikipedia article on the K9, but I wouldn't be surprised if it
> was talking about a different design.
>
> Instead of describing AMD's answer to the Pentium 4, the design described
> was ambitious in a completely different way: it was to issue up to
> eight instructions simultaneously in the same clock cycle.

Err, no::

K9 fetched 8 instructions every other cycle and made 2 branch predictions
associated with 3 next fetch addresses every other cycle.

K9 issued 4 instructions per cycle and took 2 cycles to issue a fetch width.

The major BW limiter was the renamer, not the register file.

> A chip that could do that wouldn't need to run at 5 GHz to be fast.

Was not my choice. Dirk said:: "Frequency is your friend" and we designed
a chip to his exacting specification. Most of the logic was running in
SPICE at 5GHz at the time of cancellation, and a majority of the layout
was done.

infinis...@gmail.com

unread,
May 8, 2018, 7:58:56 PM5/8/18
to
Also besides arbitrating port access, wouldn't this require more logic with respect to the lifetime of rename registers?

MitchAlsup

unread,
May 8, 2018, 10:03:31 PM5/8/18
to
Rename register size:: 7-bits
Register size.......:: 84-bits

So for every operand you save out of the reservation station, you get 12 new rename registers as a trade off. (This would have only been 9 new rename entries were the registers not required to support 80-bit arithmetic.)

So, given 4*16 reservation station entries each with 2-operands, not capturing the result data saves 168 (128) flip-flops at a cost of retaining the 7-bits one already needed to read the register file (and remember these are the post renamed register names), we saved 10,752 flip flops, or about 100K gates (equivalent to about 2.5--64-bit FMAC units.)

The cost was another 2 cycles in the pipeline--about 4*2*84 (672) bits plus one more set of 8 rename registers (8*7 = 56) call it 750 bits.

Now: question to those attempting to pay attention:: which is bigger 100K or 7.5K ? Second question:: by a little or by a lot?

The real question is why others did not discard the operand capture part of their reservation stations.

Ivan Godard

unread,
May 9, 2018, 12:12:02 AM5/9/18
to
Perhaps, in those days before branch predictors got better, the added
mispredict penalty was considered more important and the area/power?

infinis...@gmail.com

unread,
May 9, 2018, 4:01:53 PM5/9/18
to
On Sunday, May 6, 2018 at 2:26:19 PM UTC-4, MitchAlsup wrote:
I've been reviewing Tomasulo and just want to make sure I'm understanding this correctly. The original Tomasulo didn't have a reorder buffer or a history buffer or checkpoint repair so the register file had to have the same number of registers as there are reservation stations? This seems obvious but I just want to make sure I'm thinking about this correctly.

infinis...@gmail.com

unread,
May 9, 2018, 4:15:49 PM5/9/18
to
I understand the savings implied by getting rid of values in the reservation station. But I was talking about management/control. With reservations stations with both tags and values once a value is on the CDB the rename register becomes free again, since those values are buffered in the reservation stations. If the reservation station don't buffer values then you would need to manage when all instructions dependent on an output read their operands before you know a rename register becomes free again... right?

MitchAlsup

unread,
May 9, 2018, 4:57:21 PM5/9/18
to
No, the register file still had 4 registers, but there were 2+3+3+2 stations
{ADD, MUL, LD, ST}. Here registers were renamed into station numbers.

Once you rename into a reorder buffer (so you can recover) the number of ROB
entries has to be at least as large as the number of RS entries.

MitchAlsup

unread,
May 9, 2018, 4:59:21 PM5/9/18
to
With value capture stations, you read the register file at issue time.
With value free stations, you read the file at execute time.

The number of ports needed is larger at issue time than at execute time
because you are almost never executing as fast as you can issue (Mill
excluded.)

infinis...@gmail.com

unread,
May 9, 2018, 5:20:36 PM5/9/18
to
Yeah I realized this upon reading more about it in a book I own. But I don't understand how the original implementation dealt with exemptions and interrupts. I suppose with an external interrupt you just stop issuing instructions and wait till all pending instructions complete so you have consistency. But lets say you get a divide by zero in the FPU then what?

infinis...@gmail.com

unread,
May 9, 2018, 5:41:23 PM5/9/18
to
On Wednesday, May 9, 2018 at 4:59:21 PM UTC-4, MitchAlsup wrote:

> With value capture stations, you read the register file at issue time.
> With value free stations, you read the file at execute time.
>
> The number of ports needed is larger at issue time than at execute time
> because you are almost never executing as fast as you can issue (Mill
> excluded.)

I understand and agree with the first part of what you said, and I can see what you're saying in the second part. But this really doesn't address my concern about knowing when a rename register becomes free to use again. If you have to read the register file at execute time the lifetime of an rename register is no longer ends implicitly when it is written to. So how is this handled?

infinis...@gmail.com

unread,
May 9, 2018, 5:55:14 PM5/9/18
to
On Wednesday, May 9, 2018 at 5:41:23 PM UTC-4, infinis...@gmail.com wrote:

> I understand and agree with the first part of what you said, and I can see what you're saying in the second part. But this really doesn't address my concern about knowing when a rename register becomes free to use again. If you have to read the register file at execute time the lifetime of an rename register is no longer ends implicitly when it is written to. So how is this handled?

Oh wait I realize what I was misunderstanding. The rename register is still alive until it gets 'retired'. So basically the ROB is involved with its lifetime.

MitchAlsup

unread,
May 9, 2018, 6:27:58 PM5/9/18
to
If you rename into a physical register file, you don't have this problem.
A physical register file is like a ROB except that it can cover both OoO
state and architectural state at the same time. It also avoids the movement
at retirement time.

Anton Ertl

unread,
May 10, 2018, 2:43:42 AM5/10/18
to
MitchAlsup <Mitch...@aol.com> writes:
>On Wednesday, May 9, 2018 at 3:15:49 PM UTC-5, infinis...@gmail.com wrote:
>> On Tuesday, May 8, 2018 at 10:03:31 PM UTC-4, MitchAlsup wrote:
>> > The real question is why others did not discard the operand capture par=
>t of their reservation stations.

A while ago, I read somewhere that Intel switched to value-free when
they introduced AVX on the Sandy Bridge, because value capture is too
expensive with 256-bit values.

>With value capture stations, you read the register file at issue time.
>With value free stations, you read the file at execute time.
>
>The number of ports needed is larger at issue time than at execute time=20
>because you are almost never executing as fast as you can issue (Mill
>excluded.)

What about forwarding? Value capture stations can grab the results
from the bus, saving a register read port. Value-free stations
sometimes (how often?) miss that opportunity. Whether the overall
read port requirements are higher or lower is not clear. And of
course, if you want to make use of that to reduce overall read port
requirements, you complicate the logic because you need to arbitrate
the remaining read ports.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,
May 10, 2018, 2:56:54 AM5/10/18
to
infinis...@gmail.com writes:
[Tomasulo]
>But I don=
>'t understand how the original implementation dealt with exemptions and int=
>errupts. I suppose with an external interrupt you just stop issuing instru=
>ctions and wait till all pending instructions complete so you have consiste=
>ncy. But lets say you get a divide by zero in the FPU then what?

That's why you often read about "imprecise interrupts" in connection
with Tomasulo. In particular,
<https://en.wikipedia.org/wiki/Tomasulo_algorithm#Exceptions> says:

|[Programs] that experience "imprecise" exceptions generally cannot
|restart or re-execute, as the system cannot determine the specific
|instruction that took the exception.

infinis...@gmail.com

unread,
May 10, 2018, 11:26:49 AM5/10/18
to
I understand that but what I'm saying is thats the solution halt the program? Basically what gets me is that the purpose of Tomasulo was supposedly to provide better performance while remaining compatible with the rest of the 360 series. So are imprecise interrupts/exceptions common to the whole series?

infinis...@gmail.com

unread,
May 10, 2018, 11:36:03 AM5/10/18
to
When you say physical register file you're referring to a 'merged' register file where both rename and architectural registers live? Wouldn't you still need to keep track of when an instruction needs to retire if you want in order retire i.e. want precise exceptions? If you're saying no then why not?

Nick Maclaren

unread,
May 10, 2018, 11:47:19 AM5/10/18
to
In article <041d2250-3ba7-41f6...@googlegroups.com>,
<infinis...@gmail.com> wrote:
>
>I understand that but what I'm saying is thats the solution halt the
>program? Basically what gets me is that the purpose of Tomasulo was
>supposedly to provide better performance while remaining compatible with
>the rest of the 360 series. So are imprecise interrupts/exceptions
>common to the whole series?

Yes. Models vary as to which exceptions are imprecise, but their
existence is common to the whole series. Even the System/390, though
I vaguely recall that few models of that range actually had any
imprecise exceptions. But it's a heck of a long time ago now!

Actually, you can very often restart and even sometimes reexecute,
but that needs the compiler and run-time system writer to cooperate
to enable it. Attempting it without that CAN be done, but you don't
want to go there, I can assure you!


Regards,
Nick Maclaren.

Anton Ertl

unread,
May 10, 2018, 12:15:46 PM5/10/18
to
infinis...@gmail.com writes:
>I understand that but what I'm saying is thats the solution halt the progra=
>m? Basically what gets me is that the purpose of Tomasulo was supposedly t=
>o provide better performance while remaining compatible with the rest of th=
>e 360 series. So are imprecise interrupts/exceptions common to the whole s=
>eries?

I don't think so.

My take on this is: The 360/91 was built and sold as a supercomputer,
and behaves like one, not like a true S/360 family member. I.e.,
correctness is sacrificed for speed. And I think that's exactly what
the customers wanted (well, except the two customers who used them to
run Cobol programs; they did not get speed, but at least they also did
not get imprecise interrupts, because they did not use the FPU).

Why did IBM make it run the S/360 instructions, then? My guess is
that the main reason is that this was part of the marketing of the
S/360 line, i.e., it sends the message "We cover all bases", from the
cheap Model 20 (which aparently was even less of a proper member than
the 91) to the supercomputer Models 91 and 195. Wikipedia reports
$113k for the Model 30 (and the 20 was certainly cheaper) and
$7M-$12.5M for the 195.

A benefit (but probably not decisive) is that they could use S/360
software on the Model 91, in particular, they did not needed a
separate Fortran compiler.

MitchAlsup

unread,
May 10, 2018, 12:33:01 PM5/10/18